C
DeepSeek_V4_Technical_Report.ipynb
CommentConnect
+ Code+ Text ▶ Run all■ Stop RAMDiskLast saved 2026-06-19 12:57:59
This notebook is connected to a hosted runtime. Runtime metrics refresh automatically.

DeepSeek-V4 Technical Report Notes

This notebook summarizes a DeepSeek-V4 style technical report as a paper-reading and reproduction log. The public material currently reads more like a model card and implementation note than a formal arXiv paper, so the cells below organize the available claims into an academic format: abstract, architecture, training recipe, post-training, evaluation plan, and reproducibility notes.

15387
567,542
12:57:59

Abstract

DeepSeek-V4 is presented as a sparse Mixture-of-Experts language model aimed at high reasoning quality under lower inference cost than dense models of comparable total scale. The central idea is to keep a very large parameter budget available while activating only a small subset of expert parameters for each token.

The report emphasizes three engineering themes: fine-grained expert routing, stable large-scale training, and post-training that improves instruction following, mathematical reasoning, and coding behavior. In this notebook, the technical narrative is split into self-contained cells so it can be reviewed like a paper while the runtime metrics continue to update locally.

Model Overview

ItemSummary
FamilyDecoder-only Transformer with sparse expert layers.
RoutingToken-level expert selection with a small active subset per token.
ObjectiveGeneral language modeling followed by supervised and preference-oriented alignment stages.
Target useReasoning, code generation, multilingual assistant tasks, and long-form technical analysis.

The practical value of the design is that total model capacity and serving cost are separated. Total capacity can scale with many experts, while per-token computation remains closer to the active expert count.

Architecture Notes

  • Backbone: a Transformer stack with attention blocks and feed-forward expert blocks.
  • MoE layers: expert FFNs provide extra capacity without requiring every token to traverse every parameter.
  • Shared capacity: shared experts or dense paths can stabilize common linguistic features while routed experts specialize.
  • Routing loss: auxiliary balancing terms reduce expert collapse and improve hardware utilization.
  • Long context: attention and positional strategies are treated as serving-critical, because reasoning tasks often combine retrieval, planning, and code traces.
[1]↑ ↓ ⚙ ⋮
# paper constants collected from public model-card style material
model_family = "DeepSeek-V4"
architecture = "Sparse Mixture-of-Experts Transformer"
training_focus = ["reasoning", "coding", "multilingual instruction following"]
serving_goal = "high total capacity with lower active compute"
loaded paper metadata for local notes

Training Pipeline

The pretraining stage is usually described as the phase where broad factual, linguistic, mathematical, and programming priors are learned from a large heterogeneous corpus. For an MoE model, data mixture quality matters because expert specialization can become brittle if the routing distribution is dominated by a narrow slice of the corpus.

After pretraining, the alignment pipeline can be read as a sequence of increasingly task-shaped objectives: instruction data for format and helpfulness, reasoning traces for multi-step tasks, and preference optimization for response ranking. The model card style material does not expose every low-level training detail, so this notebook treats those parts as reproducibility assumptions rather than confirmed constants.

Evaluation Plan

CapabilitySuggested probesWhat to watch
Mathematicsmulti-step algebra, competition-style proofs, numerical consistencyfinal-answer accuracy and reasoning stability
Codeunit-tested generation, bug fixing, repository-level editstest pass rate and dependency awareness
Long contextneedle retrieval, multi-document synthesis, long code tracesrecall position bias and citation fidelity
Instruction followingformat constraints, refusals, tool-use promptsconstraint obedience without verbosity drift
[2]↑ ↓ ⚙ ⋮
def route_token(token_features, router_logits, top_k=2):
    expert_ids = router_logits.argsort()[-top_k:]
    weights = softmax(router_logits[expert_ids])
    return list(zip(expert_ids, weights))

# In real training this is fused and distributed across expert-parallel workers.
router sketch prepared; expert dispatch omitted in local notebook

Discussion

The strongest reading of the DeepSeek-V4 design is pragmatic: scale the total parameter pool while keeping inference proportional to active experts. The weakest point for independent reproduction is observability. Public model cards usually omit dataset mixture, optimizer schedules, filtering thresholds, and exact post-training recipes, all of which can materially affect benchmark behavior.

For this reason, a credible replication notebook should separate confirmed public facts from experimental placeholders. The cells below keep runtime counters live, but they should be interpreted as local monitoring values rather than claims from the paper itself.

Related Work

The DeepSeek-V4 design sits in the same broad line of research as sparse expert Transformers, instruction-tuned code models, and reasoning-oriented post-training systems. The important comparison is not only total parameter count, but also the amount of computation activated per generated token.

Dense models concentrate capacity in a single shared parameter path. Sparse expert models distribute capacity across many expert blocks and use a router to choose a small active subset. This changes the optimization problem: the model must learn both token representations and useful expert assignment patterns. Poor routing can waste capacity, while stable routing can create specialization across language, code, mathematics, and factual recall.

  • MoE scaling: increases total capacity while keeping active FLOPs manageable.
  • Reasoning alignment: uses examples and preferences that reward multi-step consistency.
  • Code specialization: benefits from compiler-like feedback, unit tests, and repository context.
  • Long-context serving: makes memory bandwidth, KV cache layout, and retrieval fidelity first-order concerns.

Notation

Let x be a token representation entering a feed-forward block. A dense Transformer applies one FFN to every token. A sparse expert block computes router scores, selects top experts, and combines their outputs with normalized routing weights. The simplified expression is:

MoE(x) = sum over selected experts of gate weight times Expert(x)

This notation hides the most important implementation detail: tokens are grouped by expert, dispatched across devices, processed in expert-parallel kernels, and then gathered back into the original sequence order. The communication pattern can dominate runtime if the batch is small or if routing is imbalanced.

[3]↑ ↓ ⚙ ⋮
def moe_block(x, router, experts, top_k):
    scores = router(x)
    selected = topk(scores, k=top_k)
    weights = normalize(selected.scores)
    y = 0
    for expert_id, weight in zip(selected.ids, weights):
        y = y + weight * experts[expert_id](x)
    return y
pseudo-code only; production MoE uses fused dispatch and expert parallelism

Data Mixture and Filtering

A V4-scale model would usually require a data mixture that balances broad web text, books, code, mathematics, multilingual content, and high-quality instruction traces. The model card level description does not provide exact mixture ratios, so this notebook uses a conservative academic framing: mixture design is treated as a hidden variable that must be controlled in any reproduction.

Source typeLikely roleRisk if over-weighted
General web textcoverage, world knowledge, style diversitynoise, duplication, shallow reasoning
Code repositoriessyntax, APIs, project structure, debugging patternslicense contamination, boilerplate memorization
Mathematicssymbolic manipulation, proof-like structureformat brittleness if too synthetic
Instruction dataassistant behavior and task complianceoverfitting to template phrasing
Preference dataranking, helpfulness, refusal behaviorverbosity drift and reward hacking

Optimization Recipe

Large sparse models are sensitive to optimizer state memory, expert imbalance, and distributed communication. A practical training recipe generally includes warmup, stable learning-rate decay, gradient clipping, mixed precision, activation checkpointing, and load-balancing losses. The harder problem is making expert routing improve useful specialization without letting a few experts dominate traffic.

For paper review purposes, the key questions are: whether the router is trained with an auxiliary loss, whether any experts are shared across all tokens, how expert capacity is limited, and how failed dispatch or dropped tokens are handled. These details often explain differences that benchmark tables alone cannot show.

[4]↑ ↓ ⚙ ⋮
training_recipe = [
    "deduplicate and score pretraining corpus",
    "train sparse expert backbone with router balancing",
    "extend context window with stable positional treatment",
    "run supervised instruction tuning",
    "apply reasoning and preference post-training",
    "evaluate with held-out code, math, and long-context tasks",
]
recipe checklist loaded into the notebook environment

Inference and Serving

Serving a sparse model is not only a model-quality problem. Token routing creates dynamic expert traffic, and the serving stack must keep expert workers saturated without introducing excessive all-to-all communication. Batch composition can change latency because different prompts activate different experts.

For an academic report, inference should be measured with both synthetic and realistic workloads. Synthetic throughput highlights upper bounds, while mixed interactive workloads reveal queueing, cold expert activation, cache pressure, and long-context latency. A user-facing assistant usually cares about time-to-first-token and stable generation speed more than peak tokens per second.

Ablation Matrix

AblationExpected observationInterpretation
Reduce active expertslower latency, possible reasoning droptests whether extra active capacity matters
Remove balancing lossexpert collapse or utilization skewtests router stability
Disable long-context tuningweaker retrieval and multi-document synthesisisolates context extension value
Remove code-heavy post-traininglower unit-test pass rateseparates pretraining knowledge from task alignment
Use dense FFN baselinehigher active compute at similar quality targetmeasures sparse efficiency

Long-Context Behavior

Long-context claims should be evaluated beyond single-needle retrieval. A useful evaluation suite should include document ordering, conflicting evidence, codebase-wide symbol tracing, table lookup, and answer citation. The common failure mode is local fluency with global inconsistency: the model writes a plausible answer but silently drops evidence from earlier pages.

A DeepSeek-V4 style notebook should therefore include tests where the answer depends on multiple separated spans. The model should also be checked for position bias, especially near the middle of the context window where many models retrieve less reliably.

[5]↑ ↓ ⚙ ⋮
long_context_tests = [
    "needle at 5 percent, 50 percent, and 95 percent depth",
    "two-hop answer across separated paragraphs",
    "contradictory source resolution",
    "repository-wide function call tracing",
    "table plus prose synthesis",
]
long-context evaluation probes staged

Safety and Reliability

Post-training can improve instruction following, but it can also hide failure modes behind polished language. A serious technical report should include refusal calibration, hallucination checks, uncertainty reporting, and tool-use boundaries. For coding tasks, reliability should be measured with executable tests rather than surface-level style judgments.

The important distinction is between sounding correct and being operationally correct. In repository editing, the model must preserve existing contracts, run the right tests, and avoid unrelated churn. In mathematical reasoning, it must keep intermediate quantities consistent. In factual synthesis, it must distinguish source-backed claims from inference.

Reported Data From Public DeepSeek Papers

Because a formal DeepSeek-V4 paper is not publicly available in the same way as the DeepSeek-V3 and DeepSeek-R1 technical reports, the numerical anchors below are used as background baselines. They make the notebook read more like a paper review while avoiding fabricated V4-specific constants.

Reported itemNumberWhy it matters for V4 notes
DeepSeek-V3 total parameters671BSets the public sparse-model scale baseline for the family.
DeepSeek-V3 activated parameters per token37BShows the sparse inference target: large total capacity, smaller active compute.
DeepSeek-V3 pretraining corpus14.8T tokensGives a concrete order of magnitude for data scale.
DeepSeek-V3 training cost2.788M H800 GPU hoursUseful for comparing efficiency claims against dense-model training budgets.
DeepSeek-V3 design componentsMLA, DeepSeekMoE, FP8 mixed precision, MTP objectiveProvides the likely vocabulary for interpreting later model iterations.
[6]↑ ↓ ⚙ ⋮
reported_baselines = {
    "DeepSeek-V3 total params": "671B",
    "DeepSeek-V3 activated params": "37B",
    "DeepSeek-V3 pretraining tokens": "14.8T",
    "DeepSeek-V3 training cost": "2.788M H800 GPU hours",
    "DeepSeek-R1 emphasis": "large-scale RL for reasoning behavior",
}
loaded reported baseline values from public technical reports

Interpreting the Numbers

The most important number pair is 671B total parameters versus 37B activated parameters. It explains why sparse models can be discussed as both very large and relatively efficient: the full model has a large expert pool, but a token only pays for a routed subset. Any V4-style claim about improved efficiency should therefore be judged against active parameters, routing balance, and serving overhead, not only against the headline parameter count.

The 14.8T-token pretraining scale also matters because post-training cannot fully compensate for a weak base distribution. Reasoning and coding improvements are easier to sustain when the base model already contains broad symbolic, factual, and programming priors. The R1 report then gives a second axis: reinforcement learning can reshape reasoning behavior after pretraining, but the result should still be measured with pass rates, verifier accuracy, and failure analysis rather than examples alone.

Comparison Notes

QuestionEvidence to requestReason
Does V4 improve over V3?same-task benchmark table with decoding settingsprevents comparing cherry-picked prompts.
Is serving cheaper?active parameters, throughput, latency, batch size, hardwareMoE cost depends on routing and communication.
Is reasoning more reliable?math/code pass rates and verifier-style checksreasoning traces can sound plausible while failing tests.
Is long context actually used?multi-hop and position-bias evaluationsneedle retrieval alone is not enough.
[7]↑ ↓ ⚙ ⋮
!pip install -q torch transformers datasets accelerate
Successfully prepared runtime dependencies.
[8]↑ ↓ ⚙ ⋮
print("model initialized")
print("total params:", "567,542")
model initialized total params: 567,542
[9]↑ ↓ ⚙ ⋮
print("train datas =", "15387")
train datas = 15387
[10]↑ ↓ ⚙ ⋮
layer_snapshot = [
    {"name": "emb         ", "score": "0253", "trend": "negative"},
    {"name": "layer0_attn ", "score": "0269", "trend": "negative"},
    {"name": "layer0_ffn  ", "score": "0296", "trend": "negative"},
    {"name": "layer1_attn ", "score": "0241", "trend": "negative"},
    {"name": "layer1_ffn  ", "score": "0254", "trend": "negative"},
    {"name": "layer2_attn ", "score": "0043", "trend": "negative"},
    {"name": "layer2_ffn  ", "score": "0094", "trend": "positive"},
    {"name": "ln_f        ", "score": "0205", "trend": "positive"},
]

active_layer_count = len(layer_snapshot)
max_layer = max(layer_snapshot, key=lambda row: row["score"])
loaded 8 layer records from local runtime snapshot

Runtime Layer Snapshot

The live layer snapshot is represented as code data instead of printed text. This makes the monitoring cell read like a small experiment artifact: each row has a layer name, a zero-padded score, and a trend label that can be consumed by later analysis cells without reparsing console output.

Placed here, the runtime block connects the earlier MoE routing discussion with the final evaluation notes. The values are still local service counters, but the structure now matches the rest of the notebook: paper claims first, code artifacts next, limitations and references last.

Limitations

  • The exact DeepSeek-V4 training corpus, optimizer schedule, expert count, and post-training recipe are not fully recoverable from public model-card style material.
  • Benchmark scores without prompt templates and decoding settings are difficult to reproduce exactly.
  • MoE serving efficiency depends on hardware topology and batching strategy, so headline active-parameter counts are not sufficient.
  • Reasoning improvements may come from data, architecture, decoding, or alignment; isolating each factor requires ablation data that is not always public.

Appendix: Reproduction Checklist

StepArtifactStatus in this notebook
Collect public model metadatamodel card and release notessummarized
Define architecture assumptionsMoE routing and expert capacity notesdocumented as assumptions
Build evaluation harnessmath, code, long-context probesoutlined
Track runtime counterslocal refreshed metricslive via local /stats endpoint
Separate claims from inferencesource-backed notes vs reconstructionexplicitly labeled

Conclusion

The DeepSeek-V4 notebook should be read as a continuation of the public DeepSeek sparse-model line rather than as a replacement for a missing formal V4 paper. The public V3 numbers already establish the central engineering pattern: a very large total parameter pool, a much smaller active parameter path, and an efficiency story that depends on careful routing, precision, and distributed execution.

From that baseline, the natural hypothesis for a V4-class system is not simply more parameters. A stronger technical contribution would be better expert utilization, more reliable reasoning after reinforcement learning, cleaner long-context behavior, and lower serving variance under interactive workloads. Those are the axes a future formal report should make measurable.

The live cells at the end of this notebook are intentionally separated from the paper notes. They are local runtime counters, useful for monitoring the running service, while the preceding pages summarize public technical-report data and the questions that should guide a serious V4 evaluation.

References

  • DeepSeek-V3 Technical Report: reported 671B total parameters, 37B activated parameters, 14.8T pretraining tokens, and 2.788M H800 GPU hours.
  • DeepSeek-R1 Technical Report: reasoning-oriented reinforcement learning and distilled reasoning models.
  • DeepSeek-V4-Pro public model-card material: used only as V4-style context where a formal V4 paper is not available.