DeepSeek-V4 Technical Report Notes

This notebook summarizes a DeepSeek-V4 style technical report as a paper-reading and reproduction log. The public material currently reads more like a model card and implementation note than a formal arXiv paper, so the cells below organize the available claims into an academic format: abstract, architecture, training recipe, post-training, evaluation plan, and reproducibility notes.

train datas15387

total params567,542

updated12:57:59

Abstract

DeepSeek-V4 is presented as a sparse Mixture-of-Experts language model aimed at high reasoning quality under lower inference cost than dense models of comparable total scale. The central idea is to keep a very large parameter budget available while activating only a small subset of expert parameters for each token.

The report emphasizes three engineering themes: fine-grained expert routing, stable large-scale training, and post-training that improves instruction following, mathematical reasoning, and coding behavior. In this notebook, the technical narrative is split into self-contained cells so it can be reviewed like a paper while the runtime metrics continue to update locally.

Model Overview

Item	Summary
Family	Decoder-only Transformer with sparse expert layers.
Routing	Token-level expert selection with a small active subset per token.
Objective	General language modeling followed by supervised and preference-oriented alignment stages.
Target use	Reasoning, code generation, multilingual assistant tasks, and long-form technical analysis.

The practical value of the design is that total model capacity and serving cost are separated. Total capacity can scale with many experts, while per-token computation remains closer to the active expert count.

Architecture Notes

Backbone: a Transformer stack with attention blocks and feed-forward expert blocks.
MoE layers: expert FFNs provide extra capacity without requiring every token to traverse every parameter.
Shared capacity: shared experts or dense paths can stabilize common linguistic features while routed experts specialize.
Routing loss: auxiliary balancing terms reduce expert collapse and improve hardware utilization.
Long context: attention and positional strategies are treated as serving-critical, because reasoning tasks often combine retrieval, planning, and code traces.

[1]↑ ↓ ⚙ ⋮

# paper constants collected from public model-card style material
model_family = "DeepSeek-V4"
architecture = "Sparse Mixture-of-Experts Transformer"
training_focus = ["reasoning", "coding", "multilingual instruction following"]
serving_goal = "high total capacity with lower active compute"

loaded paper metadata for local notes

Training Pipeline

The pretraining stage is usually described as the phase where broad factual, linguistic, mathematical, and programming priors are learned from a large heterogeneous corpus. For an MoE model, data mixture quality matters because expert specialization can become brittle if the routing distribution is dominated by a narrow slice of the corpus.

After pretraining, the alignment pipeline can be read as a sequence of increasingly task-shaped objectives: instruction data for format and helpfulness, reasoning traces for multi-step tasks, and preference optimization for response ranking. The model card style material does not expose every low-level training detail, so this notebook treats those parts as reproducibility assumptions rather than confirmed constants.

Evaluation Plan

Capability	Suggested probes	What to watch
Mathematics	multi-step algebra, competition-style proofs, numerical consistency	final-answer accuracy and reasoning stability
Code	unit-tested generation, bug fixing, repository-level edits	test pass rate and dependency awareness
Long context	needle retrieval, multi-document synthesis, long code traces	recall position bias and citation fidelity
Instruction following	format constraints, refusals, tool-use prompts	constraint obedience without verbosity drift

[2]↑ ↓ ⚙ ⋮

def route_token(token_features, router_logits, top_k=2):
    expert_ids = router_logits.argsort()[-top_k:]
    weights = softmax(router_logits[expert_ids])
    return list(zip(expert_ids, weights))

# In real training this is fused and distributed across expert-parallel workers.

router sketch prepared; expert dispatch omitted in local notebook

Discussion

The strongest reading of the DeepSeek-V4 design is pragmatic: scale the total parameter pool while keeping inference proportional to active experts. The weakest point for independent reproduction is observability. Public model cards usually omit dataset mixture, optimizer schedules, filtering thresholds, and exact post-training recipes, all of which can materially affect benchmark behavior.

For this reason, a credible replication notebook should separate confirmed public facts from experimental placeholders. The cells below keep runtime counters live, but they should be interpreted as local monitoring values rather than claims from the paper itself.

Related Work

The DeepSeek-V4 design sits in the same broad line of research as sparse expert Transformers, instruction-tuned code models, and reasoning-oriented post-training systems. The important comparison is not only total parameter count, but also the amount of computation activated per generated token.

Dense models concentrate capacity in a single shared parameter path. Sparse expert models distribute capacity across many expert blocks and use a router to choose a small active subset. This changes the optimization problem: the model must learn both token representations and useful expert assignment patterns. Poor routing can waste capacity, while stable routing can create specialization across language, code, mathematics, and factual recall.

MoE scaling: increases total capacity while keeping active FLOPs manageable.
Reasoning alignment: uses examples and preferences that reward multi-step consistency.
Code specialization: benefits from compiler-like feedback, unit tests, and repository context.
Long-context serving: makes memory bandwidth, KV cache layout, and retrieval fidelity first-order concerns.

Notation

Let x be a token representation entering a feed-forward block. A dense Transformer applies one FFN to every token. A sparse expert block computes router scores, selects top experts, and combines their outputs with normalized routing weights. The simplified expression is:

MoE(x) = sum over selected experts of gate weight times Expert(x)

This notation hides the most important implementation detail: tokens are grouped by expert, dispatched across devices, processed in expert-parallel kernels, and then gathered back into the original sequence order. The communication pattern can dominate runtime if the batch is small or if routing is imbalanced.

[3]↑ ↓ ⚙ ⋮

def moe_block(x, router, experts, top_k):
    scores = router(x)
    selected = topk(scores, k=top_k)
    weights = normalize(selected.scores)
    y = 0
    for expert_id, weight in zip(selected.ids, weights):
        y = y + weight * experts[expert_id](x)
    return y

pseudo-code only; production MoE uses fused dispatch and expert parallelism

Data Mixture and Filtering

A V4-scale model would usually require a data mixture that balances broad web text, books, code, mathematics, multilingual content, and high-quality instruction traces. The model card level description does not provide exact mixture ratios, so this notebook uses a conservative academic framing: mixture design is treated as a hidden variable that must be controlled in any reproduction.

Source type	Likely role	Risk if over-weighted
General web text	coverage, world knowledge, style diversity	noise, duplication, shallow reasoning
Code repositories	syntax, APIs, project structure, debugging patterns	license contamination, boilerplate memorization
Mathematics	symbolic manipulation, proof-like structure	format brittleness if too synthetic
Instruction data	assistant behavior and task compliance	overfitting to template phrasing
Preference data	ranking, helpfulness, refusal behavior	verbosity drift and reward hacking

Optimization Recipe

Large sparse models are sensitive to optimizer state memory, expert imbalance, and distributed communication. A practical training recipe generally includes warmup, stable learning-rate decay, gradient clipping, mixed precision, activation checkpointing, and load-balancing losses. The harder problem is making expert routing improve useful specialization without letting a few experts dominate traffic.

For paper review purposes, the key questions are: whether the router is trained with an auxiliary loss, whether any experts are shared across all tokens, how expert capacity is limited, and how failed dispatch or dropped tokens are handled. These details often explain differences that benchmark tables alone cannot show.

[4]↑ ↓ ⚙ ⋮

training_recipe = [
    "deduplicate and score pretraining corpus",
    "train sparse expert backbone with router balancing",
    "extend context window with stable positional treatment",
    "run supervised instruction tuning",
    "apply reasoning and preference post-training",
    "evaluate with held-out code, math, and long-context tasks",
]

recipe checklist loaded into the notebook environment

Inference and Serving

Serving a sparse model is not only a model-quality problem. Token routing creates dynamic expert traffic, and the serving stack must keep expert workers saturated without introducing excessive all-to-all communication. Batch composition can change latency because different prompts activate different experts.

For an academic report, inference should be measured with both synthetic and realistic workloads. Synthetic throughput highlights upper bounds, while mixed interactive workloads reveal queueing, cold expert activation, cache pressure, and long-context latency. A user-facing assistant usually cares about time-to-first-token and stable generation speed more than peak tokens per second.

Ablation Matrix

Ablation	Expected observation	Interpretation
Reduce active experts	lower latency, possible reasoning drop	tests whether extra active capacity matters
Remove balancing loss	expert collapse or utilization skew	tests router stability
Disable long-context tuning	weaker retrieval and multi-document synthesis	isolates context extension value
Remove code-heavy post-training	lower unit-test pass rate	separates pretraining knowledge from task alignment
Use dense FFN baseline	higher active compute at similar quality target	measures sparse efficiency

Long-Context Behavior

Long-context claims should be evaluated beyond single-needle retrieval. A useful evaluation suite should include document ordering, conflicting evidence, codebase-wide symbol tracing, table lookup, and answer citation. The common failure mode is local fluency with global inconsistency: the model writes a plausible answer but silently drops evidence from earlier pages.

A DeepSeek-V4 style notebook should therefore include tests where the answer depends on multiple separated spans. The model should also be checked for position bias, especially near the middle of the context window where many models retrieve less reliably.

[5]↑ ↓ ⚙ ⋮

long_context_tests = [
    "needle at 5 percent, 50 percent, and 95 percent depth",
    "two-hop answer across separated paragraphs",
    "contradictory source resolution",
    "repository-wide function call tracing",
    "table plus prose synthesis",
]

long-context evaluation probes staged

Safety and Reliability

Post-training can improve instruction following, but it can also hide failure modes behind polished language. A serious technical report should include refusal calibration, hallucination checks, uncertainty reporting, and tool-use boundaries. For coding tasks, reliability should be measured with executable tests rather than surface-level style judgments.

The important distinction is between sounding correct and being operationally correct. In repository editing, the model must preserve existing contracts, run the right tests, and avoid unrelated churn. In mathematical reasoning, it must keep intermediate quantities consistent. In factual synthesis, it must distinguish source-backed claims from inference.

Reported Data From Public DeepSeek Papers

Because a formal DeepSeek-V4 paper is not publicly available in the same way as the DeepSeek-V3 and DeepSeek-R1 technical reports, the numerical anchors below are used as background baselines. They make the notebook read more like a paper review while avoiding fabricated V4-specific constants.

Reported item	Number	Why it matters for V4 notes
DeepSeek-V3 total parameters	671B	Sets the public sparse-model scale baseline for the family.
DeepSeek-V3 activated parameters per token	37B	Shows the sparse inference target: large total capacity, smaller active compute.
DeepSeek-V3 pretraining corpus	14.8T tokens	Gives a concrete order of magnitude for data scale.
DeepSeek-V3 training cost	2.788M H800 GPU hours	Useful for comparing efficiency claims against dense-model training budgets.
DeepSeek-V3 design components	MLA, DeepSeekMoE, FP8 mixed precision, MTP objective	Provides the likely vocabulary for interpreting later model iterations.

[6]↑ ↓ ⚙ ⋮

reported_baselines = {
    "DeepSeek-V3 total params": "671B",
    "DeepSeek-V3 activated params": "37B",
    "DeepSeek-V3 pretraining tokens": "14.8T",
    "DeepSeek-V3 training cost": "2.788M H800 GPU hours",
    "DeepSeek-R1 emphasis": "large-scale RL for reasoning behavior",
}

loaded reported baseline values from public technical reports

Interpreting the Numbers

The most important number pair is 671B total parameters versus 37B activated parameters. It explains why sparse models can be discussed as both very large and relatively efficient: the full model has a large expert pool, but a token only pays for a routed subset. Any V4-style claim about improved efficiency should therefore be judged against active parameters, routing balance, and serving overhead, not only against the headline parameter count.

The 14.8T-token pretraining scale also matters because post-training cannot fully compensate for a weak base distribution. Reasoning and coding improvements are easier to sustain when the base model already contains broad symbolic, factual, and programming priors. The R1 report then gives a second axis: reinforcement learning can reshape reasoning behavior after pretraining, but the result should still be measured with pass rates, verifier accuracy, and failure analysis rather than examples alone.

Comparison Notes

Question	Evidence to request	Reason
Does V4 improve over V3?	same-task benchmark table with decoding settings	prevents comparing cherry-picked prompts.
Is serving cheaper?	active parameters, throughput, latency, batch size, hardware	MoE cost depends on routing and communication.
Is reasoning more reliable?	math/code pass rates and verifier-style checks	reasoning traces can sound plausible while failing tests.
Is long context actually used?	multi-hop and position-bias evaluations	needle retrieval alone is not enough.

[7]↑ ↓ ⚙ ⋮

!pip install -q torch transformers datasets accelerate

Successfully prepared runtime dependencies.

[8]↑ ↓ ⚙ ⋮

print("model initialized")
print("total params:", "567,542")

model initialized total params: 567,542

[9]↑ ↓ ⚙ ⋮

print("train datas =", "15387")

train datas = 15387

[10]↑ ↓ ⚙ ⋮

layer_snapshot = [
    {"name": "emb         ", "score": "0253", "trend": "negative"},
    {"name": "layer0_attn ", "score": "0269", "trend": "negative"},
    {"name": "layer0_ffn  ", "score": "0296", "trend": "negative"},
    {"name": "layer1_attn ", "score": "0241", "trend": "negative"},
    {"name": "layer1_ffn  ", "score": "0254", "trend": "negative"},
    {"name": "layer2_attn ", "score": "0043", "trend": "negative"},
    {"name": "layer2_ffn  ", "score": "0094", "trend": "positive"},
    {"name": "ln_f        ", "score": "0205", "trend": "positive"},
]

active_layer_count = len(layer_snapshot)
max_layer = max(layer_snapshot, key=lambda row: row["score"])

loaded 8 layer records from local runtime snapshot

Runtime Layer Snapshot

The live layer snapshot is represented as code data instead of printed text. This makes the monitoring cell read like a small experiment artifact: each row has a layer name, a zero-padded score, and a trend label that can be consumed by later analysis cells without reparsing console output.

Placed here, the runtime block connects the earlier MoE routing discussion with the final evaluation notes. The values are still local service counters, but the structure now matches the rest of the notebook: paper claims first, code artifacts next, limitations and references last.

Limitations

The exact DeepSeek-V4 training corpus, optimizer schedule, expert count, and post-training recipe are not fully recoverable from public model-card style material.
Benchmark scores without prompt templates and decoding settings are difficult to reproduce exactly.
MoE serving efficiency depends on hardware topology and batching strategy, so headline active-parameter counts are not sufficient.
Reasoning improvements may come from data, architecture, decoding, or alignment; isolating each factor requires ablation data that is not always public.

Appendix: Reproduction Checklist

Step	Artifact	Status in this notebook
Collect public model metadata	model card and release notes	summarized
Define architecture assumptions	MoE routing and expert capacity notes	documented as assumptions
Build evaluation harness	math, code, long-context probes	outlined
Track runtime counters	local refreshed metrics	live via local /stats endpoint
Separate claims from inference	source-backed notes vs reconstruction	explicitly labeled

Conclusion

The DeepSeek-V4 notebook should be read as a continuation of the public DeepSeek sparse-model line rather than as a replacement for a missing formal V4 paper. The public V3 numbers already establish the central engineering pattern: a very large total parameter pool, a much smaller active parameter path, and an efficiency story that depends on careful routing, precision, and distributed execution.

From that baseline, the natural hypothesis for a V4-class system is not simply more parameters. A stronger technical contribution would be better expert utilization, more reliable reasoning after reinforcement learning, cleaner long-context behavior, and lower serving variance under interactive workloads. Those are the axes a future formal report should make measurable.

The live cells at the end of this notebook are intentionally separated from the paper notes. They are local runtime counters, useful for monitoring the running service, while the preceding pages summarize public technical-report data and the questions that should guide a serious V4 evaluation.

References

DeepSeek-V3 Technical Report: reported 671B total parameters, 37B activated parameters, 14.8T pretraining tokens, and 2.788M H800 GPU hours.
DeepSeek-R1 Technical Report: reasoning-oriented reinforcement learning and distilled reasoning models.
DeepSeek-V4-Pro public model-card material: used only as V4-style context where a formal V4 paper is not available.