Break-even LLM calls ( # LLM calls that pay for the training investment )
--
Inference cost saving ( vs running the same calls on LLM )
--
Inputs
LLM provider & pricing
Task & volume
SLM & GPU costs
Cost breakdown & details
One-time training cost
Inference cost
Calculation breakdown
Speed factor reference (s)
The 1,400 tok/s figure is a back-calculated estimate, not a direct citation.
- Raschka's benchmark: Llama 2 7B, LoRA r=256, A100 80 GB → ~3 h on Alpaca 50k (~110 tok/sample → ~5.5 M training tokens).
- Implied throughput: 5.5 M ÷ (3 × 3,600) ≈ ~510 tok/s for a 7 B model.
- A 7 B/8 B model uses speed factor s ≈ 3.1 in this calculator, so the 1.7 B baseline is set to 1,400 tok/s → 1,400 ÷ 3.1 ≈ 450 tok/s for 8 B.
- This is consistent with Raschka's ~510 tok/s for 7 B (within expected variance from architecture and LoRA rank differences).
SLM throughput inputs
Training hours: $$t_{train} = \frac{N \cdot S \cdot E}{\text{tps}_{train} \cdot 3600 / s}$$ Inference hours: $$t_{inf} = \frac{T_{month}}{\text{tps}_{inf} \cdot 3600 / s}$$
Legend: $N$ = training samples, $S$ = tokens per sample (prompt + input + output), $E$ = epochs, $T_{month}$ = total tokens, $s$ = model size slowdown factor (larger models → higher $s$), $\text{tps}_{train}$ = baseline training tokens/sec (divided by $s$ in formula), $\text{tps}_{inf}$ = inference tokens/sec
Throughput references: Raschka — Practical Tips for Finetuning LLMs (LoRA/QLoRA benchmarks), Hu et al. 2021 — LoRA paper, Dettmers et al. 2023 — QLoRA paper, HuggingFace PEFT blog.