SLM Benefit Calculator

Compare the training cost of a small language model (SLM) with the ongoing usage cost of a hosted LLM. Estimates assume LoRA/PEFT tuning for GPU sizing and training time; adjust inputs to match your stack.

Break-even LLM calls ( # LLM calls that pay for the training investment )

--

Inference cost saving ( vs running the same calls on LLM )

--

Inputs

LLM provider & pricing
Last updated: --
Defaults to published list prices. You can override below.
Task & volume
Estimate tokens with tiktokenizer.
SLM & GPU costs
Uses Runpod serverless A100 pricing per second.
Runpod serverless rate: --
Vast.ai rate: --
Cost breakdown & details
One-time training cost
Inference cost
Calculation breakdown
Speed factor reference (s)
How the 1,400 tok/s baseline was derived

The 1,400 tok/s figure is a back-calculated estimate, not a direct citation.

  1. Raschka's benchmark: Llama 2 7B, LoRA r=256, A100 80 GB → ~3 h on Alpaca 50k (~110 tok/sample → ~5.5 M training tokens).
  2. Implied throughput: 5.5 M ÷ (3 × 3,600) ≈ ~510 tok/s for a 7 B model.
  3. A 7 B/8 B model uses speed factor s ≈ 3.1 in this calculator, so the 1.7 B baseline is set to 1,400 tok/s → 1,400 ÷ 3.1 ≈ 450 tok/s for 8 B.
  4. This is consistent with Raschka's ~510 tok/s for 7 B (within expected variance from architecture and LoRA rank differences).
SLM throughput inputs

Training hours: $$t_{train} = \frac{N \cdot S \cdot E}{\text{tps}_{train} \cdot 3600 / s}$$ Inference hours: $$t_{inf} = \frac{T_{month}}{\text{tps}_{inf} \cdot 3600 / s}$$

Legend: $N$ = training samples, $S$ = tokens per sample (prompt + input + output), $E$ = epochs, $T_{month}$ = total tokens, $s$ = model size slowdown factor (larger models → higher $s$), $\text{tps}_{train}$ = baseline training tokens/sec (divided by $s$ in formula), $\text{tps}_{inf}$ = inference tokens/sec

Baseline training throughput on A100 80GB for the 1.7B reference model (LoRA r=256). Divided by model slowdown factor s. Yields ~450 tok/s for 8B → ~3 h for 50k samples.
Measured on A100 80GB with your stack.
GPU: -- SLM: --