SLM Benefit Calculator

Compare the training cost of a small language model (SLM) with the ongoing usage cost of a hosted LLM. Estimates assume LoRA/PEFT tuning for GPU sizing and training time; adjust inputs to match your stack.

Break-even LLM calls ( # LLM calls that pay for the training investment )

Inference cost saving ( vs running the same calls on LLM )

Inputs

LLM provider & pricing

LLM Provider

Last updated: --

Defaults to published list prices. You can override below.

LLM input price ($ / 1M tokens)

LLM output price ($ / 1M tokens)

Task & volume

Avg prompt inst. size (tokens)

Input data length (tokens)

Output size (tokens)

Expected requests

Estimate tokens with tiktokenizer.

SLM & GPU costs

SLM family

Data labeling cost (USD)

Number of training data samples

Training epochs

GPU model

GPU hourly rate (USD)

Use serverless inference pricing

Uses Runpod serverless A100 pricing per second.

Serverless A100 rate ($ / sec)

Runpod serverless rate: --

Vast.ai rate: --

Cost breakdown & details

One-time training cost

Inference cost

Calculation breakdown

Speed factor reference (s)

How the 1,400 tok/s baseline was derived

The 1,400 tok/s figure is a back-calculated estimate, not a direct citation.

Raschka's benchmark: Llama 2 7B, LoRA r=256, A100 80 GB → ~3 h on Alpaca 50k (~110 tok/sample → ~5.5 M training tokens).
Implied throughput: 5.5 M ÷ (3 × 3,600) ≈ ~510 tok/s for a 7 B model.
A 7 B/8 B model uses speed factor s ≈ 3.1 in this calculator, so the 1.7 B baseline is set to 1,400 tok/s → 1,400 ÷ 3.1 ≈ 450 tok/s for 8 B.
This is consistent with Raschka's ~510 tok/s for 7 B (within expected variance from architecture and LoRA rank differences).

SLM throughput inputs

Training hours: $$t_{train} = \frac{N \cdot S \cdot E}{\text{tps}_{train} \cdot 3600 / s}$$ Inference hours: $$t_{inf} = \frac{T_{month}}{\text{tps}_{inf} \cdot 3600 / s}$$

Legend: $N$ = training samples, $S$ = tokens per sample (prompt + input + output), $E$ = epochs, $T_{month}$ = total tokens, $s$ = model size slowdown factor (larger models → higher $s$), $\text{tps}_{train}$ = baseline training tokens/sec (divided by $s$ in formula), $\text{tps}_{inf}$ = inference tokens/sec

Baseline inference model

Training tokens/sec (baseline)

Baseline training throughput on A100 80GB for the 1.7B reference model (LoRA r=256). Divided by model slowdown factor s. Yields ~450 tok/s for 8B → ~3 h for 50k samples.

SLM inference tokens/sec

Measured on A100 80GB with your stack.

GPU: -- SLM: --

Vendor references: NVIDIA H100, NVIDIA H200, CoreWeave MLPerf (GB200/H200).
Throughput references: Raschka — Practical Tips for Finetuning LLMs (LoRA/QLoRA benchmarks), Hu et al. 2021 — LoRA paper, Dettmers et al. 2023 — QLoRA paper, HuggingFace PEFT blog.