Llama 3.3 70B Instruct — Meta | Modeldex

Llama 3.3 70B Instruct

Active

The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model...

Updated 4 days agoStructured data from Modeldex catalog

FastBudget

Knowledge cutoff

Dec 31, 2023(2.3 years ago)

API release

Dec 6, 2024(1.4 years ago)

Not enough benchmark coverage yet for an Intelligence Index — needs at least 3 results across 2 categories.

Overview

The Meta Llama 3.3 multilingual large language model (LLM) is a pretrained and instruction tuned generative model in 70B (text in/text out). The Llama 3.3 instruction tuned text only model...

History

Llama 3.3 70B Instruct became available via the Meta API on 2024-12-06.

Training & availability

Training data has a knowledge cutoff of 2023-12-31 — information about events after that date is unlikely to appear in the model's responses. Meta has not released the underlying model weights — access is via their hosted API only.

Capabilities

Context window: 131K tokens.
Input modalities: text.

Recommended for: fast, cheap.

Limitations

The knowledge cutoff is 27 months old — this model will not know about recent events, releases, or API changes.
Text-only — cannot process images, audio, or video inputs.

Pricing

Input: $0.1000 per 1M tokens
Output: $0.3200 per 1M tokens

Use the cost calculator above to estimate monthly spend for your workload.

Example interactions

Curated prompts showing the model's response style — not cherry-picked to impress, picked to show what typical output looks like.

Prompt

Explain how a transformer attention layer works, for a programmer who understands matrices but hasn't touched ML before.

Response

Attention is a weighted lookup. Given a token, the layer computes how much attention to pay to every other token in the sequence, then produces a weighted sum of their values. Three projections per token: - **Query (Q):** what this token is looking for - **Key (K):** what this token offers others - **Value (V):** the information it contributes if attended to For each position, compute scores = Q · Kᵀ (dot product against every other token's key). Softmax normalizes these into weights. Output = weights · V. Multi-head attention runs this in parallel with different Q/K/V projections, so different "heads" learn to track different relationships — syntax, coreference, topic, etc. Concat and project back. The matrix view: if you have n tokens and d_model dims, Q/K/V are (n × d_k) each. Attention output is (n × d_k). With h heads, final output concatenates to (n × d_model).

Quick start

Minimal example using the OpenRouter API. Copy, paste, replace the key.

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="sk-or-...",
)
resp = client.chat.completions.create(
    model="meta/llama-3-3-70b-instruct",
    messages=[{"role": "user", "content": "Explain quantum computing in one sentence."}],
)
print(resp.choices[0].message.content)

Cost calculator

Estimate your monthly bill. Presets are typical workload sizes.

Input tokens / month5.0M

@ $0.1/1M

Output tokens / month2.0M

@ $0.32/1M

Input cost

$0.5

5.0M × $0.1/1M

Output cost

$0.64

2.0M × $0.32/1M

Total / month

$1.14

$13.68 / year

Providers & performance

16 providers

Multi-provider inference routes for this model — sorted by throughput. Latency is time-to-first-token; throughput is output tokens per second. Data from OpenRouter, measured over the last 30 minutes.

Provider	Throughput	Latency (TTFT)	Input $ / 1M	Output $ / 1M	Context	Quant	Supports
Groq	184tok/s	308ms	$0.59	$0.79	131K	—	tools · json
SambaNova	92tok/s	548ms	$0.45	$0.9	16K	bf16	—
SambaNova	77tok/s

Popularity

Signals from open-source communities — not a quality measure, but useful for gauging adoption among developers.

🤗

HuggingFace downloads

495.9K/ month

HuggingFace likes

2.7K

Integrations & tooling support

Tool calling: Not supported
Structured outputs: Not supported

Price vs quality

Budget pricing

Priced low — good for high-volume tasks. Quality tier pending more benchmark coverage.