Ctrl+K

Embeddings Cost Calculator — Plan Your Vector Index Bill

🔒 Runs in your browser — nothing is sent to a server

Embeddings cost calculator for OpenAI text-embedding-3-small/large, Anthropic-recommended Voyage 3, Google Gemini Embedding and Mistral Embed. Type the number of documents (or chunks) and the average size in tokens, words or characters, and the calculator shows the total token volume, the one-time cost to build the index, and the recurring monthly cost when a share of the corpus is re-embedded. Picks the Batch tier rate when you check the box. Useful for sizing a RAG project before you commit to a vector store contract, or comparing whether the 3-small / 3-large quality jump is worth the 6.5× price.

Pricing snapshot: May 16, 2026

Embedding model

Number of documents / chunks

Average size per document

Re-embed share per month (0–1)

0.1 = 10% of corpus is updated and re-embedded each month

Use Batch API (50% off where supported)

Total tokens to embed

5.00M

10,000 docs × 500 tokens/doc

One-time embedding cost

$0.1000

Standard rate · snapshot 2026-05-16

Monthly re-embed cost

$0.0100

~500.0K tokens / month

Vector storage footprint (fp32)

Bytes per vector6,144 B

Total raw size58.59 MB

fp32 is the worst case; production stores use fp16, int8 or product quantization to cut storage 2–32×. A dedicated Vector Storage Calculator is on the MVP-2 roadmap.

Pricing breakdown · text-embedding-3-small

Standard rate$0.020/M

Batch rate$0.010/M

Dimensions1,536

Max input tokens8,192

Picking the right embedding model

Three criteria dominate the choice: quality (MTEB benchmark score on the closest task type to yours), price per million tokens, and dimensions (storage and search compute scale linearly with this). For most teams, OpenAI text-embedding-3-small is the right default — cheap enough to be irrelevant in the budget, 1536 dimensions storeable in any vector DB, and within 1–3% of best-in-class on retrieval benchmarks. Reach for text-embedding-3-large or Voyage 3 Large only when retrieval quality is mission-critical and you have an evaluation pipeline that can prove a lift. Gemini Embedding wins on Google Cloud setups; Mistral Embed wins on European-data-residency stories.

Where embedding cost actually shows up

Three line items: (1) one-time index build — usually a single-digit-dollar bill for a small corpus, a few hundred for millions of chunks; (2) periodic re-embeds — typically 5–15% of the index per month, scales linearly; (3) per-query embedding — every search call embeds the user query once, free at small volume but adds up to a few dollars per million queries on text-embedding-3-small. Total embedding spend in production RAG is almost always under 5% of the total LLM bill, which is why most teams safely ignore it. The real saving comes from chunking smarter, not embedding cheaper.

Examples

Input

100,000 chunks × 500 tokens

Output

Total 50M tokens. One-time cost: $1.00 standard, $0.50 batch. Re-embedding 10%/month: $0.10/mo.

RAG over 100K chunks of 500 tokens each — text-embedding-3-small

Input

100,000 chunks × 500 tokens

Output

One-time cost: $6.50 standard, $3.25 batch. 3072-dim vectors at fp32 = 1.2 GB raw vs 600 MB for 3-small.

Same corpus on text-embedding-3-large (3072-dim)

Input

1,000,000 chunks × 400 tokens

Output

Total 400M tokens. One-time cost: $72 (no batch tier yet). Re-embedding 5%/month: $3.60/mo.

High-quality semantic search — Voyage 3 Large for 1M chunks

FAQ

How is embedding cost computed?

Embedding APIs charge per million input tokens, with no output cost (the response is a fixed-size vector). Total bill equals total tokens divided by 1,000,000, times the model's per-million-token rate. The Batch API on OpenAI and Mistral cuts this in half for asynchronous workloads. Anthropic's recommended Voyage models do not currently expose a batch tier.

When should I use text-embedding-3-small vs 3-large?

text-embedding-3-small at $0.02/M is the default for almost everything: it produces 1536-dim vectors and benchmarks within a few percent of 3-large on MTEB retrieval scores. text-embedding-3-large at $0.13/M produces 3072-dim vectors and squeezes a few more points on hard retrieval tasks; useful when search relevance is mission-critical and the corpus is large enough that you can measure the difference. Try small first; upgrade only if you can prove a quality lift.

How do I size the average document if I have a mix of sizes?

Take the median, not the mean — a corpus with one huge outlier document distorts the mean badly. Better yet, run a quick token-count pass over a 100-document sample and feed the median into the calculator. For RAG specifically, you embed chunks not whole documents, so the right unit is the chunk size you plan to use (typically 200–800 tokens with 10–20% overlap). The RAG Chunk Estimator on this site walks through that calculation.

Do I really pay to re-embed unchanged content?

Only when you change embedding model or re-build the index. As long as you keep the same model and the source content does not change, you embed each chunk exactly once. The "re-embed share" parameter on the calculator models the case where 5–20% of a knowledge base is updated each month — typical for product catalogues, help centres or evolving documentation.

How big are embedding vectors in storage?

Each vector is `dimensions × 4 bytes` at native fp32 precision. text-embedding-3-small = 6 KB per vector; 3-large = 12 KB; Voyage 3 Large at 2048d = 8 KB. Production stores typically halve or quarter that with int8 quantization (1–2 KB per vector) without measurable quality loss. Estimate raw storage from the calculator; multiply by your vector DB's overhead factor (Pinecone, Qdrant, pgvector all carry ~2–3× metadata overhead).

Can I use the Batch API for embeddings?

Yes on OpenAI and Mistral — both publish batch rates at 50% of standard. The catch is the 24-hour turnaround window: you submit a JSONL file, wait, then download results. Suitable for initial index builds, periodic re-embeds, or moving offline corpora. Not suitable for indexing new user-uploaded content in real time — keep that on the standard tier.

Are embedding costs the dominant cost in RAG?

Almost never. The one-time index build is a small one-time bill (a few dollars to a few hundred for typical corpora). The recurring costs in production are the LLM generation calls at query time, which use 1000× the per-token price of embeddings. Optimise for retrieval quality (which controls how few chunks you need at query time, and thus LLM input tokens), not for embedding cost alone.

Glossary

Embedding

An embedding is a fixed-length numeric vector representing the semantic content of a piece of text. Texts with similar meaning produce vectors that are close together under cosine or dot-product distance. Modern embedding models output 512–3072 dimensions. Used as the indexing primitive for semantic search, retrieval-augmented generation, clustering and de-duplication.

Vector dimensions

Dimensions is the length of the embedding vector. OpenAI text-embedding-3-small outputs 1536 dimensions; 3-large outputs 3072; Voyage 3 Large outputs 2048; Gemini Embedding outputs 3072. Higher-dimension vectors carry slightly more semantic information at the cost of 2× the storage and ~2× the search compute. Most pipelines do not need more than 1536.

Cosine similarity

Cosine similarity is the standard distance metric for embeddings: the cosine of the angle between two vectors, ranging from −1 (opposite) to 1 (identical direction). All major embedding models are trained so that semantically similar text produces high cosine similarity. The vector database does the cosine computation; the LLM never sees the math, only the top-K retrieved chunks.

Re-embedding

Re-embedding is the process of regenerating vectors when the source content changes, the chunking strategy changes, or you migrate to a different embedding model. Re-embedding a corpus is the same cost as the initial build. Most production systems re-embed 5–15% of the corpus per month for content updates, and do a full re-embed every 6–18 months when migrating models.

Quantization

Quantization reduces the precision of stored vector values to save space. fp32 (32-bit float) is the native output; int8 (8-bit signed) cuts storage 4×; binary quantization (1 bit per dimension) cuts 32× at modest quality loss. Most production vector stores quantize automatically; the calculator on this page shows raw fp32 size as the worst case.

Related tools

RAG Chunk Estimator

Estimate chunks, embedding cost and per-query LLM cost for a RAG pipeline end-to-end

LLM Cost Calculator

Compare API cost across GPT, Claude, Gemini, Llama and Mistral for a given token budget

LLM Token Counter

Count tokens for GPT, Claude, Gemini, Llama, and Mistral with live cost estimate

Context Window Fit Checker

Check whether system + history + prompt fit any model context window, with output headroom

Prompt Caching Savings

Estimate how much prompt caching saves on Claude, GPT and Gemini at a given cache hit rate

JSON Beautify

Pretty-print and format minified JSON with proper 2-space indentation in your browser

XML to JSON

Convert XML documents to JSON with attribute mapping and namespace support

HTML Encode

Escape HTML special characters into safe entities with a free online HTML encoder