← Back to Tools
PromptUtils

Tokenizer Visualizer

See how text gets tokenized

BERT/GPT Tokenizer Visualizer

Paste your text and see how different tokenizers (BERT, GPT-4, Claude, etc.) break it into tokens. Understand token counts and identify problematic splits.

GPT-4 / GPT-4 Turbo
cl100k_base
128K vocab
Total Tokens
0
Avg Tokens per Word
0.0
Character Count
0
Problematic Tokens
0
Token Visualization:
Token Details:
Estimated API Cost:

How to Use

  1. Select a tokenizer – Choose BERT, GPT-4, GPT-2, or Claude
  2. Paste your text – Enter text you want to analyze
  3. Click Visualize – See tokens, IDs, and highlighting
  4. Review stats – Check token count, efficiency, and costs

Understanding Token Colors

  • Normal Token – Standard word or subword token
  • Special Token – [CLS], [SEP], [UNK], start/end markers
  • Problematic – Rare or unusual splits that waste tokens

Tokenization Algorithms

  • BPE (Byte-Pair Encoding): Used by GPT-2, GPT-3.5, GPT-4, RoBERTa. Splits words into subwords based on frequency. Good for handling rare words and multiple languages.
  • WordPiece: Used by BERT and related models. Similar to BPE but uses likelihood instead of frequency. Uses ## prefix for subword continuations. Prefers longer tokens.
  • SentencePiece: Used by Llama, T5, XLNet, PaLM/Gemini. Treats spaces as tokens (▁). Handles multiple languages well. Creates more linguistically meaningful splits.
  • cl100k_base & p50k_base: OpenAI-specific encodings. cl100k_base is newer (128K vocab) and more efficient than p50k_base (50K vocab).
  • Custom (Anthropic): Claude models use a custom tokenizer optimized for instruction-following and long context.

Example: Token Count Comparison

Text: "Hello, I'm learning about tokenization!"

GPT-4 (cl100k_base)
8 tokens
Hello | , | I | ' | m | learning | about | tokenization | !
BERT (WordPiece)
12 tokens
Hello | , | I | ' | m | learn | ##ing | about | token | ##ization | !
Llama 2 (SentencePiece)
9 tokens
Hello | , | I | ' | m | learning | about | tokenization | !

Key insights:
• GPT-4 is most efficient (8 tokens) with cl100k_base
• BERT splits more aggressively due to smaller vocab (12 tokens)
• Llama 2 balances between efficiency and linguistic meaning (9 tokens)

Why Token Count Matters

  • Context Window: You can only use as many tokens as the model's context window. Longer prompts consume tokens faster.
  • API Costs: Most LLM APIs charge per token. Fewer tokens = lower costs.
  • Model Understanding: Unusual token splits can sometimes confuse models or reduce understanding.
  • Performance: More tokens means slower processing. Optimizing tokenization improves speed.

Related Tools