If you have spent any time reading about large language models or experimenting with AI tools, you have almost certainly stumbled across three terms that tend to appear without much explanation: tokens, embeddings, and context windows. They get mentioned in documentation, in pricing pages, in research papers, and in casual conversations about AI capabilities, often as though everyone already knows what they mean. Most people do not, and that is completely understandable. These are genuinely technical concepts that emerged from a specialized field and landed in mainstream conversation faster than anyone had time to properly explain them.
That changes here. By the end of this article, all three concepts will make sense, not in a vague hand-wavy way, but in a way that actually helps you understand what is happening when you interact with an AI system and why these ideas matter for the real-world performance of these tools.
Starting With Tokens: The Alphabet of AI Language
When a large language model reads text, it does not read it the way a human does, letter by letter or word by word in a straightforward linear sense. It reads it in chunks called tokens. Understanding what tokens are is the first step to understanding how these systems process language at all.
A token is roughly a piece of a word. Not always a whole word, and not always just a single letter. The exact size and shape of a token depends on how the model was trained and what tokenization method it uses, but a useful rule of thumb is that one token corresponds to approximately four characters of English text, which works out to roughly three quarters of a word on average.
To make this concrete, consider the word tokenization itself. A tokenizer might split this into two tokens: token and ization. The word cat might be a single token. The word cats might also be a single token, or it might be split into cat and s, depending on the system. Common short words like the, is, and and are almost always single tokens. Rare or technical words might be split into several tokens each.
Numbers, punctuation, spaces, and characters from non-English languages are also tokenized, though the way they are handled varies. Languages with smaller vocabularies in the training data tend to require more tokens to represent the same amount of meaning, which has practical implications for cost and efficiency when using AI tools in different languages.
Why does any of this matter? Because the model does not actually work with words or letters. It works with tokens. Every step of the process, from reading your input to generating its response, happens in terms of tokens. When you see a pricing page for an AI service that charges per token, that is what it means. When you hit a limit on how much text a model can handle, that limit is measured in tokens. When researchers talk about how much text a model was trained on, they often express it in tokens, sometimes in the trillions.
Tokens are the fundamental unit of currency in the language model economy. Everything else builds on top of them.
The Tokenization Process Up Close
It helps to understand briefly how tokenization is decided in the first place. Before a model is trained, researchers choose a tokenization scheme, essentially a vocabulary of chunks that will be used to represent all possible text. This vocabulary is built by analyzing a large sample of text and finding the most common subword units.
The most widely used approach is called byte pair encoding. It starts with individual characters and repeatedly merges the most frequently occurring pairs of adjacent units into a single token. Run this process enough times, and you end up with a vocabulary of tens of thousands of tokens that covers common words as single units and handles rare or unknown words by breaking them into smaller pieces.
This approach is elegant because it handles any text without ever completely failing. Even if the model encounters a word it has never seen before, it can break it down into familiar subword pieces and process it meaningfully. The word antidisestablishmentarianism, for instance, would be split into several tokens, but the model could still make sense of it by understanding the meaning of its component parts.
The tokenizer is trained separately from the language model itself and remains fixed once established. The same tokenizer is used at inference time, meaning when you actually use the model, ensuring that the text you provide is converted into the same format the model learned to work with during training.
Embeddings: Turning Tokens Into Mathematics
Once text has been broken into tokens, the model needs to do something with them. But computers cannot work directly with words or word fragments. They work with numbers. This is where embeddings come in.
An embedding is a way of representing a token as a point in a high-dimensional mathematical space. More concretely, it is a list of numbers, often hundreds or thousands of numbers long, that together encode something about the meaning and character of that token.
The key property that makes embeddings powerful is that tokens with similar meanings end up with similar embeddings. Tokens that are semantically related are positioned close together in this mathematical space. Tokens that are unrelated end up far apart.
Think of it this way. Imagine a two-dimensional map where every word in the English language is placed as a dot. Words that are related in meaning are placed near each other. Dog and cat end up close together. Dog and automobile end up far apart. King and queen end up near each other, and interestingly, the direction and distance between them might mirror the direction and distance between man and woman, capturing a relationship between concepts through geometry.
Real embeddings are not two-dimensional. They might have hundreds or thousands of dimensions, which makes them impossible to visualize directly. But the same principle applies. Meaning is encoded as geometry. Relationships between concepts are encoded as directions and distances in this high-dimensional space.
This geometric encoding of meaning is what allows language models to understand analogy, context, and nuance. When the model processes the word bank, it needs to figure out whether you mean a financial institution or the side of a river. The embeddings of the surrounding words provide geometric context that helps resolve this ambiguity, because the words associated with financial contexts have embeddings that cluster in one region of the space while words associated with rivers cluster in another.
Embeddings are not just used for individual tokens. Sentences, paragraphs, and entire documents can also be represented as embeddings by combining the embeddings of their component tokens in various ways. These document-level embeddings are widely used in search systems, recommendation engines, and retrieval tools that need to find semantically similar content quickly.
How Embeddings Are Learned
Embeddings are not manually designed. Nobody sat down and decided that the embedding for dog should be a specific list of numbers. Instead, embeddings are learned during the training process alongside the rest of the model’s parameters.
At the start of training, embeddings are initialized to random values. As the model trains on enormous amounts of text, the embeddings gradually adjust so that tokens which appear in similar contexts end up with similar values. The model learns that dog and cat often appear in similar sentences, surrounded by similar words, and their embeddings drift closer together in the mathematical space. The model learns that bank appears in very different contexts depending on surrounding words, and the embedding system learns to represent this complexity.
By the end of training, the embedding space has become a rich map of meaning extracted from human language at scale. It captures not just which words are similar but subtle relationships between concepts, associations across domains, and even cultural and contextual nuance embedded in the patterns of how language is actually used.
This learned embedding space is one of the reasons large language models can do things that feel surprisingly intelligent. They are not looking words up in a dictionary. They are navigating a geometry of meaning that emerged from reading more text than any human could process in many lifetimes.
Context Windows: The Model’s Working Memory
The third concept, the context window, is in some ways the most immediately practical of the three. It determines how much text a model can consider at any one time, and understanding it explains both the capabilities and the limitations of AI systems in ways that matter for everyday use.
A context window is the total amount of text, measured in tokens, that a model can hold in its attention at once. Think of it as working memory. Just as a human can only actively think about so many things simultaneously, a language model can only process so much text in a single operation. Everything within the context window is available to the model as it generates a response. Everything outside it effectively does not exist from the model’s perspective.
In the early days of modern language models, context windows were quite small, often just a few thousand tokens. This created frustrating limitations. You could not paste in a long document and ask questions about it. You could not have a very long conversation without the model losing track of what was said at the beginning. You could not ask the model to analyze a lengthy codebase or summarize an entire book.
Over time, context windows have expanded dramatically. Models now exist with context windows of hundreds of thousands of tokens, and the frontier continues to push outward. This expansion has opened up use cases that simply were not possible before, including analyzing entire legal contracts, maintaining coherent context across very long conversations, processing multiple long documents simultaneously, and working with large codebases in a single session.
What Happens at the Edge of the Context Window
Understanding what happens when you approach or exceed a model’s context window helps explain some behaviors that might otherwise seem mysterious.
When a conversation or document exceeds the context window, something has to give. Different systems handle this differently. Some simply truncate the oldest content, dropping the earliest parts of the conversation to make room for new input. Others use techniques to compress or summarize older content before it falls out of the window. Some use retrieval systems that pull in relevant pieces of older content when they seem relevant to the current query.
None of these solutions are perfect. If a model has truncated the beginning of a long conversation, it genuinely cannot access what was said there. It is not that it is choosing not to remember. The information is simply no longer in its working memory. This explains why very long conversations with AI assistants can sometimes feel like the system has forgotten important context established early on. It has, in a very literal computational sense.
The quality of attention also tends to degrade somewhat as context windows fill up. Research has shown that models pay more attention to content at the very beginning and very end of the context window, sometimes losing track of information buried in the middle of very long inputs. This phenomenon has been described informally as the lost in the middle problem, and it is an active area of research and improvement.
How These Three Concepts Connect
Tokens, embeddings, and context windows are not independent ideas. They form a connected chain that describes how language moves through a model from raw text to meaningful output.
Text arrives as a string of characters. The tokenizer breaks it into tokens. Each token is converted into an embedding, a rich numerical representation of its meaning. These embeddings, along with positional information about where each token sits in the sequence, are fed into the transformer architecture. The attention mechanism allows every token in the context window to consider its relationship to every other token. Layer by layer, the model builds up increasingly abstract representations of what the text means and what should come next. Finally, the output layer converts the model’s internal state back into a probability distribution over tokens, and the most likely next token is selected, converted back to text, and added to the response.
This entire process, from your typed input to the model’s generated output, is grounded in these three ideas. Tokens are the units. Embeddings are the meaning. The context window is the stage on which it all plays out.
Why This Knowledge Is Practically Useful
Understanding these concepts is not just intellectually satisfying. It has practical implications for anyone who uses AI tools regularly.
Knowing about tokens helps you understand why longer inputs cost more when using paid AI services, why some languages are more expensive to process than others, and why being concise in your prompts can sometimes improve both efficiency and response quality.
Knowing about embeddings helps you understand how AI-powered search works, why semantic search can find relevant results even when your search terms do not exactly match the source text, and how recommendation systems figure out that you might like something you have never explicitly searched for.
Knowing about context windows helps you understand why models sometimes seem to forget earlier parts of a conversation, why there are limits on how long a document you can ask a model to analyze, and how to structure your interactions with AI tools to get the best results within those limits.
These are not exotic technical details relevant only to researchers and developers. They are the underlying mechanics of tools that millions of people use every day, and understanding them gives you a clearer, more accurate picture of what these tools can and cannot do. That clarity is worth having, both for getting more out of AI tools and for thinking critically about their role in an increasingly AI-shaped world.