So Daniel sent us this one today, and it is a deep dive into the literal foundations of how AI understands us. He writes: Positional encoding in transformers. How does a transformer actually know word order when the attention mechanism is permutation-invariant? He wants us to walk through the main approaches, starting with the sinusoidal encoding from the original Attention Is All You Need paper, then moving through learned embeddings, ALiBi, and finally RoPE, which is the rotary position embedding that dominates the landscape today. He also wants us to tie into how RoPE's specific design enables those massive context window extensions like YaRN that we see in the latest models.
This is a fantastic prompt. It hits on the fundamental "hack" that makes modern AI work. Because, if you really look at the math of a transformer, it’s essentially colorblind to time. It sees a sentence not as a sequence, but as a cloud of points.
Right, a bag of words. Which is a problem if you're trying to distinguish between "The dog bit the man" and "The man bit the dog." To a raw transformer, those are identical inputs. It's like a chef who has all the ingredients for a cake but no recipe telling him what order to mix them in. He just throws them all in the oven at once and hopes for the best.
Well, not exactly—sorry, I promised I'd stop saying that. It's more that the transformer processes everything in parallel. That’s its great strength, right? That’s why we can train these things on massive clusters of GPUs. But the price of that parallelism is that the model, by default, has no idea where any word sits in relation to any other word.
And that’s where positional encoding comes in. It’s the "you are here" map for every token in the sequence. By the way, before we get too deep into the weeds of sine waves and rotations, I should mention that today’s episode of My Weird Prompts is powered by Google Gemini three Flash. It’s the brain behind the script today, so if we say anything particularly brilliant, you know who to thank.
And if I start acting like a donkey who’s spent too much time in a server room, you know who to blame. I’m Herman Poppleberry, and I have been waiting for an excuse to talk about the trigonometry of language models for a long time.
I knew you’d say that. You’ve probably got a unit circle tattooed on your arm somewhere. But let's start at the beginning. Two thousand seventeen. The Big Bang of modern AI. The paper is "Attention Is All You Need." Vaswani and the crew at Google. They had this problem: they wanted to ditch the old recurrent neural networks, the RNNs, because those were slow. They processed words one by one, like a person reading a book.
Right, and RNNs had a "memory" problem. By the time they got to the end of a long sentence, they’d forgotten how it started. The transformer fixed that by looking at everything at once using self-attention. But as you said, self-attention is permutation-invariant. If you swap the positions of two words in the input, the attention scores between them don't change. The dot product of two vectors is the same regardless of where they are in the list.
So they needed a way to bake the position into the word itself. Their solution was beautiful in a very nerdy way: sinusoidal positional encoding. They didn’t just add a number like "one, two, three" to the words. Why not just do that? Why not just add the index?
Because if you just add a raw integer, the numbers get huge very quickly. If you have a sequence of ten thousand tokens, the tenth thousandth token has this massive value added to it that completely overwhelms the actual meaning of the word. Plus, the model needs to understand relative distance. It needs to know that the word five slots away is just as "far" whether you’re at position ten or position five hundred.
So they used waves. Sines and cosines.
Yes. For every dimension in the word’s vector—and remember, these vectors are hundreds or thousands of dimensions wide—they added a specific value from a sine or cosine wave. Each dimension got a different frequency. The first few dimensions might have very fast-moving waves, while the later dimensions have very slow, long-wavelength waves.
It’s like a clock, isn’t it? The fast dimensions are the second hand, moving quickly. The middle dimensions are the minute hand. The deep dimensions are the hour hand. By looking at all the hands at once, the model can tell exactly what "time" it is—or in this case, exactly what position the token is in.
That’s a great way to visualize it. And because it’s based on these trigonometric functions, there’s a mathematical property where the encoding for any position "p plus k" can be expressed as a linear transformation of the encoding for position "p." This means the model can easily learn to attend to things at a fixed relative distance. It doesn't have to learn a new rule for every single absolute position; it just learns how to "shift" its focus.
And this was additive, right? They literally just took the word embedding—the vector that means "dog"—and added this wave vector on top of it. Doesn't that... I don't know, mess up the meaning? Does the "dog" become a "salty dog" because of the sine wave?
It’s a concern! It’s called "feature pollution." You’re mixing the semantic signal—what the word is—with the structural signal—where the word is. But because the word embeddings are high-dimensional, there’s plenty of "room" for both signals to coexist. The model learns to tease them apart. However, as we’ll see later, this additive nature is one of the reasons we moved toward more sophisticated methods.
But wait, if it’s just a fixed formula, doesn't that mean it can go on forever? If I have a formula for a sine wave, I can calculate it for position one million just as easily as position one. So why were those early models like BERT stuck at five hundred twelve tokens?
Well, BERT actually used a different approach. Even though the original Transformer paper used sines and cosines, the researchers who built BERT and the early GPT models went with "Learned Positional Embeddings."
Ah, the "brute force" method.
Instead of using a clever math formula, they just said, "Let’s let the model learn a unique vector for every position." So the model has a specific vector it learns for 'Position One,' another for 'Position Two,' and so on, all the way up to five hundred twelve.
It’s like a vocabulary, but for spots in the sentence.
Precisely. It’s very flexible. The model can learn exactly how it wants to represent each position. But the downside is the "hard ceiling." If you train a model with learned embeddings for five hundred twelve positions, and then you try to give it a sentence with five hundred thirteen words... it has no idea what to do. It’s never seen that five hundred thirteenth vector. It hasn't "learned" it. You can't just extrapolate.
It’s like a parking lot with only fifty spaces. You can be the best driver in the world, but if space fifty-one doesn't exist, you're not parking there. And that brings us to the big shift. We wanted models that could handle more than a page of text. We wanted books. We wanted codebases. And that led us to ALiBi.
ALiBi was such a clever pivot. It stands for Attention with Linear Biases, introduced by Ofir Press and his team. They realized that maybe we shouldn't be messing with the word vectors at all. Maybe we should just change how the attention mechanism itself calculates scores.
Instead of giving the words a "map," we just tell the model "hey, things that are closer to you are probably more important."
That’s the intuition. In ALiBi, when the model calculates the attention score between word 'i' and word 'j,' it takes the standard dot product but then subtracts a penalty based on the distance between them. So if two words are right next to each other, the penalty is small. If they are a thousand tokens apart, the penalty is huge.
It’s a recency bias baked into the math. It’s like being at a party. You can hear the person standing next to you perfectly, but the guy across the room is just a murmur. You naturally "attend" to the people closest to you.
And the genius of ALiBi is that it generalizes to any length. If you train on a thousand tokens, the model learns how to handle that linear penalty. If you then show it two thousand tokens, the penalty just keeps growing linearly. The model doesn't get confused by "new" positions because it doesn't care about absolute position; it only cares about the distance.
I remember the MPT models from MosaicML used this. They were some of the first to really show off massive context lengths—like sixty-five thousand tokens—because ALiBi just let them keep scaling without the model's brain melting. But... we don't see ALiBi as much anymore, do we? Llama, Mistral, GPT-4... they all went a different way.
They went with RoPE. Rotary Positional Embedding. And Corn, this is where it gets really elegant. If Sinusoidal encoding was a clock, and ALiBi was a volume knob, RoPE is a kaleidoscope.
Okay, explain that one to me before I get dizzy.
So, RoPE was introduced by Jianlin Su and others in twenty-one. Instead of adding a vector to the word embedding, they decided to rotate the vector. Think of the word embedding as a point in space. To encode its position, you rotate that point around the origin by an angle that depends on where it is in the sentence.
So word one gets rotated ten degrees, word two gets rotated twenty degrees, and so on?
Essentially, yes. But it happens in pairs of dimensions. You take two dimensions of the vector, treat them as a coordinate on a plane, and spin them. This is done for the Query and the Key vectors in the attention mechanism.
Now, why on earth is spinning a vector better than adding a wave to it?
Because of a beautiful property of rotations. When you take the dot product of two rotated vectors—the Query and the Key—the absolute rotation "washes out" and you are left with a result that only depends on the difference between the angles.
Wait, so the math naturally distills it down to the relative distance?
Yes! If word 'A' is at position ten and word 'B' is at position fifteen, their individual rotations are ten times theta and fifteen times theta. When they meet in the attention mechanism, the dot product looks at the difference: five times theta. It doesn't matter if they are at positions ten and fifteen or positions a thousand and a thousand-and-five. The relative signal is identical.
That’s the "best of both worlds" Daniel mentioned. You get the absolute position encoded in the rotation, but the attention mechanism treats it as a relative distance. It’s like everyone in a theater is holding a flashlight, and as you go further back in the rows, everyone tilts their flashlight a bit more. When you look at two people, you can tell how many rows apart they are just by the difference in the angles of their beams.
And it doesn't "pollute" the features as much. When you add a vector, you're changing the magnitude and direction of the word's meaning. When you rotate a vector, you're just changing its orientation. The "norm"—the length of the vector—stays exactly the same. The semantic content is preserved in the magnitude, while the positional content is stored in the phase.
It’s cleaner. It’s more mathematically "pure." And it’s clearly worked. I mean, Llama two, Llama three, Mistral... they all use it. But the real magic, and what Daniel asked about, is what happens when we want to go beyond the training length. We’re in twenty-six now, and we’re seeing models with context windows of a million, two million tokens. How does RoPE enable that?
This is where we get into the "hacks" like YaRN and Position Interpolation. See, because RoPE is based on frequencies—those rotation angles—we can play with the "speed" of the rotation. Imagine you trained a model on a sequence of four thousand tokens. The model "knows" what a full rotation looks like across those four thousand spots.
It’s seen the whole circle.
Now, you want it to read eight thousand tokens. If you just let it keep rotating, it enters "uncharted territory." It’s seeing angles it never saw during training. The model gets confused. It’s like a compass that suddenly points south-south-west when it was only ever taught North, East, South, and West.
So what do we do? Do we just... slow down the rotation?
That is exactly what "Position Interpolation" does. If you want to double the context window, you just divide the position indices by two. So word number eight thousand now looks to the model like word four thousand. You’re "squeezing" the eight thousand words into the space of the original four thousand.
But doesn't that make everything... blurry? If I take a high-resolution photo and squeeze it to half its size, I lose the fine details.
That’s the problem! If you scale all the frequencies equally, the model loses its ability to distinguish between words that are very close to each other. It loses the "high-frequency" information. It can see the big picture, but it can't tell the difference between slot one and slot one-point-five.
So we need a smarter way to squeeze. Which I’m guessing is where things like NTK-aware scaling and YaRN come in.
You’re on it. NTK-aware scaling, which stands for Neural Tangent Kernel, was this breakthrough where researchers realized we shouldn't scale all frequencies the same way. Remember how I said RoPE uses multiple frequencies? Some move fast, some move slow.
The second hand and the hour hand.
Right. NTK-aware scaling says: "Keep the fast waves—the high frequencies—exactly as they are." This preserves the model’s ability to see local, word-to-word relationships. But for the slow waves—the ones that represent the "big" structure of the sentence—we scale those down to fit the longer context.
So the model still knows exactly who its neighbors are, but its sense of "how long is this whole book" gets stretched.
Yes. And YaRN, which stands for "Yet another RoPE extensioN"—because AI researchers are great at naming things—takes this even further. It applies different scaling factors to different parts of the vector. It realizes that some dimensions are better at "extrapolating" to new distances, while others are better at "interpolating" within known distances.
It’s a surgical approach to stretching the model’s brain. And the results are wild. We’ve seen models trained on just a few thousand tokens get "stretched" to handle over a hundred thousand tokens with almost no loss in performance. It’s like taking a person who’s only ever lived in a small apartment and "stretching" their spatial awareness so they can navigate a skyscraper on their first day.
And it’s why we’re seeing this explosion in "long-context" applications. Analyzing entire legal contracts, summarizing whole series of books, or even feeding an entire codebase into a model so it can find a bug in a file you forgot existed. Without RoPE’s frequency-based design, these "stretching" techniques wouldn't be possible. You can't "stretch" a learned embedding. There’s nothing to interpolate.
It’s funny, isn't it? The original transformer paper was called "Attention Is All You Need," but it turns out, attention is actually quite blind without a very specific, very clever set of trigonometric glasses.
It really is. And it’s a reminder that even in this era of "just add more data and more GPUs," the underlying architecture—the actual math of how we represent information—still matters immensely. A slight change in how you rotate a vector can be the difference between a model that can read a paragraph and a model that can read a library.
So, what’s the limit? I mean, we’re talking about a million tokens. Can we go to a billion? Is there a point where the rotation becomes so slow that the model just loses its mind?
There is a theoretical limit. Eventually, the "resolution" of the numbers—the floating-point precision of the computer—becomes an issue. If you're rotating a vector by zero-point-zero-zero-zero-zero-one degrees, the computer might just round that to zero. And suddenly, your positional encoding disappears.
The "rounding error" apocalypse.
Also, there’s the quadratic cost of attention itself. Even if the positional encoding works for a billion tokens, the attention mechanism still has to compare every token to every other token. That’s an 'N-squared' problem. If you double the length, you quadruple the work. So even if RoPE can handle it, the GPUs might catch fire.
Well, that’s where things like Ring Attention and Flash Attention come in, but maybe that’s a prompt for another day. I want to circle back to something you mentioned earlier—feature pollution. In the early days, with sinusoidal addition, we were worried about mixing "where" and "what." With RoPE, you said it’s cleaner. But is it perfectly clean? Does the rotation still influence how the model understands the word?
It’s an active area of research. Some people argue that RoPE still introduces certain biases. For example, because it’s based on pairs of dimensions, it might favor certain types of relationships over others. There’s also the "lost in the middle" phenomenon.
Ah, right. We’ve talked about this before. Models tend to be really good at remembering the beginning of a document and the end, but they get "fuzzy" in the middle.
And some researchers think that’s partly a byproduct of how our positional encodings work. If the "signal" for the middle of a document looks too similar to the "signal" for other parts, the model loses its grip. It’s like driving across a very long, flat desert. After a while, mile fifty looks exactly like mile five hundred. You lose your sense of progress.
So maybe the next big breakthrough isn't just about "stretching" RoPE, but finding a new way to encode position that stays "sharp" even in the deep middle of a million-token sequence.
I wouldn't be surprised. People are already looking at "Positional-Free" transformers or models that use different types of state-space representations instead of standard attention. But for now, RoPE is the king. It’s what powers the AI revolution we’re living through.
It’s a good king. A bit obsessed with circles, maybe, but it gets the job done. I think the big takeaway for people who aren't math nerds like you, Herman, is that when you see a model boasting about a huge context window, you’re not just seeing the result of a bigger computer. You’re seeing the result of some very elegant, very clever mathematical "hacks" that allow these models to transcend their original training.
Well put. It’s the difference between a bigger bucket and a better way to organize the water. And for developers and practitioners, understanding which encoding a model uses is actually really important. If you’re trying to fine-tune a model for long-context tasks, you need to know if it’s using RoPE, because that determines which scaling techniques—like YaRN—will actually work.
If you try to "YaRN" a BERT model, you’re going to have a bad time.
You’re going to have a very confused model, that’s for sure. It’s like trying to use a map of London to navigate Jerusalem. The coordinates just don't match up.
Well, I think we’ve successfully rotated this topic around until we’ve seen all the angles. Before we wrap up, I want to give a quick shout-out to our producer, Hilbert Flumingtop, who keeps this whole operation spinning. And big thanks to Modal for providing the GPU credits that power the generation of this show. Without those H-hundreds, we’d still be waiting for the sinusoidal waves to finish calculating.
And if you enjoyed this deep dive into the guts of transformers, we have a whole archive of this stuff. You might want to check out our previous discussion on the "Great Kitchen War" of context windows—that was episode eleven-oh-three. It provides some good context—pun intended—for why these numbers are so marketed today.
Nice one. If you're listening on Spotify or Apple Podcasts, hit that follow button. It helps us out more than you know. And as always, you can find everything at myweirdprompts.com. We’re also on Telegram if you want to get a ping every time a new episode drops—just search for My Weird Prompts.
This has been My Weird Prompts. Thanks for listening, and we’ll catch you in the next sequence.
See ya.