We examine TopK attention, the theoretical upper bound of prior search-based algorithms, including static and dynamic methods, in terms of estimation error and downstream tasks, including synthetic and reasoning tasks. We show that the performance of TopK is unsatisfactory, which might be caused by the geometry of attention. This forces us to think of alternative ways to approximate attention, i.e., sampling.
(Estimation Error) Top-K attention tends to be biased and imprecise, particularly in scenarios where attention score distributions are long-tailed and the computational budget or density (i.e., K) is constrained. This issue is prevalent in LLMs across all layers, where long-tailed distributions are common. For instance, the top 20% of tokens often account for only 70–80% of the total attention scores, leaving a significant portion of keys and values unaccounted for. This omission translates to a notable estimation error of approximately 15–20%.
(Synthetic Tasks) While Top-K attention maintains accuracy for retrieval tasks that require identifying a minimal subset of the context (e.g., needle-in-a-haystack scenarios with single or multiple keys), it significantly degrades performance on aggregation tasks that depend on leveraging the entire context (e.g., common word or frequent word extraction). This is because, in aggregation tasks, information tends to be more broadly distributed, leading to a less concentrated attention score distribution.
(Reasoning Tasks) requiring LLMs to process complex relationships over extended text ranges are significantly more challenging for Top-K attention approximations than information retrieval. This difficulty underscores the purpose of building costly, long-context LLMs rather than relying solely on RAG techniques. In our recent benchmark, FACTOR, we observed substantial performance degradation even when using only 10% of the KV cache with exact Top-K attention scores, highlighting the limitations of this approach. The benchmark is a synthetic task with controlled mathematic complexity (will be released in the near future).
Several observations explain why TopK attention cannot consistently work well. They are surprising and fundamental, in addition to explaining the attention mechanism. (1) The hidden states (including but not limited to key and value states) of the first input tokens (attention sinks) are almost the same regardless of input tokens. (Left) (2) The orientation of the center of key states remains stable for different input sentences. (Mid) (3) The center of key states and the sink token's key states are almost opposite. (Right)
Geometric information of attention. Left: With arbitrary input, the orientation of the sink token's key states almost remains the same, with a minimum similarity >0.99 across sampled inputs. Mid: The orientation of the center of the key states in a sequence is stable across various input sentences with a similarity >0.9 observed. Right: the sink token's key states and the center of key states in the sequence are almost opposite with similarity between -0.9 to-0.8.
These observations define the geometric pattern illustrated on the left. The attention sink, which remains static irrespective of the input, creates a highly sparse attention distribution, while the remaining regions exhibit a more uniform distribution.
Applying Top-K attention exacerbates this issue by disproportionately weighting the sink token, leading to a loss of critical contextual information. Furthermore, misalignment between queries and keys introduces additional challenges in retrieval accuracy.
Contrary to previous assumptions, our findings reveal that attention is not always sparse, particularly in tasks that utilize the entire context. As depicted on the right, attention distributions can exhibit long-tailed patterns in certain layers. Additionally, what appears to be high sparsity is often the result of an attention sink—an element that, surprisingly, remains nearly static regardless of the input token.
Explaining these findings goes beyond this project's scope, but they are very important for understanding how transformers work. Our conclusion here is that TopK-based methods have fundamental problems for accurate approximation. The next section introduces how sampling can help handle the problems.