System Co-design
System Co-design
LLM decoding is a memory-bandwidth-bound process; most time is spent loading the data (parameters/KV cache) to GPU cores rather than doing the computation. Besides, the time-consuming part, i.e., the long-context attention computation, is moved to the CPU. Thus, the 1.8% to 8.5% extra computation on GPU will only make a minor difference in execution time. However, the enlarged size of hash tables prevents us from always increasing (K, L) to get more accurate results. We leave this (how to reduce memory overhead with better hashing algorithms effectively) as part of future work.