Fusion of ๐ฅ๐๐ (Retrieval Augmented Generation) and ๐๐๐ (Cache Augmented Generation). How can you benefit from it as AI Engineer? Few months ago there was a lot of hype around a technique called CAG. While it is powerful to its own extent, the real magic happens when you combine CAG with regular RAG. Letโs see what it would look like and what additional considerations should be taken into account. Here are example steps to implement CAG + RAG architecture: ๐๐ข๐ต๐ข ๐๐ณ๐ฆ๐ฑ๐ณ๐ฐ๐ค๐ฆ๐ด๐ด๐ช๐ฏ๐จ: ๐ญ. We use only rarely changing data sources for Cache Augmented Generation. On top of the requirement of data changing rarely we should also think about which of the sources are often hit by relevant queries. Once we have this information, only then we pre-compute all of this selected data into a KV Cache of the LLM. Cache it in memory. This only needs to be done once, the following steps can be run multiple times without recomputing the initial cache. ๐ฎ. For RAG, if necessary, precompute and store vector embeddings in a compatible database to be searched later in step 4. Sometimes simpler data types are enough for RAG, a regular database might suffice. ๐๐ถ๐ฆ๐ณ๐บ ๐๐ข๐ต๐ฉ: We can now utilise the preprocessed data. ๐ฏ. Compose a prompt including user query and the system prompt with instructions on how cached context and retrieved external context should be used by the LLM. ๐ฐ. Embed a user query to be used for semantic search via vector DBs and query the context store to retrieve relevant data. If semantic search is not required, query other sources, like real time databases or web. ๐ฑ. Enrich the final prompt with external context retrieved in step 4. ๐ฒ. Return the final answer to the user. ๐๐ฐ๐ฎ๐ฆ ๐๐ฐ๐ฏ๐ด๐ช๐ฅ๐ฆ๐ณ๐ข๐ต๐ช๐ฐ๐ฏ๐ด: โก๏ธ Context window is not infinite and even while some models boast enormous context window sizes, the needle in the haystack problem has not yet been solved so use available context wisely and cache only the data you really need. โ For some business cases, specific datasets are extremely valuable to be passed to the model as cache. Think about an assistant that has to always comply with a lengthy set of internal rules stored in multiple documents. โ While CAG has been popularised for Open Source just recently, it is already viable for some time via Prompt Caching features in OpenAI and Anthropic APIs. It is really easy to start prototyping there. โ You should always separate hot and cold data sources, only use cold (data that changes rarely) in your cache, otherwise the data will go stale and the application will go out of sync. โ Be very careful about what you cache as the data will be available for all users to query. โ It is very hard to ensure RBAC for cached data unless you have a separate model with its own cache per role. Have you used the fusion of CAG and RAG already? Let me know about your results in the comments ๐ #LLM #AI #MachineLearning | 44 comments on LinkedIn