Beyond Fine-Tuning: The Power of RAG

Introduction

Large Language Models (LLMs) have revolutionized natural language processing, but their reliance on pre-trained knowledge presents a critical limitation: a lack of adaptability to real-time data and specialized domains. Retrieval-Augmented Generation (RAG) pipelines address this by combining the generative power of LLMs with the ability to dynamically retrieve information from external knowledge sources. In today's article, we dive into the impact of RAG, exploring how it unlocks new possibilities for LLM applications – from building highly accurate chatbots and knowledge assistants to enhancing code generation and powering personalized content creation.

We'll examine the core components of a RAG pipeline, discuss common architectures, and analyze the technical challenges and opportunities shaping the future of this increasingly vital approach to leveraging LLMs.

Core Problem

Consider this scenario: A large firm wants to improve efficiency in its support operations by using machine learning to automate common customer interactions. A frequent task involves routing customers to the appropriate specialist office – whether it's a product manager, the finance department, or another specialized team. While this task appears relatively low-risk because the mapping between customer issues and the responsible office is usually straightforward, it's incredibly time-consuming for our employees, who could be focusing on more valuable work.

Traditionally, a solution might involve training a Large Language Model (LLM) to learn these mappings, then using it to automate voice recognition and routing via phone or email. However, this approach relies on the assumption that the internal organization – particularly the people responsible for handling specific issues – remains consistent.

What happens when a finance specialist is promoted to a management role, while still retaining some of their previous responsibilities? This creates a complex edge case in the routing flowchart. Over time, these flowcharts can become unwieldy, especially in large organizations. This complexity increases the risk of errors during both data collection for training and the training process itself. Furthermore, training and regularly fine-tuning large LLMs requires significant financial investment.

A more adaptable solution lies in using RAG. Instead of fine-tuning the LLM, RAG provides the system with external, up-to-date information that is processed with each request. This shifts the upfront investment from continuous model fine-tuning to slightly more complex inference processes due to the additional data. However, this approach offers significantly greater flexibility, allowing the data to be easily exchanged or even automatically selected by the system to match the specific request.

Components

Essentially, a RAG pipeline combines two primary components: a retrieval system that extracts relevant information from a source, and a powerful LLM capable of processing that information. To achieve this, we require a few additional components:

Knowledge Base: This serves as the central information hub for the system. It’s typically structured as a database to minimize data redundancy, ensure consistency, and offer other organizational advantages. This is where the system will retrieve information from.
Embedding Model: Focused on indexing and vectorizing the content, this model transforms text into numerical representations (vectors). Similar texts will have vectors that are close to each other in the vector space. This is a crucial step, enabling the LLM to effectively process the data from the knowledge base. To avoid repeated computations, these vectorized text representations can also be stored in a vector database.
Vector Database Search: After obtaining a vector representation of the user query using the embedding model, the vector database search retrieves content that is similar to the query. This process can utilize a k-Nearest-Neighbor approach or identify all vectors satisfying a predefined similarity criterion. The retrieved information is then augmented by combining it with the original user query.
LLM: The augmented output is then fed into the LLM. Here, the user query and retrieved context are processed, and a response is generated. This output is then passed to the calling application or user interface.

In the following image, we provided you with a short summary of the components and how they interact with each other:

RAG with Anythingllm

Anythingllm Logo

To demonstrate how to utilize RAG for your own projects, we’re using AnythingLLM. AnythingLLM is a free tool that simplifies the process of using local models, like those hosted by Ollama, for your RAG tasks.

For our experimentation with RAG, we’re leveraging Llama 3.2-Vision. This model offers the capability to retrieve information not only from text but also from images—although we’re not using this feature in our demonstrations. If your hardware has limited VRAM and cannot support Llama 3.2-Vision, consider using smaller models from the Gemma family, such as gemma3:1b or gemma3:4b. These will also provide excellent results.

In addition, we reduced the temperature of the models to 0.2, in order to keep outputs from the system relatively close to how it was written. This step was also included to reduce hallucination risks.

CVE-2025-57819

In our first experiment, we will download the page contents of CVE-20225-57819, by printing the page as pdf. After inserting the document (either by drag & drop or manually selecting it in a threat), we can begin asking questions.

Request: Please summarize the content, which is contained in the given document.

Response: The document describes a security vulnerability, CVE-2025-57819, affecting FreePBX, an open-source web-based graphical user interface. The vulnerability allows an unauthenticated attacker to bypass authentication, leading to SQL injection and remote code execution. The issue has been patched in versions 15.0.66, 16.0.89, and 17.0.3. The vulnerability has a CVSS score of 10.0 (Critical) and is related to CWE-89 (SQL Injection) and CWE-288 (Authentication Bypass).

Request: How does the document describe the criticality of the vulnerability?

Response: The document describes the criticality of the vulnerability by stating that it has a CVSS (Common Vulnerability Scoring System) score of 10.0, which is classified as “Critical”. This is the highest severity rating in the CVSS scoring system. Additionally, the document notes that the vulnerability allows for “arbitrary database manipulation and remote code execution”, which further emphasizes its critical nature.

After this step, we also included the pages CVE-2025-57819-ioc-check and
Authentication Bypass Leading to SQL Injection and RCE into the context. Both of these documents describe additional information on how to mitigate this vulnerability.

Request: How can this vulnerability be mitigated?

Response: According to the document, the vulnerability can be mitigated by:

Upgrading to the latest supported versions of FreePBX (currently 15, 16, and 17).
Confirming that the installed "endpoint" module meets the minimum patched versions.
Systems not configured for automatic updates can manually update via the Administrator Control Panel or via a generic command-line method:
- $ fwconsole ma upgradeall
- $ fwconsole ma list | grep endpoint
  Additionally, the document references a GitHub repository (Sucuri-Labs/CVE-2025-57819-ioc-check) that provides a script to check for indicators of compromise (IoCs) related to this vulnerability.

Request: Please provide me with any important information, which has been given to you via the appendages and not been covered in our discussion so far.

Response: One important piece of information that has not been covered in our discussion so far is the CVSS (Common Vulnerability Scoring System) metrics for the vulnerability. According to the document, the CVSS metrics are:

Attack Vector: Network (AV:N)
Attack Complexity: Low (AC:L)
Privileges Required: None (PR:N)
User Interaction: None (UI:N)
Vulnerable System Impact Metrics:
- Confidentiality: High
- Integrity: High
- Availability: High
Subsequent System Impact Metrics:
- Confidentiality: High
- Integrity: High
- Availability: High
CVSS Score: 10.0 (Critical)
CVSS Vector: CVSS:4.0/AV:N/AC:L/AT:N/PR:N/UI:N/VC:H/VI:H/VA:H/SC:H/SI:H/SA🪮

Advantages between RAG and Finetuning

Advantages of RAG over Finetuning:

Cost-Effectiveness: Significantly cheaper than finetuning. Finetuning requires substantial computational resources and datasets. RAG primarily relies on existing LLMs and a relatively smaller knowledge base.
Ease of Updates & Maintenance: Updating the knowledge base in RAG is much easier. You simply update the documents in your knowledge base. With finetuning, every change requires a full retraining process.
Reduced Risk of Catastrophic Forgetting: Finetuning can lead to "catastrophic forgetting" – the LLM forgets previously learned information. RAG minimizes this by relying on the original LLM's knowledge and augmenting it with specific context.
Improved Explainability: RAG offers better explainability. You can trace the response back to the retrieved documents, making it easier to understand the reasoning behind the answer.
Handles Dynamic/Changing Information: RAG excels at incorporating new or frequently updated information because the knowledge base can be updated without retraining the entire model.
Scalability: Easier to scale. Scaling a knowledge base is generally more straightforward than scaling a fine-tuned model.
Data Privacy: You can use your own data in a RAG system without exposing it to third-party fine-tuning services, enhancing data privacy and control.

Disadvantages of RAG compared to Finetuning:

Dependency on Retrieval Quality: The quality of the RAG system is heavily dependent on the quality of the retrieval system. If the retrieval fails to find relevant information, the response will be poor.
Latency: Retrieval adds an extra step, which can increase latency (response time) compared to a fine-tuned model. While optimization can mitigate this, it'll generally be slower.
Prompt Engineering Sensitivity: RAG responses can be highly sensitive to the prompt design. Crafting effective prompts to leverage the retrieved context can be challenging.
Potential for Irrelevant Information: The retrieval process might occasionally return irrelevant information, which can confuse the LLM and lead to inaccurate responses.
Limited Customization of Reasoning Style: While RAG allows for content augmentation, it’s harder to change the LLM’s inherent reasoning style or personality.

TL;DR

Retrieval-Augmented Generation (RAG) offers a compelling alternative to traditional LLM fine-tuning by combining the generative power of LLMs with dynamically retrieved information from external knowledge bases, significantly reducing costs and complexity. This approach is particularly useful for scenarios requiring up-to-date information, like routing customer support tickets or detailing security vulnerabilities (as demonstrated by experiments with CVE-2025-57819), where constant re-training would be impractical. By leveraging tools like AnythingLLM and models like Llama 3.2-Vision, RAG systems can readily incorporate new data and even detailed technical reports, allowing for nuanced responses and explainability. The experimental results highlight how RAG can not only provide answers but also trace them back to the original source material, ensuring accuracy and trustworthiness. While RAG’s reliance on retrieval quality can introduce latency and require careful prompt engineering, it ultimately provides a flexible and cost-effective way to enhance LLM capabilities without the risks associated with catastrophic forgetting. This makes RAG a valuable tool for tasks demanding both specialized knowledge and adaptability.