RAG Ingestion¶

In this tutorial, you'll build a Retrieval-Augmented Generation (RAG) ingestion pipeline using WSO2 Integrator: BI. The pipeline loads content from a file, chunks them into smaller sections, generates embedding and stores those embeddings in a vector knowledge base for efficient retrieval.

By the end of this tutorial, you'll have created a complete ingestion flow that reads a markdown file, processes the content, and stores it in a vector store for use in RAG applications.

Note

This tutorial focuses solely on the ingestion aspect of RAG. Retrieval and querying will be covered in a separate guide. The ingestion pipeline is designed using WSO2 Integrator: BI's low-code interface, allowing you to visually orchestrate each step with ease.

Prerequisites¶

To get started, you need a knowledge file (in Markdown format) that you want to ingest into the vector store.

Note: This tutorial uses an in-memory vector store for simplicity, but you can also use external vector stores like Pinecone, Milvus, or Weaviate.

What is an In-Memory Vector Store?

An in-memory vector store holds your data (the chunked and embedded text) directly in your computer's active memory (RAM). This makes it very fast and easy to set up, as it requires no external databases or services. However, this data is temporary—it will be completely erased when you stop the integration or close the project.

Step 1: Create a new integration project¶

Click on the BI icon in the sidebar.
Click on the Create New Integration button.
Enter the project name as rag_ingestion.
Select a directory location by clicking on the Select Path button.
Click Create New Integration to generate the project.

Step 2: Create an automation¶

In WSO2 Integrator: BI, an automation is a flow that runs automatically when the integration starts. We will use this to ensure our data is loaded and ingested into the knowledge base as soon as the application is running, making it ready for the query service.

In the design screen, click on + Add Artifact.
Select Automation under the Automation artifact category.
Click Create to open the flow editor.

Step 3: Create a text data loader¶

Hover over the flow line and click the + icon to open the side panel.
Click on Data Loader from the AI section.
Click + Add Data Loader to create a new instance.
Choose Text Data Loader.
Under the paths field, click on + Add Another Value and add the path to your markdown file.
Set Data Loader Name as loader.
Click Save to continue.

Step 4: Load data using the data loader¶

In the Data Loaders section, click on loader.
Click on load to open the configuration panel.
Name the result as doc.
Click Save to complete the data loading step.

This step wraps the file content into a ai:Document record, preparing it for chunking and embedding.

Note

In WSO2 Integrator: BI, an ai:Document is a generic container that wraps the content of any data source—such as a file, webpage, or database entry. It not only holds the main content but can also include additional metadata, which becomes useful during retrieval operations in RAG workflows. In this tutorial, no metadata is used.

Step 5: Create a vector knowledge base¶

A vector knowledge base in WSO2 Integrator: BI acts as an interface to a vector store and manages the ingestion and retrieval of documents.

Note

This tutorial uses an In-Memory Vector Store for simplicity and to get you started quickly. For production use cases or persistent storage, you can choose from other supported vector stores including Pinecone, Milvus, Weaviate, and more. Simply select your preferred option when creating the vector store in step 4 below.

When using external vector stores, you may need to provide API keys and other configuration details. It's recommended to externalize sensitive values like API keys using configurables to avoid exposing them in your project files. See Configurations for more information.

Hover over the flow line and click the + icon.
Select Vector Knowledge Bases under the AI section.
Click + Add Vector Knowledge Base to create a new instance.
In the Vector Store section, click + Create New Vector Store and choose InMemory Vector Store, then click Save to create the vector store. This will return you to the vector knowledge base configuration.
In the Embedding Model section, click + Create New Embedding Model, select Default Embedding Provider (WSO2), then click Save.
For the Chunker setting, you can leave it at the default value of AUTO or create a new chunker if needed.
Set the Vector Knowledge Base Name to knowledgeBase.
Click Save to complete the configuration.

Embedding Dimensions

The Default Embedding Provider (WSO2) generates dense vectors with 1536 dimensions. If you're using an external vector store (Pinecone, Milvus, Weaviate, etc.), ensure your vector store index is configured to support 1536-dimensional vectors.

Step 6: Ingest data into the knowledge base¶

In the Vector Knowledge Bases section, click on knowledgeBase.
Click on ingest to open the configuration panel.
Provide doc as the input for Documents.
Click Save to complete the ingestion step.

This step chunks the document and sends them to the vector store, converting each chunk into an embedding and storing them for future retrieval.

Step 7: Add a confirmation message¶

Hover over the flow line and click the + icon.
Select Log Info under the Logging section.
Enter "Ingestion completed." in the Msg field.
Click Save.

This step will print a confirmation once the ingestion is complete.

Step 8: Configure default WSO2 provider and run the integration¶

As the workflow uses the Default Embedding Provider (WSO2), you need to configure its settings:
- Press Ctrl/Cmd + Shift + P to open the VS Code command palette.
- Run the command: Ballerina: Configure default WSO2 model provider. This will automatically generate the required configuration entries.
Click the Run button in the top-right corner to execute the integration.
Once the integration runs successfully, you will see the message "Ingestion completed." in the console.