Lab 1: Ingestion Pipeline

In this lab, you will learn how to populate a Vector Database (Elasticsearch) with selected content from the Knowledge Base stored in Alfresco. This involves extracting vectors from the content using the Embedding module nomic-embed-text via Ollama.

Components

Alfresco acts as the Knowledge Base, storing documents within folders that have the cm:syndication aspect (1)
alfresco-ai-sync retrieves documents from the Alfresco Repository using the Alfresco REST API and uploads the content to the Vector Database through the AI RAG Framework REST API (2)
ai-rag-framework exposes the REST API for ingestion and utilizes Ollama to access the nomic-embed-text embedding model, which generates vector representations of the ingested documents and stores them in Elasticsearch (3)

Configuration Files

(1) The aspect and properties to be used can be configured in the application.properties file of the alfresco-ai-sync application.
(2) Alfresco credentials, protocol, host, and port can be configured in the application.properties file of the alfresco-ai-sync application.
(3) Ollama and Elasticsearch settings can be configured in the application.yml file of the ai-rag-framework application.

Step 1: Populate the Knowledge Base

Start Alfresco Community by running the following command:

cd alfresco-docker
docker compose up --build --force-recreate

Log into Alfresco Share using the default credentials:

Username: admin
Password: admin
Access Alfresco Share at: http://localhost:8080/share
Create the "Knowledge Base" folder with the following rule:
- Name: Sync Folder
- When: Items are created or enter this folder
- If all criteria are met: Content of type or sub-type is Folder
- Action:
  Set the property cm:updated to 01/01/1999 00:00
When using a different aspect than default cm:syndication (that includes the cm:updated propert), application.properties file for the alfresco-ai-sync application needs to be updated
Create a child folder named RAG within the Knowledge Base folder and add some documents to it.

At this point, your Knowledge Base is populated and ready for synchronization.

Step 2: Synchronize the Knowledge Base with the Vector Database

Follow the steps to synchronize the content from the Alfresco Knowledge Base to the Vector Database (Elasticsearch) using the alfresco-ai-sync application.

Stop Alfresco Community by pressing Ctrl+C

Data won't be loss, as local Docker volumes are used
Verify ollama is running or start the program
```
ollama -v
```
Start the full stack from root folder, including alfresco-ai-sync and ai-rag-framework
```
cd ..
docker compose up --build --force-recreate
```

After a while, initial synchronization of the folder is finished by the alfresco-ai-sync service

docker logs ai-framework-alfresco-ai-sync-1
 Successfully initialized with folder ID: [5730a944-248d-43cf-b0a9-44248d23cfec]
 Starting initial sync process.
 Starting initial synchronization for folder: AlfrescoSyncFolder[id=5730a944-248d-43cf-b0a9-44248d23cfec, publishedDate=null, updatedDate=2024-11-15T13:17Z, docLastUpdatedDate=2024-11-15T13:17:41.487Z]
 Initial synchronization for folder AlfrescoSyncFolder[id=5730a944-248d-43cf-b0a9-44248d23cfec, publishedDate=null, updatedDate=2024-11-15T13:17Z, docLastUpdatedDate=2024-11-15T13:17:41.487Z] complete. Processed 7 documents
 Finished initial sync process.


**Spring AI for ingestion**

The service **ai-rag-framework** is performing the ingestion of the documents by using following pieces of code:

Configuration for Vector Database (elasticsearch), ollama and embedding model is defined in [application.yml](https://github.com/aborroy/alfresco-ai-framework/blob/main/ai-rag-framework/src/main/resources/application.yml)

```yaml
spring:
  elasticsearch:
    uris: http://localhost:9200
  ai:
    ollama:
      base-url: http://localhost:11434
      embedding:
        options:
          model: nomic-embed-text
    vectorstore:
      elasticsearch:
        initialize-schema: true
        index-name: alfresco-ai-document-index
        dimensions: 768

The document is processed using the TikaDocumentReader to extract its text. The extracted text is then split into smaller parts, suitable for the embedding model's dimension constraints, using the TokenTextSplitter. These parts are used to calculate vector embeddings. Finally, the collection of chunks for the document—each containing the vector embedding and corresponding text—is stored in the vector database using the VectorStore object.

// Embedding and Splitting
List<Document> documents =
    TokenTextSplitter.builder().build().apply(new TikaDocumentReader(file).get());

// Storing in Vector Database
vectorStore.add(documents);

Step 3: Verify the Vector Database Population

Once synchronization is complete, verify that the Vector Database (Elasticsearch) is populated with the documents from the Alfresco Knowledge Base using Kibana.

Access Kibana Developer Tools at http://localhost:5601/app/dev_tools#/console and type following request:

GET /alfresco-ai-document-index/_search
{
  "hits": {
    "total": {
      "value": 7,
      "relation": "eq"
    },
    "hits": [
      {
        "_index": "alfresco-ai-document-index",
        "_id": "f6532062-dc95-42fa-babf-dafb09614564",
        "_source": {
          "embedding": [
            0.013943307,
            0.05599264,
            ...
          ],
          "content": "Contents 2. Introduction to ...",
          "id": "f6532062-dc95-42fa-babf-dafb09614564",
          "metadata": {
            "fileName": "file.pdf",
            "documentId": "536fe0b0-cb3f-43f4-afe0-b0cb3f43f42c",
            "source": "cryptography-0.pdf",
            "folderId": "5730a944-248d-43cf-b0a9-44248d23cfec"
          }
        }
      }
      ...
    ]
  }
}

This result represents a search query response from Elasticsearch for the alfresco-ai-document-index.

Every hit includes the content of the document, including structured fields:

embedding: An array of vector numbers representing the document's embedding in a multidimensional vector space. These are used for similarity searches.
content: Extracted textual content of the document, to be provided as context for chatting.
fileName: Name of the original file (file.pdf) in Alfresco Repository.
documentId: A unique identifier for this document in Alfresco Repository.
folderId: ID of the sync folder in the Alfresco Repository where the document resides.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lab1-ingestion.md

lab1-ingestion.md

Lab 1: Ingestion Pipeline

Components

Configuration Files

Step 1: Populate the Knowledge Base

Step 2: Synchronize the Knowledge Base with the Vector Database

Step 3: Verify the Vector Database Population

Files

lab1-ingestion.md

Latest commit

History

lab1-ingestion.md

File metadata and controls

Lab 1: Ingestion Pipeline

Components

Configuration Files

Step 1: Populate the Knowledge Base

Step 2: Synchronize the Knowledge Base with the Vector Database

Step 3: Verify the Vector Database Population