rag

RAG Sample - 3. Generate Schema-compliant Embeddings

Laur Ivan

Sep 15, 2024 • 3 min read

Until now, we've initialised the collection with a custom schema and built the capability to generate embeddings via Ollama. Now, we'll tie it all together.

A sample to process

We define a format for the document (based on the paperless-ngx API response):

doc = {
  "id": 3031,
  "correspondent": 63,
  "document_type": 10,
  "storage_path": None,
  "title": "10_ws3-petrov",
  "content": "Laura Oana Petrov * and Nobukazu Nakagoshi\n\n\nThe Use of GIS ...",
  "tags": [
    68
  ],
  "original_file_name": "10_ws3-petrov.pdf",
  "archived_file_name": "2024-07-17 Laura 10_ws3-petrov.pdf",
  "owner": 3,
  "notes": [],
  "custom_fields": [],
  "source_url": "http://docs.home.laurivan.com/documents/3031/preview"
}

Note: All numeric fields are references to other things (like the list of tags), and I've ignored them for the time being.

Generate chunks

First, we implement the chunk generator:

from langchain.text_splitter import RecursiveCharacterTextSplitter

def get_chunks(text, chunk_size=250, overlap=0):
    '''
    Split a document into chunks
    '''
    text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
        chunk_size=chunk_size,
        chunk_overlap=overlap
    )

    doc_splits = text_splitter.split_text(text)

    return doc_splits

This is probably the simplest chunk splitter! It splits a text in segments of approximately 250 characters each.

Build the embeddings

I generate the embeddings starting from a JSON structure with the following minimum set of fields:

content - the text to be indexed
id - the document ID to be referred
source_url - the reference point
title or file_name or archived_file_name or original_file_name

First I make sure I have the information from the JSON object stored in a variable named doc:

chunk_size = 250
overlap = 0

# Get the chunks based on "content"
chunks = get_chunks(doc["content"], chunk_size=chunk_size, overlap=overlap)
file_name = doc.get("file_name", doc.get("archived_file_name", doc.get("original_file_name", doc.get("title"))))
source_url = doc.get("source_url")

After that, I calculate the embeddings:

# Calculate the embeddings
embeddings = emb_chunks(chunks)
embedding_dim = len(embeddings)

Once I have the embeddings (this might be a bit slow because Ollama), I build the schema-compliant array to be inserted:

# Prepare the vectors to be inserted in milvus
data = []
for i in range(embedding_dim):
  val = {
   "embeddings": embeddings[i],
   "text": chunks[i],
   "uri": source_url,
   "title": file_name
  }
  data.append(val)

print(file_name)
print (len(embeddings))

I know there's probably a better way to do thins that what's written above, but that'll be for production code (plus that it's been a while since I wrote something meaningful in python).

My code looks like this:

from langchain.text_splitter import RecursiveCharacterTextSplitter

def get_chunks(text, chunk_size=250, overlap=0):
  '''
  Split a document into chunks
  '''
  text_splitter = RecursiveCharacterTextSplitter.from_tiktoken_encoder(
    chunk_size=chunk_size,
    chunk_overlap=overlap
  )
  doc_splits = text_splitter.split_text(text)
  return doc_splits

# import the emb_chunks function defined previously
from embeddings import emb_chunks

def generate_embeddings(doc, chunk_size=250, overlap=0):
  '''
  Generate the embeddings for a document, and prepare the decent data structure for it

  The document is a JSON/dict and must at least have the following fields:
  - content - the text to be indexed
  - id - the document ID to be referred
  - source_url - the reference point
  - title or file_name or archived_file_name or original_file_name
  '''

  # Get the chunks based on "content"
  chunks = get_chunks(doc["content"], chunk_size=chunk_size, overlap=overlap)
  file_name = doc.get("file_name", 
    doc.get("archived_file_name", 
      doc.get("original_file_name", 
        doc.get("title")
      )
    )
  )
  source_url = doc.get("source_url")

  # Calculate the embeddings
  embeddings = emb_chunks(chunks)
  embedding_dim = len(embeddings)

  # Prepare the vectors to be inserted in milvus
  data = []
  for i in range(embedding_dim):
    val = {
      "embeddings": embeddings[i],
      "text": chunks[i],
      "uri": source_url,
      "title": file_name
    }
    data.append(val)


  print(file_name)
  print (len(embeddings))
  return data

import pathlib
import json

if __name__ == "__main__":
    data = generate_embeddings(doc)
    pathlib.Path("embad.json").write_text(json.dumps(data))

It has the following characteristics:

The generate_embeddings function requires a JSON object with a specific minimum set of fields. It parameterizes the chunk size and the overlap for the text splitter.
The main generates the embeddings for the sample document doc and writes the resulted array as JSON in a file. I did this to see what the embeddings actually generated.