Chunk and index text

Chunk and index text

This is part of the Use RAG with series.

Once we have the dependencies installed, we can start filling up the vector db.

Set up

First part, is to create the LanceDB database and embedding model:

from lancedb.embeddings import get_registry
from lancedb import connect

OLLAMA_MODEL = "mxbai-embed-large:latest"

db = connect("lance")

embed_model = get_registry().get("ollama").create(

As you can see, I have defined:

  • OLLAMA_HOST - the host where Ollama is installed
  • OLLAMA_MODEL - the embedding model we want to use
  • db - the database (warning: The parameter is actually a path). In my case, the DB will be located in the lance directory, at the same level with the script
  • embed_model - the LanceDB object that defined the embedding we want to use

The approach in LanceDB with the embed_model is quite useful because we can define the vector field size based on the embedding models' parameters.

I have also defined a text splitter, using llama_index:

from llama_index.embeddings.ollama import OllamaEmbedding
from llama_index.packs.node_parser_semantic_chunking.base import SemanticChunker

ollama_embeddings = OllamaEmbedding(
    ollama_additional_kwargs={"mirostat": 0},
splitter = SemanticChunker(embed_model=ollama_embeddings)

The LanceDB model

The LanceDB model I defined is quite basic. It has:

  • the file name
  • the repository name
  • the text of the chunk
  • the vector embedding
class CodeChunks(LanceModel):
    filename: str
    repository: str
    text: str = embed_model.SourceField()
    vector: Vector(embed_model.ndims()) = embed_model.VectorField()

Create necessary tables

When you perform a query from LanceDB you must tell it where to query from, and that's a table. The functions to initialise a table is:

def create_table(name: str):
    result = None
        result = db.create_table(f"{name}", schema=CodeChunks)
    except Exception:
        result = db.open_table(f"{name}")

    return result

As you can see, the {python} db.create_table() throws an exception if the table already exists (you can give it an extra parameter to overwrite, but that's not what we want). In that case, we just open the table.

Index a file

Once I have all the components above, file indexing is quite simple:

def index_file(project_name, repo_name, file_name, file_content):
    repo_table = create_table(repo_name)
    project_table = create_table(project_name)

    chunks = splitter.split_text(file_content)
    array = []
    for chunk in chunks:
        array.append({"text": chunk, "repository": repo_name, "filename": file_name})

    # debug
    print(f"   {len(array)} chunks")


    # debug
    print(f"repo rows: {repo_table.count_rows()}, project_rows: {project_table.count_rows()}")

First, split the input file in chunks using the text splitter splitter initialised above. Then, build an array as per model and add the array of chunks to the repository-specific and project-specific tables. This way, the project-specific table will have embeddings from all the code base.

When the xxx_table.add(array) happens, the embedding model is triggered to create the vectors for each chunk. Much nicer than writing boiler-plate!

I chose to create a table that contains all chunks from my Bitbucket project, and one-table-per-repository which would contain only that repository's embeddings. I did that to see which variant would give me better results.

Index a project

Great! Now we can use the function to index the codebase. The indexing code would looks something along the lines of:

files = get_all_files_from_bitbucket(PROJECT_NAME, repo, main_branch)

for filename in files:
	if filename.lower().endswith(IGNORE_EXTENSIONS):
		print(f"Skipping {filename} because of extension")
		print(f"Indexing {filename}:")
		content = get_file_from_bitbucket(PROJECT_NAME, repo, main_branch, filename)
	    # This is the important line
		index_file(PROJECT_NAME, repo, filename, content)


for each file from the project
    get the file's content
    index the file
