Vibe Coding - Add a field to a Qdrant collection

Vibe Coding - Add a field to a Qdrant collection
It's fashionable to add Gibli-style images to everything these days.

I have created a Qdrant collection off a repository of mine and I had the following payload fields:

  • file_name - the source code file name (path actually)
  • branch - the git repository branch that I have indexed
  • repository_name - the git repository name
  • chunk - the text chunk

I forgot to add language to specify the programming language for the file/chunk. So, what better way to do it than vibe coding?

I fired up my claude client and entered:

💡
write a python script that adds a new field to an existing qdrant collection. The field name is "field_name" and the field content is a string "dummy". The script should use the qdrant python client.

knowing the collection has a field "file_name" containing the file name, change the above function to add a field named "language" with the programming language extracted from the file name. The languages recognied should be java, javascript, typescript, markdown, html, xml, json. For all other languages, default to "unknown"

And I got

from qdrant_client import QdrantClient
from qdrant_client.http import models
import os
import re

# Initialize the Qdrant client
# Replace with your connection details if needed
client = QdrantClient(host="localhost", port=6333)

# Name of the collection to update
collection_name = "your_collection_name"

def detect_language_from_filename(filename):
    """
    Detect programming language from file name based on extension.
    
    Args:
        filename (str): Name of the file
        
    Returns:
        str: Detected language or "unknown"
    """
    # Remove path if present and get extension
    _, ext = os.path.splitext(filename)
    ext = ext.lower().lstrip('.')
    
    # Map extensions to languages
    language_map = {
        'java': 'java',
        'js': 'javascript',
        'jsx': 'javascript',
        'ts': 'typescript',
        'tsx': 'typescript',
        'md': 'markdown',
        'markdown': 'markdown',
        'html': 'html',
        'htm': 'html',
        'xml': 'xml',
        'json': 'json'
    }
    
    return language_map.get(ext, "unknown")

def add_language_field_to_collection(collection_name):
    """
    Add a 'language' field to all points in a Qdrant collection,
    based on the existing 'file_name' field.
    
    Args:
        collection_name (str): Name of the collection to update
    """
    # Get the number of points in the collection
    collection_info = client.get_collection(collection_name=collection_name)
    points_count = collection_info.points_count
    
    # Set batch size for processing
    batch_size = 100
    
    print(f"Adding 'language' field to {points_count} points in collection '{collection_name}'")
    
    # Process in batches to avoid memory issues with large collections
    for offset in range(0, points_count, batch_size):
        # Get a batch of points with their file_name payload
        search_result = client.scroll(
            collection_name=collection_name,
            limit=batch_size,
            offset=offset,
            with_payload=["file_name"]
        )
        
        points = search_result[0]
        
        if not points:
            break
        
        # Process each point to determine language and prepare updates
        for point in points:
            point_id = point.id
            
            # Extract file_name from payload
            if "file_name" in point.payload:
                file_name = point.payload["file_name"]
                language = detect_language_from_filename(file_name)
                
                # Update the point with the language field
                client.set_payload(
                    collection_name=collection_name,
                    payload={"language": language},
                    points=[point_id]
                )
        
        print(f"Updated {len(points)} points (offset: {offset})")
    
    print(f"Field 'language' successfully added to all points in collection '{collection_name}'")

# Execute the update
add_language_field_to_collection(collection_name)

The detect_language_from_filename() function looks at the filename's extension and returns the programming language for recognised extension.

The add_language_field_to_collection() function iterates through all points in the collection and adds the language field to the payload.

Note: The code just adds the field, and it could be extended to add an index for the new field. I didn't do it because I'm still testing to see how useful the RAG database is in my use case.

I verified the code and tested it on a dummy collection, so it works. 😄

HTH,