Bitbucket - Scan a repository

Bitbucket - Scan a repository

Following the previous post, we need to get the list of files in the repository. This is similar with listing the contents of a directory, without having the possibility to do it in a recursive way. A call to the list files API will return a list containing both files and directories.

Get files from the repository

The Bitbucket API is capable of listing the content of a folder, and it returns both files and folders. I wrote a function that performs the query and returns a dict containing files and folders respectively:

def get_all_files_in_dir_from_bitbucket(project_key, repo_slug, branch, dir=""):
    """
    Retrieves the content of all files from a specific branch in a Bitbucket 
    (data centre) repository.
    
    :param project_key: The project key
    :param repo_slug: The repository slug
    :param branch: The branch name
    :return: Dictionary with file types as keys and file contents as values
    """
    url = f"{BITBUCKET_URL}/{BITBUCKET_API_1_0}/projects/{project_key}/repos/{repo_slug}/browse/{dir}?at=refs/heads/{branch}"
    files = get_all_children(url)
    
    # print(url)

    if files == None:
        print("Error: No files found")
        return None
    

    file_results = []
    directory_results = []
    
    for file in files:
        if file.get("type") == "FILE":
            file_results.append("".join([dir,file['path']['toString']]))
        elif file.get("type") == "DIRECTORY":
            fulldir = "".join([dir, file['path']['toString'], '/'])
            directory_results.append(fulldir)
    
    return {"files": file_results, "dirs": directory_results}

The final result will have the files containing the list of files, with relative paths to the root URL. The dirs component will contain all subdirectories. I return both.

Now, we need to run this multiple times until we traverse all directory tree:

def get_all_files_from_bitbucket(project, repo, branch: str = "develop"):
    files = []
    dirs = ['/']

    main_branch = calculate_main_branch(project, repo, branch)
    
    if main_branch == None:
        return

    while len(dirs) > 0:
        processed_dir = dirs.pop()
        print(f"processing: {processed_dir} (files: {len(files)}, dirs:{len(dirs)})...")
        r = get_all_files_in_dir_from_bitbucket(project, repo, main_branch, processed_dir)
        files += r['files']
        dirs += r['dirs']

    return (files)

Instead of recursing through the tree, I use a stack-based approach to consume all directories. The result is a list of files (strings), where each entry has the relative path to the base URL.

Get a file content

Now that we have all files, the API call to get a file's content from a repository is quite straight-forward:

def get_file_from_bitbucket(project_key, repo_slug, branch, file_path):
    """
    Retrieves a file from a specific branch in a Bitbucket Data Center repository.
    
    :param project_key: The project key
    :param repo_slug: The repository slug
    :param branch: The branch name
    :param file_path: The path to the file in the repository
    :return: Content of the file or an error message
    """
    url = f"{BITBUCKET_URL}/{BITBUCKET_API_1_0}/projects/{project_key}/repos/{repo_slug}/raw/{file_path}?at=refs/heads/{branch}"
    response = requests.get(url, headers=HEADERS)
    
    if response.status_code == 200:
        return response.text  # Assuming the file is a text-based file
    else:
        return f"Error {response.status_code}: {response.text}"

When we call the function with the right parameters, we'll get the (text) content in a string.

We need to call the function for each of the files in the list returned by get_all_files_from_bitbucket() function. The code can look something like:

files = get_all_files_from_bitbucket(PROJECT_NAME, repo, main_branch)

for filename in files:
	if filename.lower().endswith(IGNORE_EXTENSIONS):
		print(f"Skipping {filename} because of extension")
	else:
		print(f"Indexing {filename}:")
		content = get_file_from_bitbucket(PROJECT_NAME, repo, main_branch, filename)
		# Do somthing with hte content

In the next blog entry we'll see how to use lancedb to index the content.

HTH,