Bitbucket - Scan a repository

Following the previous post, we need to get the list of files in the repository. This is similar with listing the contents of a directory, without having the possibility to do it in a recursive way. A call to the list files API will return a list containing both files and directories.
Get files from the repository
The Bitbucket API is capable of listing the content of a folder, and it returns both files and folders. I wrote a function that performs the query and returns a dict containing files and folders respectively:
def get_all_files_in_dir_from_bitbucket(project_key, repo_slug, branch, dir=""):
"""
Retrieves the content of all files from a specific branch in a Bitbucket
(data centre) repository.
:param project_key: The project key
:param repo_slug: The repository slug
:param branch: The branch name
:return: Dictionary with file types as keys and file contents as values
"""
url = f"{BITBUCKET_URL}/{BITBUCKET_API_1_0}/projects/{project_key}/repos/{repo_slug}/browse/{dir}?at=refs/heads/{branch}"
files = get_all_children(url)
# print(url)
if files == None:
print("Error: No files found")
return None
file_results = []
directory_results = []
for file in files:
if file.get("type") == "FILE":
file_results.append("".join([dir,file['path']['toString']]))
elif file.get("type") == "DIRECTORY":
fulldir = "".join([dir, file['path']['toString'], '/'])
directory_results.append(fulldir)
return {"files": file_results, "dirs": directory_results}
The final result will have the files
containing the list of files, with relative paths to the root URL. The dirs
component will contain all subdirectories. I return both.
Now, we need to run this multiple times until we traverse all directory tree:
def get_all_files_from_bitbucket(project, repo, branch: str = "develop"):
files = []
dirs = ['/']
main_branch = calculate_main_branch(project, repo, branch)
if main_branch == None:
return
while len(dirs) > 0:
processed_dir = dirs.pop()
print(f"processing: {processed_dir} (files: {len(files)}, dirs:{len(dirs)})...")
r = get_all_files_in_dir_from_bitbucket(project, repo, main_branch, processed_dir)
files += r['files']
dirs += r['dirs']
return (files)
Instead of recursing through the tree, I use a stack-based approach to consume all directories. The result is a list of files (strings), where each entry has the relative path to the base URL.
Get a file content
Now that we have all files, the API call to get a file's content from a repository is quite straight-forward:
def get_file_from_bitbucket(project_key, repo_slug, branch, file_path):
"""
Retrieves a file from a specific branch in a Bitbucket Data Center repository.
:param project_key: The project key
:param repo_slug: The repository slug
:param branch: The branch name
:param file_path: The path to the file in the repository
:return: Content of the file or an error message
"""
url = f"{BITBUCKET_URL}/{BITBUCKET_API_1_0}/projects/{project_key}/repos/{repo_slug}/raw/{file_path}?at=refs/heads/{branch}"
response = requests.get(url, headers=HEADERS)
if response.status_code == 200:
return response.text # Assuming the file is a text-based file
else:
return f"Error {response.status_code}: {response.text}"
When we call the function with the right parameters, we'll get the (text) content in a string.
We need to call the function for each of the files in the list returned by get_all_files_from_bitbucket()
function. The code can look something like:
files = get_all_files_from_bitbucket(PROJECT_NAME, repo, main_branch)
for filename in files:
if filename.lower().endswith(IGNORE_EXTENSIONS):
print(f"Skipping {filename} because of extension")
else:
print(f"Indexing {filename}:")
content = get_file_from_bitbucket(PROJECT_NAME, repo, main_branch, filename)
# Do somthing with hte content
In the next blog entry we'll see how to use lancedb to index the content.
HTH,