Using ChromaDB to Generate Related Blog Content

I generate my blog with Jekyll. I’ve always loved the simplicity and flexibility that Jekyll provides. Jekyll allows us to store data as yaml in _data and retrieve it later. I wanted to copy Simon Willison and generate related content. I don’t have anywhere near the number of articles but it’s the exercise that matters.

Unlike Simon, I really want to use existing tools to do the embedding. Simon is working on quite a number of Python packages. I’m going to start by using ChromaDB to do the heavy-lifting. ChromaDB describes itself as “the AI-native open-source embedding database”.

It’s written in Python, which is pretty standard for ML tools. This is a minor downside for us, because Jekyll is Ruby based, but given that we’re just going to store this in _data anyway, I can just run it on my machine before pushing for now – I’ll just have to remember.

First we install the dependencies:

pip install chromadb
pip install pyyaml

Then create a python script to run the embedding and save the results to a file.

import os
import yaml
import chromadb
from chromadb.utils import embedding_functions

chroma_client = chromadb.Client()
default_ef = embedding_functions.DefaultEmbeddingFunction()

collection = chroma_client.create_collection(name="vertis-io", embedding_function=default_ef, metadata={ "hnsw:space": "cosine" })

folder_name = "."
allowed_paths = ["_posts"]
markdown_files = []

for root, dirs, files in os.walk(folder_name):
    if any(allowed_path in root for allowed_path in allowed_paths):
        for file in files:
            if file.endswith('.md'):
                file_path = os.path.join(root, file)
                markdown_files.append(file_path)


for file_path in markdown_files:
    try:
        with open(file_path, "r", encoding="utf-8") as o:
            content = o.read()  # Changed from readlines to read
            file_name = os.path.basename(file_path)
            collection.add(
                documents=[content],  # Directly using read content
                metadatas=[{"source": file_name}],
                ids=[file_name]  # Removed f-string as it's unnecessary
            )
    except UnicodeDecodeError:
        # some files are not utf-8 encoded; let's ignore them for now.
        pass

related_content = {}

for file_path in markdown_files:
    try:
        file_name = os.path.basename(file_path)
        doc = collection.get(ids=[file_name], include=["embeddings", "metadatas"])
        if doc:
            doc_embeddings = doc['embeddings']
            nearest_matches = collection.query(
                query_embeddings=doc_embeddings,
                n_results=5,
                where={"source": {"$ne": file_name}}
            )
            filtered_ids = [id for id, distance in zip(nearest_matches['ids'], nearest_matches['distances'][0]) if distance < 0.5]
            related_content[file_name] = filtered_ids
    except UnicodeDecodeError:
        # some files are not utf-8 encoded; let's ignore them for now.
        pass

with open('_data/related_content.yml', 'w', encoding='utf-8') as file:
    yaml.dump(related_content, file, allow_unicode=True)

From there it’s a simple matter to use the data in _data/related_content.yml to generate the similar articles.

{% assign related_content = site.data.related_content[page.name] %}
{% if related_content.size > 0 %}
    <section class="mt-12">
        <h2 class="text-2xl font-bold">Related Content</h2>
        <ul class="mt-4 list-disc">
            {% for post_filename in related_content %}
                {% assign post = site.posts | where: "name", post_filename | first %}
                {% if post %}
                    <li>
                        <a class="text-blue-500 hover:underline" href="{{ post.url }}">{{ post.title }}</a>
                    </li>
                {% endif %}
            {% endfor %}
        </ul>
    </section>
{% endif %}

Styling as appropriate.