← Back

Intro to ColBERT

January 2024

Do you have a query? Do you have documents that might answer it? Let's focus on the question of our times: given a query, how can we fetch the most relevant documents?

Vanilla RAG

Usually, a dense retrieval RAG pipeline goes like:

  1. Embed your query
  2. Go to your store of document vectors & find the top-k documents whose embeddings are closest to your query (ie. most semantically similar)
  3. Feed these k documents to your LLM as context

Note: Different RAG pipelines are variations on each of these steps.

All is well, right? Not quite. Document vectors make tradeoffs between capturing global & local context:

Take this example: a dense embedding will encode 'I love to run' & 'I hate to run' pretty close to each other.

ColBERT

ColBERT solves this; it introduces the following powerful ideas:

The query-doc similarity score is computed as follows:

In the end, you get embeddings that capture both keyword information as well as some context at the level of a few tokens. You can then ask a query in the style of the responses you want.

ColBERT late interaction architecture diagram showing Question Encoder and Passage Encoder with MaxSim operations

Figure 1: The late interaction architecture. Diagram from Khattab et al. (2021b).

ColBERT v2 Improvements

Since the ColBERT paper in 2020, ColBERT v2 has come out. The key improvement is the efficient encoding of vectors, which falls from the following observation:

When a document is embedded by ColBERT, it isn't represented as a document, but as the sum of its parts. This fundamentally changes the nature of training our model, to a much easier task: it doesn't need to cram every possible meaning into a single vector, it just needs to capture the meaning of a few tokens at a time.

Example: Learning about the Industrial Revolution

Let's use the Wikipedia page on the Industrial Revolution as our corpus. When we ask ColBERT: "how did the economy change after the industrial revolution?", we get highly relevant passages about GDP growth, standard of living improvements, and economic transformation.

More Complex Query

Let's make the query more complicated: "From where did British iron manufacturers use considerable amounts of iron?"

ColBERT returns the text "Up to that time, British iron manufacturers had used considerable amounts of iron imported from Sweden and Russia to supplement domestic supplies" in the highest-ranked result, while dense embedding retrieval did not rank it even in the top 5. ColBERT can do both semantic as well as keyword matching!

Parting Thoughts

I'm not asking you to trust me blindly here; this is a tiny corpus & does not reflect the tradeoffs you'll make in a production system with cost, latency, size & snippet quality.

Having good benchmarks for your use case will help you better understand what the best system is but I think you should at least give ColBERT a serious look!

Resources