✰Vicki Boykis✰

Querying 3 billion vectors

Recently, I got nerd-sniped by this exchange between Jeff Dean and someone trying to query 3 billion vectors. I was curious to see if I could implement the optimal map-reduce solution he alludes to in his reply.

image

A vector is a list/array of floating point numbers of n dimensions, where n is the length of the list. The reason you might perform vector search is to find words or items that are semantically similar to each other, a common pattern in search, recommendations, and generative retrieval applications like Cursor which heavily leverage embeddings.

I started by writing an extremely naive implementation which made the following assumptions:

import numpy as np
from loguru import logger
import time
import os

# start with 3_000 vectors to keep things small 
total_vectors_num = 3_000_000_000
query_vectors_num = 1_000

def generate_random_vectors(num_vectors:int)-> np.array:
    logger.info(f"Generating {num_vectors} vectors...")
    rng = np.random.default_rng() 
    vectors = rng.random((num_vectors, 768)) 
    return vectors

def get_dot_products(vectors_file:np.array, query_vectors:np.array) -> list[np.array]:
    total_products_computed = 0

    dot_products = []
    for v in vectors_file:
        for qv in query_vectors:
            dot_product = v @ qv
            dot_products.append(dot_product)
            total_products_computed += 1
            if total_products_computed % 100000 == 0:
                 logger.info(f"Total vectors processed:{total_products_computed}")

    return dot_products


# Generate initial vectors and query vectors and write to disk
doc_vectors = generate_random_vectors(total_vectors_num)
query_vectors = generate_random_vectors(query_vectors_num)
np.save('vectors.npy', doc_vectors)

# Load vectors from disk 
logger.info("Loading file from disk...")
vectors_file = np.load('vectors.npy')

start_time = time.time()
logger.info("Getting dot products...")
results =  get_dot_products(vectors_file, query_vectors)
end_time = time.time()

logger.info(f"Execution time: {end_time - start_time:.4f} seconds")
logger.info(f"Number of dot products computed: {len(results)}")

This, predictably, didn’t do so great, even on my M2 Macbook, even at 3,000 vectors, one million times less than 3 billion embeddings, taking 2 seconds.

Results:

2025-12-13 17:53:25.675 | INFO     | __main__:generate_random_vectors:9 - Generating 3000 vectors...
2025-12-13 17:53:25.691 | INFO     | __main__:generate_random_vectors:9 - Generating 1000 vectors...
2025-12-13 17:53:25.698 | INFO     | __main__:<module>:39 - Loading file from disk...
2025-12-13 17:53:25.700 | INFO     | __main__:<module>:43 - Getting dot products...
2025-12-13 17:53:27.688 | INFO     | __main__:get_dot_products:24 - Total vectors processed:3000000
2025-12-13 17:53:27.688 | INFO     | __main__:<module>:47 - Execution time: 1.9877 seconds
2025-12-13 17:53:27.688 | INFO     | __main__:<module>:48 - Number of dot products computed: 3000000

So I vectorized the numpy operation, which made things much faster.

import numpy as np
from loguru import logger
import time

total_vectors_num = 3_000
query_vectors_num = 1_000

def generate_random_vectors(num_vectors:int)-> np.array:
    logger.info(f"Generating {num_vectors} vectors...")
    rng = np.random.default_rng() 
    vectors = rng.random((num_vectors, 768))
    return vectors

def get_dot_products_vectorized(vectors_file:np.array, query_vectors:np.array):
    dot_products = vectors_file @ query_vectors.T
    return dot_products.flatten() # collapse into single dim

# Generate initial vectors and query vectors and write to disk
ram_vectors = generate_random_vectors(total_vectors_num)
query_vectors = generate_random_vectors(query_vectors_num)
np.save('vectors.npy', ram_vectors)

# Load vectors from disk 
logger.info("Loading file from disk...")
vectors_file = np.load('vectors.npy')

start_time = time.time()
logger.info("Getting dot products...")
results =  get_dot_products_vectorized(vectors_file, query_vectors)
end_time = time.time()

logger.info(f"Execution time: {end_time - start_time:.4f} seconds")
logger.info(f"Number of dot products computed: {len(results)}")

At .017 seconds, this was a big improvement!

Results:

2025-12-13 17:52:52.810 | INFO     | __main__:generate_random_vectors:9 - Generating 3000 vectors...
2025-12-13 17:52:52.831 | INFO     | __main__:generate_random_vectors:9 - Generating 1000 vectors...
2025-12-13 17:52:52.874 | INFO     | __main__:<module>:39 - Loading file from disk...
2025-12-13 17:52:52.876 | INFO     | __main__:<module>:43 - Getting dot products...
2025-12-13 17:52:52.887 | INFO     | __main__:<module>:47 - Execution time: 0.0107 seconds
2025-12-13 17:52:52.887 | INFO     | __main__:<module>:48 - Number of dot products computed: 3000000

I tried a 3 million sample size with this improvement. This took 12 seconds.

2025-12-13 19:39:43.830 | INFO     | __main__:generate_random_vectors:12 - Generating 3000000 vectors...
2025-12-13 19:39:57.509 | INFO     | __main__:generate_random_vectors:12 - Generating 1000 vectors...
2025-12-13 19:39:58.978 | INFO     | __main__:<module>:57 - Loading file from disk...
2025-12-13 19:40:00.131 | INFO     | __main__:<module>:61 - Getting dot products...
2025-12-13 19:40:12.984 | INFO     | __main__:<module>:65 - Execution time: 12.8491 seconds
2025-12-13 19:40:12.992 | INFO     | __main__:<module>:66 - Number of dot products computed: 3000000000

We could also reduce even further by converting the data to float32:

doc_vectors = generate_random_vectors(total_vectors_num).astype(np.float32)
query_vectors = generate_random_vectors(query_vectors_num).astype(np.float32)
2025-12-13 18:13:52.152 | INFO     | __main__:generate_random_vectors:10 - Generating 3000 vectors...
2025-12-13 18:13:52.168 | INFO     | __main__:generate_random_vectors:10 - Generating 1000 vectors...
2025-12-13 18:13:52.176 | INFO     | __main__:<module>:55 - Loading file from disk...
2025-12-13 18:13:52.178 | INFO     | __main__:<module>:59 - Getting dot products...
2025-12-13 18:13:52.182 | INFO     | __main__:<module>:63 - Execution time: 0.0045 seconds
2025-12-13 18:13:52.182 | INFO     | __main__:<module>:64 - Number of dot products computed: 3000000

With these small improvements, we’ve already sped up inference to ~13 seconds for 3 million vectors, which means for 3 billion, it would take 1000x longer, or ~3216 minutes.

|approach    | query_vectors | doc_vectors   | time     |
|----------- |---------------|---------------|----------|
| Naive      | 1,000         | 3,000         | 1.9877s  |
| Vectorized | 1,000         | 3,000         | 0.0107s  |
| Vectorized | 1,000         | 3,000,000     | 12.8491s |
| Np.Float32 | 1,000         | 3,0000        | 0.0045s  |

When we start to run it to test, however, we run into a different problem: OOM. Why? The amount of memory needed to process 3 billion objects, each as float32 object that’s 4 bytes in size, would be 8 million GB.


vectors = rng.random((1, 768)).astype(np.float32)
print(vectors.nbytes)
3072
print(vectors.itemsize)
4

bytes_per_float32 = 4
memory_gb = (3000000000 * 1000 * 768 * bytes_per_float32) / (1024**3)
8583068.84765625 = 8.6 TB

In order to improve this, we would need to do some heavy lifting of the kind Jeff Dean prescribed. First, we could to change the code to use generators and batch the comparison operations. We could write every n operations to disk, either directly or through memory mapping. Or, we could use system-level optimized code calls - we could rewrite the code in Rust or C, or use a library like SimSIMD explicitly made for similarity comparisons between vectors at scale.

Before I started on any further optimizations, upon further inspection, there were some things about the problem that I realized weren’t clear to me: 3 billion vector embeddings queried a few thousand times could mean:

All of these dictate the additional time and resources spent on the solution. What I realized is the same thing I’ve seen so many of these problems over the years, that the technical solution is no longer the hardest one to achieve: the hardest one is nailing down the requirements.

#jeff dean #engineering #error handling #embeddings #vectors #ambiguity