★❤✰ Vicki Boykis ★❤✰

2025 in review

Jeune fille lisant une lettre à la bougie, Jean-Baptiste Santerre, 1700

Machine learning engineers spend their lives alternating between two states: staring at tqdm progress bars during model training and staring at error logs during model inference.

A third category now involves staring at coding agent CLI progress bars, but using too much AI assistance during coding makes me feel like I’m losing my own context window.

I started a new job as a founding MLE in March and, as is true for engineers in any small and young team, I’ve been staring at error logs all across the stack.

The good news is that if you are at a point where you care a lot about error logs, it means you have users who care a lot about the answer to “What if this breaks?” If you have people other than you who care about your software, congratulations, you are in production. Being in the state of production is the best possible outcome for software engineers because it means the work we’re doing is useful.

Standing up a new thing in production, though, is scary, because production means you are responsible - to the end-user, to the other developers on your team, and, finally, to yourself.

In “The Tombs of Atuan”, Ursula K. Le Guin writes of a girl, Tenar, who is taken at a very young age from her family and goes to live in a holy city. In a ritual ceremony, her name is changed and she loses her identity to become Arha, the Eaten One. Arha is the High Priestess of the Place of the Tombs, a vast series of caverns under the earth, silent, and in complete darkness, “full of gold and the swords of old heroes, and old crowns, and bones, and years, and silence.”

In her first trip to the tombs with an elder priestess, they descend into the dark without a torch.

“Light is forbidden here.” Kossil’s whisper was sharp. Even as she said it, Arha knew it must be so. This was the very home of darkness, the inmost center of the night. Three times her fingers swept across a gap in the complex, rocky blackness. The fourth time she felt for the height and width of the opening, and entered it. Kossil came behind. In this tunnel, which went upward again at a slight slant, they passed an opening on the left, and then at a branching way took the right: all by feel, by groping, in the blindness of the underearth and the silence inside the ground. In such a passageway as this, one must reach out almost constantly to touch both sides of the tunnel, lest one of the openings that must be counted be missed, or the forking of the way go unnoticed.

This is what building a new production system is like. You go into it, like Arha, scared, groping entirely in the darkness.

If you need a cache, you can use a Go map, lru cache, or Redis. If you need a vector store, you can use np.array, or Postgres, or Pinecone/Chroma/Turbopuffer/ or Elasticsearch. You could run your whole stack on Docker Compose, or a single bare-metal server in your basement, or Kubernetes in the cloud, or on a vape.

Forget large-scale architecture choices. How will you name things? Should you name your method get_results or get_query_results? Will you ever get results that are more than just the answer to a query? Will you ever use this method in a more abstract way? Does it need to be part of a class? What exceptions should you write for it? We want to write code that must never lie, but what if, at the time, we are telling the truth to ourselves?

How soon do we need to ship get_query_results? How many other methods will rely on it? How many of our teammates need this method for their work as well? Will they understand it? Code is meant, largely still and hopefully for the forseeable future, for humans to read first and machines to compile second.

In “The Bell Jar”, Sylvia Plath laments about the endless myriad combinations of a possible life, “I saw my life branching out before me like the green fig tree in the story. From the tip of every branch, like a fat purple fig, a wonderful future beckoned and winked.” By the time she’s done thinking through all these futures,she hasn’t made a decision, and they all slip away from her.

The forking branches of a decision tree in a codebase, are likewise boundless, and the neat part is that there is no right answer. You are constrained by your business requirements, but the choice of implementation of those requirements is of an endless variety. It will depend on: the stack you already have, the budget for the rest of the stack, your own past experience and engineering values, the social norms and expectations of your team and their collective experience, the industry’s vocabulary, the language you’re writing in, its conventions and affordances, and how much time you have to actually think about, finish, and merge this current PR before you need to deploy. That’s why staff engineers make money mostly by saying “it depends” for years on end in different ways.

Let’s say your goal is to build a scalable, distributed, microservice architecture that processes inputs and streams machine learning inference outputs to users with minimal downtime, and that the system is resilient to failures via container orchestration. I’ve just described every machine learning systems since the beginning of time, this is the machine learning system they show you in Plato’s ML engineering cave, in every single Medium post on the planet, this is the set of boxes connected by dotted lines that arrives to unbidden to every engineer in their dreams. This is the light at the end of your labyrinth.

But, in building such a system, you never start by building such a system. You have to start with a single keystroke in the inky blackness.This is one of the reasons it’s so hard to write code with LLMs and evaluate their output, by the way: humans reason from the inside out - “When you write this code, you don’t go from top to bottom, you go from chunks and you grow the chunks…” - and LLMs write code from top down.

So, instead, you take a deep breath and write a single method to tokenize a string.

def tokenize(string:str) -> list[str]:
   tokens = string.split()
   return tokens

Tokenizing a string is the process of splitting it into subsections and converting each of them into a number so they can be processed by a transformer-style model. Tokenization is not specific to LLMs. We’ve been doing tokenization for a long time. Here’s a paper on how important it is for preprocessing from 1992!

The first part of tokenizing any text is splitting it into those sub-parts. Those sub-parts can be words, syllables, or characters. We start with words to make it easier.

sentence = "The truest power is the power to choose. - Ursula K. Le Guin"

def tokenize(string:str) -> list[str]:
   tokens = string.split()
   return tokens

print(tokenize(sentence))

['The', 'truest', 'power', 'is', 'the', 'power', 'to', 'choose.', '-', 'Ursula', 'K.', 'Le Guin']

Easy, and we have something concrete. Do we count “K.” as a word? How about the punctuation, do we generally want to include that for tokenization in a transformer-style model?

What do we do with input text that has numbers in it? Do we include them or strip them out?

sentence = "The truest 1 power is the power to choose. - Ursula K. Le Guin"

print(tokenize(sentence))

['The', 'truest', '1', 'power', 'is', 'the', 'power', 'to', 'choose', 'Ursula', 'K.', 'Le Guin']

What happens when we instead get malformed inputs from a webpage, or if we have delimiters?

sentence = "The<html>truest\npower\nis\nthe\npower\nto\nchoose\nUrsula\nK\nLe\nGuin"
print(tokenize(sentence))

['The<html>truest', 'power', 'is', 'the', 'power', 'to', 'choose', 'Ursula', 'K.', 'Le', 'Guin']

What if the input is an empty string? Do we want to send that empty string downstream?

What if we don’t have a string to tokenize? We now get an exception that we have to think about how to handle if we’re in production.

string = None
tokenize(sentence)

Traceback (most recent call last):
  File "/Users/vicki/arha/tokenize.py", line 9, in < module>
    print(tokenize(sentence))
  File "/Users/vicki/arha/tokenize.py", line 5, in tokenize
    tokens = string.split()
AttributeError: 'NoneType' object has no attribute 'split'

What if we receive emojis? This string representation returns only one token, which may not be correct in representing these two different emotions. And how do we know the model accepts emoji codepoints as input anyway? Many GPT-style models do, but many BERT models don’t.

sentence = "🤪🥰"
print(tokenize(sentence))

['🤪🥰']

What if we send an extremely large string because we don’t have upstream error handling for our tokenization queries? This code block will take a long time to return because you are creating a large object in memory. Since we don’t have logging yet, we won’t know in the rest of our application where that large object creation takes place. We will just have a web service that hangs silently, indefinitely.

num_tokens = 500_000_000
pattern = "The truest power is the power to choose. - Ursula K. Le Guin"
large_string = pattern * num_tokens

print(tokenize(large_string))
hanging forever...

What if our input includes languages other than English? How do we account for Unicode?

Sometimes, you’re able to come up with most of these edge cases at once. Mostly, though, you end up building out these edge cases over countless iterations, because software engineering is programming integrated over time.

Eventually, you build up to a codebase that has a level of thought and care around tokenization, punctuation, and input sanitization. You’ve now written a codebase at the level of tokenizers, or spacy or tiktoken.

In fact, you should probably use the libraries that people have already put collectively dozens of combined years of thought into - those corridors are closed, safe, and tested, in the case of open-source libraries, by a much larger community than just you. You import tokenizers.

In doing so, you close off a number of paths further into the labyrinth behind you, the doors darkened and abandoned. You light a single flickering torch to mark your path. On top of that code, you start adding the things that make that path more resilient (but also, then harder to refactor if you decide to pivot): precise exception handling, traces, clear human-legible logging, and unit tests. A single corridor in your application starts to shine. With the path in front fo you illuminated, you move forward.

Now that tokenization works, since you need this method throughout your application, you’ll want to create a tokenizer inference service.

Within the logic itself, if you’re using an external tokenizer, you’ll need to make sure it matches the model you’re using, because the model depends on a given tokenization scheme, In this case, we are better off relying on AutoTokenizer, which handles the import and comparison logic for us:

from transformers import AutoTokenizer

name_or_path = "arha-based-uncased"
tokenizer = AutoTokenizer.from_pretrained(name_or_path)

If you are downloading the tokenizer from HuggingFace, as AutoTokenizer does unless you specify a custom tokenizer path, you add a network call and an external dependency to the complications of your initial service call.

A new branch blooms on your decision tree. If you make this call in your service, how much latency does it add? On cold-start, can you assume the artifact will always be there? How long does it take to load the tokenizer? To stave off that branch of questions, you could instead download the artifact locally, or write your own tokenizer. Now, you’re in the business of object storage and artifact management.

You decide to tackle another question: if you’re calling the tokenizer again and again, you’ll need to create a method, or better yet, a class, because it’s highly likely you’ll want other functionality attached to the tokenizer.

And now, if you have a service, you need to Dockerize the code. If you have a Docker image, you need to pin dependencies for reproducibility. If you are pinning dependencies and have Docker, you need a build process for your deployable artifact.

The closer and closer to production you get, the more questions around “what happens if this breaks for other people who include my team and my users” start to form.

You write the service and add logs. Lots of logs. So many logs. Logs are your first line of defense in production. They are the candle in the darkness. If they’re good enough for Brian Kernighan, who wrote, “The most effective debugging tool is still careful thought, coupled with judiciously placed print statements,” they’re good enough for you.

You need to register the service into your observability platform. You add exception handling. So many Exceptions. What if the tokenizer doesn’t load? What if it loads, but to the wrong directory? What if HuggingFace is down? What if the tokenizer service can’t talk to other services? What happens if the new container doesn’t launch? What happens if the tokenizer artifact is corrupted? What happens if you’re serving the tokenizer, but the intra-service network fails? Exceptions and logging, logging and exceptions, unit tests and integration tests, layer upon layer of defenses.

Beej, who previously wrote an incredible guide to git, recently released a guide to learning computer science, in which he says that in order to be a good computer scientist, you need to think like a villain.

Expect the unexpected in terms of data that your code will receive. Expect malicious actors to feed in data in an attempt to gain unauthorized access or manipulate the system in undesirable ways. Test for that stuff in your code.

You think like a villain, because the villain in the production environment is usually yourself. With each new door, comes the possibility of falling through into a bottomless pit.

For me, this usually happens either immediately after I deploy, or at 2:17 in the morning a week from now. The tokenization method I wrote or imported fails. Or there’s a memory leak. Or there’s a network outage that causes a cascading failure.

I add more logs. I debug. I run tests. I develop locally. I add more tests. I consult and I fix, and, finally, I am back on the golden path. The alerts subside. I add more torches. The system becomes stronger, more legible from me having struggled with it. My reasoning about the system becomes stronger and clearer, too. We grow together, the system in production and me.

There is one comforting thought about being with the system: when you enter the Tombs, you are not actually truly alone. Software development is a team sport, and when we build a system as a team, we built it together, PR by PR.

And, we are also building with others: we build on the libraries and tools others before us have hardened, with the tools that others have built. In spite of the fact that an increasing amount of code is machine-assisted, software development is still mostly humans talking to each other in language that computers also understand. As we build, alone in the dark with our torches in the Tombs, we hear the echoes of others in other hallways, each carefully closing off dead ends, placing their own torches.

Once you illuminate enough of the system by reasoning through failure points, architectural decisions, the hallway is no longer unlimited darkness. It is an illuminated system in production. You keep walking around the app and harden the system. You start to get more sleep.

At the end of the year, I made it out of the Tombs.

Just in time to go back to staring at tqdm.

#development #engineering #error handling #tokenization #ml