<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Tech Blog on ✰Vicki Boykis✰</title><link>https://vickiboykis.com/</link><description>Recent content in Tech Blog on ✰Vicki Boykis✰</description><generator>Hugo</generator><language>en-US</language><copyright>Copyright © 2026, Vicki Boykis.</copyright><lastBuildDate>Mon, 20 Apr 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://vickiboykis.com/index.xml" rel="self" type="application/rss+xml"/><item><title>Build yourself flowers</title><link>https://vickiboykis.com/2026/04/20/build-yourself-flowers/</link><pubDate>Mon, 20 Apr 2026 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2026/04/20/build-yourself-flowers/</guid><description>&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_2.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_1.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;div style="background-color: rgba(128, 128, 128, 0.1); padding: 1em; border-radius: 0.5em; ">
This is an edited transcript of the keynote I gave at the &lt;a href="https://appliedml.us/2026/">Applied Machine Learning Conference&lt;/a> in Charlottesville, VA in April 2026.
&lt;p>I first wrote a draft of this talk by hand. This part took 2 months.&lt;/p>
&lt;p>I then recorded myself giving a version of this talk with &lt;a href="https://goodsnooze.gumroad.com/l/macwhisper">MacWhisper&lt;/a>, and transcribed it with Whisper locally. This part took 45 minutes (the total time of my practice run.)&lt;/p>
&lt;p>Then, I ran it through Gemini Flash 2.5 &lt;a href="https://pi.dev/">running in Pi&lt;/a> to break into paragraphs. I also had Gemini break up my slide deck from a PDF I generated from Google Slides, into individual images to insert into the blog post and optimize the image format to webp for blog rendering. This part took about 10 minutes.&lt;/p>
&lt;p>Then I went through the text paragraph by paragraph manually to make corrections, remove redundant phrasing and pauses, and added clarifications to make it more legible than a talk. This part took 3 hours.&lt;/p>
&lt;p>Before all that, I generated the content for this talk. This part took 13 years.&lt;/p>
&lt;/div>
&lt;p>I&amp;rsquo;m Vicki, and I build machine learning systems.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_2.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_2.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>I debated for a long time how to introduce myself. Am I a data scientist? Am I still a machine learning engineer? Am I an AI engineer now? I&amp;rsquo;m not really sure. I think, like a lot of people over the past six months in the industry, I&amp;rsquo;ve been having existential angst. So, I&amp;rsquo;ll go with &amp;ldquo;I build machine learning systems.&amp;rdquo;&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_3.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_3.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>I&amp;rsquo;ve built and broken systems at &lt;a href="https://vickiboykis.com/2022/07/25/looking-back-at-two-years-at-automattic-and-tumblr/">Tumblr&lt;/a>, at Automattic, at Duo, at Mozilla.ai. Now, I build realtime personalization and search systems at Malachyte.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_4.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_4.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>We&amp;rsquo;ve had a lot of different conversations in all of the corners of the internet about how we should incorporate LLMs into machine learning workflows. And the question that came to me is, if all we&amp;rsquo;re doing is generative AI, where does traditional machine learning even fit in here? Are we still doing machine learning at all? So, the larger question behind my existential angst is, what is the state of machine learning engineering as an industry today? That is - is it still worth doing machine learning engineering?&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_5.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_5.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>And the second question that came to me was, not only is it still worth doing ML, but, in an era where we&amp;rsquo;re having LLMs generate a lot of code, when the most important thing is for us to ship quickly, to get to a prototype quickly, is it still worth doing machine learning well?&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_6.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_6.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>So hopefully we&amp;rsquo;ll be able to definitively solve all of my existential crises in 45 minutes, but until then, I wanted to take a step back and talk about flowers.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_7.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_7.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>In 2024, the lovely folks at &lt;a href="https://amsterdam.pydata.org/">PyData Amsterdam&lt;/a>, invited me to give a keynote. I called it &lt;a href="https://vickiboykis.com/2023/09/13/build-and-keep-your-context-window/">Build and Keep Your Context Window&lt;/a>. ChatGPT was still fairly new, out for just about a year and a half at the time, and everyone was freaking out. I talked about how important it is to build and keep your own context window - the historical understanding of software tools and concepts, so that we can use the new tools and understand where they fit into the context of existing software engineering.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_8.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_8.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>The analogy I used was the ChatGPT UI. At the time, everybody was using the text interface, where you have a sidebar and you have the big text box. All your chat sessions live in the sidebar, and the sessions have titles. And I talked about that as being part of a context window.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_9.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_9.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>The context window is the amount of text that a model can recall at any given time, as measured in tokens. If the text you input overflows the context window, the model loses the thread of the conversation and can&amp;rsquo;t reply coherently anymore. What determines the context window in reply is the attention mechanism in the Transformers model. And the attention mechanism is really a cache, which is really a hash map. So if you understand all these four things and how they fit together as concepts that naturally emerge over time successively, it&amp;rsquo;s not perhaps as surprising that we get the emergence of LLMs.&lt;/p>
&lt;p>Before I gave this talk, I got to spend a few days in Amsterdam (I was told they were the only two sunny days in the Netherlands in September) and particularly in the &lt;a href="https://www.rijksmuseum.nl/en">Rijksmuseum&lt;/a>, which is the national Dutch museum of art.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_10.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_10.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>It&amp;rsquo;s absolutely enormous and beautiful. You could spend days there. But I only had a few hours, so of course I went to see the headline piece, which is the Night Watch by Rembrandt. When I went to see it, it was covered by this big plastic window and there was machinery set up all around it. I was thinking, what is this? Why is this blocking the painting?&lt;/p>
&lt;p>When I went later to PyData Amsterdam, Robert Erdmann, who was working at the Rijksmuseum, gave &lt;a href="https://www.youtube.com/watch?v=kMfl5SzfkVc">an incredible talk&lt;/a> about how he was using deep learning-based ink-removal, especially 3D imaging, pixel detection, and high resolution photography to see details in art at the museum that nobody was able to see before.&lt;/p>
&lt;p>So for example, for The Night Watch, they set up all this equipment to be able to photograph it many times in many resolutions. And as a result, they were able to get high resolution pictures like the figure on the left, who is believed to be actually Rembrandt, the artist, in a cameo. And if you really zoom in, further, further, you can see in his eye there&amp;rsquo;s a little white sliver.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_11.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_11.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>And so the question is, how do you become good enough that you know if you put just a little bit of a white dot, it will render as an iris of an eye to somebody? And so those are the kinds of questions, around the mystery of mastery, that he was now able to explore as a result of this photography.&lt;/p>
&lt;h2 id="rachel-ruysch">Rachel Ruysch&lt;/h2>
&lt;p>One of the artists whose paintings are also on display at the Rijksmuseum is Rachel Ruysch, one of the premier flower painters during the Dutch Golden Age, and that&amp;rsquo;s who I really want to talk about. Being a premier painter of the Dutch Golden Age was a huge accomplishment, and I want to understand more about how she works, because maybe we can use her as a guide to navigate this crazy time where we&amp;rsquo;re doing a lot of code generation with LLMs.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_12.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_12.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>So who was she? Rachel lived in Amsterdam, and all she did was paint flowers. Lots and lots of different compositions of flowers in all sorts of shapes and sizes, starting from when she was 15, well into her 80s.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_13.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_13.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>Let&amp;rsquo;s take a step back and talk about the Dutch Golden Age for a minute. What made this era of creativity possible? In the late 1500s, the Dutch won independence from the Spanish government. The political stability was followed by economic stability and the rise of the Dutch empire. As Dutch life got significantly easier and more people joined the growing middle class, they wanted to buy art. In response, art, became seen as an important aspect of the Dutch national character, and art which had previously been mostly religious in theme, became secular. People started painting scenes from real life, science, and nature. All of this resulted in millions of paintings, many of them scientific or representative of everyday life.&lt;/p>
&lt;p>What people wanted to see were pictures of everyday life that were accurate, that were true to what was going on around them, of local scenes, of scientific discovery. Of themes of religion and mortality, but conveyed in a different way.&lt;/p>
&lt;p>So they wanted to paint stuff like gentlemen smoking and playing backgammon in a tavern. They wanted to paint stuff like people skating in the winter.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_14.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_14.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>They wanted to paint stuff like people getting rowdy in their house while the mistress was asleep.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_15.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_15.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>And if you look at all these paintings, you can see something come through, which is that the colors are down to earth. They&amp;rsquo;re very normcore.&lt;/p>
&lt;p>Another thing people wanted was a lot of different paintings of flowers, hundreds and thousands of paintings of flowers.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_16.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_16.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>Out of the five to ten million paintings that we think were created during the Dutch Golden Age, thousands of them were flowers. Why flowers? People wanted flowers in their homes that wouldn’t wilt over the long Dutch winters, and that showcased how worldly and interesting the homeowners were. Exotic and unusual flowers in particular (such as tulips of the famous Dutch tulip mania) were prized, and flower painters sought to accommodate this trend. There were many very famous Dutch flower painters and hundreds to thousands more, all producing and focusing on flowers. For artists, painting flowers accurately and lifelike was an important test of mastery.&lt;/p>
&lt;p>And so Rachel Ruysch came into her skills at a time when people demanded this. And she was really good at it. And so the question I had, was that, if you have all these people painting flowers, why is it important to paint good flowers? Why do you want to stand out?&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_17.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_17.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>And the same question of, in a world where it&amp;rsquo;s easy and fast to write code, why is technical excellence still important?&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_18.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_18.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>As I was researching this, I came across a &lt;a href="https://ntrs.nasa.gov/api/citations/20130000445/downloads/20130000445.pdf">report from NASA&lt;/a>&lt;/p>
&lt;p>NASA has been on my mind lately with the Artemis II mission and their orbit of the moon and their triumphant return home. Because spaceflight is so hard and so enormously error-prone NASA has been continuously on a loop to improve quality, because it&amp;rsquo;s already very dangerous to be an astronaut at NASA. I&amp;rsquo;d seen somewhere that the Artemis crew knew they had a 1 in 30 chance of dying.&lt;/p>
&lt;p>But NASA is always looking at risk, and they did an investigation in 2012 and came out with this report of what happens when things fail. Why do they fail? &lt;a href="https://vickiboykis.com/2026/04/05/nasa-elements-of-engineering-excellence/">And they came up with these five problems.&lt;/a> And the one that really struck out to me was the first one, shifting engineering excellence to insight and oversight.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_19.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_19.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>So there were periods when NASA outsourced a lot of their engineering to subcontractors. And what happened was that the engineers who understood the design and the system design based on actual experience left. And the shift that resulted eliminated independent analysis and testing and understanding of the systems. And this was what led to degradation of engineering quality.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_20.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_20.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>And if it sounds familiar, this is because it&amp;rsquo;s a very easy analogy for what happens when we outsource our work as engineers and data scientists and machine learning engineers entirely to AI.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_21.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_21.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>So, striving for technical excellence in industry saves us from engineering, product, and cost problems. Quality is important.&lt;/p>
&lt;p>The second reason is that software engineering is a craft, just like painting.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_22.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_22.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>And it takes time to get good at painting. It takes time to understand where to put the little white iris. It takes time to get good at painting flowers. And so it&amp;rsquo;s important to understand that because it takes time to get to quality.&lt;/p>
&lt;p>And finally, technical excellence is important because striving for mastery is just about what it means to be human.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_23.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_23.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>&lt;a href="https://vickiboykis.com/2025/10/20/i-want-to-see-the-claw/">We like when people do things well.&lt;/a> We appreciate people who are competent. We like it when systems are designed and built well. It feels good for us to be part of systems that work. One tangential idea that I&amp;rsquo;ve seen going around the internet is that you shouldn&amp;rsquo;t talk down to LLMs, or call them &amp;ldquo;stupid&amp;rdquo;. Not because they understand, they&amp;rsquo;re just models, but because it&amp;rsquo;s bad for us as humans. In that same way, it&amp;rsquo;s important for us to strive for good things as humans.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_24.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_24.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>Okay, so if we want to paint good flowers, how do we become good at painting good flowers? What made Rachel technically excellent? What might we be able to replicate?&lt;/p>
&lt;p>The first one is that she was in an artistic family. So she was in one of the premier families of Amsterdam where all her great uncles were painters. Her grandfather was involved in art. Her father was a great artist and botanist, Frederick Ruysch. You can see him here with this creepy skull. He collected a lot of really creepy dead things, which is where she learned about scientific accuracy, embalming, preserving flowers, etc.&lt;/p>
&lt;p>It was here that she also learned discipline: how to capture minute details, She was known for her technical skill, and for gathering flowers in botanic gardens and pressing them so she could have different compositions for different seasons, surprising the Dutch, who were generally orderly and used to seeing only seasonal bouquets. During her life, her paintings sold for 3x Rembrandt’s.&lt;/p>
&lt;p>Then when she was 15, she was apprenticed to Willem van Aelst, who was one of the premier painters in Amsterdam at the time. It was tradition at the time for masters to take on apprentices who would live with them and start with stuff like washing their paintbrushes and work all the way up to painting, and then finally produce a work of art which, if the master accepted, the apprentice could become a master in their own right. So she had a very big support system and she was steeped in a culture of mentorship being extremely important.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_25.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_25.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>Second, she painted her whole life. She was keenly interested in botany, in science, and made sure that her art reflected the true appearance of nature. She spent a long time - over 60 years, honing and working on her career, not even slowing down to have ten children. For her, her career was a vocation. She got inspiration from many different places, and she would do the same thing, again and again, but differently.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_26.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_26.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>And finally, she remixed and experimented. So you can see in the background of this particular painting, there&amp;rsquo;s a cactus. And it&amp;rsquo;s not easy to get a cactus in Amsterdam in the summer or the winter. So she got this from the New World because she had connections to Amsterdam botanical societies and she knew people who were coming and going all the time. She bartered for and preserved exotic flowers. And she also remixed flowers from different seasons by drying them. She collected them all the time. She was always changing something about how she was working.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_27.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_27.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>So, given all of this, how can we, as engineers, be like Rachel?&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_28.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_28.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;h2 id="building-rijksearch">Building Rijksearch&lt;/h2>
&lt;p>I thought about all this and decided to see how I could apply these ideas to a concrete side project that I was working on. And the side project is called Rijksearch, a semantic search engine.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_29.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_29.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>A semantic search engine searches by intent rather than keyword. So what you do is you type in &amp;ldquo;chill dude&amp;rdquo; and it&amp;rsquo;ll return you this portrait of, for example, a man with a dancing dog. Which I think is actually a really good example result. Or like a &amp;ldquo;confidential chat&amp;rdquo; or &amp;ldquo;man smoking pipe&amp;rdquo;, the last one for sure being a chill dude. So it&amp;rsquo;s a search engine that works by intent.&lt;/p>
&lt;p>Or, you type in &amp;ldquo;pensive&amp;rdquo; and it will return a philosopher in his study, which is definitely pensive, a boy with a golf club probably not, you don&amp;rsquo;t want toddlers to start thinking too hard about what they can do with a golf club. But like, ok, these results are not bad!&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_30.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_30.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>Or &amp;ldquo;flower party&amp;rdquo;, which didn&amp;rsquo;t work that well because you can see it&amp;rsquo;s doing more keyword matching for &amp;ldquo;flower&amp;rdquo; than semantic matching for &amp;ldquo;party&amp;rdquo;, but this is where the craft of tuning semantic search comes into play.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_31.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_31.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>There are already a lot of &lt;a href="https://data.rijksmuseum.nl/">wonderful tools&lt;/a> that the Rijksmuseum itself has that you can build with on public data and that they&amp;rsquo;ve built with like &lt;a href="https://www.rijksmuseum.nl/en/collection/art-explorer">Art Explorer&lt;/a>.&lt;/p>
&lt;p>But I decided to build my own because I wanted to remix concepts. Rachel learned how to remix as an apprentice and from the artists she talked to.
I learned it from working at Tumblr where the reblog feature was one of the original places on the social web that implemented the functionality of you posting something and somebody else riffing on top of it.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_34.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_34.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>So what was I thinking about remixing? Oh, I had a lot of stuff going in my head. First, over the past couple years, I&amp;rsquo;ve been thinking about embeddings a lot. In fact, I&amp;rsquo;m thinking about embeddings right now.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_35.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_35.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>In search and recommendations, embeddings - vector representations of text or images (or other media) - matter a lot. They also matter within LLMs themselves because positional embeddings are how we decide what concepts are related to each other when we train the models. I wrote a really long text about them, examining their place in ML systems, called &lt;a href="https://github.com/veekaybee/what_are_embeddings/">What Are Embeddings?&lt;/a>, and I spent a lot of time understanding how they fit into the context of our workflows.&lt;/p>
&lt;p>Second, I&amp;rsquo;ve also been working with Redis for a while now. It&amp;rsquo;s one of &lt;a href="https://vickiboykis.com/2024/04/16/redis-is-forked/">my favorite pieces of software&lt;/a> in the world because it just works. It&amp;rsquo;s extremely cleanly designed. It&amp;rsquo;s really fast. It&amp;rsquo;s extremely reliable. It is just very elegantly put together. So I had worked with Redis for search before, and I wanted to see if I could use it again. At the time, about a year back, Salvatore Sanfilippo, antirez, who initially wrote Redis, picked up working on it again and created a new data structure called &lt;a href="https://antirez.com/news/149">vector sets&lt;/a>.&lt;/p>
&lt;p>Vector sets in Redis work much like sorted sets, which are collections of unique, non-repeating strings sorted by a score. In the case of vector sets, the string elements are associated with a vector. For anyone working with embeddings, this unlocks a huge amount of very fast, efficient vector operations, and they allow for vector similarity lookups using HNSW, a popular and efficient approximate nearest neighbor search algorithm.&lt;/p>
&lt;p>Third, I&amp;rsquo;ve also been hacking around &lt;a href="https://vickiboykis.com/2025/01/23/you-can-just-hack-on-atproto/">AtProto&lt;/a>, which is the protocol that powers Bluesky, a social network that has become popular in the past couple years. It turns out that a lot of the Bluesky codebases are open, and when I started looking around, it turned out that a lot of Bluesky services were written in Go. (And TypeScript, but I scoped myself.)&lt;/p>
&lt;p>Fourth, I&amp;rsquo;ve worked in Python (and Java and Scala, unfortunately), before, and I was curious to explore a new language that was meant for high concurrency and throughput and was the spiritual successor to Java. I did a couple projects with atproto where I created a feed of content in Go, and wanted to see what else it could do.&lt;/p>
&lt;p>Of course, all the other stuff swirling around in my head was LLMs.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_36.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_36.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>It&amp;rsquo;s impossible not to hear about them. I wish I could block the news out because it does feel like it distracts me enormously in my day-to-day life. What I&amp;rsquo;m particularly interested in with LLMs is the power of &lt;a href="https://vickiboykis.com/2025/07/16/my-favorite-use-case-for-ai-is-writing-logs/">small, local models&lt;/a>. So when &lt;code>llama.cpp&lt;/code> came out and you could use models directly, or with apps like LM Studio to &lt;a href="https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/">serve GGUF artifacts&lt;/a>, I became interested in that.&lt;/p>
&lt;p>I also became interested in hosted embedding APIs because before I&amp;rsquo;d been working with hosted models that you either train or pre-train models that you have to serve. And &lt;a href="https://vickiboykis.com/2025/09/01/how-big-are-our-embeddings-now-and-why/">OpenAI made those accessible as API endpoints&lt;/a>, and Gemini started serving new multimodal embedding models, so I became interested in those.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_37.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_37.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>And then finally, the humanities. A lot of my projects revolve partially around the humanities. A lot of the stuff that I write ties back in some way to the humanities because I don&amp;rsquo;t think you can really have a balanced view of what developing a good product looks like if you&amp;rsquo;re just looking at tech. And, I like art. So I&amp;rsquo;m always down to look at and use art.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_38.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_38.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;h2 id="build-the-same-thing-again-and-again">Build the same thing again and again&lt;/h2>
&lt;p>All of this stuff percolated for a year or so in my mind, and the result was that, at some point, I realized I wanted to re-implement Viberary, which is a semantic search engine.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_39.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_39.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>A quick note on search engines. They come in several different forms. There&amp;rsquo;s keyword search, which is if you type in Apple, you get back results for Apple varieties. And there&amp;rsquo;s semantic search, which is you type in Apple and you get back iPhone, and maybe also Android, depending on what you&amp;rsquo;re looking for, which is why the &amp;ldquo;chill dude&amp;rdquo; query works.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_32.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_32.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>And there are many different ways to implement semantic search. The way that I had implemented it before, one of the ways, &lt;a href="https://vickiboykis.com/2024/01/05/retro-on-viberary/">was in Viberary.&lt;/a>&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_40.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_40.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>Viberary was a side project that I&amp;rsquo;d done in 2023 that was based on text-only representation of books from Goodreads and all their reviews. And the idea was that you type in a phrase like sci-fi, funny sci-fi, or moody drama, and you get back a list of books that are recommended based on vibe rather than the books themselves, which is something that I always wanted.&lt;/p>
&lt;p>I&amp;rsquo;m not always in the mood for a particular title, but I&amp;rsquo;m sometimes in the mood for a particular type of book. And so you can see for this one, funny Sci-Fi Space with Feeling, one of the results you get back is Asimov, which I guess, yes, kind of, and Haruki Murakami, which kind of also makes sense? So these vibes kind of fit.&lt;/p>
&lt;p>What allows us to do this is that we&amp;rsquo;re not comparing just text or images. We project them into the embedding space by creating dense vectors out of them and comparing the vectors, the numerical representations.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_41.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_41.webp"
 alt="image" width="400px">&lt;/a>
&lt;/figure>

&lt;p>That way, we can compare &amp;ldquo;dog&amp;rdquo; and &amp;ldquo;bird&amp;rdquo; and &amp;ldquo;fly&amp;rdquo; and see that &amp;ldquo;bird&amp;rdquo; and &amp;ldquo;fly&amp;rdquo; are closer to each other because they&amp;rsquo;re semantically similar. And this is also essentially the way the transformer attention mechanism also works on a much, much, much larger scale.&lt;/p>
&lt;p>It used to be the case that people only learned or fine-tune their own embeddings, but now many of the major AI providers have their own hosted sets of embeddings. For example, Google &lt;a href="https://arxiv.org/abs/2503.07891">recently released embeddings for Gemini&lt;/a>, with both API and local versions being available.&lt;/p>
&lt;h2 id="two-tower-architectures">Two Tower Architectures&lt;/h2>
&lt;p>Building semantic search involves implementing a two tower architecture, a really common paradigm in both search and recommendations, and one that I&amp;rsquo;ve reached for again and again. The two tower architecture and semantic search are for me what flowers were for Rachel.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_42.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_42.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>There are generally three steps to creating a two-tower model.&lt;/p>
&lt;ol>
&lt;li>Embed the items once, in batch.&lt;/li>
&lt;li>Embed your query in realtime at inference.&lt;/li>
&lt;li>Compare the distance between the two.&lt;/li>
&lt;/ol>
&lt;p>So when I search for &amp;ldquo;chill dude&amp;rdquo; it generates an embedding for that phrase and then we do a similarity computation, usually using cosine similarity or dot product, or there are other distance measures you can look at, that says &amp;ldquo;this embedding is most similar to the embeddings and then it surfaces those results just as we do here.&lt;/p>
&lt;p>But there are just so many decisions to make in this process that only become apparent once you start looking at the constraints of the data and the engineering space you&amp;rsquo;re working on.&lt;/p>
&lt;p>This is a very simple structure but it&amp;rsquo;s very elegant because it allows for fast indexing and fast lookup on a large, coarse dataset. When we do any kind of information retrieval whether it&amp;rsquo;s for recommendations or for search, our goal is to get from tens of millions of items the entire internet a whole catalog worth of 800, 000 paintings like the Rijksmuseum has down to tens that you can look at in a carousel or in a screen or in a list of results.&lt;/p>
&lt;p>Two towers is one way that allows us to pre-compute the embedding for our entire catalog and then it allows us to filter them quickly using approximate nearest neighbor search, which is a fast algorithm which says this thing is closer to this group of things but not this other group of things.&lt;/p>
&lt;p>That allows us to do a &lt;a href="https://arxiv.org/abs/2007.16122">first pass from tens of millions&lt;/a> to tens of thousands or hundreds, and then we can use more expensive models to rank the smaller result sets and really think about how we would present those last, smaller result sets, to the user.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_43.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_43.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>Many large companies then expand on this architecture to create search and recommendations data sources for different product surfaces.&lt;/p>
&lt;p>Here&amp;rsquo;s a couple examples. For example, the one on the left is &lt;a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/45530.pdf">YouTube&amp;rsquo;s recommender system&lt;/a> where they take millions of videos that they could recommend you and narrow them down to a few hundred and then rank the ones that are actually on display.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_44.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_44.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>The input features for their custom trained model include &amp;ldquo;videos you&amp;rsquo;ve watched before&amp;rdquo; and then &amp;ldquo;videos you&amp;rsquo;ve searched for&amp;rdquo;, &amp;ldquo;geography, &amp;quot; and then metadata about you. And the query is the video in question. And then the model similarity lookup generates a list of candidates to that video based on all this input data.&lt;/p>
&lt;p>&lt;a href="https://arxiv.org/abs/2006.11632">Facebook Search&lt;/a> also uses embeddings.
It contains the search text plus the query context, like geography or language. The document side includes Facebook names or page titles, and graph-derived signals like relationships to other people.&lt;/p>
&lt;p>So, there are many variants of these systems in production. How does this look in practice for Rijksearch in particular? We have our input query, which comes from the front end as happy flowers. The embeddings are from the Gemini API. And then we also already have all our document metadata that we&amp;rsquo;ve embedded before. The indexing app is in Go, writing into Redis vector sets, and the serving web app is written in pure Go. Redis is served as a docker container that connects to the web app with everything served in docker-compose.&lt;/p>
&lt;p>It’s not a particularly original idea: there are now a lot of semantic image/vibe search engines out there today. But it’s the one that I personally wanted to do because I’d developed a connection to the Rijksmuseum, and because the amount of data they generously make public made it lovely to work with.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_45.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_45.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;h2 id="getting-data-for-rijksearch">Getting data for Rijksearch&lt;/h2>
&lt;p>Something that you&amp;rsquo;ll notice is missing here if you&amp;rsquo;ve worked in data before is the actual source data. Where do we actually get this image data? How do we collect it? And that&amp;rsquo;s its own hard problem that we&amp;rsquo;ll talk about in a minute.&lt;/p>
&lt;p>But okay, so how do we build this, given that we&amp;rsquo;ve already built Viberary and other versions of this system before? The way that I thought about it is that every company gets about three innovation tokens. This is from &lt;a href="https://mcfunley.com/choose-boring-technology">an amazing and often-cited blog post&lt;/a> by Dan McKinley, which says that every company gets three of these tokens.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_46.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_46.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>You can spend those however you want, but then the supply is fixed. That&amp;rsquo;s also true for any project, whether it&amp;rsquo;s at work or a side project. For a specific project, a way to think about innovation tokens is that you can try new things in several dimensions:&lt;/p>
&lt;ul>
&lt;li>The domain area&lt;/li>
&lt;li>The language stack&lt;/li>
&lt;li>The app architecture&lt;/li>
&lt;li>How/where you deploy it&lt;/li>
&lt;/ul>
&lt;p>If you do three new things that you can&amp;rsquo;t do or don&amp;rsquo;t know how to do, you&amp;rsquo;re just going to get held up with your product. So where should we spend our innovation tokens?&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_47.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_47.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>We should spend them on what helps us grow, because humans want to reach mastery. So we should spend them on what helps us reach mastery.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_48.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_48.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>Taking a look at the same steps for Viberary and Rijksearch, you can see what I implemented differently. So there&amp;rsquo;s a couple steps. First, we have to collect the data. Then we have to think about what embedding model we want to use and how we want to embed it. Then we need to index our embeddings. This is an important step. And then we need to serve our embeddings and create a web server so we can actually stand up the front end and return results.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_49.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_49.webp"
 alt="image" width="600px">&lt;/a>
&lt;/figure>

&lt;p>So the first time I created Viberary by using text-only data from Goodreads, a locally-hosted embeddings model artifact, indexing using &lt;code>FT.SEARCH&lt;/code> from Redis&amp;rsquo;s search , and Python&amp;rsquo;s Flask to serve the app.&lt;/p>
&lt;p>So I lied earlier about using 3 innovation tokens: I cheated a bit and used four here.&lt;/p>
&lt;p>For Rijksearch data, I used multimodal text and image data because I was interested in creating text and image embeddings. This made processing the data trickier because I needed to associate a given set of text metadata to a link of an image, as well as process the images in an efficient way so they&amp;rsquo;d download quickly and non-disruptively.&lt;/p>
&lt;p>For the actual embedding inference, I changed from a hosted model to an API, which was easier in some ways, but now required cost and latency considerations. Additionally, test and image processing together was trickier because you have to think about &lt;a href="https://ai.google.dev/gemini-api/docs/embeddings#embedding-aggregation">text and image aggregation together.&lt;/a>.&lt;/p>
&lt;p>For indexing, I switched to vector sets, though since vector sets don&amp;rsquo;t require the construction of a search index and are a native structure. This makes it easier in that you don&amp;rsquo;t need to update fields, but vector sets also don&amp;rsquo;t support metadata updates in the same way that &lt;a href="https://redis.io/docs/latest/develop/ai/search-and-query/">search indices do&lt;/a> because vector sets are schema-free (though you can add JSON through &lt;a href="https://redis.io/docs/latest/commands/vsetattr/">VSETATTR&lt;/a> ), so there is a tradeoff.&lt;/p>
&lt;p>For serving, I changed from Flask to Go, which was a pretty big change because I had to relearn web development in a new language. Go has a lot of tooling for web development out of the box, as well as support for concurrency, and for processing web data. This was probably the innovation token that took me the longest time to overcome.&lt;/p>
&lt;p>And, interestingly throughout this process, I also tracked the steps where AI helped. The yellow parts are where I included AI or I had AI assistance.&lt;/p>
&lt;p>I consider myself a fairly normcore AI user. I&amp;rsquo;ll usually wait a couple months for something to come on the market to try it. If something has been on the market for 6 months and has lasted, it probably makes sense to keep. And so in 2023, what was on the market was MidJourney for generating some of my logos, ChatGPT for answering some questions, and then some of the open models. Mistral was making good open models at the time and I was using it locally via llama-cpp-python, though getting limited results beyond summarization and tagging.&lt;/p>
&lt;p>When I did Rijksearch, I found several new use cases for AI. For example, I used Claude, as well as &lt;a href="https://huggingface.co/openai/gpt-oss-20b">GPT-OSS 20B&lt;/a> locally to help me figure out some of the paradigms for data collection from the Rijksmuseum, as well as write some quick data analysis scripts. I used Gemini APIs for embeddings and Perplexity to help me decide which embeddings to use. And I used Claude Code to generate some of the front-end code, and Qwen Coder and GPT-OSS and Perplexity to help me learn about Go.&lt;/p>
&lt;h2 id="data-sense">Data Sense&lt;/h2>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_51.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_51.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>For Rijksearch, data collection was also a lot of work, even more than scaffolding the app. But as ML and data professionals, what helps us here is the same thing that helped Rachel. She had help and mentorship from her family. As data practitioners, what we have from doing this for a while, and being surrounded by data-intensive environments, is data sense. We have been in the trenches of collecting data for a long time. We understand the pain points. We understand NaNs and what loopholes to look for in non-deterministic data flows, such as the ones generated by LLMs. We understand what it means to &amp;ldquo;look at the data.&amp;rdquo;&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_52.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_52.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>And this is something that &lt;a href="https://www.cs.unc.edu/~stotts/COMP590-059-f24/robsrules.html">Rob Pike, one of the creators of Go, said&lt;/a>. If you&amp;rsquo;ve chosen the right data structures and organized things well, the algorithms will almost always become self-evident around the data. Data structures, not algorithms, are essential to programming. And this was really important for me because good similarity matching between images and text requires good data for both.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_53.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_53.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>Luckily, the Rijksmuseum has a lot of data, and it&amp;rsquo;s all open and accessible to the public. It&amp;rsquo;s an amazing resource.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_54.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_54.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>They have metadata for 800, 000 objects, high-resolution photographs. But something that was new to me in working with this data was working with &lt;a href="https://www.openarchives.org/pmh/">OAI-PMH&lt;/a>, the data format they offer their artifacts in. It&amp;rsquo;s available through an API, but the API is very different from what I&amp;rsquo;ve generally seen. OAI-PMH is an XML-based protocol over HTTP, which is different from most of the JSON-based REST APIs I&amp;rsquo;ve worked with.&lt;/p>
&lt;p>LLMs helped me quite a bit here because I needed to understand, for example, &lt;code>ListSet&lt;/code> records I needed. And, since this was open data it was easy to traverse the API results and get an understanding of what lives in that data set.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_55.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_55.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>And it turns out there&amp;rsquo;s a lot of data to figure out.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_56.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_56.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>First, I pulled all the paintings we needed from the record files. The data’s stored across multiple collections, so I chose to use the “Dutch Paintings of the Seventeenth Century in the Rijksmuseum” set because that collection had the most visually striking paintings. From there, I extracted each entry, making sure to keep the full metadata (title, artist, date, medium, etc.) instead of just a short snippet. This gave me a clean, complete list to work with for the rest of the project.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_57.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_57.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>How do we actually collect that data into our app? Here&amp;rsquo;s where Go really helps. Go structs are typed collection fields, much like Pydantic models, and they really allow you to do this well. If you have an XML snippet, you can either run it through an LLM to get results, or use a site like &lt;a href="https://xml-to-go.github.io/">XML to Go&lt;/a>, no code generation needed. But, of course, LLMs do really well with Go because it is so strongly typed.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_58.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_58.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>It really takes a while to wrangle all the data into the correct format, then, from collecting it as XML, to writing it to disk in Redis.&lt;/p>
&lt;p>Then the fun starts, because you collect these records and you start to see that not every piece of data is labeled correctly. This is true of any given dataset, but particularly true here, because this data has been collected over lots and lots of time in many different languages, across historical collections, and who amongst us has not had the wrong language for a title or duplicate data in any of our data. This is where data sense and really looking at and being hands-on with the data helps.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_59.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_59.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;h2 id="picking-embedding-models">Picking Embedding Models&lt;/h2>
&lt;p>The second part of data is picking an embedding model. For Viberary, I used&lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2">MiniLM&lt;/a> just because it was small, fast, and it was standard. Now we have larger models &lt;a href="https://vickiboykis.com/2025/09/01/how-big-are-our-embeddings-now-and-why/">that are hosted as APIs&lt;/a>. Generally, depending on your use-case, it may make sense to use a domain-specific model, or fine-tune. But for a side project, a general model was just fine.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_60.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_60.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>There are a lot of other factors we care about when selecting an embedding model, such as inference costs, storage costs, and the overhead of setting up a new account. For me, multimodality was also important (whether the model was trained on both images and text) because that becomes relevant for retrieval.&lt;/p>
&lt;p>To help narrow down my options, I used Perplexity to generate a comparison grid and verify the information. One thing I like about Perplexity is that it cites its sources, so you can click through and confirm that the model didn&amp;rsquo;t hallucinate anything.&lt;/p>
&lt;p>And so I came to three choices based on price, storage costs. What was interesting is that what really unblocked me was I already had a Google account with access to Google AI. And so I could just use and set up an account key easier than I could for Cohere and Mongo, although I&amp;rsquo;ve heard good things about both.&lt;/p>
&lt;p>And then you have to pick which vector index you&amp;rsquo;d like to use. The index is the data structure that helps you quickly find related items.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_61.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_61.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>Vector sets already implement &lt;a href="https://arxiv.org/abs/1603.093200">HNSW&lt;/a> &lt;a href="https://antirez.com/news/149">by default&lt;/a>. That makes the application really fast at retrieval time, which is what I care about. I didn&amp;rsquo;t really care as much as how long batch indexing takes, because at the end I only had something like 2, 000 paintings that I needed to index at a time. If I had a large selection, the choice between HNSW and IVF, which is great for data sets with large memory constraints, would have become less clear.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_62.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_62.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>So then you embed everything, and then you start to get your results. And of course, because this is a data-intensive project, you get this thing where you query for &amp;ldquo;boat&amp;rdquo;, and you get the same thing for every single result, which doesn&amp;rsquo;t look like a boat at all. But there&amp;rsquo;s a pyramid, and you&amp;rsquo;re like, okay, why is this happening?&lt;/p>
&lt;p>So then you have to go through and debug step by step, and it turns out the issue is not in the data structure, and not in the ingest, and not in the Gemini embeddings, and not in the indexing, and not in the web server, but in the fact that you picked an art category that only has 100 artifacts and your query is similar to all of them. And that&amp;rsquo;s all part of the process of looking at your data, too.&lt;/p>
&lt;h3 id="where-ai-helped-and-where-it-didnt">Where AI helped and where it didn&amp;rsquo;t&lt;/h3>
&lt;p>Rachel painted her cactus. This was her remixing and tastefully adding small, exotic elements to the larger whole. For me, this was AI and really evaluating where it might help or hurt my process to create good things and get better at what I do. So, where did I end up?&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_63.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_63.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>I used AI to build the front end, which turned out nice. It wasn&amp;rsquo;t anything out of the ordinary, and I am not sure how solid the actual code is, and whether it&amp;rsquo;s production-ready, but it was nice to get going since I&amp;rsquo;m not a front-end person at all and we always want to &lt;a href="https://mitchellh.com/writing/building-large-technical-projects">get to a good demo.&lt;/a> Last time, I used a bootstrap template for this that I tweaked slightly, which I guess is not that much different.&lt;/p>
&lt;p>LLMs also helped me with Go, but they didn&amp;rsquo;t help me by writing all of the code. How I&amp;rsquo;ve been learning Go is step by step: first, by reading dead tree books, then by trying out &lt;a href="https://gobyexample.com/">Go by Example&lt;/a>, and finally, by asking about different Go concepts from the LLMs, or working with a very small, method-level snippet at a time.&lt;/p>
&lt;p>Because what happened when I was having the model write Go directly is I found, due to cognitive offloading, that I couldn&amp;rsquo;t understand what &amp;ldquo;I&amp;rsquo;d&amp;rdquo; written, at all, and when I got up from the computer, I couldn&amp;rsquo;t replicate it. But it helped me answer questions like how structs work, what you should put in a struct, what a channel does. The kind of stuff that you would have to constantly bug a senior person about, because Go is a new language for me, the LLM will just answer. And particularly in some cases where LLM systems have RAG results, it will lead you to the source documentation because Golang has really, really excellent documentation, and that&amp;rsquo;s really what we want out of learning by reinforcement.&lt;/p>
&lt;p>LLMs also helped with data formatting and exploration. So writing all of the matplotlib methods, the pandas methods, all that stuff where you need to constantly look stuff up or you need to get data from the API. A lot of that glue stuff, the models, any of them really, are good with.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_64.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_64.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>So where did it not help? Largely with architecture changes and decisions. Understanding that I needed a two-tower model, what two-tower models were, how they&amp;rsquo;d been used previously in search and recommendations, understanding my latency constraints, knowing that serving this app has to be fast and that the indexing job&amp;rsquo;s speed didn&amp;rsquo;t matter as much: all of that came from my head and past experience rather than AI.&lt;/p>
&lt;p>Picking vector sets as a storage and retrieval primitive was similar. I only learned they existed because I pay close attention to what&amp;rsquo;s happening in the vector space. (haha) You could probably build a bot to synthesize that kind of knowledge for you, but it was much easier to verify for myself.&lt;/p>
&lt;p>More complex concepts like &lt;code>Go channels&lt;/code> , which I haven&amp;rsquo;t gotten to, were also a weak spot. I couldn&amp;rsquo;t verify whether what it gave me in Go was actually correct, and had a lot of non-idiomatic code until I got an actual experienced Go developer to take a look and review &amp;ldquo;my&amp;rdquo; code. And of course, AI didn&amp;rsquo;t help in writing this talk. It took a very long time, and none of it was written with AI. It all came from ideas that had been slowly combining in my head. Which brings me to the final point: it takes time to build good things.&lt;/p>
&lt;h3 id="it-takes-time-to-build-good-things">It takes time to build good things&lt;/h3>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_65.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_65.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>It took Rachel 60 years to get good at painting.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_66.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_66.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>You can see one of her first paintings on the left, and then one of her later, much, much later paintings on the right. Take a look at how much more complex the composition is, how the flowers are behind each other, how much more delicate the background is, the movement of the flowers in the space. It takes so much time to get good at that stuff. And we have to remind ourselves in the world of speed that that&amp;rsquo;s true for programming too.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_67.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_67.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>Peter Norvig wrote this post a long time ago about &lt;a href="https://norvig.com/21-days.html">teaching yourself programming in 10 years.&lt;/a> There is no shortcut for deliberative practice, which is what we&amp;rsquo;re trying to do here when we build ourselves flowers.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_68.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_68.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>For me, I found it takes time because I was going through my tweet archive to see what I could find for this talk. In 2013, I wrote about how I was learning Python and counting words and just really struggling through it. And it&amp;rsquo;s like, it&amp;rsquo;s not very optimal and I don&amp;rsquo;t really understand what&amp;rsquo;s going on and what RGV is, etc. And then 10 years later I was able to do Viberary just through deliberate practice. And three years later, I did it again, in Go, with multimodal embeddings.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_69.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_69.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>And the other thing is, I didn&amp;rsquo;t work on my craft alone. I&amp;rsquo;ve written about how important it is to write with your &lt;a href="https://vickiboykis.com/2024/12/16/write-code-with-your-alphabet-radio-on/">alphabet radio on&lt;/a>. This is a concept from Bird by Bird by Anne Lamott, which is a book on how to write, but also on how to be creative.&lt;/p>
&lt;p>Anne Lamott talks about this alphabet radio that plays in your head. Of all the past people that have either criticized or praised you and how you have to tune some of them out selectively when you work. For me, it&amp;rsquo;s all the people that have reviewed my code and that I&amp;rsquo;ve worked with and the ones that I respect are always echoing in my head. That kind of feedback comes from good mentorship. That&amp;rsquo;s the apprenticeship that Rachel went through as well right so it takes time to get good at that.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_70.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_70.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>It also takes time to teach other machines. Here&amp;rsquo;s another small example from another side project I did before called Gandinsky, where I tried to train a generative adversarial network to paint pictures in the style of Vassily Kandinsky. At the time, this didn&amp;rsquo;t work well. This was like eight years ago, and it didn&amp;rsquo;t work because you need thousands of samples to train a GAN and Kandinsky unfortunately in all his artistic prolificacy only painted 200 paintings.&lt;/p>
&lt;p>Nowadays, you can just get Nano Banana to do style transfer for almost anything for free in an API call because we&amp;rsquo;ve been working on these models for such a long time. But it&amp;rsquo;s taken years and years for the models to get this good.&lt;/p>
&lt;p>And finally, it takes time to own your own vision, as in this quote is from antirez and his &lt;a href="https://antirez.com/news/159">recent thoughts&lt;/a> on agentic programming.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_72.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_72.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>Programming as an automatic vision is not yet. That only comes from doing the work over and over and over again. So building a semantic search engine really took me about a month once I buckled down and did it (though I&amp;rsquo;m still not entirely done). But it also took me ten years. And I think that&amp;rsquo;s important to keep in mind.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_73.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_73.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>Okay, so now we&amp;rsquo;re at the end of the talk, and the question is, does this still matter? Yes, I think so. I think data sense matters, which we all inherit from having worked in machine learning and data science for years. I think mastery matters, which we get by working with others and building our own intuition for &lt;a href="https://vickiboykis.com/2026/04/13/mechanical-sympathy/">mechanical sympathy&lt;/a>. I think craft still matters.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_74.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_74.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>

&lt;p>So let&amp;rsquo;s try to build ourselves flowers. Thank you.&lt;/p>
&lt;figure>&lt;a href="https://vickiboykis.com/images/flowers/flowers_75.webp" target="_blank">&lt;img src="https://vickiboykis.com/images/flowers/flowers_75.webp"
 alt="image" width="500px">&lt;/a>
&lt;/figure>
</description></item><item><title>Mechanical sympathy</title><link>https://vickiboykis.com/2026/04/13/mechanical-sympathy/</link><pubDate>Mon, 13 Apr 2026 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2026/04/13/mechanical-sympathy/</guid><description>&lt;figure>&lt;img src="https://vickiboykis.com/images/weaver.png"
 alt="image" width="400px">&lt;figcaption>
 &lt;p>&lt;em>Weaver, seen from the Front&lt;/em>, Vincent van Gogh, 1884&lt;/p>
 &lt;/figcaption>
&lt;/figure>

&lt;p>Something that&amp;rsquo;s been floating around in my head lately is the idea that I don&amp;rsquo;t know any truly good engineers who are also not good at at product design.&lt;/p>
&lt;p>Product design can roughly be designed as the contract between the creator and the user, where the contract is designed by a set of affordances, or actions that the product allows the user to take. This is all cribbed from &lt;a href="https://en.wikipedia.org/wiki/The_Design_of_Everyday_Things">Don Norman&lt;/a> and &lt;a href="https://www.penguinrandomhouse.com/books/608234/the-beauty-of-everyday-things-by-soetsu-yanagi/">The Beauty of Everyday Things&lt;/a>.&lt;/p>
&lt;p>For example, an affordance of a chair is that it allows you to sit. An affordance of most social media feeds is that they allow you to &lt;a href="https://en.wikipedia.org/wiki/Pull-to-refresh">pull to refresh.&lt;/a> A search bar, in theory, affords you the ability to look for results. Recently this has changed because we&amp;rsquo;ve been &lt;a href="https://vickiboykis.com/2024/05/06/weve-been-put-in-the-vibe-space/">vibe spaced.&lt;/a>&lt;/p>
&lt;p>What makes good engineers good at product design is the same thing that makes them good at engineering. They feel for the boundaries of what the code and the product allows them to do and stop at those boundaries.&lt;/p>
&lt;p>Another name for being able to understand and plan for affordances, either through good product intuition, or experience, or both, in the real world is mechanical sympathy.&lt;/p>
&lt;p>Mechanical sympathy was initially coined in referring to Formula 1 racecar handling, and &lt;a href="https://groups.google.com/g/mechanical-sympathy">in software&lt;/a>, has been adapted to mean &amp;ldquo;writing the kind of code that makes sense with the underlying stack so we can get good performance from our systems.&amp;rdquo;&lt;/p>
&lt;p>Mechanical sympathy for both developers and end-users means understanding when asyncio is and is not helpful. It means using the right language, the right build system, the right font. It means using the least amount of tooling possible. Allowing for local development. It means reading code inside out rather than top to bottom. &lt;a href="https://aleyan.com/blog/2026-why-arent-we-uv-yet/">Using uv&lt;/a>. Removing code where not necessary. Respecting boundaries.&lt;/p>
&lt;p>I was watching &lt;a href="https://www.youtube.com/live/_zdroS0Hc74?si=sb_sBpoRmp-hdx8Q&amp;amp;t=3665">this really great talk&lt;/a> by &lt;a href="https://mariozechner.at/">Mario&lt;/a> on how he built an agentic coding harness, and was reminded that agentic coding tools, just like &lt;a href="https://vickiboykis.com/2022/12/05/the-cloudy-layers-of-modern-day-programming/">many cloud services&lt;/a>, don&amp;rsquo;t have mechanical sympathy.&lt;/p>
&lt;p>Things that have happened to me over the past few weeks, off the top of my head, across various projects, languages, and providers:&lt;/p>
&lt;ul>
&lt;li>A test failed, producing a &lt;code>500&lt;/code> error. I asked the agent to fix the test. The first time, it tried to fix the code instead of the test. The second time, it rewrote the test to assert a &lt;code>500&lt;/code> instead of a &lt;code>200&lt;/code>.&lt;/li>
&lt;li>Even with a spec file, agents (and most chatbots, still) will use Python&amp;rsquo;s &lt;code>List&lt;/code> and &lt;code>Dict&lt;/code>, which &lt;a href="https://pyfound.blogspot.com/2020/04/the-path-forward-for-typing-python.html">have been deprecated&lt;/a> for years.&lt;/li>
&lt;li>Agents won&amp;rsquo;t use &lt;a href="https://vickiboykis.com/2026/02/21/querying-3-billion-vectors/">vectorized numerical operations&lt;/a> unless prompted.&lt;/li>
&lt;li>When writing a test, the agent fails numerous times, and instead of suggesting that the code should be refactored to be simpler and easier to test, just keeps trying, endlessly, to work agains the grain of the code flow and fix the failing thing.&lt;/li>
&lt;/ul>
&lt;ul>
&lt;li>In new Python projects, uv is &lt;a href="https://aleyan.com/blog/2026-why-arent-we-uv-yet/">skipped entirely as a first-class citizen&lt;/a> across many providers as a suggestion of how to package projects.&lt;/li>
&lt;/ul>
&lt;p>Here&amp;rsquo;s more stuff:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://www.sri.inf.ethz.ch/blog/fixedcode">Agents fix code that already works correctly&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.sri.inf.ethz.ch/publications/gloaguen2026agentsmd">Context files are completely ignored&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://www.sri.inf.ethz.ch/publications/gloaguen2026agentsmd">Struggles with the nuances of business logic&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://github.com/SprocketLab/slop-code-bench">Erode under iterative refinement&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Maybe &lt;code>agents.md&lt;/code> / &lt;code>gastown&lt;/code> / &lt;code>mcp&lt;/code> / &lt;code>ralph&lt;/code> / &lt;code>subagents&lt;/code> / using a different provider that was nerfed last week but is cracked this week, and switching from the provider that was cracked last week but is cooked this week, solve this. It&amp;rsquo;s possible that by the time I publish the post, or three microseconds from now, a new model or harness that will come out that renders these cases obsolete.&lt;/p>
&lt;p>But the larger point is that sympathy, and good product design, is made up of hundreds and thousands of moments like this, where missing one is not critical, but, when looked at in its entirely, result in an app that feels &amp;ldquo;hard&amp;rdquo; to use and develop.&lt;/p>
&lt;p>Mechanical sympathy, just like real sympathy, comes from an enormous context window learned over a human lifetime. People, over time, learn to get good at mechanical sympathy. Coding agents can&amp;rsquo;t. It&amp;rsquo;s (maybe) possible they will someday, until then we need to provide the intuition.&lt;/p></description></item><item><title>On Programming Joy and Octocat</title><link>https://vickiboykis.com/2026/04/06/on-programming-joy-and-octocat/</link><pubDate>Mon, 06 Apr 2026 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2026/04/06/on-programming-joy-and-octocat/</guid><description>&lt;p>While GitHub has been busy &lt;a href="https://www.theregister.com/2026/02/10/github_outages/">losing its last nine of availablility&lt;/a>, I&amp;rsquo;ve been thinking about how the
internet used to be.&lt;/p>
&lt;p>Not the internet people talk about &lt;a href="https://vickiboykis.com/2024/09/19/dead-internet-souls/">from the 90s&lt;/a>, but the internet that we used to have even 10-15 years ago. This was the heyday of startups like GitHub, Twitter, Airbnb, and, Google was in its prime (though likely slightly past it at that point - &lt;a href="https://www.youtube.com/watch?v=4XpnKHJAok8&amp;amp;t=4s">Linus&amp;rsquo;s git tech talk&lt;/a> there was in 2007.)&lt;/p>
&lt;p>I&amp;rsquo;ve specifically been thinking about the &lt;a href="https://myoctocat.com/">Octocat Builder&lt;/a>. GitHub created it back in those years, hard to say when, but it dates back to at least 2018. The Octocat mascot itself was created in 2006.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/octocat.png" width="400">
&lt;/figure>

&lt;p>The builder is so much fun - you can select different colors of cat, hairstyles, accessories, and send to friends.&lt;/p>
&lt;p>There is no reason for this to exist. Perhaps there is some branding cache that GitHub got from this, but really, someone in the company just wanted to exist, and now it does.&lt;/p>
&lt;p>Something like this would have a hard time being born in today&amp;rsquo;s tech industry, I think, with its cycle of promo packets, now layoffs, anxiety about AI, and now older, sclerotic tech companies thinking about how to stay technically and socially relevant. In a world with one nine left and the rest of the app dedicated to the growth of Copilot, there is no room for a small purple Octocat.&lt;/p>
&lt;p>There&amp;rsquo;s some hope: there is a little whimsy in products like &lt;a href="https://github.com/wynandw87/claude-code-spinner-verbs">Claude Code&amp;rsquo;s spinner verbs and animations&lt;/a>, but mostly it feels like whimsy ended with ZIRP, and that makes me a bit sad. I&amp;rsquo;m happy that Octocat is still here.&lt;/p></description></item><item><title>NASA Elements of Engineering Excellence</title><link>https://vickiboykis.com/2026/04/05/nasa-elements-of-engineering-excellence/</link><pubDate>Sun, 05 Apr 2026 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2026/04/05/nasa-elements-of-engineering-excellence/</guid><description>&lt;p>I stumbled across &lt;a href="https://ntrs.nasa.gov/archive/nasa/casi.ntrs.nasa.gov/20130000445.pdf">this report from NASA&lt;/a>, &amp;ldquo;Elements of Engineering Excellence&amp;rdquo;,
published in 2012,&lt;/p>
&lt;blockquote>
&lt;p>The inspiration for this paper originated in discussions with the director of MSFC
Engineering in 2006 who asked that we investigate the question: “How do you achieve excellence in aerospace engineering?” The authors’ approach to answering this question was a short course on Excellence in Engineering which is documented in this report.&lt;/p>&lt;/blockquote>
&lt;p>The report talks about five areas of an organization that led to failures at NASA:&lt;/p>
&lt;ol>
&lt;li>Shifting from engineering “hands-on” and “excellence” to “insight/oversight”. Lack of ownership.&lt;/li>
&lt;li>“Normalization of the deviances”. Not questioning anomalies.&lt;/li>
&lt;li>Lack of critical thinking. Over-reliance on procedures and computer codes.&lt;/li>
&lt;li>Decentralization of authority.&lt;/li>
&lt;li>Organizational and technical complexity.&lt;/li>
&lt;/ol>
&lt;p>And I found this first point particularly relevant in the age of agents:&lt;/p>
&lt;blockquote>
&lt;p>The five root causes are not listed in any priority order. The first root cause listed deals with a shift in the NASA culture, where the organization moved from a hands-on engineering approach to an insight/oversight approach. In the early days the heritage was basically the arsenal approach where you designed, built and verified the system before contracting it out for production. The engineers really understood the design, the hardware/software and the system based on actual experience. In the early culture much of the technology development was an in-house, hands-on activity. The shift has resulted in the elimination of much independent analysis and test and experience based understanding of the systems required to catch and prevent problems. Howard E. McCurdy in his book Inside NASA says, “NASA officials from the original cultures believed they needed to provide their engineers and scientists with hands-on experience in order to maintain the technical side of the house. It was the only way to keep them technically sharp. By keeping their own engineers and scientists sharp, they could penetrate the work of the contractor. During the first decade of space flight, a strong technical culture guided the work of NASA employees. The norms typical of that period required NASA to maintain a corps of professional employees deeply involved in the details of space flight and aeronautics. The technical culture counterbalanced many organizational forces that rose up to challenge it. It overpowered the usual bureaucratic tendencies present in government operations. It provided a counterweight to the centralizing and organizational necessities of the Apollo mission.” [McCurdy, H. 1993] The loss of the technical excellence based on hands-on experience has led to many of the problems and therefore is one of the root causes of problems. To prevent problems, NASA needs to re-establish the culture of technical excellence based on hands-on work.&lt;/p>&lt;/blockquote></description></item><item><title>Antidote</title><link>https://vickiboykis.com/2026/03/04/antidote/</link><pubDate>Wed, 04 Mar 2026 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2026/03/04/antidote/</guid><description>&lt;p>If you love building things, and the process of building is just as important to you as the result itself, it’s not unreasonable that you’re in a slump these days.&lt;/p>
&lt;p>The world is telling you that your thinking process is extraneous, unnecessary, and must be commoditized and compressed. But you are multidimensional, you need room to touch the code, to explore, to rise above the local minima. In engineering, &lt;a href="https://vickiboykis.com/2021/09/23/reaching-mle-machine-learning-enlightenment/">the journey is the destination.&lt;/a> The working system in production is our reward, and we always move towards that.&lt;/p>
&lt;p>Unstick yourself. Do stuff that is not the most efficient. Build something from scratch with your bare hands and your brain. Read programming books written on dead trees. Reason through something on the back of an envelope. Scribble and daydream. Join Discords where people are actively building, reviewing pull requests, building community. Write a blog post. Mentor a colleague. Answer an email from a budding engineer. Create a private repo and merge to main without a PR. Take a library or tool and look underneath the hood. Go to an in-person meetup. Pick apart the living, breathing beast that is the transformer model that governs most of our daily conversations these days, down to the attention mechanism, down to the RoPE embeddings. Explain it back to yourself.&lt;/p>
&lt;p>Having agency is one of the most important things a creator can have. Give the gift of it to yourself. It&amp;rsquo;s still there, waiting for you.&lt;/p></description></item><item><title>Querying 3 billion vectors</title><link>https://vickiboykis.com/2026/02/21/querying-3-billion-vectors/</link><pubDate>Sat, 21 Feb 2026 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2026/02/21/querying-3-billion-vectors/</guid><description>&lt;p>Recently, I got nerd-sniped by this exchange between &lt;a href="https://github.com/LRitzdorf/TheJeffDeanFacts">Jeff Dean&lt;/a> and someone trying to query 3 billion vectors.
I was curious to see if I could implement the &lt;a href="https://vickiboykis.com/2021/06/06/the-humble-hash-aggregate/">optimal map-reduce&lt;/a> solution he alludes to in his reply.&lt;/p>
&lt;img width="400" alt="image" src="https://gist.github.com/user-attachments/assets/ecca4afd-81bf-45a4-9043-ad7da174d93a" />
&lt;p>A vector is a list/array of floating point numbers of &lt;code>n&lt;/code> dimensions, where &lt;code>n&lt;/code> is the length of the list. The reason you might perform vector search is to find words or items that are semantically similar to each other, a common pattern in search, recommendations, and generative retrieval applications &lt;a href="https://read.engineerscodex.com/p/how-cursor-indexes-codebases-fast">like Cursor&lt;/a> which heavily leverage &lt;a href="https://vickiboykis.com/what_are_embeddings/">embeddings.&lt;/a>&lt;/p>
&lt;p>I started by writing an extremely naive implementation which made the following assumptions:&lt;/p>
&lt;ul>
&lt;li>we have 3 billion searchable (document) vectors and ~1k query vectors (a number I made up)&lt;/li>
&lt;li>Both of the vector sets are stored on disk in &lt;code>.npy&lt;/code> format (simple format for &lt;a href="https://numpy.org/doc/stable/reference/generated/numpy.lib.format.html">storing numpy arrays&lt;/a>&lt;/li>
&lt;li>We&amp;rsquo;d like to compare each of the query vectors against the larger pool of document vectors and return the resulting similarity (dot product) for each of the vector combinations.&lt;/li>
&lt;li>3k total reference vectors (to see if we could intially run this amount before scaling)&lt;/li>
&lt;li>The vectors are of dimensionality (n) 768, a common dimensionality for many models that allow for
similarity-based embedding queries&lt;/li>
&lt;/ul>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> numpy &lt;span style="color:#66d9ef">as&lt;/span> np
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> loguru &lt;span style="color:#f92672">import&lt;/span> logger
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> time
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> os
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># start with 3_000 vectors to keep things small &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>total_vectors_num &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">3_000_000_000&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>query_vectors_num &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">1_000&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">generate_random_vectors&lt;/span>(num_vectors:int)&lt;span style="color:#f92672">-&amp;gt;&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>array:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Generating &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>num_vectors&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> vectors...&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> rng &lt;span style="color:#f92672">=&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>random&lt;span style="color:#f92672">.&lt;/span>default_rng() 
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> vectors &lt;span style="color:#f92672">=&lt;/span> rng&lt;span style="color:#f92672">.&lt;/span>random((num_vectors, &lt;span style="color:#ae81ff">768&lt;/span>)) 
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> vectors
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">get_dot_products&lt;/span>(vectors_file:np&lt;span style="color:#f92672">.&lt;/span>array, query_vectors:np&lt;span style="color:#f92672">.&lt;/span>array) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> list[np&lt;span style="color:#f92672">.&lt;/span>array]:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total_products_computed &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> dot_products &lt;span style="color:#f92672">=&lt;/span> []
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> v &lt;span style="color:#f92672">in&lt;/span> vectors_file:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> qv &lt;span style="color:#f92672">in&lt;/span> query_vectors:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> dot_product &lt;span style="color:#f92672">=&lt;/span> v &lt;span style="color:#f92672">@&lt;/span> qv
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> dot_products&lt;span style="color:#f92672">.&lt;/span>append(dot_product)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> total_products_computed &lt;span style="color:#f92672">+=&lt;/span> &lt;span style="color:#ae81ff">1&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> total_products_computed &lt;span style="color:#f92672">%&lt;/span> &lt;span style="color:#ae81ff">100000&lt;/span> &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Total vectors processed:&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>total_products_computed&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> dot_products
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Generate initial vectors and query vectors and write to disk&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>doc_vectors &lt;span style="color:#f92672">=&lt;/span> generate_random_vectors(total_vectors_num)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>query_vectors &lt;span style="color:#f92672">=&lt;/span> generate_random_vectors(query_vectors_num)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>np&lt;span style="color:#f92672">.&lt;/span>save(&lt;span style="color:#e6db74">&amp;#39;vectors.npy&amp;#39;&lt;/span>, doc_vectors)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Load vectors from disk &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">&amp;#34;Loading file from disk...&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>vectors_file &lt;span style="color:#f92672">=&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>load(&lt;span style="color:#e6db74">&amp;#39;vectors.npy&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>start_time &lt;span style="color:#f92672">=&lt;/span> time&lt;span style="color:#f92672">.&lt;/span>time()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">&amp;#34;Getting dot products...&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>results &lt;span style="color:#f92672">=&lt;/span> get_dot_products(vectors_file, query_vectors)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>end_time &lt;span style="color:#f92672">=&lt;/span> time&lt;span style="color:#f92672">.&lt;/span>time()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Execution time: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>end_time &lt;span style="color:#f92672">-&lt;/span> start_time&lt;span style="color:#e6db74">:&lt;/span>&lt;span style="color:#e6db74">.4f&lt;/span>&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> seconds&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Number of dot products computed: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>len(results)&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This, predictably, didn&amp;rsquo;t do so great, even on my M2 Macbook, even at 3,000 vectors, one million times less than 3 billion embeddings, taking 2 seconds.&lt;/p>
&lt;p>Results:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>2025-12-13 17:53:25.675 | INFO | __main__:generate_random_vectors:9 - Generating &lt;span style="color:#ae81ff">3000&lt;/span> vectors...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:53:25.691 | INFO | __main__:generate_random_vectors:9 - Generating &lt;span style="color:#ae81ff">1000&lt;/span> vectors...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:53:25.698 | INFO | __main__:&amp;lt;module&amp;gt;:39 - Loading file from disk...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:53:25.700 | INFO | __main__:&amp;lt;module&amp;gt;:43 - Getting dot products...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:53:27.688 | INFO | __main__:get_dot_products:24 - Total vectors processed:3000000
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:53:27.688 | INFO | __main__:&amp;lt;module&amp;gt;:47 - Execution time: 1.9877 seconds
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:53:27.688 | INFO | __main__:&amp;lt;module&amp;gt;:48 - Number of dot products computed: &lt;span style="color:#ae81ff">3000000&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>So I &lt;a href="https://pythonspeed.com/articles/vectorization-python/">vectorized the numpy operation&lt;/a>, which made things much faster.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> numpy &lt;span style="color:#66d9ef">as&lt;/span> np
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> loguru &lt;span style="color:#f92672">import&lt;/span> logger
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> time
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>total_vectors_num &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">3_000&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>query_vectors_num &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">1_000&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">generate_random_vectors&lt;/span>(num_vectors:int)&lt;span style="color:#f92672">-&amp;gt;&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>array:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Generating &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>num_vectors&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> vectors...&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> rng &lt;span style="color:#f92672">=&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>random&lt;span style="color:#f92672">.&lt;/span>default_rng() 
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> vectors &lt;span style="color:#f92672">=&lt;/span> rng&lt;span style="color:#f92672">.&lt;/span>random((num_vectors, &lt;span style="color:#ae81ff">768&lt;/span>))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> vectors
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">get_dot_products_vectorized&lt;/span>(vectors_file:np&lt;span style="color:#f92672">.&lt;/span>array, query_vectors:np&lt;span style="color:#f92672">.&lt;/span>array):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> dot_products &lt;span style="color:#f92672">=&lt;/span> vectors_file &lt;span style="color:#f92672">@&lt;/span> query_vectors&lt;span style="color:#f92672">.&lt;/span>T
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> dot_products&lt;span style="color:#f92672">.&lt;/span>flatten() &lt;span style="color:#75715e"># collapse into single dim&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Generate initial vectors and query vectors and write to disk&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>ram_vectors &lt;span style="color:#f92672">=&lt;/span> generate_random_vectors(total_vectors_num)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>query_vectors &lt;span style="color:#f92672">=&lt;/span> generate_random_vectors(query_vectors_num)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>np&lt;span style="color:#f92672">.&lt;/span>save(&lt;span style="color:#e6db74">&amp;#39;vectors.npy&amp;#39;&lt;/span>, ram_vectors)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e"># Load vectors from disk &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">&amp;#34;Loading file from disk...&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>vectors_file &lt;span style="color:#f92672">=&lt;/span> np&lt;span style="color:#f92672">.&lt;/span>load(&lt;span style="color:#e6db74">&amp;#39;vectors.npy&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>start_time &lt;span style="color:#f92672">=&lt;/span> time&lt;span style="color:#f92672">.&lt;/span>time()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">&amp;#34;Getting dot products...&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>results &lt;span style="color:#f92672">=&lt;/span> get_dot_products_vectorized(vectors_file, query_vectors)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>end_time &lt;span style="color:#f92672">=&lt;/span> time&lt;span style="color:#f92672">.&lt;/span>time()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Execution time: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>end_time &lt;span style="color:#f92672">-&lt;/span> start_time&lt;span style="color:#e6db74">:&lt;/span>&lt;span style="color:#e6db74">.4f&lt;/span>&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> seconds&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;Number of dot products computed: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>len(results)&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>At .017 seconds, this was a big improvement!&lt;/p>
&lt;p>Results:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>2025-12-13 17:52:52.810 | INFO | __main__:generate_random_vectors:9 - Generating &lt;span style="color:#ae81ff">3000&lt;/span> vectors...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:52:52.831 | INFO | __main__:generate_random_vectors:9 - Generating &lt;span style="color:#ae81ff">1000&lt;/span> vectors...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:52:52.874 | INFO | __main__:&amp;lt;module&amp;gt;:39 - Loading file from disk...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:52:52.876 | INFO | __main__:&amp;lt;module&amp;gt;:43 - Getting dot products...
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:52:52.887 | INFO | __main__:&amp;lt;module&amp;gt;:47 - Execution time: 0.0107 seconds
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>2025-12-13 17:52:52.887 | INFO | __main__:&amp;lt;module&amp;gt;:48 - Number of dot products computed: &lt;span style="color:#ae81ff">3000000&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>I tried a 3 million sample size with this improvement. This took 12 seconds.&lt;/p>
&lt;pre tabindex="0">&lt;code>2025-12-13 19:39:43.830 | INFO | __main__:generate_random_vectors:12 - Generating 3000000 vectors...
2025-12-13 19:39:57.509 | INFO | __main__:generate_random_vectors:12 - Generating 1000 vectors...
2025-12-13 19:39:58.978 | INFO | __main__:&amp;lt;module&amp;gt;:57 - Loading file from disk...
2025-12-13 19:40:00.131 | INFO | __main__:&amp;lt;module&amp;gt;:61 - Getting dot products...
2025-12-13 19:40:12.984 | INFO | __main__:&amp;lt;module&amp;gt;:65 - Execution time: 12.8491 seconds
2025-12-13 19:40:12.992 | INFO | __main__:&amp;lt;module&amp;gt;:66 - Number of dot products computed: 3000000000
&lt;/code>&lt;/pre>&lt;p>We could also reduce even further by converting the data to float32:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>doc_vectors &lt;span style="color:#f92672">=&lt;/span> generate_random_vectors(total_vectors_num)&lt;span style="color:#f92672">.&lt;/span>astype(np&lt;span style="color:#f92672">.&lt;/span>float32)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>query_vectors &lt;span style="color:#f92672">=&lt;/span> generate_random_vectors(query_vectors_num)&lt;span style="color:#f92672">.&lt;/span>astype(np&lt;span style="color:#f92672">.&lt;/span>float32)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;pre tabindex="0">&lt;code>2025-12-13 18:13:52.152 | INFO | __main__:generate_random_vectors:10 - Generating 3000 vectors...
2025-12-13 18:13:52.168 | INFO | __main__:generate_random_vectors:10 - Generating 1000 vectors...
2025-12-13 18:13:52.176 | INFO | __main__:&amp;lt;module&amp;gt;:55 - Loading file from disk...
2025-12-13 18:13:52.178 | INFO | __main__:&amp;lt;module&amp;gt;:59 - Getting dot products...
2025-12-13 18:13:52.182 | INFO | __main__:&amp;lt;module&amp;gt;:63 - Execution time: 0.0045 seconds
2025-12-13 18:13:52.182 | INFO | __main__:&amp;lt;module&amp;gt;:64 - Number of dot products computed: 3000000
&lt;/code>&lt;/pre>&lt;p>With these small improvements, we&amp;rsquo;ve already sped up inference to ~13 seconds for 3 million vectors, which means for 3 billion, it would take 1000x longer, or ~3216 minutes.&lt;/p>
&lt;pre tabindex="0">&lt;code>|approach | query_vectors | doc_vectors | time |
|----------- |---------------|---------------|----------|
| Naive | 1,000 | 3,000 | 1.9877s |
| Vectorized | 1,000 | 3,000 | 0.0107s |
| Vectorized | 1,000 | 3,000,000 | 12.8491s |
| Np.Float32 | 1,000 | 3,0000 | 0.0045s |
&lt;/code>&lt;/pre>&lt;p>When we start to run it to test, however, we run into a different problem: OOM. Why? The amount of memory needed to process 3 billion objects, each as &lt;code>float32&lt;/code> object that&amp;rsquo;s 4 bytes in size, would be 8 million GB.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>vectors &lt;span style="color:#f92672">=&lt;/span> rng&lt;span style="color:#f92672">.&lt;/span>random((&lt;span style="color:#ae81ff">1&lt;/span>, &lt;span style="color:#ae81ff">768&lt;/span>))&lt;span style="color:#f92672">.&lt;/span>astype(np&lt;span style="color:#f92672">.&lt;/span>float32)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(vectors&lt;span style="color:#f92672">.&lt;/span>nbytes)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">3072&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(vectors&lt;span style="color:#f92672">.&lt;/span>itemsize)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">4&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>bytes_per_float32 &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">4&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>memory_gb &lt;span style="color:#f92672">=&lt;/span> (&lt;span style="color:#ae81ff">3000000000&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">1000&lt;/span> &lt;span style="color:#f92672">*&lt;/span> &lt;span style="color:#ae81ff">768&lt;/span> &lt;span style="color:#f92672">*&lt;/span> bytes_per_float32) &lt;span style="color:#f92672">/&lt;/span> (&lt;span style="color:#ae81ff">1024&lt;/span>&lt;span style="color:#f92672">**&lt;/span>&lt;span style="color:#ae81ff">3&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">8583068.84765625&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">8.6&lt;/span> TB
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In order to improve this, we would need to do some heavy lifting of the kind Jeff Dean prescribed. First, we could to change the code to use generators and batch the comparison operations. We could write every n operations to disk, either directly or through memory mapping. Or, we could use system-level optimized code calls - we could rewrite the code in Rust or C, or use a library like &lt;a href="https://github.com/ashvardanian/SimSIMD">SimSIMD&lt;/a> explicitly made for similarity comparisons between vectors at scale.&lt;/p>
&lt;p>Before I started on any further optimizations, upon further inspection, there were some things about the problem that I realized weren&amp;rsquo;t clear to me: 3 billion vector embeddings queried a few thousand times could mean:&lt;/p>
&lt;ul>
&lt;li>I have a single query vector, and I query all 3 billion vectors once, get the dot product, and get all results&lt;/li>
&lt;li>I have a single query vector, I query all 3 billion vectors once, get the dot product, and return top-k results, which is easier because we can do ANN search
&lt;ul>
&lt;li>In this case, do I need to return the two initial vectors also? Or just the result?&lt;/li>
&lt;li>Do I need to re-rank the results by similarity in any way?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>I have 1,000 query vectors, and I query all 3 billion vectors once, and get the dot product of all results&lt;/li>
&lt;li>Are these vectors already in-memory when we intially start working with them or will they always be on-disk? Are we reading them one at a time, or streaming them?&lt;/li>
&lt;li>What kind of machine are we assuming: Are we running this locally? What are the specs of the machine? Are we assuming the vectors come to us in a specific, optimized format?
&lt;ul>
&lt;li>Do we have GPUs and are we allowed to use them?&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>How &lt;a href="https://vickiboykis.com/2025/09/01/how-big-are-our-embeddings-now-and-why/">big are our embeddings?&lt;/a> - this is extremely important and could significantly impact our representation, input vector size and output results&lt;/li>
&lt;li>Are we assuming we can compress their representation at all, i.e. is compressiong from float64 to float32 tolerable wrt to accuracy?&lt;/li>
&lt;li>How much time do we have to generate this one-off project? Are we sure it&amp;rsquo;s really a one-off?&lt;/li>
&lt;/ul>
&lt;p>All of these dictate the additional time and resources spent on the solution. What I realized is the same thing I&amp;rsquo;ve seen so many of these problems over the years, that the technical solution is no longer the hardest one to achieve: the hardest one is nailing down the requirements.&lt;/p></description></item><item><title>2025 in review</title><link>https://vickiboykis.com/2025/12/22/2025-in-review/</link><pubDate>Mon, 22 Dec 2025 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2025/12/22/2025-in-review/</guid><description>&lt;p>&lt;figure>&lt;img src="https://vickiboykis.com/images/girl_with_candle.jpg" width="400">
&lt;/figure>

&lt;em>Jeune fille lisant une lettre à la bougie, Jean-Baptiste Santerre, 1700&lt;/em>&lt;/p>
&lt;p>Machine learning engineers spend their lives alternating between two states: staring at &lt;a href="https://tqdm.github.io/">tqdm progress bars&lt;/a> during model training and staring at error logs during model inference.&lt;/p>
&lt;p>A third category now involves staring at coding agent CLI progress bars, but using too much AI assistance during coding makes me feel like I&amp;rsquo;m losing &lt;a href="https://vickiboykis.com/2023/09/13/build-and-keep-your-context-window/">my own context window&lt;/a>.&lt;/p>
&lt;p>I started a new job as a founding MLE in March and, as is true for engineers in any small and young team, I’ve been staring at error logs all across the stack.&lt;/p>
&lt;p>The good news is that if you are at a point where you care a lot about error logs, it means you have users who care a lot about the answer to &amp;ldquo;What if this breaks?&amp;rdquo; If you have people other than you who care about your software, congratulations, &lt;a href="https://vickiboykis.com/2021/06/20/the-ritual-of-the-deploy/">you are in production&lt;/a>. Being in the state of production is the best possible outcome for software engineers because it means the work we’re doing is useful.&lt;/p>
&lt;p>Standing up a new thing in production, though, is scary, because production means you are responsible - to the end-user, to the other developers on your team, and, finally, to yourself.&lt;/p>
&lt;p>In &amp;ldquo;The Tombs of Atuan&amp;rdquo;, Ursula K. Le Guin writes of a girl, Tenar, who is taken at a very young age from her family and goes to live in a holy city. In a ritual ceremony, her name is changed and she loses her identity to become Arha, the Eaten One. Arha is the High Priestess of the Place of the Tombs, a vast series of caverns under the earth, silent, and in complete darkness, “full of gold and the swords of old heroes, and old crowns, and bones, and years, and silence.”&lt;/p>
&lt;p>In her first trip to the tombs with an elder priestess, they descend into the dark without a torch.&lt;/p>
&lt;blockquote>
&lt;p>“Light is forbidden here.” Kossil’s whisper was sharp. Even as she said it, Arha knew it must be so. This was the very home of darkness, the inmost center of the night. Three times her fingers swept across a gap in the complex, rocky blackness. The fourth time she felt for the height and width of the opening, and entered it. Kossil came behind. In this tunnel, which went upward again at a slight slant, they passed an opening on the left, and then at a branching way took the right: all by feel, by groping, in the blindness of the underearth and the silence inside the ground. In such a passageway as this, one must reach out almost constantly to touch both sides of the tunnel, lest one of the openings that must be counted be missed, or the forking of the way go unnoticed.&lt;/p>&lt;/blockquote>
&lt;p>This is what building a new production system is like. You go into it, like Arha, scared, groping entirely in the darkness.&lt;/p>
&lt;p>If you need a cache, you can use a &lt;a href="https://go.dev/blog/swisstable">Go map&lt;/a>, lru cache, or Redis. If you need a vector store, you can use &lt;code>np.array&lt;/code>, or Postgres, or Pinecone/Chroma/Turbopuffer/ or Elasticsearch. You could run your whole stack on Docker Compose, or a single bare-metal server in your basement, or Kubernetes in the cloud, or &lt;a href="https://bogdanthegeek.github.io/blog/projects/vapeserver/">on a vape.&lt;/a>&lt;/p>
&lt;p>Forget large-scale architecture choices.
How will &lt;a href="https://vickiboykis.com/2023/06/29/naming-things/">you name things&lt;/a>? Should you name your method &lt;code>get_results&lt;/code> or &lt;code>get_query_results&lt;/code>? Will you ever get results that are more than just the answer to a query? Will you ever use this method in a more abstract way? Does it need to be part of a class? What exceptions should you write for it? We want to write &lt;a href="https://npf.io/2017/08/lies/">code that must never lie&lt;/a>, but what if, at the time, we are telling the truth to ourselves?&lt;/p>
&lt;p>How soon do we need to ship &lt;code>get_query_results&lt;/code>? How many other methods will rely on it? How many of our teammates need this method for their work as well? Will they understand it? Code is meant, largely still and hopefully for the forseeable future, for humans to read first and machines to compile second.&lt;/p>
&lt;p>In &amp;ldquo;The Bell Jar&amp;rdquo;, Sylvia Plath laments about the endless myriad combinations of a possible life, &amp;ldquo;I saw my life branching out before me like the green fig tree in the story. From the tip of every branch, like a fat purple fig, a wonderful future beckoned and winked.&amp;rdquo; By the time she&amp;rsquo;s done thinking through all these futures,she hasn&amp;rsquo;t made a decision, and they all slip away from her.&lt;/p>
&lt;p>The forking branches of a decision tree in a codebase, are likewise boundless, and the neat part is that there is no right answer. You are constrained by your business requirements, but the choice of implementation of those requirements is of an endless variety. It will depend on: the stack you already have, the budget for the rest of the stack, your &lt;a href="https://vickiboykis.com/2024/12/16/write-code-with-your-alphabet-radio-on/">own past experience and engineering values&lt;/a>, the social norms and expectations of your team and their collective experience, the industry&amp;rsquo;s vocabulary, the language you&amp;rsquo;re writing in, its conventions and affordances, and how much time you have to actually think about, finish, and merge this current PR before you need to deploy. That&amp;rsquo;s why staff engineers make money mostly by saying &amp;ldquo;it depends&amp;rdquo; for years on end in different ways.&lt;/p>
&lt;p>Let&amp;rsquo;s say your goal is to build a scalable, distributed, microservice architecture that processes inputs and streams machine learning inference outputs to users with minimal downtime, and that the system is resilient to failures via container orchestration. I&amp;rsquo;ve just described every machine learning systems since the beginning of time, this is the machine learning system they show you in Plato&amp;rsquo;s ML engineering cave, in every single Medium post on the planet, this is the set of boxes connected by dotted lines that arrives to unbidden to every engineer in their dreams. This is the light at the end of your labyrinth.&lt;/p>
&lt;p>But, in building such a system, you never start by building such a system. You have to start with a single keystroke in the inky blackness.This is one of the reasons it&amp;rsquo;s so hard to write code with LLMs and evaluate their output, by the way: humans reason from the inside out - &lt;a href="https://www.dwarkesh.com/p/andrej-karpathy">&amp;ldquo;When you write this code, you don’t go from top to bottom, you go from chunks and you grow the chunks&amp;hellip;&amp;rdquo;&lt;/a> - and LLMs write code from top down.&lt;/p>
&lt;p>So, instead, you take a deep breath and write a single method to tokenize a string.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">tokenize&lt;/span>(string:str) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> list[str]:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> tokens &lt;span style="color:#f92672">=&lt;/span> string&lt;span style="color:#f92672">.&lt;/span>split()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> tokens
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Tokenizing a string is the process of splitting it into subsections and converting each of them into a number so they can be processed by a &lt;a href="https://huggingface.co/learn/llm-course/en/chapter2/4">transformer-style model&lt;/a>. Tokenization is not specific to LLMs. We&amp;rsquo;ve been doing tokenization for a long time. &lt;a href="https://aclanthology.org/C92-4173.pdf">Here&amp;rsquo;s a paper on how important it is for preprocessing from 1992!&lt;/a>&lt;/p>
&lt;p>The first part of tokenizing any text is splitting it into those sub-parts. Those sub-parts can be words, syllables, or characters. We start with words to make it easier.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>sentence &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;The truest power is the power to choose. - Ursula K. Le Guin&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">tokenize&lt;/span>(string:str) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> list[str]:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> tokens &lt;span style="color:#f92672">=&lt;/span> string&lt;span style="color:#f92672">.&lt;/span>split()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> tokens
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(tokenize(sentence))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>[&lt;span style="color:#e6db74">&amp;#39;The&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;truest&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;power&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;is&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;the&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;power&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;to&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;choose.&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;-&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;Ursula&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;K.&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;Le Guin&amp;#39;&lt;/span>]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Easy, and we have something concrete. Do we count &amp;ldquo;K.&amp;rdquo; as a word? How about the punctuation, do we generally want to include that for tokenization in a transformer-style model?&lt;/p>
&lt;p>What do we do with input text that has numbers in it? Do we include them or strip them out?&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>sentence &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;The truest 1 power is the power to choose. - Ursula K. Le Guin&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(tokenize(sentence))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>[&lt;span style="color:#e6db74">&amp;#39;The&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;truest&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;1&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;power&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;is&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;the&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;power&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;to&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;choose&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;Ursula&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;K.&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;Le Guin&amp;#39;&lt;/span>]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>What happens when we instead get malformed inputs from a webpage, or if we have delimiters?&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>sentence &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;The&amp;amp;lt;html&amp;gt;truest&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">power&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">is&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">the&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">power&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">to&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">choose&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">Ursula&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">K&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">Le&lt;/span>&lt;span style="color:#ae81ff">\n&lt;/span>&lt;span style="color:#e6db74">Guin&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(tokenize(sentence))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>[&lt;span style="color:#e6db74">&amp;#39;The&amp;amp;lt;html&amp;gt;truest&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;power&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;is&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;the&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;power&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;to&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;choose&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;Ursula&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;K.&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;Le&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;Guin&amp;#39;&lt;/span>]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>What if the input is an empty string? Do we want to send that empty string downstream?&lt;/p>
&lt;p>What if we don’t have a string to tokenize? We now get an exception that we have to think about how to handle if we&amp;rsquo;re in production.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>string &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>tokenize(sentence)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Traceback (most recent call last):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> File &lt;span style="color:#e6db74">&amp;#34;/Users/vicki/arha/tokenize.py&amp;#34;&lt;/span>, line &lt;span style="color:#ae81ff">9&lt;/span>, &lt;span style="color:#f92672">in&lt;/span> &lt;span style="color:#f92672">&amp;amp;&lt;/span>lt; module&lt;span style="color:#f92672">&amp;gt;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(tokenize(sentence))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> File &lt;span style="color:#e6db74">&amp;#34;/Users/vicki/arha/tokenize.py&amp;#34;&lt;/span>, line &lt;span style="color:#ae81ff">5&lt;/span>, &lt;span style="color:#f92672">in&lt;/span> tokenize
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> tokens &lt;span style="color:#f92672">=&lt;/span> string&lt;span style="color:#f92672">.&lt;/span>split()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">AttributeError&lt;/span>: &lt;span style="color:#e6db74">&amp;#39;NoneType&amp;#39;&lt;/span> object has no attribute &lt;span style="color:#e6db74">&amp;#39;split&amp;#39;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>What if we receive emojis? This string representation returns only one token, which may not be correct in representing these two different emotions. And how do we know the model accepts emoji codepoints as input anyway? Many GPT-style models do, but many &lt;a href="https://github.com/huggingface/transformers/issues/7648">BERT models don&amp;rsquo;t&lt;/a>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>sentence &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;🤪🥰&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(tokenize(sentence))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>[&lt;span style="color:#e6db74">&amp;#39;🤪🥰&amp;#39;&lt;/span>]
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>What if we send an extremely large string because we don&amp;rsquo;t have upstream error handling for our tokenization queries? This code block will take a long time to return because you are creating a large object in memory. Since we don&amp;rsquo;t have logging yet, we won&amp;rsquo;t know in the rest of our application where that large object creation takes place. We will just have a web service that hangs silently, indefinitely.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>num_tokens &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">500_000_000&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>pattern &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;The truest power is the power to choose. - Ursula K. Le Guin&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>large_string &lt;span style="color:#f92672">=&lt;/span> pattern &lt;span style="color:#f92672">*&lt;/span> num_tokens
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>print(tokenize(large_string))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>hanging forever&lt;span style="color:#f92672">...&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>What if our input includes languages other than English? How do we account for &lt;a href="https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/">Unicode&lt;/a>?&lt;/p>
&lt;p>Sometimes, you&amp;rsquo;re able to come up with most of these edge cases at once. Mostly, though, you end up building out these edge cases over countless iterations, because software engineering is &lt;a href="https://adamj.eu/tech/2021/11/03/software-engineering-is-programming-integrated-over-time/">programming integrated over time.&lt;/a>&lt;/p>
&lt;p>Eventually, you build up to a codebase that has a &lt;a href="https://github.com/openai/tiktoken/blob/97e49cbadd500b5cc9dbb51a486f0b42e6701bee/src/lib.rs#L97">level of thought and care&lt;/a> around tokenization, punctuation, and input sanitization. You&amp;rsquo;ve now written a codebase at the level of &lt;a href="https://github.com/huggingface/tokenizers/tree/main">tokenizers&lt;/a>, or &lt;a href="https://spacy.io/api/tokenizer">spacy&lt;/a> or &lt;a href="https://github.com/openai/tiktoken">tiktoken&lt;/a>.&lt;/p>
&lt;p>In fact, you should probably use the libraries that people have already put collectively dozens of combined years of thought into - those corridors are closed, safe, and tested, in the case of open-source libraries, by a much larger community than just you. You import &lt;code>tokenizers&lt;/code>.&lt;/p>
&lt;p>In doing so, you close off a number of paths further into the labyrinth behind you, the doors darkened and abandoned. You light a single flickering torch to mark your path. On top of that code, you start adding the things that make that path more resilient (but also, then harder to refactor if you decide to pivot): precise exception handling, traces, clear &lt;a href="https://vickiboykis.com/2025/07/16/my-favorite-use-case-for-ai-is-writing-logs/">human-legible logging&lt;/a>, and unit tests. A single corridor in your application starts to shine. With the path in front fo you illuminated, you move forward.&lt;/p>
&lt;p>Now that tokenization works, since you need this method throughout your application, you&amp;rsquo;ll want to create a tokenizer inference service.&lt;/p>
&lt;p>Within the logic itself, if you&amp;rsquo;re using an external tokenizer, you&amp;rsquo;ll need to make sure it matches the model you&amp;rsquo;re using, because the model depends on a given tokenization scheme, In this case, we are better off relying on &lt;a href="https://huggingface.co/docs/transformers/en/model_doc/auto">AutoTokenizer&lt;/a>, which handles the import and comparison logic for us:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> transformers &lt;span style="color:#f92672">import&lt;/span> AutoTokenizer
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>name_or_path &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;arha-based-uncased&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>tokenizer &lt;span style="color:#f92672">=&lt;/span> AutoTokenizer&lt;span style="color:#f92672">.&lt;/span>from_pretrained(name_or_path)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>If you are downloading the tokenizer from HuggingFace, as &lt;code>AutoTokenizer&lt;/code> does unless you specify a custom tokenizer path, you add a network call and an external dependency to the complications of your initial service call.&lt;/p>
&lt;p>A new branch blooms on your decision tree. If you make this call in your service, how much latency does it add? On cold-start, can you assume the artifact will always be there? How long does it take to load the tokenizer? To stave off that branch of questions, you could instead download the artifact locally, or write your own tokenizer. Now, you&amp;rsquo;re in the business of object storage and artifact management.&lt;/p>
&lt;p>You decide to tackle another question: if you&amp;rsquo;re calling the tokenizer again and again, you&amp;rsquo;ll need to create a method, or better yet, a class, because it&amp;rsquo;s highly likely you&amp;rsquo;ll want other functionality attached to the tokenizer.&lt;/p>
&lt;p>And now, if you have a service, you need to Dockerize the code. If you have a Docker image, you need to pin dependencies for reproducibility. If you are pinning dependencies and have Docker, you need a build process for your deployable artifact.&lt;/p>
&lt;p>The closer and closer to production you get, the more questions around &amp;ldquo;what happens if this breaks for other people who include my team and my users&amp;rdquo; start to form.&lt;/p>
&lt;p>You write the service and add logs. Lots of logs. So many logs. Logs are your first line of defense in production. They are the candle in the darkness. If they&amp;rsquo;re good enough for Brian Kernighan, who wrote, &amp;ldquo;The most effective debugging tool is still careful thought, coupled with judiciously placed print statements,” they&amp;rsquo;re good enough for you.&lt;/p>
&lt;p>You need to register the service into your observability platform. You add exception handling. So many &lt;code>Exceptions&lt;/code>. What if the tokenizer doesn&amp;rsquo;t load? What if it loads, but to the wrong directory? What if HuggingFace is down? What if the tokenizer service can&amp;rsquo;t talk to other services? What happens if the new container doesn’t launch? What happens if the tokenizer artifact is corrupted? What happens if you&amp;rsquo;re serving the tokenizer, but the intra-service network fails? Exceptions and logging, logging and exceptions, unit tests and integration tests, layer upon layer of defenses.&lt;/p>
&lt;p>Beej, who previously wrote an &lt;a href="https://beej.us/guide/bggit/html/split/index.html">incredible guide to git&lt;/a>, recently released a guide to &lt;a href="https://beej.us/guide/bglcs/html/split/index.html">learning computer science&lt;/a>, in which he says that in order to be a good computer scientist, you need to think &lt;a href="https://beej.us/guide/bglcs/html/split/problem-solving.html">like a villain&lt;/a>.&lt;/p>
&lt;blockquote>
&lt;p>Expect the unexpected in terms of data that your code will receive. Expect malicious actors to feed in data in an attempt to gain unauthorized access or manipulate the system in undesirable ways. Test for that stuff in your code.&lt;/p>&lt;/blockquote>
&lt;p>You think like a villain, because the villain in the production environment is usually yourself. With each new door, comes the possibility of falling through into a bottomless pit.&lt;/p>
&lt;p>For me, this usually happens either immediately after I deploy, or at 2:17 in the morning a week from now. The tokenization method I wrote or imported fails. Or there&amp;rsquo;s a memory leak. Or there&amp;rsquo;s a network outage that causes a cascading failure.&lt;/p>
&lt;p>I add more logs. I debug. I run tests. I develop locally. I add more tests. I consult and I fix, and, finally, I am back on the golden path. The alerts subside. I add more torches. The system becomes stronger, more legible from me having struggled with it. My reasoning about the system becomes stronger and clearer, too. We grow together, the system in production and me.&lt;/p>
&lt;p>There is one comforting thought about being with the system: when you enter the Tombs, you are not actually truly alone. Software development is a team sport, and when we build a system as a team, we built it together, PR by PR.&lt;/p>
&lt;p>And, we are also building with others: we build on the libraries and tools others before us have hardened, with the tools that others have built. In spite of the fact that an increasing amount of code is machine-assisted, software development is still mostly humans talking to each other in language that computers also understand. As we build, alone in the dark with our torches in the Tombs, we hear the echoes of others in other hallways, each carefully closing off dead ends, placing their own torches.&lt;/p>
&lt;p>Once you illuminate enough of the system by reasoning through failure points, architectural decisions, the hallway is no longer unlimited darkness. It is an illuminated system in production. You keep &lt;a href="https://vickiboykis.com/2025/09/09/walking-around-the-app/">walking around the app&lt;/a> and harden the system. You start to get more sleep.&lt;/p>
&lt;p>At the end of the year, I made it out of the Tombs.&lt;/p>
&lt;p>Just in time to go back to staring at tqdm.&lt;/p></description></item><item><title>I want to see the claw</title><link>https://vickiboykis.com/2025/10/20/i-want-to-see-the-claw/</link><pubDate>Mon, 20 Oct 2025 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2025/10/20/i-want-to-see-the-claw/</guid><description>&lt;p>I respect quality software and the people who write it. And, I’ve invested years of my life in working on becoming one of these people (&lt;a href="https://vickiboykis.com/2021/09/23/reaching-mle-machine-learning-enlightenment/">even if the journey has been long and hard&lt;/a> and has involved lots of YAML). I have seen and used code written by people who care about software correctness, who get pleasure out of defining correct interfaces, who have &lt;a href="https://vickiboykis.com/2025/09/09/walking-around-the-app/">walked the app&lt;/a>, who have spent years working towards mastery. And, I have seen and written the code the rest of us write.&lt;/p>
&lt;p>Selfishly, I want more people to aspire to mastery, because wanting to be good at writing code means you care about the code, but more importantly, about the people on the other end of the code. And, I want a world where developers care about other developers and users. I want a world of teams of seniors shipping good code and mentoring juniors and working with product managers and open-source committers. I don’t want a world where three agents in a trenchcoat from whichever AI lab is better this week quietly churn out CRUD slop that I approve out of the corner of my brain every thirty minutes.&lt;/p>
&lt;p>Perhaps the first time I understood the value of mastery and care was my first job out of college, where I, a baby analyst, handed in a report to my senior analyst to check. “Did you double-check that SQL to make sure it’s correct, ” she asked. I had not, and when she ran my data, she got different results. I learned to always check my work.&lt;/p>
&lt;p>A bit later, I worked with a senior architect trying to extract data into a relational database from a schemaless store. I watched him setting up for days, first preparing a staging environment and backups, then running and re-running again and again, carefully timestamping the files so he could roll back to an earlier run. Often, when I watched, his chair was empty - he was in meetings with the team who owned the database, making sure that when he hit their service to extract the data it didn’t fall over and impact them. Finally, when he was done, he wasn’t done. He checked it against the downstream reporting we were doing to make sure the numbers matched.&lt;/p>
&lt;p>There is no single element that comprises this quality of care, the striving towards excellence and mastery. But, as Bernoulli said when he received an anonymous solution to the &lt;a href="https://mathworld.wolfram.com/BrachistochroneProblem.html">brachistochrone&lt;/a> problem that turned out to be Isaac Newton, “I recognize the lion by its claw.”&lt;/p>
&lt;p>There is a great recent post that posits that most industry programmers are &lt;a href="https://www.seangoedecke.com/pure-and-impure-engineering/">impure engineers&lt;/a>: we will always be up against business constraints and reality, and do not have the time and luxury that pure engineers, who care about only the heart of the technical solution, to care about quality.&lt;/p>
&lt;p>But over and over in my career, as I tuned &lt;a href="https://vickiboykis.com/2024/12/16/write-code-with-your-alphabet-radio-on/"> my alphabet radio&lt;/a>, I came to consistently understand that intent matters. Even if we are, almost all of us, impure engineers and mere mortals and the sprint is ending in two days but I haven’t finished my t-shirt size medium task yet, the closer impure engineers &lt;a href="https://tidyfirst.substack.com/p/mastering-programming">strove to work at the level of purity&lt;/a> (and did their second job &lt;a href="https://jacobian.org/2017/nov/1/you-have-two-jobs/">of being easy to work with&lt;/a>), the more I wanted to both work with them and use their software.&lt;/p>
&lt;p>I recognize the claw of the lion in software like Redis, cURL, uv, Ghostty, sqlite, llama.cpp - software that is elegant, well-built, considered and thoughtful. Software that is joyful to use. Software that helps me. I want to write software like this, and I want to use software like this, and I want us as programming people to be incentivized to value the process that creates software like this.&lt;/p>
&lt;p>It has, with generative code, become harder and harder to strive towards the lions because the models produce code that is, quite literally, mid, the compressed and weighted average of every excellent Stack Overflow answer, but also questions like, “&lt;a href="https://stats.stackexchange.com/questions/185507/what-happens-if-the-explanatory-and-response-variables-are-sorted-independently">What happens if the explanatory and response variables are sorted independently before regression?&lt;/a>” It is the average of all publicly available software, updated at some cadence and mixed into training data soup and then RLHFed according to some arbitrary metrics, and as such can only offer a ghost of quality.&lt;/p>
&lt;p>We are being &lt;a href="https://daniel.haxx.se/blog/2025/08/18/ai-slop-attacks-on-the-curl-project/">overrun by mediocrity&lt;/a> and sloppiness, &lt;a href="https://github.com/ghostty-org/ghostty/pull/8289">we are trying&lt;/a> to fight against it, and yet, no matter how good the models and ecosystems around them get, I find myself more and more wanting software that I know is written with humans at the wheel - &lt;a href="https://antirez.com/news/153">we are still better&lt;/a> at reasoning, at aesthetic judgment, at architecture.&lt;/p>
&lt;p>The best code is no code, &lt;a href="https://www.stilldrinking.org/programming-sucks"> programming still sucks&lt;/a> and always will, and yet, I find myself still searching for the claw, the mark of mastery. Because that mark comes from people who want to reach other people directly. I want to see the claw, because if there is a claw, it means there is a living, breathing lion on the other side of the screen building the software that elevates us and binds us together as a community of software engineers.&lt;/p></description></item><item><title>Walking around the app</title><link>https://vickiboykis.com/2025/09/09/walking-around-the-app/</link><pubDate>Tue, 09 Sep 2025 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2025/09/09/walking-around-the-app/</guid><description>&lt;p>There is a very vigorous debate happening online right now around what shape evaluation for LLM-based products should take. I don’t want to rehash all of it, other than saying that if you are building any applications with with non-deterministic components (most AI-powered apps), or applications that are data-intensive (also most AI powered apps!), and you are serious about these features in your application, you should at a very minimum be testing your online outputs in production on a daily basis. If offline testing (testing your given model outputs internally before it goes live in production) is a luxury you can also afford based on your engineering capacity and tooling maturity, even better!&lt;/p>
&lt;p>More importantly: if you are building an app, you need to also constantly be testing and touching different parts of it to see if the app flow makes sense. Look at the UI dropdowns, try the search bar. Toggle the toggles. How long does it take to load a result? What do the results look like? What colors are the buttons? How do they work on different devices? At different internet speeds? What about your payment gateway? How does onboarding look for a brand new user in North America? In New Zealand? Could you onboard from a fresh device in under three minutes? Do images load from your CDN?&lt;/p>
&lt;p>I’ve heard this kind of testing called different things in different product cultures, but the phrase that stuck with me the most was when &lt;a href="https://overreacted.io/">Dan&lt;/a> called it “walking around the app”. Walking around the app is exactly what it sounds like - checking out the sidewalks, picking up idle pieces of trash, letting the store across the street know their street lamp is out. You could call this QA, although it’s more than that. QA is scoped transactionally and takes place at the end of a PR. Walking around the app is a mindset, a process something you do every day, broadly, habitually, passionately, with interest and curiosity, because the app is the neighborhood you’re building for the people you’re building for.&lt;/p>
&lt;p>Because what happens if you don’t walk around the app regularly is that you start to get &lt;a href="https://en.wikipedia.org/wiki/Broken_windows_theory">broken windows.&lt;/a> The broken window theory posits that any given signs of decay in a town make it seem permissive to create more: littering begets more littering. People understand the impact of this: For example, in real life, in the Italian town of Brescia, &lt;a href="https://www.instagram.com/ghostpitur/">there is an anonymous guy&lt;/a> who goes around at night painting over malignant graffiti.&lt;/p>
&lt;p>In the applications we build, builders are both the graffiti creators and the graffiti erasers, and we have the responsibility to make our applications as habitable as possible for both our users and ourselves. What does it mean to clean up broken windows for users? Buttons that go nowhere, results pages that don’t render, links that are missing results, misspellings, cache misses, tiemouts, misaligned CSS. Deprecated routes. Latency.&lt;/p>
&lt;p>How about for ourselves? Apps are easier to walk around if they are consumer-facing apps that have front-ends, but APIs and backends can be walked around, too. &lt;a href="https://restfulapi.net/resource-naming/">Naming inconsistencies.&lt;/a> Broken local builds. A nested class inheritance hierarchy that’s multiple levels deep is a broken window because it’s hard to trace in our mind - humans can only really &lt;a href="https://vickiboykis.com/2021/11/07/the-programmers-brain-in-the-lands-of-exploration-and-production/">keep 2+5 things in memory.&lt;/a> A hard-coded environment variable that creates extra mental load. A method that relies on an untestable internal method that results in having to use a mock. A build process so long it causes you to lose interest and tab away to Reddit. Anything that creates cognitive overhead, &lt;a href="https://grugbrain.dev/">that Grug doesn’t like&lt;/a>, can be a broken window.&lt;/p>
&lt;p>There is not one thing that, once fixed, makes the app work well, but it becomes intuitively obvious to end-users what a well-loved app feels like, because it&amp;rsquo;s constantly being walked around and being fixed in a million small ways, every day.&lt;/p>
&lt;p>We are mere mortals and will always create bugs, especially when it comes to data work. Walking the app will not prevent this - to write code is to generate bugs and the best code is the code never written.&lt;/p>
&lt;p>But, if we are walking around our app every day, we’ll catch at least as many things as we can create. And if we have that in place, evaluation of any kind becomes infinitely easier to establish as a habit, too.&lt;/p></description></item><item><title>How big are our embeddings now and why?</title><link>https://vickiboykis.com/2025/09/01/how-big-are-our-embeddings-now-and-why/</link><pubDate>Mon, 01 Sep 2025 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2025/09/01/how-big-are-our-embeddings-now-and-why/</guid><description>&lt;p>A few years ago, I wrote&lt;a href="https://vickiboykis.com/what_are_embeddings/"> a paper on embeddings&lt;/a>. At the time, I wrote that 200-300 dimension embeddings were fairly common in industry, and that adding more dimensions during training would create diminishing returns for the effectiveness of your downstream tasks (classification, recommendation, semantic search, topic modeling, etc.)&lt;/p>
&lt;p>I wrote the paper to be resilient to changes in the industry since it focuses on fundamentals and historical context rather than libraries or bleeding edge architectures, but this assumption about embedding size is now out of date and worth revisiting in the context of growing embedding dimensionality and embedding access patterns.&lt;/p>
&lt;p>As a quick review, embeddings are compressed numerical representations of a variety of features (text, images, audio) that we can use for machine learning tasks like search, recommendations, RAG, and classification. The size of the embedding is how many features our item has.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/embeddings-sizes/img_1.png" width="400">
&lt;/figure>

&lt;p>For example, let’s say we have two butterflies. We can compare them among many dimensions, including wingspan, wing color, number of antennae. Let’s say we have a 3-dimensional feature for a butterfly, it might look like this.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/embeddings-sizes/img_2.png" width="400">
&lt;/figure>

&lt;p>Doing a visual vibe scan, we can eyeball the data and see that &lt;code>butterfly_1&lt;/code> and &lt;code>butterfly_3&lt;/code> are more similar to each other than to &lt;code>butterfly_2&lt;/code> because their features are closer together.&lt;/p>
&lt;p>But butterflies are 3-dimensional animals and some of these features are numerical. When we talk about embeddings in industry these days, we generally mean trying to understand the properties of text, so how does this concept work with words? We can’t directly compare words like “bird” and “butterfly” or &amp;ldquo;fly&amp;rdquo; in a given text but we can compare their numerical representations if we map them into the same shared space. We can see that &amp;ldquo;bird&amp;rdquo; and &amp;ldquo;flying&amp;rdquo; are more similar to each other than to &amp;ldquo;dog&amp;rdquo; through their numerical representations.&lt;/p>
&lt;p>Intuitively, we know this is true, but how do we artificially create these relationships?&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/embeddings-sizes/img_3.png" width="400">
&lt;/figure>

&lt;p>There are different ways of creating text embeddings from a given word, all of which rely on analyzing that word in relation to the words around it in a given corpus of data. We can use traditional count-based approaches like TF-IDF based on term frequency in documents, or statistical approaches like PCA or LSA.&lt;/p>
&lt;p>With the advent of deep learning models, we started learning representations generated from models like Word2Vec that maximized the probability that a left-out word would be next to other given words in the training dataset.&lt;/p>
&lt;p>When we learn the embedding representation of a given word using probabilistic models, we are comparing how similar these words are to other words. Each feature is not an explicit feature like “wing color”, but rather a vibe-based latent representation in the latent space that doesn’t have a clear explanation.&lt;/p>
&lt;p>For example, one dimension might be “this word is an action word” or maybe “this word is related to other words about food”, but we generally don’t know exactly what the model thinks each feature represents. In fact, this is a &lt;a href="https://www.anthropic.com/news/mapping-mind-language-model">fascinating area of study&lt;/a> we are just starting to understand how these latent representations work through ideas like &lt;a href="https://vgel.me/posts/representation-engineering/">control vectors&lt;/a>, a concept that Anthropic explored in the famous Golden Gate Claude paper.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/embeddings-sizes/img_4.png" width="400">
&lt;/figure>

&lt;p>When we train a model, embedding size is initialized as a hyperparameter before model training, and we iterate on the size depending on our downstream evaluations after training. Picking the right hyperparameter is (&lt;a href="https://archives.argmin.net/2017/12/05/kitchen-sinks/">alchemy&lt;/a>) a combination of art and science and &lt;a href="https://docs.aws.amazon.com/wellarchitected/latest/generative-ai-lens/genperf04-bp02.html">depends on optimizing &lt;/a>training throughput, final embedding storage size, and the performance of the embedding on your downstream task both qualitatively and wrt to latency of performing search on embeddings of different sizes.&lt;/p>
&lt;p>Since previous generations of models were smaller and trained in-house, hyperparameters were usually not published by companies, and as such, there was no standard agreement on embedding size. We generally, as an industry, understood &lt;a href="https://aclanthology.org/I17-2006/">that somewhere around 300 dimensions for a given embedding model might be enough to compress all the nuance of a given textual dataset&lt;/a>. 300 was the number of dimensions typically used by earlier models like Word2Vec and &lt;a href="https://nlp.stanford.edu/projects/glove/">GloVE&lt;/a>.&lt;/p>
&lt;p>After the publication of the attention paper &lt;a href="https://arxiv.org/abs/1810.04805">BERT was released&lt;/a> in 2018. This model’s architecture &lt;a href="https://huggingface.co/google-bert/bert-base-uncased">introduced embeddings of 768 dimensions.&lt;/a> Although previous &lt;a href="https://svail.github.io/persistent_rnns/">RNN&lt;/a> and LSTM models had been trained on GPUs, BERT was one of the first larger embedding models to be trained on GPUs (&lt;a href="https://timdettmers.com/2018/10/17/tpus-vs-gpus-for-transformers-bert/">and TPUs&lt;/a>), which meant that GPU optimization now became increasingly important.&lt;/p>
&lt;p>The key behind &lt;a href="https://jiegroup-genai.readthedocs-hosted.com/en/latest/resource/">training Transformer models efficiently&lt;/a> is the ability to &lt;a href="https://horace.io/brrr_intro.html">efficiently move data onto GPU&lt;/a> and parallelize their matrix multiplication operations between several pipelines, aka attention heads, where each attention head can focus on understanding and defining a different part of the embedding feature space. As such, the embedding needs to be able to be partitioned evenly between the number of attention heads.&lt;/p>
&lt;p>BERT has 12 attention heads, so 768 dimensions was selected from a combination of trial and error and efficiently parallelizing computation to “attend” to different parts of the feature space, meaning that, in model training, each head operates on a 64-dimensional subspace of the original 768-dimensional input embedding. This itself comes from the Transformer paper, where each sub-embedding size per head &lt;a href="https://arxiv.org/pdf/1706.03762">is commonly chosen as 64&lt;/a>.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/embeddings-sizes/img_5.png" width="400">
&lt;/figure>

&lt;p>As a result, many BERT-style models, and related model families, used 768 as a baseline for embedding dimensions. Although it had a much larger training dataset, &lt;a href="https://www.alignmentforum.org/posts/BMghmAxYxeSdAteDc/an-exploration-of-gpt-2-s-embedding-weights">GPT-2 also &lt;/a>implemented 768. And, although CLIP uses an embedding size of &lt;strong>768&lt;/strong> as derived from the &lt;a href="https://arxiv.org/abs/2010.11929">Vision Transformer &lt;/a> architecture that CLIP uses for its image encoder, and for consistency, the text encoder also uses this dimension size[^1].&lt;/p>
&lt;p>Even though training cycles for BERT were fairly small (4 days for the original BERT model) compared to the months-long pretraining processes that LLMs require these days, it was still hard computationally to infer these embedding sizes, even with GPU optimizations. &lt;a href="https://arxiv.org/abs/1908.10084">For BERT, for example, &lt;/a>&lt;/p>
&lt;pre>&lt;code>Finding the most similar pair in a collection of 10,000 sentences requires about 50 million inference computations (~65 hours).
&lt;/code>&lt;/pre>
&lt;p>So in 2019, &lt;a href="https://github.com/UKPLab/sentence-transformers">UKP created&lt;/a> and optimized a model, SBERT, focusing on &lt;a href="https://arxiv.org/abs/1908.10084">sentence-level representation&lt;/a>s, which unlocked faster inference, and &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2">Minilm&lt;/a> became the standard baseline model for document-level chunk embeddings at 384 dimensions and still performs extremely well for a number of tasks.&lt;/p>
&lt;p>What else changed was the rise of the open-source HuggingFace platform where model artifacts could be shared. Previously, they were either hosted in undiscoverable repos in GitHub or locked away in scientific repositories missing metadata. The rise of HuggingFace as a platform and the &lt;code>transformers&lt;/code> library as a centralized API into training and inference for PyTorch-based models unlocked a level of standardization: now, many more people could simply download and replicate the existing models instead of rebuilding arcane research code from scratch. This led to much more standard embedding and architecture sizes used both in academia and industry.&lt;/p>
&lt;p>Although 768 held for a fairly long time, with upwards pressure from market competition on model sizes as the result of the release of GPT-2 (the backbone of ChatGPT), we started seeing standard embedding sizes increase.&lt;/p>
&lt;p>Part of this was the fact that companies no longer had to infer their own embeddings. Previously, embeddings had been custom-learned in labs or in&lt;a href="https://arxiv.org/abs/2007.03634"> companies whose core competency was processing large amounts of information&lt;/a> that could be culled in retrieval through search or recommendation.&lt;/p>
&lt;p>But now, with the advent of ChatGPT and API-based model availability, embeddings became a commodity available with a GET request, and the most popular embeddings&lt;a href="https://openai.com/index/new-and-improved-embedding-model/"> became OpenAI’s&lt;/a>, which used 1536 dimensions, in line with GPT-3, which used much more training data than any previous model. (&lt;a href="https://lambda.ai/blog/demystifying-gpt-3">570GB versus GPT-2 which was 40GB, and 96 attention heads.)&lt;/a>&lt;/p>
&lt;p>It used to be the case that people only learned or fine-tune their own embeddings, but now all of the major AI providers have their own hosted sets of embeddings. Particularly common ones are OpenAI, with Cohere and Nomic close behind. Google also &lt;a href="https://arxiv.org/abs/2503.07891">recently released embeddings for Gemini&lt;/a>, with both API and local versions being available.&lt;/p>
&lt;p>In addition to standardization via HuggingFace and APIs, MTEB, &lt;a href="https://huggingface.co/blog/mteb">which benchmarks embeddings publicly&lt;/a>, has grown as a resource where you can compare embedding models.&lt;/p>
&lt;p>If we look at MTEB, &lt;a href="https://huggingface.co/spaces/mteb/leaderboard">the embedding benchmark today&lt;/a>, we can see embedding sizes anywhere from our classic 768 to 4096 and beyond (note all of them are neatly divisible by 2, in line with architectural constraints)&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/embeddings-sizes/img_6.png" width="400">
&lt;/figure>

&lt;p>Finally, another consideration has been the change in the landscape with vector databases, which used to be huge, are now becoming more commoditized features in software and platforms like &lt;a href="https://github.com/pgvector/pgvector">Postgres&lt;/a>, &lt;a href="https://aws.amazon.com/s3/features/vectors/">S3&lt;/a>, and &lt;a href="https://www.elastic.co/docs/explore-analyze/machine-learning/nlp/ml-nlp-text-emb-vector-search-example">Elasticsearch&lt;/a>, leading to less out of the box necessity in vector storage and performance tuning.&lt;/p>
&lt;p>With the constant upward pressure on embedding sizes not limited by having to train models in-house, it’s not clear where we’ll slow down: Qwen-3, along with many others is already at 4096. It does appear that we are beginning to trim down on growth. OpenAI implemented a concept called &lt;a href="https://arxiv.org/abs/2205.13147">matryoshka representation learning &lt;/a>that trains embeddings with the “most important” concepts first, and additional embeddings adding incremental gains, meaning that an embedding learned in 1024 dimensions might be just as useful in 64, as long as the first 64 &lt;a href="https://huggingface.co/blog/matryoshka">dimensions compress most of the information efficiently&lt;/a>, and also making sure we re-normalize them. There are also research that indicates that not all embeddings are necessary in retrieval + search tasks, and that we can &lt;a href="https://arxiv.org/abs/2508.17744">in fact, in some cases, truncate 50% of them anyway. &lt;/a>&lt;/p>
&lt;p>It’s been fascinating to watch the rise of dimensionality and the constant struggle in tradeoffs between creating ever-larger models and then the need for those embedding sizes to perform at inference and storage time, aka continuously coming up against the age-old machine learning tradeoff of recall versus precision, and the engineering tradeoff of hardware limitations versus software versus business considerations.&lt;/p>
&lt;p>As these architectures have matured and internal models have become inference points of public-facing paid APIs, embeddings have gone from a mysterious byproduct of internal machine learning systems in companies with lots of data to a commodity used across many AI-powered applications across numerous application stacks.&lt;/p>
&lt;p>All in all, it’s been so much fun watching this space evolve, even if it means it looks like I’ll be having to update my paper over and over again.&lt;/p></description></item><item><title>Enabling Hugo static site search with Lunr.js</title><link>https://vickiboykis.com/2025/08/08/enabling-hugo-static-site-search-with-lunr.js/</link><pubDate>Fri, 08 Aug 2025 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2025/08/08/enabling-hugo-static-site-search-with-lunr.js/</guid><description>&lt;p>There comes a time in every woman&amp;rsquo;s life when she only wants one thing: for her mininmal static site to finally have some of the same features that dynamic blogging platforms do, namely search.&lt;/p>
&lt;p>So now I&amp;rsquo;ve implemented search on this blog, you should see it in the top right and the results render in-line in the drop-down.&lt;/p>
&lt;p>&lt;figure>&lt;img src="https://vickiboykis.com/images/search_bar.png" width="400">
&lt;/figure>

&lt;figure>&lt;img src="https://vickiboykis.com/images/search_bar_results.png" width="400">
&lt;/figure>
&lt;/p>
&lt;p>I&amp;rsquo;m going to tweak the results because the boosting is not quite where I&amp;rsquo;d like it to be yet, but try it out!&lt;/p>
&lt;p>Some implmenetation details:&lt;/p>
&lt;p>My blog runs on Hugo, so I looked for solutions that work with Jekyll/Hugo. I picked very fun and lightweight &lt;a href="https://lunrjs.com/">lunr.js&lt;/a>, which is a single file that implements &lt;a href="https://github.com/olivernn/lunr.js/blob/aa5a878f62a6bba1e8e5b95714899e17e8150b38/lunr.js#L319">BM-25 search&lt;/a> based on Solr primitives (it&amp;rsquo;s called Lunr because it&amp;rsquo;s smaller and &amp;ldquo;less bright&amp;rdquo; than Solr, which I love.)&lt;/p>
&lt;p>To create the index that Lunr uses for search, I had Claude write a quick Python script that traverses my post directories and outputs a single JSON file of all indexes posts.&lt;/p>
&lt;p>The entries look like this:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-json" data-lang="json">&lt;span style="display:flex;">&lt;span>{
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;id&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;2018/07/23/good-small-datasets/&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;title&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;Good small datasets&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;ref&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;2018/07/23/good-small-datasets/&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;content&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;John Lavery, The Chess Players (1929) I&amp;#39;ve been working on a project that, like most projects, requires testing with a dataset. My personal criteria are: + Relatively small size (Less than 100 KB, or 100ish rows) + At least 5-6 features (columns) + Should have both numerical and text-based features + Ideally a range of different kinds of numbers + Has good documentation + Is open and available to the public + Relatively available for both R and as individual CSV files or Python imports (APIs and&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;summary&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;John Lavery, The Chess Players (1929) I&amp;#39;ve been working on a project that, like most projects, requires testing with a dataset. My personal criteria are: + Rela...&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;date&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;2018-07-23T00:00:00Z&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;creator&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;site&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;url&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;/blog/2018-07-23-small-datasets&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>following this logic:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">return&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;id&amp;#39;&lt;/span>: slug,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;title&amp;#39;&lt;/span>: frontmatter&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&amp;#39;title&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;Untitled&amp;#39;&lt;/span>),
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;ref&amp;#39;&lt;/span>: &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>&lt;span style="color:#e6db74">{&lt;/span>slug&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>, 
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;content&amp;#39;&lt;/span>: text[:&lt;span style="color:#ae81ff">500&lt;/span>], 
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;summary&amp;#39;&lt;/span>: text[:&lt;span style="color:#ae81ff">160&lt;/span>] &lt;span style="color:#f92672">+&lt;/span> &lt;span style="color:#e6db74">&amp;#39;...&amp;#39;&lt;/span> &lt;span style="color:#66d9ef">if&lt;/span> len(text) &lt;span style="color:#f92672">&amp;gt;&lt;/span> &lt;span style="color:#ae81ff">160&lt;/span> &lt;span style="color:#66d9ef">else&lt;/span> text,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;date&amp;#39;&lt;/span>: frontmatter&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&amp;#39;date&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;&amp;#39;&lt;/span>),
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;creator&amp;#39;&lt;/span>: frontmatter&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&amp;#39;creator&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;&amp;#39;&lt;/span>),
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#39;site&amp;#39;&lt;/span>: frontmatter&lt;span style="color:#f92672">.&lt;/span>get(&lt;span style="color:#e6db74">&amp;#39;site&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>I keep that code separate from my Hugo site, in a &lt;code>search&lt;/code> directory that&amp;rsquo;s initialized as a uv project:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>├── README.md
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── create_index.py
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>├── pyproject.toml
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>└── uv.lock
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>and generates the lunr.json index into my &lt;code>static&lt;/code> repo.&lt;/p>
&lt;p>This code is run at build time in my GitHub Actions before the &lt;code>hugo build&lt;/code> step:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>- name: Install uv
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> uses: astral-sh/setup-uv@v4
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>- name: Generate search index
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>run: |
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> echo &lt;span style="color:#e6db74">&amp;#34;Generating search index...&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> uv run search/create_index.py lunr-index.json
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> echo &lt;span style="color:#e6db74">&amp;#34;Search index generated successfully&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Generating the script correctly took several passes because Claude missed some data cleaning nuances (I don&amp;rsquo;t want any HTML output, I wanted slightly longer summarization outputs, I didn&amp;rsquo;t want to include base pages, etc.), but now it regenerates the index in &amp;lt;5 sec every time.&lt;/p>
&lt;p>Then, all of that is rendered in my navbar via Hugo templates, namely by including the logic to serve the search bar in the &lt;code>header&lt;/code> partial. I&amp;rsquo;m abstracting away a ton of &amp;ldquo;draw the owl&amp;rdquo; features here, such as doing passes to make sure the CSS for the search bar matches my main blog theme, and rendering the dropdown results.&lt;/p>
&lt;p>LLMs helped move this feature along a fair bit here because:&lt;/p>
&lt;ol>
&lt;li>It&amp;rsquo;s an extremely small feature where I can clearly test the output and see the generated code&lt;/li>
&lt;li>The context window for changes was fairly small&lt;/li>
&lt;li>The biggest lift was in automating index creation and getting the right syntax for the Python script and the data was easy to check.&lt;/li>
&lt;li>It has a tight local testing loop: &lt;code>hugo build&lt;/code> and &lt;code>hugo serve&lt;/code> are extremely fast locally and offer 95% parity to the served site in GH pages.&lt;/li>
&lt;/ol>
&lt;p>This has been a fun addition and I&amp;rsquo;m excited to do more such as potentially add typeahead suggestions and tune the query.&lt;/p></description></item><item><title>My favorite use-case for AI is writing logs</title><link>https://vickiboykis.com/2025/07/16/my-favorite-use-case-for-ai-is-writing-logs/</link><pubDate>Wed, 16 Jul 2025 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2025/07/16/my-favorite-use-case-for-ai-is-writing-logs/</guid><description>&lt;p>One of my favorite AI dev products today is &lt;a href="https://plugins.jetbrains.com/plugin/14823-full-line-code-completion">Full Line Code Completion&lt;/a> in PyCharm (bundled with the IDE since late 2023). It’s extremely well-thought out, unintrusive, and makes me a more effective developer. Most importantly, it still keeps me mostly in control of my code. I’ve now used it in &lt;a href="https://vickiboykis.com/2025/01/23/you-can-just-hack-on-atproto/">GoLand&lt;/a> as well. I’ve been a happy JetBrains customer for a long time now, and it’s because they ship features like this.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/inline_completion.png" width="400">
&lt;/figure>

&lt;p>I frequently work with code that involves sequential data processing, computations, and async API calls across multiple services. I also deal with a lot of precise vector operations in PyTorch that &lt;a href="https://medium.com/@NoamShazeer/shape-suffixes-good-coding-style-f836e72e24fd">shape suffixes&lt;/a> don’t always illuminate. So, print statement debugging and writing good logs has been a critical part of my workflows for years.&lt;/p>
&lt;p>As Kerningan and Pike say in &lt;em>The Practice of Programming&lt;/em> about preferring print to debugging,&lt;/p>
&lt;blockquote>
&lt;p>…[W]e find stepping through a program less productive than thinking harder and adding output statements and self-checking code at critical places. Clicking over statements takes longer than scanning the output of judiciously-placed displays. It takes less time to decide where to put print statements than to single-step to the critical section of code, even assuming we know where that is.&lt;/p>&lt;/blockquote>
&lt;p>One thing that is annoying about logging is that &lt;a href="https://docs.python.org/3/tutorial/inputoutput.html#formatted-string-literals">f-strings&lt;/a> are great but become repetitive to write if you have to write them over and over, particularly if you’re formatting values or accessing elements of data frames, lists, and nested structures, and particularly if you have to scan your codebase to find those variables. Writing good logs is important but also breaks up a debugging flow.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> loguru &lt;span style="color:#f92672">import&lt;/span> logger
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>logger&lt;span style="color:#f92672">.&lt;/span>info(&lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#39;Adding a log for &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>your_variable&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> and &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>len(my_list)&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74"> and &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>df&lt;span style="color:#f92672">.&lt;/span>head(&lt;span style="color:#ae81ff">0&lt;/span>)&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#39;&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The amount of cognitive overhead in this deceptively simple log is several levels deep: you have to first stop to type &lt;a href="logger.info">logger.info&lt;/a> (or is it &lt;a href="logging.info">logging.info&lt;/a>? I use both loguru and logger depending on the codebase and end up always getting the two confused.) Then, the parentheses, the f-string itself, and then the variables in brackets. Now, was it &lt;code>your_variable&lt;/code> or &lt;code>your_variable_with_edits&lt;/code> from five lines up? And what’s the syntax for accessing a subset of &lt;code>df.head&lt;/code> again?&lt;/p>
&lt;p>With full-line-code completion, JetBrains’ model auto-infers the log completion from the surrounding text, with a limit of 384 characters. Inference starts by taking the file extension as input, combined with the filepath, and then the part of the code above the input cursor, so that all of the tokens in the file extension, plus path, plus code above the caret, fit. Everything is combined and sent to the model in the prompt.&lt;/p>
&lt;p>The constrained output good enough most of the time that it speeds up my workflow a lot. An added bonus is that it often writes a much clearer log than I, a lazy human, would write, logs. Because they’re so concise, I often don’t even remove when I’m done debugging because they’re now valuable in prod.&lt;/p>
&lt;p>Here’s an example from a side project I’m working on. In the first case, the is autocomplete inferring that I actually want to check the Redis URL, a logical conclusion here.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>redis &lt;span style="color:#f92672">=&lt;/span> aioredis&lt;span style="color:#f92672">.&lt;/span>from_url(settings&lt;span style="color:#f92672">.&lt;/span>redis_url, decode_responses&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;figure>&lt;img src="https://vickiboykis.com/images/redis_completion.png" width="400">
&lt;/figure>

&lt;p>In this second case, it assumes I’d like the shape of the dataframe, also a logical conclusion because the profiling dataframes is a very popular use-case for logs.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> pandas &lt;span style="color:#f92672">import&lt;/span> DataFrame
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>data &lt;span style="color:#f92672">=&lt;/span> [[&lt;span style="color:#e6db74">&amp;#39;apples&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;2.5&amp;#39;&lt;/span>], [&lt;span style="color:#e6db74">&amp;#39;oranges&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;3.0&amp;#39;&lt;/span>], [&lt;span style="color:#e6db74">&amp;#39;bananas&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;4.0&amp;#39;&lt;/span>]]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>column_names &lt;span style="color:#f92672">=&lt;/span> [&lt;span style="color:#e6db74">&amp;#39;Fruit&amp;#39;&lt;/span>, &lt;span style="color:#e6db74">&amp;#39;Quantity&amp;#39;&lt;/span>]
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>df &lt;span style="color:#f92672">=&lt;/span> pd&lt;span style="color:#f92672">.&lt;/span>DataFrame(data, columns&lt;span style="color:#f92672">=&lt;/span>column_names)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;figure>&lt;img src="https://vickiboykis.com/images/df_completion.png" width="400">
&lt;/figure>

&lt;p>&lt;strong>Implementation&lt;/strong>&lt;/p>
&lt;p>The coolest part of this feature is that the inference model is entirely local to your machine.&lt;/p>
&lt;p>This enforces a few very important requirements on the development team, namely compression and speed.&lt;/p>
&lt;ul>
&lt;li>The model has to be small enough to bundle with the IDE for desktop memory footprints (already coming in at around ~1GB for the MacOS binary), which eliminates 99% of current LLMs&lt;/li>
&lt;li>And yet, the model has to be smart enough to interpolate lines of code from its small context window&lt;/li>
&lt;li>The local requirement eliminates any model inference engines like vLLM, SGLM, or Ray which implement KV cache optimization like &lt;a href="https://huggingface.co/docs/text-generation-inference/conceptual/paged_attention">PagedAttention&lt;/a>&lt;/li>
&lt;li>It has to be a model that’s fast enough to produce &lt;a href="https://bentoml.com/llm/inference-optimization/llm-inference-metrics#latency">its first token&lt;/a> (and all subsequent tokens) extremely quickly,&lt;/li>
&lt;li>Finally, it has to be optimized for Python specifically since this model is only available in PyCharm&lt;/li>
&lt;/ul>
&lt;p>This is drastically different from the current assumptions around how we build and ship LLMs: that they need to be extremely large, general-purpose models served over proprietary APIs. we We find ourselves in a very constrained solution space because we no longer have to do all this other stuff that generalized LLMs have to do: write poetry, reason through math problems, act as OCR, offer code canvas templating, write marketing emails, and generate Studio Ghibli memes.&lt;/p>
&lt;p>All we have to do is train a model to complete a single line of code with a context of 384 characters! And then compress the crap out of that model so that it can fit on-device and perform inference.&lt;/p>
&lt;p>So how did they do it? Luckily, JetBrains &lt;a href="https://arxiv.org/abs/2405.08704v1">published a paper on this&lt;/a>, and there are a bunch of interesting notes. The work is split into two parts, model training, and then the integration of the plugin itself.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/jetbrains_model.png" width="400">
&lt;/figure>

&lt;p>The &lt;strong>model is trained&lt;/strong> is done in PyTorch and then quantized.&lt;/p>
&lt;ul>
&lt;li>First, they train a GPT-2 style Transformer &lt;a href="https://cameronrwolfe.substack.com/p/decoder-only-transformers-the-workhorse">decoder-only&lt;/a> model of 100 million parameters, including a tokenizer (aka autoregressive text completion like you’d get from Claude, OpenAI, Gemini, and friends these days). They later changed this architecture to Llama2 after the success of the growing llama.cpp and &lt;a href="https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/">GGUF community&lt;/a>, as well as the better performance of the newer architecture.&lt;/li>
&lt;li>The original dataset they used to train the model was a subset of &lt;a href="https://huggingface.co/datasets/bigcode/the-stack">The Stack&lt;/a>, a code dataset across permissive licenses with 6TB of code in 30 programming languages
&lt;ul>
&lt;li>The initial training set was “just” 45 GB and in preparing the data for training, in data cleaning, for space constraints, they remove all code comments in the training data specifically to focus on code generation&lt;/li>
&lt;li>They do a neat trick for tokenizing Python (using a &lt;a href="https://huggingface.co/learn/llm-course/en/chapter6/5">BPE-style tokenizer&lt;/a> optimized for character pairs rather than bytes, since code is made up of smaller snippets and idioms than natural language text) which is indentation-sensitive, by converting spaces and tabs to start-end &lt;code>&amp;lt;SCOPE_IN&amp;gt;&amp;lt;SCOPE_OUT&amp;gt;&lt;/code> tokens, to remove tokens that might be different only because they have different whitespacing. They ended up going with a tokenizer vocab size of 16,384.&lt;/li>
&lt;li>They do another very cool step in training which is to remove imports because they find that developers usually add imports in after writing the actual code, a fact that the model needs to anticipate&lt;/li>
&lt;/ul>
&lt;/li>
&lt;li>They then split into train/test for evaluation and trained for several days on 8 NVidia A100 GPU with a&lt;a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html"> cross-entropy loss objective function&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>Because they were able to so clearly focus on the domain and understanding of how code inference works, focus on a single programming languages with its own nuances, they were able to make the training data set smaller, make the output more exact, and spend much less time and money training the model.&lt;/p>
&lt;p>The &lt;strong>actual plugin&lt;/strong> that’s included in PyCharm “is implemented in Kotlin, however, it utilizes an additional native server that is run locally and is implemented in C++” for serving the inference tokens.&lt;/p>
&lt;p>In order to prepare the model for serving, they:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Quantized it from &lt;a href="https://developer.nvidia.com/blog/achieving-fp32-accuracy-for-int8-inference-using-quantization-aware-training-with-tensorrt/">FP32 to INT8&lt;/a> which compressed the model from 400 MB to 100 MB&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Prepared as a served ONNX RT artifact, which allowed them to use CPU inference, which removed the CUDA overhead tax(later, they switched to using llama.cpp to serve the llama model architecture for the server.&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Finally, in order to perform inference on a sequence of tokens, they use &lt;a href="https://d2l.ai/chapter_recurrent-modern/beam-search.html">beam search. &lt;/a>Generally, Transformer-decoders are trained on predicting the next token in any given sequence so any individual step will give you a list of tokens along with their ranked probabilities (cementing my long-running theory that everything is a search problem).&lt;/p>
&lt;p>Since this is computationally impossible at large numbers of tokens, a number of solutions exist to solve the problem of decoding optimally. Beam search creates a graph of all possible returned token sequences and expands at each node with the highest potential probability, limiting to &lt;code>k&lt;/code> possible beams. In FLCC, the max number of beams, k, is 20, and they chose to limit generation to collect only those hypotheses that end with a newline character.&lt;/p>
&lt;p>Additionally they made use of a number of caching strategies, including initializing the model at 50% of total context - i.e. it starts by preloading ~192 characters of previous code, to give you space to either go back and edit old code, which now no longer has to be put into context, or to add new code, which is then added to the context. That way, if your cursor clicks on code you&amp;rsquo;ve already written, the model doesn&amp;rsquo;t need to re-infer.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>There are a number of other very cool architecture and model decisions from the paper that are very worth reading and that show the level of care put into the input data, the modeling, and the inference architecture.&lt;/p>
&lt;p>The bottom line is that, for me as a user, this experience is extremely thoughtful. It has saved me countless times both in print log debugging and in the logs I ship to prod.&lt;/p>
&lt;p>In LLM land, there’s both a place for large, generalist models, and there’s a place for small models, and while much of the rest of the world writes about the former, I’m excited to also find more applications built with the latter.&lt;/p></description></item><item><title>20 years of YC</title><link>https://vickiboykis.com/2025/03/17/20-years-of-yc/</link><pubDate>Mon, 17 Mar 2025 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2025/03/17/20-years-of-yc/</guid><description>&lt;p>I &lt;a href="https://news.ycombinator.com/item?id=43332658">saw recently&lt;/a> that YCombinator celebrated its 20th anniversary.&lt;/p>
&lt;p>&lt;a href="https://www.paulgraham.com/hackernews.html">Hacker News is slightly younger&lt;/a>, but to me the two go hand in hand.&lt;/p>
&lt;p>As far as I can tell, I actively started reading Hacker News around 2011. I don&amp;rsquo;t remember how I heard about it. It was probably on Reddit or Digg. Once I found it, I started reading every day, mostly because the comment sections were so full of smart people in tech.&lt;/p>
&lt;p>At the time I was working &lt;a href="https://increment.com/planning/the-best-laid-plans-tech-careers/">as a data analyst&lt;/a>, mostly with SQL and Excel. I understood that I would need to learn much more to move into engineering, which I was getting excited about, but didn&amp;rsquo;t quite understand how to bridge the gap.&lt;/p>
&lt;p>When I first started reading HN, I didn&amp;rsquo;t understand 99% of the linked content, the jargon in the discussions, or the companies mentioned, but I was determined to learn. My general approach was to skim the top headlines and headlines that I could understand, read the post, and then read the HackerNews discussion.&lt;/p>
&lt;p>As I read the discussion, I would come upon terms I had no idea about: Big O notation, collaborative filtering, caching, Hindley-Milner type systems, lambda architectures, CI/CD pipelines, cryptography, generics, build or buy, B-trees, Bloom filters, trunk-based development, red/green deploys. I would look all of them up and go down countless rabbit holes.&lt;/p>
&lt;p>HN is a bit bigger these days, but, thanks to contributors and the &lt;a href="https://www.newyorker.com/news/letter-from-silicon-valley/the-lonely-work-of-moderating-hacker-news">tireless efforts of dang&lt;/a>, still so perfectly ambiently captures what tech is thinking about in the current moment. If you don&amp;rsquo;t have a supportive community at work － like &lt;a href="https://www.ethanrosenthal.com/2023/01/10/data-scientists-alone/">many of us data scientists&lt;/a> － HN was the perfect ambient watercooler to be near senior technically excellent people in the industry.&lt;/p>
&lt;p>In the first few years, I read maybe 2 links out of the 30 ever on the front page. But, because I read HN 4-5 times a day, after a few years, things started falling into place.&lt;/p>
&lt;p>&lt;a href="https://vickiboykis.com/2022/11/10/how-i-learn-machine-learning/">Other than constantly studying&lt;/a>, Hacker News was one of the main things that pulled me up out of my bootstraps from a non-technical major afraid of SSHing into a server, to a person confidently &lt;a href="https://vickiboykis.com/2021/06/20/the-ritual-of-the-deploy/">deploying code to millions of users&lt;/a>. There is only one other thing that has brought as much value to my career in tech, and that was the connections I made on old Twitter.&lt;/p>
&lt;p>Over time, I gained enough confidence to start blogging on technical topics, HN gave me something new: being aware that HN might pick apart that content forced me to learn to write clearly and precisely for a highly technical, educated audience.&lt;/p>
&lt;p>My favorite feeling is when, via a link, something I have written or created has made an impact on someone out there in the ether. Or, from the reader side, when I read something that makes me go &amp;ldquo;I&amp;rsquo;m not alone. This person has also thought about these problems,&amp;rdquo; and I can&amp;rsquo;t wait to see if there are any good discussions on the topic.&lt;/p>
&lt;p>YC is proud of the companies it launched: Reddit, Airbnb, Instacart, Doordash, Stripe. Ironically, the biggest multiples it&amp;rsquo;s generated for the industry have been from a low-dazzle text forum written in Arc Lisp.&lt;/p>
&lt;p>Thank you for everything, HN.&lt;/p></description></item><item><title>You can just hack on ATProto</title><link>https://vickiboykis.com/2025/01/23/you-can-just-hack-on-atproto/</link><pubDate>Thu, 23 Jan 2025 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2025/01/23/you-can-just-hack-on-atproto/</guid><description>&lt;figure>&lt;img src="https://raw.githubusercontent.com/veekaybee/gitfeed/refs/heads/main/static/android-chrome-512x512.png" width="200">
&lt;/figure>

&lt;h5 id="icon-by-iconixar">&lt;a href="https://www.freepik.com/author/user8839173/icons">Icon by iconixar&lt;/a>&lt;/h5>
&lt;p>Since I signed up for Bluesky last year,
I&amp;rsquo;ve been wanting to make something using the &lt;a href="https://atproto.com/">AT Protocol&lt;/a> that the platform is built on top of.&lt;/p>
&lt;p>I finally had a chance to do it over the holiday break and built &lt;a href="https://github.com/veekaybee/gitfeed">GitFeed&lt;/a>, a small Go app that filters the Bluesky network firehose by posts that have GitHub links and renders them into a refreshable, ephemeral feed.&lt;/p>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/veekaybee/gitfeed/refs/heads/main/ui.png" width="400">
&lt;/figure>

&lt;p>You can &lt;a href="http://www.gitfeed.me/">see GitFeed here&lt;/a>, but it might not actually be running since I didn&amp;rsquo;t build it for scale, and it&amp;rsquo;s running on a tiny &lt;a href="https://www.digitalocean.com/products/droplets">DigitalOcean Droplet&lt;/a> that&amp;rsquo;s specced out at &lt;code>1 GB Memory / 1 Intel vCPU / 35 GB Disk&lt;/code> , is entirely unloadbalanced, un-load-tested and has zero observability or alerting when my very scientific process of &lt;a href="https://github.com/veekaybee/gitfeed/blob/main/.github/workflows/restart.sh#L11">running nohup fails.&lt;/a>&lt;/p>
&lt;p>The cool part of an open protocol is that you can also &lt;a href="https://github.com/veekaybee/gitfeed">just clone the repo&lt;/a> and run a hosted version of it yourself.&lt;/p>
&lt;h2 id="atproto">ATProto&lt;/h2>
&lt;p>Bluesky is both a decentralized protocol, called AtProto and a social media company, called Bluesky plc that develops both the protocol and one of the Apps running on the protocol, Bluesky.&lt;/p>
&lt;p>There is a lot more in the &lt;a href="https://arxiv.org/abs/2402.03239">AT Protocol Paper&lt;/a>, but the basics are this:&lt;/p>
&lt;ul>
&lt;li>
&lt;p>Data for each user is hosted individually in a &lt;a href="https://github.com/bluesky-social/pds">PDS&lt;/a> - a personal data store - which is a database storing a collection of records that are cryptographically signed and encoded in &lt;a href="https://ipld.io/specs/codecs/dag-cbor/spec/">DAG-CBOR&lt;/a> format. The record schema is defined by a &lt;a href="https://atproto.com/guides/lexicon">&amp;ldquo;lexicon&amp;rdquo;&lt;/a>, which is dependent on the type of data being transferred. The records themselves are stored in a &lt;a href="https://inria.hal.science/hal-02303490/document">merkle search tree&lt;/a> structure which makes it easy to rebalance records efficiently both on read and write. Its default storage engine &lt;a href="https://bsky.app/profile/jacob.gold/post/3lbar43hgx22t">is SQLite.&lt;/a>&lt;/p>
&lt;/li>
&lt;li>
&lt;p>Each user has a PDS, exposed as a web service to network indexers. There is an indexer, the relay, which &amp;ldquo;scrapes&amp;rdquo; but really hits all the PDSes in the network for updates. Right now there is only one true relay, run by Bluesky the company, and there is a lot of debate around what that means for a decentralized network and &lt;a href="https://alice.bsky.sh/post/3laega7icmi2q">efforts to diversify and decentralize&lt;/a>. To better get a sense for how the data model works, &lt;a href="https://atproto-browser.vercel.app/">you can play around with this tool, &lt;/a> which is what I spent a lot of time doing during this project.&lt;/p>
&lt;/li>
&lt;/ul>
&lt;p>The TL; DR is that you can think of the At Proto Atmosphere as a collection of databases, or, really, websites, that the relay indexes and turns into the firehose. Data is then filtered on the firehose side for CSAM and other logic, before it&amp;rsquo;s turned into an AppView. The AppView is what you see if you sign into &lt;code>bsky.app. &lt;/code>&lt;/p>
&lt;p>If this sounds familiar, it&amp;rsquo;s because it&amp;rsquo;s how web crawlers, including Google work, with the exception that their crawled results are not available to everyone for access.&lt;/p>
&lt;p>Steve has a very &lt;a href="https://steveklabnik.com/writing/how-does-bluesky-work">nice write-up of all of this&lt;/a>, with a beautiful ascii diagram.&lt;/p>
&lt;p>Al(most) all of the data streaming through each person&amp;rsquo;s PDS is public, and enables the creation of &lt;a href="https://docs.bsky.app/showcase">projects like&lt;/a> the &lt;a href="https://news.ycombinator.com/item?id=42159786">Bluesky firehose as a screensaver&lt;/a>, or goodfeeds, &lt;a href="https://goodfeeds.co/">surfacing feeds across the network.&lt;/a>, or TikTok and &lt;a href="https://bsky.app/profile/did:plc:24kqkpfy6z7avtgu3qg57vvl">Instagram-like apps.&lt;/a> As you can imagine, the protocol then lends itself to a lot of nice experimentation (make sure to &lt;a href="https://bsky.social/about/support/community-guidelines">check the TOS/Developer guidelines&lt;/a> before you do so).&lt;/p>
&lt;h1 id="lets-find-all-the-gists">Let&amp;rsquo;s find all the gists&lt;/h1>
&lt;p>Initially, I wanted to &lt;a href="https://docs.bsky.app/docs/starter-templates/custom-feeds">create a custom feed&lt;/a> on Bluesky. Generally, people create these by filtering the network to include all feeds about cats, or feeds from only mutual follows, or one of my recent favorites, &lt;a href="https://bsky.app/profile/did:plc:o4s55v3tsfph6whswxccpsia/feed/aaaixbb5liqbu">gift articles, &lt;/a> which includes links from gift articles that you can click through to read.&lt;/p>
&lt;p>My idea was: collect all the posts that have a link to github gists, because people put really cool stuff in gists, so I could find and expose to other users some really cool code snippets of what people are hacking on around the platform.&lt;/p>
&lt;p>Initially I thought I might be able to create a lightweight recommendation feed based on aggregate likes. Or, I could create a trending links feed. But, if you want to do any machine learning, you need to start consuming the firehose at scale, collecting the data, and setting up storage, and I wanted to learn Go, not implement distributed systems - &lt;a href="https://www.youtube.com/watch?v=RqubKSF3wig">don&amp;rsquo;t use N computers when you can use one&lt;/a>.&lt;/p>
&lt;p>Moreover, in order to implement a feed, &lt;a href="https://www.reddit.com/r/RedditEng/comments/158f8o3/evolving_reddits_feed_architecture/">generally&lt;/a>, you need to also implement:&lt;/p>
&lt;ul>
&lt;li>pagination&lt;/li>
&lt;li>a cursor in case you lose your place in consuming the feed&lt;/li>
&lt;li>latency considerations&lt;/li>
&lt;li>Thinking about how you render feed objects (eventually they become large and need to be &lt;a href="https://jrashford.com/2022/04/22/how-to-hydrate-tweets-using-hydrator/">hydrated&lt;/a>)&lt;/li>
&lt;li>potentially ranking that feed&amp;rsquo;s content&lt;/li>
&lt;li>retries and handling for if you skip feed elements&lt;/li>
&lt;/ul>
&lt;p>Additionally, specific to atproto, your feed is published at your own PDS - &lt;a href="https://atproto.com/guides/glossary#pds-personal-data-server">Personal Data Server, a LOT more about this here&lt;/a>. You can see this at the link on the Gift Articles Feed:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>https://bsky.app/profile/did:plc:o4s55v3tsfph6whswxccpsia/feed/aaaixbb5liqbu
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Given that the feed is published and linked to your own data store, it made me hesitant to experiment in case I messed something up and lost all of my data.&lt;/p>
&lt;p>There are lots of clients to create and publish feeds, but as a nerd, I didn&amp;rsquo;t want to have a third party handle this out of principle. Moreover, for gists only, there was not enough data to be interesting. The Bluesky network is growing, having &lt;a href="https://bsky-users.theo.io/">surpassed 29 million users&lt;/a>. But, at this scale, interesting content when you filter at this level of granularity is sparse.&lt;/p>
&lt;p>After messing around with the (excellent) &lt;a href="https://github.com/MarshalX/atproto">Python client&lt;/a> a bit &lt;a href="https://github.com/veekaybee/blusky">in this repo&lt;/a>, I narrowed down to the problem I actually wanted to solve:&lt;/p>
&lt;p>&lt;code>consuming all links with &amp;quot;github.com&amp;quot; in the link name, and consuming them via jetstream rather than the firehose.&lt;/code>&lt;/p>
&lt;p>That&amp;rsquo;s how GitFeed was born.&lt;/p>
&lt;h2 id="why-gitfeed-why-go">Why GitFeed? Why Go?&lt;/h2>
&lt;p>McFunley says you only have &lt;a href="https://mcfunley.com/choose-boring-technology">so many innovation tokens&lt;/a>. This also applies to side projects. The other way I&amp;rsquo;ve seen this explained, is that you can either pick a new language, a new stack, a new business problem to solve, or new people to work with, but not all four at the same time, otherwise you will never ship.&lt;/p>
&lt;p>I always want to &lt;a href="https://mitchellh.com/writing/building-large-technical-projects">get to a demo&lt;/a> quickly. But, the goal of side projects is to do stuff you wouldn&amp;rsquo;t get to explore otherwise, so I chose to spend my innovation tokens on:&lt;/p>
&lt;ul>
&lt;li>learning a new backend language, Go: this thing had to serve code fast as a binary and have a very simple deployment story. I have a background in Java/Scala and initially thought about Java, but unfortunately there is no small light-weight Java server that I&amp;rsquo;m aware of (Spring doesn&amp;rsquo;t count) and Go has steadily been growing in popularity since its inception and especially after the language added features like &lt;a href="https://go.dev/blog/why-generics">generics&lt;/a> and I wanted to check it out. Just as importantly, the backend of Bluesky and the protocol were specced out and &lt;a href="https://docs.bsky.app/docs/starter-templates/clients">written in Typescript and Go&lt;/a>, and I thought it might be easier to navigate the Atmosphere in those languages.&lt;/li>
&lt;li>getting slightly better at front-end dev: I&amp;rsquo;d dabbled a bit with front-end design &lt;a href="https://vickiboykis.com/2024/01/05/retro-on-viberary/">for Viberary&lt;/a> but used almost no Javascript, which I&amp;rsquo;d need to render a feed.&lt;/li>
&lt;li>understanding the At Proto data model&lt;/li>
&lt;/ul>
&lt;p>I didn&amp;rsquo;t need to:&lt;/p>
&lt;ul>
&lt;li>Understand the problem of working with a feed of streamable content - I did a &lt;a href="https://vickiboykis.com/2022/07/25/looking-back-at-two-years-at-automattic-and-tumblr/">lot of feeds work&lt;/a> when I worked on recommendations at Tumblr and I find it to be one of the most fun and satisfying problems in engineering to work on.&lt;/li>
&lt;li>Understand how to run/deploy on DigitalOcean using GitHub Actions - I already did this for Viberary and the path was pretty smooth&lt;/li>
&lt;li>Understand SQLite&lt;/li>
&lt;li>Finally, I wanted to track and catalogue my usage of LLMs for this project. I&amp;rsquo;m &lt;a href="https://vickiboykis.com/2023/02/26/what-should-you-use-chatgpt-for/">constantly evaluating&lt;/a> their &lt;a href="https://vickiboykis.com/2025/01/03/everything-i-did-in-2024/">use-cases&lt;/a>.&lt;/li>
&lt;/ul>
&lt;h2 id="the-jetstream">The Jetstream&lt;/h2>
&lt;p>I didn&amp;rsquo;t want to spend time implementing pagination, hydration, latency mitigation, and a data storage strategy. I wanted to get going quickly, so I decided to use the Jetstream. &lt;a href="https://docs.bsky.app/blog/jetstream">Jetstream&lt;/a> is a relatively new content source for Bluesky content (relatively because everything is brand-new and being built on the fly.)&lt;/p>
&lt;p>Working with the Bluesky firehose has a set of complications other than pagination: it also has its own data formats (CBOR and CAR for all the full merkle trees in the git repos) that take time in learning how to parse.&lt;/p>
&lt;p>Moreover, the sheer volume of firehose events has grown to the point where folks consuming it need to invest heavily in scaling strategies for downstream application consumers.&lt;/p>
&lt;p>Jetstream instead streams content in JSON, with &lt;a href="https://jazco.dev/2024/09/24/jetstream/">reduced bandwidth and costs&lt;/a>.&lt;/p>
&lt;p>As a tradeoff, Jetstream is less stable, doesn&amp;rsquo;t contain content that needs API verification, and don&amp;rsquo;t offer pagination, activity offets, or uptime guarantees. As the docs say, it&amp;rsquo;s good for low-stakes side projects that don&amp;rsquo;t require heavy authentication or veracity, aka gitfeed.&lt;/p>
&lt;h2 id="serving-gitfeed">Serving GitFeed&lt;/h2>
&lt;figure>&lt;img src="https://raw.githubusercontent.com/veekaybee/gitfeed/main/architecture.png" width="400">
&lt;/figure>

&lt;p>We&amp;rsquo;re building a simple web app with &lt;a href="https://github.com/veekaybee/gitfeed/tree/main/cmd">two components:&lt;/a> that are two separate go processes:&lt;/p>
&lt;ul>
&lt;li>An &lt;code>ingest&lt;/code> &lt;a href="https://github.com/veekaybee/gitfeed/tree/main/cmd/ingest">go module&lt;/a> that will ingest posts and writes them to a SQLite DB that we &lt;a href="https://github.com/veekaybee/gitfeed/blob/main/cmd/ingest/ingest.go#L174">instantiate as part of the application&lt;/a>&lt;/li>
&lt;li>A &lt;code>serve&lt;/code> &lt;a href="https://github.com/veekaybee/gitfeed/tree/main/cmd/serve">go module&lt;/a> that serves the app &lt;a href="https://github.com/veekaybee/gitfeed/blob/main/routes/routes.go">via API.&lt;/a>&lt;/li>
&lt;/ul>
&lt;p>The serving module is just a few Javascript files with static HTML pages. I was very overwhelmed by all the choices in the front-end ecosystem (although the developer docs helped a ton!), so I just ended up going with plain old &lt;a href="http://vanilla-js.com/">Vanilla JS.&lt;/a>&lt;/p>
&lt;p>When we refresh the site, we make a call to the DB (via the &lt;code>posts&lt;/code> &lt;a href="https://github.com/veekaybee/gitfeed/blob/4e9248a83435e586a1c855d659baf103323678ec/static/feed.js#L187">API endpoint&lt;/a>) to surface the posts in reverse chronological order.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-javascript" data-lang="javascript">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">export&lt;/span> &lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">function&lt;/span> &lt;span style="color:#a6e22e">fetchPosts&lt;/span>() {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">const&lt;/span> &lt;span style="color:#a6e22e">container&lt;/span> &lt;span style="color:#f92672">=&lt;/span> document.&lt;span style="color:#a6e22e">getElementById&lt;/span>(&lt;span style="color:#e6db74">&amp;#39;postContainer&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">container&lt;/span>.&lt;span style="color:#a6e22e">innerHTML&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;&amp;lt;div class=&amp;#34;loading&amp;#34;&amp;gt;Loading posts...&amp;lt;/div&amp;gt;&amp;#39;&lt;/span>;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">try&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">console&lt;/span>.&lt;span style="color:#a6e22e">log&lt;/span>(&lt;span style="color:#e6db74">&amp;#39;Fetching new posts...&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">const&lt;/span> &lt;span style="color:#a6e22e">response&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">await&lt;/span> &lt;span style="color:#a6e22e">fetch&lt;/span>(&lt;span style="color:#e6db74">&amp;#39;/api/v1/posts&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> (&lt;span style="color:#f92672">!&lt;/span>&lt;span style="color:#a6e22e">response&lt;/span>.&lt;span style="color:#a6e22e">ok&lt;/span>) {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">throw&lt;/span> &lt;span style="color:#66d9ef">new&lt;/span> Error(&lt;span style="color:#e6db74">`HTTP error! status: &lt;/span>&lt;span style="color:#e6db74">${&lt;/span>&lt;span style="color:#a6e22e">response&lt;/span>.&lt;span style="color:#a6e22e">status&lt;/span>&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">`&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">const&lt;/span> &lt;span style="color:#a6e22e">posts&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#66d9ef">await&lt;/span> &lt;span style="color:#a6e22e">response&lt;/span>.&lt;span style="color:#a6e22e">json&lt;/span>();
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">container&lt;/span>.&lt;span style="color:#a6e22e">innerHTML&lt;/span> &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#39;&amp;#39;&lt;/span>;
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>And here&amp;rsquo;s the actual call site:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-javascript" data-lang="javascript">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">import&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">fetchPosts&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">updateTimestamp&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>} &lt;span style="color:#a6e22e">from&lt;/span> &lt;span style="color:#e6db74">&amp;#39;./feed.js&amp;#39;&lt;/span>;
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">console&lt;/span>.&lt;span style="color:#a6e22e">log&lt;/span>(&lt;span style="color:#e6db74">&amp;#39;Main.js loaded&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>document.&lt;span style="color:#a6e22e">addEventListener&lt;/span>(&lt;span style="color:#e6db74">&amp;#39;DOMContentLoaded&amp;#39;&lt;/span>, &lt;span style="color:#66d9ef">async&lt;/span> () =&amp;gt; {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">console&lt;/span>.&lt;span style="color:#a6e22e">log&lt;/span>(&lt;span style="color:#e6db74">&amp;#39;DOM Content Loaded&amp;#39;&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">try&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">await&lt;/span> &lt;span style="color:#a6e22e">fetchPosts&lt;/span>();
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">await&lt;/span> &lt;span style="color:#a6e22e">updateTimestamp&lt;/span>();
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> } &lt;span style="color:#66d9ef">catch&lt;/span> (&lt;span style="color:#a6e22e">error&lt;/span>) {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">console&lt;/span>.&lt;span style="color:#a6e22e">error&lt;/span>(&lt;span style="color:#e6db74">&amp;#39;Error in main initialization:&amp;#39;&lt;/span>, &lt;span style="color:#a6e22e">error&lt;/span>);
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>});
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>I was surprised at how much about JS was already familiar to me from Python and PHP, but where I really got stuck was in understanding how the DOM and Javascript work together, what a Javascript app structure looks like, and the Javascript ecosystem.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Using LLMs to build GitFeed&lt;/strong>:&lt;/p>&lt;/blockquote>
&lt;p>I&amp;rsquo;ve done a lot of work with &lt;a href="https://github.com/Mozilla-Ocho/llamafile">llamafile&lt;/a>, and recently, I&amp;rsquo;ve also been enjoying the local LLM stack of: &lt;strong>Ollama&lt;/strong> for the model backend and &lt;strong>OpenWebUI&lt;/strong> for the front-end. Ollama serves versioned &lt;a href="https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/">GGUF model weights&lt;/a> wrapped in a Docker-like paradigm that hits an API wrapping llama.cpp in an (of course) Go interface. For this project, I used &lt;code>mistral:latest&lt;/code> and &lt;code>qwen2.5-coder:latest&lt;/code> , the best code model at the time (in the ancient space of 3 months ago, Deepseek3 wasn&amp;rsquo;t out). I did reasonably well between the two of them, with only 5% of requests that I had to bypass and send to Claude. I did find myself getting frustrated because I couldn&amp;rsquo;t clearly articulate the unknown unknowns I had about Javascript, though, and eventually I just gave up and bought &lt;code>Eloquent Javascript&lt;/code> which I&amp;rsquo;m hoping to dig into this year to better understand what &lt;code>Qwen&lt;/code> and I wrote together and how I can improve it.&lt;/p>
&lt;h2 id="ingest">Ingest&lt;/h2>
&lt;p>Consuming the Jetstream is easy. There are &lt;a href="https://github.com/bluesky-social/jetstream?tab=readme-ov-file#public-instances">four public instances&lt;/a> of Jetstream, hosted by Bluesky the company, so we need to connect to one of these, consume the content, process and filter the data, and save it to our database for serving.&lt;/p>
&lt;p>You can check what&amp;rsquo;s up in the Jetstream with this nifty command line tool:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>websocat wss://jetstream2.us-west.bsky.network/subscribe&lt;span style="color:#ae81ff">\?&lt;/span>wantedCollections&lt;span style="color:#f92672">=&lt;/span>app.bsky.feed.post | jq .
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>This is everything people are posting on Bluesky! (It sometimes gets very, very weird so be careful if you don&amp;rsquo;t want to look at NSWF texts.)&lt;/p>
&lt;p>And now let&amp;rsquo;s look at GitHub posts. There aren&amp;rsquo;t a lot of them, so you might not get an even for at least a few minutes:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-sh" data-lang="sh">&lt;span style="display:flex;">&lt;span>websocat wss://jetstream2.us-west.bsky.network/subscribe&lt;span style="color:#ae81ff">\?&lt;/span>wantedCollections&lt;span style="color:#f92672">=&lt;/span>app.bsky.feed.post | grep &lt;span style="color:#e6db74">&amp;#34;github&amp;#34;&lt;/span> | jq .
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Jetstream is implemented as &lt;a href="https://github.com/veekaybee/gitfeed/blob/main/cmd/ingest/ingest.go#L211">a websocket connection&lt;/a> to the source.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">wsManager&lt;/span> &lt;span style="color:#f92672">:=&lt;/span> &lt;span style="color:#a6e22e">NewWebSocketManager&lt;/span>(
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;wss://jetstream2.us-west.bsky.network/subscribe?wantedCollections=app.bsky.feed.post&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">pr&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>)
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>A websocket is a protocol (like HTTP) that enables client-server communication over TCP but works best for streaming data without the need for continuous polling or webhooks (unlike HTTP).&lt;/p>
&lt;p>A connection is instantiated with a handshake between the client and the server &lt;a href="https://developer.mozilla.org/en-US/docs/Web/API/WebSockets_API/Writing_WebSocket_servers">in HTTP first&lt;/a>, and when the request is processed, both switch to websockets for communication.&lt;/p>
&lt;p>I used the &lt;code>gorilla/websocket&lt;/code> implementation of the websocket procool. for handling the core websocket logic.&lt;/p>
&lt;p>We need to be able to &lt;a href="https://brojonat.com/posts/websockets/">read and write to/from the websocket&lt;/a>:&lt;/p>
&lt;blockquote>
&lt;p>The main thing to internalize about working with WebSockets in Go is that each client connection should get at least two goroutines: one that continuously processes messages coming from the client (i.e., a “read pump”), and one that continuously processes message going out to the client (i.e., a “write pump”).&lt;/p>&lt;/blockquote>
&lt;p>However, this becomes easier since for GitFeed, we&amp;rsquo;re only reading from and not writing to the websocket, we need to implement logic to readPump&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">func&lt;/span> (&lt;span style="color:#a6e22e">w&lt;/span> &lt;span style="color:#f92672">*&lt;/span>&lt;span style="color:#a6e22e">WebSocketManager&lt;/span>) &lt;span style="color:#a6e22e">readPump&lt;/span>(&lt;span style="color:#a6e22e">ctx&lt;/span> &lt;span style="color:#a6e22e">context&lt;/span>.&lt;span style="color:#a6e22e">Context&lt;/span>) {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	&lt;span style="color:#a6e22e">w&lt;/span>.&lt;span style="color:#a6e22e">Connect&lt;/span>(&lt;span style="color:#a6e22e">ctx&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	&lt;span style="color:#a6e22e">counter&lt;/span> &lt;span style="color:#f92672">:=&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	&lt;span style="color:#66d9ef">for&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>		&lt;span style="color:#66d9ef">select&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>		&lt;span style="color:#66d9ef">case&lt;/span> &lt;span style="color:#f92672">&amp;lt;-&lt;/span>&lt;span style="color:#a6e22e">ctx&lt;/span>.&lt;span style="color:#a6e22e">Done&lt;/span>():
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			&lt;span style="color:#a6e22e">log&lt;/span>.&lt;span style="color:#a6e22e">Printf&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;Exiting readPump: got kill signal\n&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			&lt;span style="color:#66d9ef">return&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>		&lt;span style="color:#66d9ef">default&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			&lt;span style="color:#66d9ef">var&lt;/span> &lt;span style="color:#a6e22e">post&lt;/span> &lt;span style="color:#a6e22e">db&lt;/span>.&lt;span style="color:#a6e22e">ATPost&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			&lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#a6e22e">err&lt;/span> &lt;span style="color:#f92672">:=&lt;/span> &lt;span style="color:#a6e22e">w&lt;/span>.&lt;span style="color:#a6e22e">conn&lt;/span>.&lt;span style="color:#a6e22e">ReadJSON&lt;/span>(&lt;span style="color:#f92672">&amp;amp;&lt;/span>&lt;span style="color:#a6e22e">post&lt;/span>); &lt;span style="color:#a6e22e">err&lt;/span> &lt;span style="color:#f92672">!=&lt;/span> &lt;span style="color:#66d9ef">nil&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>				&lt;span style="color:#a6e22e">w&lt;/span>.&lt;span style="color:#a6e22e">Connect&lt;/span>(&lt;span style="color:#a6e22e">ctx&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>				&lt;span style="color:#66d9ef">continue&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			&lt;span style="color:#a6e22e">counter&lt;/span>&lt;span style="color:#f92672">++&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			&lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#a6e22e">counter&lt;/span>&lt;span style="color:#f92672">%&lt;/span>&lt;span style="color:#ae81ff">100&lt;/span> &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#ae81ff">0&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>				&lt;span style="color:#a6e22e">log&lt;/span>.&lt;span style="color:#a6e22e">Printf&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;Read %d posts\n&amp;#34;&lt;/span>, &lt;span style="color:#a6e22e">counter&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			&lt;span style="color:#75715e">// Process the post&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			&lt;span style="color:#a6e22e">dbPost&lt;/span> &lt;span style="color:#f92672">:=&lt;/span> &lt;span style="color:#a6e22e">ProcessPost&lt;/span>(&lt;span style="color:#a6e22e">post&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			&lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#a6e22e">err&lt;/span> &lt;span style="color:#f92672">:=&lt;/span> &lt;span style="color:#a6e22e">w&lt;/span>.&lt;span style="color:#a6e22e">postRepo&lt;/span>.&lt;span style="color:#a6e22e">WritePost&lt;/span>(&lt;span style="color:#a6e22e">dbPost&lt;/span>); &lt;span style="color:#a6e22e">err&lt;/span> &lt;span style="color:#f92672">!=&lt;/span> &lt;span style="color:#66d9ef">nil&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>				&lt;span style="color:#a6e22e">w&lt;/span>.&lt;span style="color:#a6e22e">errorHandler&lt;/span>(&lt;span style="color:#a6e22e">fmt&lt;/span>.&lt;span style="color:#a6e22e">Errorf&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;failed to write post: %v&amp;#34;&lt;/span>, &lt;span style="color:#a6e22e">err&lt;/span>))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>				&lt;span style="color:#66d9ef">continue&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>			&lt;span style="color:#a6e22e">log&lt;/span>.&lt;span style="color:#a6e22e">Printf&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;Wrote Post %v&amp;#34;&lt;/span>, &lt;span style="color:#a6e22e">dbPost&lt;/span>.&lt;span style="color:#a6e22e">Did&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>		}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>However, unlike an HTTP call, websocksets are open persistently and don&amp;rsquo;t offer any guarantees of retries, so we have to implement this logic ourselves. Fortunately, gorilla has a lot of &lt;a href="https://github.com/gorilla/websocket/blob/main/examples/chat/client.go">good examples.&lt;/a>&lt;/p>
&lt;p>You&amp;rsquo;ll notice a couple key points here: first, we log and handle the case where the web socket disconnects. Then, we do some ultra-fancy print logging to keep track of how many posts we&amp;rsquo;ve actually processed. And finally, we now get to the actual data, an &lt;code>var post db.ATPost&lt;/code> that we process, parse, and write to the database, which we initialize as a post &lt;a href="https://martinfowler.com/eaaCatalog/repository.html">repository&lt;/a>, a fancy word for &amp;ldquo;database with dependency injection&amp;rdquo;.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Golang tooling&lt;/strong>:&lt;/p>&lt;/blockquote>
&lt;p>Go just works out of the box. Unlike my beloved Python, it doesn&amp;rsquo;t need uv, formatting, linting, or special build processes. At least, for a fairly small project, everything is batteries included. In fact, its boringness and rigidity allowed me to move really quickly. What surprised me is that I thought that VSCode would work really well with go, but actually didn&amp;rsquo;t as code I imported wouldn&amp;rsquo;t get loaded automatically, and there were a couple bugs that made me switch to Goland, which works extremely smoothly, without fail, and its local autocomplete at the line level is much better than PyCharm&amp;rsquo;s equivalent, likely because Go is much smaller, and statically-typed.&lt;/p>
&lt;blockquote>
&lt;p>&lt;strong>Jetbrains LLMs&lt;/strong>:&lt;/p>&lt;/blockquote>
&lt;p>Jetbrains&amp;rsquo; local LLMs are extremely well-done and I&amp;rsquo;d encourage anyone interested &lt;a href="https://arxiv.org/html/2405.08704v3">to check out the paper.&lt;/a>&lt;/p>
&lt;h1 id="what-is-an-atproto-post">What is an AtProto Post&lt;/h1>
&lt;p>Now we get to the heart of the matter: once our websocket is open, we are ingesting a stream of JSON objects. AtProto has its own data model, defined using schemas called &amp;ldquo;Lexicons&amp;rdquo;. For posts and actions, they look like this.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-json" data-lang="json">&lt;span style="display:flex;">&lt;span>{
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;did&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;did:plc:eabmaihciaxprqvxpfvl6flk&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;time_us&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">1725911162329308&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;kind&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;commit&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;commit&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;rev&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;3l3qo2vutsw2b&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;operation&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;create&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;collection&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;app.bsky.feed.like&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;rkey&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;3l3qo2vuowo2b&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;record&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;$type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;app.bsky.feed.like&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;createdAt&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;2024-09-09T19:46:02.102Z&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;subject&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;cid&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;bafyreidc6abdkkbchcyg62v77wbhzvb2mvytlmsychqgwf2xojjtirmzj4&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;uri&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;at://did:plc:ab7b35aakoll7hugkrjtf3xf/app.bsky.feed.post/3l3pte3p2e325&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> },
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;cid&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;abfyreidwaivazkwu67xztlmuobx35hs2lnfh3kolmgfmucldvhd3sgzcqi&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>The &lt;code>DID&lt;/code> is the ID of the PDS (user repository) where the action happened, the record collection type of &lt;code>app.bsky.feed.post&lt;/code> is what we care about, and each record has both a text entry, which truncates the text, and a facet, which has all the contained &lt;a href="https://docs.bsky.app/docs/advanced-guides/post-richtext">links and rich text elements&lt;/a> in the post.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-json" data-lang="json">&lt;span style="display:flex;">&lt;span>{
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;did&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;did:plc:&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;time_us&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">1735494134541&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;com&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;kind&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;commit&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;commit&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;rev&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;c&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;operation&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;create&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;collection&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;app.bsky.feed.post&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;rkey&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;record&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;$type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;app.bsky.feed.post&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;createdAt&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;2024-12-29T17:42:14.541Z&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;embed&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;$type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;app.bsky.embed.external&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;external&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;description&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;thumb&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;$type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;blob&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;ref&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;$link&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> },
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;mimeType&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;image/jpeg&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;size&amp;#34;&lt;/span>: 
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> },
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;title&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;uri&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> },
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;facets&amp;#34;&lt;/span>: [
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;features&amp;#34;&lt;/span>: [
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;$type&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;app.bsky.richtext.facet#link&amp;#34;&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;uri&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;index&amp;#34;&lt;/span>: {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;byteEnd&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">85&lt;/span>,
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;byteStart&amp;#34;&lt;/span>: &lt;span style="color:#ae81ff">54&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;langs&amp;#34;&lt;/span>: [
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;en&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> ],
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;text&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;...&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> },
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#f92672">&amp;#34;cid&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> }
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>So we parse these JSON objects and &lt;a href="https://github.com/veekaybee/gitfeed/blob/4e9248a83435e586a1c855d659baf103323678ec/db/db.go#L17">store them as Go structs&lt;/a>. Go has a &lt;a href="https://transform.tools/json-to-go">super handy tool&lt;/a> where you can paste a JSON object and get back the Go struct.&lt;/p>
&lt;p>And then we write that struct to a DB with a &lt;code>posts&lt;/code> &lt;a href="https://github.com/veekaybee/gitfeed/blob/4e9248a83435e586a1c855d659baf103323678ec/cmd/ingest/ingest.go#L185">table&lt;/a> that we&amp;rsquo;ve already instantiated for this purpose.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">if&lt;/span> &lt;span style="color:#a6e22e">err&lt;/span> &lt;span style="color:#f92672">:=&lt;/span> &lt;span style="color:#a6e22e">w&lt;/span>.&lt;span style="color:#a6e22e">postRepo&lt;/span>.&lt;span style="color:#a6e22e">WritePost&lt;/span>(&lt;span style="color:#a6e22e">dbPost&lt;/span>); &lt;span style="color:#a6e22e">err&lt;/span> &lt;span style="color:#f92672">!=&lt;/span> &lt;span style="color:#66d9ef">nil&lt;/span> {
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#a6e22e">w&lt;/span>.&lt;span style="color:#a6e22e">errorHandler&lt;/span>(&lt;span style="color:#a6e22e">fmt&lt;/span>.&lt;span style="color:#a6e22e">Errorf&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;failed to write post: %v&amp;#34;&lt;/span>, &lt;span style="color:#a6e22e">err&lt;/span>))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">continue&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;h1 id="sqlite">SQLite&lt;/h1>
&lt;p>Enough has been written about why SQLite is awesome and amazing for smaller, and even larger projects so I&amp;rsquo;ll skip that here and say that I didn&amp;rsquo;t even consider using anything else for GitFeed.&lt;/p>
&lt;p>There are a &lt;a href="https://www.powersync.com/blog/sqlite-optimizations-for-ultra-high-performance">ton of optimizations&lt;/a> you can perform on SQlite to really juice performance, and I &lt;a href="https://github.com/veekaybee/gitfeed/blob/4e9248a83435e586a1c855d659baf103323678ec/db/db.go#L87">set a few of them&lt;/a> in anticipation of many users hitting the server, and also easing read-write contention on the db.&lt;/p>
&lt;h1 id="serving">Serving&lt;/h1>
&lt;p>After working extensively with &lt;a href="https://vickiboykis.com/2025/01/14/how-fastapi-path-operations-work/">FastAPI&lt;/a> and &lt;a href="https://vickiboykis.com/2024/01/05/retro-on-viberary/">Flask&lt;/a>, I was really excited to learn just how batteries-included &lt;code>net/http&lt;/code> module was in Go. I didn&amp;rsquo;t need to install anything extra - I was immediately writing routes and handlers.&lt;/p>
&lt;p>We set up &lt;a href="https://github.com/veekaybee/gitfeed/blob/4e9248a83435e586a1c855d659baf103323678ec/cmd/serve/serve.go#L28">several routes&lt;/a> to deal with our saved posts here.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-go" data-lang="go">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#75715e">// Start post service&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	&lt;span style="color:#a6e22e">fmt&lt;/span>.&lt;span style="color:#a6e22e">Println&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;Connect to post service...&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	&lt;span style="color:#a6e22e">pr&lt;/span> &lt;span style="color:#f92672">:=&lt;/span> &lt;span style="color:#a6e22e">db&lt;/span>.&lt;span style="color:#a6e22e">NewPostRepository&lt;/span>(&lt;span style="color:#a6e22e">database&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	&lt;span style="color:#a6e22e">postService&lt;/span> &lt;span style="color:#f92672">:=&lt;/span> &lt;span style="color:#f92672">&amp;amp;&lt;/span>&lt;span style="color:#a6e22e">handlers&lt;/span>.&lt;span style="color:#a6e22e">PostService&lt;/span>{&lt;span style="color:#a6e22e">PostRepository&lt;/span>: &lt;span style="color:#a6e22e">pr&lt;/span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	&lt;span style="color:#75715e">// Create web routes&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	&lt;span style="color:#a6e22e">routes&lt;/span>.&lt;span style="color:#a6e22e">CreateRoutes&lt;/span>(&lt;span style="color:#a6e22e">postService&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	&lt;span style="color:#a6e22e">log&lt;/span>.&lt;span style="color:#a6e22e">Printf&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;Starting gitfeed server...&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>	&lt;span style="color:#a6e22e">log&lt;/span>.&lt;span style="color:#a6e22e">Fatal&lt;/span>(&lt;span style="color:#a6e22e">http&lt;/span>.&lt;span style="color:#a6e22e">ListenAndServe&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;:80&amp;#34;&lt;/span>, &lt;span style="color:#66d9ef">nil&lt;/span>))
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>And then we&amp;rsquo;re up and running as soon as we build and run our Go executable.&lt;/p>
&lt;h1 id="building-and-running-go-artifacts">Building and Running Go Artifacts&lt;/h1>
&lt;p>So easy. &lt;a href="https://github.com/veekaybee/gitfeed/blob/main/Makefile">Just a small Makefile&lt;/a> and we&amp;rsquo;re rebuilding and testing, and then serving the binaries.&lt;/p>
&lt;h1 id="data-considerations">Data Considerations&lt;/h1>
&lt;p>I don&amp;rsquo;t want to keep any data or manage it, this is meant to be an ephemeral snapshot, so a &lt;a href="https://github.com/veekaybee/gitfeed/blob/4e9248a83435e586a1c855d659baf103323678ec/cmd/ingest/ingest.go#L183">goroutine&lt;/a> deletes the oldest data once there are more than 10 posts in the database.&lt;/p>
&lt;h1 id="github">GitHub&lt;/h1>
&lt;p>One last piece we need, which early GitFeed testers suggested, was a way to, after you load the posts from the database, render the associated GitHub metadata with the repo, so we also, after we load the posts from the DB, hit the &lt;a href="https://github.com/veekaybee/gitfeed/blob/main/handlers/github.go">GitHUB API to enrich the post.&lt;/a>&lt;/p>
&lt;h1 id="ops-considerations">Ops Considerations&lt;/h1>
&lt;p>I run all of this on a small DigitalOcean droplet, and &lt;a href="https://github.com/veekaybee/gitfeed/blob/main/.github/workflows/deploy.yaml">redeploy to the droplet&lt;/a> with new code code merged to &lt;code>main&lt;/code> via GitHub actions. There&amp;rsquo;s no monitoring or alerting, something I&amp;rsquo;d like to add for the future.&lt;/p>
&lt;h1 id="final-reflections">Final Reflections&lt;/h1>
&lt;p>This app was so much fun to develop and I learned an enormous amount of stuff. Small, self-contained apps are a joy, and especially when there&amp;rsquo;s a front-end component where you have a self-reinforcing feedback loop. Since us machine learning engineers work at what a friend called &amp;ldquo;the back-end of the backend&amp;rdquo;, we don&amp;rsquo;t often get to experience UI changes, and seeing through something end-to-end was a joy.&lt;/p>
&lt;p>There were points of friction: It was definitely frustrating getting up and going with a whole new language and tech stack, but once I got back into the flow, it was great.&lt;/p>
&lt;p>As always, the hardest part of this project, as with any project, was understanding the data model and the business logic, and parsing out those objects correctly. The second-hardest was aligning elements in CSS.&lt;/p>
&lt;p>I&amp;rsquo;d love to get to a point &lt;a href="https://github.com/veekaybee/gitfeed/issues/9">where the app can surface Trending GitHub repos.&lt;/a>. And maybe &lt;a href="https://github.com/veekaybee/gitfeed/issues/12">add some unit tests.&lt;/a>&lt;/p>
&lt;p>Upon writing this, I realized I have like three other posts I want to write about this process, so I&amp;rsquo;ll leave this as-is for now.&lt;/p></description></item><item><title>How FastAPI path operations work</title><link>https://vickiboykis.com/2025/01/14/how-fastapi-path-operations-work/</link><pubDate>Tue, 14 Jan 2025 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2025/01/14/how-fastapi-path-operations-work/</guid><description>&lt;p>If you&amp;rsquo;re building a new Python web app these days, there&amp;rsquo;s a good chance you&amp;rsquo;re using FastAPI. There are a lot of features that make FastAPI easy to get started with. There are also a lot of nuances that take a while to understand. One feature I&amp;rsquo;ve been untangling is the way FastAPI manages calls to API routes &lt;a href="https://fastapi.tiangolo.com/tutorial/path-params/">via decorated path parameters.&lt;/a> The new year is a perfect time to take a deeper dive.&lt;/p>
&lt;h1 id="what-happens-in-a-web-server">What happens in a web server&lt;/h1>
&lt;p>When we build a web app, one of the critical components is &lt;a href="https://newsletter.vickiboykis.com/archive/when-you-write-a-web-server-but-you-get-served/">a web server&lt;/a>, a program that listens for incoming requests from the network. It then translates those requests into methods that are called in the backend.&lt;/p>
&lt;p>To better understand what&amp;rsquo;s going on under the covers, we can first implement a simple web server using the &lt;code>http.server&lt;/code> module &lt;a href="https://github.com/python/cpython/blob/main/Lib/http/server.py">included in Python&amp;rsquo;s standard library&lt;/a>.&lt;/p>
&lt;p>We need to write a program that listens on a port and accepts HTTP requests. It accepts the request, parses the path route, and parses any data attached to the HTTP call. Or, &lt;a href="https://crawshaw.io/blog/programming-with-llms">&amp;ldquo;All I want is to cURL and parse a JSON object&amp;rdquo;&lt;/a>.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">import&lt;/span> json
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> http.server &lt;span style="color:#f92672">import&lt;/span> BaseHTTPRequestHandler, HTTPServer
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> urllib.parse &lt;span style="color:#f92672">import&lt;/span> urlparse, parse_qs
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">class&lt;/span> &lt;span style="color:#a6e22e">RequestHandler&lt;/span>(BaseHTTPRequestHandler):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">parse_path&lt;/span>(self, request_path: str)&lt;span style="color:#f92672">-&amp;gt;&lt;/span> dict:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Parse request path
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> parsed &lt;span style="color:#f92672">=&lt;/span> urlparse(request_path)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> params_dict &lt;span style="color:#f92672">=&lt;/span> parse_qs(parsed&lt;span style="color:#f92672">.&lt;/span>query)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> params_dict
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">store_urls&lt;/span>(self, request_path: str)&lt;span style="color:#f92672">-&amp;gt;&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Parse URLs and store them
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> params &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>parse_path(request_path)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">for&lt;/span> key, val &lt;span style="color:#f92672">in&lt;/span> params&lt;span style="color:#f92672">.&lt;/span>items():
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>data_store&lt;span style="color:#f92672">.&lt;/span>put_data(val[&lt;span style="color:#ae81ff">0&lt;/span>])
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">return_k_json&lt;/span>(self, k:dict)&lt;span style="color:#f92672">-&amp;gt;&lt;/span> BinaryIO:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Return json response
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>send_response(&lt;span style="color:#ae81ff">200&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>send_header(&lt;span style="color:#e6db74">&amp;#34;Content-type&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;application/json&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>end_headers()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># Contains the output stream for writing a response back to the client. &lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># BufferedIOBase that writes to a stream&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># See https://docs.python.org/3/library/io.html#io.BufferedIOBase.write&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>wfile&lt;span style="color:#f92672">.&lt;/span>write(json&lt;span style="color:#f92672">.&lt;/span>dumps(k)&lt;span style="color:#f92672">.&lt;/span>encode(&lt;span style="color:#e6db74">&amp;#39;utf-8&amp;#39;&lt;/span>))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">bad_request&lt;/span>(self):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#e6db74">&amp;#34;&amp;#34;&amp;#34;
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> Handle bad request
&lt;/span>&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#e6db74"> &amp;#34;&amp;#34;&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>send_response(&lt;span style="color:#ae81ff">400&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>send_header(&lt;span style="color:#e6db74">&amp;#34;Content-type&amp;#34;&lt;/span>, &lt;span style="color:#e6db74">&amp;#34;application/json&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>end_headers()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">do_GET&lt;/span>(self):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> 
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> request_path &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>path
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>path &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&amp;#34;/&amp;#34;&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>return_k_json({&lt;span style="color:#e6db74">&amp;#34;ciao&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;mondo&amp;#34;&lt;/span>})
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> request_path&lt;span style="color:#f92672">.&lt;/span>startswith(&lt;span style="color:#e6db74">&amp;#34;/get&amp;#34;&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> key &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>parse_path(request_path)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>return_k_json({&lt;span style="color:#e6db74">&amp;#34;jars&amp;#34;&lt;/span>: key[&lt;span style="color:#e6db74">&amp;#34;key&amp;#34;&lt;/span>]})
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>send_response(&lt;span style="color:#ae81ff">200&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>bad_request()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>end_headers()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">do_POST&lt;/span>(self):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> request_path &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>path
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> request_path&lt;span style="color:#f92672">.&lt;/span>startswith(&lt;span style="color:#e6db74">&amp;#34;/set&amp;#34;&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>store_urls(request_path)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>send_response(&lt;span style="color:#ae81ff">200&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>bad_request()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">if&lt;/span> __name__ &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&amp;#34;__main__&amp;#34;&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> host &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#e6db74">&amp;#34;localhost&amp;#34;&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> port &lt;span style="color:#f92672">=&lt;/span> &lt;span style="color:#ae81ff">8000&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> server &lt;span style="color:#f92672">=&lt;/span> HTTPServer((host, port), RequestHandler)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> print(&lt;span style="color:#e6db74">&amp;#34;Server started http://&lt;/span>&lt;span style="color:#e6db74">%s&lt;/span>&lt;span style="color:#e6db74">:&lt;/span>&lt;span style="color:#e6db74">%s&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span> &lt;span style="color:#f92672">%&lt;/span> (host, port))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> server&lt;span style="color:#f92672">.&lt;/span>serve_forever()
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>What&amp;rsquo;s going on here?&lt;/p>
&lt;p>Let’s say that we produce Nulltella, an artisinal hazlenut spread for statisticians, and are looking to build a web app that keeps track of &lt;a href="https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/">all of our Nulltella jars so we can later stand up a prediction service.&lt;/a>&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/308394339-66ca00e6-1baf-4eb0-9d3d-112966beb797.png" width="200">
&lt;/figure>

&lt;p>We would start by designing a super simple API: As users,&lt;/p>
&lt;ul>
&lt;li>we want to test the server and get back a simple response&lt;/li>
&lt;li>we&amp;rsquo;d like to add jars to our inventory, and&lt;/li>
&lt;li>to see the jars we added.&lt;/li>
&lt;/ul>
&lt;p>We translate these actions to GET and PUT requests so we can write HTTP calls for them. For simplicity&amp;rsquo;s sake, we won&amp;rsquo;t actually store them server-side but we will write them so we can can very simply see how to send data to our app:&lt;/p>
&lt;p>We want to test the server:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&amp;gt; python serve.py
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;gt; curl -X POST http://localhost:8000/
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;gt; &lt;span style="color:#f92672">{&lt;/span>&lt;span style="color:#e6db74">&amp;#34;ciao&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;mondo&amp;#34;&lt;/span>&lt;span style="color:#f92672">}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We want to store items:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&amp;gt; curl -X POST http://localhost:8000/set&lt;span style="color:#ae81ff">\?&lt;/span>key&lt;span style="color:#ae81ff">\=&lt;/span>&lt;span style="color:#ae81ff">8&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#ae81ff">200&lt;/span> OK
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>And get back the stored items:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-bash" data-lang="bash">&lt;span style="display:flex;">&lt;span>&amp;gt; curl -X GET http://localhost:8000/get&lt;span style="color:#ae81ff">\?&lt;/span>key&lt;span style="color:#ae81ff">\=&lt;/span>&lt;span style="color:#ae81ff">8&lt;/span> 
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&amp;gt; &lt;span style="color:#f92672">{&lt;/span>&lt;span style="color:#e6db74">&amp;#34;jars&amp;#34;&lt;/span>: &lt;span style="color:#f92672">[&lt;/span>&lt;span style="color:#e6db74">&amp;#34;8&amp;#34;&lt;/span>&lt;span style="color:#f92672">]}&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Our server needs a way to parse the key pieces of information it receives:&lt;/p>
&lt;ol>
&lt;li>They type of request. &lt;code>do_GET&lt;/code> and &lt;code>do_POST&lt;/code> &lt;a href="https://stackoverflow.com/a/50944691">handle this implicitly&lt;/a> in the &lt;a href="https://docs.python.org/3/library/http.server.html#http.server.SimpleHTTPRequestHandler.do_GET">HTTP implementation&lt;/a>.&lt;/li>
&lt;li>The parameters we pass to the path request so that we can do something with them&lt;/li>
&lt;li>A route to a method inside our application itself that processes the data&lt;/li>
&lt;/ol>
&lt;p>In our simple server, the heart of the routing happens at the method level. If we send a base path, we return &lt;code>{&amp;quot;ciao&amp;quot;: &amp;quot;mondo&amp;quot;}&lt;/code> . Otherwise, we return the amount of jars we&amp;rsquo;ve passed in via the request path by parsing the parameters in the path.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">do_GET&lt;/span>(self) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> &lt;span style="color:#66d9ef">None&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> request_path &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>path
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>path &lt;span style="color:#f92672">==&lt;/span> &lt;span style="color:#e6db74">&amp;#34;/&amp;#34;&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>return_k_json({&lt;span style="color:#e6db74">&amp;#34;ciao&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;mondo&amp;#34;&lt;/span>})
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">if&lt;/span> request_path&lt;span style="color:#f92672">.&lt;/span>startswith(&lt;span style="color:#e6db74">&amp;#34;/get&amp;#34;&lt;/span>):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> key &lt;span style="color:#f92672">=&lt;/span> self&lt;span style="color:#f92672">.&lt;/span>parse_path(request_path)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#75715e"># action performed within the web app here&lt;/span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>return_k_json({&lt;span style="color:#e6db74">&amp;#34;jars&amp;#34;&lt;/span>: key[&lt;span style="color:#e6db74">&amp;#34;key&amp;#34;&lt;/span>]})
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>send_response(&lt;span style="color:#ae81ff">200&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">else&lt;/span>:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>bad_request()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> self&lt;span style="color:#f92672">.&lt;/span>end_headers()
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We can see how this can become complicated quickly. For example, what if we have multiple operations we perform during a &lt;code>GET&lt;/code> : what if we get data from a database, or a cache, or we retrieve assets? We&amp;rsquo;ll have different methods that we process depending on how the path is parsed. What if we also have &lt;code>PUT/DELETE&lt;/code> verbs? What if we need authentication? To write to a database? Static pages? Our code complexity relative to our starting point starts to grow, and we now need a framework.&lt;/p>
&lt;h2 id="starlette">Starlette&lt;/h2>
&lt;p>Early Python web dev frameworks include juggernauts &lt;a href="https://www.david-dahan.com/blog/comparing-fastapi-and-django">Django&lt;/a> and Flask. More recently, since Python&amp;rsquo;s async story has grown stronger, frameworks like &lt;a href="https://www.starlette.io/">Starlette&lt;/a> have come onto the scene to include async functionality out of the box.&lt;/p>
&lt;p>Starlette was built by the creator of Django Rest Framework and includes lightweight operations for the core functionality of HTTP calls and additional operations like web sockets, with the added bonus of being async by default.&lt;/p>
&lt;p>To manage an HTTP call the same way we would with our simple server, we can do the following with Starlette:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> starlette.applications &lt;span style="color:#f92672">import&lt;/span> Starlette
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> starlette.responses &lt;span style="color:#f92672">import&lt;/span> JSONResponse
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> starlette.routing &lt;span style="color:#f92672">import&lt;/span> Route
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">homepage&lt;/span>(request):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse({&lt;span style="color:#e6db74">&amp;#39;ciao&amp;#39;&lt;/span>: &lt;span style="color:#e6db74">&amp;#39;mondo&amp;#39;&lt;/span>})
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>app &lt;span style="color:#f92672">=&lt;/span> Starlette(debug&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>, routes&lt;span style="color:#f92672">=&lt;/span>[Route(&lt;span style="color:#e6db74">&amp;#39;/&amp;#39;&lt;/span>, homepage),])
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We start an instance of a Starlette application, which has &lt;a href="https://github.com/encode/starlette/blob/7c0d1e6d1a499e6eeb68d447321838be3927e83b/starlette/routing.py#L208">processes routes.&lt;/a> &lt;a href="https://www.starlette.io/routing/">Each route is linked&lt;/a>, at the path level, to the actual method it calls. If Starlette sees that specific route, it calls the method, taking into account logic for parsing and reading HTTP request headers and bodies.&lt;/p>
&lt;p>What if we want to add a second method call based on a different route, getting our jar count again?&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> starlette.applications &lt;span style="color:#f92672">import&lt;/span> Starlette
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> starlette.responses &lt;span style="color:#f92672">import&lt;/span> JSONResponse
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> starlette.routing &lt;span style="color:#f92672">import&lt;/span> Route
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">homepage&lt;/span>(request):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse({&lt;span style="color:#e6db74">&amp;#39;ciao&amp;#39;&lt;/span>: &lt;span style="color:#e6db74">&amp;#39;mondo&amp;#39;&lt;/span>})
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">get_jars&lt;/span>(request):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> JSONResponse({&lt;span style="color:#e6db74">&amp;#39;jars&amp;#39;&lt;/span>: [&lt;span style="color:#e6db74">&amp;#39;8&amp;#39;&lt;/span>]})
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>app &lt;span style="color:#f92672">=&lt;/span> Starlette(debug&lt;span style="color:#f92672">=&lt;/span>&lt;span style="color:#66d9ef">True&lt;/span>, routes&lt;span style="color:#f92672">=&lt;/span>[
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Route(&lt;span style="color:#e6db74">&amp;#39;/&amp;#39;&lt;/span>, homepage),
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> Route(&lt;span style="color:#e6db74">&amp;#39;/get_jars&amp;#39;&lt;/span>, get_jars)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>])
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>We see that we are also passing and processing params, and there is &lt;a href="https://github.com/encode/starlette/blob/7c0d1e6d1a499e6eeb68d447321838be3927e83b/starlette/_utils.py#L85">logic that processes the path params&lt;/a> based on the method as &lt;a href="https://github.com/encode/starlette/blob/7c0d1e6d1a499e6eeb68d447321838be3927e83b/docs/requests.md?plain=1#L5">they come in from the request&lt;/a>.&lt;/p>
&lt;h2 id="fastapis-implementation">FastAPI&amp;rsquo;s implementation&lt;/h2>
&lt;p>FastAPI wraps Starlette - &amp;ldquo;as it is basically Starlette on steroids&amp;rdquo; per the docs - and &lt;a href="https://fastapi.tiangolo.com/alternatives/#intro">includes Pydantic type validation&lt;/a> at the logical boundaries of the application.&lt;/p>
&lt;p>Under the covers, when we instantiate a FastAPI application, it&amp;rsquo;s really &amp;ldquo;just&amp;rdquo; an instance of a Starlette application with properties that we override at the application level.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#f92672">from&lt;/span> fastapi &lt;span style="color:#f92672">import&lt;/span> FastAPI
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>app &lt;span style="color:#f92672">=&lt;/span> FastAPI()
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">@app.get&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">root&lt;/span>():
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> {&lt;span style="color:#e6db74">&amp;#34;ciao&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">&amp;#34;mondo&amp;#34;&lt;/span>}
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#a6e22e">@app.get&lt;/span>(&lt;span style="color:#e6db74">&amp;#34;/jars/&lt;/span>&lt;span style="color:#e6db74">{id}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>)
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">async&lt;/span> &lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">get_jars&lt;/span>(id):
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> {&lt;span style="color:#e6db74">&amp;#34;message&amp;#34;&lt;/span>: &lt;span style="color:#e6db74">f&lt;/span>&lt;span style="color:#e6db74">&amp;#34;jars: &lt;/span>&lt;span style="color:#e6db74">{&lt;/span>id&lt;span style="color:#e6db74">}&lt;/span>&lt;span style="color:#e6db74">&amp;#34;&lt;/span>}
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>In development, FastAPI uses &lt;a href="https://www.uvicorn.org/"> &lt;code>uvicorn&lt;/code> &lt;/a>, an &lt;a href="https://asgi.readthedocs.io/en/latest/">ASGI server&lt;/a> to listen for incoming requests and handle them according to the routes defined in your application.&lt;/p>
&lt;p>Uvicorn initializes the ASGI server, binds it to a &lt;a href="https://github.com/encode/uvicorn/blob/ae8253f10b9f73f10e92be52a0d9f70696b77c62/uvicorn/server.py#L115">socket connections&lt;/a> on &lt;a href="https://github.com/encode/uvicorn/blob/ae8253f10b9f73f10e92be52a0d9f70696b77c62/uvicorn/main.py#L73">port &lt;code>8000&lt;/code> &lt;/a>, and starts listening for incoming connections. So, when we send a &lt;code>GET&lt;/code> request to the main route hosted by default on port 8000, we expect to get back &lt;code>ciao mondo&lt;/code> as a response.&lt;/p>
&lt;p>Like our previous applications, FastAPI is still delegating path operations and methods to a router that processes them and parses parameters, but it wraps these in a &lt;a href="https://nedbatchelder.com/blog/202210/decorator_shortcuts.html">Python decorator&lt;/a>. This is easier to write, but adds a level of complexity at the layer of understanding how the path processing actually happens.&lt;/p>
&lt;p>When we perform a path operation in FastAPI, we&amp;rsquo;re performing the equivalent work of routing that we do with our simple method, but with a lot more rigor and nested definitions.&lt;/p>
&lt;p>Within our simple server, we:&lt;/p>
&lt;ol>
&lt;li>Start the server&lt;/li>
&lt;li>Listen on port &lt;code>8000&lt;/code> for incoming requests&lt;/li>
&lt;li>When we receive a request, we route it to the &lt;code>do_GET &lt;/code>method&lt;/li>
&lt;li>Depending on the path of the request, we route it to &lt;code>&amp;quot;/&amp;quot;&lt;/code>&lt;/li>
&lt;li>We return the results to the client via a &lt;code>200&lt;/code> status&lt;/li>
&lt;/ol>
&lt;p>In FastAPI, we:&lt;/p>
&lt;ol>
&lt;li>Start the uvicorn web server (if in development mode, if production we have to choose gunicorn using the &lt;a href="https://stackoverflow.com/a/71546833">compatible worker class&lt;/a>)&lt;/li>
&lt;li>Listen on port &lt;code>8000&lt;/code> for incoming requests&lt;/li>
&lt;li>We instantiate an instance of the FastAPI application&lt;/li>
&lt;li>This in turn instantiates an instance of Starlette&lt;/li>
&lt;li>When we receive a &lt;code>GET&lt;/code> request, it&amp;rsquo;s routed to the application&amp;rsquo;s &lt;a href="https://github.com/fastapi/fastapi/blob/144f09ea146b2cc026bf317f730aa0e0dbc3de24/fastapi/applications.py#L1460">&lt;code>self.get&lt;/code>&lt;/a> method&lt;/li>
&lt;li>This in turn calls &lt;code>self.router.get&lt;/code> with &lt;a href="https://github.com/fastapi/fastapi/blob/144f09ea146b2cc026bf317f730aa0e0dbc3de24/fastapi/applications.py#L1807">the path operation&lt;/a>&lt;/li>
&lt;li>The router is an instance of &lt;a href="https://github.com/fastapi/fastapi/blob/144f09ea146b2cc026bf317f730aa0e0dbc3de24/fastapi/applications.py#L932">&lt;code>routing.APIRouter&lt;/code>&lt;/a>&lt;/li>
&lt;li>The &lt;code>.get&lt;/code> &lt;a href="https://github.com/fastapi/fastapi/blob/144f09ea146b2cc026bf317f730aa0e0dbc3de24/fastapi/routing.py#L1366">method on &lt;code>APIRouter&lt;/code> takes the path&lt;/a> and retunrs &lt;code>return self.api_route&lt;/code>. This is the point where the decorater is actually called - &lt;a href="https://github.com/fastapi/fastapi/blob/144f09ea146b2cc026bf317f730aa0e0dbc3de24/fastapi/routing.py#L963">we can see the decorator in that method&lt;/a> takes a &lt;code>DecoratedCallable&lt;/code> function as input and returns a decorated &lt;code>add_api_route&lt;/code>, which actually &lt;a href="https://github.com/fastapi/fastapi/blob/144f09ea146b2cc026bf317f730aa0e0dbc3de24/fastapi/routing.py#L961">appends the route to the list of routes.&lt;/a>&lt;/li>
&lt;/ol>
&lt;p>This is purely the set of steps that happens for correct routing - and we didn&amp;rsquo;t yet address how the path parameters in the path are processed.&lt;/p>
&lt;h1 id="path-parameter-routing">Path Parameter Routing&lt;/h1>
&lt;p>Path parameter routing happens in Starlette, where &lt;a href="https://github.com/encode/starlette/blob/0109dce29b76c64e93c56c01fa5020860f935ed3/starlette/requests.py#L182">path parameters are parsed out&lt;/a> of the &lt;a href="https://github.com/encode/starlette/blob/0109dce29b76c64e93c56c01fa5020860f935ed3/starlette/requests.py#L76">request&lt;/a> into a dictionary (just like we do in our simple web application), via the magic of &lt;a href="https://github.com/encode/starlette/blob/0109dce29b76c64e93c56c01fa5020860f935ed3/starlette/templating.py#L123">Jinja Templating.&lt;/a>&lt;/p>
&lt;h2 id="tl-dr">TL; DR&lt;/h2>
&lt;p>When we write a route in FastAPI that accepts path parameters, we are creating a lengthy callstack that goes through several levels of logic in FastAPI using decorators as input into an application that routes requests and appends methods using decorators to a group of route methods; those requests are then passed onto Starlette which does the work of parsing the path variables, using Jinja templates, into dictionaries which the application can then work with and return data to you!&lt;/p></description></item><item><title>Everything I did in 2024</title><link>https://vickiboykis.com/2025/01/03/everything-i-did-in-2024/</link><pubDate>Fri, 03 Jan 2025 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2025/01/03/everything-i-did-in-2024/</guid><description>&lt;p>I want to get back into writing more regularly this year, so in light of that, here&amp;rsquo;s my last year in review.&lt;/p>
&lt;h1 id="evaluating-llms">Evaluating LLMs&lt;/h1>
&lt;p>Like many of us in tech, I spent a large portion of 2024 thinking about and working with LLMs, but I was lucky enough to do it for work. I spent the year designing, building, open-sourcing, (and naming! 🐊) &lt;a href="https://blog.mozilla.ai/lets-build-an-app-for-evaluating-llms/">an application to evaluate LLMs&lt;/a>, &lt;a href="https://github.com/mozilla-ai/lumigator">Lumigator.&lt;/a>&lt;/p>
&lt;p>In support of that work, I &lt;a href="https://blog.mozilla.ai/open-source-in-the-age-of-llms/">did open-source work in the LLM ecosystem&lt;/a>and learned a ton of stuff along the way. Just when I thought I had &lt;a href="https://vickiboykis.com/2021/09/23/reaching-mle-machine-learning-enlightenment/">reached machine learning enlightenment&lt;/a>, I learned a ton about the weird &lt;a href="https://vickiboykis.com/2024/05/06/weve-been-put-in-the-vibe-space/">nondeterministic properties of LLMs&lt;/a>, their evaluation methods, Ray + Ray Serve, vLLM, Llamafile, &lt;a href="https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/">GGUF&lt;/a>, FastAPI, OpenAPI-compatible APIs, and much, much, much more. I am hoping to translate some of these into blog posts as well.&lt;/p>
&lt;p>There is always more to learn in machine learning, particularly in the fast-moving world at the bleeding edge. &lt;a href="https://vickiboykis.com/2022/11/10/how-i-learn-machine-learning/">Lucikly, the fundamentals always remain the same.&lt;/a>&lt;/p>
&lt;p>My biggest takeaways from this year are that:&lt;/p>
&lt;ol>
&lt;li>&lt;strong>Machine learning in industry is mostly still made up of good engineering&lt;/strong>, and in order to work with LLMs, you need to have both machine learning context and engineering discipline. There&amp;rsquo;s some debate whether the &amp;ldquo;old&amp;rdquo; machine learning skills (gradient boosted trees anyone? 👵) still apply in this Brave New World of &lt;a href="https://m-clark.github.io/posts/2021-07-15-dl-for-tabular/index.html">NonTabular Scraped Internet&lt;/a>, and &lt;a href="https://timkellogg.me/blog/2024/12/10/ml-liability">I am of the opinion&lt;/a> that these skills are more valuable than ever, particularly as engineers without ML context enter the frustrating world of building with non-deterministic experimental loops.&lt;/li>
&lt;/ol>
&lt;p>But what&amp;rsquo;s even more valuable is being able to &lt;a href="https://vickiboykis.com/2022/12/05/the-cloudy-layers-of-modern-day-programming/">ship things that hold engineering rigor&lt;/a>. Even if you are deep in academic research, &lt;a href="https://bsky.app/profile/eugenevinitsky.bsky.social/post/3ldm5i4ljks2z">taking an engineering approach&lt;/a> will make your life significantly easier.&lt;/p>
&lt;p>As part of this, I spent a significant amount of the year &lt;a href="https://vickiboykis.com/2023/09/13/build-and-keep-your-context-window/">widening my own context window&lt;/a> about engineering best practices: Over the summer, I &lt;a href="https://publish.obsidian.md/learning-c/Learning+C/Learn+C+Programming+and+OOP">learned C&lt;/a> and this fall, I learned JavaScript and Go and &lt;a href="https://github.com/veekaybee/gitfeed">built my first working project, Gitfeed&lt;/a>. I am particularly enamored with Go: it&amp;rsquo;s so boring, so clean, and so easy to build things that are simple and move quickly out of the box compared to the Python ecosystem.&lt;/p>
&lt;p>This is not to say that the &lt;a href="https://pydevtools.com/blog/effective-python-developer-tooling-in-december-2024/">Python world&lt;/a>, which is my first love, is not &lt;a href="https://thedataquarry.com/posts/towards-a-unified-python-toolchain/">changing and evolving&lt;/a>. I &lt;a href="https://bsky.app/profile/vickiboykis.com/post/3lazmcuftus25">moved to &lt;code>uv&lt;/code> from &lt;code>pyenv&lt;/code> &lt;/a> for both work and personal projects in 2024 and have never looked back, particularly as &lt;code>uv&lt;/code> continues to work &lt;a href="https://docs.astral.sh/uv/guides/integration/pytorch/">to gain support for PyTorch&lt;/a> and all of its cuda-related install issues.&lt;/p>
&lt;ol start="2">
&lt;li>&lt;strong>There is too much going on in the model landscape:&lt;/strong> too many models, too much choice, too many platforms, too much money, too much drama. My main strategy is to follow what&amp;rsquo;s going on on a daily basis to keep my finger on the pulse and then basically ignore it. If people mention it again over the course of 2-3 months, continue to pay attention.&lt;/li>
&lt;/ol>
&lt;p>In light of this, some contradictary advice if you&amp;rsquo;re in the space is that you should also &lt;strong>always be playing around with stuff&lt;/strong>, trying it out, building it and breakign it. There are some &lt;a href="https://gist.github.com/veekaybee/be375ab33085102f9027853128dc5f0e">classic papers and books&lt;/a> for understanding the context and theory, but most of the stuff I read is still in &lt;code> r/LocalLLaMA&lt;/code> or in various tweets/skeets and blog posts. We are not at the level yet for this stuff where cannon exists, although I have noticed that there are technical books being published about AI engineering this year, which means we are starting to solidify.&lt;/p>
&lt;p>Some LLM tools I tried out and loved this year were &lt;a href="https://github.com/Mozilla-Ocho/llamafile">llamafile&lt;/a>, for getting started with GGUF transformer and embedding models with a server extremely quickly, and &lt;code>ollama + openwebui&lt;/code> for an experience that is nearly identical to using Claude. In fact, I&amp;rsquo;ve switched over from Claude to using &lt;code>mistral:latest&lt;/code> locally for most of my LLM usage, which is basically code search. Mistral is a model that has consistently worked well for me both at work and for my own purposes, and I&amp;rsquo;ve enjoyed the chance to pit it against newer models like &lt;a href="https://simonwillison.net/2024/Nov/12/qwen25-coder/">Qwen2.5 Coder&lt;/a>, Llama2, and phi.&lt;/p>
&lt;p>I am still having a hard time wrapping my mind around what LLM models are useful for, in general, and for me personally. I &lt;a href="https://mathstodon.xyz/@tao/113132502735585408">spend a lot&lt;/a> of time &lt;a href="https://antirez.com/news/140">reading blog posts&lt;/a> about &lt;a href="https://nicholas.carlini.com/writing/2024/how-i-use-ai.html">how other (smarter) people use LLMS&lt;/a> and against academic papers that study the success of these models at machin learning tasks stemming in their roots as NLP models focusing on problems like text completion, classification, translation, and summarization.&lt;/p>
&lt;p>A very cool and fun related thing I did while reviewing the &amp;ldquo;classics&amp;rdquo; was edit this wonderful series of posts by Katharine about &lt;a href="https://blog.kjamistan.com/a-deep-dive-into-memorization-in-deep-learning.html">model memorization&lt;/a> which you should read if you&amp;rsquo;re interested in their internals.&lt;/p>
&lt;ol start="3">
&lt;li>I also spent a significant part of the year working on web development. I wrote earlier in the year that there is a &lt;a href="https://blog.mozilla.ai/open-source-in-the-age-of-llms/">bifurcation in how we consume LLMs&lt;/a>:&lt;/li>
&lt;/ol>
&lt;blockquote>
&lt;p>The LLM ecosystem is currently bifurcated between HuggingFace and OpenAI compatibility: An interesting pattern has developed in my development work on open-source in LLMs. It’s become clear to me that, in this new space of developer tooling around transformer-style language models at an industrial scale, you are generally conforming to be downstream of one of two interfaces:
&lt;strong>models that are trained and hosted using HuggingFace libraries&lt;/strong> and particularly the HuggingFace hub as infrastructure - in practicality, this means dependence on PyTorch’s programming paradigm, which HuggingFace tools wrap (although they now also provide interop between Tensorflow and JAX)&lt;/p>&lt;/blockquote>
&lt;blockquote>
&lt;p>&lt;strong>Models that are available via API endpoints&lt;/strong>, particularly as hosted by OpenAI. Given that OpenAI was a first mover in the product LLM space, they currently have the API advantage, and many tools that have developed have developed OpenAI-compatible endpoints which don’t always mean using OpenAI, but conform to the same set of patterns that the Chat Completions API v1/chat/completions offers. For example, adding OpenAI-style interop chat completions allowed us to stand up our own vLLM OpenAI-compatible server that works against models we’ve started with on HuggingFace and fine-tuned locally.&lt;/p>&lt;/blockquote>
&lt;blockquote>
&lt;p>If you want to be successful in this space today and you&amp;rsquo;d like to cater to a broad audience, you as a library or service provider have to be able to interface with both of these.&lt;/p>&lt;/blockquote>
&lt;p>I still believe this is true and it will be interesting to see how this develops over the course of the next year as APIs and their model ecosystems become even broader.&lt;/p>
&lt;ol start="4">
&lt;li>&lt;strong>These models will integrate with, not replace traditional machine learning systems.&lt;/strong> In the beginning, everyone thought that we could replace whole classes of engineering and machine learning problems with LLMs. It&amp;rsquo;s becoming increasingly clearer that one big-ass model is not as a good as many smaller models for specific tasks performed in industry, which makes complete sense because LLMs as a concept rose out of academic labs whose goals are to generalize on out-of-sample problems, whereas the task of industry is to specialize in healthcare or OCR document scanning for legal, or sentiment analysis for social media. Even if fine-tuning did decline as a practice this year, people are making models specialized in a lot of deifferent ways, which I think the rise of agents at the end of this year really makes clear.&lt;/li>
&lt;/ol>
&lt;p>An area that has particularly interested me is the integration of LLMs into recommender systems, my favorite area of applied machine learning. Earlier last year, I, along with &lt;a href="https://jfkirk.github.io/posts/trustworthiness-ai/">James&lt;/a> and &lt;a href="https://bsky.app/profile/ravimody.bsky.social">Ravi&lt;/a>, started a position paper on what will change in recommender systems in the light of LLMs based on a joke tweet I had made,&lt;/p>
&lt;blockquote>
&lt;p>personalized recommendations based on implicit matrix factorization from data acquired through large log streaming architectures were a low interest rate phenomenon&lt;/p>&lt;/blockquote>
&lt;p>My hypothesis was that companies used to collect a lot of streaming user data, particularly in social and other B2C settings, that were then used for personalization. Now, all the data collection happens publicly at the level of the interet, &lt;a href="https://vickiboykis.com/2024/01/15/whats-new-with-ml-in-production/">where it is re-compressed into LLMs.&lt;/a> What does this mean for personalization, now?&lt;/p>
&lt;p>Unfortunately we didn&amp;rsquo;t finish the post (maybe someday?), but our main hypothesis was that traditional recsys will not be replaced with, but instead augmented with LLMs, that LLMs could help fill the cold-start gap because they&amp;rsquo;re so good at zero-shot retrieval and general topic classification and can serve as places to augment cold-start problems and in onboarding, and that LLMs could offer more explainability in recommendations. This has &lt;a href="https://arxiv.org/html/2410.17152v1">so far&lt;/a> &lt;a href="https://arxiv.org/abs/2307.02046">proven to be true.&lt;/a>&lt;/p>
&lt;h1 id="italy--learning-italian">Italy + Learning Italian&lt;/h1>
&lt;p>I had the extreme pleasure of being invited to give a keynote at &lt;a href="https://2025.pycon.it/en">PyCon Italia&lt;/a> in Florence in May. I took all these learnings that had been running on crazed hamster wheels in my brain and threw them into a &lt;a href="https://vickiboykis.com/2024/05/20/dont-worry-about-llms/">talk based loosely on the Decameron&lt;/a>.&lt;/p>
&lt;p>In addition to subjecting my poor audience to jokes about the Medicis, I also gave the intro in Italian to an audience of +100, which was extremely nervewracking. I&amp;rsquo;ve been wanting to study Italian since I was 18 and the stars aligned two years ago for me to seriously start. and I&amp;rsquo;ve been learning ever singe. This year, I also managed to read my first (Level A1/A2) books in Italian.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/902697ae-eefa-4315-bf30-f4603dc1489b.jpg" width="400">
&lt;/figure>

&lt;p>My family also came to Italy. It was the kids&amp;rsquo; first trip to Europe. Everyone had an amazing time (at least after my talk was over), ate a lot of gelato, and experienced Rome and Florence in the spring. One of my favorite moments of the trip (and of my year) was listening to a &lt;a href="https://www.youtube.com/watch?v=6xTTHC2Y18E">Ricchi e Poveri&lt;/a> cover band playing at a disco outside our hotel at 11:30 at night in Florence.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/bceed358-ffa0-4909-a353-834b30d215e4.jpg" width="400">
&lt;/figure>

&lt;p>Inspired by all of this, and to take a break from LLMs, I also started learning how to mix music, and hopefully will have something more concrete to write/share about DJing this year.&lt;/p>
&lt;p>I also &lt;a href="https://vickiboykis.com/essays/2024-12-31-favorite-books/">managed to read some books.&lt;/a>&lt;/p>
&lt;h1 id="shift-in-social-strategy">Shift in Social Strategy&lt;/h1>
&lt;p>As I spent the year trying my best to keep up with the wild, unruly, and growing LLM landscape, another part of my ecosystem was wilting. It&amp;rsquo;s clear for all intents and purposes that Twitter as we knew her is dead. It doesn&amp;rsquo;t have the juice anymore, as the kids say. I&amp;rsquo;m surprised that it died from human intervention rather than engineering failure, but as anyone who&amp;rsquo;s read my newsletter knows, this is by far the most likely outcome for techno-social systems. I have been wanting to write a long eulogy on my personal experinece and what Twitter meant to me: it was a place where I made friends, learned to be a serious programmer, worked out my best ideas, got leads for jobs, started a newsletter, &lt;a href="https://vickiboykis.com/2022/12/22/everything-i-learned-about-accidentally-running-a-successful-tech-conference/">organized Normconf&lt;/a>, and most importantly, had fun. But in the end, it all died with a whimper, and I, like many others, just quietly left.&lt;/p>
&lt;p>I&amp;rsquo;ve always been a firm believer in owning your own internet space, and I continued to do that by blogging, and moving from Substack to Buttondown, but I also need micro social interaction because I get energy and ideas from it. So, I engaged more on Bluesky, where I&amp;rsquo;ve been since last year, and was pleasantly surprised to see the machine learning community start to migrate there. Amost immediately as soon as I came back to the platform, I had a conversation that &lt;a href="https://vickiboykis.com/2024/11/09/why-are-we-using-llms-as-calculators/">led to an idea for a post.&lt;/a>&lt;/p>
&lt;p>A lot of people have (rightfully so) sworn off agglomerated social &lt;a href="https://vickiboykis.com/2024/09/19/dead-internet-souls/">in favor of group chats.&lt;/a> This makes complete sense to me, yet as a creator and someone who thrives on community, I was sorely missing this space until I came back to Bluesky. There is a lot of conversation around protocol versus platform and what it means for Bluesky the app versus Bluesky the protocol to succeed.&lt;/p>
&lt;p>In particular, I encourage anyone interested in the space to read &lt;a href="https://dustycloud.org/blog/how-decentralized-is-bluesky/">this series of exchanges&lt;/a> &lt;a href="https://whtwnd.com/bnewbold.net/3lbvbtqrg5t2t">on the question&lt;/a> but I remain hopeful that that space grows and remains a place where users can choose and experiment. It allowed me, at the end of the year, &lt;a href="https://github.com/veekaybee/gitfeed">to hack on gitfeed&lt;/a>, which is very, very very cool.&lt;/p>
&lt;h1 id="2025">2025&lt;/h1>
&lt;p>What next? Who knows, who can even predict the future? Not machine learning models, and certainly not me. I&amp;rsquo;m just excited to keep learning and building and making and dreaming, and, hopefully writing more about it all.&lt;/p></description></item><item><title>Write code with your Alphabet Radio on</title><link>https://vickiboykis.com/2024/12/16/write-code-with-your-alphabet-radio-on/</link><pubDate>Mon, 16 Dec 2024 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2024/12/16/write-code-with-your-alphabet-radio-on/</guid><description>&lt;p>There is a lot of debate in the software community around whether LLMs can replace developers. Part of the reason is the way we formulate the problem of what it means to write software. In industry, we still give outsize cultural deference to software developers as lone wizards who come into the room, put on their hoodie, crank up the techno, and write the application on their own.&lt;/p>
&lt;p>When we sit down at our keyboard, we assume that we sit down alone, with our precious treasure trove of programming knowledge that we have built up over the years, hoarding the nuggets about byte size and data types like a dragon guarding its gold. It is us and our hard-earned brain patterns from years of algorithm classes and hundreds of pages spent trawling StackOverflow versus the machine, working alone, against each other.&lt;/p>
&lt;p>We sit down, we crack our knuckles and deserialize the knowledge from our brain into clear streams of beautiful code, in solitude and blessed silence. These days, we are spending less time with books and search results and more time with ChatGPT or Claude, and it’s this prompting process, we assume, that will ultimately replace us. But that’s only true if we consider the developer as an individual unit of work separate from anything else, without our own context window.&lt;/p>
&lt;p>&lt;a href="https://vickiboykis.com/2024/01/15/whats-new-with-ml-in-production/">Machine learning is compression&lt;/a> and LLMs doubly so. When we sit with an LLM, we are not only asking the average of the internet but the average, &lt;a href="https://x.com/karpathy/status/1862565643436138619">as Karpathy puts it&lt;/a>, of every human data labeler that has contributed to moving LLMs forward.&lt;/p>
&lt;p>When we ask LLMs to write code, we get back compressed representations of the entire field, optimized via RLHF, sorted by descending probability, and limited by the size of the model’s context window. So we are not alone, but we are also often don’t ascend outside a local minimum of average code, sometimes not even outside the &lt;a href="https://vickiboykis.com/2021/08/05/the-local-minima-of-suckiness/">local minima of suckiness.&lt;/a> Not to mention, these models have no access to hundreds and thousands and millions of lines of elegant code that has never seen the public internet because it sits quietly, undiscoverable in corporate git servers. When we ask the model, almost any model (with the exception of finetunes on our own data), we are peeking into a very small window of what good code could be.&lt;/p>
&lt;p>When I sit down at the keyboard, I also sit down with the average of the local minima, but I also have a secret superpower. I’ve had the enormous good fortune to work with and learn from very good developers, and it’s the average habits of all of these developers that lift me out of the depths of compressed scraped GitHub.&lt;/p>
&lt;p>These developers’ good habits stream from my memories into my consciousness. As I write code, they enrich my decision-making process, like a radio station of good advice that never turns off.&lt;/p>
&lt;p>&lt;em>A&lt;/em> insists that I need to learn fundamental algorithms and data structures inside and out because they are the basis for what makes good code fast. “Even if you never implement a LinkedList in your own code, you’ll be able to more clearly reason through the decisions others have made and use the libraries that make the most efficient use of them.”&lt;/p>
&lt;p>&lt;em>B&lt;/em> tells me, through hundreds of PR comments over years and years, that it’s not good enough to have a PR that just passes tests. The code needs to be clean, reasonable, and legible for others because we read code more than we write it.&lt;/p>
&lt;p>&lt;em>C&lt;/em> tells me to write code that is elegant. C says it will take longer to write this kind of code than something that just ships, but that it is our professional responsibility as developers to carve out that time, to demand it. “Write your code, then write a second time. Make it work, make it right, make it fast.”&lt;/p>
&lt;p>&lt;em>D&lt;/em> tells me not to mess with fancy tools. A simple print debug statement goes a long way. A 2-line unit test fixes 100 lines of obtuse code. Code completion editors don&amp;rsquo;t work when you don&amp;rsquo;t know what you want to autocomplete, yet.&lt;/p>
&lt;p>&lt;em>E&lt;/em> tells me to master my tools, to dig into problems I don’t understand, to get to a reproducible example, to play with things in the terminal, take them apart, write smaller and smaller pieces of code until they make sense.&lt;/p>
&lt;p>&lt;em>F&lt;/em> tells me to take time to analyze the data correctly, and to be crisp and concrete about my analysis.&lt;/p>
&lt;p>&lt;em>G&lt;/em> and &lt;em>H&lt;/em> tell me to start without machine learning if I can, and if I have to use it, to use a simple model first of all.&lt;/p>
&lt;p>&lt;em>I&lt;/em> tells me to get to a local development loop as soon as possible.&lt;/p>
&lt;p>&lt;em>J&lt;/em> tells me to take my craft seriously, but never myself.&lt;/p>
&lt;p>&lt;em>K&lt;/em> tells me to be patient and have grace, both for others, and for myself. Code is hard, but humans are harder.&lt;/p>
&lt;p>&lt;a href="https://grugbrain.dev/">Above all of them is Grug&lt;/a>, constantly whispering “complexity bad” with every line I add.&lt;/p>
&lt;p>Over and over my Alphabet Radio plays my greatest hits, fixing a variable name, digging through a class hierarchy, adding a test, using dependency injection, avoiding unnecessary libraries, deleting lines of code until the number of lines I add is less than the number I subtract, refactoring, refactoring, refactoring.&lt;/p>
&lt;p>I am so lucky to have my Alphabet Radio. Engineers like me who come from data land sometimes don’t get the privilege of building a playlist like this &lt;a href="https://www.ethanrosenthal.com/2023/01/10/data-scientists-alone/">because we don’t work in teams&lt;/a>, and for a long time at the beginning of my career, I was one of these people.&lt;/p>
&lt;p>It was very lonely and I felt wrong a lot of the time without guardrails or intuition as to why what I was doing was inefficient and incorrect. LLMs don’t and can’t help with this part. Repeated exposure to the best practices of good developers does.&lt;/p>
&lt;p>When I am at the keyboard, my joyous, cacophonous radio station keeps streaming the greatest hits on an infinite loop and allows me to rise above the local minimum, encouraging me to work harder to produce code that is not just functional, but easy for others to read, fast to modify, and doesn’t break for longer.&lt;/p>
&lt;p>Nothing is black and white. Code is not precious, nor the be-all end-all. The end goal is a functioning product. All code is eventually thrown away. &lt;a href="https://vickiboykis.com/2023/02/26/what-should-you-use-chatgpt-for/">LLMs help with some tasks&lt;/a>, if you already know what you want to do and give you shortcuts. But they can’t help with this part. They can’t turn on the radio. We have to &lt;a href="https://vickiboykis.com/2023/09/13/build-and-keep-your-context-window/">build our own context window&lt;/a> and make our own playlist.&lt;/p>
&lt;p>When LLMs can stream advice as clearly and well as my Alphabet Radio, then, I’ll worry. Until then, I build with my radio on.&lt;/p></description></item><item><title>Why are we using LLMs as calculators?</title><link>https://vickiboykis.com/2024/11/09/why-are-we-using-llms-as-calculators/</link><pubDate>Sat, 09 Nov 2024 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2024/11/09/why-are-we-using-llms-as-calculators/</guid><description>&lt;p>We &lt;a href="https://www.reddit.com/r/singularity/comments/122ilav/why_is_maths_so_hard_for_llms/">keep trying to get LLMs to do math&lt;/a>. We want them &lt;a href="https://community.openai.com/t/incorrect-count-of-r-characters-in-the-word-strawberry/829618">to count the number of &amp;ldquo;rs&amp;rdquo; in strawberry&lt;/a>, to perform &lt;a href="https://arxiv.org/abs/2303.05398">algebraic reasoning&lt;/a>, &lt;a href="https://news.ycombinator.com/item?id=30309302">do multiplication&lt;/a>, and &lt;a href="https://mathstodon.xyz/@tao/113132502735585408">to solve math theorems.&lt;/a>&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/llm_calc.png" width="400">
&lt;/figure>

&lt;p>A &lt;a href="https://x.com/yuntiandeng/status/1836114401213989366">recent experiment particularly&lt;/a> piqued my interest. Researchers used OpenAI&amp;rsquo;s new &lt;a href="https://openai.com/index/hello-gpt-4o/">4o model&lt;/a> to solve multiplication problems by using the prompt:&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-shell" data-lang="shell">&lt;span style="display:flex;">&lt;span>Calculate the product of x and y. Please provide the final answer in the format: 
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span>Final Answer: &lt;span style="color:#f92672">[&lt;/span>result&lt;span style="color:#f92672">]&lt;/span>
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;figure>&lt;img src="https://vickiboykis.com/images/72f1f906-20b6-4d5d-880a-da1065e15f39.png" width="400">
&lt;/figure>

&lt;p>These models are generally &lt;a href="https://arxiv.org/abs/2204.07705">trained for natural language tasks&lt;/a>, particularly text completions and chat.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/1ef1bc9d-408d-40ea-b2ec-073120785ac6.png" width="400">
&lt;/figure>

&lt;p>So why are we trying to get these enormous models, good for natural text completion tasks like summarization, translation, and writing poems, to multiply three-digit numbers and, what&amp;rsquo;s more, attempt to return the results as a number?&lt;/p>
&lt;p>Two reasons:&lt;/p>
&lt;ol>
&lt;li>Humans always try to use any new software/hardware we invent to do calculation&lt;/li>
&lt;li>We don&amp;rsquo;t actually want them to do math for the sake of replacing calculators, we want to understand if they can reason their way to AGI.&lt;/li>
&lt;/ol>
&lt;h1 id="computers-and-counting-in-history">Computers and counting in history&lt;/h1>
&lt;p>In the history of human relationships with computers, we&amp;rsquo;ve always wanted to count large groups of things because &lt;a href="https://en.wikipedia.org/wiki/The_Magical_Number_Seven,_Plus_or_Minus_Two">we&amp;rsquo;re terrible at it&lt;/a>. Initially we used our hands - or others&amp;rsquo; - in the Roman empire, administrators known as &lt;em>calculatores&lt;/em> and slaves known as &lt;a href="https://kartsci.org/kocomu/computer-history/history-abacus-ancient-computing/">calculones&lt;/a> performed household accounting manually.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/bbe7a45f-86a5-40f3-abba-ec635ce7c93f.png" width="400">
&lt;/figure>

&lt;p>Then, we started inventing calculation lookup tables. After the French Revolution, the French Republican government switched to the metric system in order to collect property taxes. In order to perform these calculations, it hired human computers to do the conversions by creating large tables of logarithms for decimal division of angles, &lt;a href="https://inria.hal.science/inria-00543946/document">Tables du Cadastre&lt;/a>. This system was never completed and eventually scrapped, but it inspired Charles Babbage to do his work on machiens for calculation along with Ada Lovelace, which in turn kicked off the modern era of computing.&lt;/p>
&lt;p>UNIVAC, one of the first modern computers, was used by the Census Bureau in &lt;a href="https://www.census.gov/about/history/bureau-history/census-innovations/technology/univac-i.html">population counting.&lt;/a>&lt;/p>
&lt;p>The nascent field of artificial intelligence developed jointly in line with the expectation that machines should be able to replace humans in computation through historical developments like the Turing Test and &lt;a href="https://en.wikipedia.org/wiki/Turochamp">Turing&amp;rsquo;s chess program&lt;/a>, the &lt;a href="https://spectrum.ieee.org/dartmouth-ai-workshop">Dartmouth Artificial Intelligence Conference&lt;/a> and &lt;a href="https://www.ibm.com/history/early-games">Arthur Samuel&amp;rsquo;s checkers demo&lt;/a>. &lt;/p>
&lt;p>Humans have been inventing machines to mostly do math for milennia, and it&amp;rsquo;s only recently that computing tasks have moved up the stack from calculations to higher human endeavors like writing, searching for information, and shitposting. So naturally, we want to use LLMs to do the thing we&amp;rsquo;ve been doing with computers and software all these years.&lt;/p>
&lt;h1 id="making-computers-think">Making computers think&lt;/h1>
&lt;p>Second, we want to understand if LLMs can &amp;ldquo;think.&amp;rdquo; There is no one definition of what &amp;ldquo;thinking&amp;rdquo; means, but for these models in particular, &lt;a href="https://arxiv.org/abs/2212.10403">we are interested to see&lt;/a> if they can work through a chain of steps to come to an answer about logical things that are easy for humans, as an example:&lt;/p>
&lt;blockquote>
&lt;p>all whales are mammals, all mammals have kidneys; therefore, all whales have kidneys&lt;/p>&lt;/blockquote>
&lt;p>One way humans reason is through performing different kinds of math: arithmetic, solving proofs, and reasoning through symbolic logic. The underlying question in artificial intelligence is whether machines can reason outside of the original task we gave them. For large language models, the ask is whether they can move from summarizing first a book if they were trained for books, to a movie script plot, to finally, summarizing what you did all day if you pass it a bunch of documents about your activity. So, it stands to reason that if LLMs can &amp;ldquo;solve&amp;rdquo; math problems, they can achieve AGI.&lt;/p>
&lt;p>There are approximately seven hundred million benchmarks to see if LLMs can reason. &lt;a href="https://www.llm-reasoning-benchmark.com/">Here&amp;rsquo;s an example&lt;/a>, and &lt;a href="https://arxiv.org/abs/2307.13692">here&amp;rsquo;s another one&lt;/a>. Even since I started this draft yesterday, &lt;a href="https://epochai.org/frontiermath/the-benchmark">a new one came out.&lt;/a>&lt;/p>
&lt;p>Since it&amp;rsquo;s hard to define what &amp;ldquo;reasoning&amp;rdquo; or &amp;ldquo;thinking&amp;rdquo; means, the benchmarks try to proxy to see if models can answer the same questions we give to humans in settings such as university tests and compare the answers between human annotators generating ground truth and inference run on the model.&lt;/p>
&lt;p>These types of tasks make up a &lt;a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard">large number of LLM benchmarks that are popular on LLM leaderboards.&lt;/a>&lt;/p>
&lt;h1 id="how-calculators-work">How calculators work&lt;/h1>
&lt;p>However, evaluating how good LLMs are at calculation doesn&amp;rsquo;t take into account a critical component: the way that calculators arrive at their answer is radically different from how these models work. A calculator records the button you pressed and converts it to a binary representation of those digits. Then, it stores those number in memory registers until you press an operation key. For basic hardware calculators, the machine has built-in operations that perform variations of addition on the binary representation of the number stored in-memory:&lt;/p>
&lt;pre>&lt;code> + addition is addition, 
 + subtraction is performed via two's complement operations, 
 + multiplication is just addition, and 
 + division is subtraction
&lt;/code>&lt;/pre>
&lt;p>In software calculators, &lt;a href="https://www.bloomberg.com/graphics/2015-paul-ford-what-is-code/">the software takes user keyboard input&lt;/a>, generates a scan code for that key press, encodes the signal, converts it to character data, and uses an encoding standard to convert the key press to a binary representation. That binary representation is sent to the application level, which now starts to work with the variable in the programming language the calculator uses, and performs operations on those variables based on &lt;a href="https://gitlab.gnome.org/GNOME/gnome-calculator/-/blob/main/lib/number.vala?ref_type=heads#L587">internally-defined methods for addition, subtraction, multiplication, and division.&lt;/a>&lt;/p>
&lt;p>Software calculators can grow to be fairly complicated with the addition of graphing operations and calculus, but usually have a standard collected set of methods to follow to perform the actual calcuation. As a fun aside, &lt;a href="https://www.pcalc.com/mac/thirty.html">here&amp;rsquo;s a great piece&lt;/a> on what it was like to build a calculator app Back In The Day.&lt;/p>
&lt;p>The hardest part of the calculator is writing the logic for representing numbers correctly and creating manual classes of operations that cover all of math&amp;rsquo;s weird corner cases.&lt;/p>
&lt;p>However, to get an LLM to add &amp;ldquo;2+2&amp;rdquo;, we have a much more complex level of operations. Instead of a binary calculation machine that uses small, simple math business logic to derive an answer based on addition, we create an enormous model of the entire universe of human public thought and try to reason our way into the correct mathematical answer based on how many times the model has &amp;ldquo;seen&amp;rdquo; or been exposed to the text &amp;ldquo;2+2&amp;rdquo; in written form.&lt;/p>
&lt;p>We first train a large language model to answer questions.&lt;/p>
&lt;p>&lt;img alt="d8e7f5ec-c333-4890-8430-7f73fe9e89fa_1588x386" height="386" id="h-rh-i-0" src="https://vickiboykis.com/images/87e4ee63-7270-4fcf-89da-45769b7aba53.jpg" width="1588">&lt;/p>
&lt;p>&lt;a href="https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training">Source&lt;/a>&lt;/p>
&lt;p>This includes:&lt;/p>
&lt;ol>
&lt;li>&lt;a href="https://magazine.sebastianraschka.com/p/new-llm-pre-training-and-post-training">Gathering and deduplicating&lt;/a> an enormous amount of large-scale, clean internet text&lt;/li>
&lt;li>We then train the model by feeding it the data and asking it, at a very simplified level, to predict the next word in a given sentence. We then compare that prediction to the baseline sentence and adjust a loss function. An attention mechanism helps guide the prediction by keeping a context map of all the words of our vocabulary (our large-scale clean internet text.)&lt;/li>
&lt;li>Once the model is trained initially to perform the task of text completion, we perform &lt;a href="https://arxiv.org/abs/2308.10792">instruction fine-tuning&lt;/a>, to more closely align the model with the task of performing a summarization task or following instructions.&lt;/li>
&lt;li>The model is aligned with human preferences with RLHF. &lt;a href="https://huggingface.co/blog/rlhf">This process&lt;/a> involves collecting a set of questions with human responses, and having human annotators rank the response of the model, and then feeding those ranks back into the model for tuning.&lt;/li>
&lt;li>Finally, we stand up that artifact (or have it accessable as a service.) The artifact is &lt;a href="https://vickiboykis.com/2024/02/28/gguf-the-long-way-around/">a file or a collection of files&lt;/a> that contain the model architecture and weights and biases of the model generated from steps 2 and 3.&lt;/li>
&lt;/ol>
&lt;p>Then, when we&amp;rsquo;re ready to query our model. This is the step that most people take to get an answer from an LLM when they hit a service or run a local model, equivalent to opening up the calculator app.&lt;/p>
&lt;ol>
&lt;li>We write &amp;ldquo;What&amp;rsquo;s 2 + 2&amp;rdquo; into the text box.&lt;/li>
&lt;li>This natural-language query &lt;a href="https://cybernetist.com/2024/10/21/you-should-probably-pay-attention-to-tokenizers/">is tokenized&lt;/a>. Tokenization is the process of first converting our query into a string of words that the model uses as the first step in performing numerical lookups.&lt;/li>
&lt;li>That text is then embedded in the context of the model&amp;rsquo;s vocabulary by converting each word to an embedding and then creating an embedding vector of the input query.&lt;/li>
&lt;li>We then passing the vector to the model&amp;rsquo;s encoder, which stores the relative position of embeddings to each other in the model&amp;rsquo;s vocabulary&lt;/li>
&lt;li>Passing those results to the attention mechanism for lookup, which compares the similarity using various metrics of each token and position with every other token in the reference text (the model). This happens many times in multi-head attention architectures.&lt;/li>
&lt;li>Getting results back from the decoder. A &lt;a href="https://huggingface.co/docs/transformers/en/llm_tutorial">set of tokens and the probability of those tokens is returned from the decoder.&lt;/a> We need to generate the first token that all the other tokens are conditioned upon. However, afterwards, &lt;a href="https://huggingface.co/blog/how-to-generate">returning probablities takes many forms&lt;/a>: namely search strategies like greedy search and and sampling, most frequently top-k sampling, the method originally used by GPT-2. Depending on which strategy you pick and what tradeoffs you&amp;rsquo;d like to make, you will get &lt;a href="https://gist.github.com/kalomaze/4473f3f975ff5e5fade06e632498f73e">slightly different answers of resulting tokens selected from the model&amp;rsquo;s vocabulary.&lt;/a>&lt;/li>
&lt;/ol>
&lt;p>Finally, even after this part, to ensure that what the model outputs is an actual number, we could do a number of different guided generation strategies to ensure we get ints or longs as output from &lt;a href="https://dottxt-ai.github.io/outlines/latest/welcome/">multiplication, addition, etc.&lt;/a>&lt;/p>
&lt;p>So this entire process, in order to add &amp;ldquo;what is 2+2&amp;rdquo;, we do a non-deterministic a lookup from an enormous hashtable that contains the sum of public human knowledge we&amp;rsquo;ve seen fit to collect for our dataset, then we squeeze it through the tiny, nondeterministic funnels of decoding strategies and guided generation to get to an answer from a sampled probability distribution.&lt;/p>
&lt;p>These steps include a large amount of actual humans in the loop guiding the model throughout its various stages.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/8bd85044-8583-48dd-b620-f8f13a134d18.png" width="700">
&lt;/figure>

&lt;p>And, all of this, only to get an answer that&amp;rsquo;s right only some percent of the time, not consistent across all model architectures and platforms and in many cases has to be coaxed out of the model using techniques like chain of thought.&lt;/p>
&lt;p>As an example, here&amp;rsquo;s an aswer I&amp;rsquo;ve tried on OpenAI, Claude, Gemini, and locally using Mistral via llamafile and ollama:&lt;/p>
&lt;p>&lt;figure>&lt;img src="https://vickiboykis.com/images/7be40c7f-8f7b-48db-9ad9-80c421e3c05c.png" width="400">
&lt;/figure>

Claude Sonnet 3.5&lt;/p>
&lt;p>&lt;figure>&lt;img src="https://vickiboykis.com/images/ec024d0d-3d58-4bab-9052-a31c91a0bc62.png" width="400">
&lt;/figure>

Gemini 1.5 Flash&lt;/p>
&lt;p>&lt;figure>&lt;img src="https://vickiboykis.com/images/ecd53bb1-063a-477d-9969-877cfa3eb35c.png" width="400">
&lt;/figure>

OpenAI ChatGPT GPT-4 Turbo&lt;/p>
&lt;p>&lt;figure>&lt;img src="https://vickiboykis.com/images/3fd6bdaf-dc08-4208-959f-46df356bc4d9.png" width="400">
&lt;/figure>

Llamafile Mistral 7-B Instruct 2&lt;/p>
&lt;p>&lt;figure>&lt;img src="https://vickiboykis.com/images/986371b3-6703-41a9-bdf8-ea74680149ed.png" width="400">
&lt;/figure>

Ollama Mistral&lt;/p>
&lt;p>If you ask any given calculator what 2+2 is, you&amp;rsquo;ll always get 4. This doesn&amp;rsquo;t work with LLMs, even when it&amp;rsquo;s variations of the same model, much less different models hosted across different service providers and in different levels of quantization, different sampling strategies, mix of input data, and more.&lt;/p>
&lt;h2 id="why-are-we-even-doing-this">Why are we even doing this?&lt;/h2>
&lt;p>From a user perspective, this is absolutely a disastrous violation of Jakob&amp;rsquo;s Law of UX, which states that people &lt;a href="https://vickiboykis.com/2024/05/06/weve-been-put-in-the-vibe-space/">expect the same kind of output&lt;/a> from the same kind of interface.&lt;/p>
&lt;p>However, when you realize that the goal is, as &lt;a href="https://mathstodon.xyz/@tao/113132502735585408">Terrence Tao notes&lt;/a>, to get models to solve mathematical theorems, it makes more sense, although all these models are still very far from actual reasoning.&lt;/p>
&lt;p>I&amp;rsquo;d love to see us spend time more understanding and working on the practical uses &lt;a href="https://unlocked.microsoft.com/ai-anthology/terence-tao/">he discusses&lt;/a>: drafts of documents, as ways to check understanding of a codebase, and of course, &lt;a href="https://vickiboykis.com/2023/02/26/what-should-you-use-chatgpt-for/">generating boilerplate Pydantic models for me personally&lt;/a>.&lt;/p>
&lt;p>But, this is the core tradeoff between practicality and research: do we spend time on Pydantic now because it&amp;rsquo;s what&amp;rsquo;s useful to us at the moment, or do we try to get the model to write the code itself to the point where we don&amp;rsquo;t even need Pydantic, or Python, or programming languages, and can write natural language code, backed by mathematical reasoning?&lt;/p>
&lt;p>If we didn&amp;rsquo;t spend time on the second, we never would have gotten even to GPT-2, but the question is, how much further can we get? I&amp;rsquo;m not sure, but I personally am still not using LLMs for tasks that can&amp;rsquo;t be verified or for reasoning, or for counting Rs.&lt;/p>
&lt;hr>
&lt;p>Further Reading:&lt;/p>
&lt;ul>
&lt;li>&lt;a href="https://mitpress.mit.edu/9780262549349/artificial-general-intelligence/">Artificial General Intelligence by Julian Togelius&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://wwnorton.com/books/9780393882148">Empire of the Sum: The Rise and Reign of the Pocket Calculator by Keith Houston&lt;/a>&lt;/li>
&lt;li>&lt;a href="http://jmc.stanford.edu/articles/dartmouth/dartmouth.pdf">Dartmouth AI Workshop Original Proposal&lt;/a>&lt;/li>
&lt;/ul></description></item><item><title>Dead Internet Souls</title><link>https://vickiboykis.com/2024/09/19/dead-internet-souls/</link><pubDate>Thu, 19 Sep 2024 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2024/09/19/dead-internet-souls/</guid><description>&lt;p>In the 1800s, before serfdom was abolished in the Russian empire, landowners paid taxes based on how many serfs they had. A census was conducted every few years by government employees traveling across the empire and doing counts; a manual map-reduce of epic proportions. If a person was dead, it would often be years before the government cleared the cache, so to speak, and landowners continued to pay taxes on these dead souls.&lt;/p>
&lt;p>Alexandr Pushkin, the greatest living Russian-language author at the time, heard a story about how landowners took advantage of this by buying up dead souls from landowners, and passed this story onto fellow writer, Nikolai Gogol as an idea for a book or play, which resulted in Gogol’s seminal satirical work, &lt;a href="https://en.wikipedia.org/wiki/Dead_Souls">“Dead Souls.”&lt;/a> Gogol meant dead souls on two levels: both the serfs, and the banality and falsity of Russian landowning society at the time.&lt;/p>
&lt;p>The internet today is filled with dead souls. Or, more accurately, souls who were never alive in any sense of the word: text copypasta from LLMS, slop artwork generated from generative art tools, and bots on aging social networks that are quickly emptying of real content as real people &lt;a href="https://sriramk.com/group-chats-rule-the-world">migrate to group chats&lt;/a>.&lt;/p>
&lt;p>It used to be different. In the beginning, the internet was made of people. I came online in the mid 1990s, along with the rest of America, and for the first few years of my internet experience, my primary threat model was someone on AOL finding out that I was 12 (&lt;a href="https://books.google.com/books?id=P54CfcXKMUUC&amp;amp;pg=PA87#v=onepage&amp;amp;q&amp;amp;f=false">A/S/L&lt;/a> , and also where I live. After all, no one on the internet knew if &lt;a href="https://en.wikipedia.org/wiki/On_the_Internet,_nobody_knows_you%27re_a_dog">I was a dog.&lt;/a>)&lt;/p>
&lt;p>Based on my priors, I thought that this would be the world my kids would grow up into as well, and as they got older, I started preparing them against the dangers of strangers on the internet.&lt;/p>
&lt;p>But the explosion of generative AI over the past two years has brought to light something even worse: the problems we encounter on the internet today are not because of people, but because of the lack of them, and now the danger is not that you will find people who want to do you harm, but that the &lt;a href="https://www.envisioning.io/vocab/slop">amount of slop&lt;/a> will overwhelm us all: Not 1984, but Brave New World.&lt;/p>
&lt;p>&lt;a href="https://theconversation.com/side-job-self-employed-high-paid-behind-the-ai-slop-flooding-tiktok-and-facebook-237638">It continues&lt;/a> &lt;a href="https://www.theverge.com/2024/9/18/24248471/linkedin-ai-training-user-accounts-data-opt-in">to get worse&lt;/a>, and &lt;a href="https://techcrunch.com/2023/12/27/the-new-york-times-wants-openai-and-microsoft-to-pay-for-training-data/">will get much worse&lt;/a> before it gets better.&lt;/p>
&lt;p>But the good news is that, when there is a lot of noise, it is even easier for people who have original thoughts to stand out. &lt;a href="https://justine.lol/history/">As Justine writes&lt;/a> in this important and thought-provoking piece,&lt;/p>
&lt;blockquote>
&lt;p>In a world of infinite automation and infinite surveillance, survival is going to depend on being the least boring person.&lt;/p>&lt;/blockquote>
&lt;p>If you are a human out there in the vast wilds of the new internet of dead souls, speak up. We are still out there, on the &lt;a href="https://benhoyt.com/writings/the-small-web-is-beautiful/">small web&lt;/a>, being ingested into &lt;a href="https://vickiboykis.com/2024/05/06/weve-been-put-in-the-vibe-space/">indie search engine indexes&lt;/a>, on blogs, on websites.&lt;/p>
&lt;p>The machines are thinking, but so are people, and they’re still much better. I’m still out here. Come be out here with me, too.&lt;/p></description></item><item><title>Don't worry about LLMs</title><link>https://vickiboykis.com/2024/05/20/dont-worry-about-llms/</link><pubDate>Sat, 25 May 2024 00:00:00 +0000</pubDate><guid>https://vickiboykis.com/2024/05/20/dont-worry-about-llms/</guid><description>&lt;p>This is a near-transcript of &lt;a href="https://2024.pycon.it/en/keynotes/stay-close-to-the-metal">the talk I gave&lt;/a> at &lt;a href="https://www.youtube.com/watch?v=Ik0voaZmf5A">PyCon Italia 2024&lt;/a> in May in Florence.&lt;/p>
&lt;div style="position: relative; padding-bottom: 56.25%; height: 0; overflow: hidden;">
 &lt;iframe allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share; fullscreen" loading="eager" referrerpolicy="strict-origin-when-cross-origin" src="https://www.youtube.com/embed/Ik0voaZmf5A?si=h2Yfx5v_LIfWd6xz&amp;amp;amp;%20start=30316?autoplay=0&amp;amp;controls=1&amp;amp;end=0&amp;amp;loop=0&amp;amp;mute=0&amp;amp;start=0" style="position: absolute; top: 0; left: 0; width: 100%; height: 100%; border:0;" title="YouTube video">&lt;/iframe>
 &lt;/div>

&lt;h1 id="introduction">Introduction&lt;/h1>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_4_resized.png" width="400">
&lt;/figure>

&lt;p>Buongiorno PyconIt, grazie per avermi invitata a parlare! Avrei voluta fare tutto il discorso in italiano, ma lo sto ancora imparando.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_2_resized.png" width="400">
&lt;/figure>

&lt;p>Per adesso posso parlare soltanto di gelato o colori. Perché non so ancora dire, “don’t worry about LLMs”, il resto sarà in inglese.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_3_resized.png" width="400">
&lt;/figure>

&lt;p>&lt;a href="https://vickiboykis.com/">I’m Vicki&lt;/a> and I work as a machine learning engineer &lt;a href="https://blog.mozilla.ai/author/vicki/">at Mozilla.ai&lt;/a>, building a platform to enable developers to evaluate and select between different LLMs. Before working on LLMs, I’ve built large-scale ML systems in security, social recommenders, and tv content recommendations&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_4_resized.png" width="400">
&lt;/figure>

&lt;p>After working with LLMs for the past year, what I&amp;rsquo;ve found is that the new engineering systems we’re building around these LLMs are a lot like the old ones. Once we cut away the hype, what we’re usually left with are plain engineering and machine learning problems.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_5_resized.png" width="400">
&lt;/figure>

&lt;p>But how do we as practitioners cut away the hype? Since we are in Firenze, I want to tell a story I recently heard while talking to other tech folks that took place here that might help us to navigate this. Around town, there is a company called Medici Corp.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_6_resized.png" width="400">
&lt;/figure>

&lt;p>They are a very large organization, spread out into lots of different industries in banking, pharma, fine arts, and philanthropy. They were very successful, but recently, their CEO was worried about being left behind in AI.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_7_resized.png" width="400">
&lt;/figure>

&lt;p>And she wanted to see if there was some way they could incorporate a chatbot or similar into their offerings.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_8_resized.png" width="400">
&lt;/figure>

&lt;p>The CEO created a small R&amp;amp;D task force of developers and machine learning engineers and tasked them with investigating what it would take to add AI to their product over the course of a sprint.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_9_resized.png" width="400">
&lt;/figure>

&lt;p>Now, the developers, machine learning engineers, and PMs, were all very experienced in industry, but new to LLMs, and when they started looking at all the different model choices, platforms, and the buzz around LLMs, they became very distraught.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_10_resized.png" width="400">
&lt;/figure>

&lt;p>There were too many open-ended options , a lot of people who were loud online. So, the developers decided to rent an Airbnb outside the city for a week so they could really focus, isolate, and ship some code. When they got together around a whiteboard, they frantically started researching what tools to build an LLM with, and what they saw was &lt;a href="https://mattturck.com/mad2024/">this chart.&lt;/a>&lt;/p>
&lt;p>“What do we do, ” they asked. How were they going to compete in this insanely crowded market? More importantly, how could they as engineers even understand this landscape?&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_11_resized.png" width="400">
&lt;/figure>

&lt;p>As they worried their team lead stepped forward and said, “In times like these, I turn to the wisdom of the foundational thinkers. &lt;a href="http://antirez.com/news/140">There is one&lt;/a> who said,&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_12_resized.png" width="400">
&lt;/figure>

&lt;p>The developers turned to the team lead and said, “This man says the truth, but how can we possibly turn this into actionable advice that we can write features for with feature flags and unit tests in a two-week sprint? And deliver a product that has a magic emoji on it?” In other words, &amp;ldquo;How do we deliver AI?&amp;rdquo;&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_13_resized.png" width="400">
&lt;/figure>

&lt;p>The team lead said, “Here’s what we’re going to do. We’re going to get in small groups and research what other people have done in the situation, and then we’re going to present what other people did.”&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_14_resized.png" width="400">
&lt;/figure>

&lt;p>The engineers groaned, because group work is the worst. But the team lead said, “Reading about what people in the past have done is the only way to build and keep our context window, which is our knowledge base of classical architecture patterns and historical engineering context that allows us to make good, grounded engineering decisions.”&lt;/p>
&lt;h2 id="the-singular-machine-learning-task">The Singular Machine Learning Task&lt;/h2>
&lt;p>So the engineers agreed and went off to do research. All day the engineers researched and read, and the next day, they all gathered in a group. One stepped forward, and began to tell her story.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_15_resized.png" width="400">
&lt;/figure>

&lt;p>She said, once upon a time in Barcelona, they were building a grand cathedral, called Santa Maria Del Mar. At that time, &lt;a href="https://en.wikipedia.org/wiki/Building_a_Gothic_cathedral">many Gothic cathedrals&lt;/a> were built in periods that took a long amount of time, usually fifty to one hundred years. &lt;a href="https://carrersbcn.com/2021/01/17/the-legendary-bastaixos-of-the-santa-maria-del-mar/">Santa Maria del Mar&lt;/a> was finished in only fifty five, even notwithstanding a fire and the plague that started the neighborhood where it was being built, and is the only church surviving in the Catalan Gothic style.&lt;/p>
&lt;p>It’s unique because it was one of the only churches built with backing from commoners as opposed to the nobility of the city of Barcelona. The rich would pay with their money, and the poor would pay with their labor, for the neighborhood to build the cathedral together.&lt;/p>
&lt;p>A key force on the project were the bastaixos, or porters. They had an organized guild, and these men were traditionally in charge of loading and unloading ships. They were already extremely good at one thing: carrying heavy things. When they heard about the cathedral, the guild volunteered to take the stones from the royal quarry at Montjuïc, at a high elevation above the city, to the cathedral.&lt;/p>
&lt;p>Right now, to get from here to the Cathedral, it takes about 50 minutes to walk, downward. In those days, the bastaixos would have walked past the Port of Barcelona to get there, taking longer, over an hour, carrying a stone that weighs over 40kg on his back alone.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_16_resized.png" width="400">
&lt;/figure>

&lt;p>So a bastaix would first put the stone on his back, and then move it all the way to the cathedral. Then, he would go back and move another stone. They did this all day, day-in and day out. The stones weighed 40 kg and the only protection they had was the turban on their head, called a capcana that rolled up above the neck.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_17_resized.png" width="400">
&lt;/figure>

&lt;p>The machine learning engineer paused and said, you know, now that I’m telling this story, it reminds me of something, and that something is gradient descent.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_18_resized.png" width="400">
&lt;/figure>

&lt;p>&lt;a href="https://arxiv.org/abs/1805.05052">Gradient descent&lt;/a> is a key algorithm that many machine learning models, including neural networks in the transformer family that powers GPT-style models, use for training the model.&lt;/p>
&lt;p>Gradient descent minimizes the loss function by iteratively adjusting the model&amp;rsquo;s parameters (weights and biases). The process involves calculating the gradient of the loss function with respect to the model parameters and then updating these parameters in the direction that reduces the loss.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_19_resized.png" width="400">
&lt;/figure>

&lt;p>For example, &lt;a href="http://vickiboykis.com/test_blog/2024/02/26/gguf-the-long-way-around/">let’s say that we produce artisanal hazelnut spread&lt;/a> for statisticians, Nulltella. We want to predict how many jars of Nulltella we’ll produce on any given day. Let’s say we have some data available to us, and that is, how many hours of sunshine we have per day, and how many jars of Nulltella we’ve been able to produce every day.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_20_resized.png" width="400">
&lt;/figure>

&lt;p>It turns out that we feel more inspired to produce hazelnut spread when it’s sunny out. We can clearly see this relationship between input and output in our data (we do not produce Nulltella Friday-Sunday because we prefer to spend those days talking about Python.)&lt;/p>
&lt;p>Now that we have our data, we need to write our prediction algorithm, where we know, based on our current values, what future values could potentially be. The equation to predict output Y from inputs X for linear regression is outlined here.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_21_resized.png" width="400">
&lt;/figure>

&lt;p>Our task is to continuously adjust our weights and biases for all of our features to optimally solve this equation for the difference between our actual as presented by our data and a prediction based on the algorithm to find the smallest sum of squared differences, between each point and the line.&lt;/p>
&lt;p>In other words, we’d like to minimize epsilon, because it will mean that, at each point, our predicted Y is as close to our actual Y as we can get, given the other points.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_22_resized.png" width="400">
&lt;/figure>

&lt;p>The equation we use for this is RMSE (Root mean squared error).&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_23_resized.png" width="400">
&lt;/figure>

&lt;p>Let’s say we initialize the function with some x values and weights. How do we optimize it? Using gradient descent. We start with either zeros or randomly-initialized values for the weights and continue recalculating both the weights and error term until we come to an optimal stopping point. We’ll know we’re succeeding because our loss, as calculated by RMSE should incrementally decrease in every training iteration.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_24_resized.png" width="400">
&lt;/figure>

&lt;p>For our particular model, we can see that the loss curve reaches a local minimum after ten iterations. If you look at the curve for the elevation the bastaixos walked from Montjuic to Santa Maria del Mar, you’ll see that it follows the same pattern.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_25_resized.png" width="400">
&lt;/figure>

&lt;p>That’s because these two things lay out a fundamental pattern: &lt;a href="https://arxiv.org/abs/1609.04747">working on optimizing one thing at a time.&lt;/a>&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_26_resized.png" width="400">
&lt;/figure>

&lt;p>The bastaixos had a single goal: the completion of the cathedral, and the single functionality of carrying stones to that goal. They didn’t work on carving the stone, nor on the stained glass, nor on architecture. They just moved rocks. Everything else was peripheral, and the focus allowed them to get just one great thing done. The engineers realized that they also needed to learn to be bastaixos and focus on a single machine learning task.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_27_resized.png" width="400">
&lt;/figure>

&lt;p>But what they realized was that LLMs in and of themselves, while also operating on the principle of gradient descent, &lt;a href="https://arxiv.org/abs/2402.06196">are set up to solve&lt;/a> an unbounded number of open-ended tasks: they can write poems, complete recipes, classify text, summarize text, evaluate other models, act as chatbots, and much, much more.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_28_resized.png" width="400">
&lt;/figure>

&lt;p>When the engineers went back to the drawing board, it was clear what they needed: the focus on a single use-case for their customers. What’s the best way to figure out what you need LLMs for? List all the problems your business is facing and see if machine learning will be the right way to address them.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_29_resized.png" width="400">
&lt;/figure>

&lt;p>Machine learning generally, and LLMs more specifically, are good for where the number of heuristics you develop starts to outweigh their maintenance cost. I’ve heard this also called the &lt;a href="https://www.youtube.com/watch?t=280&amp;amp;v=glpR1MD1UoM&amp;amp;feature=youtu.be">1000-intern heuristic&lt;/a>: if it’s a task that can be simplified by a thousand people entirely new to the task doing it, then do that.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_30_resized.png" width="400">
&lt;/figure>

&lt;p>The simplest tasks for large language models to do are summarization, classification, translation, named entity recognition, and similar. If your problems fall in that space, you’ll have an easier time than with open-ended tasks like reasoning. This is also the reason why, often, simpler models perform better for specific tasks than general LLMs that are meant to generalize to out-of-distribution tasks.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_31_resized.png" width="400">
&lt;/figure>

&lt;p>When the team thought about all the numerous things their company did, what they realized is that one of their biggest problems was trending topic detection: they have a lot of documents constantly floating around, particularly around the patronages they’re performing, and they perform a lot of art patronage.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_32_resized.png" width="400">
&lt;/figure>

&lt;p>They wanted to be able to get a sense of the types of art and types of artists in the hands of the corporation at any given time so they could further explore those trends and make sure their art funding was even.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_33_resized.png" width="400">
&lt;/figure>

&lt;p>Armed with this information, The Medici team decided that they wanted to use it internally to look at all the documents they had related to the large trove of artwork they had acquired. It would be easier to do internal topic detection than other outward-facing use-cases because they had subject-matter experts, and the categories of art were fairly established since they had in-house art experts. They presented this plan to the CEO, who was pleased. Now, they were at their next task, but they were still overwhelmed.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_34_resized.png" width="400">
&lt;/figure>

&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_35_resized.png" width="400">
&lt;/figure>

&lt;h2 id="the-measurable-goal">The Measurable Goal&lt;/h2>
&lt;p>On the second day, another developer had a story. She said, I’ve been reading about medieval monks.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_36_resized.png" width="400">
&lt;/figure>

&lt;p>Jamie Keiner, at the Department of Classics in the University of Georgia, wrote a book called &lt;a href="https://wwnorton.com/books/9781631498053">“The Wandering Mind”&lt;/a>, about how medieval monks used to harness attention to make their prayers more effective. Every day, their goal was to get higher on the spiritual plane away from earthly matters. But they kept getting distracted by the daily minutiae of life - legal disputes, farming and livestock, gossip, banality of everyday life that overloads your attention.&lt;/p>
&lt;p>From the writings of one monk, &lt;a href="https://en.wikipedia.org/wiki/John_Cassian">John Cassian&lt;/a>, a Christian monk in the Roman empire, for example, who traveled to Egypt, it turns out that distractions are not new: In the 420s, he wrote that “the mind gets pushed around by random distractions.” Like him, many monks and nuns saw distraction as a “primordial struggle”, and something they felt obligated to continually fight&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_38_resized.png" width="400">
&lt;/figure>

&lt;p>What monks and nuns recognized is that, if they wanted to get closer to true understanding, they would have to separate from distractions that bound them to society, because it was impossible to focus on one thing at a time, and they only wanted to focus on prayer. As a monk called Hildemar of Civate put, “it is impossible to focus on two things at once - “in uno homine duas intentiones esse non posse.”&lt;/p>
&lt;p>Much later, scientists confirmed that human brains have very poor capabilities for multitasking, as well. The most items a human can hold in memory is about 7, plus or minus two, and we definitely cannot do two things simultaneously. After 7 items our cache is cleared and we start getting confused and lost.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_39_resized.png" width="400">
&lt;/figure>

&lt;p>That means when we as engineers try to build large systems with multiple components, with large amounts of decisions to be made, the brain will shut down. This is the case in the LLM land today. The average LLM system might look something like this.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_40_resized.png" width="400">
&lt;/figure>

&lt;p>Each of these components is a million decisions to be made. Here is a list I can tick off just by looking at this diagram.&lt;/p>
&lt;ul>
&lt;li>Should we use our own model or an open-source base-model or an API vendor? Which API vendor?&lt;/li>
&lt;li>How should we go about testing our prompts? How should we constrain the outputs of the model? + Do we want to generate plain text, or JSON, or binary responses?&lt;/li>
&lt;li>What kind of UI do we need? Is text-based chat enough?&lt;/li>
&lt;li>Where do we store the model artifact - do we store it locally&lt;/li>
&lt;li>Are we using cloud vendors, are we using their implementations of cloud artifacts?&lt;/li>
&lt;li>Are we doing model fine-tuning, do we need separate data-collection mechanisms?&lt;/li>
&lt;li>How do we evaluate this model?&lt;/li>
&lt;/ul>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_41_resized.png" width="400">
&lt;/figure>

&lt;p>The monks told us that, even though we will have a lot of distractions and environmental variables, we need to focus on developing one good piece of software at a time as we build.&lt;/p>
&lt;p>The developer telling the story said this reminds me of another established traditional development pattern: the philosophy of Unix, which says, tools that we build should be minimal and modular and do one thing at a time so we can combine them together later.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_43_resized.png" width="400">
&lt;/figure>

&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_44_resized.png" width="400">
&lt;/figure>

&lt;p>It’s true that when we build machine learning systems, we have no choice: we need vector databases, and APIs, and data stores, and, most importantly, models. But, when we build an entire machine learning system at once, for example, connecting five different APIs rather than one, or trying five models rather than one, five use-cases, we see a deterioration at each stage of the process rather than focusing on streamlining a single use-case and task end-to-end.&lt;/p>
&lt;p>How do we get to programs that do one thing well in machine learning, i.e. perform our task of topic detection? The best way is to understand first that machine learning, unlike developing product features like “the ability to click a button” or “sending data to the database”, is a cyclical process that involves a lot of iteration before we get something working we’re happy with, and the way to get to happiness is to pick an evaluatable baseline.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_45_resized.png" width="400">
&lt;/figure>

&lt;p>Machine learning-based systems are typically also services in the backend of web applications. They are integrated into production workflows. But, they process data much differently. In these systems, we don’t start with business logic. We start with input data that we use to build a model that will suggest the business logic for us.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_46_resized.png" width="400">
&lt;/figure>

&lt;p>This requires thinking about application development slightly differently - we need to be able to loop through a machine learning inference cycle quickly and examine the results: are they right? If yes we keep the model, if no we go back and change one thing and move on.&lt;/p>
&lt;p>This means there’s a lot of trial and error. Not many people realize it, but machine learning is more like alchemy than even software engineering. The top people in the field even said so, as Ali Rahmi did in his &lt;a href="https://www.science.org/content/article/ai-researchers-allege-machine-learning-alchemy">NeurIPS keynote in 2017. &lt;/a>&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_47_resized.png" width="400">
&lt;/figure>

&lt;p>This is doubly true for LLMs, where our main medium of input and output is freeform text. We have some model that we add text requirements to, and hope to get back logical text, or instructions, or code out. Because the inputs of natural language are so varied and so are the outputs, the process becomes three times harder to control and evaluate.&lt;/p>
&lt;p>What this means for focus is that the developer, like the monks, has to create a center of focus, and that center of focus is the first use-case for evaluation for your model. That means, if you’re a bank, you can’t evaluate a model on how well it writes poetry. It needs to be on how well a model creates top-level topics for all of your documents.&lt;/p>
&lt;p>This means creating both what we call offline and online evaluation metrics. Online evaluation metrics are those that can be assessed by people using your product or platform, or simply doing what we call a vibe check, and offline evaluation is more scientific and offers us a grounded reference against academic benchmarks to see if our model generalizes well. In order to have a well-functioning machine&lt;/p>
&lt;p>So the developers decided to look at two metrics:&lt;/p>
&lt;ol>
&lt;li>The vibe check&lt;/li>
&lt;li>The offline eval&lt;/li>
&lt;/ol>
&lt;p>The vibe check is as simple as: creating a list of documents and manually labeling topics for these documents, for a number of different documents. Here’s an example from the wikipedia page for Michelangelo.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_48_resized.png" width="400">
&lt;/figure>

&lt;p>We can see that we added some human topics based on simply scanning the text and our knowledge of categories, as experts in art.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_49_resized.png" width="400">
&lt;/figure>

&lt;p>The comparison is now to run it against our LLM and see what topics it generates. We use &lt;a href="https://github.com/Mozilla-Ocho/llamafile">llamafile locally&lt;/a>, with a local quantized model for quick iteration between prompts and responses without having to use external models.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_51_resized.png" width="400">
&lt;/figure>

&lt;p>We can now compare the results we generated ourselves and what the model generates.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_52_resized.png" width="400">
&lt;/figure>

&lt;p>We can see the results are ok, but maybe not as low-level as we&amp;rsquo;d like? For example, &amp;ldquo;Renaissance&amp;rdquo; doesn&amp;rsquo;t help us at all since all of our artists are from the Renaissance. And literature is not a category in art. So our next step would be to modify the prompt.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_53_resized.png" width="400">
&lt;/figure>

&lt;p>We can do this kind of prompt tuning manually for a bit, or use automatic prompt-tuning libraries, but then we might find that our model itself doesn&amp;rsquo;t have enough information about art, in which case we might want to fine-tune it with samples of text specifically related to Renaissance art, and try again.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_54_resized.png" width="400">
&lt;/figure>

&lt;p>Once we go through this manual cycle, we can also perform offline metric evaluation, which allows us to more systematically evaluate the model based on agreed-up academic benchmarks. Topic alignment is an imprecise science because it&amp;rsquo;s based on what humans think are good topics for categories, but &lt;a href="http://proceedings.mlr.press/v28/chuang13.pdf">we can use metrics&lt;/a> like cosine similarity to look at how well given topic pairs match.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_55_resized.png" width="400">
&lt;/figure>

&lt;p>Once the developers did this and tuned the model, the CEO was once again happy. But, now the team had a different problem, the curse of success. They needed to deploy this model to production.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_56_resized.png" width="400">
&lt;/figure>

&lt;p>Now, the team had a use-case for trending topic detection, and they had an evaluation metric: manual vibes and rescaled dot product for similarity between human-assigned and machine-assigned topics.&lt;/p>
&lt;p>They had problems, though. The systems they were building were big and complicated. They now involved the model, and something horrible called LLM ops that &lt;a href="https://towardsdatascience.com/llm-monitoring-and-observability-c28121e75c2f">now involved&lt;/a> updating, monitoring the system, storing prompts, monitoring model output, monitoring latency, security, package updates. They had built a product with an LLM producing summarization inputs, but now, whenever something broke, they didn’t know where to look.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_57_resized.png" width="400">
&lt;/figure>

&lt;h2 id="the-reproducible-example">The Reproducible Example&lt;/h2>
&lt;p>So, on the final day, the final developer shared a story. He said, “This story is about &lt;a href="https://en.wikipedia.org/wiki/Ellen_Ullman">Ellen Ullman&lt;/a>. She was a software engineer who worked on complex systems starting in the late 1970s including at startups and large companies, wrote a number of essays about the art of practicing computer science. Her latest book is called “Life in Code”, and in it, she describes the mind of a programmer as they are writing a program. And this ties together everything we’ve been talking about so far.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_58_resized.png" width="400">
&lt;/figure>

&lt;p>She writes that keeping track of translation between human and code logic is hard. “When you are writing code, your mind is full of details, millions of bits of knowledge. This knowledge is in human form, which is to say rather chaotic. For example, try to think of everything you know about something as simple as an invoice. Now try to tell an alien how to prepare one. That is programming&amp;hellip; To program is to translate between the chaos of human life and the line-by-line world of computer language.”&lt;/p>
&lt;p>She says that, in order to do this, a developer must not “lose attention. As the human world-knowledge tumbles about in your mind, you must keep typing, typing. You must not be interrupted.”&lt;/p>
&lt;p>In this kind of environment, you are information-constrained, but focused. It is harder to get through a mistake, but easier to find it and test possible solutions because the feedback cycle is fairly short. The reason this was the case was because, in earlier times, developers worked closer to the machine, closer to the metal, closer to the source of the computation, and even though there were interruptions and blockers, they were different.&lt;/p>
&lt;p>Most programs were written and compiled locally in languages like C. There were much fewer third-party libraries. Implementations were written from scratch, and the main source of information were books, man pages, and Usenet. The approximate user experience would be similar to writing python by only being able to access python.org.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_59_resized.png" width="400">
&lt;/figure>

&lt;p>Ullman describes this phenomenon in another part of the book where she describes working with a developer named Frank. Frank previously had worked as a hardcore technical contributor, but when he was working with Ullman, he had moved to financial reports, and he personally was miserable, and he also hated Ullman because she was “close to the metal.”&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_60_resized.png" width="400">
&lt;/figure>

&lt;p>These days, it is extremely hard to be close to the metal, because when we work with distributed systems, and machine learning, and the cloud, each of these have been built on top of the levels of turtles that previous developers have built, and it is easier to get distracted. When you throw non-deterministic LLMs and the distributed systems used to train and serve them into the mix, you come up with a special kind of hell that makes it impossible to have a good developer experience.&lt;/p>
&lt;p>What we can do is what people have always done: &lt;a href="https://reprex.tidyverse.org/">create reprex&lt;/a>. A reprex is a reproducible example, an idea that comes from the R community. In R, it’s fairly easy because it’s a self-contained piece of software, but we can also strive for the same thing.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_61_resized.png" width="400">
&lt;/figure>

&lt;p>Here’s an example of a reprex in Python, the RMSE code we just reviewed in the first section. We can know it will run the same way on our reviewer’s computer as ours, we can troubleshoot and check.&lt;/p>
&lt;div class="highlight">&lt;pre tabindex="0" style="color:#f8f8f2;background-color:#272822;-moz-tab-size:4;-o-tab-size:4;tab-size:4;">&lt;code class="language-python" data-lang="python">&lt;span style="display:flex;">&lt;span>&lt;span style="color:#66d9ef">def&lt;/span> &lt;span style="color:#a6e22e">root_mean_squared_error&lt;/span>(y_true:float, y_predicted:float) &lt;span style="color:#f92672">-&amp;gt;&lt;/span> float:
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> cost &lt;span style="color:#f92672">=&lt;/span> math&lt;span style="color:#f92672">.&lt;/span>sqrt(np&lt;span style="color:#f92672">.&lt;/span>sum((y_true&lt;span style="color:#f92672">-&lt;/span>y_predicted)&lt;span style="color:#f92672">**&lt;/span>&lt;span style="color:#ae81ff">2&lt;/span>) &lt;span style="color:#f92672">/&lt;/span> len(y_true))
&lt;/span>&lt;/span>&lt;span style="display:flex;">&lt;span> &lt;span style="color:#66d9ef">return&lt;/span> cost
&lt;/span>&lt;/span>&lt;/code>&lt;/pre>&lt;/div>&lt;p>Reprex comes in handy when you&amp;rsquo;re dealing with complex distributed systems. For example, here’s an actual problem the developers were dealing with while trying to serve their model: trying to troubleshooting Ray.&lt;/p>
&lt;p>Ray is a Python and C++-based distributed framework for training and serving machine learning models, used by companies training LLMs. Its predecessors include Hadoop and Spark. It also takes inspiration from other distributed libraries. The cool thing about Ray is that it is very easy to run locally, and spinning up a single instance is something that comes out of the box, meaning we are already closer to the metal.&lt;/p>
&lt;p>However, if you have an issue, it can take a while to get to the bottom of it because of how complex the architecture is. Ray has several &lt;a href="https://docs.ray.io/en/latest/ray-observability/key-concepts.html">orthogonal patterns&lt;/a> working together. First, there are tasks that you can execute on a remote cluster and actors, which are stateful tasks. This comes from the actor pattern in computer science which receives messages from its environment and can send messages back.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_63_resized.png" width="400">
&lt;/figure>

&lt;p>We also have the cluster-level communication patterns, with the global control store managing transactions and if you are running this on top of Kubernetes, also the Kubernetes primitives there is also the Task execution graph, the various modules: the dashboard, ray train, and Ray serve. The amount of patterns you have to understand in order to wrap your mind around it is truly astounding.&lt;/p>
&lt;p>But don&amp;rsquo;t forget that humans can only keep several things in memory when they trace through complexity!&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_65_resized.png" width="400">
&lt;/figure>

&lt;p>In building their trending topic application, the team had an issue where they were looking to test a topic detection pipeline of their model using Ray serve. Ray Serve takes your local deployable code and ships it to your Ray cluster using the Ray client, which is an API that connects a Python script to a remote Ray cluster.&lt;/p>
&lt;p>Ray Serve allows you the ability to serve a model with code that gets sent using bundled Ray actors known as deployments. A Deployment is served usually on top of Kubernetes, on top of Ray, within some sort of cloud cluster. It’s made up of several Ray Actors, which are stateful services run and managed by the Ray control plane. The Controller acts as the entrypoint for the deployment , tied to a proxy on the head node of a Ray cluster, and forwards it to replicas which serve a request with an instantiated model.&lt;/p>
&lt;p>In order to serve a deployment, you can use the pattern of specifying a &lt;a href="https://docs.ray.io/en/latest/serve/configure-serve-deployment.html">YAML-base config.&lt;/a>&lt;/p>
&lt;p>One critical piece here is the working dir, which specifies the where code to download to the cluster at runtime: This is part of how ray specifies the &lt;a href="https://docs.ray.io/en/latest/ray-core/handling-dependencies.html#runtime-environments">RuntimeEnv.&lt;/a>&lt;/p>
&lt;blockquote>
&lt;p>A copy of the working_dir will be downloaded to the cluster at runtime, and the current working directory of each remote Ray worker will be changed to that &lt;code>working_dir&lt;/code> copy&lt;/p>&lt;/blockquote>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_68_resized.png" width="400">
&lt;/figure>

&lt;p>In a production-grade deployment, it’s recommended that the working_dir comes from a served zip executable, which is usually pinned to a hash in GitHub. So the team had their config file set up like this:&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_69_resized.png" width="400">
&lt;/figure>

&lt;p>However, there are several points of failure in the YAML file, which looks to simplify but in fact hides complexity from us in the ways config is read and implemented in the code itself. But this model didn’t launch or run. Why? If you look at all the failure points, there are at least three or four that could trip you up. First, the way Ray handles import paths could be different behavior than the way we usually assume uvicorn routes them. Then, there is the question of the runtime environment: what happens when our deployable asset is hosted on github. Then, we have the issue of Python dependencies: when you’re working with fast-moving libraries like transformers and torch, it’s always guaranteed that you’ll have conflicts, sometimes even if you pin them to specific versions. Finally, there is the Ray deployment logic itself: what happens when we specify CPU and GPU options, and how do those work with our Kubernetes cluster?&lt;/p>
&lt;p>Whew. Time to start debugging.&lt;/p>
&lt;p>The way to start here is to start reading logs and stack traces. When you have a very large, distributed system, here’s a sample of the stacktrace you might get, and it can be very discouraging because it doesn’t point you to where the issue is.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_70_resized.png" width="400">
&lt;/figure>

&lt;p>There’s a lot going on here, and it seems to be coming from importlib rather than Ray itself, but Ray uses importlib as a dependency. So, in order to isolate it, the team went into the module in Ray that calls importlib, which lives in &lt;code>python/ray/_private/utils.py&lt;/code> .&lt;/p>
&lt;p>What the team then did was to &lt;a href="https://github.com/ray-project/ray/blob/683d7f378b913b943f7edf5cdca6548e807f33ed/python/ray/_private/utils.py#L1163">find the method&lt;/a> that opens the import path, and created a reprex for the issue, where they took that exact method calling importlib.import_module, put it into a notebook, and called the code directly.&lt;/p>
&lt;p>This directly didn’t work, and they proved it without using any external dependencies or Ray itself - they found the source of the issue. Which meant that, in their case, they couldn’t use the nested path to build their model deployment, but for simplicity’s sake, they could serve it from the root level. It turns out that the issue was in the import_attr, which meant that they had to include the config file and the script to launch the cluster in the same root-level of the repository.&lt;/p>
&lt;p>They weren’t done yet, though. When they tried to deploy this, they also got a 404 error when they tried to download the deployable from the Git hash. This is because they were trying to enable asset downloads, aka our Topic trend detection code, from a private GitHub repo onto the head node in the cluster for running the code. Initially, they tried to use a fine-grained PAT: This didn&amp;rsquo;t work, so they had to use a legacy GitHUB token. Along the way, they learned that Ray uses a library called &lt;code>smart_open&lt;/code> for streaming files from S3/GitHub, which helped them troubleshoot further issues.&lt;/p>
&lt;p>All of this is abstracted away from us by the YAML, quickly-changing documentation, and multiple parts of the system. To get from that to the piece of information you need, the team needed to ruthlessly strip out all areas of detail and focus on the problem, and, hopefully, get to a reproducible example. This is getting closer to the metal, this is attention.&lt;/p>
&lt;p>Finally, the team had a prototype running in production. With this issue resolved, the team moved on, and was successfully able to deploy the topic detection app, much to the delight of the CEO. Everyone got raises and lived happily ever after in their own castles in Florence.&lt;/p>
&lt;p>Until they started to develop more products.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_75_resized.png" width="400">
&lt;/figure>

&lt;h1 id="conclusion">Conclusion&lt;/h1>
&lt;p>The developers celebrated. They had launched something into production, managed to cut away from the hype of the LLM space, and focused on a use-case that was important to their company.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_76_resized.png" width="400">
&lt;/figure>

&lt;p>They now came together to do a retro on what they had accomplished, and why they were able to do so.
They started by picking a single use-case for generative AI in their application and evaluating the models available for that use-case. Then, they picked online and offline evaluation metrics to continue the evaluation loop. Finally, when it came down to building the system, they focused on the smallest composable parts possible instead of starting with the top, and broke down the code until they understood it.&lt;/p>
&lt;figure>&lt;img src="https://vickiboykis.com/images/dontworryllms_77_resized.png" width="400">
&lt;/figure>

&lt;p>All of this helped them concentrate, focus on what was most important and stay close to the metal.&lt;/p>
&lt;hr>
&lt;p>Thanks to Davide Eynard, Guenia Izquierdo, James Kirk, and Ravi Mody for patiently reading versions of this talk.&lt;/p>
&lt;p>Credits&lt;/p>
&lt;ul>
&lt;li>Icons: Freepik&lt;/li>
&lt;li>Bronzino, Portrait of Eleonora of Toledo, c. 1539&lt;/li>
&lt;li>The Decameron, Franz Xaver Winterhalter 1837&lt;/li>
&lt;li>View of Barcelona, by Anton van den Wyngaerde, 1563&lt;/li>
&lt;li>&lt;a href="https://arxiv.org/pdf/1805.04829.pdf">Gradient Descent&lt;/a>&lt;/li>
&lt;li>&lt;a href="https://poissonisfish.com/2023/04/11/gradient-descent/">Gradient Descent&lt;/a>&lt;/li>
&lt;/ul></description></item></channel></rss>