How is Twitter Still Working?

I recently came across a LinkedIn article by Casey Rosenthal that is worth your time: Decline and fall of a billionaire’s Twitter. It tries to answer an interesting question: How has Twitter not crashed yet? How is it still running, despite over 75% of its staff being gutted, and after hitting (per its new owner) an all-time high in site visits? Remember: some people had predicted it would crash within a week of the firings.

Rosenthal has some expertise in this domain: he used to lead the Chaos Engineering team at Netflix, now runs a company specializing in complex system resiliency, and has authored a book and several IEEE papers on the subject.

Couple of key takeaways for me from this article:

Musk no longer understands modern software development.

“He has made several non-sensical claims of a technical nature. (Ex: There are the same number of RPC calls as microservices to render the home timeline; latency was reduced by 400ms; serializing trips will improve speed; etc.)”

Taken together with other ridiculous ideas (e.g. asking Twitter developers to print their most pertinent code for his review - source), it’s pretty clear that he’s woefully out of date. Of course, he doesn’t think so himself; which is potentially worse because it can lead to bad decisions. As Charlie Munger pointed out, it’s better to have a manager with a 130 IQ who thinks it’s 120; than someone with a 150 IQ who thinks it’s 170.

Microservices architectures led to this (positive) outcome.

Contrary to Musk’s implication, having a small critical subset of microservices is a feature of that architecture and a positive sign for Twitter’s robustness. It doesn’t mean that the other 80% of microservices are worthless. The non-critical microservices provide room for engineers to offer features, enhancements, integrations, bug-fixes, customizations, and all sorts of other nice-to-haves. That’s valuable. And in the face of an incident, it’s extremely helpful to know which microservices you can ignore or shutdown to save the critical ones.

Maybe the implication is that Musk misunderstood what a tiered microservice architecture is (Tier 1 being the most critical):

Thousands of microservices communicating with each other enables the designation of criticality of some versus others. At Netflix we called this tiering. This is a feature. It allows teams with different priorities to optimize for different business goals simultaneously with very little coordination.

The (ex) Twitter team built one heck of a good infrastructure.

Which begins to explain why Twitter is still running. A microservice architecture done well is very robust. Each boundary between services provides an opportunity to catch errors, retry requests, fall back to another service, circuit break, or do any of a dozen other things that maintain a good customer experience in the face of degradation. […] Twitter can lose a significant portion of its staff, and the remaining staff can continue to operate the system by navigating the complexity as they have always done, without ever having a complete mental model of how the entire system works.

In other words; those engineers built a system that could survive without them. With high levels of automation, redundancy, and failsafes in place, the system was incredibly robust. Humans didn’t need to be part of the machinery that kept the site ticking; instead, they built fantastic mechanisms to ensure that it kept ticking automatically like clockwork.

The site will fail; just not in the way some people pictured it.

Do I think Twitter will crash and burn in a molton heap? No. I think it is much more likely that Twitter will decline in two specific ways. First: it seems Musk is driving the remaining engineering team to optimize for efficiency, specifically runtime performance. Efficiency is brittle. The more efficient the code, the less room the sociotechnical system will have for improvisation under unforeseen conditions. As improvisation becomes more frictious, Twitter will experience longer, more frequent, and wider-reaching degradations in service. As a consequence, this will put a noxious burden on the remaining on-call and operational staff, further degrading morale within the company. Second: Musk has no grasp of UX. He has mistaken his personal philosophy for a design virtue, and that’s just not how Twitter developed as a product. As the years of work that went into content moderation and UX are neglected, the polish of the tool as a medium of communication will suffer. Arguably Twitter could have done much more to make content moderation in particular better prior to the Musk takeover. Now I predict it will get much worse.

This is a critical insight. Twitter the product was a lot more than its IT infrastructure. Hundreds of thousands of decisions around UX, content moderation, and the “right” thing to do is what made it a vibrant town square. Yes, they didn’t do enough in terms of safety and moderation; which led the platform being vulnerable in the first place. Post-acquisition, you could argue that Musk doesn’t value content moderation, because he personally doesn’t want to be moderated. His echo chamber didn’t want to be moderated or banned either; which I think is pretty much why he bought Twitter in the first place. They didn’t want Twitter per se - they just wanted Gab, but with Twitter’s infrastructure and reach. And that’s what he bought.

Note: other interesting reading on this:

(LinkedIn, Nov 10 2022) Twitter is Falling Apart at the Seams
(Youtube) Betrayal & Greed: How 4 CEOs and 5 Billion Failed Twitter