Introduction to case studies
This book’s Introduction started with a commitment to grounding its approach in concrete case studies. In this section, we’re living up to that commitment by presenting ten real-world strategies I’ve directly worked on or observed. These strategies take the somewhat abstract concepts we’ve covered thus far and materialize them into concrete ideas, hopefully making them easier to grasp and easier for you to apply.
The first five strategies are selected to show a varied mix of refinement techniques and operational mechanisms. The next five strategies are organized by the companies in which they were implemented. If you work through these case studies and find yourself wanting more, the Strategy Resources Appendix includes suggestions for further study.
Uber's service migration strategy circa 2014.
In early 2014, I joined as an engineering manager for Uber’s Infrastructure team. We were responsible for a wide number of things, including provisioning new services. While the overall team I led grew significantly over time, the subset working on service provisioning never grew beyond four engineers.
Those four engineers successfully migrated 1,000+ services onto a new, future-proofed service platform. More importantly, they did it while absorbing the majority, although certainly not the entirety, of the migration workload onto that small team rather than spreading it across the 2,000+ engineers working at Uber at the time. Their strategy serves as an interesting case study of how a team can drive strategy, even without any executive sponsor, by focusing on solving a pressing user problem, and providing effective ergonomics while doing so.
Service onboarding model for Uber (2014).
At the core of Uber’s service migration strategy (2014) is understanding the service onboarding process, and identifying the levers to speed up that process. Here we’ll develop a system model representing that onboarding process, and exercise the model to test a number of hypotheses about how to best speed up provisioning.
In this chapter, we’ll cover:
- Where the model of service onboarding suggested we focus on efforts
- Developing a system model using the lethain/systems package on Github. That model is available in the lethain/eng-strategy-models repository
- Exercising that model to learn from it
Let’s figure out what this model can teach us.
Wardley mapping the service orchestration ecosystem (2014).
In Uber’s 2014 service migration strategy, we explore how to navigate the move from a Python monolith to a services-oriented architecture while also scaling with user traffic that doubled every six months.
This Wardley map explores how orchestration frameworks were evolving during that period to be used as an input into determining the most effective path forward for Uber’s Infrastructure Engineering team.
Reading this map
To quickly understand this Wardley Map, read from top to bottom. If you want to review how this map was written, then you should read section by section from the bottom up, starting with Users, then Value Chains, and so on.
How should you adopt LLMs?
Whether you’re a product engineer, a product manager, or an engineering executive, you’ve probably been pushed to consider using Large Language Models (LLMs) to extend your product or enhance your processes. 2023-2024 is an interesting era for LLM adoption, where these capabilities have transitioned into the mainstream, with many companies worrying that they’re falling behind despite the fact that most integrations appear superficial.
That context makes LLM adoption a great topic for a strategy case study. This document is an engineering strategy document determining how a hypothetical company, Theoretical Ride Sharing, could adopt LLMs.
Modeling impact of LLMs on Developer Experience.
In How should you adopt Large Language Models? (LLMs), we considered how LLMs might impact a company’s developer experience. To support that exploration, I’ve developed a system model of the software development process at the company.
In this chapter, we’ll work through:
- Summary results from this model
- How the model was developed, both sketching and building the model in a spreadsheet. (As discussed in the overview of systems modeling, I generally would recommend against using spreadsheets to develop most models, but it’s educational to attempt doing so once or twice.)
- Exercise the model to see what it has to teach us
Let’s get into it.
Wardley mapping the LLM ecosystem.
In How should you adopt LLMs?, we explore how a theoretical ride sharing company, Theoretical Ride Sharing, should adopt Large Language Models (LLMs). Part of that strategy’s diagnosis depends on understanding the expected evolution of the LLM ecosystem, which we’ve built a Wardley map to better explore.
This map of the LLM space focuses on how product companies should address the proliferation of model providers such as Anthropic, Google and OpenAI, as well as the proliferation of LLM product patterns like agentic workflows, Retrieval Augmented Generation (RAG), and running evals to maintain performance as models change.
Modeling driving onboarding.
The How should you adopt LLMs? strategy explores how Theoretical Ride Sharing might adopt LLMs. It builds on several models, the first is about LLMs impact on Developer Experience. The second model, documented here, looks at whether LLMs might improve a core product and business problem: maximizing active drivers on their ridesharing platform.
In this chapter, we’ll cover:
- Where the model of ridesharing drivers identifies opportunities for LLMs
- How the model was sketched and developed using lethain/systems package on Github
- Exercising this model to learn from it
Let’s get started.
Navigating Private Equity ownership.
In 2020, you could credibly argue that ZIRP explains the world, but that’s an impossible argument to make in 2024 when zero-interest rate policy is only a fond memory. Instead, we’re seeing a number of companies designed for rapid expansion, learning to adapt to a world that expects immediate free cash flow rather than accepting the sweet promise of discounted future cash flow.
This chapter aims to tackle that problem head-on, taking the role of an engineering organization attempting to navigate new ownership by a private equity group. It’s an increasingly frequent scenario: after many years of learning to operate under the direction of its original founders, and the brief excitement of going public, now there’s a short runway to change operating models.
Eng org seniority-mix model.
One of the trademarks of private equity ownership is the expectation that either the company maintains their current margin and grows revenue at 25-30%, or they instead grow slower and increase their free cash flow year over year. In many organizations, engineering costs have a major impact on their free cash flow. There are many costs to reduce, cloud hosting and such, but inevitably part of the discussion is addressing engineering headcount costs directly.
How should we control access to user data?
At some point in a startup’s lifecycle, they decide that they need to be ready to go public in 18 months, and a flurry of IPO-readiness activity kicks off. This strategy focuses on a company working on IPO readiness, which has identified a gap in internal controls for managing user data access. It’s a company that wants to meaningfully improve their security posture around user data access, but which has had a number of failed security initiatives over the years.
Should we decompose our monolith?
From their first introduction in 2005, the debate between adopting a microservices architecture, a monolithic service architecture, or a hybrid between the two has become one of the least-reversible decisions that most engineering organizations make. Even migrating to a different database technology is generally a less expensive change than moving from monolith to microservices or from microservices to monolith.
The industry has in many ways gone full circle on that debate, from most hyperscalers in the 2010s partaking in a multi-year monolith to microservices migration, to Kelsey Hightower’s iconic tweet on the perils of distributed monoliths:
"We're a product engineering company!" — Engineering strategy at Calm.
In my career, the majority of the strategy work I’ve done has been in non-executive roles, things like Uber’s service migration. Joining Calm was my first executive role, where I was able to not only propose but also mandate strategy.
Like almost all startups, the engineering team was scattered when I joined. Was our most important work creating more scalable infrastructure? Was our greatest risk the failure to adopt leading programming languages? How did we rescue the stuck service decomposition initiative?
How to resource Engineering-driven projects at Calm? (2020)
One of the recurring challenges in any organization is how to split your attention across long-term and short-term problems. Your software might be struggling to scale with ramping user load while also knowing that you have a series of meaningful security vulnerabilities that need to be closed sooner than later. How do you balance across them?
These sorts of balance questions occur at every level of an organization. A particularly frequent format is the debate between Product and Engineering about how much time goes towards developing new functionality versus improving what’s already been implemented. In 2020, Calm was growing rapidly as we navigated the COVID-19 pandemic, and the team was struggling to make improvements, as they felt saturated by incoming new requests. This strategy for resourcing Engineering-driven projects was our attempt to solve that problem.
How should Stripe deprecate APIs? (~2016)
While Stripe is a widely admired company for things like its creation of the Sorbet typer project, I personally think that Stripe’s most interesting strategy work is also among its most subtle: its willingness to significantly prioritize API stability.
This strategy is almost invisible externally. Internally, discussions around it were frequent and detailed, but mostly confined to dedicated API design conversations. API stability isn’t just a technical design quirk, it’s a foundational decision in an API-driven business, and I believe it is one of the unsung heroes of Stripe’s business success.
Systems model of API deprecation
In How should Stripe deprecate APIs?, the diagnosis depends on the claim that deprecating APIs is a significant cause of customer churn. While there is internal data that can be used to correlate deprecation with churn, it’s also valuable to build a model to help us decide if we believe that correlation and causation are aligned in this case.
In this chapter, we’ll cover:
- What we learn from modeling API deprecation’s impact on user retention
- Developing a system model using the lethain/systems package on GitHub. That model is available in the lethain/eng-strategy-models repository
- Exercising that model to learn from it
Time to investigate whether it’s reasonable to believe that API deprecation is a major influence on user retention and churn.
Why did Stripe build Sorbet? (~2017)
Many hypergrowth companies of the 2010s battled increasing complexity in their codebase by decomposing their monoliths. Stripe was somewhat of an exception, largely delaying decomposition until it had grown beyond three thousand engineers and had accumulated a decade of development in its core Ruby monolith. Even now, significant portions of their product are maintained in the monolithic repository, and it’s safe to say this was only possible because of Sorbet’s impact.
How to integrate Stripe's acquisition of Index? (2018)
Discussions around acquisitions often focus on technical diligence and deciding whether to make the acquisition. However, the integration that follows afterwards can be even more complex. There are few irreversible trapdoor decisions in engineering, but decisions made early in an integration tend to be surprisingly durable.
This engineering strategy explores Stripe’s approach to integrating their 2018 acquisition of Index. While a business book would focus on the rationale for the acquisition itself, here that rationale is merely part of the diagnosis that defines the integration tradeoffs. The integration itself is the area of focus.