Frozen Caveman in an Ivory Tower: How Architecture Anti-Patterns Crashed an E-Commerce Thanksgiving Party
Once bitten twice shy? Broaden your gaze, lest you turn into the Frozen Caveman!
The Story
Deals Galore
It was a pre-covid Thanksgiving eve. Black Friday marketing was almost over. Deals had been plastered for months on almost everything that ran ads. People making up one of the largest consumer bases in the US were powdering their hands. Lest a stuck scroll, through their excitedly sweaty fingers, diminished their chances of sniping a new Nintendo.
The time was approaching.
On the other side in a giant room somewhere in the San Francisco Bay Area sat huddled an army of software engineers ready to remedy any anomaly that appeared in the metrics dashboards on the equally giant screen in front of them. The office hadn’t seen a conglomeration of VPs and Directors as big as this since the last year’s sale. All hands were on deck and all eyes out into the sea.
The time had arrived.
The much-awaited deals were live. Many of the seated staff were scampering to clinch some of the deals themselves - using their employee discount cards to sweeten the already candied deals. Customers around the nation were glued to the listing pages.
Then it happened.
The “Crash”
At the anticipated sales peak, the sales chart was nowhere near the projected counts.
Everyone in the room panicked. What was happening? Were customers genuinely buying less? Or was there a systemic reason? Someone with eyes on the dashboard was quick to notice; a tall spike had towered up in the add-to-card error metrics on the dashboard.
This was a mind-boggling anomaly. Why were customers getting errors on adding items to cart? How did so many comprehensive load tests miss this scenario? The answer at the time was elusive. No one wanted an explanation. Only a fix. The concerned teams performed rolling restarts of their servers. But it was too little, too late.
It was a severity-one customer-impacting issue. A high number of buyers were seeing the usual “something went wrong” or the much dreaded “out of stock” messages. Not being able to add the juicy deals to their carts, many of them took to social media to vent out their frustration - tagging the company’s social handles and anything else they could to bump up their posts. Media franchises were quick to pick up on these and translated this error spike as a humiliating crash on a big day.
People pointed out that the company wasn’t doing as much in their tech as they were in procurement & marketing. But that wasn’t true. The efforts, simply put, were misdirected.
The History
Frozen Caveman Anti-Pattern
Frozen Caveman Anti-Pattern describes an architect who always reverts back to their pet irrational concern for every architecture. …Generally, this anti-pattern manifests in architects who have been burned in the past by a poor decision or unexpected occurrence, making them particularly cautious in the future. - From Fundamentals of Software Architecture by Mark Richards & Neal Ford.
A few years before, on the same sale (Black Friday), the company had oversold thousands of units of a particularly popular item. The company had to pull all the strings it could to fulfill those orders - including top-executive level push on the manufacturer to increase production! It had been a PR nightmare. The cause of this had been a race condition, that had prevented the inventory count in the database from reducing as soon as the online checkouts were successful. It had been a classic case of row lock contention in the database.
When a user adds an item to their cart, a unit is taken out from the inventory & a timer starts. This makes sure say if this is the last unit of the product being sold, then no one else can add it to their cart while this user is in the process of making the purchase. On the other hand, this also ensures that the inventory is updated back if the user does not purchase the item within the specified time.
Getting back to the point. The proposed changes after that aftermath were implemented shortly after. a) The said inventory count configuration was moved from the database to a distributed cached configuration management service. While this may not have a good case for atomicity in theory, faster I/O trumped that concern in reality. This meant that the database and cache had to be in sync, but it was fine as long as the site had a single source of truth and made the updates through the cache. b) The code talking to the database had been refactored to handle locks more gracefully.
The architects in concern had been pulled up by the management for the goof-up. But they had been able to continue in their current roles, owing to past laurels. The same architects now led the efforts for converting the archaic monolithic architecture into a more granularly scalable set of services. They were also in charge of preparations for holiday elasticity. But the incident had pushed “avoiding lock contentions” to the fore of concerns in the architects’ set of priorities. Domain-level characteristics were of course considered, albeit not on precedence.
The load-tests leading up to the holiday season were focused around the accuracy of inventory counts under stressful conditions. Numerous concurrent processes were spawned to write and read the inventory counts. User behavior patterns during the anticipated high load moments were simulated and a lot of preparation went into how the counts tallied up at various checkpoints. There was no dearth of seriousness or commitment to ensure a smooth peak. Yet, the Frozen Caveman had manifested and thrived.
Ivory Tower Anti-Pattern
A software architect who is in an ivory tower refers to one who, either by how they approach their position or because of how an organization works, is isolated from others. …In addition, they may not be working closely with the developers who will be creating implementations based on the architecture. - From Fundamentals of Software Architecture by Mark Richards & Neal Ford.
The company had been able to evolve into having a pretty good work culture. But the baggage of past is a sore spot for all transformational initiatives. The ivory tower was no longer visible on the outside, but its remnants still lingered in the architects’ collaborative behavior. The isolation wasn’t only on the development side - it had also been observed in how the company ran its offices. Tech and business had been segregated physically. The collaboration was more through documents and sheets than through conversation. The condescending tone of the tech gurus was perhaps a deterrent for the business executives to push for solid discussions on new features. This meant covering bases for all aspects of those features was left to the few people driving the discussions at a technical level. We’ll see how this caused “the crash”, as we understand what really went wrong on that eventful day.
What Really Happened?
The Change
The difference between all previous years and this one had been that the users had been allowed to add the sale items to their cart much before the actual sale day. This meant that users were able add products that in fact did not have any inventory set up.
This had been a brilliant strategic move in the eyes of many on the business side of things. This allowed for more accurate sales projects and directly drove inventory procuring. On the other hand, the ramifications on the tech side of things had not been thoroughly studied.
Reason: This did not need elasticity as users organically added these products to their carts over the duration of a few weeks. The services supporting the add-to-cart operation did not require scaling - at least not before the actual sale day.
But scaling wasn’t the only concern here.
The Oversight
There was a simple logical error born out of not having the right discussion about this change. Let’s see how it unfurled in the face of the heat.
As soon the deals went live - which really meant that the inventory for the sales items were updated - meaning that the site-level inventory was instantly reduced by the sum of the number of items that were in users’ carts. This happened due to the timed-hold that kicked in only when the inventory was updated and the products became available on the site. A large number of these users, whose carts held the sale items, weren’t even logged in at the time. They had simply added the items to the cart and forgotten about it. A glaring edge-case oversight that surfaced at the most crucial moment in the US e-commerce calendar.
In reality, there is no Inventory Manager, the inventory update is a event based trigger.
The inventory of hot items was instantly depleted as a factor of people’s activity over weeks. This blocked the now-online customers from adding the deal items to the cart. Of course, the inventory count tallied up with the actual sales as soon as the holds were automatically removed on those in-cart items. But the moment had passed. The company eventually made all the sales over the next few hours and days. But at the cost of dissatisfied customers and a subpar peak-sales performance than what had been projected.
Needless to say, a few heads did roll this time around!
This article may or may not have been a work of fiction!
Also checkout: