At Glance we build a feed of personalized content for users. Not unlike Reels or TikTok. We have a prediction service, a REST endpoint, which serves the recommendations for the users. In this particular case this one endpoint was responsible for a traffic of around 12,000 requests per second. On a typical day this endpoint would have a latency well under 100ms. It had an auto-scaling policy enabled to account for the cyclical nature of the traffic pattern.
Today, was not a typical day.
The first two messages on Teams were from
Clicking on the link to the message led to the monitoring dashboard for this particular endpoint.
Not good. ~100% error rate, at over 25k RPS.
Increasing the time window tells us exactly when the failure started:
Sometime just before six the traffic increased steeply, and almost immediately the errors went to 100%. The drastic increase in traffic seems to be the culprit. But the outage has carried on, even during the times when traffic was similar to the previous day. Notice the RPS around 7 AM is not dissimilar to the traffic around 6 PM of the previous night. 0% errors then, 100% errors now? And from the previous graph, all of them were HTTP 503, server not found.
Why were the servers not found, when they were found yesterday…
All this while the latencies are through the roof.
So the servers are not available. The first graph under Resource Utilization is this:
Interesting. We need 21 nodes to service the traffic right now but are only able to maintain 5-7. We are not able to bring up the servers. Increasing the time window we see that yes, indeed around 0600, when the errors shot up to 100%, when the traffic spiked up swiftly, the resource starvation started. Notice the two lines almost overlapping before, but diverging during the outage:
The configuration is setup to go as high as 80 nodes (technically 100 if you are keen eyed), but something is preventing it from even hitting 20 nodes.
The CPU usage graph is …. a mess. Looks like the nodes are getting killed constantly. And they are getting killed around 350%, when they could theoretically go upto 400% (800 less the autoscaling threshold, set to 50%) . You see, we have deployed this on n1-highcpu-8
nodes. And clearly it has hit 400% before without a hitch.
The memory usage paints a similar picture:
Hey, that is a very sharp cutoff in memory! Almost as if the nodes are hitting some memory limit and getting killed.
n1-highcpu-8
, from memory, should have 8GB of memory. A quick lookup told me I was wrong, it is somewhere close to 7GB. Which is still above the seeming cutoff at 5.3GB, so what gives?
We do not have access to all the memory of the node!
Because our systems run within docker containers. The OS, docker daemons and other processes would ofcourse need some memory for themselves. Fair.
But but but, 5.3GB of memory during the outage, and a near constant 3GB usage in the previous day. Huh?
To recap, we know the following:
If you noticed in the graphs, all of them start around 5PM the previous day. That is because this model was scaled up to 100% then, and the node configuration was arrived at looking at the peak traffic. As the memory utilization was under 5GB consistently, we moved from n1-standard-8
nodes, which have 30GB of memory, to n1-highcpu-8
nodes with 7GB memory. For the cost savings.
So a simple redeployment would have sufficed for the time being - a slow canary deployment to move traffic to new nodes. But to account for the 0530 spike we decided to overprovision the buffers slightly, go to n1-highcpu-16
.
Tomorrow when the traffic rises, the 6 minimum count nodes should be able to handle the bump until the new nodes spawn up.
Wrapping up with some happy looking graphs. No errors on the 30k RPS, sub-200ms latency.
Look ma, I didn’t even touch the logs!