Not to pile on the recent Cloudflare outage, but I want to talk about it from a testing perspective wrt a couple of different aspects.
We are seeing an increase in cloud outages lately, affecting all the major infrastructure service providers — Amazon, Microsoft, Google, and now Cloudflare. Cloudflare is somewhat unique in that it is a relatively small company (compared to the others) with an outsized impact on internet infrastructure.
There are probably several factors contributing to the recent spike in outages, with our increasing reliance on them being probably the biggest. You just didn’t notice as much previously, and the increasingly digitized, interconnected, and subscription-based model of services makes it more noticeable. But also, there are two major factors that I think are underpinning the problem, increased use of Ai and outsourcing of core technical responsibilities.
Both of these are related, but what it boils down to is lack of responsibility, and unwillingness to prepare for (and invest in) contingencies. And that’s where it ties to testing.
You’ve got to give Cloudflare props for their transparency and the amount of detail they are willing to share in their postmortem, which will no doubt lead to improved engineering at the company:
The Cloudflare issue was specifically related to a rewrite and replacement of some of their core infrastructure — memory pre-allocation, which may have been needed to increase the performance & scale of their services, but was not properly tested.
This type of thing is notoriously difficult to test, because it commonly requires infrastructure at scale to test. And there are a lot of moving parts. You can’t find infrastructure scaling issues using Playwright or Selenium to click buttons on a website — which is, unfortunately, where too much of QA testing efforts go.
But this was something that could have been anticipated with boundary testing. And a clear test strategy is described in their postmortem.
The bot management feature configuration file is updated frequenty, processed, and rules are inserted into the database to prevent malicious bots. There had been a fixed number of features (rules) and a bug in the configuraiton processing created an increasing number of rows to be inserted.
So this, couldn’t have been caught in a unit test, but could have easily been tested on small scale infrastructure that did the following:
- Generate the feature config
- Process it
- Check the database
The exact scenario (test data) which caused it to pre-allocate and overwhelm their proxy is not described, but it appears to be a typical memory allocation overrun defect, which is what edge case boundary checking is good at finding.
People have argued about whether Rust (a theoretically memory-safe) language was at fault, or rather, whether the assumption that because Rust checks against buffer overruns means you don’t have to worry about memory allocation — this might be try at the micro-scale, but memory allocation (and leaks) can happen at the macro scale too (see Java).
But the point I want to make is that this highlights
- The need for resiliancy, failover, recovery, and monitoring as part of your system architecture.
- The need for testing at this higher level. (Or lower level, if you consider infrastructure at the bottom. What I meant was probably “broader” as opposed to “narrower” focused testing.)
Organizations need to be sure that not only does the user-centric functionality of their software work as expected, but that the infrastructure focused aspects of it do as well.
At a dinner meeting with executives at a company several years ago, I was asked why the R&D team (comprising brilliant data scientists) could create prototypes so rapidly, and yet it took so many months for development to produce a working version 1 for customers. Implcit in this question was the assumption that the product engineers were less skilled and whether we should seek to improve the talent of our developers.
I started off by explaining that there were a lot more requirements for scalability, security, and usability than a proof of concept requires when you expose a system to your customers (and the wide-open internet), and that this took a lot of time. I also pointed out that testing takes time, and highlighted some bugs that were oncovered that led to expanded feature scope.
While this company did have an extremely slow and ineffective testing routine (that I had been brought in to help fix), testing wasn’t the primary bottleneck.
The nods of approval from IT and product leadership helped convince the executive team, but I knew that we still had a lot of work to do in QA to remove the friction to help increase development velocity — because we weren’t without blame.
But that conversation did lead to a initiative to spend resources building out a more robust test environment that, thanks to the infrastructure engineers, could be spun up quickly, and that test infrastructure became the basis of their multi-cloud resiliency strategy for production systems.


