Testing In Prod vs. Unit Tests - Fresh Blurbs by Irakli Nadareishvili

A lot has been written about the merits of testing complex systems in production. My favorite is the now-famous presentation: Yes, I Test in Production (And So Do You) by Charity Majors. The main premise of which is that it’s impossible to recreate live (production)-system conditions in pre-production, so pre-production testing is always flawed and we end-up having to test quite a lot once system is already live. As Charity puts it: “They do test in prod, they just don’t admit it”, where by “they” she means everybody who thinks they can test systems before production enough that nothing would be left to post-going-live testing.

I have spent last several years being responsible for the development of core financial systems of one of the largest financial institutions in the United States. Arguably but very likely, financial industry is one of the most regulated industries with high emphasis on systems resiliency, so if anybody is not leaving anything for a chance - that would be us (maybe only second to aviation industry and NASA). So - do we test in production? Yes, we do. And we also do a lot of testing before going to production, but most importantly: what these two different kind of tests protect us from is the critical nuance that I would like to address in this short blog post.

Let’s get one fact out of the way: I don’t know any serious practitioner of system resiliency that believes it to be possible to catch all problems (bugs) before going live with a complex system. It’s noble to want to create 100% issue-free systems, but it is simply impossible. Being in denial doesn’t help either. Accepting reality and working with it, creating systems that can: quickly detect problems, limit the blast radius of the problems, and ideally can self-heal is the way here. Much of Charity’s talk addressed how to quickly detect and analyze problems through investing in observability.

When “testing in production” we also make sure the blast radius of problems is limited. Which is why every time we launch a systems we do it using canary releases, blue green deployments and silent parallel runs. A Silent Parallel Run is when we run the old and the new systems in parallel for some reasonable time. Initially, only old system is serving customer traffic. We start sending the same traffic to the new system as well, but the reactions and responses from the new system are only used for validating the proper behavior of the new system (e.g. by comparing it to the output of the old system, when possible). Once we are confident in the correctness of the new system, it can gradually start accepting increasing amount of customer traffic. All of this happens in production, and can be quite a grueling process, but automation helps a ton and also–the safety of testing your systems with actual customer conditions is simply priceless. A lot goes into making sure financial systems are reliable. Many of those things, indeed, need to be performed in production.

Does this mean we can stop testing before going live? If we test so much in production, do we still need substantial number of unit-tests, integration tests and all the other forms of pre-production tests? Or are they just a waste of time and we would be better off getting our code to production, as soon as possible? You know - where we can actually test things?

Our answer to this questions is that, despite enormous emphasis we have on in-production testing, we still write a lot of automated tests before code ever reaches production space. As a matter of fact, code that our teams write have higher level of test coverage than any other code, anywhere else that I have worked at. We monitor test coverage levels and quality, and are pretty insistent on maintaining those tests. So, why do we do it?

To put it bluntly: for us, test-coverage means very little for production resilience. Production resilience is all about self-healing and testing in production (Canary, blue/green, parallel runs etc.). The reason we still insist on high levels of test coverage is that: it makes our systems evolvable in the future! Once we need to change system’s known functionality, automated tests help us detect if a change causes some hidden ripple effects, in the known functionality. Knowing if we violated any prior assumptions - that is what we write automated tests for!

As far as I am concerned, automated tests are for the ability of changing a system in the future. It has very little benefit for the present. And that really is the critical nuance we use in differentiating automated, pre-production tests from in-production testing. Hope other teams can find this perspective useful in their work, as well.