Let’s Not Turn the Test Automation Pyramid Upside Down Just Yet

A few days ago, I listened to Gojko Adzic’s talk “Humans vs Computers: five key challenges for software quality tomorrow” on Jfokus.  It was a great talk and it really gave some food for thought. This summary will not do it justice, but basically the plot was that our software is now being used by other software, and there’s AI, and voice recognition and a mix of all this will (and already does) cause new kinds of trouble. Not only must we be prepared for the fact the “user” is no longer human; we must also take into account new edge cases, such as twins for facial and voice recognition, and the fact that our software may stop working because someone else’s does. All in all, the risk is rightfully shifted towards integration and to handle it, we need to turn to monitoring of unexpected behavior. This made Mr. Adzic propose that we do something about the test automation pyramid. Turn it upside down maybe?

Personally, I vote for the test automation monolith :), or rectangle. I’ll tell you why. First, I have to admit that this talk made some pieces fall into place for me. My ambition in regards to developer testing is to raise the bar in the industry. I don’t want us to wonder about how many unit tests we need to write or how we should name them. Mocks and stubs should be used appropriately, and testability should be in every developer’s autonomic nervous system. But why? And here’s the eye opener: Because we’ll need to be solving harder problems in a few years (if not today already). Instead of, or more likely in addition to, finding simple boundary values to avoid off-by-one errors, we’ll also need to handle the twins using voice authorization to login to our software. Needless to say, we shouldn’t spend too much of our mental juice writing simple unit tests and alike.

That being said, we can’t abandon the bottom layer of the pyramid. Imagine handling strange AI-induced edge cases in a codebase that isn’t properly crafted for testability and tested to some degree. It would probably be the equivalent of adding unit tests to poorly designed code or even worse.  Yes, monitoring will probably play a greater part in the software of tomorrow, but isn’t it just another facet of observability?

So, what will probably happen next is that the top of the testing pyramid will grow thicker, maybe like this (couldn’t resist the “AI”):

Test Automation Monolith

The Developer Testing Maturity Curve

I’ve been talking about this topic for years, and have been asked questions about it while giving presentations on developer testing. Still, I’ve been unable to articulate fully what I mean by the “Developer Testing Maturity Curve”. Many wanted to know, because it’s tempting to rank your organization once there’s something to rank against.

After organizing my thoughts for a while I’ve finally come up with a model that matches my experience of how developer testing tends to get implemented across different organizations. Just a caveat: I may revise this model if I learn more or have an epiphany, but this is what it looks like today.

Graph Axis

The horizontal axis is technical maturity. It’s the knowledge and understanding of various tools and techniques. I consider employing a unit testing framework “immature”, which means that you need relatively little technical skill to author some simple unit tests. Conversely, implementing an infrastructure that enables repeatable, automated end-to-end tests would be on the other side of the scale.

The vertical axis is more elusive. In the image, I call it “organizational maturity”, but in reality, it means several things:

  • Understanding that developer testing (any automated testing, in fact) must be allowed to take time
  • Acknowledging testability as a primary quality attribute and designing the systems accordingly
  • Willingness to refactor legacy code to make it testable
  • Time and motivation to clean up test data to make it work for you, not against you
  • Dedicating time and resources to cleaning up any other old sins that prevent from harnessing the power of developer testing and automated checking, be they related to infrastructure, architecture, or the development process in general

You should get the idea… If you’re in an organization that has internalized the above, you won’t be throwing quality and (developer) testing out the window as soon as there’s a slightest risk of not meeting a deadline.

The Maturity Zones

The hard part of this model was to place the individual practices in the different zones. Fortunately, in my experience, technical and organizational maturities seem to go hand in hand. By that, I mean that I haven’t seen an organization with superior technical maturity that would totally neglect the organizational climate needed to sustain the technical practices, and vice versa. After having made this discovery, placing the individual practices became easier. Next, I’ll describe what they are.


Unit tests

Having only unit tests is immature in my opinion. True, certain systems can run solely on unit tests, but they are the exception. If you only do unit tests, you probably have no way of dealing with integration and realistic test data. From an organizational point of view, having only unit tests means that developers write them because they must, not because they see any value in them. Had this been the case, they’d engage in other developer testing activities as well.

Mocking frameworks

This is the only artifact on the border. Let me explain. A mature way of employing a mocking framework assumes that you understand how indirect input and output affects your design and its testability, and then use the framework to produce the correct type of test double. The immature approach is to call all test doubles “mocks” and use the framework because everyone else seems to.


Specialized frameworks

These are frameworks that help you solve a specific testing problem. They may be employable at unit test level, but they’re most frequent (and arguably useful) for integration tests. Examples? QuickCheck, Selenium WebDriver, RestAssured, Code Contracts. Some of these frameworks may be entire topics and areas of competence in themselves. Therefore, I consider it mature to make use of them.

Integration tests without orchestration

What does “without orchestration” mean? It means that the framework you’re using does everything for you and that you don’t need to write any code to start components or set them to a certain state.  I’m thinking about frameworks like WireMock, Dumbster, or Spring’s test facilities.

Using a BDD framework

You can use a BDD/ATDD framework to launch tests, hopefully the complex ones. This requires its infrastructure and training, so I consider it mature (barely). In this sector, you’re not reaping the full benefits of BDD, just using the tools.


Leveraging BDD

In contrast to the immature case, where the BDD framework is just a tool, in this sector the organization understands that BDD is about shared understanding and a common vocabulary. Furthermore, various stakeholders are involved in creating specifications together, and they use concrete examples to do so.

Testability as primary a quality attribute

Testability­—controllability, observability, smallness—applies at all levels of system and code design. Organizations that have internalized this practice ensure that all new code and refactored old code take it into account. When it comes to designing code, it’s mostly about ensuring that it’s testable at the unit level and that replacing dependencies with test doubles is easy and natural. At the architectural level it’s about designing the systems so that they can be observed and controlled (and kept small), and that any COTS that enters the organization isn’t just a black box. The same goes for services operated by partners and SaaS solutions.

Integration tests with orchestration

There’s a fine line between such tests and end-to-end tests. The demarcation line, albeit a bit subjective, is the scope of the test. An integration tests with orchestration targets two components/services/systems, but it can’t rely solely on a specialized framework to set them up. It needs to do some heavy lifting.

End-to-end tests

Anything goes here! These tests will start up services and servers, populate databases with data and simulate a user’s interaction with this system. Doing this consistently and in a repeatable fashion is a clear indication of internalized developer testing practices.

All test data is controlled

This is true for a majority of systems: The more complex the tests, the more complex the test data. At some point your tests will most likely not be able to rely on specific entries in the database (“the standard customer”) and you’ll need to implement a layer that creates test data with specific properties on the fly before each test. If you can do this for all test data, I’d say you’ve internalized this practice.

The entire system can be redeployed in a repeatable manner

If you have your end-to-end tests in place, you most likely have ticked off this practice. If not, there’s some work to do. A repeatable deployment usually requires a bit of provisioning, a pinch of database versioning, and a grain of container/server tinkering. Irrespective of the exact composition of your stack, you want to be able to deploy at will. Why is this important from a developer testing perspective? Because it implies controllability, as all moving parts of the system are understood.

And the point is?

Congratulations on making it this far. What actions can you take now? If you really want to gauge your maturity level, please do so. My advice is that you map the areas in the maturity curve to your organization’s/team’s architecture and practices, and start thinking about where to start digging and in what order.







Deciphering the Test Pyramid

The test pyramid is a very frequently quoted model. I believe it originates from Mike Cohn’s book Succeeding with Agile Software Development using Scrum. Originally, the test pyramid is drawn with three tiers: UI, Service, and Unit, but google it, and you’ll find many adaptations and refinements. I really like this model, because it illustrates so much about how testing is done on an agile team. In this post I aim to present some ways of reading and interpreting it along some dimensions.The Test Pyramid

Who does the “testing”

Since unit tests are at the bottom of the pyramid, it should come as no surprise that developers will actually be the ones who create the greatest amount of test code. Unit tests are the best place to employ standard testing techniques like equivalence partitioning, boundary value analysis and various sorts of table-based techniques, which means that there’ll be quite a few of them (there are other reasons as well, of course, like TDD). This doesn’t say anything about the testing process as a whole, but the fact remains: developers will create the most test artifacts and do the most checking.


Visually, the model implies that there’s a ratio between the layers, i.e, there’s a relation like 1:x between service level tests and unit tests, and there’s a relation 1:y between UI tests and service tests. Personally, I don’t think it’s meaningful to strive for a certain ratio as such. Different systems with different architectures and history will have different ratios. As long as there are more lower-level tests we should be fine. However, for reasons listed in the following sections, we really want the majority of the tests at the bottom of the pyramid.

Level of abstraction and Language

The higher up in the pyramid, the more domain-related the language, or at least it should be. Good unit tests most likely use domain concepts in their code and read as specifications, but they can get away with compact names related to the solution domain at times. This doesn’t work for higher-level tests, since they often work by orchestrating bits and pieces of quite complex test infrastructures in many cases. A typical example is a UI-based test of a specific scenario. The underlying test code will interact with a layer typically called “flow layer” or “scenario layer”, which in turn will orchestrate Page Objects or the like. So, basically, the test will talk to the test infrastructure using language like “login,” “open this customer,” “buy three drill presses.”


People often mention that the test cost increases as we move up along the pyramid’s tiers.  For models that put manual testing at the top, this is certainly true. And yes, the top of the pyramid is inhabited by more complex tests. However, it’s not a truism that that the cost of such tests should be spiralling.

Tests at the top of the pyramid need more code and more moving parts, so they’ll be more expensive, but good teams will have ways of working and a test infrastructure that make the price of creating yet another higher-level test reasonable.


Different layers/tiers require different tools. For example, unit tests will most likely rely on a unit testing framework and some mocking framework. In some specific cases, some kind of special-purpose testing library (used in unit tests) will be required.

Tests than run through the user interface will obviously use libraries for automating interaction with web pages, fat clients, or mobile apps. Tests in the middle tier will probably use the most diverse flora of tooling. They may include lightweight servers, things like Spring Boot, in-memory databases, libraries for managing transactions and test data setup; you getthe picture.

Also, BDD frameworks, if used, will most likely be used in the middle or topmost tier (or both), as well as tools for model-based testing.

Execution Time

As we move up towards the top of the pyramid, the tests have a larger footprint: they may require entire servers to be up and running, databases to be repopulated, a series of API calls over a slow network, etc. This naturally will affect their execution time. A corollary of this is that we should strive to push tests as far down the pyramid as we can.


Related to the previous point. Tests closer to the top will execute slower and consequently provide delayed feedback.Not only that, but the quality of the information they provide will most likely be lower. A UI-based end-to-end test is usually not the best tool for error localization, since there’s virtually no practical way for it to truly understand the system(s) it tests—not at the level of granularity needed to provide detailed information about what went wrong anyway.

Communication and Stakeholder Involvement

Tests at the top of the pyramid can have a very distinct advantage over unit tests: they may be authored so that they’re interesting to non-technical stakeholders. A good implementation of ATDD, BDD, or specification by example will produce a manageable quantity of tests at a high enough level of abstraction to be interesting to non-technical stakeholders, given that the documentation part is relevant and well written.

The test pyramid also tells us something about environment dependence. Unit tests are, by definition, environment independent. Service-level tests will often make some assumptions about the environment: a port must be open, a process can be launched, there’s a disk to write to, etc. Finally, UI tests probably depend on pretty much the entire system to be running. Depending on the architecture and method of deployment, the environment dependence may become absolute. Try a Cobol backend + licensed database with enterprise features…

These are some dimensions I find useful to discuss in putting the test pyramid to work when deciding on a testing strategy. You may use others, so please share.