Integrated Tests Are A Scam

I just learned about this blog post and now I would really like to know your opinion about Ian Connors way: http://vimeo.com/68375232

They are integra*tion* tests, but not integra*ted* tests, which explains why I changed the name of this talk several years ago.

I'm slow to respond. Sorry.

I haven't watched Ian's talk yet. I know that several programmers whom I respect don't use mock objects (or at least don't use message expectations, although they "fake" functions to return hardcoded responses), and I have not yet taken the time and opportunity to learn how they do things differently than I do.

I see many programmers use stubs/fakes as a way to implement a message expectation: they stub foo() to return 23, then call bar(18), which they know will call foo() and return its value plus 9, then they check that bar(18) returns 32. I find this risky: it uses indirect knowledge of bar()'s implementation in order to justify expecting 32 (23 + 9) at the end. I prefer simply to say "when I invoke bar(18), the result should be that something eventually invokes foo()", because although it always knows some implementation details: (1) it describes the *essential* result of bar(18) that we care about and (2) it describes this result quite abstractly ("something eventually invokes foo()"). I find this method expectation more stable and easier to understand than "I know that bar() returns foo() + 9 and I know that bar(18) causes foo() to return 23."

You are wrong at about 19:30 of your talk where u said after refactoring of a cluster of objects that you would need more tests afterwards. If you refactor you will not need more tests afterwards otherwise you are not refactoring.. this was a really horrible example imo, otherwise the talk was ok/good

Pulling things apart generally means opening code up to the possibility of being used in a way that its current client (probably the entry point of the cluster) happens not to use it. This is a (generally) unavoidable consequence of removing code from its context. When we leave a block of code in its context and its client only sends it a limited set of inputs, we can safely avoid some of the tests we would otherwise think to write, and in the interest of time, we usually don't write those tests. When we separate that block of code from its context, it becomes liable to be sent those previously-unseen inputs and we have to decide whether to care about that. It seems generally (though not always) irresponsible never to add tests for those previously-unconsidered inputs.

In the process, we have refactored--we haven't changed the behavior of the system yet--but we need more tests in order to support reusing newly-available code in other contexts. We don't need to add those tests for the current system yet, but it's only a matter of time before we regret not adding them.

So it is that refactoring can make tests that we were once able to safely avoid less safe to avoid. Of course, in exchange for this risk, our refactoring opens up code for potential reuse that was not previously available for reuse, so if we don't intend to try to reuse that code, then we probably shouldn't separate it just yet.

This highlights some interesting tension between which set of tests does the current system need as a whole and which set of tests do the parts of the system need individually. On the one hand, we don't want to waste energy testing paths that the system as a whole does not execute; but on the other hand, if we don't write those tests, then we might run into latent mistakes (bugs) while adding features that use never-before-executed paths of existing code. I'd never thought of that in particular before. Another of the many tradeoffs that makes writing software complicated.

Thank you for your kind words.

First, when I count layers, I refer to frames in the call stack, focusing on just the code I need to test. In a typical application I have some framework "above" me (it calls my code), some libraries "below" me (I call them), and my stuff in the "middle". If I halt in some arbitrary spot in the code, I get a call stack and can look at how many frames of that call stack represent "my code" (or, more precisely, code I want to test). A "layer" means a level/frame of the call stack. My code might go to a maximum depth of 10 layers between the framework I deploy it into and the libraries I use. Broadly speaking, then, my code "is 10 layers deep". Of course, different code paths go through different numbers of layers, but in something like a typical MVC structure, most of the controllers have similar call stack depth and I'd use that most common call stack depth as a stand-in for "the number of layers".

The actual number matters less than the fact that it is an exponent!

As for "examples*paths^layers", I don't think I wrote that and couldn't find that. Broadly speaking, one needs to write one example per path, so examples=paths is roughly true and we need approximately paths^layers tests/examples to check the code thoroughly. Again, I've used this as a simplifying approximation: the number of examples is roughly the product of the number of paths through each layer, so if there are 5 paths through layer 1, 7 paths through layer 2, and 3 paths through layer 3, then there are close to 5*7*3 paths through the 3 layers when taken together, assuming that all 3 layers are involved in every path, which might not be the case. (Even if you can cut this in half or in thirds, in a typical system, it grows out of control quickly.) This is pretty close to 5^3, where 5 is the median number of paths through a single layer and 3 is the number of layers. Again, even if we can multiply this by a relatively small constant like 1/5 or 1/10, as the exponent grows slowly, the number of tests/examples we need explodes. Of course, this all applies to checking the code by running all the layers together, meaning an integrated test.

In the case of form fields, I would count the paths through this layer this way: how many different data formats do I have to worry about (dates, numbers, other special kinds of text) and how many different responses are there for each type (are there many success paths? are there any fundamentally different failure paths?). If you test mostly through integrated tests, then you'd have to write the same tests over and over again for every date field, every number field, and so on. Duplication. Ick. If you test the UI separately from handling the requests in the controllers, then you can avoid a lot of this duplication by extracting duplicate code into Helpers (or whatever replaces those these days) and testing them directly in isolation. Then when you say "this is another number field", you can just wire up the right Helper with confidence. (You might write a smoke test to double-check that you've wired a NumberFieldHelper to that number field.)

Each controller method will probably have multiple paths, so count those, including if (for example) 3 fundamentally different kinds of input happen to lead down the same path--that's 3 paths, not 1.

With Rails in particular, you need to focus on model methods with custom code and if that custom code that calls some of the ActiveRecord magic, and you're worried about using ActiveRecord correctly, then you need to include the ActiveRecord layer in your test. When that's a straight pass-through, you can count Model + ActiveRecord as a single path through a single layer, as long as you feel confident that ActiveRecord isn't going to blow up on that path. It's when you do things like use scopes or complicated queries or complicated updates that you have to count more paths. It's even worse if you use ActiveRecord lifecycle callbacks. (Don't.)

Certainly, if you rely on end-to-end scenarios to check that behavior in one part of the system (like focusing on a particular UI element or a particular complicated model update), then you'll see rampant duplication/redundancy and understand how big a waste of energy integrated tests can be. I simply won't volunteer to write all those tests. We have to choose between spending all our time writing redundant integrated tests or cutting corners, writing those tests "tactically", and hoping that you haven't missed anything important.

Rails does, however, make it a little better, because if you stick to scaffolding, then the whole thing acts like one big layer. This means fewer integrated tests, but it also means that you're relying on Rails' default behavior. As you deviate from its omakase settings, you "introduce" new layers in the sense that things that used to behave as a single layer don't any more. It's probably more accurate, then, to think about the Rails UI+controller+model+ActiveRecord beast as a single layer for the purposes of this calculation. The number of paths still goes up the same way (combinatoric), but it's clearer that we can treat large parts of the app as "a single layer". It means that you need thousands of tests instead of millions. (You still shouldn't need to write all those tests.)

I hope this helps.

You're most welcome. I hope it helps.

What makes you think that your Unit Tests are covering the permutations? They're not either. Your argument is flawed. Yes, there are billions of potential permutations, but not from a business requirements perspective. The main problem with Unit Tests - apart from the fact that they don't tell you if your code actually works in production - is that they're typically not based on business requirements, because they're too low level.

This ended up being a long answer, so let me summarize:

1. I know that my unit tests aren't covering all the permutations; I didn't claim that here.
2. The billions of permutations come from our design choices, and not the business needs, but when we ignore those permutations, we get stack traces and admin calls at 4 AM, so let's stop ignoring them.
3. What you call a "problem" with unit tests seems similar to say that the problem with a cat is that it's not a dog. Well, yes: cats are cats and dogs are dogs.

The details will take longer to draft than I have time for right now, but I will post them in the coming days.

Thanks for the reply!

You simply can't cover every permutation, no matter what approach. It's just not a valid argument.

I agree that design is key. More importantly, the ability to refactor allows the implemented design to improve. Unit tests hinder refactoring. Integration tests enable and validate refactored code.

I'm not getting the cat analogy sorry. It seems like you're attached to the solution so much that it now trumps the requirements. If I asked for a pet that is faithful, likes to take walks and fetch sticks, and you give me a cat, then yes, that's a problem. ;-)

Writing performant and reliable Integration Tests is hard, but it's the right solution. Unit Tests should be a last resort or developer tool.

You're most welcome!

Alex, I don't quite understand what you mean by "valid argument" here. I don't claim to be able to cover every permutation, so I don't understand why you'd consider my argument invalid based on negating a claim that I still haven't made. Where is the wrong link in the chain?

"X is a bad approach" can be true even if there is no perfect approach. "X is a bad approach" can be true even if there is no good approach (trivially true, but still true). I agree that "X is a bad approach" is a weak argument if there are no better approaches, but even that wouldn't make the argument *invalid*. None of these match my argument.

I do not have "cover every permutation" as a goal, even though I do have the goal of "cover most of the most interesting permutations", because it generally leads to "have more confidence both in the code as is and in being able to change it safely when needed". I therefore prefer approaches that let me cover more permutations with fewer tests (less effort, less resistance). I get this with isolated object/module tests over integrated tests.

"Unit tests hinder refactoring" is just plain wrong, because it misidentifies the cause of the problem. Excessive coupling hinders refactoring. Yes, I see excessive coupling in a lot of code, and yes, some of that code is in what the programmer probably intended to be unit tests. When I use isolated tests to drive towards suitable abstractions, I just don't have this coupling problem and my tests don't hinder refactoring--at least not for long. If I see that my tests are threatening to hinder refactoring, then I look for the missing abstraction that provides the needed reduction in coupling, extract it, and the problem disappears. Indeed, this is how I learned to really understand how to engage the power of abstraction.

Integrated tests (not "integration tests"!) constrain the design less, which allows design freedom, but also provides less feedback about excess coupling. They tolerate weaker (harder to change, harder to understand) designs, and those get in the way of refactoring, usually due to high levels of hardwired interdependence. Isolated tests provide considerable feedback about unhealthy coupling, but a lot of programmers seem either to have too high a tolerance for this unhealthy coupling or don't see the signs. I teach them to see the signs and heed them.

The cat/dog thing is just this: cats are not better than dogs and dogs are not better than cats, but rather they are different and suit different situations. That a cat is not a dog isn't a "problem" with cats. Similarly, that an isolated programmer test doesn't do the job of a customer test (check requirements) isn't a problem with isolated programmer tests. You say it yourself, that "unit tests should be a [...] developer tool". Exactly right: they are, and that's how I use them. I use them specifically to drive the design and help me identify design risks. They help me build components that I can confidently and safely arrange in a way to solve business problems.

I don't understand why you'd think that I believe that "the solution... trumps the requirements". I don't. I believe that both matter and attend to both. I also use different tools for different needs: programmer tests help me with the design of the solution and customer tests help me check that we're meeting the needs of the users/customers. So yes, if I try to use programmer tests to talk to a customer, that'd be like giving you a cat to play fetch with. (Some cats play fetch, but I wouldn't bet on it.) Sadly, I see a lot of programmers do the opposite: they try to use their customer tests to check the correctness of tiny pieces of their design. This wastes a lot of time and energy. I recommend against it. I recommend approaching two different kinds of testing/checking differently. Specifically, I recommend not using integrated tests to check low-level details in the design.

"Writing performant and reliable integrated (I assume) tests... is the right solution." ...to which problem? I use them where I find them appropriate, but I don't find them appropriate when I really want to check that I've built the pieces correctly and that they will talk to each other correctly. Think about checking the wiring in your house: isn't it much more effective to just check the current on any wire at any point, so that you can isolate whether the problem is the wire or the ceiling fan? Why limit yourself to detaching and reattaching various ceiling fans in various rooms in order to isolate the problem to the wire or the ceiling fan? (This is a real problem we had recently in our house.) Why would such an indirect way to investigate the problem ever be "the (singular?) right solution"? Yes, if we can't justify digging into the walls, then it might be the most cost-effective solution in that situation, but now imagine that there are no walls! Why limit yourself then? Why act like there are walls to work around when there are no walls?

When we write software, we can just tear the walls down and rebuild them any time we want, so why would we volunteer to pretend that we can't touch them? I don't see the value in limiting ourselves in that way. I don't. My programmer tests help me build components that I can recombine however I need, and if I find that a combination of components isn't behaving as expected, I can inspect each one on its own, find the unstated assumption I'm making, and add or fix a couple of tests in a few minutes. No need to pour through logs and long, integrated tests with excessive setup code to isolate the problem.

Now, of course, we do only discover some problems by putting everything together and seeing how it runs, so we absolutely need some integrated tests, but why would I ever volunteer to find *every* problem this way?! I would hope that finding a problem with integrated tests would be the exception, and not the rule.

Hi, I was wondering if integration tests for database related classes are scams as well?

I've been tasked with building yet another crud screen, and have been using integrated tests to ensure queries return the right information, and commands result in the correct side effects.

The integrated tests seem to have shortened my feedback loop when developing the SQL statements since any negative changes to the SQL tell me if something has gone awry.

They also provide a type of living documentation as to the database environment that the class is living in (where I work, there is no version control for the database and no tracking of changes).

At least, this way, when something changes, the test must change as well keeping this documentation up to date (I've been in-lining the SQL statements in one project, and using an ORM in another one).

I feel like they are helping me to some extent, but they are slow to develop and slow to run. I still unit test most of the functionality. Am I doing something terribly wrong by testing those types of classes this way? Is there an alternative to this type of testing?

Great article. In the begining I thought that mocking everything will lead me to complicated tests and will hinder refactoring. But after few tests I noticed that it's pretty clean and refactoring is easy as long as I keep good design (no extensive coupling), so it's not a problem of TDD approach but design itself.

I have a question about acceptance tests that drives our unit tests.
From what I know, we shouldn't write more unit tests if acceptance test is already passing. It means we have to write acceptance test for every little feature we want to implement. That leads me to writing a lot of acceptance tests. Am I doing something wrong? Because my testing pyramid looks a bit like a block, rather than pyramid.

Another question is about collaborators that returns other objects with non-trivial behaviour, should I mock both collaborator that returns this object and object itself? Or maybe I have some problem with my design if I have to do this?

Yes and no. I have a "standard talk" that I do to explain the details, and I don't have time to reproduce that here, but I appreciate the reminder to record it or write it down! :)

The "No" part comes from writing integrated tests only at the point of actually integrating with third-party software with the express goal of checking that integration. If you write integrated tests, then write them to check that the last layer of Your Stuff integrates well with Their Stuff.

The "Yes" part comes from ignoring the duplication in the production code at this integration point. I wrote about this in detail in _JUnit Recipes_, in the chapter on testing databases. You probably don't need to test 100 times that you can execute an SQL query or update and can clean up database resources and so on. (1) Check that once and (2) Use a library for it. (In the Java world, for example, I still like Spring JDBC templates, as long as we use them as a library, and don't inherit everything else from Spring.) So I recommend this to you: start relentlessly removing duplication among your modules that talk to the database directly, and see what happens. What kinds of duplication can you extract? Some of it will be specifically talking to the database without worrying about your domain model and some of it will be specifically working with your domain model without talking to the database. Both of these options are easier to test individually than putting both together. When we check database integration using our domain model or when we run the database just to check our domain model, that's where the scam returns. Don't do that.

But, as always, do whatever you find helpful until it becomes less helpful, then try removing more duplication. That almost always helps me. :)

Thanks for the reply! This definitely clears up a lot of confusion I was having.

"We shouldn't write more unit tests if acceptance test is already passing." I disagree, and even worse, following this rule is exactly a version of The Scam. Here's the problem: to change code confidently we need very fast feedback from a set of tests that check our code thoroughly; (1) to have very fast feedback from tests, we need tests to execute very quickly, but acceptance tests tend to execute more slowly AND (2) to check our code thoroughly requires a lot of tests AND acceptance tests tend to run more of the system, so the number of acceptance tests we need to cover code is much higher than the number of microtests (similar to unit tests) we need to cover the same code equally well. So it seems to me that using bigger tests in this way will create risk, and that risk will only increase over time until it becomes a problem, and when it becomes a problem, it will become a BIG problem.

This explains why I don't try to use one test for two purposes.

We write two kinds of tests: Customer Tests help the customer feel comfortable that we have built the feature that they have asked for and Programmer Tests help the programming feel comfortable that they understand the behavior of each part of the code and can change it with confidence and ease. These happen to be two very different goals that we solve best with two different kinds of tests. These two kinds of tests happen to compete with each other: Customer Tests generally need to be long, run the whole system, and are therefore less helpful telling us about what happens in smaller parts of the code. Programmer Tests generally need to be fast, zoom in on one small part of the system, and therefore aren't enough to give customers the confidence that we want to give them. Trying to use one kind of test for these two competing goals mostly doesn't work.

When the system is small, the two sets of tests look similar, and so we believe that we are needlessly duplicating effort. As the system grows and we remove more duplication and the structure becomes more complicated, these two sets of tests diverge more and more from each other, and it becomes much clearer the difference between these two kinds of tests. Sadly, many people don't see this divergence because the cost of your "testing block" starts to look too high, and many people lose patience and throw away the tests. They often don't reach the point of gaining the experience to see the point when the cost/benefit curve starts to bend. :P It's quite sad, really, because this makes me sound like I'm crazy and it makes the idea sound like a theoretical one, instead of the very practical one that it is. On the other hand, I got quite a few clients over the years because they reached a point where the tests are "too expensive" and asked themselves "Are we doing something strange?" and then contacted me.

If you are comfortable with the idea of having two kinds of tests for two different purposes and two different audiences, then The Scam won't kill you.

Regarding "collaborators that returns other objects with non-trivial behaviour", this alone doesn't seem like a problem to me, as long as the current test focuses on one module's behavior at a time. The problem happens when you start to want to stub/mock A to return a B and then stub/mock B in the test in order to check the interesting behavior. I interpret this as a sign of a design problem. If X (the module you are checking now) only uses A to get to B, then I consider this dependency unhealthy. I would change the design so that X depends directly on B *without knowing where B came from*. This is an example of "pushing details up the call stack". The origin of B (that it comes from A) is the detail that X perhaps doesn't really need to know about. Read http://blog.thecodewhispere... for a deeper description.

If X uses A and B, then you might decide that X needs to know that it can ask A for B. I disagree. If X uses A and B, then it might just be a coincidence that I can ask A for B, and in this case, I imagine that A-provides-B is a property that can change over time, so I would change X to accept A and B in its constructor, and push the detail of A-provides-B up the call stack. If, on the other hand, A-provides-B sounds like an essential property of A and B-must-come-from-A sounds like an essential property of B, then I see another risk: if X needs both A and B, then X probably has too many responsibilities. I would expect to split X into Y and Z, where Y needs A and Z needs B.

I can think of one more possibility: if X needs both A and B AND A-provides-B seems to make sense AND X seems like it has only one responsibility, then maybe X knows too much about the interaction between A and B. You can see this if you have to copy/paste a lot of stubs/mocks of A and B throughout your tests of X. You can also see this if you feel tempted to write comments in your tests for X that explain why you have to write so many stubs/mocks of A and B in those tests. In this situation, X probably wants to rely on some new abstraction C that summarizes the purpose of using A and B together, and X should depend on C and maybe not A nor B at all!

This happens to me when I notice that I start to have many unrelated steps to perform in the same place, and I notice that adding a feature means adding a new unrelated step to an ever-expanding algorithm. Each of these unrelated steps needs to happen, but they are quite independent of each other, and as I add more collaborators, I have tests with 3-5 important side-effect goals, one on each collaborator. What's happening? These are event handlers, and my X is really just firing an event, and today I need 3 listeners for this event, but when I add new features, I often add a new listener for this event. The missing abstraction--the C in the previous paragraph--is the event. When I introduce the event, X's tests simplify to "fire this event with the right data at the right time" and all the tests for A, B, and its friends simplify to "here are the various kinds of event data you can receive--how you handle them all?" All those complicated tests for X that mock 3, 4, 5 (and every month more...) different collaborators disappear.

I know that this sounds quite abstract. It's harder to describe without a concrete example, which I plan eventually to include in my next online TDD course (The World's Best Intro to TDD: Level 2... some time in 2017, I hope). I hope that, for now, you can find your situation in one of these cases. :)

...or maybe you meant something else entirely? I can try again.

Firstly I would like to thank you for Your very detailed answer. I really appreciate that you can find time to help guys like me and others, who cannot find answers on web or in books.

Coming back to my question. I have a concrete example and it's actually pretty simple. What I tried to test was an "application service" as of DDD.
In example:


def load_container_onto_vessel(vessel_id, container_data, vessels):
vessel = vessels.get(vessel_id) # vessels is a repository
vessel.load(Container(container_data['volume'], container_data['mass']))

This is where I had to mock `vessels.get` and `vessel.load` that will be returned by `vessels.get` mock.
According to your answer this a design problem, if I got it right, but I have no idea how to do it differently.

I also have a similiar problem when I try to use mockist style with functional paradigm in domain layer, since values are immutable so functions returns new values.

Idea of different purpose tests appeals to me. But then, how do you drive implementation of features that are not covered by acceptance tests?
In example: "client can buy one piece of X from store" is acceptance tested, but there is no acceptance test for case "client can buy two pieces of X", but still we have to implement it. Do we just have to remember to write unit test for it, or maybe you write some more coarse grained tests for that, but not of acceptance test scale?