Getting Started with Contract Tests

June 25, 2017 Integrated Tests Are a Scam, Test Doubles

This is a companion discussion topic for the original entry at

Assume you have a bunch of end-to-end tests that hit live services and you'd like to point them towards static json files.

How do you recommend extending your approach (validating behavioral contracts between apis), to also cover syntax contracts with regard to the JSON payload/schema?
Basically I'd like to hit the stub file 'endpoint_a.json' in my tests but want to know that 'endpoint_a.json" matches the schema and output from 'endpoint/a'.

Also, I've struggled with where this test belongs in the CI Pipeline? Is it a pre-requisite for running end-to-end tests on the client? A requirement for deploying to the server? Who should "own" this type of test?

It sounds like you want to test your simulated output (static JSON files). I literally don't understand this impulse, so I'm missing something, and I hope someone can explain it to me. (Programmers who read about contract tests suddenly want to test their test doubles. Why? Aren't they simple enough just to get right? Why aren't they?) When I write a contract test for this situation, I start with tests that check the output of endpoint/a, then I extract the parts of that output (or some property of it) that clients _need_ to depend on. I use this extracted information--the contract--to compose the simulated output (static JSON files). If someone changes the behavior of endpoint/a, then those tests will fail, and I'll know that I need to either weaken the contract (to ignore the change in implementation details) or change the contract (to compensate for a change that clients need to know about).

Would that work?

I, programmer, write contract tests in order to avoid end-to-end tests. It would satisfy me to replace my end-to-end tests with collaboration and contract tests over time. I run the more-focused tests more often and the end-to-end tests as rarely as I can reasonably justify. I'd like to (eventually, when my confidence reaches a high-enough point) give the end-to-end tests away to a separate testing group or the customers. If I have to keep them, then I tolerate them only as much as I have to. I "shouldn't" need them, but I might like to have them as a last line of defence.

That makes sense. So say I have tests that check endpoint/a. When those fail I fix the test and update the simulated output. Here's my issue: let's say I'm a moron (or just human) and I make a mistake updating the simulated output. Now I have updated stubs that are wrong. My code appears to work because it's against the stub data but won't work in real life because the stub isn't verified.

In your opinion does this fall under the category of acceptable risk or am I looking at this wrong and there is no risk? Or is this just a matter of developing against developing vs mature services that are less likely to change?

This happens. I accept this risk, because at least it doesn't become exponentially more expensive to fix over time, the way integrated tests do. I also accept this risk, because I have a systematic way to check the results: a small set of rules that govern which changes lead to which other changes.

For example, if I change the simulated output from endpoint/a in order to test its clients, then I need to check that simulated output against the test for endpoint/a that produces that output. I need a way to compare the simulated output for the clients to the actual output of the endpoint.

If I get the expected result of a test wrong, then there's not much I can do, except interpret that as a sign that I should try to simplify the expected result somehow. Most commonly, I generate the expected result from some kind of Builder that helps me avoid making that mistake.

I'm trying to apply collaboration & contract tests to my existing python code. As an example, let's say I have a function A which calls function B. In turn B calls function C.

I write a collaboration test for when a specific exception is caught by A. However, this exception is generated by C and B just let it bubble up. Do I need to write a contract test for B for that exception?

Very short version: yes, if B expects A to recover from the exception and probably no, otherwise. The longer version needs its own article, which I intend to publish soon. It includes things like why I'd like you to have asked a different question. :)

Read my longer answer here: https://experience.jbrains....

Let's assume we have a payment gateway interface variant capable of charging any positive amount. We also define a contract like payment gateway should successfully charge this amount and the current variant passes the contract.

Eventually another variant arises that charges only if the payment is greater than a threshold and throws exception otherwise. As the earlier contract makes no assumption on amount (still the concept of threshold is unfamiliar to the client) tests for an amount less than the threshold, this variant fails while the first one passes.

Now, does the second variant's contract failure tell us that this variant is not suitable for clients not familiar with threshold and client has to familiarize itself with threshold in order to be able to use the second variant?

Would like some resources that provides insights and guidelines for writing contracts.

It depends how we articulate the contract. What does the Client (meaning the union of all Clients) of the Payment Gateway need to know about it? What is the _minimum_ that it needs to know?

Option 1: The Client knows that the Payment Gateway might charge any positive amount.

Option 2: The Client knows that whatever amount the Payment Gateway charges, that amount is positive.

With Option 1, the situation you describe creates a problem; but with Option 2, it does not. With Option 2, both variants satisfy the contract, although it might be easy to write an incorrect test for that contract. I have in mind that we must use property-based tests for those contract tests, because otherwise we risk sampling with values above 0 but below the threshold, and the test would accidentally fail.

So what does the Client really need to know? What is the WEAKEST assumption about the Gateway's behavior that the Client can make, but still say something useful about the Gateway's behavior? Option 2 is weaker than Option 1, so I would probably prefer it. In fact, we could even generalize Option 2 to "whatever the Gateway charges, that amount is greater than N currency units", and with Option 1, N=0.

At best the Client needs to know that some payments succeed and others fail, and maybe the Client doesn't need to know the reason for that failure. That sounds even better: different variants of the Gateway fail for different reasons (thresholds, currency, authentication failures, whatever) and the Client only needs to know how the Gateway signals failure. Maybe the Client only needs to know how to distinguish a technology-based failure (connection timed out) from a business failure (payment is the wrong currency, not high enough, account not recognized). This way, both variants satisfy the contract.

I'm using one simple guideline, based on the Interface Segregation Principle: I want my Client to know as little as possible about its collaborators. This means agreeing on a contract with the fewest demands/constraints possible. I look for ways to push details out of the contract and down into the implementation. It's possible that this guideline suffices for most people most of the time!

Thanks for the reply.

1. The difference between Option 1 and Option 2 is so subtle for me. They sound nearly the same. Is Option 2 a parameterized (the payment amount being the parameter) version of Option 1?


... we must use property-based tests for those contract tests, because otherwise we risk sampling with values above 0 but below the threshold

That is exactly my concern (sampling values from invalid range) and I like the solution. Didn't know there exists another test methodology, property-based testing, that addresses the issue: providing a specification or a formula for test inputs instead of concrete values, to put it simply, if I didn't get it wrong. Need to explore it further.


... the Client only needs to know how the Gateway signals failure. Maybe the Client only needs to know how to distinguish a technology-based failure (connection timed out) from a business failure (payment is the wrong currency, not high enough, account not recognized).

Agreed with error reporting: only one generic exception for business rule failures (with sufficient information for user as message string) will do. In my current scenario, payment failure due to payment threshold is a reasonable information that client must be aware of; a property-based testing tool too needs that information to generate inputs I suppose.

Can't see any other way than exposing the information (payment threshold) through the gateway interface and providing a degenerate threshold value (0) for the first variant (BTW, is it violation of Interface Segregation?)


What is the WEAKEST assumption about the Gateway's behavior that the Client can make, but still say something useful about the Gateway's behavior?

I'm using one simple guideline, based on the Interface Segregation Principle: I want my Client to know as little as possible about its collaborators. This means agreeing on a contract with the fewest demands/constraints possible.

Concise though, these somewhat guides writing contract tests. Thanks.

You're welcome for the answer.

Option 1 means this: for all n > 0, charge(n) is possible/defined, so we should certainly check the values 0, 1 (penny), something typical like 1495, and then the maximum integer value.

Option 2 means this: for all n such that charge(n) is possible, n > 0. In this case, charge(1) might not be possible, but certainly charge(0) and charge(-1) are not possible. The minimum threshold of, for example, 1000 fits Option 2 and we would check in the contract that charge(999) is rejected.

Option 1 is exactly Option 2 with a threshold of 0.

It feels natural to to me to use both examples and property-based testing to check both options. I think this becomes a question of style/preference.

The big difference lies in interpretation: is the threshold of 1000 part of the essential contract or an implementation detail? In your case, I infer that it is part of the essential contract, in which case I expect three kinds of responses from charge(): (1) OK, (2) Rejected because the amount lies below the threshold, (3) Failed, and here is the reason/cause.

Maybe in other situations, the threshold is just an implementation detail. Maybe negative charges are allowed and treated as refunds, if they happen, but they are not required. Maybe this particular payment gateway vendor doesn't yet support refunds, and so their implementation conforms to the contract (no charge amount is rejected), even though sometimes payments fail (and negative amount is just another reason to fail, alongside disk full and network timeout). The contract can say, for example, "charge(n) is undefined for n <= 0", which means, "do whatever you want when n <= 0". This might be OK and this might be dangerous, depending on the situation.

And yes, it might be a violation of ISP to force clients to know details that they don't need to know. That's not always bad, but it's always worth reconsidering.

> I need a way to compare the simulated output for the clients to the actual output of the endpoint.

How often is the comparison made? In practice how do you achieve this in a big org? My team has services that talk to other services, some owned by my team, some owned by other teams in the org and some owned by 3rd party.

At a minimum, I recommend running these tests whenever the Supplier side of the contract changes. If we can directly detect that situation, then we know to run the contract tests at that time. In this case, we try to detect whether the Supplier implementation has changed, so that we can judge that change as a mistake or intentional, then we can coordinate with the Consumers to publish the change and control the resulting disruption.

In large organizations we often don't achieve that level of precision. Instead, we run those tests periodically and regularly hoping that they help us detect breaking changes so that we can recover from the disruption. In this case, I fall back to the typical advice: as disruption increases, we run these tests more often.

In many organizations, the Consumers start bearing the responsibility for running these tests, in part because a cultural rule has developed that allows Suppliers to change their contracts unilaterally and without advance notice. In this case, I advise Consumers to run these tests whenever they fear that a breaking change might have occurred, whenever they notice that something might have changed, or periodically and regularly as a defence mechanism. With patience and good communication, Consumers can gradually share this responsibility with the Suppliers. This becomes an ongoing negotiation and, in many large organizations, that never ends.

When the Supplier lives outside the organization, the Consumers typically run these tests as classical acceptance tests: I won't accept your upgrade until your new version passes these tests. Of course, that works better when the Consumer decides whether to upgrade; if the Supplier merely deploys new versions at their discretion, then the Consumer needs to treat the relationship more like the ones I described above.

When a programmer tells me that this seems like too much work, I ask them "Compared to what?" :)


Hi, so I’ve been really interested in using Contract testing for replacing e2e tests. However, one thing that’s been really tripping me up is the idea of testing semantic vs. syntactic using contract tests. Advocates of syntactic testing seem to be against the idea of using contract testing to replace e2e tests.

I recall you saying that we can use contracts to test the scenario where an item gets put into a service with a request and then it appears in some kind of list after another request. However, it seems like the Pact maintainers recommend that you don’t keep track of the provider’s state with Pact and that the Pact contract is only for testing communication (syntax), not functional testing (semantic?).

Also, in the example I linked above, their issue doesn’t seem as much as an issue with functional testing, so much as robustness, because the detail that given the username is longer than 20 characters, when saving the user object, then the server returns an error, seems to be testing a property of the server, that in their case the client doesn’t depend on and therefore does not need to test. Although that doesn’t seem to imply anything about a limitation of doing functional testing using a contract such as testing given a request to put the object into the service, when making a request to list objects in the service, then it should appear in the list.

What can I do differently than Pact that would allow me to effectively test semantics without running into the same problems? Is it just allowing for sequences of requests in a single contract? If not would you happen to have an example of what I can do?

Also, I’m assuming that semantic contracts, similar to Pact contracts, aren’t used directly in top-to-bottom testing due to the combinatorial nature of those tests but instead alongside something like Mountebank, where the stubs are slightly permuted. Am I right in this assumption?


1 Like

First things first, I need to fix the notification mechanism for comments here, because I didn’t notice one for your comment. I’m sorry that that failed and that I didn’t reply sooner. I would have wanted to.

I agree that Contract Testing doesn’t replace Customer Testing (which I believe to be Functional Testing with a name I prefer: testing designed to give Customers the warm, fuzzy feeling that we’ve enabled the capabilities they need). I consider Contract Testing as a kind of Programmer Testing (give the programmers confidence that their code behaves as they intend it to). I claim that relying on Customer Testing to find coding errors is overpaying for a guarantee compared to using Collaboration and Contract Testing for that purpose. Moreover, Collaboration and Contract Testing helps expose weaknesses in the design in a way that Customer Testing rarely does.

I use Contract Tests to change the role/purpose of Customer Tests, not to replace them. Even so, when we do it well, we need fewer Customer Tests for the same level of confidence both in the code and in the capabilities of the system.

I can’t comment on Pact, because I don’t use it. I can only comment on Contract Testing in general. I don’t know whether that invalidates whatever I write next in your eyes. You need to decide that. :slight_smile:

As for the example you cite, I think I remember it. The example revolves around the contract of two Controllers in a Point of Sale system, collaborating to implement the feature of a Shopper (a person buy items from our store) purchasing multiple items in a single Purchase (a transaction unit with multiple items). One Controller handles scanning a product’s barcode and the other Controller handles the Cashier (the person scanning items and collecting money from the Shopper) signaling “these are all the items that this Shopper wants to buy”, presumably buy pressing some button probably marked “TOTAL”. In this case, we can imagine an event sequence like this:

barcode_scanned "12345"
barcode_scanned "23456"
barcode_scanned "34567"

Each Controller handles a different event and they have to work together to compute the total price of those three items (assuming that we know the prices of all three items) in order to display to both the Cashier and Shopper, “That’ll be $45.81”.

What is the responsibility of each Controller? Well, the Barcode Scanned Controller needs to signal to something that an item has been found and reserved for purchase (but, of course, only for barcodes that match products that we know about). The Total Pressed Controller needs to signal to something to summarize the current Shopper’s Purchase so that the Cashier knows how many money to demand from the current Shopper.

Which something? I don’t know. Let’s invent something. “Current Purchase”? No. What matters is something like pending Purchase. It’s the one accumulating at the moment. So we make an abstraction representing this idea and it needs two methods (I’ll use OOP language for now, just to make it easier to describe): addItem() and completePurchase(). What are the contract semantics of Pending Purchase?

We could try to get too detailed and say that completePurchase() must return a Purchase value whose totalCost property is equal to the sum of the prices of the items added by addItem(), but that’s already making assumptions about the implementation. What happens when we add taxes? or allow for discounts? Oy. No. Let’s let that be an implementation detail.

And maybe this is where the confusion arises. Certainly, we don’t want clients of Pending Purchase to know details about how Pending Purchase tracks its state. This means that we wouldn’t put in the contract semantics something such as “addItem() adds items to an internal list”. I mean… we could do that, but eventually we’d likely regret it. And maybe I made the mistake of suggesting that some time ago. If I did, then I was young and foolish. :slight_smile:

If we choose not to do that, then what are the contract semantics of Pending Purchase? Something like this:

  • addItem() never fails
  • completePurchase() returns a Purchase that summarizes the pending purchase
  • completePurchase() prepares Pending Purchase for the next Purchase, presumably buy doing something such as clearing its internal list of items reserved for purchase

…uh, no. Again, too many implementation details. What could we say instead?

  • invoking completePurchase() a second time in a row returns an empty purchase

which is equivalent to

  • invoking completePurchase() without any items previously added returns an empty purchase

(And, of course, if you prefer to avoid an empty Purchase, it could raise a “No Purchase in Progress” error or something like that.)

What Happened Here?

I tried to describe a Contract Test (or an aspect of the semantics of a contract), I noticed that the resuting test would overspecify the contract, so I used that as the trigger for raising the level of abstraction of the interaction between Client and Supplier. I used that as an opportunity to reduce the number of details that the Client needs to know to use the Supplier successfully.

At the same time, I illustrated the point of guidelines such as “Contract Tests aren’t designed to replace Customer Tests”, because I’ve moved the responsibility for “purchase cost = sum of item prices” somewhere else. I’ve made it a free choice in our design! This makes life easier for the programmers (less to get wrong), but riskier for the customers (more behavior to worry about). What’s truly happened is merely that the Controllers no longer need to care about this behavior, because either the implementation of Pending Purchase will handle it or the Purchase value object itself will handle it. Both options work fairly well, although I suspect that eventually the programmers will prefer one choice over the other. The OOP folks will push that behavior into Purchase and the FP folks will leave it as some pluggable policy that returns a Purchase Summary value that looks like { items = […], cost = Cents 4581 }.

This is how I use Contract Tests to guide my design.

What If I Don’t Have the Same Insight?

No worries. You can always write the more-complicated Contract Tests, which would result in more-complicated Collaboration Tests. At some point, you’d notice the duplication of irrelevant details in the Collaboration Tests (why do I have to do item price and sales taxes arithmetic in this damn test?! I’m not checking any of that arithmetic!!) and you could use that as the signal to raise the level of abstraction. And if you weren’t sure how, you’d ask for help. That’s how I learned it.

I use contract semantics in top-to-bottom design/implementation. I call it Client-First Design with Test Doubles. When I implement layer 6, my Collaboration Tests for that layer make assumptions about how layer 7 behaves. Those assumptions become the beginning of the Contract Test list for the abstractions in layer 7. I can often identify design regrets in those abstractions before I even try to implement layer 7. This way I notice the impending combinatorial explosion and stop it before it happens. Most of the time. Sometimes there are little explosions and I stop them before the factorial curve truly hurts me.

How much does all that help?