Examples Aren’t Specifications

“Specification by Example” is a particular implementation strategy of Behavior-Driven Development. The central justification, as far as I can tell, is expressed in the following snippet from the Wikipedia page:

Human brains are generally not that great at understanding abstractions or novel ideas/concepts when first exposed to them, but they’re really good at deriving abstractions or concepts if given enough concrete examples

Ironically enough, this is immediately followed with “citation needed”.

Anyone with experience teaching math will immediately understand what is off about this statement. The number of people who can see the first few numbers in a sequence, and deduce from that the sequence itself, is much much smaller than the number of people who can understand the sequence when they see it. If I endeavored to teach you what a derivative is by just showing you a few examples of functions and their derivatives, I would be shocked if you were able to “derive” the abstraction that way.

It’s not just a matter of raw intelligence. It is true that only the highly intelligent can engage in this kind of pattern recognition. This is, in fact, exactly what an IQ test is. But the bigger problem is that multiple sequences can have the same values for the first few elements. It is quite simply not enough information to deduce a sequence from a few examples of its elements.

Examples help illustrate an abstraction, and thereby make it easier to understand. First I present an abstraction to you. I explain that a derivative measures the rate of change of a function, in the limit as the change goes to zero. Then I show you examples to help you grasp it. I don’t do it the other way around, and I certainly don’t skip the part where I explain what a derivative is, and hope by simply seeing a few derivatives you’ll realize what I’m showing you.

The “specification by example” practices I’ve seen all recognize that it would be a terrible idea to have developers truly try to derive the product specification from examples. All of them supplement the examples with actual statements of the abstractions. They do what I said: explain a “rule”, then follow it with examples to help illustrate the rule. But then, out of some kind of confusion, the insistence is then to enshrine the examples as “the specification”, instead of the rules.

A good overview of how this practice is fit into BDD is given here. The practice of “example mapping” is applied to generate Gherkin scenarios for concrete examples of behavior. The essential practice is that Gherkin is written exclusively for concrete examples, and not for abstract rules.

Let’s go back to the Wikipedia article to see how a few cases of sleight-of-hand are applied in order to justify this. From the article:

With Specification by example, different roles participate in creating a single source of truth that captures everyone’s understanding.

This is, in fact, an argument for something completely different: elimination of the overlapping and largely redundant documents that different roles of a development organization maintain. It has nothing whatsoever to do with expressing specifications through concrete examples. A “single source of truth” is equally possible with specifications expressed directly. In fact, doing so is far better in this sense, because no interpretative burden is left on developers to get from what is documented to what is specified. Specifying directly by abstractions avoids each reader of the concrete examples deriving his own personal “source of truth” about what the examples mean.

We see this kind of thing a lot. The justification for scrum teams and ceremonies, apparently, is that it keeps manual testing load low. No, that’s the the justification for test automation. That has nothing to do with scrum teams. It is a very common practice to try to “trojan horse” some novel concept in by attaching it to another, unrelated and generally already widely lauded practice. Avoiding redundant documentation is already a good idea. It is not a reason to adopt an entirely unrelated practice of specification by example.

Continuing:

Examples are used to provide clarity and precision, so that the same information can be used both as a specification and a business-oriented functional test.

Examples don’t provide precision, they provide clarity at the expense of precision. This is the fundamental point of confusion here. Examples are not specifications. I can provide these examples of a business rule:

“If John enters a 2 digit number into the field and tries to submit it, he is unsuccessful”

“If John enters a 5 digit number into the field and tries to submit it, he is successful”

“If John enters an 8 digit number into the field and tries to submit it, he is unsuccessful”

There are so many ways to interpret what I’m really getting at with these examples, I can’t list them all. Is the rule that the number of digits must be between 3 and 7? 4 and 6? Exactly 5? No one would dare hand only this to programmers and expect them to produce the desired software. That’s why every “specification by example” system supplements these examples with an actual rule like, “the number of digits must be between 3 and 6”.

The imprecision of examples is exactly why they can’t be specifications. Examples don’t specify. They exemplify.

As for “business-oriented test”, that’s BDD and TDD. The specification should be the business requirement, not some technical realization of that requirement. The requirement should be tested, preferably with an automated test. None of that requires the specification to be expressed through concrete examples.

Continuing:

Any additional information discovered during development or delivery, such as clarification of functional gaps, missing or incomplete requirements or additional tests, is added to this single source of truth.

It is? Why? What makes that happen? Does specifying by example force developers to go back and amend the requirements when they discover something new? Of course not. Maybe they will, maybe they won’t. Hopefully they will. It’s good practice to document a discovery that led to a code change in the requirements. This has nothing to do with whether requirements are expressed directly (abstractly) or indirectly (through concrete examples).

Continuing:

When applied to required changes, a refined set of examples is effectively a specification and a business-oriented test for acceptance of software functionality. After the change is implemented, specification with examples becomes a document explaining existing functionality. As the validation of such documents is automated, when they are validated frequently, such documents are a reliable source of information on business functionality of underlying software. To distinguish between such documents and typical printed documentation, which quickly gets outdated,[4] a complete set of specifications with examples is called Living Documentation.

This is an argument for tests as specifications, wherein the specifications are directly used by the CI/CD system to enumerate the test suite. The problem is that “a refined set of examples” cannot effectively be a specification. The author of this paragraph actually understands this. That’s why he says “specification with examples” (emphasis mine), instead of “specification by examples”, which is what this article is supposed to be advocating for. That one change in preposition completely alters what is being discussed, and completely undermines their case.

There are multiple (usually infinitely many) specifications that would align with any finite set of examples. Concrete examples simply don’t map 1-1 to abstractions. There’s a reason why the human mind employs abstractions so pervasively. I can’t tell you I’m hungry by pointing to a girl who is eating, and also sitting at a table, and also reading something on her phone (am I telling you I’m hungry, or that I want to sit down, or that I want to play on my phone, or that I think that girl is cute?)

Like I keep saying, everyone actually knows this is nonsense. If anyone really believed “specification by example” were possible, they would deliver only a set of examples to a development team and tell them to get working. They don’t do that. In the Cucumber world of “example mapping”, the actual acceptance criteria are of such critical importance, they are elevated to the same first-class citizen status as examples, and placed directly into the .feature files.

The rules are placed in .feature files as commented out, unexecutable plain English. If those rules change, well, maybe someone will go update that comment. We could completely overhaul the actual scenarios, the Gherkin (which is executable code and entails at least some sort of code change), and not touch the rules, and everything would be fine. These rules don’t explain existing functionality at all. They’re just comments, and you can write anything in them. They’re as bad as code comments for documentation.

By sticking with the practice of writing Gherkin for examples, instead of rules, the Gherkin ceases to be the specification. That’s why the feature files have to be augmented with a bunch of plain English. That English is actually the specification. All that’s happening here is that the benefits of a DSL like Gherkin are not exploited. The specifications are written in English, which is ambiguous, vague and imprecise (particularly in the way most people use it). To whatever extent the examples help resolve these ambiguities (especially when those examples are written in Gherkin), it would be far more effective to write the rules in Gherkin. The whole point of Gherkin is that English is too flexible and imprecise of a language with which to express software specifications. Writing Gherkin in a way that requires it to be supplemented with plain English negates its benefit.

My point is not that examples are unhelpful. Quite to the contrary, examples are extremely helpful, and often crucial in arriving at the desired abstractions. But “specification by example” assigns an entirely inappropriate role to examples. The primary role of examples is to motivate the discovery of appropriate specification. Examples stimulate people, particularly the ones who define the specifications, to think more carefully about what exactly the specifications are. A counterexample can prove that a scenario is too generic, and that a “given” needs to be added to constrain its scope.

Let’s return to the initial quote from the article. In my experience, the inability to understand abstract specifications is a nearly nonexistent problem in software development. I don’t ever remember a case where a requirement was truly specified in unambiguous terms, and someone simply drew a blank while reading it (or even just misinterpreted it, which would require an objectively wrong reading of the words). Instead, requirements are vague, unclear, ambiguous, confusing, and incomplete. Here’s an example:

When the user logs in, he should see his latest activities

What does that mean exactly? What counts as an “activity”? How many of the latest ones should he see? Is there a maximum? How are they displayed to the user? How should they be ordered?

The problem here isn’t that this requirement is so mind-blowing that we need to employ the tactics of a college level math lecture to get anyone to comprehend it. The information simply isn’t there. This is a lousy requirement because it isn’t specific, which means it isn’t a specification.

Really, what the supplemental examples do is fill in information that is missing in the rule. I can supplement this requirement with an example:

Given a user Sally
Given Sally has made 3 purchases
Given Sally has changed her delivery address 2 times
Given Sally has changed her payment info 3 times
When Sally logs in
Then Sally sees, in descending order by date, her 3 purchases, 2 delivery address changes, and 1 payment info change

Okay, this example is hinting at more information. A purchase, a delivery address change, and a payment info change, are all examples of “activities”. Great. That was a missing detail in the “rule”. It specifies an ordering. It also seems that the recent activity is limited to 6 items. That was also a missing detail in the rule.

But I can interpret that differently. Maybe the rule is that there is no limit to the total number of activities shown, but there is a limit to only show 1 payment info change. Both of those rules fulfill this example. We need the actual rules.

Relying on examples in this manner is just a way to get by with vague and incomplete “rules”. In fact, if there is ever a perceived need to supplement a rule with examples, that is a very reliable proof that the rule is incomplete and needs to be improved. We can take the fact that the rule is enough on its own, no examples needed, as a bellwether for the completeness and specificity of the rules.

Making the examples the target for Gherkin, which is what turns into your acceptance tests, completely fails as a BDD/TDD mechanism. The fundamental process of development driven by behaviors and tests is that you don’t touch the production code unless and until, and only to the minimal extent that, a failing test requires it. If you’re only writing tests for specific examples, the minimum work you need to do to make those tests pass is to satisfy the examples, not the rules.

I could write code that simply hardcodes 3 purchases, 2 address changes and 1 payment info change into the “activities” view on the home screen. Doing so would almost certainly be easier than fetching the logged in user’s real list of activities, parsing and truncating them. That would make this test pass. Even if there are a couple more examples with different sets of example activities, I could still get away with hardcoding them. And to the extent that the examples are our “documentation”, this is correct. But I know that’s not what I am supposed to be doing, so eventually, even though all the tests are passing, I have to go in and start messing with production code to make it do what I understand is really what we want it to do. In this workflow, acceptance tests simply aren’t the driving force of the production code, in any sense. They revert to the old role of tests as merely being verifications.

(This hints at a bigger discussion about whether tests, even under the hood, should ever use hardcoded stub data. Doing so always risks a false positive when the production code also hardcodes the same data, but it’s a very common and quick-to-stand-up method of testing. If this is an implementation detail of the tests, at least the test self-documents that this hardcoded data isn’t part of the test definition, or the requirement, which is certainly much better than a test in which arbitrary hardcoded stub data is right there in the test and requirement definition. The problem of “I can make this test pass by hardcoding the same data in production code” is still present, but arguably at a much smaller risk of occurring, because it’s clear from reading the test that those stubbed values are fake and private to the test implementation. If you want to fully eliminate this problem, you should randomly generate the fake data as part of executing the test.)

The fact that different people come away with different understandings of what exactly the requirement means does not point to some defect in the human brain’s ability to comprehend abstractions. It points to a defect in the language of the requirement, which genuinely does not specify exactly what the requirement is. The pervasive problem is vague requirements, not developers who can’t understand what the product owner wants. The problem is a language problem, not a comprehension problem. That’s why the solution is a domain-specific language (Gherkin), not a brain transplant.

Examples are fine, and they can help. But they don’t get rid of the problem that the plain English business rule, I can almost guarantee you, is vague and ambiguous. Even if the ritual of communal exemplification causes the participants to all reach a shared understanding, it’s not going to help the next guy who comes along. And in case this isn’t clear, specifications are documentation, and documentation lives longer than any particular team member’s involvement. The whole point of the “document” is to be the thing anyone reads when they need to get an answer to something.

No one really believes examples can be the documentation. So when you insist on your Gherkin being only for examples, you necessitate plain English documentation. Is it better to have plain English documentation plus examples in Gherkin, than to just have plain English documentation? I’m sure it is. But both are far, far inferior to having all Gherkin documentation (the major exception here is visual requirements, in which case a picture is literally worth a thousand words. Visual requirements, aka fonts, colors, spacing, sizes, etc., are best expressed visually). The point of this is to produce true, precise specifications of the product. Plain English specifications aren’t precise enough, and examples (in any language) are even worse in this sense. Keeping examples around only allows you to get away with incomplete specifications. You shouldn’t need examples to supplement rules. The rules should be expressive and clear enough on their own. You can use examples to help arrive at that clear, expressive rule. Once you do, the examples are scaffolding and they can, and probably should, be torn down.

Specifications define. Examples illustrate. It will cause nothing but trouble to confuse the two.