On Protocol Witnesses

Introduction

So, it’s just a normal day at work, you’re writing some Swift code, and the need arises to model an abstract type: there will be multiple implementations, and you want users to be able to use those implementations without knowing which implementation they are using.

So, of course, you reach for the tool that Swift provides for modeling abstract types: a protocol. If you were writing Kotlin, C#, Java or TypeScript, the comparable tool would be an interface.

As an example, let’s say you’re writing the interface to your backend, which exposes a few endpoints for fetching data. You are going to want to swap out the real or “live” backend for a fake backend during testing, so you’re at least going to need two variations. We’ll therefore write it as a protocol:

protocol NetworkInterfaceProtocol {
  func fetchJobListings() async throws -> [JobListing]

  func fetchApplicants(for listing: JobListing) async throws -> [Applicant]

  func fetchJobListings(for applicant: Applicant) async throws -> [JobListing]
}

We’ll write each flavor as a conforming concrete type. The “live” one:

struct NetworkInterfaceLive: NetworkInterfaceProtocol {
  func fetchJobListings() async throws -> [JobListing] {
    let (data, response) = try await urlSession.data(for: host.appendingPathComponent("listings")

    guard let response, response.statusCode == 200 else {
      throw BadResponse(response)
    }

    return try JSONDecoder().decode(type: [JobListing].self, from: data ?? .init())
  }

  func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
    let (data, response) = try await urlSession.data(for: host.appendingPathComponent("listings/\(listing.id)/applicants")

    guard let response, response.statusCode == 200 else {
      throw BadResponse(response)
    }

    return try JSONDecoder().decode(type: [Applicant].self, from: data ?? .init())
  }

  func fetchJobListings(for applicant: Applicant) async throws -> [JobListing] {
    let (data, response) = try await urlSession.data(for: host.appendingPathComponent("applicants/\(applicant.id)/listings")

    guard let response, response.statusCode == 200 else {
      throw BadResponse(response)
    }

    return try JSONDecoder().decode(type:[JobListing].self, from: data ?? .init())
  }

  init(
    urlSession: URLSession = .shared
    host: URL
  ) {
    self.urlSession = urlSession
    self.url = url
  }

  let urlSession: URLSession
  let host: URL
}

And then a fake one that just returns canned successful responses:

struct NetworkInterfaceHappyPathFake: NetworkInterfaceProtocol {
  func fetchJobListings() async throws -> [JobListing] {
    StubData.jobListings
  }

  func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
    StubData.applicants[listing.id]!
  }

  func fetchJobListings(for applicant: Applicant) async throws {
    StubData.applicantListings[applicant.id]!
  }
}

Simple enough, right?

But wait! You start hearing that this isn’t the only way to implement this design. You can, in fact, build abstract types in Swift without using protocols at all! How could this be possible!?

We use what are called “protocol witnesses”. We replace the protocol with a struct, replace function declarations with closure members, and replace the concrete implementations with factories that assign the closures to the implementations provided by that concrete type. First the abstract type:

struct NetworkInterface {
  var fetchJobListings: () async throws -> [JobListing]

  var fetchApplicants: (_ listing: JobListing) async throws -> [Applicant]

  var fetchJobListingsForApplicant: (_ applicant: Applicant) async throws -> [JobListing]
}

Then the live implementation:

extension NetworkInterface {
  static func live(
    urlSession: URLSession = .shared
    host: URL
  ) -> Self {
    func fetchJobListings() async throws -> [JobListing] {
      let (data, response) = try await urlSession.data(for: host.appendingPathComponent("listings")

      guard let response, response.statusCode == 200 else {
        throw BadResponse(response)
      }

      return try JSONDecoder().decode(type: [JobListing].self, from: data ?? .init())
    }

    func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
      let (data, response) = try await urlSession.data(for: host.appendingPathComponent("listings/\(listing.id)/applicants")

      guard let response, response.statusCode == 200 else {
        throw BadResponse(response)
      }

      return try JSONDecoder().decode(type: [Applicant].self, from: data ?? .init())
    }

    func fetchJobListings(for applicant: Applicant) async throws -> [JobListing] {
      let (data, response) = try await urlSession.data(for: host.appendingPathComponent("applicants/\(applicant.id)/listings")

      guard let response, response.statusCode == 200 else {
        throw BadResponse(response)
      }

      return try JSONDecoder().decode(type:[JobListing].self, from: data ?? .init())
    }

    return .init(
      fetchJobListings: fetchJobListings,
      fetchApplicants: fetchApplicants,
      fetchJobListingsForApplicant: fetchCandidates(for:)
    )
}

(If you haven’t seen this before, yes you can write “local functions” inside other functions, and they’re automatically closures with default implicit capture, which is how they have access to urlSession and host. If you need to explicitly capture anything you have to switch to writing them as closures, as in let fetchJobListings: () async throws -> [JobListing] = { [...] in ... }).

And the happy path fake:

extension NetworkInterface {
  var happyPathFake: Self {
    .init(
      fetchJobListings: { StubData.jobListings },
      fetchApplicants: { listing in StubData.applicants[listing.id]! },
      fetchJobListingsForApplicant: { applicant in StubData.applicantListings[applicant.id]! }
    )
  }
}

Take a moment to study those to understand that they are, in fact, equivalent. The only difference is that when you want to instantiate a live instance, instead of writing let network: NetworkInterfaceProtocol = NetworkInterfaceLive(url: prodUrl), we write let network: NetworkInterface = .live(url: prodUrl). And similarly when instantiating the happy path fake. Other than that (and the small caveat that you can’t have named function arguments), it’s equivalent.

And, in fact it is more powerful: I can, for example, mix and match the individual functions from different implementations (i.e. make an instance whose fetchJobListings is the happyPathFake one, but the other two functions are live). I can even change the implementation of one function after creating an instance, substituting new implementations inline in other code, like test code, and those functions will close over the context, so I can capture stuff like state defined on my test case file, to make a test double that, say, returns job listings that depends on how I’ve configured the test.

This technique is discussed by PointFree here. I don’t subscribe to their content, so I only see the free portion of this article, and similarly for other articles they link to from there. From this limited vantage point, it appears to me their position is that “protocol oriented programming”, a popular paradigm in the Swift community, is not a panacea, that there are situations where it is not the right tool, and the protocol witness approach can be a better and more powerful alternative. They specifically say that some things are harder to do with protocols than with protocol witnesses, and some things are not possible to do with protocols than can be done with protocol witnesses.

Now, the fundamental starting point of this discussion seems to be that these two options (the protocol with implementing structs, and the struct with settable closure members) are different implementations of the same design. That is, we’re attempting to model the same system, with the same requirements and behavior. And we’re not even debating two different conceptual models of the system. We agree there is an abstract type of network interface (and this, not somewhere else, is where the abstraction lies… this is important) and it has multiple concrete implementations. Rather, we are only comparing two mechanisms of bringing this conceptual model into reality.

As an example of what I’m talking about, here is two different models. This:

struct JobListing {
  let id: String
  let role: String
  let applicants: [Applicant]
  let candidates: [Candidate]
  ...
}

struct Applicant {
  let id: String
  let person: Person
  ...
}

struct Candidate {
  let id: String
  let person: Person
  ...
}

struct Person {
  let id: String
  let firstName: String
  let lastName: String
  ...
}

vs. this:

struct JobListing {
  let id: String
  let role: String
  let applicants: [Applicant]
  let candidates: [Candidate]
  ...
}

struct Applicant {
  let id: String
  let isCandidate: Bool
  let firstName: String
  let lastName: String
  ...
}

These are both attempts to conceptually model the same requirements, but they are different models (i.e. the former allows applicants and candidates to have different attributes, the latter does not). Contrast with the following, which is two implementations, with competing language features, of the same model. This:

struct JobListing {
  let id: String
  let role: String
  var applicants: [Applicant] { ... }
  var candidates: [Candidate] { ... }
  ...
}

vs. this:

struct JobListing {
  let id: String
  let role: String
  func getApplicants() -> [Applicant] { ... }
  func getCandidates() -> [Candidate] { ... }
  ...
}

These both model the domain in the exact same way. The difference is in whether we use the language feature of computed read-only variables vs. functions. That’s purely mechanical.

I believe PointFree are framing this choice of modeling our problem as a protocol with implementing structs, vs. a struct with closure members, as the latter: the same conceptual model but with different tools to construct the model. This is why the discussion focuses on what the language can and can’t do, not the exact nature of the requirements or system, since it is fundamentally about choosing language constructs, not correctly modeling things.

I think this framing is mistaken.

I rather consider the choice over these two implementations to be a discussion over two different conceptual models. This will become clear as I analyze the differences, and I will point out exactly where the differences have implications for the conceptual model of the system we’re building, which goes beyond mere selection of language mechanisms. In the process, we need to ask and satisfactorily answer:

  • What is the nature of this “extra power”, and what are the implications of being able to wield it?
  • What exactly are the things you “can’t” do (or can’t do as easily) with protocols, and why is it you can’t do them? Is it a language defect, something a future language feature might fix?
  • Does the protocol witness design actually eliminate protocols?
  • Why is this struct called a “protocol witness”? (that’s very significant and reveals a lot about what’s going on)

About That Extra Power

One, if not the, fundamental arguments to justify protocol witnesses is that it is “more powerful” than protocols.  Now, based on the part of the article I can see, I have a suspicion this is at least partially based on mistaken beliefs about what you can do with protocols. But there’s at least some truth to this, simply considering the examples they show (which I mentioned above in my own version of this): you can reconfigure individual pieces of functionality in the object, and even compose functionality together in different combinations, where in the protocol -> implementer approach you can’t (well, sort of… we’ll get there). With protocols, the behaviors always come as a set and can’t be mixed and matched, and they can’t be changed later.

This is, indeed, much more “powerful” in the sense you are less restricted in what you can do with these objects.

And that’s bad. Very bad.

“More powerful”, which we can also phrase as “more flexible”, always balances with “less safe”.  Being able to do more in code can very well mean, and usually does mean, you can do more wrong things.

After all, why are these restrictions in place to begin with?  Is it a language defect?  Have we just not figured out how to improve protocols to make them this flexible? Can you imagine a Swift update in which protocols enable you to do this swapping out of individual requirements after initializing a concrete implementing type?

No, it’s the opposite.  Unfettered flexibility is the “stone age” of software programming.  What’s the most flexible code, “powerful” in the sense of “you can implement more behaviors”, you could possibly write?

Assembly.

There, you can do lots of things higher level languages (where even BASIC or C is considered “high level”) won’t let you do.  Some examples are: perform addition on two values in memory that are execution instructions and store the result in another place that is later treated as an instruction, not decrement the stack before jumping to the return address, decrement the stack by an amount that’s not equal to the amount it was incremented originally, and have a for loop increment the index in the middle of an iteration instead of at the end.

These are all things you can’t do in higher level languages.  And thank God for that!  They’re all bugs.  That’s the point of higher level languages being more restricted: they stop you from writing incorrect code, which is the vast majority of code.

See, the main challenge of software engineering is not the inability to write code easily.  It’s the ability to easily write incorrect code.  The precise goal of raising the abstraction level with higher level languages is to restrict you from writing code that’s wrong.  The restrictions are not accidents or unsolved problems.  They are intentional guardrails installed to stop you from flying off the side of the mountain.

This is the point of strongly typed languages.  In JavaScript you can add an integer to a string.  You can’t do this in C.  Is C “missing” a capability?  Is JavaScript “more powerful” than C?  I guess in this sense yes.  But if adding an integer to a string is a programmer error, what is that extra power except a liability?

This is what the “footgun” analogy is about.  Some footguns are very very powerful.  They’re like fully automatic heat-seeking (or maybe foot-seeking?) assault footguns.  And that’s even worse than a single action revolver footgun because, after all, my goal is to not shoot myself in the foot.

While a looser type system or lower level language is “more powerful” at runtime, these languages are indeed less powerful at design time. Assembly is the most powerful in terms of what behavior you can create in the running program, but it is the least powerful in terms of what you can express, at design time, is definitely wrong and should be reported at design time as an error.

There’s no way to express in JavaScript that adding a network client to a list of usernames is nonsense.  There’s no way to express that calling .length on a DOM node is nonsense.  You can express this in Swift, in a way the compiler understands so that it prevents you from doing nonsensical things.  This is more powerful.

So more power at runtime is less power at design time, and vice versa. Computer code is just a sequence of instructions, and the vast vast majority of such sequences are useless or worse. Our goal is not to generate lots of those sequences quickly, it’s to sift through that incomprehensibly giant space of instruction sequences and filter out the garbage ones. And because of the halting problem, running the program to exercise every pathway is not a viable way of doing so.

Is this relevant to this discussion about protocol witnesses?  Absolutely.  All this is doing is turning Swift into JavaScript.  In JavaScript, there are no static classes with fixed functions.  “Methods” are just members of an object that can be called.  “Objects” are just string -> value dictionaries that can hold any values at any “key” (member name).

The only difference between this and the structs PointFree is showing us here is that JavaScript objects are even more powerful still: you can add or remove any functions, at any name, you want during runtime.  How can we make Swift do this?

@dynamicCallable
struct AnyClosure {
  // This will be much better with parameter packs
  init<R>(_ body: () -> R) {
    _call = { _ in body() }
  }

  init<T, R>(_ body: (T) -> R) {
    _call = { args in body(args[0] as! T) }
  }

  init<T1, T2, R>(_ body: (T1, T2) -> R) {
    _call = { args in body(args[0] as! T1, args[1] as! T2) }
  }

  init<T1, T2, T3, R>(_ body: (T1, T2, T3) -> R) {
    _call = { args in body(args[0] as! T1, args[1] as! T2, args[2] as! T3) }
  }

  ...

  @discardableResult
  func dynamicallyCall(withArguments args: [Any]) -> Any {
    _call(args)
  }

  private let _call: ([Any]) -> Any
}

@dynamicMemberLookup
struct NetworkInterface {
  private var methods: [String: AnyClosure]

  subscript(dynamicMember member: String) -> AnyClosure {
    get { methods[member]! }
    set { methods[member] = newValue }
  }
}

In a language that doesn’t have stuff like @dynamicCallable and @dynamicMemberLookup, you can still do this, but you have to settle for ugly syntax: network.fetchApplicants(listing.id) will have to be written as network.method("fetchApplicants").invoke([listing.id]).

With this, we can not only change the implementation of fetchJobListings or fetchApplicants, we can add more methods! Or delete one of those methods. Or change the parameters those methods take, or the type of their return value. Talk about being more powerful!

So that’s even better!  Well, if you consider this added “power” to be a good thing.  I don’t. What’s the point of adding a new method, or even worse changing the parameters of an existing one? It’s not like production code is going to call that new method, or send those new types of parameters in. Well, you might misspell fetchListings as fechListings or forget or reorder parameters, and now that’s a runtime error instead of a compile time error.

I like statically typed languages, and I like them because they restrict me from writing the bugs I can easily write in JavaScript, like calling a method that doesn’t exist, or changing the definition of sqrt to something that doesn’t return the square root of the input.

And this is very controversial.  I’ve met devs who strongly dislike statically typed languages and prefer JavaScript or Ruby because the static type system keeps stopping them from doing what they want to do.  I don’t want to be completely unfair to these devs: it is more work to design types that allow everything you need but also design bugs out of the system.  Rather, it’s more work at first, and then much much less work later.  It’s tempting to pick the “easier now and harder later” option because humans have time preference.

(Again to be fair to them, no static type system in existence today is expressive enough to do everything at compile time that dynamic languages can do at runtime. Macros, and more generally metaprogramming, will hopefully bridge the gap).

What bugs are no longer impossible to write when we use protocol witnesses?  Exercising any of these new capabilities that are just for tests in production code.  You can now swap an individual function out.  You should never do that in production code.  If you do it’s a bug.  If your response to this is “why would anyone ever do that?”, my counter-response is “LOL“.

Indeed this design, by being “more powerful” at runtime, is less powerful at compile time.  I simply can’t express that a flavor of the object exists where the definitions of the two methods can’t vary independently.  With the protocol I can still express that a flavor exists where they can vary independently: make the same witness struct but make it also conform to the protocol.  So I actually have more expressiveness at compile time with the protocol.  I can even say, for example, that one particular function or type only works with one concrete type (like a test that needs the happy path type, I can express that: func runTest(network: NetworkInterfaceHappyPathFake)), or that two parts of code must work with the same type (make it a generic parameter).

I can’t do any of this with protocol witnesses because the compiler has no idea about my types.  It only knows about the abstract type (represented by the witness struct), instead of knowing not only about all the types (the abstract type, as the protocol, and the live concrete type, and the happy path concrete type, etc.) but also their relationship (the latter are all subtypes of the abstract type, expressed by the conformance). So, as it turns out, and specifically because the protocol witness can do things the protocol design can’t do at runtime, the protocol design can do things at compile time the protocol witness design cannot.

“More” or “less” powerful depends on whether you’re looking at compile time or runtime, and one will be the opposite of the other.

These Are Different Models

As I said at the beginning, the very framing here is that these are alternative language mechanisms for the same model. The abstract type and concrete implementing types still exist in the protocol witness code. You just haven’t told the compiler about them, so it can’t enforce any rules related to them.

But whether or not you should be able to mix and match, and by extension swap out, individual functions in the NetworkInterface, is a matter of what system we’re modeling: what exactly the NetworkInterface concept represents, and what relationship, if any, its functions have to each other. Why, after all, are we even putting these three functions in the same type? Why did we decide the correct conceptual model of this system is for fetchJobListings and fetchApplicants(for:) to be instance methods on one type? Why didn’t we originally make three separate protocols, each of which defines only one of these functions?

Well, because they should vary together! Presumably the jobListing you pass into fetchApplicants(for:) is going to be, in fact needs to be, one you previously retrieved from fetchJobListings on the same NetworkInterface instance… or at least an instance of the same type of NetworkInterface. If you grabbed a jobListing from one implementation of NetworkInterface then asked another implementation of NetworkInterface what it’s applicants are, do you think it’s going to be able to tell you?

This means we really should try to express that a particular JobListing type is tied back to a particular NetworkInterface type, and that it is a programmer error that we ideally want to catch at compile time to send a JobListing retrieved from one type of NetworkInterface into another type of NetworkInterface. How would we do that? Definitely not with protocol witnesses. We not only need protocols, we need to exercise more of their capabilities:

protocol NetworkInterfaceProtocol {
  associatedtype JobListing
  associatedtype Applicant

  func fetchJobListings() async throws -> [JobListing]

  func fetchApplicants(for listing: JobListing) async throws -> [Applicant]

  func fetchJobListings(for applicant: Applicant) async throws -> [JobListing]
}

This way, each NetworkInterface type has to specify its own JobListing and Applicant types, and (as long as we pick different types, which we should if we want strong typing) it will become a compile time error to try to pass one type of NetworkInterface‘s job listing into another type of NetworkInterface, which we know is not going to produce a sensible answer anyways (it might crash or throw an exception due to not finding a matching job listing, or even worse it might successfully return a wrong result because both backend systems happen to contain a job listing with the same id).

Obviously, then, it’s nonsense to mix and match the individual functions. The whole point of grouping these functions together is that they form a cohesive unit, and the implementation of one needs to “match” the implementation of another. Modeling this in a way where we are allowed to compose different individual functions is just plain wrong.

But if these three functions happened to be totally separate, like let’s say we have separate backends for different resources, they have no relationship to each other (no foreign keys from one to the other, for example), and it should be possible, and might eventually be a production use case, to mix and match different backends, then it would be just plain wrong to model all those functions as being a cohesive unit, incapable of being individually composed.

See, these are different models of a system. Which one is correct depends on what system we’re trying to build. This is not about language mechanics.

Even with the system as it is, where these functions clearly are a cohesive unit… why should NetworkInterface be a protocol? Why should we be able to define two interfaces that have entirely different implementations? Look at what’s defined in the Live implementation: we implement each one by doing an HTTP GET from a particular path added to a shared base endpoint (each one can’t have a separate base URL, they all have to go to the same backend system). Why would we want to replace this entire implementation? Are we going to have a backend that doesn’t use HTTP? If so, yes it makes sense this needs to be abstract. If not… well then what do we expect to vary?

The base URL? Because you have a Development, Staging and Production backend hosted at different URLs?

Well then just make that a parameter of a concrete type. If you want to be really fancy and make it a compile time error to send a Development JobListing into a Staging NetworkInterface, you can make NetworkInterface generic and use that to strongly type the response entities by their environment:

protocol Environment {
  static var baseURL: URL { get }
}

enum Environments {
  enum Development: Environment { static let host = ... } }
  enum Staging: Environment { static let host = ... } }
  enum Production: Environment { static let host = ... } }
}

struct JobListing<E: Environment> {
  ...
}

struct Applicant<E: Environment> {
 ...
}

struct NetworkInterface<E: Environment> {
  func fetchJobListings() async throws -> [JobListing<E>] {
    let (data, response) = try await urlSession.data(for: E.host.appendingPathComponent("listings")

    guard let response, response.statusCode == 200 else {
      throw BadResponse(response)
    }

    return try JSONDecoder().decode(type: [JobListing<E>].self, from: data ?? .init())
  }

  func fetchApplicants(for listing: JobListing<E>) async throws -> [Applicant<E>] {
    let (data, response) = try await urlSession.data(for: E.host.appendingPathComponent("listings/\(listing.id)/applicants")

    guard let response, response.statusCode == 200 else {
      throw BadResponse(response)
    }

    return try JSONDecoder().decode(type: [Applicant<E>].self, from: data ?? .init())
  }

  func fetchJobListings(for applicant: Applicant<E>) async throws -> [JobListing<E>] {
    let (data, response) = try await urlSession.data(for: E.host.appendingPathComponent("applicants/\(applicant.id)/listings")

    guard let response, response.statusCode == 200 else {
      throw BadResponse(response)
    }

    return try JSONDecoder().decode(type:[JobListing<E>].self, from: data ?? .init())
  }

  init(
    urlSession: URLSession = .shared
  ) {
    self.urlSession = urlSession
  }

  let urlSession: URLSession
}

So here, we use protocols, but in a different place: we’re expressing that the variation is not in the network interface itself. There’s only one network interface. The variation is the environment. Making NetworkInterface abstract is not correct because we don’t want that to vary. We’re not going to build a system that connects to anything other than one of these environments, who all have an HTTP backend with identical RESTful APIs. The web protocol, the resource paths, the shape of the responses, none of that varies. Only the environment varies. Not only can we not swap out individual function implementations, we can’t even swap out a network interface. There’s only one concrete type (parameterized by its environment).

In fact… why did we even think to make NetworkInterface abstract to begin with? We don’t even have multiple environments yet!

Ohh, right… testing.

The Interaction of Testing and Design

That’s a weird use case for this. We’re making things abstract specifically so we can break them… if we consider faking to be a breakage, which we should, as our customers certainly would if we accidentally shipped that to them. So then all these extra capabilities, like individually hot-swapping functions… sure, that creates network interfaces that are obviously broken, as any mix-and-matched one would have to be. But that’s the goal: we want to write a test where we can inject fake (a type of broken) behavior into specific parts of the code.

Testing is uniquely challenging in Swift, compared to other industry languages. In other languages the runtime system is dynamic enough that you can mess with objects at runtime by default. This is how you’re able to build mocking frameworks that can take any method on any class and rewrite it while the test is running. Back in Objective-C days we had OCMock. A major pain point in converting ObjC code with tests to Swift is that you can’t do this anymore.

Why is Swift uniquely difficult in this way (it is comparable to C++)? Because it is so much more strongly typed and compile time safe, which also affects how it is compiled. In other languages basically everything is dynamic dispatch, but in Swift a lot of stuff is static dispatch, so there is literally no place for tests to hook into and redirect that dispatch (at runtime at least). To allow a test to redirect something at runtime, you have to write Swift code specifically to abandon some of that compile time safety… or write Swift code that is compile time polymorphic instead of run time polymorphic, a.k.a. generics.

It’s very unfortunate, then, that we’ve introduced the ability to construct broken network interfaces into the production code just so we can do these things in tests. Wouldn’t it be much better if we could isolate this God mode cheat where we can break the invariants to just when the tests run? How might we do that?

Well, first let’s back up a moment here. What are we writing tests for? To prove the system works? To test-drive the features? It’s a popular idea in TDD circles that “testability” should dictate design. Now this evolved out of a belief that “testable design” is really just synonymous for “good design”… that a design that is untestable is untestable because it has design flaws. Testing is therefore doubly valuable because it additionally forces design improvements. Basically, testability reveals a bunch of places where you hardcoded stuff you shouldn’t have (like dependencies, which should be injected because that makes all your code far more reusable anyways).

But that concept can get carried way, to the point that your design becomes a slave to “testability”. You introduce abstractions, the capability for variation, customization, swapping out dependencies, all over the place just so you can finely probe the running system with test doubles. Even though you not only don’t need that variation, it would be fundamentally incorrect for those things to vary, we introduce a capability for variation just so we can construct intentionally broken (not shippable to production) configurations of our code just to run tests against them.

Again, this wasn’t really a concern in industry languages where TDD evolved in, because it’s not even possible to not have the capability for variation literally everywhere (people eventually found out how to mock even ostensibly static calls in Java). C++ devs might have known about it, and Swift brought this issue to the surface for a different audience.

There is a lot of debate over different testing strategies, including whether it’s a good idea in the first place for tests to “see” things customers would never be able to see, or for tests to run against anything other than the exact code that customers are going to use. The argument is simple: the more you screw with the configuration of code just to test it, the less accurate that test will be because no one in the wild is using that configuration. On the other hand, black box testing the exact production build with no special test capabilities or probes inserted is really hard and tends to produce brittle, slow, shallow tests that can’t even reliably put the system into a lot of the initial conditions we want to test.

So it’s really is worth asking ourselves: what exactly do we need to test that requires replacing implementations of NetworkInterface functions with test doubles? After all, this is the interface into a different system that we communicate with over the wire. Can’t we do the interception and substitution at the wire? Can’t we take full control over what responses NetworkInterface is going to produce by controlling the HTTP server at the base URL? Is that maybe a better way to test here, because then we don’t have to make NetworkInterface abstract, because arguably it shouldn’t be, there should not be a NetworkInterface that does anything other than make those specific HTTP calls?

You can do all the mixing, matching and hot swapping you want this way. As long as you can control your HTTP server in your tests, you’re good to go. If you run the server in the test process, you can supply closures that capture state in your test functions.

Okay so this is a pretty specific issue with this specific example. But then again, this is the specific example used to motivate the desire to mix and match and hot swap methods in our network interface. I think if we ignored the testing use case, it would become clear that this is a capability we very much do not want, and then protocol witnesses are plainly the wrong design specifically because they allow this.

But let’s say you really don’t want to spin up a local HTTP server to power your tests. I get it: it’s a pain in the a** to do that, and a heck of a lot easier to just hot-swap the functions. On the other hand, you aren’t exercising your production NetworkInterface, even though it would be nice for your tests to uncover problems in there too. And then it’s possible for your faked implementations to do things the real one simply wouldn’t be able to do. For example, the live implementation can throw a BadRequest error, whatever URLSession.data(for:) can throw, and whatever JSONDecoder.decode can throw. If you swap out a fake function that throws something else, and an explosion occurs… so what? Why does the rest of the code need to prepare for that eventuality even though it would never (and perhaps should never) happen?

Well, if NetworkInterface is a protocol, you can’t be sure that would never happen. After all, by having production code couple to a protocol, it’s coupling to implementations that can do anything. If NetworkInterface were concrete we’d know that isn’t possible, and we’d be wasting time ensuring production code deals with it. Correspondingly, we wouldn’t be able to write a test that throws something else. We could only control the response by controlling what the server returns, which can only trigger one of the errors that live implementation can throw… and we’ll be sure whatever scenario we created is one that can really happen and that we really need to deal with.

And so on… but anyways, you’re just not going to spin up an HTTP server, so can we somehow isolate these special hot-swapping capabilities to just the tests? Yes. To do so we still need to make NetworkInterface a protocol, which makes the production code potentially way more open-ended than we want it to be… but we can try to express that:

  • The entire production code should all use one implementation of NetworkInterface. That is, we shouldn’t be able to switch from one type to another in the middle of the app
  • The only implementation that should be available in production code is the live implementation

The second part is easy: we’re going to use the protocol model, the Live implementation will be in production code, and a Test implementation will be defined in the test suite. The Live implementation will look just like it did originally. The Test one will look like this:

struct NetworkInterfaceTest: NetworkInterfaceProtocol {
  var fetchJobListingsImp: () async throws -> [JobListing]
  func fetchJobListings() async throws -> [JobListing] {
    try await fetchJobListingsImp()
  }

  var fetchApplicantsImp: (_ listing: JobListing) async throws -> [Applicant]
  func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
    try await fetchApplicantsImp(listing)
  }

  var fetchJobListingsForApplicantImp: (_ applicant: Applicant) async throws -> [JobListing]
  func fetchJobListing(for applicant: Applicant) async throws -> [JobListing] {
    try await fetchJobListingsForApplicantImp(listing)
  }
}

extension NetworkInterfaceTest {
  var happyPathFake: Self {
    .init(
      fetchJobListingsImp: { StubData.jobListings },
      fetchApplicantsImp: { listing in StubData.applicants[listing.id]! },
      fetchJobListingsForApplicantImp: { applicant in StubData.applicantListings[applicant.id]! }
    )
  }
}

See, we’re still using the protocol witness approach, and we’re still building factories for creating specific common configurations, which we can always reconfigure or customize later. But we’re concentrating it in a specific implementation of the protocol. This expresses that in general a NetworkInterface has cohesion among its functions, and they therefore cannot be individually selected. But this individual configuration capability exists specifically in one implementation of NetworkInterface. By defining NetworkInterfaceTest in the test target instead of the prod target, we’re keeping this, and all its added capabilities, out of the production code.

For the first point, since NetworkInterface is a protocol, there’s nothing stopping you from creating new implementations and using them in specific places in production code. Maybe it’s enough to define only the Live one in prod code, but substituting a different (and definitely wrong) one at one specific place (like in a Model for one screen, or component on a screen, buried deep in the structure of the app) would be easy to miss because you’d have to go drill down into that specific place to see the mistake.

Is it not a requirement of the production code that everything uses the same type of network interface? It would always be incorrect to switch from one type to another for one part of the app. Even in tests, aren’t we going to swap in the fake server for everything? Can we somehow express this in the design, to make it a compiler error to mix multiple types? If we could, then picking the wrong type would affect the app everywhere, and that would probably be immediately obvious upon the most trivial testing.

The reason it would be possible to switch network types mid-way is because every UI model stores an instance of the protocol existential, like let network: NetworkInterfaceProtocol, which is really shorthand for let network: any NetworkInterfaceProtocol. That any box is able to store, well, any implementing type, and two different boxes can store two different types. How do we make them store the same type?

With generics.

Instead of this:

final class SomeComponentModel {
  ...

  let otherComponentModel: OtherComponentModel

  private let network: any NetworkInterfaceProtocol
}

final class OtherComponentModel {
  ...

  private let network: any NetworkInterfaceProtocol
}

We do this:

final class SomeComponentModel<Network: NetworkInterfaceProtocol> {
  ...

  let otherComponentModel: OtherComponentModel<Network>

  private let network: Network
}

final class OtherComponentModel<Network: NetworkInterfaceProtocol> {
  ...

  private let network: Network
}

See, the generic parameter links the two network members of the two component models together: they can still vary, we can make component models for any network interface type, but when we select one for SomeComponentModel, we thereby select the same one for OtherComponentModel. Do this throughout all the models in your app, and you express that the entire app structure works with a single type of network interface. You select the type for the entire app, not for each individual component model.

This is an example of “fail big or not at all”.  If you can’t eliminate an error from the design, try to make it a nuclear explosion in your face so you immediately catch and correct it.  The worst kinds of failures are subtle ones that fly under the radar. If your response to this is “just cover everything with tests”, well… I’ve yet to see any GUI app come anywhere close to doing that. Plus if you aren’t categorizing in your mind compilation errors as failed tests, and the type system you create as a test suite, you’re not getting the point of static typing.

Whether this is worth it to you, well that depends. How determined are you to prevent mistakes like creating new network interface types and using them in tucked away parts of the app (is this something you think is appreciably risky?), and how much does making all the types in your app generic bother you (maybe you don’t like the bracket syntax, or you have other aesthetic or design philosophy reasons to object to a proliferation of generics in a codebase)?

I have been steadily evolving for the past 3 years toward embracing generics, and I mean really embracing them: I have no problem with every single type in my app being generic, and having multiple type parameters (if you want to see code that works this way, just look at SwiftUI). I used to find the brackets, extra syntactic boilerplate, and cognitive load of working out constraints (where clauses) alarming or tedious. Then I got used to it, it became second nature to me, and now I see no reason not to. It leads to much more type safe code, and greatly expands the scope of rules I can tell the compiler about, that the compiler then enforces for me. I don’t necessarily think inventing a new network interface for an individual component is a likely bug to come up, but the cost of preventing it is essentially zero, and I additionally document through the design the requirement that the backend selection is app-wide, not per-component.

(The same evolution is happening in my C++ codebases: everything is becoming templates, especially once C++20 concepts showed up).

And this all involves using the two most powerful, expressive, and, in my opinion, important, parts of Swift: protocols and generics.

It’s Not a Choice Over Protocols

Is this “protocol oriented programming”? And if it is, does this mean I believe protocol oriented programming is a panacea? I don’t know, maybe. But the protocol witness design hasn’t removed protocols, it’s just moved them around. The issue isn’t whether to use protocols or not, it’s where they properly belong, which is determined by the correct conceptual model of your system, specifically in what should really be abstract.

Where are the protocols in the protocol witness version? After all, the keyword protocol is nowhere to be found in that code. How can I claim it’s still using protocols?

Because closures are protocols.

What else could the be? They’re abstract, aren’t they? I can’t write let wtf = (() async throws -> [JobListing]).init(). They express a variation: a closure variable holds a closure body, but it doesn’t specify which one. Even the underlying mechanisms are the same. When you call a closure, the compiler has to insert a dynamic dispatch mechanism. That means a witness table (there’s that word “witness”, we’ll get to that later), which lists out the ways a particular instance fulfills the requirements of a protocol.

This is more obvious if we think about how we would implement closures if the language didn’t directly support them. We’d use protocols:

protocol AsyncThrowingCallable0<R> {
  associatedtype R

  func callAsFunction() async throws -> R
}

protocol AsyncThrowingCallable1<T, R> {
  associatedtype T
  associatedtype R

  func callAsFunction(_ arg0: T) async throws -> R
}

struct NetworkInterface {
  var fetchJobListings: any AsyncThrowingCallable0<[JobListing]>

  var fetchApplicants: any AsyncThrowingCallable1<JobListing, [Applicant]>

  var fetchJobListingsForApplicant: any AsyncThrowingCallable1<Applicant, [JobListing]>
}

extension NetworkInterface {
  static func live(
    urlSession: URLSession = .shared
    host: URL
  ) -> Self {
    struct FetchJobListings: AsyncThrowingCallable0 {
      let urlSession: URLSession
      let host: URL

      func callAsFunction() async throws -> [JobListing] {
        let (data, response) = try await urlSession.data(for: host.appendingPathComponent("listings")

        guard let response, response.statusCode == 200 else {
          throw BadResponse(response)
        }

        return try JSONDecoder().decode(type: [JobListing].self, from: data ?? .init())
      }
    }

    struct FetchApplicants: AsyncThrowingCallable1 {
      let urlSession: URLSession
      let host: URL

      func callAsFunction(_ listing: JobListing) async throws -> [Applicant] {
        let (data, response) = try await urlSession.data(for: host.appendingPathComponent("listings/\(listing.id)/applicants")

        guard let response, response.statusCode == 200 else {
          throw BadResponse(response)
        }

        return try JSONDecoder().decode(type: [Applicant].self, from: data ?? .init())
      }
    }

    struct FetchJobListingsForApplicant: AsyncThrowingCallable1 {
      let urlSession: URLSession
      let host: URL

      func callAsFunction(_ applicant: Applicant) async throws -> [JobListing] {
        let (data, response) = try await urlSession.data(for: host.appendingPathComponent("applicants/\(applicant.id)/listings")

        guard let response, response.statusCode == 200 else {
          throw BadResponse(response)
        }

        return try JSONDecoder().decode(type:[JobListing].self, from: data ?? .init())
      }
    }

    return .init(
      fetchJobListings: FetchJobListings(urlSession: urlSession, host: host),
      fetchApplicants: FetchApplicants(urlSession: urlSession, host: host),
      fetchJobListingsForApplicant: FetchJobListingsForApplicant(urlSession: urlSession, host: host)
    )
}

This is something I think every Swift developer should see at least once, because it shows the underlying mechanism of closures and helps demystify them. We have to implement callability, which we do as a protocol requirement (even if we didn’t have callAsFunction to improve the syntax, we could name the function something like invoke and it still works, we just have to write out invoke to call a closure). We also have to implement capturing, which is fundamentally why these are protocols (whose implementations can be any struct with any amount of instance storage) and not just function pointers.

Now, am I just being pedantic? Can’t I just as well conceive of closures as being function pointers, instead of protocols? No, the specific reason a closure is not just a function pointer is because of capture. I could not keep urlSession and host around and accessible in the live implementation’s methods if they were just function pointers. And what exactly gets captured varies across the implementations.

The key capability of the protocol design is that each implementing type can have its own set of (possibly private) fields, giving each one a different size and layout in memory. The protocol witness itself can’t do this. It’s a struct and its only fields are the methods. Every instance is going to have the exact layout in memory. How, then, can we possibly give the live version more instance fields? By stuffing them inside the captures of the methods, which requires the methods to support variation in their memory layout. The protocoliness of closures is, in fact, crucial to the protocol witness approach working at all.

So, with this clear, we can plainly see that we aren’t choosing protocols vs. no protocols. Rather, we’re choosing one protocol with three requirements at the top vs. three protocols with one requirement (being callable) on the inside. It’s no different than any other modeling problem where you try to work out what the correct way to conceptualize the system you’re building. What parts of it are abstract? Do we have a single abstraction here, or three separate abstractions placed into a concrete aggregate?

If this is your way to avoid being “protocol oriented”, well… I have some bad news for you!

Since we’re on the subject, could we implement either one of these conceptual models (the single abstraction with multiple requirements or the multiple abstractions, each with a single requirement) without any protocols? That might seem plainly impossible. How can we model abstraction without using the mechanism of abstraction that Swift gives us? Well, you can do it, because of course abstraction itself has an implementation, and we can always do that by hand instead of using the language’s version of it.

I really, really hope no one actually suggests this would ever be a good idea. And I’m more nervous about that than you might think because I’ve seen a lot of developers who seem to have forgotten that dynamic dispatch exists and insist on writing their own dispatch code with procedural flow, usually by switching over enums. I hesitate to even show this because I might just be giving them ideas. But if you promise me you’ll take this as educational only and not go start replacing protocols in your code with what you’re about to see… let’s proceed.

This is, of course, how you would model this if you were using a language that simply didn’t have an abstract type system… like C. How would you implement this system in C? Well, that would be a syntactic translation of what you’re about to see in Swift:

enum NetworkInterfaceType {
  case live(urlSession: URLSession, host: URL)
  case happyPathFake
}

struct NetworkInterface {
  let type: NetworkInterfaceType

  func fetchJobListings() async throws -> [JobListing] {
    switch type {
      case let .live(urlSession, host):
        let (data, response) = try await urlSession.data(for: host.appendingPathComponent("listings")

        guard let response, response.statusCode == 200 else {
          throw BadResponse(response)
        }

        return try JSONDecoder().decode(type: [JobListing].self, from: data ?? .init())

      case .happyPathFake:
        return StubData.jobListings
    }
  }

  func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
  switch type {
      case let .live(urlSession, host):
        let (data, response) = try await urlSession.data(for: host.appendingPathComponent("listings/\(listing.id)/applicants")

        guard let response, response.statusCode == 200 else {
          throw BadResponse(response)
        }

        return try JSONDecoder().decode(type: [Applicant].self, from: data ?? .init())

      case .happyPathFake:
        return StubData.applicants[listing.id]!
    }    
  }

  func fetchJobListings(for applicant: Applicant) async throws -> [JobListing] {
  switch type {
      case let .live(urlSession, host):
        let (data, response) = try await urlSession.data(for: host.appendingPathComponent("applicants/\(applicant.id)/listings")

        guard let response, response.statusCode == 200 else {
          throw BadResponse(response)
        }

        return try JSONDecoder().decode(type: [JobListing].self, from: data ?? .init())

      case .happyPathFake:
        return StubData.applicantListings[applicant.id]!
    }    
  }
}

For those curious: how would you do this in C, particularly the enum with associated values, something that does not translate directly to C? You would use a union of all the associated value tuples (structs) for each case, and have a plain enum for tracking which value is currently stored in the union (this is basically how enums with associated types are implemented under the hood in Swift).

How would you even add a new test double to this? You’d have to define a new enum case, and add the appropriate switch cases to each method. Yes, you’d have to do this for each and every “concrete implementation” you come up with. The code for every possible implementation of the network interface would all coexist in this giant struct, and in giant methods that cram all those different varying implementations next to each other.

How would you capture state from where you define these other implementations? Ohh… manually, by adding the captured values as associated values of that particular enum case, that you’d then have to pass in by hand. Every variation of state capture would need its own enum case!

Please, please please don’t start doing this, please!

Again, my anxiety here is well founded. I’ve seen plenty of Swift code where an enum is defined whose name ends with Type, it’s added as a member to another type X, and several methods on X are implemented by switching over the enum. If you’ve ever done this… I mean if you’ve ever written an enum whose name is WhateverType and made one a member of the type Whatever, then you’ve invented your own type system. After all, wouldn’t it make more sense to embed the enum inside the other type and remove the redundant prefix?

struct Whatever {
  enum Type {
    ...
  }

  let type: Type
  ...
}

Try it. See what the compiler tells you. Whatever already has an inner typealias named Type, which is the metatype: literally Whatever.Type, that Swift defines for every type you create.

You’re inventing your own type system!

Why you’re doing this instead of using the type system the language ships with… I just don’t know.

There are other ways you could do this that might be slightly less ridiculous than cramming all the bodies of all the implementations of the abstract type next to each other in a single source file, and even in single functions. You could, for example, use function pointers, and pass Any parameters in to handle captures. By that point you’re literally just implementing dynamic dispatch and abstract types.

To be clear, this implements the first design of a single protocol with three requirements. If we wanted to implement the protocol witness this way, we’d need three separate structs with their own Type enums, so they can vary independently.

So… yeah. Good thing we decided to use protocols. I don’t know about you, but I’m very happy we don’t write iOS apps in C.

I do see this as being peripherally related to the protocol witness stuff, particularly the conception of it as a choice over language mechanics, because that is also inventing your own type system. As I pointed out earlier, the abstract type and multiple concrete types still exist. The compiler just doesn’t know about it. You’re still typing things. You’re just doing it without making your conceptual “types” literally types as defined in Swift.

I don’t know why you’d want to invent your own type system, and I doubt there is ever a good reason to. Doing so has very surprising implications, it’s an expert level technique that rapidly gets complicated beyond only the most trivial use cases, and it only shuts off a large part of compile time verification and forces you to do type validation at runtime (and probably fatalError or something on validation failures). If you’re doing this to circumvent the rules of a static type system, like for example that you can’t replace the definition of a type’s function at runtime or construct instances that are stitched together from multiple different types (in other words, static types are static)… well you absolutely should not be doing that, the type system works the way it does for a reason and you should stop trying to break it.

What’s a “Witness” Anyways?

Speaking of inventing your own version of what Swift already gives you… why is this technique called a “protocol witness”? If you Google “Swift” and “witness” you’ll likely encounter discussions about how Swift is implemented, something involving a “witness table”. You may have even seen a mention of such a thing in certain crashes, either the debug message that gets printed or somewhere in the stack trace.

To understand that, let’s think about what happens when we declare a variable of type let network: any NetworkInterfaceProtocol. Now, today, you may or may not have to write any here. Whether you do is based on some rather esoteric history (albeit recent history) about Swift. Basically, if you were allowed to write this at all before Swift 5.7, you don’t have to write any, but you can. If something about the protocol prevented you from using it as a variable type before Swift 5.7, specifically before this proposal was implemented, then you have to write any.

I’m harping on this because it’s of critical importance here. This is the correct way to think about it: the any is always there, and always has been, wherever you use a protocol as a variable type. This just was, and still is in many places, implied by the compiler, so writing it out is optional. After all, the compiler can already tell that a protocol is being used as a variable type, so there’s no syntactic ambiguity by omitting it (similar to how you can omit the type of a variable if it gets initialized inline).

How else can you use a protocol beside as the type of a variable? To declare conformance (i.e. to say NetworkInterfaceLive conforms to NetworkInterfaceProtocol) and in a generic constraint. Those two places involve the protocol itself. That is completely different than the protocol name with any before it. If P is a protocol, then any P is an existential box that holds an instance of any concrete type that implements P.

What is an “existential box”? I don’t want to get too deep into the implementation details of Swift’s type system (I’m writing another article series for that), but just think of it this way: P is abstract, which literally means no variable can have a type P. But there has to be some concrete type, and it has to be static: the type of a variable var theBox: any P can’t change as you put different concrete implementations of P inside it. So the type can’t be whatever concrete type you put in it.

That’s why we have to have a special concrete type, any P, for holding literally any P. Imagine for a moment the compiler just didn’t let you do this. You just aren’t allowed to declare a variable whose type is a protocol. But you need to be able to store any instance that implements a protocol in a variable. What would you do? Let’s say we have the protocol:

protocol MyProtocol {
  var readOnlyMember: Int { get }
  var readWriteMember: String { get set }

  func method1() -> Int
  func method2(param: Int) -> String
}

Now, the point of trying to declare a variable var theBox: MyProtocol is so we can call the requirements, like theBox.readOnlyMember, or theBox.method1(). We need to create a type that lets us do this, but those calls actually dispatch to any implementing instance we want. Let’s try this:

struct AnyMyProtocol {
  var readOnlyMember: Int { readWriteMember_get() }
  var readWriteMember: String { 
    get { readWriteMember_get() }
    set { readWriteMember_set(newValue) }
  }

  func method1() -> Int { method1_imp() } 
  func method2(param: Int) -> String { method2_imp(param) }  

  init<P: AnyMyProtocol>(_ value: P) {
    readOnlyMember_get = { value.readOnlyMember }
   
    readWriteMember_get = { value.readWriteMember }
    readWriteMember_set = { newValue in value.readWriteMember = newValue }

    method1_imp = { value.method1() } 
    method2_imp = { param in value.method2(param: param) } 
  }

  private let readOnlyMember_get: () -> Int

  private let readWriteMember_get: () -> String
  private let readWriteMember_set: (String) -> Void

  private let method1_imp: () -> Int
  private let method2_imp: (Int) -> String
}

What’s happening here is we store individual closures for each of the requirements defined by the protocol (all getters and setters of vars and all funcs). We define a generic initializer that takes an instance of some concrete implementation of P, and we assign all these closures to forward the calls to this implementation.

This is called a type eraser, because that’s exactly what it’s doing. It’s dropping knowledge of the concrete type (whatever the generic parameter P was bound to when the init was called) but preserving the capabilities of the instance. That way, whoever is using the AnyMyProtocol doesn’t know the type of MyProtocol those calls are being forwarded to.

But hold on… this isn’t correct. If you tried compiling this code, you’ll see it failed. Specifically, we’re trying to assign readWriteVar on the value coming into the initializer. But value is a function parameter, and therefore read-only. We need to store our own mutable copy first, and make sure to send all our calls to that copy. We can do that by simply shadowing the variable coming in:

  init<P: AnyMyProtocol>(_ value: P) {
    var value = value

    readOnlyMember_get = { value.readOnlyMember }
   
    readWriteMember_get = { value.readWriteMember }
    readWriteMember_set = { newValue in value.readWriteMember = newValue }

    method1_imp = { value.method1() } 
    method2_imp = { param in value.method2(param: param) } 
  }

What exactly is happening here? Well, when the init is called, we create the local variable, which makes a copy of value. All those closures capture value by reference (because we didn’t explicitly add [value] in the capture list to capture a copy), so we are able to write back to it. Since those closures are escaping, Swift sees that this local variable is captured by reference in escaping closures, which means that variable has to escape too. So it actually allocates that variable on the heap, allowing it to live past the init call. It gets put in a reference counted box, retained by each closure, and the closures are then retained by the AnyMyProtocol instance being initialized. When this instance is discarded, all the closures are discarded, the reference count of this value instance goes to 0, and it gets deallocated.

In a rather convoluted way, this effectively sneaks value into the storage for the AnyMyProtocol instance. It’s not literally inside the instance, it just gets attached to it and has the same lifetime.

Now think about what happens here:

let box1 = AnyMyProtocol(SomeConcreteStruct())
var box2 = box1

box2.readWriteMember = "Uh oh..."

print(box1.readWriteMember)

What gets printed? This: "Uh oh...".

We assigned readWriteMember on box2, and it affected the same member on box1. Evidently AnyMyProtocol has reference semantics, even though the value it’s erasing, a SomeConcreteStruct, has value semantics. This is wrong. When we assign box2 to box1 it’s supposed to make a copy of the SomeConcreteStruct instance inside of box1. But above I explained that the actual boxed value is the var value inside the init, which is placed in a reference counted heap-allocated box to keep it alive as long as the closures that capture it are alive. This has to happen, this value must have reference semantics, because the closures have to share the value. When the readWriteMember_set closure is called and writes to value, that has to be “seen” by a subsequent call to the readWriteMember_get closure.

But while we need the value to be shared among the various closures in a single instance of AnyMyProtocol, we don’t want the value to be shared across instances. How do we fix this?

We have to put the erased value in the AnyMyProtocol instance as a member, and we need that member to be the one closures operate one, which means it needs to be passed into the closures:

struct AnyMyProtocol {
  var readOnlyMember: Int { readWriteMember_get(erased) }
  var readWriteMember: String { 
    get { readWriteMember_get(erased) }
    set { readWriteMember_set(&erased, newValue) }
  }

  func method1() -> Int { method1_imp(erased) } 
  func method2(param: Int) -> String { method2_imp(erased, param) }  

  init<P: AnyMyProtocol>(_ value: P) {
    erased = value

    readOnlyMember_get = { erased in (erased as! P).readOnlyMember }
   
    readWriteMember_get = { erased in (erased as! P).readWriteMember }
    readWriteMember_set = { erased, newValue in 
      var value = erased as! P
      value.readWriteMember = newValue
      erased = value
   }

    method1_imp = { erased in (erased as! P).method1() } 
    method2_imp = { erased, param in (erased as! P).method2(param: param) } 
  }

  private var erased: Any

  private let readOnlyMember_get: (Any) -> Int

  private let readWriteMember_get: (Any) -> String
  private let readWriteMember_set: (inout Any, String) -> Void

  private let method1_imp: (Any) -> Int
  private let method2_imp: (Any, Int) -> String
}

Here we see the appearance of Any. What’s that? It’s Swift’s type erasing box for literally anything. It is what implements the functionality of keep tracking of what’s inside the box and properly copying it when assigning one Any to another. The closures now have what is effectively the self parameter. The closures are going to be shared among copies of the AnyMyProtocol, we can’t change that. So if they’re shared, and we don’t want them to operate on a shared P instance, we have to pass in the particular P instance we want them to operate on.

In fact, these aren’t closures anymore because they aren’t capturing anything. And that’s fortunate, because we’re trying to solve the “Swift doesn’t let you use abstract types as variable types” hypothetical, which would ban closures too. By eliminating capture, these are now just plain old function pointers, which aren’t abstract types.

Now, we can collect these function pointers into a struct:

struct AnyMyProtocol {
  var readOnlyMember: Int { witnessTable.readWriteMember_get(erased) }
  var readWriteMember: String { 
    get { witnessTable.readWriteMember_get(erased) }
    set { witnessTable.readWriteMember_set(&erased, newValue) }
  }

  func method1() -> Int { witnessTable.method1_imp(erased) } 
  func method2(param: Int) -> String { witnessTable.method2_imp(erased, param) }  

  init<P: AnyMyProtocol>(_ value: P) {
    erased = value

    witnessTable = .init(
      readOnlyMember_get: { erased in (erased as! P).readOnlyMember },
      readWriteMember_get = { erased in (erased as! P).readWriteMember },
      readWriteMember_set = { erased, newValue in 
        var value = erased as! P
        value.readWriteMember = newValue
        erased = value,
      method1_imp = { erased in (erased as! P).method1() },
      method2_imp = { erased, param in (erased as! P).method2(param: param) }
     )
  }

  struct WitnessTable {
    let readOnlyMember_get: (Any) -> Int

    let readWriteMember_get: (Any) -> String
    let readWriteMember_set: (inout Any, String) -> Void

    let method1_imp: (Any) -> Int
    let method2_imp: (Any, Int) -> String
  }

  private let erased: Any
  private let witnessTable: WitnessTable
}

The specific WitnessTable instance being created doesn’t depend on anything except the type P, so we can move into an extension on MyProtocol:

extension MyProtocol {
  static var witnessTable: AnyMyProtocol.WitnessTable {
    .init(
      readOnlyMember_get: { erased in (erased as! Self).readOnlyMember },
      readWriteMember_get = { erased in (erased as! Self).readWriteMember },
      readWriteMember_set = { erased, newValue in 
        var value = erased as! Self
        value.readWriteMember = newValue
        erased = value,
      method1_imp = { erased in (erased as! Self).method1() },
      method2_imp = { erased, param in (erased as! Self).method2(param: param) }
    )
  }
}

Then the init is just this:

  init<P: AnyMyProtocol>(_ value: P) {
    erased = value
    witnessTable = P.witnessTable
  }

Does this idea of a table of function pointer members, one for each requirement of a protocol, sound familiar to you?

Hey! This is a protocol witness! “Witness” refers to the fact that the table “sees” the concrete P and its concrete implementations of all those requirements, recording the “proof”, so-to-speak, of those implementations in the form of the function pointers. When someone else comes along and asks, “hey, does you value implement this requirement?”, the witness answers, “yes! I saw that earlier. Here, look at this closure, this is what its implementation is”.

Well, this is exactly what the Swift compiler writes for you, for every protocol that you write. The only difference is that theirs is called any MyProtocol instead of AnyMyProtocol. The compiler creates a witness table for every protocol and uses it in its existential boxes to figure out where to jump to.

At any point an existential box will hold a pointer to a particular witness table, and when you call a method on it, the compiled code goes to the witness table, grabs the function pointer at the right index (depending on what requirement you’re invoking), and then jumps to that function pointer.

The box is, in fact, implementing dynamic dispatch, which always requires some type of runtime lookup mechanism. The members of the witness table are function pointers, which are just the addresses of the first line of the compiled code for the bodies. Calling one just jumps the execution to that address. Every function has a known compile time constant address, so if you want dynamic dispatch, you have to store a function pointer in a variable. That’s what the witness table is.

There are things the compiler can make its box do that we can’t make our box do, and vice versa. The compiler makes its box transparent with respect to casting: we can directly downcast an instance of any P to P1 (where P1 is a concrete implementer of P) and the compiler checks what it has in the box and, if it matches, pulls that out and gives it to us. We can’t make our own box do that, at least not transparently. On the other hand, the compiler never conforms its box to any protocols (not even the protocol being erased: any P does not conform to P, you may have seen compiler errors about this that, they used to be much worse messages like P as a type cannot conform to itself, which is pretty freaking confusing!). You can conform your box to whatever protocols you want.

If we wrote our own existential box for NetworkInterfaceProtocol, it would look like this:

struct AnyNetworkInterface: NetworkInterfaceProtocol {
  func fetchJobListings() async throws -> [JobListing] {
    try await fetchJobListings_imp()
  }

  func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
    try await fetchApplicants_imp(listing)
  }

  func fetchJobListings(for applicant: Applicant) async throws -> [JobListing] {
    try await fetchJobListingsForApplicant_imp(applicant)
  }

  init<Erasing: NetworkInterfaceProtocol>(erasing: Erasing) {
    fetchJobListings_imp = { try await erasing.fetchJobListings() } 
    fetchApplicants_imp = { listing in try await erasing.fetchApplicants(for: listing) } 
    fetchJobListingsForApplicant_imp = { applicant in try await erasing.fetchJobListings(for: applicant) } 
  }

  private let fetchJobListings_imp: () async throws -> [JobListings]
  private let fetchApplicants_imp: (JobListing) async throws -> [Applicant]
  private let fetchJobListingsForApplicant_imp: (Applicant) async throws -> [JobListings]
}

// Convenience function, so we can box an instance with `value.erase()` instead of `AnyNetworkInterface(erasing: value)`
extension NetworkInterfaceProtocol {
  func erase() -> AnyNetworkInterface {
    .init(erasing: self)
  }
}

Well this is almost exactly like the Test implementation I showed earlier, isn’t it!?

“Protocol witness” is a reference to the fact this is, for all intents and purposes, a type erasing box. The only difference is the one PointFree shows us is made to be configurable later. We can do that too:

struct ConfigurableAnyNetworkInterface: NetworkInterfaceProtocol {
  func fetchJobListings() async throws -> [JobListing] {
    try await fetchJobListings_imp()
  }

  func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
    try await fetchApplicants_imp(listing)
  }

  func fetchJobListings(for applicant: Applicant) async throws -> [JobListing] {
    try await fetchJobListingsForApplicant_imp(applicant)
  }

  init<Erasing: NetworkInterfaceProtocol>(erasing: Erasing) {
    fetchJobListings_imp = { try await erasing.fetchJobListings() } 
    fetchApplicants_imp= { listing in try await erasing.fetchApplicants(for: listing) } 
    fetchJobListingsForApplicant_imp = { applicant in try await erasing.fetchJobListings(for: applicant) } 
  }

  var fetchJobListings_imp: () async throws -> [JobListings]
  var fetchApplicants_imp: (JobListing) async throws -> [Applicant]
  var fetchJobListingsForApplicant_imp: (Applicant) async throws -> [JobListings]
}

extension NetworkInterfaceProtocol {
  func makeConfigurable() -> ConfigurableAnyNetworkInterface{
    .init(erasing: self)
  }
}

With this, we can do something like start off with the Live implementation, put it in a configurable box, then start customizing it:

var network = NetworkInterfaceLive(host: prodHost)
  .makeConfigurable()

network.fetchJobListings_imp = { 
  ...
}

What we’re really doing here is writing our own existential box. That’s what I mean when I say this technique is getting into the territory of creating our own versions of stuff the compiler already creates for us. It’s just a less absurd form of building our own type system with Type enums.

Now, there are reasons why you need to build your own type erasing box. Even Apple does this in several of their frameworks (AnySequence, AnyPublisher, AnyView, etc.). It usually comes down to making the box conform to the protocol it’s abstracting over (AnySequence conforms to Sequence, AnyPublisher conforms to Publisher, AnyView conforms to View, etc.). This is a language limitation, something we expect, or at least hope, will be alleviated in future versions (i.e. allowing extensions of the compiler-provided box: extension any Sequence: Sequence), and sometimes we just need to work around language limitations by doing stuff ourselves the compiler normally does.

(…well, not necessarily. Whether the type erasing box should ever conform to the protocol it abstracts over is not so obvious. Remember how we used a generic to ensure two component models use the same concrete network interface type? If any NetworkInterfaceProtocol automatically counted as a concrete network interface, you could pick that as the type parameter and then you’re able to mix and match, or switch mid-execution. Then what’s the point of making it generic? Maybe what we really need is a way to specify that a generic parameter can be either a concrete type or the existential).

However, writing a type erasing box that you can then start screwing with is not an example of this. This is not a missing language feature. Allowing methods of an any P to be hot-swapped or mix-and-matched would break the existential box. You’d never be sure, when you have an any P, if you have a well-formed instance conforming to P, and everyone has the keys to reach in and break it.

This is why I would only want to expose an implementation of NetworkInterfaceProtocol that allows such violations to tests, and call it Test to make it clear it’s only for use as a test double. Now that’s a fine approach. I’ve actually gone one step further and turned such a Test implementation into a Mock implementation by having it record invocations:

typealias Invocation<Params> = (time: Date, params: Params)

final class NetworkInterfaceMock: NetworkInterfaceProtocol {
  private(set) var fetchJobListings_invocations: [Invocation<()>] = []
  var fetchJobListings_setup: () async throws -> [JobListing]
  func fetchJobListings() async throws -> [JobListing] {
    fetchJobListings_invocations.append((.now, ())
    try await fetchJobListings_setup()
  }

  private(set) var fetchApplicants_invocations: [Invocation<JobListing>] = []
  var fetchApplicants_setup: (_ listing: JobListing) async throws -> [Applicant]
  func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
    fetchApplicants_invocations.append((.now, listing)
    try await fetchApplicants_setup(listing)
  }

  private(set) var fetchJobListingsForApplicant_invocations: [Invocation<Applicant>] = []
  var fetchJobListingsForApplicant_setup: (_ applicant: Applicant) async throws -> [JobListing]
  func fetchJobListing(for applicant: Applicant) async throws -> [JobListing] {
    fetchJobListingsForApplicant_invocations.append((.now, applicant)
    try await fetchJobListingsForApplicantImp(listing)
  }

  func reset() {
    fetchJobListings_invocations.removeAll()
    fetchApplicants_invocations.removeAll()
    fetchJobListingsForApplicant_invocations.removeAll()
  }
}

This kind of boilerplate is a perfect candidate for macros. Make an @Mock macro and you can produce a mock implementation, that records invocations and is fully configurable from the outside, with as a little as @Mock final class MockMyProtocol: MyProtocol {}. You can probably do the same with the type erasing boxes.

Remember I said earlier that Swift’s compile time safety shut off a lot of easy configurability in tests that all executed at runtime (i.e. rewriting methods)? Well, macros is the solution, where this same level of expressiveness is recovered at compile time.

What Can’t You Do with Protocols?

But let’s rewind a little bit. Do you actually need to create this kind of open-ended type whose behavior you can then change, which is a strange thing to be able to do to any type (changing the meaning of its methods) in order to cover the examples shown?

No. For example, in the case where we want a test to be able to swap in its own implementation of fetchJobListings, we don’t have to make any concrete NetworkInterfaceProtocol whose behavior can change ex-post facto. Instead, we can build transforms that take one type of network interface, and create a new type of network interface with new behavior. The key is that we stick to the usual paradigm of a static type having static behavior. We don’t make a particular instance, of course whose type doesn’t change, dynamic to this degree. We create a new instance of a new type:

struct NetworkInterfaceReplaceJobListings<Base: NetworkInterfaceProtocol>: NetworkInterfaceProtocol {
    func fetchJobListings() async throws -> [JobListing] {
    try await fetchJobListings_imp()
  }

  func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
    try await base.fetchApplicants(for: listing)
  }

  func fetchJobListings(for applicant: Applicant) async throws -> [JobListing] {
    try await base.fetchJobListings(for: applicant)
  }

  init(
    base: Base,
    fetchJobListingsImp: @escaping () async throws -> [JobListings]
  ) {
    self.base = base
    self.fetchJobListingsImp = fetchJobListingsImp
  }

  private let base: Base
  private let fetchJobListingsImp: () async throws -> [JobListings]
}

extension NetworkInterfaceProtocol {
  func reimplementJobListings(`as` fetchJobListingsImp: @escaping () async throws -> [JobListings]) -> NetworkInterfaceReplaceJobListings<Self> {
    .init(base: self, fetchJobListingsImp: fetchJobListingsImp)
  }
}

...

let network = NetworkInterfaceHappyPathFake()
  .reimplementJobListings { 
    throw TestError()
  }

The key difference here is that the network interface with the swapped out implementation is a different type than the original one. This can interact in interesting ways with the generics system. For example, if you implemented your component models to be generics in order to constrain all component models in your app to always use the same network interface type, then this is ensuring that either the entire app uses the network interface with the swapped out implementation, or no one does.

This is more strongly typed. It expresses something I think is very reasonable: a network interface with a swapped out implementation of fetchJobListings is a different type of network interface. However, there’s still an element of dynamism. Each type we create a new NetworkInterfaceReplaceJobListings instance, we can supply a different reimplementation. Everything else is static, but the jobListings implementation is still instance-level varying. This is clear from the fact the type has a closure member. Closures are abstract, so that’s dynamic dispatch. Can we get rid of that exception and our classes fully static? Can we instead make it so that each specific reimplementation of jobListings produces a distinct type of NetworkInterface?

Yes, but it requires a lot of boilerplate, and we unfortunately lose automatic capture so we have to implement it ourselves. Let’s say in a test we’re calling reimplementJobListings inside a method either for a test, or for test setup:

final class SomeTests: XCTestCase {
  var jobListingsCallCount = 0

  func testJobListingsGetsCalledOnlyOnce() {
    let network = NetworkInterfaceHappyPathFake()
      .reimplementJobListings { [weak self] in
        self?.jobListingsCallCount += 1
        return []
      }

    ...

    XCTAssertEqual(jobListingsCallCount, 1)
  }
}

We can replace this with defining a local type of network interface:

final class SomeTests: XCTestCase {
  var jobListingsCallCount = 0

  func testJobListingsGetsCalledOnlyOnce() {
    struct NetworkInterfaceForThisTest: NetworkInterfaceProtocol {
      func fetchJobListings() async throws -> [JobListing] {
        parent?.jobListingsCallCount += 1
        return []
      }

      func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
        try await base.fetchApplicants(for: listing)
      }

      func fetchJobListings(for applicant: Applicant) async throws -> [JobListing] {
        try await base.fetchJobListings(for: applicant)
      }

      init(
        parent: SomeTests?
      ) {
        self.parent = parent
      }

      private let base = NetworkInterfaceHappyPathFake()
      private weak var parent: SomeTests?
    }

    let network = NetworkInterfaceForThisTest(parent: self)

    ...

    XCTAssertEqual(jobListingsCallCount, 1)
  }
}

Here, we’re going all the way with “static types are static”, meaning a type’s behavior is statically bound to that type. If we want different behavior in a different test, we make a different type. But having to type all of this out, and deal with capturing (in this case weak self manually) is a lot of tedium. Especially in tests, I probably wouldn’t bother, and would bend the rules of a static type having fixed behavior to avoid all this boilerplate.

What I would love to be able to do, though, is be able to do this fully static approach, but avoid having to invent a throwaway name for this local type, and have it close over its context, using the same syntax as always to specify capture:

final class SomeTests: XCTestCase {
  var jobListingsCallCount = 0

  func testJobListingsGetsCalledOnlyOnce() {
    let network = NetworkInterfaceProtocol { [weak self] in
      func fetchJobListings() async throws -> [JobListing] {
        parent?.jobListingsCallCount += 1
        return []
      }

      func fetchApplicants(for listing: JobListing) async throws -> [Applicant] {
        try await base.fetchApplicants(for: listing)
      }

      func fetchJobListings(for applicant: Applicant) async throws -> [JobListing] {
        try await base.fetchJobListings(for: applicant)
      }

      private let base = NetworkInterfaceHappyPathFake()
    }

    ...

    XCTAssertEqual(jobListingsCallCount, 1)
  }
}

This isn’t a very odd language capability, either. Java and Typescript both support this very thing.

Another minor issue here is performance, which I’ll be very frank, I would be shocked if this ever actually matters for something like an iOS app. But it’s interesting to point it out.

The protocol witness pays various runtime costs, in terms of both memory and CPU usage, because it is fundamentally, and entirely, a runtime solution to the problem of swapping out implementations. First, what is the representation of the protocol witness NetworkInterface in memory? Well, it’s a struct, and it looks to be a pretty small one (just three members, which are closures), so it’s likely to be placed on the stack where it’s a local variable. However, the struct itself isn’t doing anything interesting. The real work happens inside the closure members. Those are all existential boxes for a closure, which as we saw above is a concrete type that holds all the captured state and implements a callAsFunction requirement. Depending on how much gets captured, the instances of those concrete types may or may not fit directly into their type erasing boxes. If they don’t, they’ll be allocated on the heap and the box will store a pointer to its instance. This will result in runtime costs of pointer indirection, heap allocation and cache invalidation.

When methods are called, the closure members act as a thunk to the body of the closure, so that’s going to be another runtime cost of pointer indirection.

Contrast that with the first approach shown above (which is really an example of the Proxy Pattern), where the static typing is much stronger. The type that we end up with is a NetworkInterfaceReplaceJobListings<NetworkInterfaceHappyPathFake>. Notice the way the proxy type retains information about the original type that it is proxying. In particular, it does not erase this source type. Furthermore, NetworkInterfaceHappyPathFake is a static type, not a factory that constructs particular instances.

The implication of this is that the network variable can never (even if it was made var) store another type of network interface. We can’t change the definitions of any of the functions on this variable at any point in its life. That means it is known at compile time what the precise behavior is, except for the closure provided as the reimplementation of fetchJobListings. The fact that, for example, the other two calls just call the happy path fake version, is known at compile time. It is known at compile time that the base member is a NetworkInterfaceHappyPathFake. The size of this member, and the closure member, is known at compile time. If it’s small enough it can be put on the stack. There is no runtime dispatch of any of the calls, except for the closure.

If we go to the fully static types where we define new ones locally, we eliminate the one remaining runtime dispatch of the closure, and literally everything is hardwired at compile time.

The cost is all paid at compile time. The compiler looks at that more complex nested/generic type and figures out, from that, how to wire everything up hardcoded. This the difference, in terms of runtime performance, between resolving the calls through runtime mechanisms as in the protocol witness, and resolving the calls through compile mechanisms of generics.

(Caveat: when you call generic code in Swift, supplying specific type parameters, the body of the generic code can be compiled into a dedicated copy for that type parameter, with everything hardwired at compile time, and there will be no runtime cost, but only if the compiler has access to that source code during compilation of your code that calls it, which is generally true only if you aren’t crossing module boundaries. If you do cross a module boundary, when the compiler compiled the module with the generic code, it compiled an “erased” copy of the generic code that replaces the generic parameter with an existential and dynamic-dispatches everything. That’s the only compiled code your module will be able to invoke, and you’ll end up paying runtime costs).

Again, I’m sure this will never matter in an iOS app. We used to write apps in Objective-C, where everything is dynamic dispatch (specifically message passing, the slowest type), on phones half the speed (or less) of what we write them on now. I rather think it’s interesting to point out the trivial improvement in runtime performance (and corresponding, probably also trivial, degradation of compile time performance) of the strongly typed implementation because it further illustrates that the behavior of the program is being worked out at compile time… and it therefore does more validation, and rejection of invalid code, at compile time.

These kinds of techniques where you start building up more sophisticated types, typically with some kind of type composition (through generic parameters), of protocols, with the goal of eliminating runtime value-level (existential) variation and try to express the variation on the type level, is where the power of protocols can really yield fruit, especially in terms of expanding your thinking about what they can do. It’s easy to just give up and say “this can’t be done with protocols or at the type level”, but you should always try. You might be surprised what you can do, and even if you revert to existentials later, you’ll probably learn something useful.

Conclusion

So, do I think you should ever employ this struct witness pattern?

Well, that depends on what it means exactly.

As a refactor of a design with protocols on the basis that it works around language limitations? No, that’s fundamentally confused: those are not language limitations, that’s called a static type system, and if you’re going to throw that away for dynamic typing at least frame it honestly and accurately. Then do your best to convince me dynamic typing is better than static typing (good luck).

What about if we realize those individual functions are where the abstractions should be, because it is correct modeling of the system we’re building to let them vary independently? Well, in that case, the question is: should those members be just closures, or should you define protocols with just one function as their requirements? For example, if you define fetchJobListings to simply be any () async throws -> [JobListing] closure, to me this communicates that any function that has that signature is acceptable. I could have two such functions, both defined as mere closures, but they mean different things and it’s a programmer error to assign one to another. If I introduce two protocols, each of which having a single function requirement, well first I can name that function, thereby indicating what the meaning of this function is. Second, it’s strongly typed: the compiler won’t let me use one where another is expected.

So even in this case, I would want to think carefully about whether closures are sufficient, or if I want to have stronger typing than this and instead define a protocol that implements callAsFunction. Being as biased toward strong typing as I am, it’s likely I’ll choose to write out protocols. That extra boilerplate? The names of those protocols, and possibly the names of their one function requirement? I want that. Terseness has never been my goal. If it were I’d name my variables things like x or y1.

As a hand-written type eraser, intending to be a replacement for the language-provided existential where I need more flexible behavior? Never, I would consider that to be breaking the existential by enabling it to circumvent and violate the static type system.

To implement test doubles? Yes I’d do something that’s effectively the same as a protocol witness, but I wouldn’t replace the protocol with this, I would add to the protocol based design by having this witness implement the protocol.

So, ultimately, as it is presented: no, I never would, and I would consider it to be broadly in the same category as the style of coding where you define your own TheThingType enums that appear as an instance member in a type TheThing, which is a bad style of Swift that tries to replace the language’s type system with a homegrown type system that, however good or even better (which I doubt it will be) it is, will never be verified by the compiler, and that’s a hard dealbreaker for me.

Where might you see me writing code that involves simple structs with a closure member where I then create a handful of different de-facto “types” by initializing the closure in a specific way for each one? In prototyping. Like I’m scaffolding out some code, it’s early and I’m not sure about the design, and because it’s less boilerplate, whipping up that struct is just a little bit faster than writing out the protocol and conforming type hierarchy. Once I’m more sure this abstract type is here to stay, I’m almost certainly going to refactor it to a protocol with the closure member becoming a proper function.

If I can tie together all my decisions regarding this into a single theme, it is: I completely, 100% favor compile time safety over runtime ease or being able to quickly write code without restriction. Because perhaps my central goal of design is to elevate as many programmer errors as possible into compilation failures, and because generally the way to do that is to use statically typed languages and develop a strong (meaning lots of fine-grained) type system, I immediately dislike the protocol witness approach because it degrades the type system that the compiler sees. My goal is not and has never been to avoid or minimize the boilerplate that defining more types entails.

If you’re interested in pursing this style of programming, where you are trying to reject as much invalid code as possible at compile time, my #1 advice, if you’re working in Swift, is to embrace generics and spend time honing your skills with protocols, especially the more advanced features like associated types. If there’s a single concrete goal to drive this, try your best to eliminate as many protocol existentials in your code base as possible (this will almost always involve replacing them with generics). As part of this, you have to treat closures as existentials (which, fundamentally, they are).

You will get frustrated at times, and probably overwhelmed. Run into the challenge, not away. You’ll be happy you did later.

In a follow-up article I will explore a more nontrivial conceptual model of an abstract type with concrete implementations, and what happens (specifically what breaks down, especially related to compile time safety) if you try to build it will protocol witnesses.

Testing Async Code

Introduction

By now, pretty much every major modern OOP language has gained async/await. Java is the odd man out (they have a very different way of handling lightweight concurrency).

So, you go on a rampage, async-ifying all the things, and basking in the warm rays of the much more readable and naturally flowing fruits of your labor. Everything is fine, everything is good.

Then you get to your tests.

And suddenly you’re having lots of problems.

What’s the issue, exactly? Well, it isn’t async code per se. In fact, usually the test frameworks that are used with each language support test methods that are themselves async. If you want to test an async method, simply make the test method async, call the async method you’re testing, and await the result. You end up with what is functionally the same test as if everything were synchronous.

But what if the async code you want to test isn’t awaitable from the test?

Async vs. Await

This is a good time to review the paradox that async and await are almost antonyms. Asynchrony allows for concurrency: by not synchronizing two events, other stuff is able to happen between the two events. But adding async to code just allows you to await it. “Waiting” means to synchronize with the result: don’t continue executing below the await until the result is ready. What exactly is “asynchronous” about that?

Let’s look at an example async method:

func doSomeStuff() async -> String {

  let inputs = Inputs(5, "hello!", getSomeMoreInputs())

  let intermediateResult = await processTheInputs()

  let anotherIntermediateResult = await checkTheResultForProblems()

  logTheResults(anotherIntermediateResult)

  await save(result: anotherIntermediateResult)

  return "All done!"
}

Asynchrony appears in three places (where the awaits are). And yet this code is utterly sequential. That’s the whole point of async-await language features. Writing this code with callbacks, while functionally equivalent, obscures the fact that this code all runs in a strict sequence, with the output of one step being used as the input to later steps. Async-await restores the natural return-value flow of code while also allowing it to be executed in several asynchronous chunks.

So then why bother making stuff async and awaiting it? Why not just make the functions like processTheInputs synchronous? After all, when you call a regular old function, the caller waits for the function to return a result. What’s different?

The answer is how the threads work. If the function were synchronous, one thread would execute this from beginning to end, and not do anything else in the process. Now if the functions like processTheInputs are just crunching a bunch of CPU instructions, this makes sense, because the thread keeps the CPU busy with something throughout the entire process. But if the function is asking some other thread, possibly in some other process or even on another machine, to do something for us, our thread has nothing to do, so it has to sit there and wait. You typically do that, ultimately, by using a condition variable: that tells the operating system this thread is waiting for something and not to bother giving it a slice of CPU time.

This doesn’t waste CPU resources (you aren’t busy waiting), but you do waste other resources, like memory for the thread’s stack. Using asynchrony lets multiple threads divide and conquer the work so that none of them have to sit idle. The await lets the thread go jump to other little chunks of scheduled work (anything between two awaits in any async function) while the result is prepared somewhere else.

Okay, but this is a rather esoteric implementation detail about utilizing resources more efficiently. The code is logically identical to the same method with the async and awaits stripped out… that is, the entirely synchronous flavor of it.

So then why does simply marking a method as async, and replacing blocking operations that force a thread to go dormant with awaits that allow the thread to move onto something else, and a (potentially) different thread to eventually resume the work, suddenly make testing a problem?

It doesn’t. Just add async to the test method, and add await to the function call. Now the test is logically equivalent to the version where everything is synchronous.

The problem is when we introduce concurrency.

The Problem

How do we call async code? Well, if we’re already in an async method, we can just await it. If we’re not in an async method, what do we do?

It depends on the language, but in some way or another, you start a background task. Here, “background” means concurrency: something that happens “in the background”, meaning it hums along without interfering with whatever else you want to do in the mean time. In .NET, this means either calling an async function that returns a Task but not awaiting it (which causes the part of the task before the first suspension to run synchronously), or calling Task.Run(...) (which causes the entire part of the task to run concurrently). In Swift, it means calling Task {...}. In Kotlin, it means calling launch { ... }. In Javascript, similar to .NET, you call an async function that returns a Promise but don’t await it, or construct a new Promise that awaits the async function then resolves itself (and, of course, don’t await this Promise).

That’s where the trouble happens.

This is how you kick off async code from a sync method. The sync method continues along, while the async code executes concurrently. The sync method does not wait for the async code to finish. The sync code can, and often will, finish before the async code does. We can also kick off concurrent async work in an async method, the exact same way. In this case we’re allowed to await the result of this concurrent task. If we do, then the outer async function will only complete after that concurrent work completes, and so awaiting the outer function in a test will ensure both streams of work are finished. But if the outer function only spawns the inner task and doesn’t await it, the same problem exists: the outer task can complete before the inner task does.

Let’s look at an example. You create a ViewModel for a screen in your app. As soon as it gets created, in the init, it spawns a concurrent task to download data from your server. When the response comes in, it is saved to a variable that the UI can read to build up what it will display to the user. Before it’s ready, we show a spinner:

@MainActor
final class MyViewModel: ObservableObject {
  @Published
  var results: [String]?

  init() {
    Task {
      let (data, response) = try await urlSession.data(for: dataRequest)

      try await MainActor.run {
        self.results = try JSONDecoder().decode([String].self, from: data)
      }
    }
  }
}

struct MyView: View {
  @ObservedObject var viewModel: MyViewModel = .init()

  var body: some View {
    if let results = model.result {
      ForEach(results, id: \.self) { result in 
        Text(result)
      }
    } else {
      ProgressView()
    }
  }
}

You want to test that the view model loads the results from the server and stores them in the right place:

final class MyViewModelTests: XCTestCase {
  func testLoadData() {
    let mockResults = setupMockServerResponse()

    let viewModel = MyViewModel()

    XCTAssertEqual(viewModel.results, mockResults)
  }
}

You run this test, and it passes… sometimes. Sometimes it fails, complaining that viewModel.results is nil. It’s actually kind of surprising that it ever passes. The mock server response you set up can be fetched almost instantaneously, it doesn’t have to actually call out to a remote server, so the Task you spin off the init completes in a few microseconds. It also takes a few microseconds for the thread running testLoadData to get from the let viewModel = ... line to the XCTAssertEqual line. The two threads are racing against each other: if the Task inside the init wins, the test passes. If not, it fails, and viewModel.results will be nil because that’s the initial value and it didn’t get set by the Task yet.

Do we fix this by making things async? Let’s do this:

@MainActor
final class MyViewModel: ObservableObject {
  @Published
  var results: [String]?

  init() async throws {
    let (data, response) = try await urlSession.data(for: dataRequest)

    try await MainActor.run {
      self.results = try JSONDecoder().decode([String].self, from: data)
    }
  }
}

...

final class MyViewModelTests: XCTestCase {
  func testLoadData() async throws {
    let mockResults = setupMockServerResponse()

    let viewModel = try await MyViewModel()

    XCTAssertEqual(viewModel.results, mockResults)
  }
}

Now it passes every time, and it’s clear why: it’s now the init itself that awaits the work to set the results. And the test is able to (in fact it must) await the init, so we’re sure this all completes before we can get past the let viewModel = ... line.

But the view no longer compiles. We’re supposed to be able to create a MyView(), which creates the default view model without us specifying it. But that init is now async. We would have to make an init() async throws for MyView as well. But MyView is part of the body of another view somewhere, and that can’t be async and so can’t await this init.

Plus, this defeats the purpose of the ProgressView. In fact, since results is set in the init, we can make it a non-optional let, never assigning it to nil. Then the View will always have results and never need to show a spinner. That’s not what we want. Even if we push the “show a spinner until the results are ready” logic outside of MyView, we have to solve that problem somewhere.

This is a problem of concurrency. We want the spinner to show on screen concurrent with the results being fetched. The problem isn’t the init being async per se, it’s that the init awaits the results being fetched. We can keep the async on the init but we need to make the fetch concurrent again:

@MainActor
final class MyViewModel: ObservableObject {
  @Published
  var results: [String]?

  init() async throws {
    Task {
      let (data, response) = try await urlSession.data(for: dataRequest)

      try await MainActor.run {
        self.results = try JSONDecoder().decode([String].self, from: data)
      }
    }
  }
}

...

final class MyViewModelTests: XCTestCase {
  func testLoadData() async throws {
    let mockResults = setupMockServerResponse()

    let viewModel = try await MyViewModel()

    XCTAssertEqual(viewModel.results, mockResults)
  }
}

Now we’re back to the original problem. The async on the init isn’t helping. Sure we await that in the test, but the thing we actually need to wait for is concurrent again, and we get the same flaky result as before.

This really has nothing to do with the async-await language feature per se. The exact same problem would have arisen had we achieved our desired result in a more old school way:

@MainActor
final class MyViewModel: ObservableObject {
  @Published
  var results: [String]?

  init() {
    DispatchQueue.global().async { // This really isn't necessary, but it makes this version more directly analogous to the async-await one
            
      URLSession.shared.dataTask(with: .init(url: .init(string: "")!)) { data, response, _ in
        guard let data, let response else {
          return
        }
                
        DispatchQueue.main.async {
          self.results = try? JSONDecoder().decode([String].self, from: data)
        }
      }
    }
  }
}

There’s still no way for the test to wait for that work thrown onto a background DispatchQueue to finish, or for the work to store the result thrown back onto the main DispatchQueue to finish either.

So what do we do?

The Hacky “Solution”

Well, the most common way to deal with this is ye old random delay.

We need to wait for the spun off task to finish, and the method we’re calling can finish slightly earlier. So after the method we call finishes, we just wait. How long? Well, for a… bit. Who knows. Half a second? A tenth of a second? It depends on the nature of the task too. If the task tends to take a couple seconds, we need to wait a couple seconds.

final class MyViewModelTests: XCTestCase {
  func testLoadData() async throws {
    let mockResults = setupMockServerResponse()

    let viewModel = MyViewModel()

    try await Task.sleep(0.2) // That's probably long enough!

    XCTAssertEqual(viewModel.results, mockResults)
  }
}

(Note that Task.yield() is just another flavor of this, it’s just causing the execution to pause for some indeterminate brief amount of time)

To be clear, this “solves” the problem no matter what mechanism of concurrency we decided to use: async-await, dispatch queues, run loops, threads… doesn’t matter.

Whatever the delay is, you typically discover it by just running the test and seeing what kind of delay makes it usually pass. And that works… until random fluctuations in the machine running the tests make it not work, and you have to bump the delay up slightly.

That, ladies and gentlemen, is how you get brittle tests.

Is This a Solution?

The problem is that by adding a delay and then going ahead with our assertions as though the task completed in time, we’re encoding into our tests a requirement that the task complete within the time we chose. The time we chose was utterly random and tuned to make the test pass, so it’s certainly not a valid requirement. You don’t want tests that inadvertently enforce nonexistent requirements.

I’ve heard some interesting takes on this. Well, let’s think about the requirements. Really, there is a requirement that mentions a time window. After all, the product owner wouldn’t be happy if this task completed 30 minutes after the sync method (the “when” of the scenario) got triggered. The solution, according to this perspective, is to sit down with the product owner, work out the nonfunctional requirements regarding timing constraints (that this ought to finish in no more than some amount of time), and voilà: there’s your delay amount, and now your tests are enforcing real requirements, not made up ones.

But hold on… why must there be a nonfunctional requirement about timing? This is about a very technical detail in code that concerns how threads execute work on the machine, and whether it’s possible for the test method to know exactly when some task that got spun off has finished. Why does this implementation detail imply that a NFR about timing exists? Do timing NFRs exist for synchronous code? After all, nothing a computer does is instantaneous. If this were true, then all requirements, being fundamentally about state changes in a computer system, would have to mention something about the allowed time constraints for how long this state change can take.

Try asking the product owner what the max allowed time should be. Really, ask him. I’ll tell you what his answer will be 99% of the time:

“Uhh, I don’t know… I mean, as short as it can be?”

Exactly. NFRs are about tuning a system and deciding when to stop optimizing. Sure, we can make the server respond in <10ms, but it will take a month of aggressive optimization. Not worth it? Then you’ll get <100ms. The reason to state the NFR is to determine how much effort needs to be spent obtaining realistic but nontrivial performance.

In the examples we’re talking about with async methods, there’s no question of optimization. What would that even mean? Let’s say the product owner decides the results should be ready in no more than 10 seconds. Okay, first of all, that means this test has to wait 10 seconds every time it runs! The results will actually be ready in a few microseconds, but instead every test run costs 10 seconds just to take this supposed NFR into account. It would be great if the test could detect that the results came in sooner that the maximum time and go ahead and complete early. But if we could solve that problem, we’d solve the original problem too (the test knowing when the results came in).

Even worse, what do we do with that information? The product owner wants the results in 10ms, but the response is large and it takes longer for the fastest iPhone to JSON decode it. What do we do with this requirement? Invent a faster iPhone?

Fine, then the product owner will have to take the limitations of the system into account when he picks the NFR. Well now we’re just back to “it should be as fast as it reasonably can”. So, like, make the call as early as possible, then store the results as soon as they come in.

The Real Requirements

The requirement, in terms of timing, is quite simply, that the task get started immediately, and that it finish as soon as it can, which means that the task only performs what it needs to, and nothing else.

These are the requirements we should be expressing in our tests. We shouldn’t be saying “X should finish in <10 s”, we should be saying “X should start now, and X should only do A, B and C, and nothing else”.

The second part, that code should only do something, and not do anything else, is tricky because it’s open-ended. How do you test that code only does certain things? Generally you can’t, and don’t try to. But that’s not the thing that’s likely to “break”, or what a test in a TDD process needs to prove got implemented.

The first part… that’s what our tests should be looking for… not that this concurrent task finishes within a certain time, but that it is started immediately (synchronously!). We of course need to test that the task does whatever it needs to do. That’s a separate test. So, really, what starts off as one test:

Given some results on the server
When we create a view model
Then the view model’s results should be the server’s results

Ends up as two tests:

When we create the view model
Then the “fetch results” task should be started with the view model

Given some results on the server
And the “fetch results” task is running with the view model
When the “fetch results” finishes
Then the view model’s results should be the server’s results

I should rather say this ended up as two requirements. Writing two tests that correspond exactly to these two tests is a more advanced topic that I will talk about another time so it doesn’t distract too much from the issue here.

For now let’s just try to write a test that indeed tests both of these requirements together, but still confirms that the task does start and eventually finish, doing what it is supposed to do in the process, without having to put in a delay.

Solution 1: Scheduler Abstraction

The fundamental problem is that a async task is spun off concurrently, and the test doesn’t have access to it. How does that happen? By initializing a Task. This is, effectively, how you schedule some work to run concurrently. By writing Task.init(...), we are hardcoding this schedule call to the “real” scheduling of creating a Task. If we can abstract this call, then we can substitute test schedulers that allow us more capabilities, like being able to see what was scheduled and await all of it.

Let’s look at the interface for Task.init:

struct Task<Success, Failure: Error> {

}

extension Task where Failure == Never {

  @discardableResult
  init(
    priority: TaskPriority? = nil, 
    operation: @Sendable @escaping () async -> Success
  )
}

extension Task where Failure == Error {

  @discardableResult
  init(
    priority: TaskPriority? = nil, 
    operation: @Sendable @escaping () async throws -> Success
  )
}

There are actually some hidden attributes that you can see if you look at source code for this interface. I managed to catch Xcode admitting to me once that these attributes are there, but I can’t remember how I did it and can no longer reproduce it. So really this is the interface:

extension Task where Failure == Never {

  @discardableResult
  @_alwaysEmitIntoClient
  init(
    priority: TaskPriority? = nil, 
    @_inheritActorContext @_implicitSelfCapture operation: __owned @Sendable @escaping () async -> Success
  )
}

extension Task where Failure == Error {

  @discardableResult
  @_alwaysEmitIntoClient
  init(
    priority: TaskPriority? = nil, 
    @_inheritActorContext @_implicitSelfCapture operation: __owned @Sendable @escaping () async throws -> Success
  )
}

Okay, so let’s write an abstract TaskScheduler type that presents this same interface. Since protocols don’t support default parameters we need to deal with that in an extension:

protocol TaskScheduler {
  @discardableResult
  @_alwaysEmitIntoClient
  func schedule<Success>(
    priority: TaskPriority?, 
    @_inheritActorContext @_implicitSelfCapture operation: __owned @Sendable @escaping () async -> Success
  ) -> Task<Success, Never>

  @discardableResult
  @_alwaysEmitIntoClient
  func schedule<Success>(
    priority: TaskPriority?, 
    @_inheritActorContext @_implicitSelfCapture operation: __owned @Sendable @escaping () async throws -> Success
  ) -> Task<Success, Error>
}

extension TaskScheduler {
  @discardableResult
  @_alwaysEmitIntoClient
  func schedule<Success>(
    @_inheritActorContext @_implicitSelfCapture operation: __owned @Sendable @escaping () async -> Success
  ) -> Task<Success, Never> {
    schedule(priority: nil, operation: operation)
  }

  @discardableResult
  @_alwaysEmitIntoClient
  func schedule<Success>(
    @_inheritActorContext @_implicitSelfCapture operation: __owned @Sendable @escaping () async throws -> Success
  ) -> Task<Success, Error> {
    schedule(priority: nil, operation: operation)
  }
}

Now we can write a DefaultTaskScheduler that just creates the Task and nothing else:

struct DefaultTaskScheduler: TaskScheduler {
  @discardableResult
  @_alwaysEmitIntoClient
  func schedule<Success>(
    priority: TaskPriority?, 
    @_inheritActorContext @_implicitSelfCapture operation: __owned @Sendable @escaping () async -> Success
  ) -> Task<Success, Never> {
    .init(priority: priority, operation: operation)
  }

  @discardableResult
  @_alwaysEmitIntoClient
  func schedule<Success>(
    priority: TaskPriority?, 
    @_inheritActorContext @_implicitSelfCapture operation: __owned @Sendable @escaping () async throws -> Success
  ) -> Task<Success, Error> {
    .init(priority: priority, operation: operation)
  }
}

And we can introduce this abstraction into MyViewModel:

@MainActor
final class MyViewModel: ObservableObject {
  @Published
  var results: [String]?

  init(taskScheduler: some TaskScheduler = DefaultTaskScheduler()) {
    taskScheduler.schedule {
      let (data, response) = try await urlSession.data(for: dataRequest)

      self.results = try JSONDecoder().decode([String].self, from: data)
    }
  }
}

Now in tests, we can write a RecordingTaskScheduler decorator that records all tasks scheduled on some other decorator, and lets us await them all later. In order to do that, we need to be able to store tasks with any Success and Failure type. Then we need to be able to await them all. How do we await a Task? By awaiting its value:

extension Task {
  public var value: Success {
    get async throws
  }
}

extension Task where Failure == Never {
  public var value: Success {
    get async
  }
}

In order to store all running Tasks of any success type, throwing and non-throwing, and to be able to await their values, we need a protocol that covers all those cases:

protocol TaskProtocol {
  associatedtype Success

  public var value: Success {
    get async throws
  }
}

extension Task: TaskProtocol {}

Now we can use this in RecordingTaskScheduler:

final class RecordingTaskScheduler<Scheduler: TaskScheduler>: TaskScheduler {
  private(set) var runningTasks: [any TaskProtocol] = []

  func awaitAll() async throws -> {
    // Be careful here.  While tasks are running, they may schedule more tasks themselves.  So instead of looping over runningTasks, we need to keep repeatedly pull off the next one if it's there.
    while !running tasks.isEmpty {
      let task = runningTasks.removeFirst()
      _ = try await task.value
    }
  }

  @discardableResult
  @_alwaysEmitIntoClient
  func schedule<Success>(
    priority: TaskPriority?, 
    @_inheritActorContext @_implicitSelfCapture operation: __owned @Sendable @escaping () async -> Success
  ) -> Task<Success, Never> {
    let task = scheduler.schedule(priority: priority, operation: operation)
    runningTasks.append(task)
    return task
  }

  @discardableResult
  @_alwaysEmitIntoClient
  func schedule<Success>(
    priority: TaskPriority?, 
    @_inheritActorContext @_implicitSelfCapture operation: __owned @Sendable @escaping () async throws -> Success
  ) -> Task<Success, Error> {
    let task = scheduler.schedule(priority: priority, operation: operation)
    runningTasks.append(task)
    return task
  }

  init(scheduler: Scheduler) {
    self.scheduler = scheduler
  }

  let scheduler: Scheduler
}

extension TaskScheduler {
  var recorded: RecordingTaskScheduler<Self> {
    .init(scheduler: self)
  }
}

You probably want to make that runningTasks state thread safe.

Now we can use this in the test:

final class MyViewModelTests: XCTestCase {
  func testLoadData() async throws {
    let mockResults = setupMockServerResponse()

    let taskScheduler = DefaultTaskScheduler().recorded
    let viewModel = MyViewModel(taskScheduler: taskScheduler)

    try await taskScheduler.awaitAll()

    XCTAssertEqual(viewModel.results, mockResults)
  }
}

We’ve replaced the sleep for an arbitrary time with a precise await for all scheduled tasks to complete. Much better!

Now, there are other ways to schedule concurrent work in Swift. Initializing a new Task is unstructured concurrency: creating a new top-level task that runs independently of any other Task, even if it was spawned an async function being run by another Task. The other ways to spawn concurrent work are the structured concurrency APIs: async let and with(Throwing)TaskGroup. Do we need to worry about these causing problems in tests?

No, we don’t. The consequence of the concurrency being structured is that these tasks spawned inside another Task are owned by the outer Task (they are “child tasks” of that “parent task”). This primarily means parent tasks cannot complete until all of their child tasks complete. It doesn’t matter if you explicitly await these child tasks or not. The outer task will implicitly await for all that concurrent work at the very end (before the return) if you didn’t explicitly await it earlier than that. Thus, as long as the top-level Task that owns all these child tasks is awaitable in the tests, then doing so will await all those concurrent child tasks as well.

It’s only the unstructured part of concurrency we need to worry about. That is handled by Task.init, and our TaskScheduler abstraction covers it.

(It’s becoming popular to claim that “unstructured concurrency is bad” and that you should replace all instances of it with structured concurrency, but this doesn’t make sense. Structured concurrency might very well be called structured awaiting. When you actually don’t want one thing to await another, i.e. genuine concurrency, unstructured concurrency is exactly what you need. The view model where we made init async throws is an example: it’s not correct to use async let to kick off the fetch work, because that causes init to await that child task, and destroys the very concurrency we’re seeking to create.)

Things look pretty similar in other platforms/frameworks, with some small caveats. In Kotlin, the way to spawn concurrent work is by calling CoroutineScope.launch. There’s a global one available, but many times you need to launch coroutines from other scope. This is nice because this is already basically the abstraction we need. Just make it configurable in tests, and make a decorator for CoroutineScope that additionally records the launched coroutines and lets you await them all. You might even be able to do this by installing a spy with mockk.

In .NET, the equivalent to Task.init in Swift is Task.Run:

void SpawnATask() {
  Task.Run(async () => { await DoSomeWork(); });
}

Task DoSomeWork() async {
  ...
}

Task.Run is really Task.Factory.StartNew with parameters set to typical defaults. Whichever one you need, you can wrap it in a TaskScheduler interface. Let’s assume Task.Run is good enough for our needs:

interface ITaskScheduler {
  Task Schedule(Func<Task> work);
  Task<TResult> Schedule<TResult>(Func<TResult> work);
  Task<TResult> Schedule<TResult>(Func<Task<TResult>> work);
}

struct DefaultTaskScheduler: ITaskScheduler {
  Task Schedule(Func<Task> work) {
    Task.Run(work);
  }

  Task<TResult> Schedule<TResult>(Func<TResult> work) {
    Task.Run(work);
  }

  Task<TResult> Schedule<TResult>(Func<Task<TResult>> work) {
    Task.Run(work);
  }
}

Then we replace naked Task.Run with this abstraction:

void SpawnATask(ITaskScheduler scheduler) {
  scheduler.Schedule(async () => { await DoSomeWork(); });
}

And similarly we can make a RecordingTaskScheduler to allow tests to await all scheduled work:

sealed class RecordingTaskScheduler: ITaskScheduler {
  public IImmutableQueue<Task> RunningTasks { get; private set; } = ImmutableQueue.Empty;

  Task AwaitAll() async {
    while !RunningTasks.IsEmpty {
      Task task = RunningTasks.Dequeue();
      await task;
    }
  }

  Task Schedule(Func<Task> work) {
    var task = _scheduler.schedule(work);
    RunningTasks = RunningTasks.Enqueue(task);
    return task;
  }

  Task<TResult> Schedule<TResult>(Func<TResult> work) {
    var task = _scheduler.schedule(work);
    RunningTasks = RunningTasks.Enqueue(task);
    return task;
  }

  Task<TResult> Schedule<TResult>(Func<Task<TResult>> work) {
    var task = _scheduler.schedule(work);
    RunningTasks = RunningTasks.Enqueue(task);
    return task;
  }

  RecordingTaskScheduler(ITaskScheduler scheduler) {
    _scheduler = scheduler;
  }

  static RecordingTaskScheduler Recorded(this ITaskScheduler scheduler) {
    new(scheduler);
  }

  private ITaskScheduler _scheduler;
}

Because in C#, generic classes are generally created as subclasses of a non-generic flavor (Task<TResult> is a subclass of Task), we don’t have to do any extra work to abstract over tasks of all result types.

There’s a shark in the water, though.

Async/await works a little differently in C# than in Swift. Kotlin is Swiftlike, while Typescript and C++ are C#-like in this regard.

In C# (and Typescript and C++), there aren’t really two types of functions (sync and async). The async keyword is just a request to the compiler to rewrite your function to deal with the awaits and return a special type, like Task or Promise (you have to write your own in C++). And correspondingly, an async function can’t just return anything, it has to return one of these special types. But that’s all that’s different. Specifically, you can call async functions from anywhere in these languages. What you can’t do in non-async functions is await. You can call that async function from a non-async function, you just can’t await the result, which is always going to be some awaitable type like Task or Promise.

Long story short, you can do this:

void SpawnATask() {
  DoSomeWork();
}

Task DoSomeWork() async {
  ...
}

There’s a slight difference in how this executes, compared to wrapping it in Task.Run, but I certainly hope you aren’t writing code that depends in any way on that difference. So you should be able to wrap it in Task.Run, and then change that to scheduler.Schedule.

But before you make that change, this is a sneaky little ninja lurking around in your codebase. It’s really easy to miss. If you’re running a test, it’s failing clearly due to not waiting long enough, and you’re going crazy because you searched for every last Task.Run (or Task. in general) in your code base, the culprit can be one of these crypto task constructors that you’d never even notice is spawning concurrent work. Just keep that in mind. Again, it should be fine to wrap it in scheduler.Schedule.

This isn’t a thing in Swift/Kotlin because they do async/await differently. In those languages there are two types of functions, and you simply aren’t allowed to call async functions from non-async ones. You have to explicitly call something like Task.init to jump from one to another.

A Not So Good Solution

This is not a new problem. I showed earlier that we can handle the concurrency with DispatchQueue. Similarly, and I’ve done this plenty of times, you would write an abstraction that captures work scheduled on global queues so the test can synchronize with it…

…well, no that’s not exactly what I did. I did something a little different. First I made a protocol so that I can customize what happens when I dispatch something to a queue:

protocol DispatchQueueProtocol {
  func async(
    group: DispatchGroup?, 
    qos: DispatchQoS, 
    flags: DispatchWorkItemFlags, 
    execute work: @escaping @convention(block) () -> Void
  )
}

extension DispatchQueueProtocol {
  // Deal with default parameters
  func async(
    group: DispatchGroup? = nil, 
    qos: DispatchQoS = .unspecified, 
    flags: DispatchWorkItemFlags = [], 
    execute work: @escaping @convention(block) () -> Void
  ) {
    // Beware of missing conformance and infinite loops!
    self.async(
      group: group,
      qos: qos,
      flags: flags,
      execute: work
    )
  }
}

extension DispatchQueue: DispatchQueueProtocol {}

Just as with the scheduler, the view model takes a dispatch queue as an init parameter:

final class MyViewModel: ObservableObject {
  @Published
  var results: [String]?

  init(
    backgroundQueue: some DispatchQueueProtocol = DispatchQueue.global(),
    mainQueue: some DispatchQueueProtocol = DispatchQueue.main
  ) {
    backgroundQueue.async {
            
      URLSession.shared.dataTask(with: .init(url: .init(string: "")!)) { data, response, _ in
        guard let data, let response else {
          return
        }
                
        mainQueue.async {
          self.results = try? JSONDecoder().decode([String].self, from: data)
        }
      }
    }
  }
}

Then I defined a test queue that didn’t actually queue anything, it just ran it outright:

struct ImmediateDispatchQueue: DispatchQueueProtocol {
  func async(
    group: DispatchGroup?, 
    qos: DispatchQoS, 
    flags: DispatchWorkItemFlags, 
    execute work: @escaping @convention(block) () -> Void
  ) {
    work()
  }
}

And if I supply this queue for both queue parameters in the test, then it doesn’t need to wait for anything. I have, in fact, removed concurrency from the code entirely. That certainly solves the problem the test was having!

This is often the go-to solution for these kinds of problems: if tests are intermittently failing because of race conditions, just remove concurrency from the code while it is being tested. How would you do this with the async-await version? You need to be able to take control of the “executor”: the underlying object that’s responsible for running the synchronous slices of async functions between the awaits. The default executor is comparable to DispatchQueue.global, it uses a shared thread pool to run everything. You would replace this with something like an ImmediateExecutor, which just runs the slice in-line. That causes “async” functions to become synchronous.

Substituting your own executor is possible in C# and Kotlin. It’s not in Swift. They made one step in this direction in 5.9, but they’re still working on it.

However, even if it was possible, I don’t think it’s a good idea.

Changing the underlying execution model from asynchronous to synchronous significantly changes what you’re testing. Your test is running something that is quite different than what happens in prod code. For example, by making everything synchronous and running in a single thread, everything becomes thread safe by default (not necessarily reentrant). If there are any thread safety issues with the “real” version that runs concurrently, your test will be blind to that. On the opposite side, you might encounter deadlocks by running everything synchronously that don’t happen when things run concurrently.

It’s just not as accurate as a test as I’d like. I want to exercise that concurrent execution.

That’s why I prefer to not mess with how things execute, and just record what work has been scheduled so that the test can wait on it. This is a little more convoluted to get working in the DispatchQueue version, but we can do it:

final class RecordingDispatchQueue<Queue: DispatchQueueProtocol>: DispatchQueueProtocol {
  private(set) var runningTasks: [DispatchGroup] = []

  func waitForAll(onComplete: @escaping () -> Void) {
    let outerGroup = DispatchGroup() 
    while let innerGroup = runningTasks.first {
      runningTasks.removeFirst()
      outerGroup.enter()
      innerGroup.notify(queue: .global(), execute: outerGroup.leave)
    }
    
    outerGroup.notify(queue: .global(), execute: onComplete)
  }

  func async(
    group: DispatchGroup?, 
    qos: DispatchQoS, 
    flags: DispatchWorkItemFlags, 
    execute work: @escaping @convention(block) () -> Void
  ) {
    let group = DispatchGroup()
    group.enter()

    queue.async {
      work()
      group.leave()
    }

    runningTasks.append(group)
  }

  init(queue: Queue) {
    self.queue = queue
  }

  private let queue: Queue
}

extension DispatchQueueProtocol {
  var recorded: RecordingDispatchQueue<Self> {
    .init(queue: self)
  }
}

Then in the tests:

final class MyViewModelTests: XCTestCase {
  func testLoadData() {
    let mockResults = setupMockServerResponse()

    let backgroundQueue = DispatchQueue.global().recorded
    let mainQueue = DispatchQueue.main.recorded

    let viewModel = MyViewModel(
      backgroundQueue: backgroundQueue,
      mainQueue: mainQueue
    )

    let backgroundWorkComplete = self.expectation(description: "backgroundWorkComplete")
    let mainWorkComplete = self.expectation(description: "mainWorkComplete")

    backgroundQueue.waitForAll(onComplete: backgroundWorkComplete.fulfill)
    mainQueue.waitForAll(onComplete: mainWorkComplete.fulfill)

    waitForExpectations(timeout: 5.0, handler: nil)

    XCTAssertEqual(viewModel.results, mockResults)
  }
}

A lot less pretty than async-await, but functionally equivalent.

Solution 2: Rethink the Problem

Abstracting task scheduling gives us a way to add a hook for tests to record scheduled work and wait for all of it to complete before making its assertions. Instead of just randomly waiting and hoping it’s long enough for everything to finish, we expose the things we’re waiting for so we can known when they’re done. This solves the problem we had of the test needing to know how long to wait… but are we thinking about this correctly?

Why does the test need to be aware that tasks are being spawned and concurrent work is happening? Does the production code that uses the object we’re testing need to know all of that? It didn’t look that way. After all we started with working production code where no scheduling abstraction was present, the default scheduling mechanism (like Task.init) was hardcoded inside the private implementation details of MyViewModel, and yet… everything worked. Specifically, the object interacting with the MyViewModel, MyView, didn’t know and didn’t care about any of this.

Why, then, do the tests need to care? After all, why in general do tests need to know about private implementation details? And it is (or at least was, before we exposed the scheduler parameter) a totally private implementation detail than any asynchronous scheduling was happening at all.

What were we trying to test? We wanted to test, basically, that our view shows a spinner until results become ready, that those results will eventually become ready, and that they will be displayed once they are ready. We don’t want to involve the view in the test so we don’t literally look for spinners or text rows, instead we test that the view model instructs the view appropriately. The key is asking ourselves: why is this “we don’t know exactly when the results are ready” problem not a problem for the view? How does the view get notified that results are ready?

Aha!

The view model’s results are an @Published. It is publishing changes to its results to the outside world. See, we’ve already solved the problem we have in tests. We had to, because it’s a problem for production code too. It’s perhaps obscured a bit inside the utility types in SwiftUI, but the view is notified of updates by being bound to an ObservableObject, that has a objectWillChange publisher that fires any time any of its @Published members are about to change (specifically in willSet). This triggers an evaluation of the view’s body in the next main run loop cycle, where it reads from viewModel.results.

So, we just need to simulate this in the test:

final class MyViewModelTests: XCTestCase {
  @MainActor
  func testLoadData() async throws {
    let mockResults = setupMockServerResponse()

    let viewModel = MyViewModel()

    let updates = viewModel
      .objectWillChange
      .prepend(Just()) // We want to inspect the view model immediately in case the change already occurred before we start observing
      .receive(on: RunLoop.main) // We want to inspect the view model after updating on the next main run loop cycle, just like the view does

    for await _ in updates.values {
      guard let results = viewModel.results else {
        continue
      }

      XCTAssertEqual(viewModel.results, mockResults)
      break
    }
  }
}

Now this test is faithfully testing what the view does, and how a manual tester would react to it: the view model’s results are reevaluated each time the view model announces it updated, and we wait until results appear. Then we check that they equal the results we expect.

With this, a stalling test is now a concern. If the prod code is broken, the results may never get set, and then this will wait forever. So we should throw it in some sort of timeout check. Usually test frameworks come with timeouts already. Unfortunately XCTestCase only has a resolution of 1 minute. It would be nice to specify something like 5 seconds, so we can write our own withTimeout function (I won’t show the implementation here):

final class MyViewModelTests: XCTestCase {
  @MainActor
  func testLoadData() async throws {
    let mockResults = setupMockServerResponse()

    let viewModel = MyViewModel()

    let updates = viewModel
      .objectWillChange
      .prepend(Just()) // We want to inspect the view model immediately in case the change already occurred before we start observing
      .receive(on: RunLoop.main) // We want to inspect the view model after updating on the next main run loop cycle, just like the view does
 
    try await withTimeout(seconds: 5.0) {
      for await _ in updates.values {
        guard let results = viewModel.results else {
          continue
        }

        XCTAssertEqual(viewModel.results, mockResults)
        break
      }
    }
  }
}

The mindset here is to think about why anyone cares that this concurrent work is started to begin with. The only reason why anyone else would care is that they have access to some kind of state or notification that is changed or triggered by the concurrent task. Whatever that is, it’s obviously public (otherwise no one else could couple to it, and we’re back to the original question), so have your test observe that instead of trying to hook into scheduling of concurrent work.

In this example, it was fairly obvious what the observable side effect of the task running was. It may not always be obvious, but it has to exist somewhere. Otherwise you’re trying to test a process that no one can possibly notice actually happened, in which case why is it a requirement (why are you even running that task)? Whether this is setting some shared observable state, triggering some kind of event that can be subscribed to, or sending a message out to an external system, all of those can be asserted on in tests. You shouldn’t need to be concerned about tasks finishing, that’s an implementation detail.

Conclusion

We found a way to solve the immediate problem of a test not knowing how long to wait for asynchronous work to complete. As always, introducing an abstraction is all that’s needed to be able to insert a test hook that provides the missing information.

But the more important lesson is that inserting probes like this into objects to test them raises questions: why would you need to probe an object in a way production code can’t to test the way that object behaves from the perspective of the production objects that interact with it (that is, after all, all that matters)? I’m not necessarily saying there’s never a good answer for this. At the very least it may replace a more faithful (in terms of recreating what users do) but extremely convoluted, slow and unreliable test with a much more straightforward, fast and reliable one that “cheats” slightly (is the risk of the convolutedness, skipping runs because it’s slow and ignoring its results because it’s unreliable more or less than the risk of producing invalid results by cheating?).

But you should always ask this question, and settle for probing the internals of behavior only after you have exhaustively concluded that it really is necessary, and for good reason. In the case of testing concurrent tasks, the point of waiting for tasks to “complete” is that the tasks did something, and it’s this side effect that you care about. You should assert on the side effect, which is the only real meaningful test that code executed (and it’s all that matters).

In general, if you write concurrent code, you will already solve the notification problem for the sake of production code. You don’t need to insert more hooks to observe the status of jobs in tests. The only reason the status of that job could matter is because it affects something that is already visible, and it is possible to listen to and be immediately notified of when that change happens. Whether it’s an @Published, an Observable/Publisher, a notification, event, C# delegate, or whatever else, you have surely already introduced one of these things into your code after introducing a concurrent task. Either that, or you’re calling out to an external system, and that can be mocked.

Just find that observable state that the job is affecting, and then you have something to listen to in tests. Introducing the scheduling interface is a way to quickly cover areas with tests and get reliable passing, but you should always revisit this and figure out what the proper replacement is.

On Code Readability

Introduction

I have many times encountered the statement from developers that “90% of time is spent reading code”, and therefore readability is a very important quality of code. So important, in fact, that it can overtake other qualities. This is almost always used to justify less abstract code. Highly abstracted code can be “hard to read”, so much so that it’s positive qualities (less duplication, more scalable and maintainable, etc.) may be invalidated by the fact it is just so hard to read. More “straightforward” code, perhaps with the lower level details spelled out explicitly in line may suffer problems of maintainability, but since we spend most of our time reading it, not making changes to it, the less abstract but easier to read code may still be better.

Since you hear this exact phrase (the 90% line), it’s obviously coming from some influential source. That would be Uncle Bob’s book, Clean Code: A Handbook of Agile Software Craftsmanship:

Indeed, the ratio of time spent reading versus writing is well over 10 to 1. We are constantly reading old code as part of the effort to write new code. …[Therefore,] making it easy to read makes it easier to write.

Now, he’s saying improving readability improves modifiability. But these discussions end up becoming about readability over modifiability. I’ll dig into why that matters.

Excuses for poorly abstracted code are abundant, particularly in “agile” communities. The first one I encountered was spoken by the first scrum master I ever had. Basically, since agile development is about rapid iteration, thinking design-wise more than two weeks ahead is wasteful, and you instead just want to focus on getting a barely working design out quickly, to complete the iteration before starting the next one.

Then, I heard more excuses later, ranging from YAGNI to “junior developers can’t understand this” to arguments about emergent design and “up-front design is bad” (the implication being that writing highly abstract code always counts as upfront design). There are a lot of misconceptions around this topic, especially the idea that building up a runway of library tools (i.e. abstractions) that can be quickly and easily composed into new features somehow hurts your ability to quickly implement new features with short turnaround (in other words, agility), but that’s not what I want to talk about here.

I just want to talk about this argument that “code should be easy to read”, implying that some other quality of it should be sacrificed. The reason I bring this other stuff up is to put it in context: developers are constantly looking for excuses to write less abstract code, so the “readability” argument isn’t unique at all. It’s just another of many avenues to reach the same conclusion. You should view these arguments with a bit of suspicion, as all of these unrelated premises land at the same place. Perhaps the conclusion was foregone, and people are working backward from it to various starting points.

(The simple, banal, reason developers look for excuses to not write abstract code is because abstracting is a form of low time preference: investing into the future instead of blowing all your money today, or running your credit card up. Investment requires delaying gratification, which is unpleasant here and now. It’s the same reason everyone wants to be fit but no one wants to go to the gym.)

For transparency’s sake, I’m the one who thinks abstraction, lots of it, is literally the only way any of us can begin to comprehend modern software. If you don’t believe me, remove all the abstractions and read the compiled code in assembly. This is why whenever I see a principle like YAGNI or readability used to conclude “un-abstracted code is better, actually”, I become extremely suspicious of the principle. At best it’s being completely misunderstood. Sometimes on closer inspection the principles turn out to be nonsensical.

I think that’s the case with “readability”. It doesn’t make sense as a principle of good software design.

Readability Is Largely Subjective

Let’s start by assuming the statement, “90% of time is spent reading code” is correct (I’m not sure how this could ever be measured… it feels accurate enough, but giving it a number like it was scientifically measured is just making it sound more authoritative), and let’s assume that this straightforwardly implies readability is 10 times more important than other qualities of the code.

I’m going to say something, the exact same thing, in two different ways. The only difference will be the phrasing. But the meaning of these two sentences are exactly equivalent:

  1. This code is hard to read
  2. I have a hard time reading this code

A skill I think everyone should learn is to pick on the habit of people using existential phrasing (sentences where the verb is “is”) to take themselves out of the picture. People will share their opinions, but instead of saying “I like Led Zeppelin more than the Beatles”, they say, “Led Zeppelin is better than the Beatles”. See the difference? By removing themselves as the subject, they are declaring their subjective judgements are objective traits of the things they are judging.

It’s not that you find this code hard to read, which is a description you and your reading abilities, it’s that the code is just hard to read, like it’s an innate quality of the code itself.

The implication is that all developers will always agree on which two versions of code is more readable. Hopefully I don’t have to emphasize or prove with examples that this is nowhere close to the truth. This is what leads to advice from developers that’s basically tautological: “you should write code that’s easy to read”. Who’s intentionally writing code that’s hard to read? This is like telling someone, “you should be less lazy”, as if anyone gets up in the morning and yells, “WOO, bring on the laziness!” Yeah, I know being lazy is bad, do you have any techniques for overcoming it, or just empty platitudes?

With the “readability” argument, developers are claiming, whether they realize it or not, that certain developers are consciously eschewing readability in favor of something else, like the DRY principle. Maybe some of them do, but that’s rather presumptuous. Why do you assume those developers don’t actually find the DRY code easier to read and understand? Maybe for them, having to see nearly identical blocks of code in multiple places, realizing the only difference is the name of a local variable, and reverse engineering in their head that it’s the same code and it’s doing something, and after looking at it for a moment they see it’s doing X, makes code harder to read than seeing a single line that’s a call to the function doX(someLocalVariable). Maybe they introduced the abstraction because, in their opinion, that improves readability.

Are you assuming they consciously rejected readability in favor of something else simply because you find it harder to read? Why are you assuming if you find it harder to read, everyone else must too?

Example 1: I have a book in my hand that’s written in Spanish. Embarrassing as this is, I’m not a polyglot. I only speak and read English. The Spanish classes I took in college didn’t take. How hard is it for me to read the Spanish book? Well, straight up impossible. I just can’t read it, at all. A native Spanish speaker, on the other hand, might find it quite easy to read. Or, if I bothered to learn the language, I could read it too.

Example 2: I spend more time than I’d like to admit in public reading physics textbooks. For me, an undergrad or grad level textbook on classical mechanics is an easy read. I haven’t read fiction since middle school. Any moderately sophisticated adult fiction book would probably leave me lost trying to figure out the metaphors and recall tiny plot hints that were dropped chapters ago that explain what I’m reading now. Maybe I’d figure it out, but I wouldn’t call it easy. Contrast this with, well, normal people, and I’m sure it’s the exact opposite: they breeze through the fiction, picking up on all the literary devices, and can’t get through the introductory math section of the physics textbook.

Now, some of this is difference in personality. I’m the kind of person who gets and likes math, and doesn’t really care about fictional stories, some people are the opposite. But a lot of it is practice. I find physics literature easy to read because I’ve spent a lot of time practicing. That grad level textbook was not easy the first time I read it, not even close. But it got easier each time I read it. The next one I read was easier because I already had experience. I would probably get better at reading fiction if I ever practiced it.

You see my point, right?

You can’t read highly abstracted code because you aren’t used to reading it.

It’s not an innate quality of the code. It’s a quality of you. The code doesn’t read itself.

I am not implying there are no objective qualities of code that would affect anyone’s ability to read it. The author could have scrambled the words into a random order. Yeah, that makes it harder to read for everyone. This is irrelevant because no one does this. No one sets out to write something that’s hard to read. If you suggest a change that everyone agrees, including its author, makes it easier to read, it’s going to be accepted with no resistance (the author probably just didn’t think of it), and “more readable code is good” didn’t help reveal that change (it’s an empty tautology). If the suggested change is controversial, it’s quite likely some people think the change degrades readability.

This is why “readability” can’t be a guideline for development teams. Either team members are going to disagree on which code is actually more readable, or they agree code is hard to read but can’t think of how to make it better. Telling them “make it more readable” doesn’t help in either case.

This applies not just to the level of abstraction, but the style and tools. Someone who’s never heard of ReactiveX would probably be totally bewildered by code written by someone who thinks “everything is an Observable“. But it’s easy to read for the author, and to anyone else who has years of experience reading that style of code.

My point is not that there’s no such thing as bad code. Of course not. It’s just that whether code is high or low quality has nothing to do with whether any particular developer is good at reading it or not.

We have the fortune of not having to learn to read assembly code. Coders in the 1960s, or even 90s, weren’t so lucky. Super Mario World was written in assembly. The authors were probably so good at reading assembly their brains would recognize control flow structures like loops and subroutines. They didn’t just see unrelated blocks of code, they saw Mario jumping around and hitting scenery and launching fireballs.

You’ll find highly abstracted code easy to read if you spend more time reading it, as long as you don’t approach it with a cynical attitude of “this is stupid, I shouldn’t have to put up with code like this”. Again, I’m not saying you have to like that code. It may be the worst code you’ve ever read. And you can still make it easy to read.

It’s somewhat ironic that I’ve spent enough of my career reading code I think was pretty poorly designed when I have to work on “legacy” apps, that I’ve gotten pretty damn good at reading bad code. It’s bad, but there’s a lot of it out there so it’s necessary for my job to be good at reading it.

How to Read Abstracted Code

I’m aware of a few differences in the way I read code compared to how I think developers, who especially don’t like highly abstracted code, read it. They are probably relevant to why I have an easier time reading such code.

The absolute biggest one is: I don’t feel a need to “click into” an implementation, or class, or other abstraction and read its implementation details simply because I encountered a use of it in code. If I ever do feel that urge, it immediately turns into an urge to rename something. The name should make it self-evident enough what role it’s playing in the code that’s calling it.

I should not need to open up a coffee machine and look at how all its gears and pumps are hooked together to know that it’s a coffee machine, and pressing the “brew” button is going to make coffee come out of it. But if it’s just an unlabeled machine with a blank button it, okay maybe I’d want to open it up to figure out what the hell it is. I could also just press the button and see what happens (for code, that means find the test that calls this method or instantiates this class, and if doesn’t exist, write it). Once I figure out what it is, I make sure to label it properly, so no one has to do that again.

The theme here is: when you organize a large codebase into modules, and you’re still clicking from one module to another while reading, or even worse stepping from one module to another during debugging, you’re doing it wrong. Want to challenge yourself? Put your library modules in different repos and consume them in your app through a package manager, in binary format. No source code available. You shouldn’t need a library’s source code while you’re working on the app that uses the library, and vice versa. This will also more or less force you to test the library with tests for the library instead of testing it indirectly through testing the app.

This is the single biggest complaint I’ve gotten from code I wrote for large applications with dozens of very similar screens. No, I’m not f**cking copy-pasting it 25 times. That’s a maintenance nightmare. But the complaint is, “I have to click through dozens of files just to get through a single behavior”. No, you don’t have to, you just do. Have you considered not clicking through? Have you done some self-reflection on what drives your need to dig through implementation details of implementation details to understand something? You usually can’t do that once you hit the platform SDKs. It’s not like there isn’t implementation details inside those classes, it’s just not publicly accessible. If you can stop there, why can’t you stop on abstractions in your source code?

The answer might be, “because they’re bad abstractions”. They’re badly named, or worse yet leaky, un-cohesive, slice an actual abstraction into pieces and scatter it around multiple files, combine two unrelated functionalities and select one with a parameter, etc. Yes, developers can absolutely do a bad job of abstracting. The solution is not to destroy the abstractions, it’s to fix them. This is why you need to ask yourself why you want to click into everything. There may be a good answer, and it may lead to an improvement in the code that’s a lot better than just inlining implementations everywhere.

Clicking into stuff isn’t always bad. But you need to treat it as a context switch. You unload everything from your brain you were reading before you clicked in. It’s irrelevant now. How you got to a method or class or whatever should have nothing to do with what that method or class is doing. If the meaning isn’t clear from just looking at the method or class, it needs a better name, or it’s a bad abstraction if it really requires context from the call site to understand.

How did I learn to read code like this? Practicing. The most reliable way to get better at X is to practice X. Do you want running a marathon to be easy? Then run some marathons. The first few will be really, really hard. They’ll get easier. Want to learn to play a musical instrument? Pick up the instrument and start playing it. It will be really hard at first. It will get easier.

Just because you can get better at reading a certain style of code doesn’t mean you should. I’m sure it’s a waste of time for any of us to get as good at reading assembly as the Super Mario World programmers were. It’s not a waste of time to get good at reading highly abstract code. Change it as soon as you can if you want, but you’ll only get faster at that if it’s easy to read.

Does Optimizing for Readability Even Make Sense?

Why cite the actual number for how much time we spend reading code? It’s 90%, apparently. The reason is because we humans really like simple relationships. Ideally, linear and proportional ones. We like to believe that if we can measure the proportions of time allocation on different activities, we can straightforwardly calculate the weights for different optimizations. If we spend X1% amount of time doing Y1, and X2% of time doing Y2, then optimizing Y2 will be X2/X1 times as important as optimizing Y1. We love these kinds of simple calculations.

This is why I never hesitate to remind people that physics problems become literally impossible to solve analytically as soon as you introduce… wait for it… three bodies. The number of problems in the world that are actually solvable with these kinds of basic algebra techniques are miniscule. We are heavily biased toward simplifying real problems down until they become solvable like this, destroying in the process any relevance the now toy model of the problem has to the real problem.

There are a lot of unspoken assumptions between the statement “we spend 90% of our time reading code and 10% of the time modifying it” and “optimizing readability is 10x more important than optimizing modifiability”, or even that optimizing readability is important at all. I’m going to try to bring out some of them, and we’re going to find they aren’t justifiable.

First, let’s clarify what exactly we’re trying to “optimize”, overall. Hopefully, as professional developers, we’re not merely trying to optimize for our own comfort. Of course making our jobs more pleasant is good for the profession, because we’re more eager to do our jobs, and we’re people who work to live, not faceless cogs in a corporate machine. But they’re still jobs (not, as Red Foreman says, “Super Happy Fun Time”). At the end of the day we’re being paid to deliver software.

It doesn’t matter if we spend 99.99999% of our time reading code.

We aren’t paid to read code.

We are literally paid to modify code. The only reason the reading is valuable at all is because it is a necessary precondition to the eventual modification.

Thus, optimizing for being happy during the 90% of our jobs when we aren’t actually producing deliverables is extremely dubious.

This is literally in Uncle Bob’s quote: that the reason improving readability is important is insofar as it improves writability. He was not pitting the two against each other.

But this problem is being framed as a “readability vs. modifiability” problem (some sort of tradeoff). Any adjustment of code that makes it simultaneously easier to read but harder to modify is absolutely detrimental to productivity, because the only actually productive phase is during modification. The only possible way readability could help productivity is because improving readability improves modifiability.

If you find it way easier to read less abstracted code, but directly at the expense of making it more complicated to modify it appropriately, how is this improving productivity? I totally get that it’s making your day more pleasant, since you’re spending 90% of that day reading instead of modifying the code. But is that what we’re optimizing for? You not tearing your hair out because you hate your job so much? I’m not being totally facetious. It is important for productivity that you not want to go postal every time you open your IDE. But this can’t come at the expense of you getting stuff done. You need a way to be happier and more productive.

A way to do that is get used to reading the code that’s driving you crazy. Over time it won’t drive you crazy. At worst you’ll start finding it funny.

Honestly I don’t think readability and modifiability are in competition very often, if ever. I think it’s a false dilemma, as I explained above, born out of developers being presumptuous that all other developers share their opinion about what code is easiest to read. Developers assume the person who wrote the code they’re staring at and finding hard to read was written by someone who agrees it’s hard to read but believed readability and some other quality (which is going to be some aspect of modifiability) were in competition, and he let the other quality win. I think it’s way more likely the author believes his version of the code is both easier to read and easier to modify, and the current reader simply disagrees with him about that.

Is It Good We Read Code This Much?

Another unspoken assumption here is that the ratio of reading to modifying is constant. Not just constant, but specifically independent of the ease of reading and modifying code. No matter how much the ease of either of these activities change, the assumption is made that the reading/modifying ratio will stay at 10/1. I mentioned above that, this number was probably pulled out of someone’s nether regions, and is completely meaningless (87.3% of all statistics are made up on the spot). Given that we’re just making up numbers, the only thing we could do if we wanted to model how this made up number is coupled to the other variables we’re thinking about modifying is to make up more facts about how this coupling works. But that’s still better than just assuming there’s no coupling at all.

It seems pretty natural to me that there should be some rather strong coupling between the reading/modifying allocation ratio and how easy/pleasant both of these activities are. I would assume that making code much, much easier to modify is likely to lower the read/modify time ratio. Now that’s interesting. Remember again that, if we apply productivity multipliers to both activities, modifying gets a full 100% (actually that’s too simple, sometimes it can be negative, more on that below), and reading gets a big fat goose egg (again, not implying you can optimize reading out of the equation, it’s indirectly productive). In terms of productivity then, anything that decreases the read/modify time ratio is literally increasing the amount of actual productive time. That cannot be ignored. It could also disproportionately reduce the rate of productivity during modification (“velocity”), so this doesn’t necessarily imply increased productivity.

So, perhaps it’s true that today developers typically spend 90% of their time reading code, and only 10% of the time modifying it. And maybe that’s a problem. Maybe that’s because most of the code bases we’re working on today are huge piles of ten-year-old spaghetti following a design that could barely scale to a 9 month old app, and we can’t help but stare at in disbelief for hours before finally mustering up the courage to edit one line and see if it actually accomplishes what we’re trying to do. Maybe we’re in this situation because so much code out there is a total maintenance nightmare, we can’t figure out how to do a 10 minute modification without rereading a chunk of code for an hour and half first. Maybe if we all wrote code with ease of modifiability as the primary goal, we wouldn’t need to preamble the next modification with such a massive amount of preparatory study.

Or maybe not. Maybe this is just how it goes, and it’s unreasonable to think it could be any different. But there’s a wide range of options here. It could be that spending any less than 90% of time reading code is just unreasonable, or it could mean that spending less than 50% of time reading code is unreasonable, but we’ve got a solid 40% of potential improvement to work with. Or somewhere in between.

A large amount of my time is spent neither reading nor writing code but simply thinking about the problem. I really believe this is the most crucial phase of my job. Software engineering is more creative than formulaic, it’s much more like architecture than construction. The most valuable moments for the businesses I work for are random epiphanies I have in the middle of a meal when I realize a certain design is perfectly suited for the new requirements, and this is followed by a three day bender where I’m almost exclusively writing code. The time allocation for this is kind of nonsensical because it is often just background while I’m doing other stuff.

Does time allocation really even matter for productivity? Recall that if you’re optimizing a software algorithm, when you find that 90% of time is spent on one part, your goal is to reduce that number.

Quality vs. Quantity

Readability is subjective. But there are objective qualities of writing that virtually everyone will subjectively agree makes it harder to read.

What’s easier for me to read? A press release or a Top Secret classified document? Okay, sure, the former is easier to read, by simple virtue of it being available to me. That’s not an accident. I’m not supposed to read the classified document.

Now, imagine your friend complains that a novel is hard to read. You inquire further, and determine she is frustrated that the novel doesn’t come with the author’s notes and early drafts added as an appendix. She exclaims, “how can I know what the author meant here if he’s not going to share his notes with me!?”

Okay… if you’re having trouble reading a novel because you can’t read the author’s notes about it, there’s something wrong with the way you’re approaching reading the novel. That extra stuff is intentionally not part of the book.

Now imagine your other friend is complaining a novel is hard to read. You inquire further, and find he is jumping around, reading all the odd chapters and then reading all the even chapters. He reads Chapter 1, then 3, then 5, and so on, then circles back and reads Chapter 2, then 4, etc.

After lamenting to yourself how you managed to find all of these friends, you ask him in bewilderment why he isn’t just reading the damn thing in order. The order is very intentional. Maybe the novel isn’t chronological, that’s very important to its structure, but your friend thinks all stories should be presented in chronological order and is jumping around trying to reorganize it in that way.

If your teammate wrote code you find hard to read, it’s quite possible your teammate finds it easy to read. It’s also possible that however you’re trying to read it, the author made that harder on purpose.

Reading code is only indirectly productive, and insofar as it aids in future modification, which is directly productive. It is also the case that not all modification is productive. Correct modification is productive. Incorrect modification (introducing bugs) is counterproductive. That’s even worse than reading, which is merely (directly) unproductive.

Another tendency I see among some developers is to be focused on how overall easy it is to make changes to code. This is definitely a big part of the arguments I hear in favor of non-compiled (basically not strongly typed) languages. Having to create types is hard, it makes every little change harder.

And changing is what we get paid for, so making changing code harder is bad, right?

F*** no it isn’t.

It isn’t if most changes are actually harmful.

It’s way, way easier to break code than it is to fix it or correctly add or modify functionality. Thus, most changes to code are actually harmful.

If you’re into XP, you love automated tests, right? TDD all the way right!? All a test does is stop you from modifying code. It does nothing but make editing production code harder. That’s why some developers hate test suites.

In my experience, high quality code is high quality not so much because it makes correct modification easier (this is the #2 aspect of high quality code) but because it makes incorrect modification harder (this is the #1 aspect). Thus, when I hear developers complaining that working on this code base, given its design, is too hard, I’m suspicious that they’re complaining their bugs are being found as they’re being typed instead of later by QA or customers.

Is it “easier” to read the internal details of a library class that are exposed to you than it is to read those that are encapsulated? Well, yeah. But that doesn’t mean you should expose everything as public to make it “easier” to find. Hiding stuff is a fundamental tool of improving software quality.

If “easier to read” means less abstraction, and all the implementation details spelled out right there so you don’t have to click around to find them, then this literally means less encapsulation. Less intentional hiding of stuff you might want to read but you damn well shouldn’t be because all reading it is going to do is make you want to couple to it.

Some parts of code need to be easy to read and understood by certain people in certain roles. Other parts of code, those people in those roles have no business reading it and doing so is likely to just confuse them or give them dangerous ideas. This is directly relevant to the way “readability” is often connected directly to abstraction.

Abstraction introduces boundaries, and intentionally moves one thing (an implementation detail) away from another thing (an invocation of the thing being implemented elsewhere) specifically because it is not helpful to read both at the same time. Doing so only feeds confusing and misleading ideas that they are relevant to each other, that they should be colocated, are likely to change together, that they form a cohesive unit, and have no independent meaning.

Highly abstract code does intentionally make certain approaches to reading code harder, as a direct corollary to it making certain approaches to modifying code harder. Properly abstracted code makes the modifications, plus the reading that likely precedes those modifications, that are not helpful and degrade the code quality, harder. This is a feature, not a bug.

The structure may be stopping you from even reading the code in a way that leads up to you making a damaging modification, especially if the main problem you’re having is that you’d have to punch through a bunch of encapsulation boundaries to do whatever it is you’re trying to do.

Conclusion

The overall thesis here is simple: readability has no simple relation to code quality. Readability has a large subjective component, and to the extent authors intentionally frustrate certain approaches to reading code, there can be valid reasons for doing so.

Even examples I try to come up with that I think are as objectively more readable as can be, like naming a variable thisSpecificThing instead of x, I know someone is gonna say, “no x is better because it’s more concise, the more concise the easier it is to read”. I can’t argue with that. I can’t tell someone they don’t like reading something they like reading. If I want to convince him to please name his variables descriptively, “more readable” isn’t a way to do it.

If you have trouble reading highly abstract code, you just need to practice reading it, which will help you understand why it discourages certain reading strategies. It’s also really helpful to practice writing it. If you always tell yourself you don’t have time or it’s not agile enough, when will you ever get this practice?

There’s nothing intrinsically unreadable about even extremely DRY’d heavily templated and layered code. It may even be excessive or inappropriate abstraction (I don’t like saying “excessive”, what matters is quality, not quantity), but even that code can be easy to read if you just practice doing so. Then you’ll get really good at improving it.

The rules of thumb for achieving high code quality need to be objective, not subjective. Examples of subjective (or at best tautological, they’re all fancy synonyms for “good”) qualities are:

  • Readable
  • “Clean” (what does this mean? You wiped your code down with Lysol?)
  • Simple
  • Understandable

Examples of qualities that are objective but so vague that they aren’t functional rules of thumb on their own (they are rather guiding principles for producing the rules of thumb) are:

  • Scalable
  • Maintainable
  • Agile
  • Modular
  • Reusable
  • Resilient
  • Stable

Examples of objective rules of thumb are (you may agree or disagree if these are good rules, I’m just giving examples of rules that are objective):

  • Everything is private by default
  • Use value semantics until you need reference semantics
  • Always couple to the highest possible level of abstraction (start with types at the top of the hierarchy, and cast down only once needed)
  • Law of Demeter (never access a dependency’s dependencies directly)
  • Use inheritance for polymorphism, use composition for sharing code
  • Make all required dependencies of a class constructor parameters (don’t allow constructed but nonfunctional instances)
  • Use a single source of truth for state, and only introduce caching when performance measurements prove it necessary
  • Don’t use dynamic dispatch for build-time variations
  • Use dynamic dispatch over procedural control flow
  • View should only react to model changes and never modify themselves directly

You see the difference right? Good or bad, these rules actually tell you something concrete. It’s a matter of fact whether you are following them or not, rather than subjective preferences.

The first category, in my opinion, is useless. The second category is useful only as a means of generating statements in the last category. It’s the last category that should go in the style guide of your README.

Being in the first category, I think developers should stop talking about readability. I mean, if they just want to complain, confiding with their fellow team members how frustrating their job can be, that’s fine. But citing “improve readability” as any kind of functional advice or guiding principle… it’s not. And if you have trouble reading code, reframe that as an opportunity for you to grow. After all, we’re software engineers, our job is to be good at reading and understanding code.

What Should Your Entities Be?

Introduction

So, you’re writing an application in an object-oriented programming language, and you need to work with data in a relational database. The problem is, data in a relational database is not in the same form as data in OOP objects. You have to solve this “impedance mismatch” in some way or another.

What exactly is the mismatch? Well, plain old data (POD) objects in OOP are compositions: one object is composed of several other objects, each of which are composed of several more objects, and so on. Meanwhile, relational databases are structured as flat rows in tables with relations to each other.

Let’s say we have the following JSON object:

{
  "aNumber": 5,
  "aString": "Hello",
  "anInnerObject": {
    "anotherNumber": 10,
    "aDate": "12-25-2015",
    "evenMoreInnerObject": {
      "yetAnotherNumber": 30,
      "andOneMoreNumber": 35,
      "oneLastString": "Hello Again"
    }
  }
}

In OOP code, we would represent this with the following composition of classes:

class TopLevelData {

  let aNumber: Int
  let aString: String
  let anInnerObject: InnerData 
}

class InnerData {

  let anotherNumber: Int
  let aDate: Date
  let evenMoreInnerObject: InnerInnerData
}

class InnerInnerData {
  
  let yetAnotherNumber: Int
  let andOneMoreNumber: Int
  let oneLastString: String
}

This is effectively equivalent to the JSON representation, with one important caveat: in most OOP languages, unless you use structs, the objects have reference semantics: the InnerData instance is not literally embedded inside the memory of the TopLevelData instance, it exists somewhere else in memory, and anInnerObject is really, under the hood, a pointer to that other memory. In the JSON, each sub-object is literally embedded. This means we can’t, for example, refer to the same sub-object twice without having to duplicate it (and by extension all its related objects), and circular references are just plain impossible.

This value vs. reference distinction is another impedance mismatch between OOP objects and JSON, which is what standards like json-api are designed to solve.

In the database, this would be represented with three tables with foreign key relations:

TopLevelData

id

Int
Primary Key
aNumber

Int
aString

Text
anInnerObjectId

Int
ForeignKey
InnerData.id

InnerData

id

Int
Primary Key
anotherNumber

Int
aDate

Date
evenMoreInnerObjectId

Int
ForeignKey
InnerInnerData.id

InnerInnerData

id

Int
Primary Key
yetAnotherNumber

Int
andOneMoreNumber

Int
oneLastString

Text

This representation is more like how OOP objects are represented in memory, where foreign keys are equivalent to pointers. Despite this “under the hood” similarity, on the visible level they’re completely different. OOP compilers “assemble” the memory into hierarchical structures we work with, but SQL libraries don’t do the same for the result of database queries.

The problem you have to solve, if you want to work with data stored in such tables but represented as such OOP classes, is to convert between foreign key relations in tables to nesting in objects…

…or is it?

It may seem “obvious” that this impedance mismatch needs to be bridged. After all, this is the same data, with different representations. Don’t we need an adapter that converts one to the other?

Well, not necessarily. Why do we need to represent the structure of the data in the database in our code?

ORMs to the Rescue

Assuming that yes, we do need that, the tools that solve this problem are called object-relational mapping, or ORM, libraries. The purpose of an ORM is to automate the conversion between compositional objects and database tables. At minimum, this means we get a method to query the TopLevelData table and get back a collection of TopLevelData instances, where the implementation knows how to do the necessary SELECTs and JOINs to get all the necessary data, build each object out of its relevant parts, then assign them to each other’s reference variables.

If we want to modify data, instead of hand-crafting the INSERTs or UPDATEs, we simply hand the database a collection of these data objects, and it figures out what records to insert or update. The more clever ones can track whether an object was originally created from a database query, and if so, which fields have been modified, so that it doesn’t have to write the entire object back to the database, only what needs updating.

We still have to design the database schema, connect to it, and query it in some way, but the queries are abstracted from raw SQL, and we don’t have to bother with forming the data the database returns into the objects we want, or breaking those objects down into update statements.

The fancier ORMs go further than this and allow you to use your class definitions to build your database schema. They can analyze the source code for the three classes, inspect its fields, and work out what tables and columns in those tables are needed. When it sees a reference-type field with one object containing another, it’s a cue to create a foreign key relationship. With this, we no longer need to design the schema, we get it “for free” by simply coding our POD objects.

This is fancy and clever. It’s also, the way we’ve stated it, unworkably inefficient.

Inefficiency is a problem with any of these ORM solutions because of their tendency to work on the granularity of entire objects, which correspond to entire rows in the database. This is a big problem because of foreign key relations. The examples we’ve seen so far only have one-to-one relations. But we can also have one-to-many, which would look like TopLevelData having a field let someInnerObjects: [InnerData] whose type is a collection of objects, and many-to-many, which would add to this a “backward” field let theTopLevelObjects: [TopLevelData] on InnerData.

The last one is interesting because it is unworkable in languages that use reference counting for memory management. That’s a circular reference, which means you need to weaken one of them, but by weakening one (say, the reference from InnerData back to TopLevelData) means you must hold onto the TopLevelData separately. If you, for example, query the database for an InnerData, and want to follow it to its related TopLevelData, they’ll be gone as soon as you get your InnerData back.

This is, of course, not a problem in garbage collected languages. You just have to deal with all the other problems of garbage collection.

With x-to-many relations, constructing a single object of any of our classes might end up pulling hundreds or thousands of rows out of the database. The promise we’re making in our class, however implicit, is that when we have a TopLevelData instance, we can follow it through references to any of its related objects, and again, and eventually end up on any instance that is, through an arbitrarily long chain of references, related back to that TopLevelData instance. In any nontrivial production database, that’s an immediate showstopper.

A less severe form of this same problem is that when I grab a TopLevelData instance, I might only need to read one field. But I end up getting the entire row back. Even in the absence of relations, this is still wasteful, and can become unworkably so if I’m doing something like a search that returns 10,000 records, where I only need one column from each, but the table has 50 columns in it, so I end up querying 50,000 cells of data. That 50x cost, in memory and CPU, is a real big deal.

By avoiding the laborious task of crafting SQL queries, where I worry about SELECTing and JOINing only as is strictly necessary, I lose the ability to optimize. Is that premature optimization? In any nontrivial system, eventually no.

Every ORM has to deal with this problem. You can’t just “query for everything in the object” in the general case, because referenced objects are “in”, and that cascades.

And this is where ORMs start to break down. We’ll soon realize that the very intent of ORMs, to make database records look like OOP objects, is fundamentally flawed.

The Fundamental Flaw of ORMs

We have to expose a way to SELECT only part of an object, and JOIN only on some of its relations. That’s easy enough. Entity Framework has a fancy way of letting you craft SQL queries that looks like you’re doing functional transformations on a collection. But the ability to make the queries isn’t the problem. Okay, so you make a query for only part of an object.

What do you get back?

An instance of TopLevelData? If so, bad. Very bad.

TopLevelData has everything. If I query only the aNumber field, what happens when I access aString? I mean, it’s there! It’s not like this particular TopLevelData instance doesn’t have an aString. If that were the case, it wouldn’t be a TopLevelData. A class in an OOP language is literally a contract guaranteeing the declared fields exist!

So, what do the other fields equal? Come on, you know what the answer’s gonna be, and it’s perfectly understandable that you’re starting to cry slightly (I’d be more concerned if you weren’t):

null

I won’t belabor this here, but the programming industry collectively learned somewhere in the last 10 years or so that the ancient decision of C, from which so many of our languages descend in some way, to make NULL not just a valid but the default value of any pointer type, is one of the most expensive bug-breeding decisions that’s ever been made. It wasn’t wrong for C to do this, NULL (just sugar for 0) is a valid pointer. The convention to treat this as “lacking a value” or “not set” is the problem, but again, in C, there’s really no better option.

But carrying this forward into C++, Objective-C, C# and Java was a bad idea. Well, okay, Objective-C doesn’t really have a better option either. C++ has pointers but more than enough tools to forever hide them from anyone except low level library developers. C# and Java don’t have them at all, and it’s their decision to make null a valid value of any declared variable type (reference types at least) that’s really regrettable. It’s a hole in their type system.

This is one of the greatest improvements that Swift and Kotlin (and Typescript if you configure it properly) made over these languages. null is not a String, so if I tell my type system this variable is a String, assigning null to it should be a type error! If I want to signal that a variable is either a String or null, I need a different type, like String?, or Optional<String>, or std::optional<String>, or String | null, which is not identical to String and can’t be cast to one.

I said I wouldn’t belabor this, so back to the subject: the ability of ORMs to do their necessary optimization in C# and Java is literally built on the biggest hole of their type systems. And of course this doesn’t work with primitives, so you either have to make everything boxed types, or God forbid, decide that false and 0 is what anything you didn’t ask for will equal.

It really strikes to the heart of this issue that in Swift, which patched that hole, an ORM literally can’t do what it wants to do in this situation. You’d have to declare every field in your POD objects to be optional. But then what if a particular field is genuinely nullable in the database, and you want to be able to tell that it’s actually null, and not just something you didn’t query for? Make it an optional optional? For fu…

Either way, making every field optional would throw huge red flags up, as it should.

In Java and C#, there’s no way for me to know, just by looking at the TopLevelData instance I have, if the null or false or 0 I’m staring at came from the database or just wasn’t queried. All the information about what was actually SELECTed is lost in the type system.

We could try to at least restrict this problem to the inefficiency of loading an entire row (without relations), but making relations lazy loaded: the necessary data is only queried from the database when it is accessed in code. This tries to solve the problem of ensuring the field has valid data whenever accessed while also avoiding the waste of loading a potentially very expensive (such as x-to-many) relation that is never accessed.

This comes with a host of its own problems, and in my experience it’s never actually a workable solution. Database connections are typically managed with some type of “context” object that, among other things, is how you control concurrency, since database access is generally not thread safe. You usually create a context in order to make a query, get all the data you need, then throw the context away once the data is safely stored in POD objects.

If you try to lazy-load relations, you’re trying to hit the database after the first query is over, and you’ve thrown the context away. Either it will fail because the context is gone, or it’s going to fail because the object with lazy-loading capabilities keeps the context alive, and when someone else creates a context it throws a concurrency exception.

You can try so solve this by storing some object capable of creating a new context in order to make the query on accessing a field. But even if you can get this to work, you’ll end up potentially hitting the database while using an object you returned to something like your UI. To avoid UI freezes you’d have to be aware that some data is lazy-loaded, keep track of whether it’s been loaded or not, and if not, make sure to do it asynchronously and call back to update the UI when it’s ready. By that point you’re just reinventing an actual database query in a much more convoluted way.

The Proper Solution

What we’re trying to do is simply not a good idea. Returning a partial object of some class, but having it declared as a full instance of that class, violates the basic rules of object-oriented programming. The whole point of a type system is to signal that a particular variable has particular members with valid values. Returning partials throws that out the window.

We can do much better in Typescript, whose type system is robust enough to let us define Partial<T> for any T, that will map every member of type M in T to a member of type M | undefined. That way, we’re at least signaling in the type system that we don’t have a full TopLevelData. But we still can’t signal which part of TopLevelData we have. The stuff we queried for becomes nullable even when it shouldn’t be, and we have to do null checks on everything.

Updating objects is equally painful with ORMs. We have to supply a TopLevelData instance to the database, which means we need to create a full one somehow. But we only want to update one or a few fields. How does the framework know what parts we’re trying to update? Combine this with the fact that part of the object may be missing because we didn’t query for all of it, and what should the framework do? Does it interpret those empty fields as instructions to clear the data from the database, or just ignore them?

I know Entity Framework tries to handle this by having the generated subclasses of your POD objects try to track what was done to them in code. But it’s way more complicated than just setting fields on instances and expecting it to work. And it’s a disaster with relations, especially x-to-many relations. I’ve never been able to get update statements to work without loading the entire relations, which it needs just so it can tell exactly how what I’m saving back is different from what’s already there. That’s ridiculous. I want to set a single cell on a row, and end up having to load an entire set of records from another table just so the framework can confirm that I didn’t change any of those relations?

Well, of course I do. If I’m adding three new records to a one-to-many relation, and removing one, then how do I tell EF this? For the additions, I can’t just make an entity where the property for this relationship is an array that contains the three added records. That’s telling EF those are now the only three related entities, and it would try to delete the rest. And I couldn’t tell it to remove any this way. The only thing I can do is load the current relations, then apply those changes (delete one, add three) to the loaded array and save it back. There’s no way to do this in an acceptably optimized fashion.

The conclusion is inescapable:

It is not correct to represent rows with relations in a database as objects in OOP

It should be fairly obvious, then, what we should be representing as objects in OOP:

We should represent queries as objects in OOP

Instead of writing a POD object to mirror the table structure, we should write POD objects to mirror query structures: objects that contain exactly and only the fields that a query SELECTed. Whether those fields came from a single table or were JOINed together doesn’t matter. The point is whatever array of data each database result has, we write a class that contains exactly those fields.

For example, if I need to grab the aNumber from a TopLevelData, the aDate from its related InnerData, and both yetAnotherNumber and oneLastString from its related InnerInnerData, I write the following class:

struct ThisOneQuery {

  let aNumber: Int
  let aDate: Date
  let oneLastString: String
}

This means we may have a lot more classes than if we just wrote one for each table. We might have dozens of carefully crafted queries, each returning slightly different combinations of data. Each one gets a class. That may sound like extra work, but it’s upfront work that saves work later, as is always the case with properly designing a type system. No more accidentally accessing or misinterpreting nulls because they weren’t part of the query.

We apply the same principal to modifying data. Whatever exact set of values a particular update statement needs, we make a class that contains all and only those fields, regardless of whether they come from a single table or get distributed to multiple tables. Again, we use the type system to signal to users of an update method on our Store class exactly what they need to provide, and what is going to be updated.

These query objects don’t need to be flat. They can be hierarchical and use reference semantics wherever it is helpful. We can shape them however we want, in whatever way makes it easiest to work with them. The rule is that every field is assigned a meaningful value, and nulls can only ever mean that something is null in the database.

Entity Framework does something interesting that approximates what I’m talking about here: when you do a Select on specific fields, the result is an anonymous type that contains only the fields you selected. This is exactly what we want. However, since the type is anonymous (it doesn’t have a name), you can’t return them as is. We still need to write those query result classes and give them a name, but this feature of Entity Framework will make it a lot easier to construct those objects out of database queries.

We can get similar functionality in C++ by using variadic templates, to write a database query method that returns a tuple<...> containing exactly the fields we asked for. In that case, it’s a named type and we can return it as is, but the type only indicates which types of fields, in what order, we asked for. The fields aren’t named. So we’d still want to explicitly define a class, presumably one we can reinterpret_cast that tuple<...> to.

The payoff of carefully crafting these query classes is that we get stronger coupling between what our model layers work with to drive the UI and what the UI actually needs, and looser coupling between the model layer and the exact details of how data is stored in a database. It’s always a good idea to let requirements drive design, including the database schema. Why create a schema that doesn’t correspond in a straightforward way to what the business logic in your models actually needs? But even if we do this, some decisions about how to split data across tables, whether to create intermediate objects (as is required for many-to-many relations), and so on may arise purely out of the mechanics of relational databases, and constitute essentially “implementation details” of efficiently and effectively storing the data.

Writing classes that mirror the table structure of a database needlessly forces the rest of the application to work with data in this same shape. You can instead start by writing the exact POD objects you’d most prefer to use to drive your business logic. Once you’ve crafted them, they signal what needs to be queried from the database. You have to write your SELECTs and JOINs so as to populate every field on these objects, and no more.

If, later, the UI or business logic requirements change, and this necessitates adding or removing a field to/from these query classes, your query method will no longer compile, guiding you toward updating the query appropriately. You get a nice, compiler-driven pipeline from business requirements to database queries, optimized out of the box to supply exactly what the business requirement need, wasting no time on unnecessary fetches.

This also guides how queries and statements are batched. Another problem with ORMs is that you can’t bundle fetches of multiple unrelated entities into a single query, because there’s no type representing the result. It would be a tuple, but only C++ has the machinery necessary to write a generic function that returns arbitrary tuples (variadic generics). You’re stuck having to make multiple separate trips to the database. This may be okay, or even preferred, in clients working with embedded databases, but wherever the database lives on another machine, each database statement is a network trip, and you want to batch those where possible.

By writing classes for queries, the class contain be a struct that contains the structs for each part of the data, however unrelated it is to other parts. With this we can hit the database once, retrieving all we need, and nothing extra, even if it constitutes multiple fetches of completely unrelated data. We can do the same with updates, although we could achieve the same in ORM with transactions.

Queries as classes also integrates very well with web APIs, especially if they follow a standard like json-api that supports partial objects. Anyone who’s tried writing the network glue for updating a few fields in an object whose class represents an entire backend database entity knows the awkwardness of having to decide either to inefficiently send the entire object every time, or come up with some way to represent partial objects. This could be straightforward in Typescript, where a Partial<T> would contain only what needs to be updated, but even there we can improve the situation with transaction objects because they signal what data is going to be updated. With queries, requesting specifically the fields needed translates straightforwardly into parsing the responses to the query objects, which contain the same fields as what was requested.

Conclusion

It turns out it not only is not necessary but wholly misguided to try to represent your database tables as classes in OOP code. That set of classes exists conceptually, as that’s exactly what the database is ultimately storing, but just because those classes conceptually exist doesn’t mean you need to code them. You may find it useful to write them purely to take advantage of the schema-specifying features of ORMs, but their usage should not go beyond this.

The actual interactions with the database, with data going in and out, don’t work in terms of entire rows with all their relations, but with carefully selected subsets. The solution we’re yearning for, that made us think an ORM might help, is in fact a rather different solution of representing individual queries as classes. Perhaps eventually a tool can be written that automates this with some type of code generation. Until then, I promise you’ll be much happier handwriting those query classes than you ever were working with entity classes.

On ReactiveX – Part IV

In the previous parts, first we tried to write a simple app with pure Rx concepts and were consumed by demons, then we disentangled the Frankenstein Observable into its genuinely cohesive components, then organized the zoo of operators by associating them with their applicable targets. Now it’s time to put it all together and fix the problems with our Rx app.

Remember, our requirements are simple: we have a screen that needs to download some data from a server, display parts of the downloaded data in two labels, and have those labels display “Loading…” until the data is downloaded. Let’s recall the major headaches that arose when we tried to do this with Rx Observables:

  • We started off with no control over when the download request was triggered, causing it to be triggered slightly too late
  • We had trouble sharing the results of one Observable without inadvertently changing significant behavior we didn’t want to change. Notice that both this and the previous issue arose from the trigger for making the request being implicit
  • We struggled to find a stream pipeline that correctly mixed the concepts of “hot” and “cold” in such a way that our labels displayed “Loading…” only when necessary,

With our new belt of more precise tools, let’s do this right.

The first Observable we created was one to represent the download of data from the server. The root Observable was a general web call Observable that we create with an HTTPRequest, and whose type parameter is an HTTPResponse. So, which of the four abstractions really is this? It’s a Task. Representing this as a “stream” makes no sense because there is no stream… no multiple values. One request, one response. It’s just a piece of work that takes time, that we want to execute asynchronously, and may produce a result or fail.

We then transformed the HTTPResponse using parsers to get an object representing the DataStructure our server responds with. This is a transformation on the Task. This is just some work we need to do to get a Task with the result we actually need. So, we apply transformations to the HTTP Task, until we end up with a Task that gives us the DataStructure.

Then, what do we do with it? Well, multiple things. What matters is at some point we have the DataStructure we need from the server. This DataStructure is a value that, at any time, we either have or don’t have, and we’re interested in when it changes. This is an ObservableValue, particularly of a nullable DataStructure. It starts off null, indicating we haven’t retrieved it yet. Once the Task completes, we assign this ObservableValue to the retrieved result.

That last part… having the result of a Task get saved in an ObservableValue… that’s probably a common need. We can write a convenience function for that.

We then need to pull out two different strings from this data. These are what will be displayed by labels once the data is loaded. We get this by applying a map to the ObservableValue for the DataStructure, resulting in two ObservableValue Strings. But wait… the source ObservableValue DataStructure is nullable. A straight map would produce a nullable String. But we need a non-null String to tell the label what to display. Well, what does a null DataStructure represent? That the data isn’t available yet. What should the labels display in that case? The “Loading…” text! So we null-coalesce the nullable String with the loading text. Since we need to do that multiple times, we can define a reusable operator to do that.

Finally we end up with two public ObservableValue Strings in our ViewModel. The View wires these up to the labels by subscribing and assigning the label text on each update. Remember that ObservableValues give the option to have the subscriber be immediately notified with the current value. That’s exactly what we want! We want the labels to immediately display whatever value is already assigned to those ObservableValues, and then update whenever those values change. This only makes sense for ObservableValues, not for any kind of “stream”, which doesn’t have a “current” value.

This is precisely that “not quite hot, not quite cold” behavior we were looking for. Almost all the pain we experienced with our Rx-based attempt was due to us taking an Observable, which includes endless transformations, many of which are geared specifically toward streams, subscribing to it and writing the most recently emitted item to a UI widget. What is that effectively doing? It’s caching the latest item, which as we saw is exactly what a conversion from an EventStream to an ObservableValue does. Rx Observables don’t have a “current value”, but the labels on the screen certainly do! It turned out that the streams we were constructing were very sensitive to timing in ways we didn’t want, and it was becoming obvious by remembering whatever the latest emitted item was. By using the correct abstraction, ObservableValue, we simply don’t have all these non-applicable transformations like merge, prepend or replay.

Gone is the need to carefully balance an Observable that gets made hot so it can begin its work early, then caches its values to make it cold again, but only caches one to avoid repeating stale data (remember that caching the latest value from a stream is really a conversion from an EventStream to a… ObservableValue!). All along, we just needed to express exactly what a reactive user interface needs: a value, whose changes over time can be reacted to.

Let’s see it:

class DataStructure
{
    String title;
    int version;
    String displayInfo;
    ...
}

extension Task<Result>
{
    public void assignTo(ObservableValue<Result> destination)
    {
        Task.start(async () ->
        {
            destination.value = await self;
        });
    }
}

extension ObservableValue<String?>
{
    private Observable<String> loadedValue()
    {
        String loadingText = "Loading...";

        return self
            .map(valueIn -> valueIn ?? loadingText);
    }
}

class AppScreenViewModel
{
    ...

    public final ObservableValue<String> dataLabelText;
    public final ObservableValue<String> versionLabelText;

    private final ObservableValue<DataStructure?> data = new ObservableValue<DataStructure?>(null);
    ...

    public AppScreenViewModel()
    {
        ...

        Task<DataStructure> fetchDataStructure = getDataStructureRequest(_httpClient)
            .map(response -> parseResponse<DataStructure>(response);

        fetchDataStructure.assignTo(data);
        
        dataLabelText = dataStructure
            .map(dataStructure -> dataStructure?.displayInfo)
            .loadedValue();

        versionLabelText = dataStructure
            .map(dataStructure -> 
 dataStructure?.version.map(toString))
            .loadedValue();
    }
}

...

class AppScreen : View
{
    private TextView dataLabel;
    private TextView versionLabel;

    ...

    private void bindToViewModel(AppScreenViewModel viewModel)
    {
        subscriptions.add(viewModel.dataLabelText
            .subscribeWithInitial(text -> dataLabel.setText(text)));

        subscriptions.add(viewModel.versionLabelText
            .subscribeWithInitial(text -> versionLabel.setText(text)));
    }
}

Voila.

(By the way, I’m only calling ObservableValues “ObservableValue“s to avoid confusing them with Rx Observables. I believe they are what should be properly named Observable, and that’s what I would call them in a codebase that doesn’t import Rx)

This, I believe, achieves the declarative UI style we’re seeking, that avoids the need to manually trigger UI refreshes and ensures rendered data is never stale, and also avoids the pitfalls of Rx that are the result of improperly hiding multiple incompatible abstractions behind a single interface.

Where can you find an implementation of these concepts? Well, I’m working on it (I’m doing the first pass in Swift, and will follow with C#, Kotlin, C++ and maybe Java implementations), and maybe someone reading this will also start working on it. For the time being you can just build pieces you need when you need them. If you’re building UI, you can do what I’ve done several times and write a quick and dirty ObservableValue abstraction with map, flatMap and combine. You can even be lazy and make them all eagerly stored (it probably isn’t that inefficient to just compute them all eagerly, unless your app is really crazy sophisticated). You’ll get a lot of mileage out of that alone.

You can also continue to use Rx, as long as you’re strict about never using Observables to represent DataStreams or Tasks. They can work well enough as EventStreams, and Observables that derive from BehaviorSubjects work reasonably well as ObservableValues (until you need to read the value imperatively). But don’t use Rx as a replacement for asynchrony. Remember that you can always block threads you create, and yes you can create your own threads and block them as you please, and I promise the world isn’t going to end. If you have async/await, remember that it was always the right way to handle DataStreams and Tasks, but don’t try to force it to handle EventStreams or ObservableValues… producer-driven callbacks really are the right tool for that.

Follow these rules and you can even continue to use Rx and never tear your hair out trying to figure out what the “temperature” of your pipelines are.

On ReactiveX – Part III

Introduction

In the last part, we took a knife to Rx’s one-size-fits-all “Observable” abstraction, and carved out four distinct abstractions, each with unique properties that should not be hidden from whoever is using them.

The true value, I believe, of Rx is in its transformation operators… the almost endless zoo of options for how to turn one Observable into another one. The programming style Rx encourages is to build every stream you need through operators, instead of by creating Subjects and publishing to them “imperatively”.

So this begs the question… what happens to this rich (perhaps too rich, as the sheer volume of them is headache-inducing) language of operators when we cleave the Observable into the EventStream, the DataStream, the ObservableValue and the Task?

What happens is operators also get divided, this time into two distinct categories. The first category is those operators that take one of those four abstractions and produces the same type of abstraction. They are, therefore, still transformations. They produce an EventStream from an EventStream, or a DataStream from a DataStream, etc. The second category is those operators that take one of those four abstractions and produces a different type of abstraction. They are not transformers but converters. We can, for example, convert an EventStream to a DataStream.

Transformations

First let’s talk about transformations. After dividing up the abstractions, we now need to divvy up the transformations that were originally available on Observable, by determining which ones apply to which of these more focused abstractions. We’ll find that some still apply in all cases, while others simply don’t make sense in all contexts.

We should first note that, like Observable itself, all four abstractions I identified retain the property of being generics with a single type parameter. Now, let’s consider the simplest transformation, map: a 1-1 conversion transformation that can change that type parameter. This transformation continues to apply to all four abstractions.

We can map an EventStream to an EventStream: this creates a new EventStream, which subscribes to its source EventStream, and for each received event, applies a transformation to produce a new event of a new type, and then publishes it. We can map a DataStream to a DataStream: this creates a new DataStream, where every time we consume one of its values, it first consumes a value of its source DataStream, then applies a transformation and returns the result to us. We can map an ObservableValue: this creates a new ObservableValue whose value is, at any time, the provided transformation of the source ObservableValue (this means it must be read only. We can’t manually set the derived ObservableValue‘s value without breaking this relationship). It therefore updates every time the source ObservableValue updates. We can map a Task to a Task: this is a Task that performs its source Task, then takes the result, transforms it, and returns it as its own result.

We also have the flatMap operator. The name is confusing, and derives from collections: a flatMap of a Collection maps each element to a Collection, then takes the resulting Collection of Collections and flattens it into a single Collection. Really, this is a compound operation that first does a map, then a flatten. The generalizing of this transformation is that it takes an X of X of T, and turns it in an X of T.

The flatten operator, and therefore flatMap, also continues to apply to all of our abstractions. How do we turn an EventStream of EventStreams of T into an EventStream of T? By subscribing to each inner EventStream as it is published by the outer EventStream, and publishing its events to a single EventStream. The resulting EventStream receives all the events from all the inner EventStreams as they become available. How do we turn a DataStream of DataStreams of T into a DataStream of T? When we consume a value, we consume a single DataStream from the outer DataStream, store it, and then supply each value from it on each consumption, until it runs out, then we go consume the next DataStream from the outer DataStream, and repeat. How do we turn an ObservableValue of an ObservableValue of T into an ObservableValue? By making the current value the current value of the current ObservableValue of the outer ObservableValue (which then updates every time either the outer or inner ObservableValue updates). How do we turn a Task that produces a Task that produces a T into a Task that produces a T? We run the outer Task, then take the result inner Task, run it, and return its result.

In Rx land, it was realized that flatten actually has another variation that didn’t come up with ordinary Collections. Each time a new inner Observable is published, we could continue to observe the older inner Observables, or we can stop observing the old one and switch to the new one. This is a slightly different operator called switch, and it leads to the combined operator switchMap. For us, this continues to be the case for only EventStream, because the concept depends on being producer-driven: that streams publish values on their own accord, and we must decide whether to keep listening for them. DataStreams are consumer-driven, so flatMap must get to the end of one inner DataStream before moving to the next. ObservableValue and Task don’t involve multiple values so the concept doesn’t apply there.

Now let’s look at another basic transformation: filter. Does this apply to all the abstractions? No... because filtering is inherently about multiple values: some get through, some don’t. But only two of our four abstractions involve multiple values: EventStream and DataStream. We can therefore meaningfully filter those. But ObservableValue and Task? Filtering makes no sense there, because there’s only one value or result. Any other transformations that inherently involve multiple values (filter is just one, others include buffer or accumulate) therefore only apply to EventStream and DataStream, but not ObservableValue or Task.

Another basic operator is combining: taking multiple abstractions and combining them into a single one. If we have an X of T1 and an X of T2, we may be able to combine them into a single X of a combined T1 and T2 (i.e. a (T1, T2) tuple). Or, if we have a Collection of Xs of Ts, we can combine them into a single X of a Collection of Ts, or possibly a single X of T. Can this apply to all four abstractions? Yes, but we’ll see that for abstractions that involve multiple values, there are multiple ways to “combine” their values, while for the ones that involve on a single value, there’s only one way to “combine” their values.

That means for ObservableValue and Task, there’s one simple combine transformation. A combined ObservableValue is one whose value is the tuple/collection made up of its sources’ values, and therefore it changes when any one of its source values changes. A combined Task is one that runs each one of its source Tasks in parallel, waits for them all to finish, then returns the combined results of all as its own result (notice that, since Task is fundamentally concerned with execution order, this becomes a key feature of one of its transformations).

With EventStream and DataStream, there are multiple ways in which we can combine their values. With EventStream, we can wait for all sources to publish one value, at which point we publish the first combined value, we store these combined values, then each time any source value changes, we update only the published one, keeping all the rest the same, and publish that new combined value. This is the combineLatest operator: each published event represents the most recently published events from each source stream. We can alternatively wait for each source to publish once, at which point we publish the combination, then we don’t save any values, and again wait for all sources to publish again before combining and publishing again. This is the zip operator.

But combineLatest doesn’t make sense for DataStream though, because it based on when each source stream publishes. The “latest” of combineLatest refers to the timing of the source stream events. Since DataStreams are consumer-driven, there is no timing. The DataStream is simply asked to produce a value by a consumer. Therefore, there’s only combine, where when a consumer consumes a value, the combined DataStream then consumes a value from each of its sources, combines then and returns it. This is the zip operator, which continues to apply to DataStream.

Both EventStream and DataStream also ways to combine multiple streams into a single stream of the same type. With EventStream, this is simply the stream that subscribes to multiple sources and publishes when it receives a value from any of them. This is the merge operator. The order in which a merged EventStream publishes is dictated by the order it receives events from its source streams. We can do something similar with DataStream, but since DataStreams are consumer-driven, the transformation has to decide which source to consume from. A merge would be for the DataStream to first consume all the values of the first source stream, then all the values of second one, and so on (thus making it equivalent to a merge on a Collection… we could also call this concatenate to avoid ambiguity). We can also do a roundRobin: each time we consume a value the combined stream consumes one from a particular source, then one from the next one, and so on, wrapping back around after it reaches the end. There are all sorts of ways we can decide how to pick the order of consumption, and a custom algorithm can probably be plugged in as a Strategy to a transformation.

Somewhat surprisingly, I believe that covers it for ObservableValue and Task, with one exception (see below): map, flatten and combine are really the only transformations we can meaningfully do with them, because all other transformations involve either timing or value streams. Most of the remaining transformations from Rx we haven’t talked about will still apply to both EventStream and DataStream, but there are some important ones that only apply to one or the other. Any transformations that involve order apply only to DataStream, for example append or prepend. Any transformations that are driven by timing of the source streams apply only to EventStream, for example debounce or delay. And some transformations are really not transformations but conversions.

The exception I mentioned is for ObservableValue. EventStreams are “hot”, and their transformations are “eager”, and it never makes sense for them to not be (in the realist interpretation of events and “observing”). Derived ObservableValues, however, can be “eager” or “lazy”, and both are perfectly compatible with the abstraction. If we produce one ObservableValue from a sequence of transformations (say, maps and combines) on other ObservableValues, then we can choose to either perform those transformations every time someone reads the value, or we can choose to store the value, and simply serve it up when asked.

I believe the best way to implement this is to have derived ObservableValues be lazy by default: their values get computed from their sources on each read. This also means when there are subscribers to updates, they must subscribe to the source values’ updates, then apply the transformations each time new values are received by the sources. But sometimes this isn’t the performance we want. We might need one of those derived values to be fast and cheap to read. To do that, we can provide the cache operator. This takes an ObservableValue, and creates a new one that stores its value directly. This also requires that it eagerly subscribe to its source value’s updates and use them to update the stored value accordingly. There is also an issue of thread safety: what if we want to read a cached ObservableValue from one thread but it’s being written to on another thread? To handle this we can allow the cache operator to specify how (using a Scheduler) the stored value is updated. These issues of caching and thread safety are unique to observable values.

Converters

Now let’s talk about how we can turn one of these four abstractions into another one of of the four. In total that would be 12 possible converters, assuming that there’s a useful or sensible way to convert each abstraction into each other one.

Let’s start with EventStream as the source.

What does it mean to convert an EventStream into a DataStream? This means we’re talking a producer driven stream and converting it to a consumer driven stream. Remember the key distinction is that EventStreams are defined by timing: the events happen when they do, and subscribers are either subscribed at the time or they miss them. DataStreams are defined by order: the data are returned in a specific sequence, and it’s not possible to “miss” one (you can skip one but that’s a conscious choice of the consumer). Thus, turning an EventStream into a DataStream is fundamentally about storing events as they are received until they are later consumed, ensuring that none can be missed. It is, therefore, buffering. For this reason, this conversion operator is called buffer. It internally builds up a Collection of events received from the EventStream, and when a consumer consumes a value, the first element in the collection is returned immediately. If the Collection is empty, the consumer will be blocked until the next event is received.

What does it means to convert an EventStream to an ObservableValue? It would mean we’re storing the latest event emitted by the stream, so we can query what it is at any time. We call this converter cacheLatest. Note that the latest event must be cached, or else we wouldn’t be able to read it on demand. That’s fundamentally what this converter is doing: taking transient events that are gone right after they occur, and making them persistent values that can be queried as needed. This can be combined with other transformations on EventStream to produce some useful derived converters. For example, if we apply the accumulate operator to the EventStream, then use cacheLatest to produce an ObservableValue, the result is an accumulateTo operator, which stores a running accumulation (perhaps a sum) of incoming values over time.

What does it mean to convert an EventStream to a Task? Well, basically it would mean we create a Task that waits for one or more events to be published, then returns them as the result. But as we will see soon, it makes more sense to create “wait for the next value” Tasks out of DataStreams, and we can already convert EventStreams to DataStreams with buffer. Therefore, this converter would really be a compound conversion of first buffering and then fetching the next value. We can certainly write it as a convenience function, but under the hood it’s just composing other converters.

Now let’s move to DataStream as the source.

What does it mean to convert a DataStream to an EventStream? Well, an EventStream publishes its own events, but a DataStream only returns values when a consumer consumes them. Thus, turning a DataStream into an EventStream involves setting up a consumer to immediately start consuming values, and publish them as soon as they are available. The result is that multiple observers can now listen for those values as they get published, with the caveat that they’ll miss any values if they don’t subscribe at the right time. We can call this conversion broadcast.

What does it mean to convert a DataStream to an ObservableValue? Nothing useful or meaningful, as far as I can tell. Remember meaning of converting an EventStream to an ObservableValue was to cache the latest value. That’s a reference to timing. But timing in a DataStream is controlled by the consumer, so all that could mean is a consumer powers through all the values and saves them to an ObservableValue. The result is a rapidly changing value that then gets stuck on the last value in the DataStream. That doesn’t appear to be a valid concept.

What does it mean to convert a DataStream to a Task? Simple: read the next value! In fact, in .NET, where Task is the return value of all async functions, the return value of the async read method would then have to be a Task. There can, of course, be other related functions to read more than one value. We can also connect an input DataStream to an output DataStream (which, remember, is a valid abstraction but not one carved out from Observable, which only represents sources and not sinks), which results in a Task whose work is to consume values from the input and send them to the output.

Now let’s move to ObservableValue as the source.

What does it mean to convert an ObservableValue to an EventStream? Simple: publish the updates! Now, part of an ObservableValue‘s features is being able to subscribe to updates. How is this related to publishing updates? When we subscribe to an ObservableValue, the subscription is lazy (for derived ObservableValues, the transformations are applied to source values as they arrive), and we have the option of notifying our subscriber immediately with the current value. But when we produce an EventStream from an ObservableValue, remember EventStreams are always hot! It must eagerly subscribe to the ObservableValue and then publish each update it receives. This is significant for derived lazy ObservableValues, because as long as someone holds onto an EventStream produced from it, it has an active subscription and therefore its value is being calculated, which it wouldn’t be if no one was subscribed to it. We can call this converter publishUpdates: it is basically ensuring that updates are eagerly computed and broadcasted so that anyone can observe them as any other EventStream.

What does it mean to convert an ObservableValue to a DataStream? Nothing useful or meaningful that I can think of. At best it would be a stream that lets us read the updates, but that’s just publishUpdates followed by buffer.

What does it mean to convert an ObservableValue to a Task? Again, I can’t think of anything useful or meaningful, that wouldn’t be a composition of other converters.

Now let’s move to Task as the source.

Tasks don’t convert to any of the other abstractions in any meaningful way, because they have only a single return value. Even ObservableValues, which fundamentally represent single values, still have an associated stream of updates. Tasks don’t even have this. For this reason, we can’t derive any kind of meaningful stream from a Task, which means there’s also nothing to observe.

The converters are summarized in the following table (the row is the input, the column is the output):

EventStreamDataStreamObservableValueTask
EventStreamN/AbuffercacheLatestnone
DataStreambroadcastN/Anoneread
ObservableValuepublishUpdatesnoneN/Anone
TasknonenonenoneN/A

There are, in total, only five valid converters.

Conclusion

After separating out the abstractions, we find that the humongous zoo of operators attached to Observable is tidied up into more focused groups of transformations on each of the four abstractions, plus (where applicable) ways to convert from one to the other. What this reveals is that places where an Rx-driven app creates deep confusion over “hotness/coldness” and side effects are areas where an Observable really represents one of these four abstractions but it combined with a transformation operation that does not apply to that abstraction. For example, one true event stream (say, of mouse clicks or other user gestures) appended to another one makes no sense. Nor does trying to merge two observable values into a stream based on which one changes first.

In the final part, we’ll revisit the example app from the first part, rewrite it with our new abstractions and escape the clutches of Rx Hell.

On ReactiveX – Part II

In the last part, we explored the bizarre world of extreme observer-dependence that gets created in an ReactiveX (Rx)-driven app, and how that world rapidly descends into hell, especially when it is applied as a blanket solution to every problem.

Is the correct reaction to all of this to say “Screw Rx” and be done with it? Well, not entirely. The part where we try to cram every shaped peg into a square hole, we should absolutely say “to hell” with that. Whenever you see library tutorials say any variant of “everything is an X”, you should back away slowly, clutching whatever instrument of self-defense you carry. The only time that statement is true is if X = thing. Yes, everything is a thing… and that’s not very insightful, is it? The reason “everything is an X” with some more specific X seems profound is because it’s plainly false, and you have to imagine some massive change in perception for it to possibly be true.

Rx’s cult of personality cut its way through the Android world a few years ago, and now most of its victims have sobered up and moved on. In what is a quintessentially Apple move, Apple invented their own, completely proprietary and incompatible version of Rx, called Combine, a couple of years ago, and correspondingly the “everything is a stream” drug is making its rounds through the iOS world. It, too, will come to pass. A large part of what caused RxJava to wane is Kotlin coroutines, and with finally Swift gaining async/await, Combine will subside as well. Why do these “async” language features replace Rx? Because Rx was touted as the blanket solution to concurrency.

Everything is not an event stream, or an observable, period. Some things are. Additionally, the Rx Observable is a concept with far too much attached to it. It is trying to be so many things at once, owing to the fact it’s trying to live up to the “everything is a me” expectation, which will only result in Observable becoming a synonym for Any, except instead of it doing what the most general, highest category should do (namely, nothing), it endeavors instead to do the opposite: everything. It’s a God object in the making. That’s why it ends up everywhere in your code, and gradually erodes all the information a robust type system is supposed to communicate.

But is an event stream, endeavoring only to be an event stream, with higher-order quasi-functional transformations, a useful abstraction? I believe it’s a massively useful one. I still use it for user interfaces, but I reluctantly do so with Rx’s version of one, mostly because it’s the best one available.

The biggest problem with Rx is that its central abstraction is really several different abstractions, all crammed together. After thinking about this for a while, I have identified four distinct concepts that have been merged under the umbrella of the Observable interface. By disentangling these from each other, we can start to rebuild more focused, well-formed libraries that aren’t infected with scope creep.

These are the four abstractions I have identified:

  • Event streams
  • Data streams
  • Observable values
  • Tasks

Let’s talk about each one, what they are (and just as importantly, what they aren’t), and how they are similar to and different from Rx’s Observable.

Event Streams

Let us return from the mind-boggling solipsism of extreme Copenhagen interpretation, where the world around us is brought into being by observing it, and return to classical realism, where objective reality exists independent of observation. Observation simply tells us what is out there. An event stream is literally a stream of events: things that occur in time. Observing is utterly passive. It does not, in any way, change what events occur. It merely signs one up to be notified when they do.

The Rx Observable practically commands the Copenhagen outlook by making the abstract method, to be overridden by the various subclasses returned by operators, subscribe. It is what, exactly, subscribing (a synonym for observing) means that varies with different types of streams. This is where the trouble starts. It sets us up to have subscribe be what controls publish.

A sane approach to an event stream is for the subscribe method to be final. Subscribing is what it is: it just adds a callback to the list of callbacks to be triggered when an event is published. It should not alter what is published. The interesting behavior should occur exclusively in the constructor of a stream.

Let us recall the original purpose of the Observer Pattern. The primary purpose is not really to allow one-to-many communication. That’s a corollary of its main purpose. The main purpose is to decouple the endpoints of communication, specifically to allow one object to send messages to another object without ever knowing about that other object, not even the interfaces it implements.

Well, this is no different than any delegation pattern. I can define a delegate in class A, then have class B implement that delegate, allowing A to communicate with B without knowing about B. So what is it, specifically, about the Observer pattern that loosens the coupling even more than this?

This answer is that the communication is strictly one way. If an A posts an event, and B happens to be listening, B will receive it, but cannot (without going through some other interface that A exposes) send anything back to Anot even a return value. Essentially, all the methods in an observer interface must have void returns. This is what makes one-to-many broadcasting a trivial upgrade to the pattern, and why you typically get it for free. Broadcasting with return values wouldn’t make sense.

The one-way nature of the message flow creates an objective distinction between the publisher (or sender) and the subscriber (or receiver). The intermediary that moves the messages around is the channel, or broker. This is distinct from, say, the Mediator Pattern, where the two ends of communication are symmetric. An important consequence of the asymmetry of observers is that the presence of subscribers cannot directly influence the publisher. In fact, the publisher in your typical Observer pattern implementation can’t even query who is a subscriber, or even how many subscribers there are.

A mediator is like your lawyer talking to the police. An observer is like someone attending a public speech you give, where the “channel” is the air carrying the sound of your voice. What you say through your lawyer depends on what questions the police ask you. But the speech you give doesn’t depend on who’s in the audience. The speaker is therefore decoupled from his audience to a greater degree than you are decoupled from the police questioning you.

By moving the publishing behavior into subscribe, Rx is majorly messing with this concept. It muddles the distinction between publisher/sender and subscriber/receiver, by allowing the subscribe/receive end of the chain to significantly alter what the publisher/sender side does. It’s this stretching of the word “observe” to mean something closer to “discuss” that can cause confusion like “why did that web request get sent five times?”. It’s because what we’re calling “observing a response event” is more like “requesting the response and waiting for it to arrive”, which is a two-way communication.

We should view event streams as a higher abstraction level for the Observer Pattern. An EventStream is just a wrapper around a channel, that encapsulates publishing and defines transformation operators that produce new EventStreams. The publishing behavior of a derived stream is set up at construction of the stream. The subscribe method is final. Its meaning never changes. It simply forwards a subscribe call to the underlying channel.

Event streams are always “hot”. If the events occur, they are published, if not, they aren’t. The transformation operations are eager, not lazy. The transform in map is evaluated on each event as soon as the event is published, independent of subscribers. This expresses the realism of this paradigm: those mapped events happen, period. Subscribing doesn’t make them happen, it just tells us about them. The way we handle whether derived streams continue to publish their derived events is by holding onto the stream. If a derived stream exists, it is creating and publishing derived events. If we want the derived events to stop firing, we don’t throw away subscriptions, we throw away the stream itself.

There’s no problem of duplication here. The subscribing is one-to-many, but the construction of the events, the only place where any kind of side effects can occur, is tied the construction of derived streams, which only happens once. One stream = one instance of each event. The other side of that coin is that missed events are missed, period. If you want any kind of caching behavior, that’s not an event stream. It’s something else.

I think we’ll also find that by separating out the other concepts we’ll get to next, the need to ever create event streams that have any side effects is reduced to essentially zero.

Rx streams have behavior for handling the stream “completing”, and handling exceptions that get thrown during construction of an item to be emitted. I have gone back and forth over whether it makes sense for a strict event stream to have a notion of “completing”. I lean more toward thinking it doesn’t, and that “completion” applies strictly to the next concept we’ll talk about.

What definitely does not make sense for event streams is failures. Event streams themselves can’t “fail”. Events happen or they don’t. If some exception gets thrown by a publisher, it’s a problem for the publisher, that’s either going to be trapped by the publisher, will kill the publisher, or kill the process. Having it propagate to subscribers, and especially having it (by design) terminate the whole stream doesn’t make sense.

Data Streams

The next concept is a data stream. How are “data” streams different from “event” streams? Isn’t an event just some data? Well, an event holds data, but the event is the occurrence itself. With data streams, the items are not things that occur at a specific time. They may become available at a specific time, but that time is otherwise meaningless. The only significance of the arrival time of a datum is that we have to wait for it.

More importantly, in a stream of data, every datum matters. It’s really the order, not the timing, of the items that’s important. It’s critical that someone reading the data stream receive every element in the correct order. If a reader wants to skip some elements, that’s his business. But it wouldn’t make sense for a reader to miss elements and not know it.

We subscribe to an event stream, but we consume a data stream. Subscribing is passive. It has no impact on the events in the stream. Consuming is active. It is what drives the stream forward. The “next” event in a stream is emitted whenever it occurs, independent of who is subscribed. The “next” event of a data stream is emitted when the consumer decides to consume it. In both cases, once an element is emitted, it is never re-emitted.

Put succinctly, an event stream is producer-driven, and a data stream is consumer-driven. An event stream is a push stream, and a data stream is a pull stream.

This means a data stream cannot be one-to-many. An event stream can have arbitrarily many subscribers, only because subscribing is passive; entirely invisible to the publisher. But a data stream cannot have multiple simultaneous consumers. If we passed a data stream to multiple consumers who tried to read at the same time, they would step on each others’ toes. One would consume a datum and cause the other one to miss it.

To clarify, we’re talking about a specific data stream we call an input stream. It produces values that a consumer consumes. The other type of data stream is an output stream, which is a consumer itself, rather than a producer. Output streams are a separate concept not related to Rx Observables, because Observables are suppliers, not consumers (consumers in Rx are called Subscribers).

Most languages already have input and output stream classes, but they aren’t generic. Their element type is always bytes. We can define a generic one like this:

interface InputStream<Element>
{
    bool hasMore();

    Element read();

    UInt skip(UInt count);
}

This time it’s a pure interface. There’s no default behavior. Different types of streams have to define what read means.

Data streams can be transformed in ways similar to event streams. But since the “active” part of a data stream is the reading, it is here that a derived stream will interact with its source stream. This will look more like how Rx Observable implements operators. The read method will be abstract, and each operator, like map and filter, will implement read by calling read on the source stream and applying the transform. In this case, the operators are lazy. The transform is not applied to a datum until a consumer consumes the mapped stream.

The obvious difference between this an Rx Observables is that this is a pull, rather than push, interface. The read method doesn’t take a callback, it returns a result. This is exactly what we want for a stream where the next value is produced by the consumer requesting it. A data stream is inherently a pull paradigm. A push-style interface just obscures this. Typical needs with data streams, for example reading “n” values, then switching to do other stuff and then returning to read some more, become incredibly convoluted with an interface designed for a stream where the producer drives the flow.

A pull interface requires that if the next datum isn’t available yet, the thread must block. This is the horror that causes people to turn everything into callbacks: so they never block threads. The phobia of blocking threads (which is really a phobia of creating your own threads that can be freely blocked without freezing the UI or starving a thread pool) is a topic for another day. For the sake of argument I’ll accept that it’s horrible and we must do everything to avoid it.

The proper solution to the problem of long-running methods with return values that don’t block threads is not callbacks. Callback hell is the price we pay for ever thinking it was, and Rx hell is really a slight variation of callback hell with even worse problems layered on top. The proper solution is coroutines, specifically async/await.

This is, of course, exactly how we’d do it today in .NET, or any other language that has coroutines. If you’re stuck with Java, frankly I think you should just let the thread block, and make sure you do the processing on a thread you created (not the UI thread). That is, after all, exactly how Java’s InputStream works. If you are really insistent on not blocking, use a Future. That allows consuming with a callback, but it at least communicates in some way that you only expect the callback to be called once. That means you get a Future each time you read a chunk of the stream. If that seems ugly/ridiculous to you, then just block the damn thread!

Data streams definitely have a notion of “completing”. Their interface needs to be able to tell a consumer that there’s nothing left to consume. How does it handle errors? Well, since the interface is synchronous, an exception thrown by a transformation will propagate to the consumer. It’s his business to trap it and decide how to proceed. It should only affect that one datum. It should be possible to continue reading after that. If an intermediate derived stream doesn’t deal with an exception thrown by a source stream, it will propagate through until it gets to an outer stream that handles it, or all the way out to the consumer. This is another reason why a synchronous interface is appropriate. It is exactly what try-catch blocks do. Callback interfaces require you to essentially try-catch on every step, even if a step actually doesn’t care about (and cannot handle) an error and simply forwards it. You know you hate all that boilerplate. Is it really worth all of that just to not block a thread?

(If I was told I simply cannot block threads I’d port the project to Kotlin before trying to process data streams with callbacks)

Observable Values

Rx named its central abstraction Observable. This made me think if I create an Observable<String>, it’s just like a regular String, except I can also subscribe to be notified when it changes. But that’s not at all what it is. It’s a stream, and streams aren’t values. They emit values, but they aren’t values themselves. What’s the difference, exactly? Well, if I had what was literally an observable String, I could read it, and get a String. But you can’t “read” an event stream. An event stream doesn’t have a “current value”. It might have a most recently emitted item, but those are, conceptually, completely different.

Unfortunately, in its endeavor toward “everything is me”, Rx provides an implementation of Observable whose exact purpose is to try to cram these two orthogonal concepts together: the BehaviorSubject. It is a literal observable value. It can be read to get its current value. It can be subscribed to, to get notified whenever the value changes. It can be written to, which triggers the subscribers.

But since it implements Observable, I can pass it along to anything that expects an Observable, thereby forgetting that its really a BehaviorSubject. This is where it advertises itself as a stream. You might think: well it is a stream, or rather changes to the value are a stream. And that is true. But that’s not what you’re subscribing to when you subscribe to a BehaviorSubject. Subscribing to changes would mean you don’t get notified until the next time the value gets updated. If it never changes, the subscriber would never get called. But subscribers to a BehaviorSubject always get called immediately with the current value. If all you know if you’ve got an Observable, you’ll have no idea if this will happen or not.

Once you’ve upcast to an Observable, you lose the ability to read the current value. To preserve this, you’ll have to expose it as a BehaviorSubject. The problem then becomes that this exposes both reading and writing. What if you want to only expose reading the current value, but not writing? There’s no way to do this.

The biggest problem is that operators on a BehaviorSubject produce the same Observable types that those operators always do, which again loses the ability to read the current value. You end up with a derived Observable where the subscriber always gets called immediately (unless you drop or filter or do something else to prevent this), so it certainly always has a current value, you just can’t read it. This has forced me to do very stupid stuff like this:

BehaviorSubject<Int> someInt = new BehaviorSubject<Int>(5);

Observable<String> stringifiedInt = someInt
    .map(value -> value.toString());

...

String currentStringifiedInt = null;

Disposable subscription = stringifiedInt
    .subscribe(value ->
    {
        currentStringifiedInt = value;
    });

subscription.dispose();

System.out.print("Current value: " + currentStringifiedInt);
...

This is ugly, verbose, obtuse and unsafe. I have to subscribe just to trigger the callback to produce the current value for me, then immediately close the subscription because I don’t want that callback getting called again. I have to rely on the fact that a BehaviorSubject-derived observable will emit items immediately (synchronously), to ensure currentStringifiedInt gets populated before I use it. If I turn the derived observable back into a BehaviorSubject (which basically subscribes internally and sticks each updated value into the new BehaviorSubject), I can read the current value, but I can write to it myself, thereby breaking the relationship between the derived observable and the source BehaviorSubject.

The fundamental problem is that observable values and event streams aren’t the same thing. We need a separate type for this. Specifically, we need two interfaces: one for read-only observable values, and one for read-write observable values. This is where we’re going to see the type of subscribe-driven lazy evaluation that we see inside of Rx Observables. Derived observables are read-only. Reading them triggers whatever cascade of processing and upstream reading is necessary to produce the value. When we subscribe, that is where it will subscribe to its source observables, inducing them to compute their values when necessary (when those values update) to send them downstream.

Furthermore, the subscribe method on our Observable should explicitly ask whether the subscriber wants to be immediately notified with the current value (by requiring a boolean parameter). Since we have a separate abstraction for observable values, we know there is always a current value, so this question always makes sense.

Since the default is lazy, and therefore expensive and repetitious evaluation, we’ll need an operator specifically to store a derived observable in memory for quick evaluation. Is this comparable to turning a cold (lazy) Rx Observable into a hot (eager) one? No, because the thing you subscribe to with observable values, the changes, are always hot. They happen, and you miss them if you aren’t subscribed. Caching is purely a matter of efficiency, trading computing time for computing space (storage). It has no impact whatsoever on when updates get published.

Caching will affect whether transformations to produce a value are run repeatedly, but only for synchronous reads (multiple subscribers won’t cause repeated calculations). The major difference is that we can eliminate repeated side-effects from double-calculating a value without changing how or when its updates are published. What subscribers see is totally separate from whether an observable value is cached, unlike in Rx where “sharing” an Observable changes what subscribers see (it causes them to miss what they otherwise would have received).

A single Observable represents a single value. Multiple subscribers means multiple people are interested in one value. There’s no issue of “making sure all observers see the same sequence”. If a late subscriber comes in, he’ll either request the current value, whatever it is, or just request to be notified of later changes. The changes are true events (they happen, or they don’t, and if they do they happen at a specific time). We’d never need to duplicate calculations to make multiple subscribers see stale updates.

Furthermore, we communicate more clearly what, if any, “side effects” should be happening inside a transformation. They should be limited to whatever is necessary to calculate the value. If we have a derived value that requires an HTTP request to calculate it, this request will go out either when the source value changes, requiring a re-evaluation, or it will happen when someone tries to read the value… unless we cache it, which ensures the request always goes out as soon as it can. It is multiple synchronous reads that would, for non-cached values, trigger multiple requests, not multiple subscribers. This makes sense. If we’ve specified we don’t want to store the value, we’re saying each time we want to query the value we need to do the work of computing it.

Derived (and therefore read-only) observable values, which can both be subscribed to and read synchronously, is the most important missing piece in Rx. It’s so important I’ve gone through the trouble multiple times to build rudimentary versions of it in some of my apps.

“Completion” obvious makes no sense for observable values. They never stop existing. Errors should probably never happen in transformations. If a runtime exception sneaks through, it’s going to break the observable. It will need to be rethrown every time anyone tries to read the value (and what about subscribing to updates?). The possibility of failure stretches the concept of a value whose changes can be observed past, in my opinion, its range of valid interpretation. You can, of course, define a value that has two variations of success and failure (aka a Result), but the possibility of failure is baked into the value itself, not its observability.

Tasks

The final abstraction is tasks. Tasks are just asynchronous function invocations. They are started, and they do or do not produce a result. This is fundamentally different from any kind of “stream” because tasks only produce one result. They may also fail, in which case they produce one exception. The central focus of tasks is not so much on the value they produce but on the process of producing it. The fact the process is nontrivial and long-running is the only reason you’d pick a task over a regular function to begin with. As such, tasks expose an interface to start, pause/resume and cancel. Tasks are, in this way, state machines.

Unlike any of the other abstractions, tasks really do have distinct steps for starting and finishing. This is what ConnectableObservable is trying to capture with its addition (or rather, separation from subscribe) of connect. The request and the response are always distinct. Furthermore, once a task is started, it can’t be “restarted”. Multiple people waiting on its response doesn’t trigger the work to happen multiple times. The task produces its result once, and stores it as long as it hangs around in case anyone else asks for it.

Since the focus here is on the process, not the result, task composition looks fundamentally different from stream composition. Stream composition, including pipelines, focuses on the events or values flowing through the network. While task composition deals with the results, it does so primarily in its concern with dependency, which is about the one thing task composition is really concerned with: exactly when the various subtasks can be started, relative to when other tasks start or finish. Task composition is concerned with whether tasks can be done in parallel or serially. This is even a concern for tasks that don’t produce results.

Since tasks can fail, they also need to deal with error propagation. An error in a task means an error occurring somewhere in the process of running the task: moving it from start to finish. It’s the finishing that is sabotaged by an error, not the starting. We expect starting a task to always succeed. It’s the finishing that might never happen due to an error. This is represented by an additional state for failed. This is why it is not starting a task that would throw an exception, but waiting on its result. It makes sense that in a composed task, if a subtask fails, the outer task may fail. The outer task either expects and handles the error by trapping it, or it doesn’t, in which case it propagates out and becomes a failure of the outer task.

This propagation outward of errors, through steps that simply ignore those errors (and therefore, ideally, should contain absolutely no boilerplate code for simply passing an error through), is similar to data streams, and it therefore demands a synchronous interface. This is a little more tricky though because tasks are literally concerned with composing asynchrony. Even if we’re totally okay with blocking threads, what if we want subtasks to start simultaneously? Well, that’s what separating starting from waiting on the result lets us do. We only need to block when we need the result. That can be where exceptions are thrown, and they’ll automatically propagate through steps that don’t deal with them, which is exactly what we want. This separates when an exception is thrown from when an exception is (potentially) caught, and therefore requires tasks to cache exceptions just like they do their result.

We can, of course, avoid blocking any threads by using coroutines. That’s exactly what the .NET Tasks do. If you’re in a language that doesn’t have coroutines, I have the same advice I have for data streams: just block the damn threads. You’ll tear your hair out with the handleResult/handleError pyramids of callback doom, where most of your handleError callbacks are just calling the outer handleError to pass errors through.

What’s missing in the Task APIs I’ve seen is functional transformations like what we have on the other abstractions. This is probably because the need is much less. It’s not hard at all to do what it is essentially a map on a Task:

async Task<MappedResult> mapATask()
{
    Task<Result> sourceTask = getSourceTask();
    Function<Result, MappedResult> transform = getTransform();

    return transform(await sourceTask);
}

But still, we can eliminate some of that boilerplate with some nice extension methods:

static async Task<MappedResult> Map<Result, MappedResult>(this Task<Result> ThisTask, Function<Result, MappedResult> transform)
{
    return transform(await ThisTask);
}

...

Task<Result> someTask = getTask();

await someTask
  .map(someTransform)
  .map(someOtherTransform);

Conclusion

By separating out these four somewhat similar but ultimately distinct concepts, we’ll find that the “hot” vs. “cold” distinction is expressed by choosing the right abstraction, and this is exposed to the clients, not hidden in the implementation details. Furthermore, the implication of side effects is easier to understand and address. We make a distinction of how “active” or “passive” different actions are. Observing an event is totally passive, and cannot itself incur side effects. Constructing a derived event stream is not passive, it entails the creation of new events. Consuming a value in a data stream is also not passive. Notice that broadcasting requires passivity. The only one-to-many operations available, once we distinguish the various abstractions, are observing an event stream and observing changes to an observable value. The former alone cannot incur side effects itself, and the latter can only occur side effects when going from no observers to more than none, and thus is independent of the multiplicity of observers. We have, in this way, eliminated the possibility of accidentally duplicating effort in the almost trivial manner that it is possible in Rx.

In the next part, we’ll talk about those transformation operators, and what they look like after separating the abstractions.

On ReactiveX – Part I

If a tree falls in the forest and no one hears it, does it make a sound?

The ReactiveX libraries have finally answered this age-old philosophical dilemma. If no one is listening for the tree falling, not only does it not make a sound, the tree didn’t even fall. In fact, the wind that knocked the tree down didn’t even blow. If no one’s in the forest, then the forest doesn’t exist at all.

Furthermore, if there are three people in the forest listening, there are three separate sounds that get made. Not only that, there are three trees, each one making a sound. And there are three gusts of wind to knock each one down. There are, in fact, three forests.

ReactiveX is the Copenhagen Interpretation on steroids (or, maybe, just taken to its logical conclusion). We don’t just discard counterfactual definiteness, we take it out back and shoot it. What better way to implement Schrodinger’s Cat in your codebase than this:

final class SchrodingersCat : Observable<boolean>
{
    public SchrodingersCat()
    {
        cat = new Cat("Mittens");
    }

    private void subscribeActual(@NonNull Observer<boolean> observer)
    {
        if(!observed)
        {
            observed = true;

            boolean geigerCounterTripped = new Random().nextInt(2) == 0;
            if(geigerCounterTripped)
                new BluntInstrument().murder(cat);
        }

        observer.onNext(cat.alive());
    }

    private final Cat cat;

    boolean observed = false;
}

In this example, I have to go out of my way to prevent multiple observers from creating multiple cats, each with its own fate. Most Observables aren’t like that.

When you first learn about ReactiveX (Rx, as I will refer to it from now on), it’s pretty cool. The concept of transforming event streams, whose values occur over time, as opposed to collections (Arrays, Dictionarys, etc.), whose values occur over space (memory, or some other storage location), the same way that you transform collections (map, filter, zip, reduce, etc.) immediately struck me as extremely powerful. And, to be sure, it is. This began the Rx Honeymoon. The first thing I knew would benefit massively from these abstractions are the thing I had already learned to write reactively, but without the help of explicit abstractions for that purpose: graphical user interfaces.

But, encouraged by the “guides”, I didn’t stop there. “Everything is an event stream”, they said. They showed me the classic example of executing a web request, parsing its result, and attaching it to some view on the UI. It seems like magic. Just define your API service’s call as an Observable, which is just a map of the Observable for an a general HTTP request (if your platform doesn’t provide for you, you can easily write it bridging a callback interface to an event stream). Then just do some more mapping and you have a text label that displays “loading…” until the data is downloaded, then it automatically switches to display the loaded data:

class DataStructure
{
    String title;
    int version;
    String displayInfo;
    ...
}

class AppScreenViewModel
{
    ...

    public final Observable<String> dataLabelText;

    ...

    public AppScreenViewModel()
    {
        ...

        Observable<HTTPResponse> response = _httpClient.request(
             "https://myapi.com/getdatastructure",
             HTTPMethod::get
         );

        Observable<DataStructure> parsedResponse = response
            .map(response -> new JSONParser().parse<DataStructure>(response.body, new DataStructure());

        Observable<String> loadedText = parsedResponse
             .map(dataStructure -> dataStructure.displayInfo);

        Observable<String> loadingText = Observable.just("Loading...);

        dataLabelText = loadingText
             .merge(loadedText);
    }
}

...

class AppScreen : View
{
    private TextView dataLabel;

    ...

    private void bindToViewModel(AppScreenViewModel viewModel)
    {
        subscription = viewModel.dataLabelText
            .subscribe(text -> dataLabel.setText(text));
    }
}

That’s pretty neat. And you wouldn’t actually write it like this. I just did it like this to illustrate what’s going on. It would more likely look something like this:

    public AppScreenViewModel()
    {
        ...

        dataLabelText = getDataStructureRequest(_httpClient)
            .map(response -> parseResponse<DataStructure>(response, new DataStructure())
            .map(dataStructure -> dataStructure.displayInfo)
            .merge(Observable.just("Loading...");

        ...
    }

And, of course, you’d want to move the low-level HTTP client stuff out of the ViewModel. What you get is an elegant expression of a pipeline of retrieval and processing steps, with end of the pipe plugged into your UI. Pretty neat!

But… hold on. I’m confused. I have my UI subscribe (that is, listen) to a piece of data that, through a chain of processing steps, depends on the response to an HTTP request. I can understand why, once the response comes in, the data makes its way to the text label. But where did I request the data? Where did I tell the system to go ahead and issue the HTTP request, so that eventually all of this will get triggered?

The answer is that it happens automatically by subscribing to this pipeline of events. That is also when it happens. The subscription happens in bindToViewModel. The request will be triggered by that method calling subscribe on the observable string, which triggers subscribes to all the other observables because that’s how the Observable returned by operators like map and merge work.

Okay… that makes sense, I guess. But it’s kind of a waste of time to wait until then to send the request out. We’re ready to start downloading the data as soon as the view-model is constructed. Minor issue, I guess, since in this case these two times are probably a fraction of a second apart.

But now let’s say I also want to send that version number to another text label:

class DataStructure
{
    String title;
    int version;
    String displayInfo;
    ...
}

class AppScreenViewModel
{
    ...

    public final Observable<String> dataLabelText;
    public final Observable<String> versionLabelText;

    ...

    public AppScreenViewModel()
    {
        ...

        Observable<DataStructure> dataStructure = getDataStructureRequest(_httpClient)
            .map(response -> parseResponse<DataStructure>(response, new DataStructure());

        String loadingText = "Loading...";

        dataLabelText = dataStructure
            .map(dataStructure -> dataStructure.displayInfo)
            .merge(Observable.just(loadingText);

        versionLabelText = dataStructure
            .map(dataStructure -> Int(dataStructure.version).toString())
            .merge(Observable.just(loadingText);
    }
}

...

class AppScreen : View
{
    private TextView dataLabel;
    private TextView versionLabel;

    ...

    private void bindToViewModel(AppScreenViewModel viewModel)
    {
        subscriptions.add(viewModel.dataLabelText
            .subscribe(text -> dataLabel.setText(text)));

        subscriptions.add(viewModel.versionLabelText
            .subscribe(text -> versionLabel.setText(text)));
    }
}

I fire up my app, and then notice in my web proxy that the call to my API went out twice. Why did that happen? I didn’t create two of the HTTP request observables. But remember I said that the request gets triggered in subscribe? Well, we can clearly see two subscribes here. They are each to different observables, but both of them are the result of operators that begin with the HTTP request observable. Their subscribe methods call subscribe on the “upstream” observable. Thus, both of the two chains eventually calls subscribe, once each, on the HTTP request observable.

The honeymoon is wearing off.

Obviously this isn’t acceptable. I need to fix it so that only one request gets made. The ReactiveX docs refer to these kinds of observables as cold. They don’t do anything until you subscribe to them, and when you do, they emit the same items for each subscriber. Normally, we might think of “items” as just values. So at worst this just means we’re making copies of our structures. But really, an “item” in this world is any arbitrary code that runs when the value is produced. This is what makes it possible to stuff very nontrivial behavior, like executing an HTTP request, inside an observable. By “producing” the value of the HTTP response, we execute the code that calls the HTTP client. If we produce that value for “n” listeners, we literally have to produce it “n” times, which means we call the service “n” times.

The nontrivial code that happens as part of producing the next value in a stream is what we can call side effects. This is where the hyper-Copenhagen view of reality starts getting complicated (if it wasn’t already). That tree falling sound causes stuff on its own. It chases birds off, and shakes leaves off of branches. Maybe it spooks a deer, causing it to run into a street, which causes a car driving by to swerve into a service pole, knocking it down and cutting off the power to a neighborhood miles away. So now, “listening” to the tree falling sound means being aware of anything that was caused by that sound. Sitting in my living room and having the lights go out now makes me an observer of that sound.

There’s a reason Schrodinger put the cat in a box: to try as best he could to isolate events inside the box from events outside. Real life isn’t so simple. “Optimizing” out the unobserved part of existence requires you to draw a line (or box?) around all the effects of a cause. The Butterfly Effect laughs derisively at the very suggestion.

Not all Observables are like this. Some of them are hot. They emit items completely on their own terms, even if no subscribers are present. By subscribing, you’ll receive the same values at the same time as any other subscribers. If one subscriber subscribes late, they’ll miss any previously emitted items. An example would be an Observable for mouse clicks. Obviously a new subscriber can’t make you click the mouse again, and you can click the mouse before any subscribers show up.

To fix our problem, we need to convert the cold HTTP response observable to a hot one. We want it to emit its value (the HTTP response, which as a side effect will trigger the HTTP request) on its own accord, independent of who subscribes. This will solve both the problem of waiting too long to start the request, and having the request go out twice. To do this, Rx gives us a subclass of Observable called ConnectableObservable. In addition to subscribe, these also have a method connect, which triggers them to start emitting items. I can use the publish operator to turn a cold observable into a connectable hot one. This way, I can start the request immediately, without duplicating it:

        ConnectableObservable<DataStructure> dataStructure = getDataStructureRequest(_httpClient)
            .map(response -> parseResponse<DataStructure>(response, new DataStructure())
            .publish();

        dataStructure.connect();

        String loadingText = "Loading...";

        dataLabelText = dataStructure
            .map(dataStructure -> dataStructure.displayInfo)
            .merge(Observable.just(loadingText);

        versionLabelText = dataStructure
            .map(dataStructure -> Int(dataStructure.version).toString())
            .merge(Observable.just(loadingText);

Now I fire it up again. Only one request goes out! Yay!! But wait… both of my labels still say “Loading…”. What happened? They never updated.

The response observable is now hot: it emits items on its own. Whatever subscribers are there when that item gets emitted are triggered. Any subscribers that show up later miss earlier items. Well, my dev server running in a VM on my laptop here served up that API response in milliseconds, faster than the time between this code running and the View code subscribing to these observables. By the time they subscribed, the response had already been emitted, and the subscribers miss it.

Okay, back to the Rx books. There’s an operator called replay, which will give us a connectable observable that begins emitting as soon as we call connect, but also caches the items that come in. When anyone subscribes, it first powers through any of those cached items, sending them to the new subscriber in rapid succession, to ensure that every subscriber sees the same sequence of items:

        ConnectableObservable<DataStructure> dataStructure = getDataStructureRequest(_httpClient)
            .map(response -> parseResponse<DataStructure>(response, new DataStructure())
            .replay();

        dataStructure.connect();

        String loadingText = "Loading...";

        dataLabelText = dataStructure
            .map(dataStructure -> dataStructure.displayInfo)
            .merge(Observable.just(loadingText);

        versionLabelText = dataStructure
            .map(dataStructure -> Int(dataStructure.version).toString())
            .merge(Observable.just(loadingText);

I fire it up, still see one request go out, but then… I see my labels briefly flash with the loaded text, then go back to “Loading…”. What the fu…

If you think carefully about the last operator, the merge, well if the response comes in before we get there, we’re actually constructing a stream that consists first of the response-derived string, and then the text “Loading…”. So it’s doing what we told it to do. It’s just confusing. The replay operator, as I said, fires off the exact sequence of emitted items, in the order they were originally emitted. That’s what I’m seeing.

But wait… I’m not replaying the merged stream. I’m replaying the upstream event of the HTTP response. Now it’s not even clear to me what that means. I need to think about this… the dataStructure stream is a replay of the underlying stream that makes the request, emits the response, then maps it to the parsed object. That all happens almost instantaneously after I call connect. That one item gets cached, and when anyone subscribes, it loops through and emits the cached items, which is just that one. Then I merge this with a Just stream. What does Just mean, again? Well, that’s a stream that emits just the item given to it whenever you subscribe to it. Each subscriber gets that one item. Okay, and what does merge do? Well, the subscribe method of a merged stream subscribes to both the upstream observables used to build it, so that the subscriber gets triggered by either one’s emitted items. It has to subscribe to both in some order, and I guess it makes sense that it first subscribes to the stream on which merge was called, and then subscribes to the other stream passed in as a parameter.

So what’s happening is by the time I call subscribe on what happens to be a merged stream, it first subscribes to the replay stream, which already has a cached item and therefore immediately emits it to the subscriber. Then it subscribes to the Just stream, which immediately emits the loading text. Hence, I see the loaded text, then the loading text.

If I swapped the operands so that the Just is what I call merge on, and the mapped data structure stream is the parameter, then the order reverses. That’s scary. I didn’t even think to consider that the placement of those two in the call would matter.

Sigh… okay, I need to express that the loading text needs to always come before the loaded text. Instead of using merge, I need to use prepend. That makes sure all the events of the stream I pass in will get emitted before any events from the other stream:

    ConnectableObservable<DataStructure> dataStructure = getDataStructureRequest(_httpClient)
        .map(response -> parseResponse<DataStructure>(response, new DataStructure())
        .replay();

    dataStructure.connect();

    String loadingText = "Loading...";

    dataLabelText = dataStructure
        .map(dataStructure -> dataStructure.displayInfo)
        .prepend(Observable.just(loadingText);

    versionLabelText = dataStructure
        .map(dataStructure -> Int(dataStructure.version).toString())
        .prepend(Observable.just(loadingText);

Great, now the labels look right! But wait… I always see “Loading…” briefly flash on the screen. All the trouble I just dealt with derived from my dev server responding before my view gets created. I shouldn’t ever see “Loading…”, because by the time the labels are being drawn, the loaded text is available.

But the above explanation covers this as well. We’ve constructed a stream where every subscriber will get the “Loading…” item first, even if the loaded text comes immediately after. The prepend operator produces a cold stream. It always emits the items in the provided stream before switching to the one we’re prepending to.

The stream is still too cold. I don’t want the subscribers to always see the full sequence of items. If they come in late, I want them to only see the latest ones. But I don’t want the stream to be entirely hot either. That would mean if the subscribers comes in after the loaded text is emitted, they’ll never receive any events. I need to Goldilocks this stream. I want subscribers to only receive the last item emitted, and none before that. I need to move the replay up to the concatenated stream, and I need to specify that the cached items should never exceed a single one:

    Observable<DataStructure> dataStructure = getDataStructureRequest(_httpClient)
        .map(response -> parseResponse<DataStructure>(response, new DataStructure());

    dataStructure.connect();

    String loadingText = "Loading...";

    dataLabelText = dataStructure
        .map(dataStructure -> dataStructure.displayInfo)
        .prepend(Observable.just(loadingText)
        .replay(1);

    dataLabelText.connect();

    versionLabelText = dataStructure
        .map(dataStructure -> Int(dataStructure.version).toString())
        .prepend(Observable.just(loadingText)
        .replay(1);

    versionLabelText.connect();

Okay, there we go, the flashing is gone. Oh shit! Now two requests are going out again. By moving the replay up to after the stream bifurcated, each stream is subscribing and caching its item, so each one is triggering the HTTP response to get “produced”. Uggg… I have to keep that first replay to “share” the response to each derived stream and ensure each one gets it even if it came in before their own connect calls.

This is all the complexity we have to deal with to handle a simple Y-shaped network of streams to drive two labels on a user interface. Can you imagine building an entire even moderately complex app as an intricate network of streams, and have to worry about how “hot” or “cold” each edge in the stream graph is?

Is the honeymoon over yet?

All this highly divergent behavior is hidden behind a single interface called Observable, which intentionally obscures it from the users of the interface. When an object hands you an Observable to use in some way, you have no idea what kind of observable it is. That makes it difficult or impossible to track down or even understand why a system built out of reactive event streams is behaving the way it is.

This is the point where I throw up my hands and say wait wait wait… why am I trying to say an HTTP request is a stream of events? It’s not a stream of any sort. There’s only one response. How is that a “stream”? What could possibly make me think that’s the appropriate abstraction to use here?

Ohhh, I see… it’s asynchronous. Rx isn’t just a library with a powerful transformable event stream abstraction. It’s the “all purpose spray” of concurrency! Any time I ever need to do anything with a callback, which I put in because I don’t want to block a thread for a long-running process, apparently that’s a cue to turn it into an Observable and then have it spread to everything that gets triggered by that callback, and so on and so on. Great. I graduated from callback hell to Rx hell. I’ll have to consult Dante’s map to see if that moved me up a level or down.

In the next part, I’ll talk about whether any of the Rx stuff is worth salvaging, and why things went so off the rails.

REST SchmEST

On almost every project I’ve worked on, everyone has been telling themselves the web services are RESTful services. Most of them don’t really follow RESTful principles, at least not strictly. But the basic ideas of REST are the driving force of how the APIs are written. After working with them for years, and reading into what a truly RESTful interface is, and what the justification is, I am ready to say:

REST has no reason to exist.

It’s perfectly situated in a middle ground no one asked for. It’s too high-level to give raw access to a database to make arbitrary queries using an actual query language like SQL, and it’s too low-level to directly drive application activities without substantial amounts of request forming and response stitching and processing.

It features the worst of both worlds, and I don’t know what the benefit is supposed to be.

Well, let’s look at what the justification for REST is. Before RESTful services, the API world was dominated by “remote procedure call” (RPC) protocols, like XML-RPC, and later Simple Object Access Protocol (SOAP). The older devs I’ve worked with told horror stories about how painful it was to write requests to those APIs, and they welcomed the “simplicity” of REST.

The decision to use REST is basically the decision to not use an RPC protocol. However, if we look at the original paper for REST, it doesn’t mention RPC protocols a single time. It focuses on things like statelessness, uniformity, and cacheability. But since choosing this architectural style for an API is just as much about not choosing the alternative, the discussion began to focus on a SOAP vs. REST comparison.

On the wiki page for SOAP, it briefly mentions one justification in the comparison:

SOAP is less “simple” than the name would suggest. The verbosity of the protocol, slow parsing speed of XML, and lack of a standardized interaction model led to the dominance of services using the HTTP protocol more directly. See, for example, REST.

This point tends to come up a lot. For example, this page repeats the idea that by using the capabilities of HTTP “directly”, REST can get away with defining less itself, which makes it “simpler”. So it seems the idea is that protocols like SOAP add unnecessary complexity to APIs, by reinventing capabilities already contained in the underlying communication protocols. The RPC protocols were written to be application-layer agnostic. As such, they couldn’t take advantage of the concepts already present in HTTP. Once it became clear that all these RPC calls are being delivered over HTTP anyway, they’ve become redundant. We can instead simply use what HTTP gives us out of the box to design APIs.

See, for example, the answers on this StackOverflow question:

SOAP is a protocol on top of HTTP, so it bypasses a lot of HTTP conventions to build new conventions in SOAP, and is in a number of ways redundant with HTTP. HTTP, however, is more than sufficient for retreiving, searching, writing, and deleting information via HTTP, and that’s a lot of what REST is. Because REST is built with HTTP instead of on top of it, it also means that software that wants to integrate with it (such as a web browser) does not need to understand SOAP to do so, just HTTP, which has to be the most widely understood and integrated-with protocol in use at this point.

Bottom line, REST removes many of the most time-consuming and contentious design and implementation decisions from your team’s workflow. It shifts your attention from implementing your service to designing it. And it does so without piling gobbledygook onto the HTTP protocol.

It’s curious that this “SOAP reinvents what HTTP already gives us” argument did not appear in the original REST paper. It’s a bad argument, which leads directly to the no-man’s land between raw database access and high-level application interfaces.

HTTP does already gives us what we need to make resource-based requests to a server; in short, CRUD. They claim that a combination of the HTTP request method (GET, PUSH, DELETE, etc.), path elements, and query parameters, already do for us what the RPC protocols stuff into overcomplicated request bodies.

The problem with this argument is that what HTTP provides and what RPC provides are not the same, either in implementation or in purpose. Those features of HTTP (method, path, and query parameters) expose a much different surface than what RPC calls expose. RPC is designed to invoke code (procedure) on another machine, while HTTP is designed to retrieve or manipulate resources on another machine.

Somewhere along the line, the idea arose that REST tells you how to map database queries and transactions to HTTP calls. For example, a SELECT of a single row by ID from a table maps to a GET request with the table name being a path element, and the ID being the next (and last) path element. An INSERT with column-cell value pairs maps to a POST request, again with the table in the path, and this time no further path elements, and the value pairs as body form data.

This certainly didn’t come from the original paper, which doesn’t mention “database” a single time. It mentions resources. The notion most likely arose because REST was shoehorned into a replacement for RPC, used to build APIs that almost always sits on top of a database. If you’re supposed to “build your API server RESTfully”, and that server is primarily acting as a shell with a database at its kernel, then a reading of “REST principles”, which drives your API toward dumb data access (rather than execution of arbitrary code), will inevitably become a “HTTP -> SQL” dictionary.

The mapping covers basic CRUD operations on a database, but it’s not a full query language. Once you start adding WHERE clauses, simple ones may map to query parameters, but there’s no canonical way to do the mapping, and there’s no way to do more sophisticated stuff like subqueries. There’s not even a canonical way to select specific columns. Then there’s joining. Since neither selecting specific columns nor joining map to any HTTP concept, you’re stuck with an ORM style interface into the database, where you basically fetch entire rows and all their related rows all at once, no matter how much of that data you actually need.

The original paper specifically called out this limitation:

By applying the software engineering principle of generality to the component interface, the overall system architecture is simplified and the visibility of interactions is improved. Implementations are decoupled from the services they provide, which encourages independent evolvability. The trade-off, though, is that a uniform interface degrades efficiency, since information is transferred in a standardized form rather than one which is specific to an application’s needs. The REST interface is designed to be efficient for large-grain hypermedia data transfer, optimizing for the common case of the Web, but resulting in an interface that is not optimal for other forms of architectural interaction.

So, basically, this manifestation of “REST as querying through HTTP” gives us a lot less than what query languages already gave us, and nothing more. I’ll bet if you ask people why they picked REST over just opening SQL connections from the clients, they’ll say something about security, i.e. by supplying an indirect and limited interface to the database, REST allows you to build a sort of firewall in the server code on the way to actually turning the requests into queries. Well, you’re supposed to be able to solve that with user permissions. Either way, it’s nothing about the interface actually being nicer to work with or more powerful.

It’s only barely more abstract. You could probably make a fair argument it’s really not more abstract. REST is just a stunted query language. We can say this is a straw man, since the paper never said to do this. But unless the very suggestion to make application servers RESTful in general is a straw man (making the vast majority of “RESTful” services a misapplication of REST), I don’t see how it could have ended up any differently.

Claiming that this serves as a replacement for RPC fundamentally misunderstands what RPC protocols do. It’s in the name: remote procedure call. It’s a protocol to invoke a function on another machine. The protocol is there to provide a calling convention that works over network wires, just like a C compiler defines a calling convention for functions that works within a single process on a single machine. It defines how to select the function (by name), how to send the parameters, and how to receive the return value.

How much does HTTP help with this? Well, I guess you can put the class name, or function name (or both) into the URL path. But there’s no one way to do this that jumps out to me as obviously correct. The HTTP methods aren’t of much use (functions don’t, generally, correspond to the short list of HTTP methods), and query parameters are quite inappropriate for function parameters, which can be arbitrarily complex objects. Any attempt to take what SOAP does and move “redundant” pieces into HTTP mechanisms isn’t going to accomplish much.

We can, of course, send RPC calls over HTTP. The bulk, if not entirety, of the calling convention goes into the request body. By limiting HTTP to RESTful calls, we’re foregoing the advantage of the very difference between an API and direct querying: that it triggers arbitrary code on the server, not just the very narrowly defined type of work that a database can do. We can raise the abstraction layer and simply invoke methods on models that closely represent application concerns, and execute substantial amounts of business logic on top of the necessary database queries. To demand that APIs be “RESTful” is to demand that they remain mere mappings to dumb data accesses, the only real difference being that we’re robbed of a rich query language.

What you get is clients forced to do all the business logic themselves, and make inefficient coarse fetches of entire rows or hierarchies of data, or even worse a series of fetches, each incurring an expensive round trip to the server, to stitch together a JOIN on their side. When people start noticing this, they’ll start breaking the REST rules. They’ll design an API a client really wants, which is to handle all that business logic on the server, and you end up with a /api/fetchStuffForThisUseCase endpoint that has no correspondence whatsoever to a database query (the path isn’t the name of any table, and the response might be a unique conglomerate and/or mapping of database entities into a different structure).

That’s way better… and it’s exactly what this “REST as HTTP -> SQL” notion tries to forbid you from doing.

The middle of the road is, as usual, the worst place to be. If the client needs to do its own queries, just give it database access, and don’t treat REST as a firewall. It’s very much like the fallacy of treating network address translation as being a firewall. Even if it can serve that role, it’s not designed for it, and there are much better methods designed specifically for security. Handle security with user permissions. If you’re really concerned with people on mobile devices MITM attacking your traffic and seeing your SQL queries, have your clients send those queries over an end-to-end encrypted request.

If you’re a server/database admin and you’re nervous about your client developers writing insane queries, set up a code review process to approve any queries that go into client code.

If the client doesn’t need to do the business logic itself, and especially if you have multiple clients that all need the same business logic, then implement it all on the server and use an RPC protocol to let your clients invoke virtual remote objects. You’re usually not supposed to hand-write RPC calls anyways. That’s almost as a bad as handwriting assembly with a C compiler spec’s calling convention chapter open in your lap. That can all be automated, because the point of RPC protocols is that they map directly to function or method calls in a programming language. You shouldn’t have to write the low-level stuff yourself.

What, then, is the purpose of the various features of HTTP? Well, again it’s in the name: hypertext transfer protocol. HTTP was designed to enable the very first extremely rudimentary websites on the very first extremely rudimentary internet to be built and delivered. It’s designed to let you stick HTML files somewhere on a server, in such a way they can be read back from the same location where you stuck them, to update or delete them later, and to embed links among them in the HTML.

The only reason we’re still using HTTP for APIs is the same reason we’re still using HTML for websites: because that’s the legacy of the internet, and wholesale swapping to new technology is hard. Both of them completely outdated and mostly just get in our way now. Most internet traffic isn’t HTML pages anymore, it’s APIs, and it’s sitting on top of a communication protocol built for the narrow purpose of sticking HTML (to be later downloaded directly, like a file) onto servers. Even TCP is mostly just getting in our way, which is why it’s being replaced by things like QUIC. There’s really no reason to not run APIs directly on the transport protocol.

Rather than RPC, it’s really HTTP that’s redundant in this world.

Even for the traffic that is still HTML, it’s mostly there to bootstrap Javascript, and act as the final delivery format for views. Having HTML blend the concepts of composing view components and decorating text with styling attributes in, I believe, outdated and redundant in the same way.

To argue that APIs need to use such a protocol, to the point of being restricted to that protocol’s capabilities, makes no sense.

The original REST paper never talked about trying to map HTTP calls to database queries or transactions. Instead it focused on resources, which more closely correspond to (but don’t necessarily have to be) the filesystem on the server (hence why they are identified with paths in the URL). It doesn’t even really talk about using query parameters. The word “query” appears a single time, and not in the context of using them to do searches.

The driving idea of REST is more about not needing to do searches at all. Searches are ways to discover resources. But the “RESTful” way to discover resources is to first retrieve a resource you already know about, and that tells you, with a list of hyperlinks, what other resources exist and where to find them. If we were strict, a client using a RESTful API would have to crawl it, following hyperlinks to build up the data needed to drive a certain activity.

The central justifications of REST (statelessness, caching and uniformity) aren’t really ever brought up much in API design discussions… well, caching is, and I’ll get to that. RPC protocols can be as stateless (or stateful) as you want, so REST certainly isn’t required to achieve statelessness. Nor does sticking to a REST-style interface guarantee statelessness. Uniformity isn’t usually a design requirement, and since it comes at the cost of inefficient whole-row (or more) responses, it usually just causes problems.

That leaves caching, the only really valid reason I can see to make an API RESTful. However, the way REST achieves caching is basically an example of its “uniformity”: it sticks to standard HTTP mechanisms, which you get “for free” on almost any platform that implements HTTP, but it comes at the cost of being very restricted. For it to work, you have to express your query as an HTTP GET, with the specific query details encoded as query parameters. As I’ve mentioned, there’s not really a good way to handle complex queries like this.

Beside, what does HTTP caching do? It tells the client to reuse the previous response for a certain time range, and then send a request with an ETag to let the server respond with a 304 and save bandwidth by not resending an identical body. The first part, the cache age, can be easily done an any system. All you need is a client-side cache with a configurable age. The second part… well what does the server do when it gets a request with an ETag? Unless the query is a trivial file or row lookup, it has to generate the response and then compute the ETag and compare it. For example, any kind of search or JOIN is going to require the server to really hit the database to prove whether the response will change.

So, what are you really saving by doing this? Some bandwidth in the response. In most cases, that’s not the bottleneck. If we’re talking about a huge search that returns thousands (or more) of entities, and your customers are mostly using your app on slow cell networks in remote locations, then… sure, saving that response bandwidth is a big deal. But the more typical use case is that response bodies are small, bandwidth is plentiful, but the cloud resources needed to compute the response and prove it’s unchanged are what’s scarce.

You’ll probably do a much better job optimizing the system in this way by making sure you’re only requesting exactly what you need… the very capability that you lose with REST. This even helps with the response body size, which is going to be way bigger if you’re returning entire rows when all you need is a couple column values. Either that, or the opposite, where you basically dump entire blobs of the database onto the client so that it can do its querying locally, and just periodically asks for diffs (this also enables offline mode).

Again, the caching that REST gives us is a kind of no-man’s land middle ground that is suboptimal in all respects. It is, again, appropriate only for the narrow use case of hypertext driven links to resource files in a folder structure on a server that are downloaded “straight” (they aren’t generated or modified as they’re being returned).

The next time I have the authority to design an API, there’s either not going to be one (I’ll grant direct database access to the clients), or it will be a direct implementation of high-level abstract methods that can be mapped directly into an application’s views, and I’ll pick a web framework that automates the RPC aspect to let me build a class on one machine and call its methods from another machine. Either way, I’ll essentially avoid “designing” an API. I’ll either bypass the need and just use a query language, or I’ll write classes in an OOP language, decide where to slice them to run on separate machines, and let a framework write the requisite glue.

If I’m really bold I might try to sell running it all directly on top of TCP or QUIC.

The C++ Resource Management Model (a.k.a. Why I Don’t Want Your Garbage Collector)

My Language of Choice

“What’s your favorite programming language, Dan?”

“Oh, definitely C++”

Am I a masochist? Well, if I am, it’s irrelevant here. Am I just unfamiliar with all those fancy newer “high-level” languages? Nope, I don’t use C++ professionally. On jobs I’m writing Swift, Java, Kotlin, C# or even Ruby and Javascript. C++ is what I write my own apps in.

Am I just insane? Again, if I am, it’s not the reason for my opinion on this matter (at least from my possibly insane perspective).

C++ is an incredibly powerful language. To be fair, it has problems (what Bjarne Stroustrup calls “barnacles”). I consider 3 of them to be major. C++20 fixed 2 of them (the headers problem that makes gratuitous use of templates murder your compile time and forces you to distribute source code, fixed with modules, and the duck typing of templates that makes template error messages unintelligible, fixed with concepts). The remaining one is reflection, which we were supposed to get in C++20, but now it’s been punted to at least C++26.

But overall, I prefer C++ because it is so powerful. Of all the languages I’ve used, I find myself saying the least often in C++ “hmm, I just can’t do what I want to do in this language”. It’s not that I’ve never said that. I just say it less often than I do in other languages.

When this conversation comes up, someone almost always asks me about memory management. It’s not uncommon for people, especially Java/C# guys, to say, “when is C++ going to get a garbage collector?”

C++ had a garbage collector… or, rather, an interface for adding one. It was removed in C++23. Not deprecated, removed. Ripped out in one clean yank.

In my list of problems/limitations of C++, resource management (not memory management, I’ll explain that shortly) is nowhere on the list. C++ absolutely kicks every other language’s ass in this area. There’s another language, D, that follows the design of C++ but frees itself from the shackles of backward compatibility, and is in almost every way far more powerful. Why do I have absolutely no interest in it? Because it has garbage collection. With that one single decision, they ruined what could easily be the best programming language in existence.

I think the problem is a lot of developers who aren’t familiar with C++ assume it’s C with the added ability to stick methods on structs and encapsulate their members. Hence, they think memory management in C++ is the same as in C, and you get stuff like this:

Programmers working in languages without garbage collection (like C and C++) must implement manual memory management in their code.

Even the Wikipedia article for garbage collectors says:

Other languages were designed for use with manual memory management… for example, C and C++

I have a huge C++ codebase, including several generic frameworks, for my own projects. I can count the number of deletes I’ve written on two hands, maybe one.

The Dark Side of Garbage Collection

Before I explain the C++ resource management system, I’m going to explain what’s wrong with garbage collection. Now, “garbage collection” has a few definitions, but I’m talking about the most narrow definition: the “tracer”. It’s the thing Java, C# and D have. Objective-C and Swift don’t have this kind of “garbage collector”, they do reference counting.

I can sum up the problem with garbage collector languages by mentioning a single interface in each of the languages: IDisposable for C#, and Closeable (or Autocloseable) for Java.

The promise garbage collectors give me is that I don’t have to worry about cleaning stuff up anymore. The fact these interfaces exist, and work the way they do, reveals that garbage collectors are dirty liars. We might as well have named the interfaces Deletable, and the method delete.

Then, remember that I told you I can count the number of deletes I’ve written in tens of thousands of lines of C++ on one or two hands. How many of these effective deletes are in a C#/Java codebase?

Even if you don’t use these interfaces, any semantically equivalent “cleanup” call, whether you call it finish, discard, terminate, release, or whatever, counts as a delete. Now tell me, who has fewer of these calls? Java/C# or C++?

C++ wins massively, unless you’re writing C++ code that belongs in the late 90s.

Interestingly, I’ve found most developers assume when I say I don’t like garbage collectors that I’m going to start talking about performance (i.e. tracing is too slow/resource intensive), and it surprises them I say nothing about that and jump straight to these psuedo-delete interfaces. They don’t even know how much better things are in my world.

If you doubt that dispose/close patterns are the worst possible way to deal with resources, allow me to explain how they suffer from all the problems that manual pointers in C suffer from, plus more:

  • You have to clean them up, and it’s invisible and innocuous if you don’t
  • If you forget to clean them up, the explosion is far away (in space and time) from where the mistake was made
  • You have no idea if it’s your responsibility. In C, if you get a pointer from a function, what do you do? Call free when you’re done, or not? Naming conventions? Read docs? What if the answer is different each time you call a function!?
  • Static analysis is impossible, because a pointer that needs to be freed is syntactically indistinguishable from one that shouldn’t be freed
  • You can’t share pointers. Someone has to free it, and therefore be designated the sole true owner.
  • Already freed pointers are still around, land mines ready to be stepped on and blow your leg off.

Replace “pointer” and free with IDisposable/Closeable and Dispose/close respectively, and everything carries over.

The inability to share these types is a real pain. When the need arises, you have to reinvent a special solution. ADO.NET does this with database connections. When you obtain a connection, which is an IDisposable, internally the framework maintains a count of how many connections to are open. Since you can’t properly share an IDisposable, you instead “open” a new connection every time, but behind the scenes it keeps track of the fact an identical connection is already open, and it just hands you a handle to this open connection.

Connection pooling is purported to solve a different problem of opening and closing identical connections in rapid succession, but the need to do this to begin with is born out of the inability to create a single connection and share it. The cost of this is that the system has to guess when you’re really done with the connection:

If MinPoolSize is either not specified in the connection string or is specified as zero, the connections in the pool will be closed after a period of inactivity. However, if the specified MinPoolSize is greater than zero, the connection pool is not destroyed until the AppDomain is unloaded and the process ends.

This is ironic, because the whole point of IDisposable is to recover the deterministic release of scarce resources that is lost by using a GC. By this point, you might as well just hand the database connection to GC, and do the closing in the finalizer… except that’s dangerous (more on this later), and it also loses you any control over release (i.e. you can’t define a “period of inactivity” to be the criterion).

This is just reinvented reference counting, but worse: instead of expressing directly what you’re doing (sharing an expensive object, so that the last user releasing it causes it to be destroyed), you have to hack around the limitation of no sharing and write code that looks like it’s needlessly recreating expensive objects. Each time you need something like this, you have to rebuild it. You can’t write a generic shared resource that implements the reference counting once. People have tried, and it never works (we’ll see why later).

Okay, well hopefully we can restrict our use of these interfaces to just where they’re absolutely needed, right?

IDisposable/Closeable are zombie viruses. When you add one as a member to a class A, it’s not uncommon that the (or at least a) “proper” time to clean up that member is when the A instance is no longer used. So you need to make A an IDisposable/Closeable too. Anything holding an A as a member then likely needs to become an IDisposable/Closeable itself, and on and on. Then you have to write boilerplate, which can usually be generated by your IDE (that’s always a sign of a language defect, that a tool can autogenerate code you need but the compiler can’t), to have your Dispose/close just call Dispose/close on all IDisposable/Closeable members. Except that’s not always correct. Maybe some of those members are just being borrowed. Back to the docs!

Now you’re doing what C++ devs had to do in the 90s: write destructors that do nothing but call delete on all pointer members… except when they shouldn’t.

In fact, IDisposable/Closeable aren’t enough for the common case of hierarchies and member cleanup. A class might also hold handles to “native” objects that need to be cleaned up whenever the instance is destroyed. As I’ll explain in a moment, you can’t safely Dispose/close your member objects in a finalizer, but you can safely clean up native resources (sort of…). So you need two cleanup paths: one that cleans up everything, which is what a call to Dispose/close will do, and one that only does native cleanup, which is what the finalizer will trigger. But then, since the finalizer could get called after someone calls Dispose, you need to make sure you don’t do any of this twice, so you also need to keep track of whether you’ve already done the cleanup.

The result is this monstrosity:

protected virtual void Dispose(bool disposing)
{
    if (_disposed)
    {
        return;
    }

    if (disposing)
    {
        // TODO: dispose managed state (managed objects).
    }

    // TODO: free unmanaged resources (unmanaged objects) and override a finalizer below.
    // TODO: set large fields to null.

    _disposed = true;
}

I mean, come on! The word “Dispose” shows up as an imperative verb, a present participle, and a past participle. It’s a method whose parameter basically means “but sort of not really” (I call these “LOLJK” parameters). Where did I find this demonry? On Microsoft’s docs, as an example of a pattern you should follow, which means you won’t just see this once, but over and over.

Raw C pointers never necessitated anything that ridiculous.

For the love of God keep this out of C++. Keep it as far away as possible.

Now, the real question here isn’t why do we have to go through all this trouble when using IDisposable/Closeable. Those are just interfaces marking a uniform API for utterly manual resource management. We already know manual resource management sucks. The real question is: why can’t the garbage collector handle this? Why don’t we just do our cleanup in finalizers?

Because finalizers are horrible.

They’ve been completely deprecated in Java, and Microsoft is warning people to never write them. The consensus is now that you can’t even safely release native resources there . It’s so easy to get it wrong. Storing multiple native resources in a managed collection? The collection is managed, so you can’t touch it. Did you know finalizers can get called on objects while the scope they’re declared in is still running, which means they can get called mid-execution of one of their methods? And there’s more. Allowing arbitrary code to run during garbage collection can cause all sorts of performance problems or even deadlocks. Take a look at this and this thread.

Is this really “easier” than finding and removing reference cycles?

The prospect of doing cascading cleanup in finalizers fails because of how garbage collectors work. When I’m in a finalizer, I can’t safely assume anything in my instance is still valid except for native/external objects that I know aren’t being touched by the garbage collector. In particular, the basic assumption about a valid object is violated: that its members are valid. They might not be. Finalizers are intrinsically messages sent to half-dead objects.

Why can’t the garbage collector guarantee order? This is, I think, the biggest irony in all of this. The answer is reference cycles. It turns out neglecting to define an ordered topology of your objects causes some real headaches. Garbage collectors just hide this, encourage you to neglect the work of repairing cyclical references, and force you to always deal with the possibility of cycles even when you can prove they don’t exist. If those cyclical references are nothing but bags of bits taken from the memory pool, maybe it will work out okay. Maybe. As soon as you want any kind of well-ordered cleanup logic, you’re hosed.

It doesn’t even make sense to try to apply garbage collectors to non-memory resources like files, sockets, database connections, and so on, especially when you remember some of those resources are owned by entire machines, or even networks, rather than single processes. It turns out that “trigger a sequence to build up a structure, then trigger the exact opposite sequence in reverse order to tear it town” is a highly generic, widely useful paradigm, which we C++ guys call Resource Allocation Is Initialization.

Anything from opening and closing a file, to acquiring and releasing a mutex, to describing and committing an animation, can fall under this paradigm. Any situation you can imagine where there is a balanced “start” and “finish” logic, which is inherently hierarchical: if “starting” X really means to start A, B then C in that order, then “finishing” X will at least include “finishing” C, B then A in that order.

By giving up deterministic “cleanup” of your objects in a language, you’re depriving yourself of this powerful strategy, which goes way beyond the simple case of “deleting X, who was constructed out of A, B and C, means deleting C, B and A”. Deterministically run pairs of hierarchical setup/teardown logic are ubiquitous in software. Memory allocation and freeing is just a narrow example of it.

For this reason, garbage collection definitely is not what you want to have baked into a language, attempting to be the one-size-fits-all resource management strategy. It simply can’t be that, and then since the language-baked resource management is limited in what it can handle, you’re left totally out to dry, reverting to totally manual management, of all other resources. At best, garbage collection is something you can opt into for specific resources. That requires a language capability to tag variables with a specific resource management strategy. Ideally that strategy can be written in the language itself, using its available features, and shipped as a library.

I don’t know any language that could do this, but I know one that comes really close, and does allow “resource management as libraries” for every other management technique beside tracing.

What was my favorite language, again?

The Underlying Problem

I place garbage collection into the same category as ORMs: tools that attempt to hide a problem instead of abstract the problem.

We all agree manual resource management is bad. Why? Because managing resources manually forces us to tell a system how to solve a problem instead of telling it what problem to solve. There’s generally two ways to deal with the tedium of spelling out how. The first is to abstract: understand what exactly you’re telling the system to do, and write a higher level interface to directly express this information that encapsulates the implementation details. The other is to hide: try to completely take over the problem and “automagically” solve it without any guidance at all.

ORMs, especially of the Active Record variety, are an example of the second approach applied to interacting with a database. Instead of relieving you from wrestling with the how of mapping database queries to objects, it promises you can forget that you’re even working with a database. It hides database stuff entirely within classes that “look” and “act” like regular objects. The database interaction is under-the-hood automagic you can’t see, and therefore can’t control.

Garbage collection is the same idea applied to memory management: the memory releases are done totally automagically, and you are promised you can forget a memory management problem even exists.

Of course not really. Beside the fact, as I’ve explained, that it totally can’t manage non-memory reosurces, it also really doesn’t let you forget memory management exists. In my experience with reference counted languages like Swift, the most common source of “leaks” aren’t reference cycles, but simply holding onto references to unneeded stuff for too long. This is especially easy to do if you’re sticking references in a collection, and nothing is ever pruning the collection. That’s not a leak in the strict sense (an object unreachable to the program that can’t be deleted), but it’s a semantic leak with identical consequences. Tracers won’t help you with that.

All of these approaches suffer from the same problems: some percentage, usually fairly high (let’s say 85-90%) of problems are perfectly solved by these automagic engines. The remaining 10-15% are not, and the very nature of automagic systems that hide the problem is that they can’t be controlled or extended (doing so re-exposes the problem they’re attempting to hide). Therefore, nothing can be done to cover that 10-15%, and those problems become exponentially worse than they would have been without a fancy generic engine. You have to hack around the engine to deal with that 10-15%, and the result is more headaches than contending directly with the 85% ever would have caused.

Automagic solutions that hide the problem intrinsically run afoul of the open-closed principle. Any library or tool that violates the open-closed principle will make 85-90% of your problems super easy, and the remaining 10-15% total nightmares.

The absolute worst thing to do with automagic engines is bake them into a language. In one sense, doing so is consistent with the underlying idea: that the automagic solution really is so generic and such a panacea that it really deserves to be an ever-present, and unavoidable all-purpose approach. It also significantly exacerbates the underlying problem: that such “silver bullets” are never actually silver bullets.

I’ve been dunking on garbage collectors, but baking reference counting into a language is the same sort of faulty reasoning: that reference counting is the true silver bullet of resource management. Reference counting at least gives us deterministic release. We don’t have to wrestle with the abominations of IDisposable/Closeable. But the fact you literally can’t create a variable without having to manage an atomic integer is a real problem inside tight loops. As I’ll get into shortly, reference counting is the way to handle shared ownership, but the vast majority of variables in a program arent shared (and the ones that are usually don’t need to be). This causes a proliferation of unnecessary cyclic references and memory leaks.

What is the what, and the how, of resource management? Figuring out exactly what needs to get released, and where, is the how. The what is object lifetimes. In most cases, objects need to stay alive exactly as long as they are accessible to the program. The case of daemons that keep themselves alive can be treated as a separate exception (speaking of which, those are obnoxious in garbage collected languages, you have to stick them into a global variable). For something to be accessible to the program, it needs to be stored in a variable. In object-oriented languages, variables live inside other objects, or inside blocks of code, which are all called recursively by the entry function of a thread.

We can see that the lifetime problem is precisely the problem of defining a directed, non-cyclical graph of ownership. Why can there not be cycles? Not for the narrow reason garbage collectors are designed to address, which is that determining in a very “dumb” manner what is reachable and what is not fails on cycles. Cycles make the order of release undefined. Since teardown logic must occur in the reverse order of setup (at least in general), this makes it impossible to determine what the correct teardown logic is.

The missing abstraction in a language like C is the one that lets us express in our software what this ownership graph is, instead of just imagining it and writing out the implications of it (that this pointer gets freed here, and that one gets freed there).

The Typology of Ownership

We can easily list out the types of ownership relationships that will occur in a program. The simplest one is scope ownership: an object lives, and will only ever live, in a single variable, and therefore its lifetime is equal to the scope of that one variable. The scope of a variable is either the block of code it’s declared in (for “local” variables), or the object it’s a instance member of. The ownership is unique and static: there is one owner, and it doesn’t change.

Both code blocks and objects have cascading ownership, and therefore trigger cascading release. When a block of code ends, that block dies, which causes all objects owned by it to die, which causes all objects owned by those objects to die, and so on. The cascading nature is a consequence of the unique and static nature of the ownership, with the parent-child relationship (i.e. the direction of the graph) clearly defined at the outset.

Slightly more complex than this is when an object’s ownership remains unique at all times (we can guarantee there is only ever one variable that holds an object), but the owner can change, and thereby transfer from one scope to another. Function return values are a basic example. We call this unique ownership. The basic requirement of unique ownership is that only transfers can occur, wherein the original variable must release and no longer be a reference to the object when the transfer occurs.

The next level of complexity is to relax the requirement of uniqueness, by allowing multiple variables to be assigned to the same object. This gives us shared ownership. The basic requirement of shared ownership is that the object becomes unowned, and therefore cleaned up, when the last owner releases it.

That’s it! There’s no more to ownership. The owner either changes or it doesn’t. If it does change, either the number of owners can change or it can’t (all objects start with a single owner, so if it doesn’t change, it stays at 1). There’s no more to say.

However, we have to contend with the limitation of being directed. The graph of variable references is generally not directed. This is why we can’t just make everything shared, and every variable an owner of its assigned object. We get a proliferation of cycles, and that destroys the well-ordered cleanup logic, whether we can trace the graph to find disconnected islands or not.

We need to be able to send messages in both directions. Parents will send messages to their children, but children need to send messages to parents. To do this, a child simply needs a non-owning reference back to its parent. Now, the introduction of non-owning references is what creates this risk of dangling references… a problem guaranteed to not exist if every reference is owning. How can we be sure non-owning references are still valid?

Well, the reason we have to introduce non-owning references is to send messages up the ownership hierarchy, in reverse direction of the graph. When does a child have to worry if its parent is still alive? Well, definitely not in the case of unique ownership. In that case, the fact the child is still alive and able to send messages is already proof the (one, unique) parent is still around. The same applies for more distant ancestors. If an ownership graph is all unique, then a child can safely send a message to a great-grandparent, knowing that there’s no way he could still exist to send messages if any of his unique ancestors were gone.

This is no longer true when objects are shared. A shared object only knows that one of its owners is still alive, so it cannot safely send a message to any particular parent. And thus we have the partner to shared ownership, which is the weak reference: a reference that is non-owning and also can be safely checked before access to see if the object still exists.

This is an important point that does not appear to be well-appreciated: weak references are only necessary in the context of shared ownership. Weak references force the user to contend with the possibility of the object being gone. What should happen then? The most common tactic may be to do nothing, but that’s likely just a case of stuffing problems under the rug (i.e. avoiding crashing when crashing is better than undefined behavior). You have to understand what the correct behavior in both variations (object is still present, and object is absent) when you use weak references.

In summary, we have for ownership:

  1. Scope Ownership
  2. Unique Ownership
  3. Shared Ownership

And for non-owning references:

  1. “Unsafe” references
  2. Weak references

What we want is a language where we can tell it what it needs to know about ownership, and let it figure out from that when to release stuff.

Additionally, we want to be able to control both what “creating” and “releasing” a certain object entails. The cascading of scope-owned members is given, and we shouldn’t have to, nor should we be able to, modify this (to do so breaks the definition of scope ownership). We should also be able to add additional custom logic.

Once our language lets us express who’s an owner of what, everything else should be take care of. We should not have to tell the program when to clean stuff up. That should happen purely as a consequence of an object becoming unowned.

The Proper Solution

Let’s think through how we might try to solve this problem in C. A raw C pointer does not provide any information on ownership. An owned C pointer and a borrowed C pointer are exactly the same. There are two possibilities about ownership: the owner is either known at compile-time (really authorship time, which applies to interpreted languages too), or it’s known only at run-time. A basic example is a function that mallocs a pointer and returns it. The returned pointer is clearly an owning pointer. The caller is responsible for freeing it.

Whenever something is known at authorship time, we express it with the type system. If a function returns an int*, it should instead return a type that indicates it’s an owning pointer. Let’s call it owned_int_ptr:

struct owned_int_ptr
{
    int* ptr;
};

When a function returns an owned_int_ptr, that adds the information that the caller must free it. We can also define an unowned_int_ptr:

struct unowned_int_ptr
{
    int* ptr;
};

This indicates a pointer should not be freed.

For the case where it’s only known at runtime if a pointer is owned, we can define a dynamic_int_ptr:

struct dynamic_int_ptr
{
    int* ptr;
    char owning;
};

(The owning member is really a bool, but C doesn’t have a bool type, so we use a char where 0 means false and everything else means true.)

If we have one of these, we need to check owning to determine if we need to call free or not.

Now, let’s think about the problems with this approach:

  • We’d have to declare these pointer types for every variable type.
  • We have to tediously add a .ptr to every access to the underlying pointer
  • While this tells us whether we need to call free before tossing a variable, we still have to actually do it, and we can easily forget

For the first problem, a C developer would use macros. Macros are black magic, so we’d really like to find a better solution. Ignoring macros, none three of these problems can really be solved in C. We need to add some stuff to the language to make them properly solvable:

  • Templates
  • User-overridable * and -> operators
  • User-defined cleanup code that gets automatically inserted by the compiler whenever a variable goes out of scope

You see where I’m going, don’t you? (Unless you’re that unfamiliar with C++)

With these additions, the C++ solution is:

template<typename T> class auto_ptr
{

public:

  auto_ptr(T* ptr) : _ptr(ptr) 
  {
 
  }

  ~auto_ptr()
  {
      delete _ptr;
  }

  T* operator->() const
  {
      return _ptr;
  }

  T& operator*() const
  {
      return *_ptr;
  }

private:

  T* _ptr;
}

Welcome to C++03 (that’s the C++ released in 2003)!

By returning an auto_ptr, which is an owning pointer, you’ll get a variable that behaves identically to a raw pointer when you dereference or access members via the arrow operator, and that automatically deletes the pointer when the auto_ptr is discarded by the program (when it goes out of scope).

The last part is very significant. There’s something unique to C++ that makes this possible:

C++ has auto memory management!

This is what all those articles that say “C++ only has manual memory management” fail to recognize. C++ does have manual memory management (new and delete), but it also has a type of variable with automatic storage. These are local variables, declared as values (not as pointers), and instance members, also declared as values. This is usually considered equivalent to being stack-allocated, but that’s neither important nor always correct (a heap-allocated object’s members are on the heap, but are automatically deleted when the object is deleted).

The important part is auto variables in C++ are automatically destroyed at the end of the scope in which they are declared. This behavior is inherited from C, which “automatically” cleans up variables at the end of their scope.

But C++ makes a crucial enhancement to this: destructors.

Destructors are user defined code, whatever your heart desires, added to a class A that gets called any time an instance of A is deleted. That includes when an A instance with automatic storage goes out of scope. This means the compiler automatically inserts code when variables go out of scope, and we can control what that code is, as long as we control what types the variables are.

That’s the real garbage collection, and it’s the only garbage collection we actually need. It’s completely deterministic and doesn’t cost a single CPU cycle more than what it takes do the actual releasing, because the instructions are inserted (and can be inlined) at compile-time.

You can’t have destructors in a garbage collected language. Finalizers aren’t destructors, and the pervasive myth that they are (encouraged at least in C# by notating them identically to C++ destructors) has caused endless pain. You can have them in reference counted languages. So far, reference counted languages are on par with C++ (except for performance, nudging those atomic reference counts are expensive). But let’s keep going.

Custom Value Semantics

Why can’t we build our own “shared disposable” as a userland class in C#? Something like this:

class SharedDisposable<T> : IDisposable
{
  private class ControlBlock
  {
    T source;
    AtomicInt count;
  }

  SharedDisposable(IDisposable source)
  {
    _controlBlock = new()
    {
      source = source;
      count = 1;
    }
  }

  SharedDisposable(SharedDisposable other)
  {
    _controlBlock = other._controlBlock;
    _controlBlock.increment();
  }

  T get()
  {
    return _controlBlock.source;
  }

  void Dispose()
  {
    if(_controlBlock.decrementAndGet() == 0)
    {
      _controlBlock.source.Dispose()
    }
  }
}

One problem, of course, is that if the source IDisposable is accessible directly to anyone, they can Dispose it themselves. Sure, but really that problem exists for any “resource manager” class, including smart pointers in C++. The bigger problem is that if I do this:

function StoreSharedDisposable(SharedDisposable<MyClass> Incoming)
{
  this._theSharedThing = Incoming;
}

The thing that’s supposed to happen, namely incrementing the reference count, doesn’t happen. This is just a reference assignment. None of my code gets executed at the =. What I have to do is write this:

function StoreSharedDisposable(SharedDisposable<MyClass> Incoming)
{
  this._theSharedThing = new SharedDisposable(Incoming);
}

Like calling Dispose, this is another thing you’ll easily forget to do. We need to be able to require that assigning one SharedDisposable to another invokes the second constructor.

This is where C++ pulls ahead even of reference counted languages, and where it becomes, AFAIK, truly unique (except direct derivatives like D). A C++ dev will look at that second constructor for SharedDisposable and recognize it as a copy constructor. But it doesn’t have the same effect. Like most “modern” languages, C# has reference semantics, so assigning a variable involves no copying whatsoever. C++ has primarily value semantics, unless you specifically opt out with * or &, and unlike the limited value semantics (structs) in C# and Swift, you have total control over what happens on copy.

(If C#/Swift allowed custom copy constructors for structs, it would render the copy-on-write optimization impossible, and since you can only have value semantics, unless you tediously wrap a struct in a class, losing this optimization would mean a whole lot of unnecessary copying).

Speaking of this, there’s a big, big problem with auto_ptr. You can easily copy it. Then what? You have two auto_ptrs to the same pointer. Well, auto_ptrs are owning. You have two owners, but no sharing logic. The result is double delete. This is so severe a problem it screws up simply forwarding an auto_ptr through a function:

auto_ptr<int> forwardPointerThrough()
{
    auto_ptr<int> result = getThePointer();
    ...
    return result; // Here, the auto_ptr gets copied.  Then result goes out of scope, and its destructor is called, which deletes the underlying pointer.  You've now returned an auto_ptr to an already deleted pointer!
}

Luckily, C++ lets us take full control over what happens when a value is copied. We can even forbid copying:

template<typename T> class auto_ptr
{

public:

  ...

  auto_ptr(const auto_ptr& other) = delete;
  auto_ptr& operator=(const auto_ptr& other) = delete;

  ...
}

We also suppressed copy-assignment, which would be just as bad.

C++ again let’s us define ourselves exactly what happens when we do this:

SomeType t;
SomeType t2 = t; // We can make the compiler insert any code we want here, or forbid us from writing it.

This is the interplay of value semantics and user-defined types that let us take total control of how those semantics are implemented.

That helps us avoid the landmine of creating an auto_ptr from another auto_ptr, which means the underlying ptr now has two conflicting owners. Our attempt to pass an auto_ptr up a level in a return value will now cause a compile error. Okay, that’s good, but… I still want to pass the return value through. How can I do this?

I need some way for an auto_ptr to release its control of its _ptr. Well, let’s back up a bit. There’s a problem with auto_ptr already. What if I create an auto_ptr by assigning it to nullptr?

auto_ptr ohCrap = nullptr;

When this goes out of scope, it calles delete on a nullptr. auto_ptr needs to check for that case:

~auto_ptr()
{
    if(_ptr)
        delete _ptr;
}

With that fixed, it’s fairly obvious what I need to do to get an auto_ptr to not delete its _ptr when it goes out of scope: set _ptr to nullptr:

T* release() const
{
    T* ptr = _ptr;
    _ptr = nullptr;
    return ptr;
}

Then, to transfer ownership from one auto_ptr from another, I can do this:

auto_ptr<int> forwardPointerThrough()
{
    auto_ptr<int> result = getThePointer();
    ...
    return result.release();
}

Okay, I’m able to forward auto_ptrs, because I’m able to transfer ownership from one auto_ptr to another. But it sucks I have to add .release(). Why can’t this be done automatically? If I’m at the end of a function, and I assign one variable to another, why do I need to copy the variable? I don’t want to copy it, I want to move it.

The same problem exists if I call a function to get a return value, then immediately pass it by value to another function, like this:

doSomethingWithAutoPtr(getTheAutoPtr())

What the compiler does (or did) here is assign the result of getTheAutoPtr() to a temporary unnamed variable, then copy it to the incoming parameter into doSomethingWithAutoPtr. Since a copy happens, and we have forbidden copying an auto_ptr, this will not compile. We have to do this:

doSomethingWithAutoPtr(getTheAutoPtr().release())

But why is this necessary? The reason to call release is to make sure that we don’t end up with two usable auto_ptrs to the same object, both claiming to be owners. But the second auto_ptr here is a temporary variable, which is never assigned to a named variable, and is therefore unusable to the program except to be passed into doSomethingWithAutoPtr. Shouldn’t the compiler be able to tell that there’s never really two accessible variables? There’s only one, it’s just being transferred around.

This is really a specific example of a much bigger problem. Imagine instead of an auto_ptr, we’re doing this (passing the result of one function to another function) with some gigantic std::vector, which could be megabytes of memory. We’ll end up creating the std::vector in the function, copying it when we return it (maybe the compiler optimizes this with copy elision), and then copying it again into the other function. If the function it was passed to wants to store it, it needs to copy it again. That’s as many as three copies of this giant object when really there only needs to be one. Just as with the auto_ptr, the std::vector shouldn’t be copied, it should be moved.

This was solved with C++11 (released in 2011) with the introduction of move semantics. With the language now able to distinguish copying from moving, the unique_ptr was born:

template<typename T> class unique_ptr
{

public:

  unique_ptr(T* ptr) : _ptr(ptr) 
  {
 
  }

  unique_ptr(const unique_ptr& other) = delete; // Forbid copy construction
  
  unique_ptr& operator=(const unique_ptr& other) = delete; // Forbid copy assignment

  unique_ptr(unique_ptr&& other) : _ptr(other._ptr) // Move construction
  {
    other._ptr = nullptr;
  }

  unique_ptr& operator=(unique_ptr&& other) // Move assignment
  {
    _ptr = other._ptr;
    other._ptr = nullptr;
  }

  ~unique_ptr()
  {
    if(_ptr)
      delete _ptr;
  }

  T* operator->() const
  {
    return _ptr;
  }

  T& operator*() const
  {
    return *_ptr;
  }

  T* release()
  {
    T* ptr = _ptr;
    _ptr = nullptr;
    return ptr;
  }

private:

  T* _ptr;
}

Using unique_ptr, we no longer need to call release when simply passing it around. We can forward a return value, or pass a returned unique_ptr by value (or rvalue reference) from one function to another, and ownership is transferred automatically via our move constructor.

(We still define release in case we need to manually take over the underlying pointer).

We had to exercise all the capabilities of C++ related to value types, including ones that even reference counted languages don’t have, to build unique_ptr. There’s no way I could build a UniqueReference in Swift, because I can’t control, much less suppress, what happens when one variable is assigned to another. Since I can’t define unique ownership, everything is shared in a reference counted language, and I have to be way more careful about using unsafe references. What most devs do, of course, is make every unsafe reference a weak reference, which forces it to be optional, and make you contend with situations that may never arise and for which no proper action is defined.

C++ comes with scope ownership and unsafe references out of the box, and with unique_ptr we’ve added unique ownership as a library class. To complete the typology, we just add a shared_ptr and the corresponding weak_ptr, and we’re done. Building a correct shared_ptr similarly exercises the capability of custom copy constructors: we don’t suppress copying like we do on a unique_ptr, we define it to increment the reference count. Unlike the C# example, that changes the meaning of thisSharedPtr = thatSharedPtr, instead of requiring us to call something extra.

And with that, the typology is complete. We are able to express every type of ownership by selecting the right type for variables. With that, we have told the system what it needs to know to trigger teardown logic properly.

The vast majority of cleanup logic is cascading. For this reason, not only do we essentially never have to write delete (the only deletes are inside the smart pointer destructors), we also very rarely have to write destructors. We don’t, for example, have to write a destructor that simply deletes the members of a class. We just make those members values, or smart pointers, and the compiler ensures (and won’t let us stop) this cascading cleanup happens.

The only time we need to write a destructor is to tell the compiler how to do the cleanup of some non-memory resource. For example, we can define a database connection class to adapts a C database library to C++:

class DatabaseConnection
{

public:

  DatabaseConnection(std::string connectionString) :
    _handle(createDbConnection(connectionString.c_string())
  {

  }

  DatabaseConnection(const DatabaseConnection& other) = delete;

  ~DatabaseConnection()
  {
    closeDbConnection(_handle);
  }

private:

  std::unique_ptr<DbConnectionHandle> _handle;
}

Then, in any class A that holds a database connection, we simply make the DatabaseConnection a member variable. Its destructor will get called automatically when the A gets destroyed.

We can use RAII to do things like declare a critical section locked by a mutex. First we write a class that represents a mutex acquisition as a C++ class:

class AcquireMutex
{

public:

  AcquireMutex(Mutex& mutex) :
    _mutex(mutex)
  {
    _mutex.lock();
  }

  ~AcquireMutex()
  {
    _mutex.unlock();
  }

private:
  
  Mutex& _mutex;
}

Then to use it:

void doTheStuff()
{
  doSomeStuffThatIsntCritical();

  // Critical section
  {
    AcquireMutex acquire(_theMutex);

    doTheCriticalStuff();
  }

  doSomeMoreStuffThatIsntCritical();
}

The mutex is locked at the beginning of the scope by the constructor of AcquireMutex, and automatically unlocked at the end by the destructor of AcquireMutex. This is really useful, because it’s exception safe. If doTheCriticalStuff() throws an exception, the mutex still needs to be unlocked. Manually writing unlock after doTheCriticalStuff() will result in it never getting unlocked if doTheCriticalStuff() throws. But since C++ guarantees that when an exception is thrown and caught, all scopes between the throw and catch are properly unwound, with all local variables being properly cleaned up (including their destructors getting called… this is why throwing exceptions in destructors is a big no-no), doing the unlock in a destructor behaves correctly even in this case.

This whole paradigm is totally unavailable in garbage collected languages, because they don’t have destructors. You can do this in reference counted languages, but at the cost of making everything shared, which is much harder to reason correctly about than unique ownership, and the vast majority of objects are really uniquely owned. In C# this code would have to be written like this:

void DoTheStuff()
{
  DoSomeStuffThatIsntCritical();

  _theMutex.Lock();

  try
  {
    doTheCriticalStuff();
  }
  catch(_)
  {
    throw;
  }
  finally
  {
    _theMutex.Unlock()
  }

  doSomeMoreStuffThatIsntCritical();
}

Microsoft’s docs on try-catch-finally show a finally block being used for precisely the purpose of ensuring a resource is properly cleaned up.

In fact, this isn’t fully safe, because finally might not get called. To be absolutely sure the mutex is unlocked we’d have to do this:

void DoTheStuff()
{
  DoSomeStuffThatIsntCritical();

  _theMutex.Lock();

  Exception? error = null;

  try
  {
    doTheCriticalStuff();
  }
  catch(Exception e)
  {
    error = e;
  }
  finally
  {
    _theMutex.Unlock()
  }

  if(error)
    throw error;  

  doSomeMoreStuffThatIsntCritical();
}

Gross.

C# and Java created using/try-with-resources to mitigate this problem:

void DoTheStuff()
{
  DoSomeStuffThatIsntCritical();

  _theMutex.Lock();

  Exception? error = null;

  using(var acquire = new AcquireMutex(_mutex))
  {
    doTheCriticalStuff();
  }

  doSomeMoreStuffThatIsntCritical();
}

That solves the problem for relatively simple cases like this where a resource doesn’t cross scopes. But if you want to do something like open a file, call some methods that might throw, then pass the file stream to a method that will hold onto it for some amount of time (maybe kicking off an async task), using won’t help you because that assumes the file needs to get closed locally.

Adding using/try-with-resources was a great decision, and it’s not the garbage collector and receives no assistance from the garbage collector at all. They are special language features with new keywords. They never could have been added as library features. And they only simulate scope ownership, not unique or shared ownership. Adding them is an admission that the garbage collector isn’t the panacea it promised to be.

Tracing?

The basic idea here is not to bake a specific resource management strategy into the language, but to allow the coder to opt each variable into a specific resource management strategy. We’ve seen that C++ gives us the tools necessary to add reference counting, a strategy sometimes baked directly into languages, as an opt-in library feature. That begs the question: could we do this for tracing as well? Could we build some sort of traced_ptr<T> that will be deleted by a separate process running in separate thread, that works by tracing the graph of references and determines what is still accessible?

C++ is still missing the crucial capability we need to implement this. The tracer needs to be able to tell what things inside an object are pointers that need to be traced. In order to do that, it needs do be able to collect information about a type, namely what its members are, and their layout, so it can figure out which members are pointers. Well, that’s reflection. Once we get it, it will be part of the template system, and we could actually write a tracer where much of the logic that normally would happen at runtime would be worked out at compile time. The trace_references<T>(traced_ptr<T>& ptr) function would be largely generated at compile time for any T for which a traced_ptr<T> is used somewhere in our program. The runtime logic would not have to work out where the pointers to trace are, it would just have to actually trace them.

Once we get reflection, we can write a traced_ptr<T> class that knows whether or not T has any traced_ptr type members. The destructor of traced_ptr itself will do nothing. The tracer will periodically follow any such members, repeat the step for each of those, and voila: you get opt-in tracing. This is interesting because it greatly mitigates the problem that baked in tracing has, which is the total uncertainty about the state of an object during destruction. What can you do in the destructor for your class if you have traced_ptrs to it? Well, you can be sure everything except the traced_ptr members are still valid. You just can’t touch the traced_ptr members.

Since it is now your responsibility, and decision, to work out which members of a class will be owned by the tracer, you can draw whatever dividing line you want between deterministic and nondeterministic release. A class that holds both a file handle and other complex objects might decide that the complex objects will be traced_ptrs, but the file handle will be a unique_ptr. That way we don’t have to write a destructor at all, and the destructor the compiler writes for us will delete the file handle, and not touch the complex objects.

There may be problems with delegating only part of your allocations to a tracer. The other key part of a tracer is it keeps track of available memory. To make this work you’d probably also need to provide overrides of operator new and operator delete. But you may also be okay with relaxing the promises of traced references: instead of the tracer doing nothing until it “needs to” (when memory reaches a critical usage threshold), it just runs continuously in the background, giving you assurance you can build some temporary webs of objects that you know aren’t anywhere close to your full memory allotment, and be sure they’ll all be swept away soon after you’re done with them.

While this is a neat idea, I would consider it an even lazier approach to the ownership problem than a proliferation of shared and weak references. This is also more or less avoiding the ownership problem altogether. It may be a neat tool to have in our toolbelts, but I’d probably want a linter to warn on every usage just like with weak_ptrs, to make us think carefully about whether we can be bothered to work out actual ownership.

Conclusion

I have gone over all the options in C++ for deciding how a variable is owned. They are:

  • Scope ownership: auto variables with value semantics
  • Unique ownership: unique_ptr
  • Shared ownership: shared_ptr
  • Manual ownership: new and delete

Then there are two options for how to borrow a variable:

  • Raw (unowned) pointers/references
  • Weak references: weak_ptr

This list is in deliberate order. You should prefer the first option, and only go to the next option if that can’t work, and so on. These options really cover all the possibilities of ownership. Neither garbage collected nor reference counted languages give you all the options. Really, the first two are the most important. Resource management is far simpler when you can make the vast majority of cases scope or unique ownership. Unique ownership (that can cross scope boundaries) is, no pun intended, unique to C++.

For this reason, I have far fewer resource leaks in C++ than in other languages, far less boilerplate code to write, and the vast majority of dangling references I’ve encountered were caused by inappropriate use of shared ownership, due to me coming from reference counted languages and thinking in terms of sharing everything. Almost all my desires to use weak references were me being lazy and not fixing cyclical references (it’s not some subtle realization after profiling that they exist, it’s obvious when I write the code they are cyclical).

I wouldn’t add a garbage collector to that code if you paid me.