July 2022 – Agile Design

Introduction

So, you’re writing an application in an object-oriented programming language, and you need to work with data in a relational database. The problem is, data in a relational database is not in the same form as data in OOP objects. You have to solve this “impedance mismatch” in some way or another.

What exactly is the mismatch? Well, plain old data (POD) objects in OOP are compositions: one object is composed of several other objects, each of which are composed of several more objects, and so on. Meanwhile, relational databases are structured as flat rows in tables with relations to each other.

Let’s say we have the following JSON object:

{
  "aNumber": 5,
  "aString": "Hello",
  "anInnerObject": {
    "anotherNumber": 10,
    "aDate": "12-25-2015",
    "evenMoreInnerObject": {
      "yetAnotherNumber": 30,
      "andOneMoreNumber": 35,
      "oneLastString": "Hello Again"
    }
  }
}

In OOP code, we would represent this with the following composition of classes:

class TopLevelData {

  let aNumber: Int
  let aString: String
  let anInnerObject: InnerData 
}

class InnerData {

  let anotherNumber: Int
  let aDate: Date
  let evenMoreInnerObject: InnerInnerData
}

class InnerInnerData {
  
  let yetAnotherNumber: Int
  let andOneMoreNumber: Int
  let oneLastString: String
}

This is effectively equivalent to the JSON representation, with one important caveat: in most OOP languages, unless you use structs, the objects have reference semantics: the InnerData instance is not literally embedded inside the memory of the TopLevelData instance, it exists somewhere else in memory, and anInnerObject is really, under the hood, a pointer to that other memory. In the JSON, each sub-object is literally embedded. This means we can’t, for example, refer to the same sub-object twice without having to duplicate it (and by extension all its related objects), and circular references are just plain impossible.

This value vs. reference distinction is another impedance mismatch between OOP objects and JSON, which is what standards like json-api are designed to solve.

In the database, this would be represented with three tables with foreign key relations:

TopLevelData

id

Int
Primary Key

aNumber

Int

aString

Text

anInnerObjectId

Int
ForeignKey
InnerData.id

InnerData

id

Int
Primary Key

anotherNumber

Int

aDate

Date

evenMoreInnerObjectId

Int
ForeignKey
InnerInnerData.id

InnerInnerData

id

Int
Primary Key

yetAnotherNumber

Int

andOneMoreNumber

Int

oneLastString

Text

This representation is more like how OOP objects are represented in memory, where foreign keys are equivalent to pointers. Despite this “under the hood” similarity, on the visible level they’re completely different. OOP compilers “assemble” the memory into hierarchical structures we work with, but SQL libraries don’t do the same for the result of database queries.

The problem you have to solve, if you want to work with data stored in such tables but represented as such OOP classes, is to convert between foreign key relations in tables to nesting in objects…

…or is it?

It may seem “obvious” that this impedance mismatch needs to be bridged. After all, this is the same data, with different representations. Don’t we need an adapter that converts one to the other?

Well, not necessarily. Why do we need to represent the structure of the data in the database in our code?

ORMs to the Rescue

Assuming that yes, we do need that, the tools that solve this problem are called object-relational mapping, or ORM, libraries. The purpose of an ORM is to automate the conversion between compositional objects and database tables. At minimum, this means we get a method to query the TopLevelData table and get back a collection of TopLevelData instances, where the implementation knows how to do the necessary SELECTs and JOINs to get all the necessary data, build each object out of its relevant parts, then assign them to each other’s reference variables.

If we want to modify data, instead of hand-crafting the INSERTs or UPDATEs, we simply hand the database a collection of these data objects, and it figures out what records to insert or update. The more clever ones can track whether an object was originally created from a database query, and if so, which fields have been modified, so that it doesn’t have to write the entire object back to the database, only what needs updating.

We still have to design the database schema, connect to it, and query it in some way, but the queries are abstracted from raw SQL, and we don’t have to bother with forming the data the database returns into the objects we want, or breaking those objects down into update statements.

The fancier ORMs go further than this and allow you to use your class definitions to build your database schema. They can analyze the source code for the three classes, inspect its fields, and work out what tables and columns in those tables are needed. When it sees a reference-type field with one object containing another, it’s a cue to create a foreign key relationship. With this, we no longer need to design the schema, we get it “for free” by simply coding our POD objects.

This is fancy and clever. It’s also, the way we’ve stated it, unworkably inefficient.

Inefficiency is a problem with any of these ORM solutions because of their tendency to work on the granularity of entire objects, which correspond to entire rows in the database. This is a big problem because of foreign key relations. The examples we’ve seen so far only have one-to-one relations. But we can also have one-to-many, which would look like TopLevelData having a field let someInnerObjects: [InnerData] whose type is a collection of objects, and many-to-many, which would add to this a “backward” field let theTopLevelObjects: [TopLevelData] on InnerData.

The last one is interesting because it is unworkable in languages that use reference counting for memory management. That’s a circular reference, which means you need to weaken one of them, but by weakening one (say, the reference from InnerData back to TopLevelData) means you must hold onto the TopLevelData separately. If you, for example, query the database for an InnerData, and want to follow it to its related TopLevelData, they’ll be gone as soon as you get your InnerData back.

This is, of course, not a problem in garbage collected languages. You just have to deal with all the other problems of garbage collection.

With x-to-many relations, constructing a single object of any of our classes might end up pulling hundreds or thousands of rows out of the database. The promise we’re making in our class, however implicit, is that when we have a TopLevelData instance, we can follow it through references to any of its related objects, and again, and eventually end up on any instance that is, through an arbitrarily long chain of references, related back to that TopLevelData instance. In any nontrivial production database, that’s an immediate showstopper.

A less severe form of this same problem is that when I grab a TopLevelData instance, I might only need to read one field. But I end up getting the entire row back. Even in the absence of relations, this is still wasteful, and can become unworkably so if I’m doing something like a search that returns 10,000 records, where I only need one column from each, but the table has 50 columns in it, so I end up querying 50,000 cells of data. That 50x cost, in memory and CPU, is a real big deal.

By avoiding the laborious task of crafting SQL queries, where I worry about SELECTing and JOINing only as is strictly necessary, I lose the ability to optimize. Is that premature optimization? In any nontrivial system, eventually no.

Every ORM has to deal with this problem. You can’t just “query for everything in the object” in the general case, because referenced objects are “in”, and that cascades.

And this is where ORMs start to break down. We’ll soon realize that the very intent of ORMs, to make database records look like OOP objects, is fundamentally flawed.

The Fundamental Flaw of ORMs

We have to expose a way to SELECT only part of an object, and JOIN only on some of its relations. That’s easy enough. Entity Framework has a fancy way of letting you craft SQL queries that looks like you’re doing functional transformations on a collection. But the ability to make the queries isn’t the problem. Okay, so you make a query for only part of an object.

What do you get back?

An instance of TopLevelData? If so, bad. Very bad.

TopLevelData has everything. If I query only the aNumber field, what happens when I access aString? I mean, it’s there! It’s not like this particular TopLevelData instance doesn’t have an aString. If that were the case, it wouldn’t be a TopLevelData. A class in an OOP language is literally a contract guaranteeing the declared fields exist!

So, what do the other fields equal? Come on, you know what the answer’s gonna be, and it’s perfectly understandable that you’re starting to cry slightly (I’d be more concerned if you weren’t):

null

I won’t belabor this here, but the programming industry collectively learned somewhere in the last 10 years or so that the ancient decision of C, from which so many of our languages descend in some way, to make NULL not just a valid but the default value of any pointer type, is one of the most expensive bug-breeding decisions that’s ever been made. It wasn’t wrong for C to do this, NULL (just sugar for 0) is a valid pointer. The convention to treat this as “lacking a value” or “not set” is the problem, but again, in C, there’s really no better option.

But carrying this forward into C++, Objective-C, C# and Java was a bad idea. Well, okay, Objective-C doesn’t really have a better option either. C++ has pointers but more than enough tools to forever hide them from anyone except low level library developers. C# and Java don’t have them at all, and it’s their decision to make null a valid value of any declared variable type (reference types at least) that’s really regrettable. It’s a hole in their type system.

This is one of the greatest improvements that Swift and Kotlin (and Typescript if you configure it properly) made over these languages. null is not a String, so if I tell my type system this variable is a String, assigning null to it should be a type error! If I want to signal that a variable is either a String or null, I need a different type, like String?, or Optional<String>, or std::optional<String>, or String | null, which is not identical to String and can’t be cast to one.

I said I wouldn’t belabor this, so back to the subject: the ability of ORMs to do their necessary optimization in C# and Java is literally built on the biggest hole of their type systems. And of course this doesn’t work with primitives, so you either have to make everything boxed types, or God forbid, decide that false and 0 is what anything you didn’t ask for will equal.

It really strikes to the heart of this issue that in Swift, which patched that hole, an ORM literally can’t do what it wants to do in this situation. You’d have to declare every field in your POD objects to be optional. But then what if a particular field is genuinely nullable in the database, and you want to be able to tell that it’s actually null, and not just something you didn’t query for? Make it an optional optional? For fu…

Either way, making every field optional would throw huge red flags up, as it should.

In Java and C#, there’s no way for me to know, just by looking at the TopLevelData instance I have, if the null or false or 0 I’m staring at came from the database or just wasn’t queried. All the information about what was actually SELECTed is lost in the type system.

We could try to at least restrict this problem to the inefficiency of loading an entire row (without relations), but making relations lazy loaded: the necessary data is only queried from the database when it is accessed in code. This tries to solve the problem of ensuring the field has valid data whenever accessed while also avoiding the waste of loading a potentially very expensive (such as x-to-many) relation that is never accessed.

This comes with a host of its own problems, and in my experience it’s never actually a workable solution. Database connections are typically managed with some type of “context” object that, among other things, is how you control concurrency, since database access is generally not thread safe. You usually create a context in order to make a query, get all the data you need, then throw the context away once the data is safely stored in POD objects.

If you try to lazy-load relations, you’re trying to hit the database after the first query is over, and you’ve thrown the context away. Either it will fail because the context is gone, or it’s going to fail because the object with lazy-loading capabilities keeps the context alive, and when someone else creates a context it throws a concurrency exception.

You can try so solve this by storing some object capable of creating a new context in order to make the query on accessing a field. But even if you can get this to work, you’ll end up potentially hitting the database while using an object you returned to something like your UI. To avoid UI freezes you’d have to be aware that some data is lazy-loaded, keep track of whether it’s been loaded or not, and if not, make sure to do it asynchronously and call back to update the UI when it’s ready. By that point you’re just reinventing an actual database query in a much more convoluted way.

The Proper Solution

What we’re trying to do is simply not a good idea. Returning a partial object of some class, but having it declared as a full instance of that class, violates the basic rules of object-oriented programming. The whole point of a type system is to signal that a particular variable has particular members with valid values. Returning partials throws that out the window.

We can do much better in Typescript, whose type system is robust enough to let us define Partial<T> for any T, that will map every member of type M in T to a member of type M | undefined. That way, we’re at least signaling in the type system that we don’t have a full TopLevelData. But we still can’t signal which part of TopLevelData we have. The stuff we queried for becomes nullable even when it shouldn’t be, and we have to do null checks on everything.

Updating objects is equally painful with ORMs. We have to supply a TopLevelData instance to the database, which means we need to create a full one somehow. But we only want to update one or a few fields. How does the framework know what parts we’re trying to update? Combine this with the fact that part of the object may be missing because we didn’t query for all of it, and what should the framework do? Does it interpret those empty fields as instructions to clear the data from the database, or just ignore them?

I know Entity Framework tries to handle this by having the generated subclasses of your POD objects try to track what was done to them in code. But it’s way more complicated than just setting fields on instances and expecting it to work. And it’s a disaster with relations, especially x-to-many relations. I’ve never been able to get update statements to work without loading the entire relations, which it needs just so it can tell exactly how what I’m saving back is different from what’s already there. That’s ridiculous. I want to set a single cell on a row, and end up having to load an entire set of records from another table just so the framework can confirm that I didn’t change any of those relations?

Well, of course I do. If I’m adding three new records to a one-to-many relation, and removing one, then how do I tell EF this? For the additions, I can’t just make an entity where the property for this relationship is an array that contains the three added records. That’s telling EF those are now the only three related entities, and it would try to delete the rest. And I couldn’t tell it to remove any this way. The only thing I can do is load the current relations, then apply those changes (delete one, add three) to the loaded array and save it back. There’s no way to do this in an acceptably optimized fashion.

The conclusion is inescapable:

It is not correct to represent rows with relations in a database as objects in OOP

It should be fairly obvious, then, what we should be representing as objects in OOP:

We should represent queries as objects in OOP

Instead of writing a POD object to mirror the table structure, we should write POD objects to mirror query structures: objects that contain exactly and only the fields that a query SELECTed. Whether those fields came from a single table or were JOINed together doesn’t matter. The point is whatever array of data each database result has, we write a class that contains exactly those fields.

For example, if I need to grab the aNumber from a TopLevelData, the aDate from its related InnerData, and both yetAnotherNumber and oneLastString from its related InnerInnerData, I write the following class:

struct ThisOneQuery {

  let aNumber: Int
  let aDate: Date
  let oneLastString: String
}

This means we may have a lot more classes than if we just wrote one for each table. We might have dozens of carefully crafted queries, each returning slightly different combinations of data. Each one gets a class. That may sound like extra work, but it’s upfront work that saves work later, as is always the case with properly designing a type system. No more accidentally accessing or misinterpreting nulls because they weren’t part of the query.

We apply the same principal to modifying data. Whatever exact set of values a particular update statement needs, we make a class that contains all and only those fields, regardless of whether they come from a single table or get distributed to multiple tables. Again, we use the type system to signal to users of an update method on our Store class exactly what they need to provide, and what is going to be updated.

These query objects don’t need to be flat. They can be hierarchical and use reference semantics wherever it is helpful. We can shape them however we want, in whatever way makes it easiest to work with them. The rule is that every field is assigned a meaningful value, and nulls can only ever mean that something is null in the database.

Entity Framework does something interesting that approximates what I’m talking about here: when you do a Select on specific fields, the result is an anonymous type that contains only the fields you selected. This is exactly what we want. However, since the type is anonymous (it doesn’t have a name), you can’t return them as is. We still need to write those query result classes and give them a name, but this feature of Entity Framework will make it a lot easier to construct those objects out of database queries.

We can get similar functionality in C++ by using variadic templates, to write a database query method that returns a tuple<...> containing exactly the fields we asked for. In that case, it’s a named type and we can return it as is, but the type only indicates which types of fields, in what order, we asked for. The fields aren’t named. So we’d still want to explicitly define a class, presumably one we can reinterpret_cast that tuple<...> to.

The payoff of carefully crafting these query classes is that we get stronger coupling between what our model layers work with to drive the UI and what the UI actually needs, and looser coupling between the model layer and the exact details of how data is stored in a database. It’s always a good idea to let requirements drive design, including the database schema. Why create a schema that doesn’t correspond in a straightforward way to what the business logic in your models actually needs? But even if we do this, some decisions about how to split data across tables, whether to create intermediate objects (as is required for many-to-many relations), and so on may arise purely out of the mechanics of relational databases, and constitute essentially “implementation details” of efficiently and effectively storing the data.

Writing classes that mirror the table structure of a database needlessly forces the rest of the application to work with data in this same shape. You can instead start by writing the exact POD objects you’d most prefer to use to drive your business logic. Once you’ve crafted them, they signal what needs to be queried from the database. You have to write your SELECTs and JOINs so as to populate every field on these objects, and no more.

If, later, the UI or business logic requirements change, and this necessitates adding or removing a field to/from these query classes, your query method will no longer compile, guiding you toward updating the query appropriately. You get a nice, compiler-driven pipeline from business requirements to database queries, optimized out of the box to supply exactly what the business requirement need, wasting no time on unnecessary fetches.

This also guides how queries and statements are batched. Another problem with ORMs is that you can’t bundle fetches of multiple unrelated entities into a single query, because there’s no type representing the result. It would be a tuple, but only C++ has the machinery necessary to write a generic function that returns arbitrary tuples (variadic generics). You’re stuck having to make multiple separate trips to the database. This may be okay, or even preferred, in clients working with embedded databases, but wherever the database lives on another machine, each database statement is a network trip, and you want to batch those where possible.

By writing classes for queries, the class contain be a struct that contains the structs for each part of the data, however unrelated it is to other parts. With this we can hit the database once, retrieving all we need, and nothing extra, even if it constitutes multiple fetches of completely unrelated data. We can do the same with updates, although we could achieve the same in ORM with transactions.

Queries as classes also integrates very well with web APIs, especially if they follow a standard like json-api that supports partial objects. Anyone who’s tried writing the network glue for updating a few fields in an object whose class represents an entire backend database entity knows the awkwardness of having to decide either to inefficiently send the entire object every time, or come up with some way to represent partial objects. This could be straightforward in Typescript, where a Partial<T> would contain only what needs to be updated, but even there we can improve the situation with transaction objects because they signal what data is going to be updated. With queries, requesting specifically the fields needed translates straightforwardly into parsing the responses to the query objects, which contain the same fields as what was requested.

Conclusion

It turns out it not only is not necessary but wholly misguided to try to represent your database tables as classes in OOP code. That set of classes exists conceptually, as that’s exactly what the database is ultimately storing, but just because those classes conceptually exist doesn’t mean you need to code them. You may find it useful to write them purely to take advantage of the schema-specifying features of ORMs, but their usage should not go beyond this.

The actual interactions with the database, with data going in and out, don’t work in terms of entire rows with all their relations, but with carefully selected subsets. The solution we’re yearning for, that made us think an ORM might help, is in fact a rather different solution of representing individual queries as classes. Perhaps eventually a tool can be written that automates this with some type of code generation. Until then, I promise you’ll be much happier handwriting those query classes than you ever were working with entity classes.

Agile Design

Month: July 2022

What Should Your Entities Be?