N+1 Problem with ORM

3 minute read

Updated:

I was helping my colleague with an issue we saw in a slow API endpoint. We wanted to increase the number of days of data we returned, from 2 to 30. The expected amount of records was in order of magnitude of 200-400 records, joined across 4 small tables (1000s of records).

This was expected to be very fast but when we increased the limit to 30 days, the endpoint slowed to a halt. What happened?

Problem

The first thing we noticed was that there were hundreds of quick SELECT queries. Each was very fast but the sheer volume of each individual execution added up.

Turns out this is a very common symptom when working with ORM. It’s called the N+1 problem and is a ORM mis-configuration. This post illustrates an example of the problem.

ORM

First stop, what is an ORM?

An ORM is an Object Relational Mapping, which is how you map records in a database to objects in an OOP program. You setup the model in the ORM software and it will set up the mappings. When you update the property of a object, that will automatically result in an UPDATE SQL statement. If you need to get a nested object or attribute, it will automatically query the db to get the information to instantiate the instance.

This is all fine and dandy, as it lets a developer write OOP and not need to think about the details of the underlying data store. Instead of a database, this can even be a stream of data to a remote server! Or maybe it’s JSON, written to disk.

Unfortunately, it’s not always that easy. A misconfigured setting in the modeling or query access patterns can wreak havoc and this means developers need to be somewhat aware of the potential consequences of their code.

N + 1 Problem

The N+1 problem is a side effect of lazy-loading a collection of objects. The sqlalchemy docs have a good write up on this.

Lazy Loading

Lazy loading is where some attributes are not necessarily loaded by default, as this would require querying extra tables. If the code doesn’t use that attribute throughout the its life cycle, this means we paid upfront the cost of querying and memory overhead.

However, if we’re going to access the attribute for many items in a collection, the ORM is not smart enough to know this apriori: without this knowledge, it will naively retrieve the attribute one-at-a-time, which results in many SELECT queries. With apriori knowledge, it would bulk retrieve values, which is much faster. Imagine searching a list in an O(n^2) manner instead of doing set intersection (O(n)).

Eager Loading

The other side of the coin is eager-loading. When the model is configured this way, every access to the instance will automatically be sure to retrieve the information for the nested attribute.

The trade-off is initial upfront query cost and memory overhead vs. amortized cost. The default behaviour in many ORMs is to lazy-load. It’s a very sensible default, to only get values as-needed.

This means that as a developer, you need to be aware of when your access pattern does not match this and you need to preemptively eager-load as a performance optimization.

Gotchas

The first gotcha I notice is that previous devs misunderstood the correct solution. They set all the relationship loading to be eager. This meant that they unknowingly caused increases in memory use and query times. Without looking into the code further, it’s possible that some code depends on this behaviour but my hunch is that we’ve been paying upfront cost, instead of amortizing it.

This gotcha actually caused us more headaches because the query was still pretty slow. It’s a lot faster than before but it’s not what I would call ideal. I would expect the endpoint to take 500ms but it was taking 2 seconds. This is better than timing out at 10 seconds though.

If we were to go back to lazy-loading everything, and then selectively eager-loading on as-needed basis, I suspect we could hit these speed targets.