How to Organize Application SQL
Treat every data boundary as an API opportunity
If you’re writing SQL (or other query code), how does it live in your codebase? SQL is a delightfully boring technology. But it’s not obvious how to keep it organized. I’ve seen this done a handful of different ways and I’ve come up with a system that has proven pleasant, sturdy, and extensible.
TL/DR
- Declare each query as a method on a protobuf service:
db.proto
- Implement each query in a single file:
src/sql/db/{service}/{query}.sql
- Build exotic data infrastructure later, without refactoring application code
- meta: Treat every data boundary as an API opportunity
NOTE: I’m avoiding a lengthy discussion of when to use an off-the-shelf ORM versus writing your own SQL/data-layer. It’s a tough decision and a hard one to change later. Typically, an off-the-shelf ORM will generate SQL and let you move quicker at first. And it can be hard to see the benefits of rolling your own SQL/data-layer.
To see those benefits (and sketch how to get them), I’ve been sharing a draft of this article for years to teammates, friends, and companies I advise that are struggling with this decision. I was waiting to publish until I could share some more supporting tools etc. But instead of waiting I’ve decided to publish as-is — I hope that it will help you better balance the tradeoffs for your project.
Why Bother?
If you’re writing the application SQL, the most straightforward way is to type it directly amidst the rest of your application code:
NOTE: In these first examples I’m using SQL in a Javascript codebase. Later I’ll share how this extends to other languages and query methods.
db.query("SELECT episode_id, episode_name " +
"FROM episodes " +
"WHERE show_id = :show_id", {show_id: 'abc123'})
.then(episodes => ...)
This approach has the virtue of minimal fuss: it’s all in one place and there aren’t many moving parts. But it has some problems:
- The query can’t be re-used somewhere else.
- It’s hard to automate lint, style, and other checks on the SQL.
- Most IDEs can’t help you write the inline SQL.
- Reading this requires someone to learn to read both Javascript and SQL.
- When you add or switch programming languages, the query will have to be rewritten and separately maintained.
- The input and result types are hard to decipher and they’ll have to be declared for each new application language.
- It’s hard to monitor this query in production.
- It’s hard to introduce caching or other data sources for the query.
Depending on which problem they hit first, creative developers have solved these problems in unique ways. I wanted a system that would solve most of them up front, even at the expensive of some ceremony.
Declare the Database Services in Protobuf
In a file named db.proto
I declare all database queries as methods on protobuf services, including the input and output types:
// src/proto/db.protoservice Episodes {
rpc FindEpisodesByShowId(ShowId) returns (stream Episode);
...
}message Episode {
string episode_id = 1;
string episode_name = 2;
...
}message ShowId {
string show_id = 1;
}
And then implement the SQL:
-- src/sql/db/Episodes/FindEpisodesByShowId.sqlSELECT
episode_id,
episode_name,
...
FROM episodes
WHERE
show_id = :show_id
ORDER BY
published_at DESC
And then in the application I can call the method:
// src/js/controller/EpisodeController.jsdb.Episodes.findEpisodesByShowId({show_id: 'abc123'})
.then(episodes => ...)
To do this, I need to implement a shim that adapts the generated service call to execute the corresponding SQL and yield results. But, importantly, I do not need to do this for each query — I only need to do it once. The only per-query overhead (besides implementing and using it) is declaring the .proto
for each query — that’s the ceremony that enables everything else.
Short-Term: Improved Experience Writing and Using SQL
This has some nice characteristics. It tightens the abstraction from the calling code. The application remains helpfully unaware of future refactoring. With the same declared method signature we can do lots of things:
- reading Javascript doesn’t require reading SQL
- use generated type definitions for all query input and outputs
- monitor performance or usage (with meaningful query names)
It also makes it possible for us to use and build tools to improve the experience of writing and maintaining the SQL:
- reading SQL doesn’t require reading Javascript
- IDEs can do magic when editing files named
.sql
with a known schema - automated lint and style checks enforce SQL conventions
- automated checks to detect unused or missing SQL inputs and outputs
Long-Term: Easier Evolution of Data Layer
The coding experience improvements listed above will kick in right away: even if you never change how you store and query your data. But over time, most applications do evolve. They fetch and store data in increasingly sophisticated ways: caching, load-balancing, new languages, new data services etc.
The power of abstracting away your data services kicks in later. You’ll feel it when you start using the abstraction as the entry-point for high-leverage data infrastructure projects. Here are some examples.
Performance Monitoring and Logging
An early thing you can do with this approach is to start keeping an eye on the performance. Because each query has a meaningful name, it is much easier to see and talk about the places that are causing problems.
Caching
One way to mitigate those slower queries is to cache the results. Caching is easy to screw up. Doing one-off caching logic within the application code can introduce divergent conventions that tend to cause confusion and mistakes. But with the data service abstraction in place, I can carefully implement consistent caching conventions once. Then I can use the .proto
file as the place where select queries declare their cache configuration.
service Episodes {
rpc FindEpisodesByShowId(ShowId) returns (stream Episode) {
option (db_cache_entry_count) = 100;
option (db_cache_max_age_ms) = 600000;
}
}
New Data Stores
Sometimes caching is not enough. And sometimes, for other reasons, I’ll need to introduce another external service or storage system. Refactoring an application to use new storage can be a gnarly task. With the data service abstraction it gets easier. At minimum you have a hit-list of the queries you’ll need to convert. But even better, you can use the abstraction to adapt the data storage layer transparently: declare which queries use the new data source and implement the corresponding shim.
For example, imagine I want to start using Cassandra for some of my largest and ever-growing tables. I start by declaring that the service for that table uses a Cassandra (CQL) database:
service Episodes {
option (db_type) = "cql";
rpc FindEpisodesByShowId(ShowId) returns (stream Episode);
}
And then the queries against the new data store have a predictable home in the codebase:
-- src/cql/db/Episodes/FindEpisodesByShowId.cqlSELECT
episode_id,
episode_name,
...
After I implement the corresponding shim to execute the CQL then the entire application maintains a unified database access layer. The application doesn’t need to know what type of database it is hitting. And it will use Cassandra when appropriate without me having to refactor any existing application calls.
Supporting Other Languages
What happens when I want to spin up another application that hits the same database? Or what if the new application is to be maintained by a team of Java or Python experts or they need access to libraries only available in those languages? The .proto
data service abstraction makes it straightforward to use the same queries across projects — all that’s needed is to implement the shim for that language.
// jsdb.Episodes.findEpisodesByShowId({show_id: 'abc123'})
.then(episodes => ...)// javadbEpisodes.findEpisodesByShowId(_ -> _.showId = "abc123")
.thenApply(episodes -> ...);// pyshow_id = ShowId(show_id="abc123")
episodes = await db_episodes.FindEpisodesByShowId(show_id)
New applications can get off the ground much quicker. And this increases the leverage of query optimization work, since the gains will be shared across multiple codebases. Used this way, the data abstraction becomes a shared API to the database.
How it Works
The first step of this approach is implementing the shims — the code that implements the abstraction for the particular language and database. Design the generated interfaces to fit you or your team’s conventions and style — the generated clients should feel familiar and convenient within your codebase. Design your .proto
declarations to be expressive but easy to grok — be opinionated here and consistent. Consider automated enforcement of naming conventions.
It’s helpful to tailor the implementation to your team/conventions. I’ve implemented this pattern many times now for different combinations of language (js/java/py/go) and datastores (mysql/cassandra/bigquery/grpc) and teams. Unfortunately, none of my implementations are mine to share. Time permitting, I’ll share more complete examples and supporting tools.
Zooming Out
This approach to organizing application SQL is an instance of a broader strategy: treat every data boundary as an API opportunity. Wherever data is flowing from one language/process/service into another, take a hard look at the interface: that is a candidate for a well-defined API.
treat every data boundary as an API opportunity
Consider declaring the types explicitly using something like .proto
and thus preparing a sharp cut-point for clean refactoring on either side of that typed interface.
Not Just Traditional RPC APIs
Tools like gRPC have made .proto
defined APIs familiar for traditional client/server APIs:
service Api {
rpc FetchFoo(FooId) returns(Foo);
}
...
api.fetchFoo({fooId});
But once you get comfortable writing your own shim layers, they can be useful in lots of places. I’ve seen it work well for things like triggering tasks, rendering speech dialog, calibrating ML models, exposing flutter plugins, or sending emails and push notifications. For example here’s how you might declare your email sender:
// src/proto/email.proto
service Email {
rpc Welcome(WelcomeInput) returns (EmailPieces);
rpc Farewell(FarewellInput) returns (EmailPieces);
...
}// src/handlebars/Email/Welcome.mjml.handlebars
<mjml>... Hello {user_name} ...</mjml>// src/handlebars/Email/Farewell.mjml.handlebars
<mjml>... Farewell {user_name} ...</mjml>// src/js/app.js
email.welcome({user_name}).sendTo({user_id});
...
email.farewell({user_name}).sendTo({user_id});
Some things to note:
- a newcomer can guess how to add a new email
- a designer can edit the template
mjml
instead ofjs
- you can move the rendering/sending logic to an external process without touching the calling code
- you can declare per-email configuration (A/B tests, logging, priority, etc) right inline on the
.proto
file as options under eachrpc
But Not Always
It’s not always a good idea to take this approach. Most notably, if your team vomits when they see .proto
then it may be a nonstarter. Or if you’re committed to a single language/process and there are really strong native libraries for the same abstraction (e.g. committing to python and using SQLAlchemy). Or when there are too few call patterns (e.g. if you’ll only ever have 1 or 2 email templates) then the overhead of proto declarations won’t be worth it. Nonetheless, I recommend thinking through each data boundary as a possible candidate.
I hope this gives you some good ideas for organizing your SQL. And more generally, I hope this helps as you think about preparing your data boundaries to evolve.