On GraphQL-to-SQL

May 21, 2020 | 7 min read

GraphQL has a reputation for its N+1 problem which can often happen when implemented naively. This leads to a lot of us trying to solve the issue of data fetching with GraphQL in the most efficient way possible. Besides the popular Dataloader approach, another very common way of tackling this problem is by taking a GraphQL query, and coming up with the exact SQL needed to resolve it:

// example from Join Monster
{                           SELECT                  
  user(id: 1) {               "user"."id",          
    idEncoded                 "user"."first_name",  
    fullName        ==>       "user"."last_name",    
    email                     "user"."email_address"
  }                         FROM "accounts" AS "user"
}                           WHERE "user"."id" = 1

Looking at this query, this is the absolute most efficient way of getting the data for it. This is a really interesting idea, and it's not surprising many people are aiming for something like this 👇

All these tools are slightly different and the core problem they solve is not all the same. However, they all rely a lot on compiling a GraphQL AST into a SQL statement that will hopefully be more performant than the combined SQL statements that would be executed from a naive GraphQL implementation. I see a lot of effort towards those tools, but they never really seemed to be able to solve the GraphQL data fetching issues I see day-to-day.

Some are tied to a particular style of development

Hasura and PostGraphile are really solid products, and are different from just a GraphQL-to-SQL in the sense that they aim to be a complete engine for building applications. In both these tools the database is what drives a lot of the application logic, although Hasura allows you to "stitch" your own custom resolvers with the database-backed schema and PostGraphile handles custom use cases with PostgreSQL Functions.

Honestly, both these tools are an amazing fit to get an "instant GraphQL API", especially when you don't want to deal with maintaining your own GraphQL server. That's the tradeoff here. However, for existing server applications interested in offering a GraphQL interface, it's less of the sweet spot. Same thing with Super Graph. It takes an existing database, introspects it, and generates a full GraphQL schema for it, including sortings and orderings. Useful if you want to scaffold something quickly, but definitely not that useful for large existing codebases with a lot of logic outside of the database.

Coupling

Other tools like Join Monster and SqlMancer take a slightly different approach. They don't necessarily need to introspect an existing database, decoupling the GraphQL schema definition from the database schema definition.

Tools like Join Monster (and Prisma Tools to a certain degree) need to somehow look at a GraphQL query, and decide what to fetch from the database.

For Join Monster, a requirement is that our object types map to database tables:

const User = new GraphQLObjectType({
  name: 'User',
  sqlTable: 'accounts'
  uniqueKey: 'id',
  fields: () => ({ /*...*/ })
})

The biggest hurdle for these tools is going beyond a 1:1 mapping between GraphQL fields and SQL columns. For example, how would you handle a GraphQL field name, which actually relies on logic concatenating firstName and lastName? It would be very hard to make that happen magically, so Join Monster for example allows you to configure this behavior:

const User = new GraphQLObjectType({
  //...
  fields: () => ({
    fullName: {
      description: 'A user\'s first and last name',
      type: GraphQLString,
      // perhaps there is no 1-to-1 mapping of field to column
      // this field depends on multiple columns
      sqlDeps: [ 'first_name', 'last_name' ],
      resolve: user => `${user.first_name} ${user.last_name}`
    }
  })
})

The Join Monster API comes with quite a few constraints. Our types must map to database tables and we must know the SQL dependencies of our fields. Imagine the field PullRequest.mergeable on GitHub's GraphQL API. We can imagine this field probably needs to load data coming from git, data coming from various "checks" running on the PullRequest, data coming from different pull request reviews. Not only is that already more than a single table, but it's likely to evolve with more data needed overtime. This makes it really hard to come up with a sqlDeps configuration that would not be brittle.

It's quite difficult for an API as complex as GitHub, or really any really large application to be able to provide the exact database requirements to fulfill a use case. Sometimes data is loaded deep within existing logic, wrapped in conditionals. Not only that, but this also doesn't take into consideration application caching, and fields that make calls to other data sources or other services. Don't get me wrong, this is also really challenging with a lazy loading approach. But at least we don't have to maintain configuration for SQL at our interface layer, which does seem better to me.

Beware of tools that make this seem too easy. Join Monster is really good at what it does, but as you can see requires complex configuration to make things work outside the happy path. If the library doesn't have these configurations and only relies on naming conventions, either it's an auto-generated GraphQL API from a database schema, or it's probably too naive to handle more complex cases (Or maybe it has some AI to analyze your data dependencies, who knows 🙊).

Is one large query always better?

Is generating a giant SQL statement always faster than a few smaller SQL statements? That's far from a new question and is obviously very hard to answer generically like this. However, we can certainly say it is not always the case. For questions like these, I always read up High Performance MySQL. I don't want to spoil the book here, but it does highlight many reasons why a giant SQL query is not always the better choice. A lot of applications often need to try "join decomposition". This gives us potentially a few advantages:

  • More potential for cache hits
  • Sometimes reduces lock contention
  • IN() queries can sometimes be faster than complex JOINS
  • And a lot more things that are explained in that book!

It's probably possible for some GraphQL-to-SQL tools to be smart enough to know how to do this, but it definitely is more of a black box than optimizing your dataloaders. All that to say that database performance is not always as simple as making one large SQL statement.

Data fetching remains something to improve

While I would not suggest most of these tools for those who have existing and complex code bases, I do think there is room for a lot of improvement with GraphQL execution. While GraphQL is datastore agnostic, the nature of its execution does impose certain constraints on how we structure application logic. Ideally, things would just work TM. Prisma client is exciting on that side of things, if it keeps getting better and better. Maybe another solution we'll see is some kind of proxy a la ProxySQL where the application can use the database without worrying about GraphQL execution, and somehow the proxy making sense of queries and making for performant ones.

If you're interested in bootstrapping a new API and you're comfortable with using the database server as the center-piece, I can only recommend Hasura and Postgraphile, they're both great. For existing codebases, and especially for large/complex ones, I'm sticking with the recommendation of using a Dataloader/Lazy/Asynchronous loading approach for GraphQL APIs, over tools that aim for ahead-of-time SQL generation.

If you've used some of these tools on existing complex codebases, I'd love to hear from you.

If you've enjoyed this post, you might like the Production Ready GraphQL book, which I have just released!

Thanks for reading 💚

Sign up for my newsletter

Stay up to date when I release courses, posts, and anything related to GraphQL

No spam, just great GraphQL content!

© 2020 MYUL Digital, Inc. All rights reserved.