Data vs Use Cases

If  you’ve been following my posts over the past months and even years, you  know how important I think it is to design well crafted GraphQL schemas  that express real use cases. When in doubt, designing a GraphQL schema  for behaviors instead of for data has been my go-to rule. An older post on “Anemic Mutations” explains my rationale behind that principle.

While  this usually leads to a great experience when consuming an API, there  is usually two pretty distinct uses for an API. The first one is the  more use case driven one. Imagine an application using the GitHub GraphQL API: the schema our API is currently exposing is perfect for an application looking to interact with the GitHub  domain and consume common use cases: think opening an issue, commenting,  merging a pull request, listing branches, paginating pull requests,  etc.

However,  certain clients, when seeing GraphQL’s syntax and features, see a great  potential for getting exactly the data they require out of GitHub.  After all, the “one GraphQL query to fetch exactly what you need“ thing  we hear so often may be interpreted that way. Take for example a comment  analysis application that needs to sync all comments for all issues  every 5 minutes. GraphQL sounds like a great fit to do this, craft the  query for the data you need, send one query, and profit. However, we hit  certain problems pretty quickly:

  • Pagination:  the GitHub API was designed with snappier / use case driven clients.  This assumes these kinds of clients probably don’t want to fetch a very  large amount of comments in a single request, but rather recreate  something a bit like GitHub’s own UI.
  • Timeouts:  We’re pretty aggressive with GraphQL query timeouts with our API at  GitHub, we don’t want to let a gigantic GraphQL running for too long.  However, purely data driven clients might need to make pretty large  queries (one query only, yay), to achieve their goals. Even though it’s a  valid use case and not an abuse scenario, there is quite a high chance  queries could timeout if they query hundreds, see thousands of records.

Pretty  hard to deal with right? On one end, clients that have purely a data  driven use case may have a legitimate reason to do so, but on the other  end, our GraphQL API is not designed (and nor should be) for this  purpose. In fact, it’s not a special problem. Most  APIs out there today are mostly to build use-case driven clients, and  would be hard to deal with when wanting to sync a large amount of data  (large GraphQL requests, batch HTTP requests, or tons of HTTP requests).

So can we do? Let’s explore a few options.


Ship a new data driven schema

One  option could be to expose a totally new GraphQL endpoint/schema for  more data drive use cases. Get all issues and their comments without  pagination. batch loaded types and resolvers. This could possibly be  even in the same schema as different fields, but since they’re such  different use cases, I can see them being in a completely different  schema. The timeout problem still might be hard to solve however,  because these kinds of use cases often aren’t simple to consume  synchronously. So what if… we could run queries asynchronously instead?

Asynchronous GraphQL Jobs

POST /async_graphql

{
  allTheThings {
    andEvenMore {
      things
    }
  }
}

202 ACCEPTED
Location: /async_graphql/HS3HlKN76EI5es7qSTHNmA

Then, clients can poll on that link until the result is ready:

GET /async_graphql/HS3HlKN76EI5es7qSTHNmA
202 ACCEPTED Location: /async_graphql/HS3HlKN76EI5es7qSTHNmA

GET /async_graphql/HS3HlKN76EI5es7qSTHNmA
{ "data": { ... } }

The  cool thing is these are actually very similar to the concept of  persited queries. Applications could register GraphQL queries, and run  them on a schedule, or asynchronously whenever they want. Again, this is  not necessarily specific to GraphQL. Something I found out about  recently is Stripe’s Sigma, with scheduled queries. A SQL powered tool to create reports / extract data out of Stripe for customers/integrators.

Streaming

Darrel Miller and I met for ☕️ recently and we talked a bit about this problem. One  thing he mentioned is that an event stream would actually be great for  clients who only care about data. Integrators can then keep things in  sync / analyze data however they want. This really resonated with me. If  an API client really only cares about raw data and no so much about a  business/use-case oriented API, then they might as well connect to some  kind of data firehose. The Twitter PowerTrack API is a good example of this. Allowing (privileged/enterprise) clients to consume 100% of Twitter’s tweet data.

Best of both worlds?

Maybe  a mix of both is what we’re looking for? Register a GraphQL query to  filter a firehose of data events, use subscriptions, and a separate more  data-oriented schema:

subscription Comments {
  comments(pullRequests: [...]) {
    comment {
      id
      bodyHTML
      author {
        name
      }
    }
  }
}

Here  the comment field is populated with batches of comments as they come.  Connect to a long lived HTTP connection and get a stream of events with  the data matched by our GraphQL subscription. This way we push data at our own rhythm, but clients can still filter payloads  using GraphQL. Of course this requires clients to be smarter and built  more resilient client side applications. Retries, re-connects,  connections dropping, consuming with multiple connections, etc. The  server also needs to build resiliency features (last_message_received,  replay_since, etc).


Food  for thought 🌮💭, let me know what you think about the tension between  these two pretty distinct use cases. I’m pretty excited how we can solve  these issues, there is a lot of things to try out there!

Thanks for reading 🚀

Enjoyed the post? Subscribe to Production Ready GraphQL!