Why GraphQL Performance Monitoring is Hard šŸ“ˆ

July 19, 2019 | 6 min read

This post is a preview of Production Ready GraphQL, a book I recently released on building GraphQL servers that goes into great detail on building GraphQL servers at scale: https://book.productionreadygraphql.com

Measuring API performance is very important to get right, and GraphQL doesnā€™t make it any less important. In fact, some of you are maybe using GraphQL because of performance concerns and replacing multiple calls with a single GraphQL call. But are your GraphQL queries actually faster? This is why itā€™s important to monitor them.

If youā€™re used to endpoint-based HTTP APIs, you might have done some monitoring successfully before. For example, one of the most basic ways of monitoring performance for an API is to measure response time. We may have some dashboards that look like this.

Monitoring our APIs like this allows us to quickly catch a subset of the API that is having issues. For example, in the example above, something seems to be off with the POST /posts endpoint. So our first instinct might be to approach a GraphQL API the same way.

Unfortunately, we quickly discover that measuring response time of a GraphQL endpoint gives us almost no insight into the health of our GraphQL API. Letā€™s see why by looking at a super simplified example. Imagine you maintain a GraphQL API that currently serves a simple query:

query {
  viewer {
    name
    bestFriend {
      name
    }
  }
}

Now a new API client starts using our GraphQL API, but has different, much larger needs:

query {
  viewer {
    friends(first: 1000) {
      bestFriend {
        name
      }
    }
  }
}

Notice how the query strings are not so far from each other, but the second one is actually requesting more than a hundred objects, while the first one is only querying for one. If you were monitoring the GraphQL endpoint, you would see your response time spike by a lot. This is because indeed, responses we return are a bit slower because we simply serve more complex queries! Most of us are not actually interested by this, but would rather know if the performance degraded given the same usual load/use cases.

In fact, we are not interested in monitoring the endpoint, but the *queries. - We want to know if a query that a user used to run in 200ms now takes 500ms to run. If youā€™re the maintainer of a private API, with a small set of known clients and a small set of queries, we could actually just do that ā€” monitor known queries for their performance. This is even easier with ā€œpersisted queriesā€:

However, if youā€™re managing a public API, or a very large internal API with a larger set of clients and queries, this may not be easy to do, due to the high cardinality of queries. We must then find other ways.

Monitoring Per-Field

A common approach is to change our mindset from monitoring queries, to monitoring resolvers, where individual fields are computed. We could then quickly see if field is acting slow. Thereā€™s a major problem with this approach that makes it very hard to use this approach to accurately monitor the performance of GraphQL APIs: lazy/batch loading (<- I recommend you read this if you havenā€™t heard of it before).

With dataloader style libraries, the time a resolver takes to compute a field doesnā€™t tell the whole story.

def my_resolver
  load_one_million_things.then do |a_million_things|
    a_million_things.map { |thing| thing.name }
  end
end

If we measured the timing of this resolve method, we would probably see it resolved very fast, and that would be true! Using lazy loading, the method asked to load a million objects, but instantly returned a promise. A query using this field would probably be quite slow, but we wouldnā€™t instantly know why. Thatā€™s because depending on which library youā€™re using, maybe another field ended up kicking the loading process, or maybe it was even outside of resolvers.

If youā€™re using asynchronous loading or similar execution strategies, chances are monitoring only resolvers wonā€™t give you the data youā€™re looking for! Not only this, but fields often behave differently based on a number of things, like which parent field they were selected on šŸ˜Ø.

Alternative Approaches

As you can see, reliably monitoring the performance of a GraphQL server is quite a hard task, especially as a public API. Thereā€™s other approaches weā€™ve been considering. One of them is to try and analyze regressions in all possible queries. However one thing youā€™ll notice is not only do we need to take care of variations in queries, but also in variables, take a look at this query:

query ($num: Int) {
  friends(first: $num) {
    name
  }
}

We can probably affect the execution time of this query string by passing a different value for the $num variable. We can quite easily guess that getting only 1 friend is much faster than getting 1000 friends for example.

Really, what weā€™re looking for are regressions or anomalies in a pair of query + variables. If getting exactly 50 friends is getting slower and slower, thereā€™s probably a problem. Iā€™m not aware of any tool that allows you to do that at the moment, but itā€™s something you could build yourself if you have the tools for it. We could track the execution of these pairs every day, store these results, and analyze regressions this way.

Anomaly Detection

Iā€™m also really interested in how we can determine if the response time from a GraphQL endpoint was too slow for a particular query. Something like a slow query log could be built, to help engineers find out which queries run problematically slow.

In fact, Iā€™ve often wondered if we could find a relationship between query complexity, and query execution time. If we find out that they are related in a particular way, we could in theory detect anomalies. For example, a low complexity query with a high execution time may be a sign that we need to take a good look at the execution of the query.

GraphQL monitoring at a large scale is really hard. Not many tools exist to help us with some of these problems right now, but Iā€™m hopeful weā€™ll explore smarter ways to get insight from execution performance for GraphQL, stay tuned!

Iā€™m working on a book on this stuff :) If youā€™re interested, please sign up for more updates and to know when it comes out šŸ’œ Hopefully by the time it ships, Iā€™ll include how you can build or use some of these more advanced monitoring solutions. https://book.productionreadygraphql.com

If you've enjoyed this post, you might like the Production Ready GraphQL book, which I have just released!

Thanks for reading šŸ’š

Sign up for my newsletter

Stay up to date when I release courses, posts, and anything related to GraphQL

No spam, just great GraphQL content!

Ā© 2020 MYUL Digital, Inc. All rights reserved.