I’m not a data scientist, but I don’t like being mislead by numbers. Sometimes those lies are intentional, but very often they’re unintentional. Having all the data you need readily available doesn’t necessarily mean you’ll get the true picture of what that data means. It all depends, of course, on what you do with it.
Have you ever heard this classic statistics joke:
Bill Gates walks into a pub. Instantly, the average wealth of each person in the pub is 45 million dollars.
I’ve never been in a pub with Bill Gates, but I imagine everyone else would join in in disagreement with that statement.
So today, instead of complaining too much about why averages are bad, I’ll give you a gentle introduction to how to use percentiles instead, and how to apply them in your work.
But really, averages are bad
There are plenty of books and articles written on this very topic, and I don’t claim to know more than any of the information I sourced to write this. But let’s start by looking at what averages represent, and how data works in real life.
The primary purpose of averages is to measure changes over time in the same sample group or cohort. It is in this application, or more so misapplications, by using averages for different purposes that the three most common errors occur.
Most sets of data have outliers. More often than not, extreme outliers. Let’s see an initial example:
Let’s gather up 1% of the population of London and ask them to run 10km on a track: depending on the upper and lower bounds of the age of the sample, we’ll have some clusters of outliers at the top and at the bottom of the list (say, the 550 athletes who happen to do track every week, and the 1500 elderly folks well over 90 years old who need assistance to walk those 10k).
550 people will run the 10k under 37 minutes.
1500 people will run the 10k over 2h30.
In a graph showing us these averages, we may not even see these outliers plotted, but they’re skewing the data heavily towards either direction, giving the impression that all data points cluster around an area that doesn’t represent where they really cluster. If the 10k time average you got was 58 minutes, this isn’t representative of the experience of the individuals who ran: those 1500 elderly certainly didn’t have a scientifically average experience running a 10k, did they? Averages are misleading and often hide where the most pain points can be in a set of data.
Percentiles are a fantastic way to avoid data skewing, because you’ll have more metrics to look at which happen to be representative of the experiences of that data.
You’ll see percentiles in charts where the letter p, followed by a number, is on it; typically, p10, p50 and p90. But what are these?
Let’s see an actual engineering example: measuring page load times, for performance metrics. You and your team start capturing how long it takes for your product page to load, and we get the following values (in milliseconds):
[120, 120, 134, 155, 300, 867, 980, 1800]
The average page load time for the set above is 559ms, but as you can see, no one actually had a 559ms sort of experience; some users loaded the page really fast, others were extremely slow. Again, that average is misleading and unhelpful.
You probably already know how to calculate a p50 percentile, because that’s really just the median: you sort the data is ascending order (like our set above) and you discard the bottom 50% of your points. Your next entry point is your p50 value. So above, we throw away half of the values and we’re left with:
Original dataset: [120, 120, 134, 155, 300, 867, 980, 1800]
- p50 values = [ ̶1̶2̶0̶,̶ ̶1̶2̶0̶,̶ ̶1̶3̶4̶,̶ ̶1̶5̶5̶, 300, 867, 980, 1800]
- p90 values = [ ̶1̶2̶0̶,̶ ̶1̶2̶0̶,̶ ̶1̶3̶4̶,̶ ̶1̶5̶5̶,̶ ̶3̶0̶0̶,̶ ̶8̶6̶7̶,̶ ̶9̶8̶0̶, 1800]
The p50 value for that dataset is 300ms, p90 is 1800ms and the p10 is 120ms.
What percentiles tell us
Imagine you’re feeling pretty good about your 559ms average value. One morning, a few reports from customer support come in warning your team about some user frustrations: some users are experiencing page load times of almost 2 seconds. It’s easy to dismiss it, because after all your average is well under 600ms. But what percentile calculation gives you is the knowledge that 90% of your users are having an experience that is lower than 2 seconds; but the bottom 10% aren’t really happy at all.
Using percentiles allows gives us much more valuable information. Instead of focusing on averages, an example performance goal for your team can now be:
Make the p90 of page loads lower than the worst case scenario of 1000ms
Bring p50 of page loads below 500ms.
Often, in some situations, you may really only be interested in extremes like the p99 or p99.9 percentiles, if your data set is large enough.
Downsides to all this
While there are frameworks and tools to help you calculate these values over time, these calculations tend to be computationally heavy and demand extra resources, both in terms of servers and costs. If you’re saving every piece of information (page load time, request time, etc) then you’ll have a ton of data on your hands. Most tools will allow you to tweak memory vs accuracy ratios, because fundamentally percentiles are expensive approximations.
If you have terabytes of data to calculate values, sorting them and calculating the p95 like:
data_set[count(data_set) * 0.95]
… is expensive, and you shouldn’t build your own tools from scratch to calculate this. Tools like Elasticsearch or Datadog will get you started relatively quickly. For small datasets, something as simple as percentile in nodejs will give you a head start. Even good old Excel will give you these values easily.