As machine learning and AI grow in popularity, I have recently been in several discussions with people who enthusiastically described to me how they or their team implemented a machine learning model, or ‘used AI’, in their business to quickly and easily model a complex scenario with high accuracy. I am delighted to see the passion for adopting machine learning methods, and I was initially quite impressed at just how effective their models sounded. On reflection afterwards though, I realised that while some of the models may indeed have been amazing, there was a good chance that some people had become enamoured with models which didn’t necessarily add any value at all. High accuracy alone didn’t actually mean anything much in the scenarios they had described.
All the people I refer to were in mid-level or senior positions, with a keen interest in computing or technology, but without a mathematical or analytical background. Given the huge degree of hype that currently exists around AI / ML, I am keen to see that decision makers have a basic understanding of how to evaluate a good machine learning model, and what questions to ask.
In particular, I realised that the term ‘accuracy’ can be very misleading.
Each case had two common themes (which together describe a huge set of common real-world scenarios):
- The problems were ones of binary classification, i.e. they classified an outcome variable as being in one of two opposing states. Examples include fraudulent vs legitimate transactions; cancerous vs normal cells; churning vs retained customers; faulty vs good batches of products.
- One of the two outcomes was much rarer than the other.
Under these conditions, if the models were optimising for accuracy, a high degree of accuracy would always be not only easy but inevitable.
But what’s the problem with this? Surely accuracy is good anyhow?
When we think of accuracy we tend to think of an archer hitting the bullseye, a sniper hitting their mark from a mile away, or perhaps a stock trader predicting the peak of a bull market and cashing in moments before the crash. In all these cases, more accuracy is better – there are no tradeoffs.
The definition of classification accuracy in statistics is essentially the proportion of cases which are correctly predicted by the model. Again, this sounds simple enough and quite reasonable. If a model correctly predicts 95% of cases then surely it’s a good model, isn’t it? Well no, not necessarily. It depends on what we want to use the model for. In many real-world cases, maximising accuracy will reduce the utility of the model. An example can illustrate this best.
Imagine that a bank hires a monkey to identify fraudulent credit card transactions in order to block them at the point of sale. For each transaction, the monkey is given is a card with some details of the sale, plus the customer’s history and purchase patterns. The monkey is to press a green button to indicate a legitimate transaction or red to indicate a fraudulent transaction. For each correct prediction the monkey is rewarded with a banana, and for each wrong prediction he is given a mild electric shock. The monkey enjoys the reward of the banana to an equal degree that he dislikes the electric shock, so one shock and one banana cancel each other out. This particular monkey is also infinitely hungry and doesn’t ever tire of bananas.
Of course, the monkey cannot read, and he does not have the slightest grasp of commerce or human geography. But he does have a general capacity for learning. The monkey receives his first card. He sniffs it, shakes it around and chews on for it a bit. Then he spots the shiny buttons which look much more interesting. Initially, the monkey presses the green or the red buttons at random, receiving a few bananas and a few electric shocks. But as it happens, 98% of this bank’s transactions are non-fraudulent, and the monkey very quickly learns that the green button is best as it gives him a banana, while the red button should be avoided as it leads to a shock. So the monkey learns to press only the green button, over and over (remember this monkey is infinitely hungry).
Occasionally, the monkey is surprised to find the green button also gives him an electric shock, such as when the card comes through concerning an elderly gentleman who has never shopped online, suddenly ordering 10 PlayStations to an address in Somalia. But the monkey doesn’t mind – the multitude of bananas outweigh the occasional mild electric shock.
In fact, the monkey is incredibly accurate – 98% accurate. The bank is deeply impressed by this. They conclude his ‘leaning’ period, reward him with a hefty pay rise and make plans for an entire department of monkeys under his leadership.
Soon enough though, the fraud claims start coming in as customers spot the fraudulent transactions. Many PlayStations have been ordered to Somalia. The bank is rapidly swamped by claims and forced to take the loss. Their entire profit for the next three quarters is wiped out at a stroke. It dawns on them that despite being 98% accurate, the monkey has failed to identify a single fraudulent transaction.
Beyond accuracy: Precision, Sensitivity and Specificity
Clearly, the bank should have been looking at how often the monkey correctly identified fraudulent cases when they occurred. This is referred to as the sensitivity of the model. The monkey flagged everything as non-fraudulent, so regardless of how rare the fraud cases were, his sensitivity was zero.
Sensitivity is one of the three other key model performance measures that will generally be of interest along with accuracy. The remaining two are specificity and precision. These measures are all interrelated and easy to understand when examined together.
In any binary classification scenario, we have two possible actual outcomes (true and false), and two possible predictions from the model (true and false). Therefore, there are four groups into which each prediction falls:
- True positives – model correctly predicts a true outcome, e.g. monkey flags a fraudulent transaction as fraudulent
- False positives – model wrongly predicts true where the actual outcome is false, e.g. monkey flags a legitimate transaction as fraudulent
- True negatives – model correctly predicts a negative outcome, e.g. monkey flags a legitimate transaction as legitimate
- False negatives – model wrongly predicts false where the actual outcome is true, e.g. monkey flags a fraudulent transaction as legitimate
The model can then be described in terms of how ‘good’ it is based on key ratios between these.
Precision measures how often the model is correct when it predicts the outcome as being true. It is defined as the number of [true positives] / [true positives + false positives].
Sensitivity measures how good the model is at predicting positive outcomes. It is defined as the number of [true positives] / [true positives + false negatives].
Specificity measures how often the model is at predicting negative outcomes. It is defined as the number of [true negatives] / [true negatives + false positives].
In the earlier example, the ‘true’ event is that a transaction is fraudulent. In addition to his accuracy of 98%, the monkey would have a sensitivity of zero and specificity of 100%, while his precision would be undefined since he never predicted fraud at all.
We should never consider only one of these measures, or aim to maximise one at the complete expense of the others. In fact, we can always achieve 100% on either sensitivity or specificity if we wish – simply by predicting a single outcome. For example, if the monkey were to change tack and start reporting every single transaction as fraudulent, he would then achieve a sensitivity of 100%. Unfortunately, he would also rapidly lose the bank all their customers once all transactions starting getting blocked.
Bringing it all together
So, what does a good model look like here? It really depends on what the model is meant to do, and the benefits and costs of each outcome.
In the previous example, the cost of a false positive is a mildly frustrated customer having to phone up the bank to unblock the transaction, and a call centre agent to deal with the call. On the other hand, a false negative means a fraudulent transaction being let through, costing the bank possibly hundreds or thousands of pounds unless they can recover the funds.
In some cases, the costs of each outcome can be neatly quantified financially, modelled mathematically and a perfect optimum found. However, in many cases the benefits or costs are either impossible to estimate with any accuracy, or are inherently unmeasurable, such as the harm inflicted when a cancer detection test fails to detect an actual cancer.
In some cases, a good model may be able to achieve simultaneously high levels of accuracy, precision, sensitivity and specificity if it is fed most of the information which determines the outcome, and if there are few random effects. Conversely, in other cases even a great model may only achieve levels for accuracy or precision of slightly over 50% (the baseline accuracy for a balanced sample is 50%, since with two equal categories even random guessing would get one in two correct).
In most real-world cases though, we will know which measure is most important, and maximise this while keeping the others within acceptable levels.
A smoke detector, for example, will obviously need very high sensitivity, i.e. it should virtually always detect a fire where one occurs. Low precision is perfectly acceptable here, i.e. we can accept many false alarms for each real fire (the same will be true for credit card fraud). But there are limits to this – if the smoke alarm were set off every time the gas stove was lit, people would either disable or ignore the detector, with tragic results.
Most statistical packages will pull together the above metrics, and more, within a confusion matrix. You can also look into ROC curves which consider trade-offs between specificity and sensitivity, and may be suitable in some cases to determine an optimal balance.
Don’t be the monkey
You may be thinking that no real model could be as dumb as the example above. After all, your model didn’t always produce the same outcome; your regression produced non-zero coefficients for all the hundreds of input variables; your random forest model grew lots of different trees; your neural network created many nodes and determined weightings for the connections. But the models will always do this whether they have found something valuable or not. In the end, all that apparent complexity may just be the result of noise. Or the model may be indeed be picking up some useful patterns which are then swamped by the constant term or equivalent.
If you’re not an analyst or data scientist then you don’t need to worry about recalling each of the metrics discussed. Just remember, don’t be fooled by high accuracy, it can sometimes mean nothing at all. When you come to evaluate the efficacy of a model, simply ask yourself three questions:
- When the model predicts an event, how often is it correct?
- When the event does happen, how often did the model predict correctly?
- When the event does not happen, how often did the model predict correctly?