Reading time
4 Minutes
Published
Know what you don't know - allowing classifiers to reject input samples
Insight in Brief
Typical classification algorithms assign a class to every sample, even though that sample may be far off / dissimilar to the training data or the type of thing you want to classify. This is especially problematic for safety-critical systems. Therefore, in this article the four methods
- Rejection samples for neural networks
- Sigma-Ellipses for Bayesian classifiers with Gaussian class densities
- One-class SVMs with radial basis functions
- Isolation forests
to integrate novelty detection into classification algorithms are presented and compared using a toy example. Which method suits you best – as always – depends on the application.
Introduction
Many classification algorithms assign one class to every sample (feature vector), even if that sample is far from the training data. This property is problematic for practical applications, especially if they are safety-critical. Therefore, it is desirable to allow classification algorithms to respond with “I don’t know” for samples dissimilar to the training data. In this article, methods to achieve this goal are presented and are compared using a toy example.
Toy example
Suppose a fruit company wants to build a machine that packs apples and pears coming along a conveyor belt into boxes for shipping. The machine needs to distinguish between apples and pears to pack them into the right box. The properties measured for classification are the weight and color of the fruits. The company decides to train a linear classifier with training data obtained from measuring a large sample of apples and pears from its fields (the training data is visualized in Figure 1 on the right) The company brings the classifier into production and the system seems to work fine.
After some weeks, customers started complaining about having tennis balls in their apple boxes. The company investigates the problem. They know that there is a tennis court near their fields, and they find out that the trained classifier classifies the incoming tennis balls as apples – although they are dissimilar to the training data (see Figure 2 on the right). How can the company solve this problem?
Possible solutions
1. Neural networks
“Classic” neural networks are not inherently able to reject out-of-distribution samples. One way to solve this problem is to introduce an additional rejection class. To generate training samples for that rejection class, the smallest surrounding hypercube for the data of each class is constructed. In the non-overlapping regions of the hypercubes of all classes, training samples are generated and labeled as “rejection” classes. Care must be taken that the number of rejection samples is in a sound ratio to the other training samples to avoid a skewed distribution. With the generated rejection samples, the neural network can be trained normally (the network capacity may need to be increased). You can see in Figure 3 that that tennis ball would NOT be classified as apple or pear.
2. Bayesian classifiers with Gaussian class densities
Assume that the classifier consists of a generative model that has normally distributed (or Gaussian mixture) class densities \( p\left(\mathbf{x}\middle| C_i\right) \) ~ \( N\left(\mathbf{\mu}_\mathbf{i},\mathbf{\Sigma}_\mathbf{i}\right) \). Then one could assume that the evidence term (calculated by the law of total probability),
\( p(x)= \sum{p\left({x}\middle| C_i\right)p\left(C_i\right),} \)
could be used to reject “unlikely” samples with respect to the model. However, \( p(x) \) is a probability density and has no absolute meaning. Its values are dependent on the scaling of the features. Therefore, a more reliable measure is needed. One idea is to use the Mahalanobis distance of the feature vector to the mean of a certain class
\( {d_i(x)}^2=\left(\mathbf{x}-\mathbf{\mu}_\mathbf{i}\right)^\mathbf{T}{\mathbf{\Sigma}_\mathbf{i}}^{-\mathbf{1}}\left(\mathbf{x}-\mathbf{\mu}_\mathbf{i}\right). \)
This distance forms ellipses (at least in 2d) around the distribution mean whose axes are scaled by the distribution variance. One can then calculate the probability of observing a distance greater than a certain value by
\( p\left(D_i>d\right)=\frac{1}{2}\left(1-erf\left(\frac{1}{\sqrt2}d\right)\right). \)
As such, samples can be rejected if the probability of observing a certain distance for that sample is below a threshold
\( p\left(D_i>d_i\left(x\right)\right)<\delta \)
These thresholds can correspond to the well-known three-sigma rule – for example, choosing a threshold of \( \delta \) = 0.27 means that 99.73 (3 sigma) of the data generated by the model would lie within this region. To extend the criterion to multiple classes the following relation can be used
\( p_{\sigma(x)}=\frac{\underset i \max {p(D_i>d_i(x))p(C_i)}}{\underset i \max {p\left(C_i\right)}} \)
You can see in Figure 4 that that tennis ball would NOT be classified as apple or pear and that the decision regions have an elliptic shape.
3. Two stages (two models) approaches
In general, it is also possible to follow a two-stage approach to reject samples. In this case, two models are trained. The first model detects samples that are not part of the training data and does the rejection. The second model is then any other classifier (as the linear classifier in the initial toy example). In the following the two possible rejection models one-class SVM and isolation forests are shown. Nevertheless, going in this direction, nearly any model that can perform outlier detection can be used.
3.1. One-class SVM with radial basis function kernel
One possibility to train a rejection model would be to use a one-class support vector machine. To capture arbitrary data distributions a radial basis function kernel can be used. Figure 5 shows the result of applying the one-class SVM to the toy example. Again, the tennis ball would be rejected.
3.2. Isolation forest
As a second example of a rejection model, Figure 6 shows the result of training an isolation forest for the toy example. The key idea about isolation forests or trees is that anomalies can be isolated from the data with only a few decisions and have therefore a low depth in the trained isolation trees.
Summary
In this article, three different approaches to integrating “I don’t know” into classification algorithms have been shown. Which method works best (in case of accuracy, number of dimensions, computational performance or memory requirements, overall system design, etc.) depends on the application. In general, it is desirable to have to train only one model to keep the overhead and the computational complexity low. However, if the model is not able to reject samples a two-stage procedure can be used (e.g. class densities are not normally distributed). In general, one should always be aware of the consequences of misclassification for a particular application.
More Expert Blog articles
Discover IMT’s engineering expertise and innovative solutions in our Expert Blog. Gain valuable know-how from our experts and explore technology highlights.