Neural networks: Multi-class classification

Page Summary

This document explores multi-class classification models, which predict from multiple possibilities instead of just two, like binary classification models.
Multi-class classification can be achieved through two main approaches: one-vs.-all and one-vs.-one (softmax).
One-vs.-all uses multiple binary classifiers, one for each possible outcome, to determine the probability of each class independently.
One-vs.-one (softmax) predicts probabilities of each class relative to all other classes, ensuring all probabilities sum to 1 using the softmax function.
Softmax is efficient for fewer classes but can become computationally expensive with many classes; candidate sampling offers an alternative for increased efficiency.

Earlier, you encounteredbinary classificationmodels that could pick between one oftwo possible choices, such as whether:

A given email is spam or not spam.
A given tumor is malignant or benign.

In this section, we'll investigatemulti-class classificationmodels, which can pick frommultiple possibilities. For example:

Is this dog a beagle, a basset hound, or a bloodhound?
Is this flower a Siberian Iris, Dutch Iris, Blue Flag Iris,or Dwarf Bearded Iris?
Is that plane a Boeing 747, Airbus 320, Boeing 777, or Embraer 190?
Is this an image of an apple, bear, candy, dog, or egg?

Some real-world multi-class problems entail choosing frommillionsof separate classes. For example, consider a multi-class classificationmodel that can identify the image of just about anything.

This section details the two main variants of multi-class classification:

one-vs.-all
one-vs.-one, which is usually known assoftmax

One versus all

One-vs.-all provides a way to use binary classificationfor a series of yes or no predictions across multiple possible labels.

Given a classification problem with N possible solutions, a one-vs.-allsolution consists of N separate binary classifiers—one binaryclassifier for each possible outcome. During training, the model runsthrough a sequence of binary classifiers, training each to answer a separateclassification question.

For example, given a picture of a piece of fruit, fourdifferent recognizers might be trained, each answering a different yes/no question:

Is this image an apple?
Is this image an orange?
Is this image a banana?
Is this image a grape?

The following image illustrates how this works in practice.

Figure 7. An image of a pear being passed as input into 4 different binary classifier models. The first model predicts 'apple' or 'not apple', and its prediction is 'not apple'. The second model predicts 'orange' or 'not orange', and its prediction is 'not orange'. The third model predicts 'pear' or 'not pear', and its prediction is 'pear'. The fourth model predicts 'grape' or 'not grape', and its prediction is 'not grape'. — Figure 7. An image of a pear being passed as input to four different binary classifiers. The first, second, and fourth models (predicting whether or not the image is an apple, orange, or grape, respectively) predict the negative class. The third model (predicting whether or not the image is a pear) predicts the positive class.

For more on how binary classifiers make predictions(setting a classification threshold to convert numerical model output intoa positive or negative label), see the Classificationmodule.

This approach is fairly reasonable when the total number of classesis small, but becomes increasingly inefficient as the number of classesrises.

We can create a significantly more efficient one-vs.-all modelwith a deep neural network in which each output node represents a differentclass. The following image illustrates this approach.

Figure 8. A neural network with the following architecture: input layer with 1 node, hidden layer with 3 nodes, hidden layer with 4 nodes, output layer with 4 nodes. The input node is fed an image of a pear. A sigmoid activation function is applied to the output layer. Each output node represents the probability that the image is a specified fruit. Output node 1 represents 'Is apple?' and has a value of 0.34. Output node 2 represents 'Is orange?' and has a value of 0.18. Output node 3 represents 'Is pear?' and has a value of 0.84. Output node 4 represents 'Is grape?' and has a value of 0.07. — Figure 8. The same one-vs.-all classification tasks accomplished using a neural net model. A sigmoid activation function is applied to the output layer, and each output value represents the probability that the input image is a specified fruit. This model predicts that there is a 84% chance that the image is a pear, and a 7% chance that the image is a grape.

One versus one (softmax)

You may have noticed that the probability values in the output layer of Figure 8don't sum to 1.0 (or 100%). (In fact, they sum to 1.43.) In a one-vs.-allapproach, the probability of each binary set of outcomes is determinedindependently of all the other sets. That is, we're determining the probabilityof "apple" versus "not apple" without considering the likelihood of our otherfruit options: "orange", "pear", or "grape."

But what if we want to predict the probabilities of each fruitrelative to each other? In this case, instead of predicting "apple" versus "notapple", we want to predict "apple" versus "orange" versus "pear" versus "grape".This type of multi-class classification is calledone-vs.-one classification.

We can implement a one-vs.-one classification using the same type of neuralnetwork architecture used for one-vs.-all classification, with one key change.We need to apply a different transform to the output layer.

For one-vs.-all, we applied the sigmoid activation function to each outputnode independently, which resulted in an output value between 0 and 1 for eachnode, but did not guarantee that these values summed to exactly 1.

For one-vs.-one, we can instead apply a function calledsoftmax, whichassigns decimal probabilities to each class in a multi-class problem such thatall probabilities add up to 1.0. This additional constrainthelps training converge more quickly than it otherwise would.

Click the plus icon to see the softmax equation.

The softmax equation is as follows:

$$p(y = j|\textbf{x}) = \frac{e^{(\textbf{w}_j^{T}\textbf{x} + b_j)}}{\sum_{k\in K} {e^{(\textbf{w}_k^{T}\textbf{x} + b_k)}} }$$

Note that this formula basically extends the formula for logisticregression into multiple classes.

The following image re-implements our one-vs.-all multi-class classificationtask as a one-vs.-one task. Note that in order to perform softmax, the hiddenlayer directly preceding the output layer (called the softmax layer) must havethe same number of nodes as the output layer.

Figure 9. A neural network with the following architecture: input layer with 1 node, hidden layer with 3 nodes, hidden layer with 4 nodes, output layer with 4 nodes. The input node is fed an image of a pear. A softmax activation function is applied to the output layer. Each output node represents the probability that the image is a specified fruit. Output node 1 represents 'Is apple?' and has a value of 0.19. Output node 2 represents 'Is orange?' and has a value of 0.12. Output node 3 represents 'Is pear?' and has a value of 0.63. Output node 4 represents 'Is grape?' and has a value of 0.06. — Figure 9. Neural net implementation of one-vs.-one classification, using a softmax layer. Each output value represents the probability that the input image is the specified fruit and not any of the other three fruits (all probabilities sum to 1.0). This model predicts that there is a 63% chance that the image is a pear.

Softmax options

Consider the following variants of softmax:

Full softmax is the softmax we've been discussing; that is,softmax calculates a probability for every possible class.
Candidate sampling means that softmax calculates a probabilityfor all the positive labels but only for a random sample ofnegative labels. For example, if we are interested in determiningwhether an input image is a beagle or a bloodhound, we don't have toprovide probabilities for every non-doggy example.

Full softmax is fairly cheap when the number of classes is smallbut becomes prohibitively expensive when the number of classes climbs.Candidate sampling can improve efficiency in problems having a largenumber of classes.

One label versus many labels

Softmax assumes that each example is a member of exactly one class.Some examples, however, can simultaneously be a member of multiple classes.For such examples:

You may not use softmax.
You must rely on multiple logistic regressions.

For example, the one-vs.-one model in Figure 9 above assumes that each inputimage will depict exactly one type of fruit: an apple, an orange, a pear, ora grape. However, if an input image might contain multiple types of fruit—a bowl of both apples and oranges—you'll have to use multiple logisticregressions instead.

Key terms:

Help Center

Interactive exercises (15 min)

Test your knowledge (10 min)

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-08-25 UTC.

Movatterモバイル変換

Neural networks: Multi-class classification Stay organized with collections Save and categorize content based on your preferences.