AdaptiveLogSoftmaxWithLoss#
- classtorch.nn.AdaptiveLogSoftmaxWithLoss(in_features,n_classes,cutoffs,div_value=4.0,head_bias=False,device=None,dtype=None)[source]#
Efficient softmax approximation.
As described inEfficient softmax approximation for GPUs by Edouard Grave, Armand Joulin,Moustapha Cissé, David Grangier, and Hervé Jégou.
Adaptive softmax is an approximate strategy for training models with largeoutput spaces. It is most effective when the label distribution is highlyimbalanced, for example in natural language modelling, where the wordfrequency distribution approximately follows theZipf’s law.
Adaptive softmax partitions the labels into several clusters, according totheir frequency. These clusters may contain different number of targetseach.Additionally, clusters containing less frequent labels assign lowerdimensional embeddings to those labels, which speeds up the computation.For each minibatch, only clusters for which at least one target ispresent are evaluated.
The idea is that the clusters which are accessed frequently(like the first one, containing most frequent labels), should also be cheapto compute – that is, contain a small number of assigned labels.
We highly recommend taking a look at the original paper for more details.
cutoffsshould be an ordered Sequence of integers sortedin the increasing order.It controls number of clusters and the partitioning of targets intoclusters. For example settingcutoffs=[10,100,1000]means that first10 targets will be assignedto the ‘head’ of the adaptive softmax, targets11, 12, …, 100 will beassigned to the first cluster, and targets101, 102, …, 1000 will beassigned to the second cluster, while targets1001, 1002, …, n_classes - 1 will be assignedto the last, third cluster.div_valueis used to compute the size of each additional cluster,which is given as,where is the cluster index (with clustersfor less frequent words having larger indices,and indices starting from).head_biasif set to True, adds a bias term to the ‘head’ of theadaptive softmax. See paper for details. Set to False in the officialimplementation.
Warning
Labels passed as inputs to this module should be sorted according totheir frequency. This means that the most frequent label should berepresented by the index0, and the least frequentlabel should be represented by the indexn_classes - 1.
Note
This module returns a
NamedTuplewithoutputandlossfields. See further documentation for details.Note
To compute log-probabilities for all classes, the
log_probmethod can be used.- Parameters
in_features (int) – Number of features in the input tensor
n_classes (int) – Number of classes in the dataset
cutoffs (Sequence) – Cutoffs used to assign targets to their buckets
div_value (float,optional) – value used as an exponent to compute sizesof the clusters. Default: 4.0
head_bias (bool,optional) – If
True, adds a bias term to the ‘head’ of theadaptive softmax. Default:False
- Returns
output is a Tensor of size
Ncontaining computed targetlog probabilities for each exampleloss is a Scalar representing the computed negativelog likelihood loss
- Return type
NamedTuplewithoutputandlossfields
- Shape:
input: or
target: or where each value satisfies
output1: or
output2:
Scalar
- log_prob(input)[source]#
Compute log probabilities for all.
- Parameters
input (Tensor) – a minibatch of examples
- Returns
log-probabilities of for each classin range, where is aparameter passed to
AdaptiveLogSoftmaxWithLossconstructor.- Return type
- Shape:
Input:
Output:
- predict(input)[source]#
Return the class with the highest probability for each example in the input minibatch.
This is equivalent to
self.log_prob(input).argmax(dim=1), but is more efficient in some cases.- Parameters
input (Tensor) – a minibatch of examples
- Returns
a class with the highest probability for each example
- Return type
output (Tensor)
- Shape:
Input:
Output: