AdaptiveLogSoftmaxWithLoss #

classtorch.nn.AdaptiveLogSoftmaxWithLoss(in_features,n_classes,cutoffs,div_value=4.0,head_bias=False,device=None,dtype=None)[source]#

Efficient softmax approximation.

As described inEfficient softmax approximation for GPUs by Edouard Grave, Armand Joulin,Moustapha Cissé, David Grangier, and Hervé Jégou.

Adaptive softmax is an approximate strategy for training models with largeoutput spaces. It is most effective when the label distribution is highlyimbalanced, for example in natural language modelling, where the wordfrequency distribution approximately follows theZipf’s law.

Adaptive softmax partitions the labels into several clusters, according totheir frequency. These clusters may contain different number of targetseach.Additionally, clusters containing less frequent labels assign lowerdimensional embeddings to those labels, which speeds up the computation.For each minibatch, only clusters for which at least one target ispresent are evaluated.

The idea is that the clusters which are accessed frequently(like the first one, containing most frequent labels), should also be cheapto compute – that is, contain a small number of assigned labels.

We highly recommend taking a look at the original paper for more details.

cutoffs should be an ordered Sequence of integers sortedin the increasing order.It controls number of clusters and the partitioning of targets intoclusters. For example settingcutoffs=[10,100,1000]means that first10 targets will be assignedto the ‘head’ of the adaptive softmax, targets11, 12, …, 100 will beassigned to the first cluster, and targets101, 102, …, 1000 will beassigned to the second cluster, while targets1001, 1002, …, n_classes - 1 will be assignedto the last, third cluster.
div_value is used to compute the size of each additional cluster,which is given as $⌊ \frac{in_features}{{div_value}^{i d x}} ⌋ \left\lfloor\frac{\texttt{in\_features}}{\texttt{div\_value}^{idx}}\right\rfloor$ ,where $i d x idx$ is the cluster index (with clustersfor less frequent words having larger indices,and indices starting from $11$ ).
head_bias if set to True, adds a bias term to the ‘head’ of theadaptive softmax. See paper for details. Set to False in the officialimplementation.

Warning

Labels passed as inputs to this module should be sorted according totheir frequency. This means that the most frequent label should berepresented by the index0, and the least frequentlabel should be represented by the indexn_classes - 1.

Note

This module returns aNamedTuple withoutputandloss fields. See further documentation for details.

Note

To compute log-probabilities for all classes, thelog_probmethod can be used.

Parameters

in_features (int) – Number of features in the input tensor
n_classes (int) – Number of classes in the dataset
cutoffs (Sequence) – Cutoffs used to assign targets to their buckets
div_value (float,optional) – value used as an exponent to compute sizesof the clusters. Default: 4.0
head_bias (bool,optional) – IfTrue, adds a bias term to the ‘head’ of theadaptive softmax. Default:False

Returns

output is a Tensor of sizeN containing computed targetlog probabilities for each example
loss is a Scalar representing the computed negativelog likelihood loss

Return type

NamedTuple withoutput andloss fields

Shape:

input: $(N, in_features) (N, \texttt{in\_features})$ or $(in_features) (\texttt{in\_features})$
target: $(N) (N)$ or $() ()$ where each value satisfies $0 < = target[i] < = n_classes 0 <= \texttt{target[i]} <= \texttt{n\_classes}$
output1: $(N) (N)$ or $() ()$
output2:Scalar

forward(input_,target_)[source]#

Runs the forward pass.

Return type: _ASMoutput

log_prob(input)[source]#

Compute log probabilities for all $n_classes \texttt{n\_classes}$ .

Parameters: input (Tensor) – a minibatch of examples
Returns: log-probabilities of for each class $c c$ in range $0 < = c < = n_classes 0 <= c <= \texttt{n\_classes}$ , where $n_classes \texttt{n\_classes}$ is aparameter passed toAdaptiveLogSoftmaxWithLoss constructor.
Return type: Tensor

Shape:

Input: $(N, in_features) (N, \texttt{in\_features})$
Output: $(N, n_classes) (N, \texttt{n\_classes})$

predict(input)[source]#

Return the class with the highest probability for each example in the input minibatch.

This is equivalent toself.log_prob(input).argmax(dim=1), but is more efficient in some cases.

Parameters: input (Tensor) – a minibatch of examples
Returns: a class with the highest probability for each example
Return type: output (Tensor)

Shape:

Input: $(N, in_features) (N, \texttt{in\_features})$
Output: $(N) (N)$

reset_parameters()[source]#

Resets parameters based on their initialization used in__init__.

On this page

Show Source

PyTorch Libraries

Movatterモバイル変換

AdaptiveLogSoftmaxWithLoss #

Docs

Tutorials

Resources

Movatterモバイル変換

AdaptiveLogSoftmaxWithLoss#

Docs

Tutorials

Resources

AdaptiveLogSoftmaxWithLoss #