Machine Learning Glossary

Page Summary

This glossary defines a wide range of machine learning terms, including those specific to TensorFlow and large language models.
It provides clear explanations, examples, and applications of each term for a comprehensive understanding.
The content covers various aspects of machine learning, such as model types, training processes, fairness, and evaluation metrics.
It serves as a valuable resource for anyone seeking to grasp the terminology and fundamental concepts in this field.
The glossary also addresses advanced topics like deep learning, reinforcement learning, and natural language processing.

This glossary defines artificial intelligence terms.

Do you have questions about this glossary?

A

ablation

A technique for evaluating the importance of afeatureor component by temporarilyremoving it from amodel. You thenretrain the model without that feature or component, and if the retrained modelperforms significantly worse, then the removed feature or component waslikely important.

For example, suppose you train aclassification modelon 10 features and achieve 88%precision on thetest set. To check theimportanceof the first feature, you can retrain the model using only the nine otherfeatures. If the retrained model performs significantly worse (for instance,55% precision), then the removed feature was probably important. Conversely,if the retrained model performs equally well, then that feature was probablynot that important.

Ablation can also help determine the importance of:

Larger components, such as an entire subsystem of a larger ML system
Processes or techniques, such as a data preprocessing step

In both cases, you would observe how the system's performance changes (ordoesn't change) after you've removed the component.

A/B testing

A statistical way of comparing two (or more) techniques—theAand theB. Typically, theA is an existing technique, and theB is a new technique.A/B testing not only determines which technique performs betterbut also whether the difference is statistically significant.

A/B testing usually compares a singlemetric on two techniques;for example, how does modelaccuracy compare for twotechniques? However, A/B testing can also compare any finite number ofmetrics.

accelerator chip

#GoogleCloud

A category of specialized hardware components designed to perform keycomputations needed for deep learning algorithms.

Accelerator chips (or justaccelerators, for short) can significantlyincrease the speed and efficiency of training and inference taskscompared to a general-purpose CPU. They are ideal for trainingneural networks and similar computationally intensive tasks.

Examples of accelerator chips include:

Google's Tensor Processing Units (TPUs) with dedicated hardwarefor deep learning.
NVIDIA's GPUs which, though initially designed for graphics processing,are designed to enable parallel processing, which can significantlyincrease processing speed.

accuracy

#fundamentals

#Metric

The number of correct classificationpredictions dividedby the total number of predictions. That is:

$$\text{Accuracy} =\frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrectpredictions would have an accuracy of:

$$\text{Accuracy} =\frac{\text{40}} {\text{40 + 10}} =\text{80%}$$

Binary classification provides specific namesfor the different categories ofcorrect predictions andincorrect predictions. So, the accuracy formula for binary classificationis as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

where:

TP is the number oftrue positives (correct predictions).
TN is the number oftrue negatives (correct predictions).
FP is the number offalse positives (incorrect predictions).
FN is the number offalse negatives (incorrect predictions).

Compare and contrast accuracy withprecision andrecall.

Click the icon for details about accuracy and class-imbalanced datasets.

Although a valuable metric for some situations, accuracy is highlymisleading for others. Notably, accuracy is usually a poor metricfor evaluating classification models that processclass-imbalanced datasets.

For example, suppose snow falls only 25 days per century in a certainsubtropical city. Since days without snow (the negative class) vastlyoutnumber days with snow (the positive class), the snow dataset forthis city is class-imbalanced.Imagine abinary classificationmodel that is supposed to predict either snow or no snow each day butsimply predicts "no snow" every day.This model is highly accurate but has no predictive power.The following table summarizes the results for a century of predictions:

Category	Number
TP	0
TN	36499
FP	0
FN	25

The accuracy of this model is therefore:

accuracy = (TP + TN) / (TP + TN + FP + FN)accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

Although 99.93% accuracy seems like a very impressive percentage, the modelactually has no predictive power.

Precision andrecall are usually more useful metricsthanaccuracy for evaluating models trained on class-imbalanced datasets.

SeeClassification: Accuracy, recall, precision and relatedmetricsin Machine Learning Crash Course for more information.

action

Inreinforcement learning,the mechanism by which theagenttransitions betweenstates of theenvironment. The agent chooses the action by using apolicy.

activation function

#fundamentals

A function that enablesneural networks to learnnonlinear (complex) relationships between featuresand the label.

Popular activation functions include:

The plots of activation functions are never single straight lines.For example, the plot of the ReLU activation function consists oftwo straight lines:

A cartesian plot of two lines. The first line has a constant y value of 0, running along the x-axis from -infinity,0 to 0,-0. The second line starts at 0,0. This line has a slope of +1, so it runs from 0,0 to +infinity,+infinity.

A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain -infinity to +positive, while y values span the range almost 0 to almost 1. When x is 0, y is 0.5. The slope of the curve is always positive, with the highest slope at 0,0.5 and gradually decreasing slopes as the absolute value of x increases.

Click the icon to see an example.

In a neural network, activation functions manipulate theweighted sum of all the inputs to aneuron. To calculate a weighted sum, the neuron adds upthe products of the relevant values and weights. For example, suppose therelevant input to a neuron consists of the following:

input value	input weight
2	-1.3
-1	0.6
3	0.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

Suppose the designer of this neural network chooses thesigmoid function to be theactivation function. In that case, the neuron calculates thesigmoid of -2.0, which is approximately 0.12. Therefore, theneuron passes 0.12 (rather than -2.0) to the next layer in the neural network.The following figure illustrates the relevant part of the process:

An input layer with three features passing three feature values and three weights to a neuron in a hidden layer. The hidden layer calculates the raw value (-2.0), and then passes the raw value to the activation function. The activation function calculates the sigmoid of the raw value and passes the result (0.12) to the next layer of the neural network.

SeeNeural networks: Activationfunctionsin Machine Learning Crash Course for more information.

active learning

Atraining approach in which thealgorithmchooses some of the data it learns from. Active learningis particularly valuable whenlabeled examplesare scarce or expensive to obtain. Instead of blindly seeking a diverserange of labeled examples, an active learning algorithm selectively seeksthe particular range of examples it needs for learning.

AdaGrad

A sophisticated gradient descent algorithm that rescales thegradients of eachparameter, effectively giving each parameteran independentlearning rate. For a full explanation, seeAdaptive Subgradient Methods for Online Learning and StochasticOptimization.

adaptation

#generativeAI

Synonym for tuning orfine-tuning.

agent

#generativeAI

Software that can reason about multimodal user inputs in order to plan andexecute actions on behalf of the user.

Inreinforcement learning,an agent is the entity that uses apolicy to maximize the expectedreturn gained fromtransitioning betweenstates of theenvironment.

agentic

#generativeAI

The adjective form ofagent. Agentic refers to the qualitiesthat agents possess (such as autonomy).

agentic workflow

#generativeAI

A dynamic process in which anagent autonomously plans andexecutes actions to achieve a goal. The process may involve reasoning,invoking external tools, and self-correcting its plan.

agglomerative clustering

#clustering

Seehierarchical clustering.

AI slop

#generativeAI

Output from agenerative AI system that favorsquantity over quality. For example, a web page with AI slop is filledwith cheaply produced, AI-generated, low-quality content.

anomaly detection

The process of identifyingoutliers. For example, if the meanfor a certainfeature is 100 with a standard deviation of 10,then anomaly detection should flag a value of 200 as suspicious.

AR

Abbreviation foraugmented reality.

area under the PR curve

#Metric

SeePR AUC (Area under the PR Curve).

area under the ROC curve

#Metric

SeeAUC (Area under the ROC curve).

artificial general intelligence

A non-human mechanism that demonstrates abroad range of problem solving,creativity, and adaptability. For example, a program demonstrating artificialgeneral intelligence could translate text, compose symphonies,and excel atgames that have not yet been invented.

artificial intelligence

#fundamentals

A non-human program ormodel that can solve sophisticated tasks.For example, a program or model that translates text or a program or model thatidentifies diseases from radiologic images both exhibit artificial intelligence.

Formally,machine learning is a sub-field of artificialintelligence. However, in recent years, some organizations have begun using thetermsartificial intelligence andmachine learning interchangeably.

attention

A mechanism used in aneural network that indicatesthe importance of a particular word or part of a word. Attention compressesthe amount of information a model needs to predict the next token/word.A typical attention mechanism might consist of aweighted sum over a set of inputs, where theweight for each input is computed by another part of theneural network.

Refer also toself-attention andmulti-head self-attention, which are thebuilding blocks ofTransformers.

SeeLLMs: What's a large languagemodel?in Machine Learning Crash Course for more information about self-attention.

attribute

#responsible

Synonym forfeature.

In machine learning fairness, attributes often refer tocharacteristics pertaining to individuals.

attribute sampling

#df

A tactic for training adecision forest in which eachdecision tree considers only a random subset of possiblefeatures when learning thecondition.Generally, a different subset of features is sampled for eachnode. In contrast, when training a decision treewithout attribute sampling, all possible features are considered for each node.

AUC (Area under the ROC curve)

#fundamentals

#Metric

A number between 0.0 and 1.0 representing abinary classification model'sability to separatepositive classes fromnegative classes.The closer the AUC is to 1.0, the better the model's ability to separateclasses from each other.

For example, the following illustration shows aclassification model that separates positiveclasses (green ovals) from negative classes (purple rectangles) perfectly.This unrealistically perfect model has an AUC of 1.0:

A number line with 8 positive examples on one side and 9 negative examples on the other side.

Conversely, the following illustration shows the results for aclassification model that generated randomresults. This model has an AUC of 0.5:

A number line with 6 positive examples and 6 negative examples. The sequence of examples is positive, negative, positive, negative, positive, negative, positive, negative, positive negative, positive, negative.

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, thefollowing model separates positives from negatives somewhat, and thereforehas an AUC somewhere between 0.5 and 1.0:

A number line with 6 positive examples and 6 negative examples. The sequence of examples is negative, negative, negative, negative, positive, negative, positive, positive, negative, positive, positive, positive.

AUC ignores any value you set forclassification threshold. Instead, AUCconsidersall possible classification thresholds.

Click the icon to learn about the relationship between AUC and ROC curves.

AUC represents thearea under anROC curve.For example,the ROC curve for a model that perfectly separates positives fromnegatives looks as follows:

Cartesian plot. x-axis is false positive rate; y-axis is true positive rate. Graph starts at 0,0 and goes straight up to 0,1 and then straight to the right ending at 1,1.

AUC is the area of the gray region in the preceding illustration.In this unusual case, the area is simply the length of the gray region(1.0) multiplied by the width of the gray region (1.0). So, the productof 1.0 and 1.0 yields an AUC of exactly 1.0, which is the highest possibleAUC score.

Conversely, the ROC curve for aclassification model that can'tseparate classes at all is as follows. The area of this gray region is 0.5.

Cartesian plot. x-axis is false positive rate; y-axis is true positive rate. Graph starts at 0,0 and goes diagonally to 1,1.

A more typical ROC curve looks approximately like the following:

Cartesian plot. x-axis is false positive rate; y-axis is true positive rate. Graph starts at 0,0 and takes an irregular arc to 1,0.

It would be painstaking to calculate the area under this curve manually,which is why a program typically calculates most AUC values.

Click the icon for a more formal definition of AUC.

AUC is the probability that aclassification model will be moreconfident that a randomly chosen positive example is actually positive thanthat a randomly chosen negative example is positive.

SeeClassification: ROC andAUCin Machine Learning Crash Course for more information.

augmented reality

A technology that superimposes a computer-generated image on a user's view ofthe real world, thus providing a composite view.

autoencoder

A system that learns to extract the most important information from theinput. Autoencoders are a combination of anencoder anddecoder. Autoencoders rely on the following two-step process:

The encoder maps the input to a (typically) lossy lower-dimensional(intermediate) format.
The decoder builds a lossy version of the original input by mappingthe lower-dimensional format to the original higher-dimensionalinput format.

Autoencoders are trained end-to-end by having the decoder attempt toreconstruct the original input from the encoder's intermediate formatas closely as possible. Because the intermediate format is smaller(lower-dimensional) than the original format, the autoencoder is forcedto learn what information in the input is essential, and the output won'tbe perfectly identical to the input.

For example:

If the input data is a graphic, the non-exact copy would be similar tothe original graphic, but somewhat modified. Perhaps thenon-exact copy removes noise from the original graphic or fills insome missing pixels.
If the input data is text, an autoencoder would generate new text thatmimics (but is not identical to) the original text.

automatic evaluation

#generativeAI

Using software to judge the quality of a model's output.

When model output is relatively straightforward, a script or program cancompare the model's output to agolden response.This type of automatic evaluation is sometimes calledprogrammatic evaluation. Metrics such asROUGE orBLEU are often useful for programmatic evaluation.

When model output is complex or hasno one right answer, a separate ML program called anautorater sometimes performs the automaticevaluation.

Contrast withhuman evaluation.

automation bias

#responsible

When a human decision maker favors recommendations made by an automateddecision-making system over information made without automation, evenwhen the automated decision-making system makes errors.

SeeFairness: Types ofbiasin Machine Learning Crash Course for more information.

AutoML

Any automated process for buildingmachine learning models. AutoML can automatically do tasks such as the following:

Search for the most appropriate model.
Tunehyperparameters.
Prepare data (including performingfeature engineering).
Deploy the resulting model.

AutoML is useful for data scientists because it can save them time andeffort in developing machine learning pipelines and improve predictionaccuracy. It is also useful to non-experts, by making complicatedmachine learning tasks more accessible to them.

SeeAutomated MachineLearning (AutoML)in Machine Learning Crash Course for more information.

autorater evaluation

#generativeAI

A hybrid mechanism for judging the quality of agenerative AI model's output that combineshuman evaluation withautomatic evaluation.An autorater is an ML model trained on data created byhuman evaluation. Ideally, an autoraterlearns to mimic a human evaluator.

Prebuilt autoraters are available, but the best autoraters arefine-tuned specifically to the task you are evaluating.

Note: A running autorater is a fully automated process; humans "only" provide data that helpstrain an autorater.

auto-regressive model

#generativeAI

Amodel that infers a prediction based on its own previouspredictions. For example, auto-regressive language models predict the nexttoken based on the previously predicted tokens.AllTransformer-basedlarge language models are auto-regressive.

In contrast,GAN-based image models are usually not auto-regressivesince they generate an image in a single forward-pass and not iteratively insteps. However, certain image generation modelsare auto-regressive becausethey generate an image in steps.

auxiliary loss

Aloss function—used in conjunction with aneural network model's mainloss function—that helps acceleratetraining during theearly iterations when weights are randomly initialized.

Auxiliary loss functions push effectivegradientsto the earlierlayers. This facilitatesconvergence duringtrainingby combating thevanishing gradient problem.

average precision at k

#Metric

A metric for summarizing a model's performance on a single prompt thatgenerates ranked results, such as a numbered list of book recommendations.Average precision atk is, well, the average of theprecision at k values for eachrelevant result.The formula for average precision atk is therefore:

\[{\text{average precision at k}} = \frac{1}{n} \sum_{i=1}^n {\text{precision at k for each relevant item} } \]

where:

$n$ is the number of relevant items in the list.

Contrast withrecall at k.

Note: Average precision atk evaluates the output for asingle prompt. Use mean average precision at k to evaluate the quality of a model's output across many different prompts.Note: Some people abbreviateaverage precision at k to simplyaverage precision.

Click the icon for an example

Suppose a large language modelis given the following query:

List the 6 funniest movies of all time in order.

And the large language model returns the following list:

The General
Mean Girls
Platoon
Bridesmaids
Citizen Kane
This is Spinal Tap

Four of the movies in the returned list are very funny (that is, they arerelevant) but two movies are dramas (not relevant). The following tabledetails the results:

Position	Movie	Relevant?	Precision at k
1	The General	Yes	1.0
2	Mean Girls	Yes	1.0
3	Platoon	No	not relevant
4	Bridesmaids	Yes	0.75
5	Citizen Kane	No	not relevant
6	This is Spinal Tap	Yes	0.67

The number of relevant results is 4. Therefore, you can calculatethe average precision at 6 as follows:

$${\text{average precision at 6}} = \frac{1}{4} {\text{(1.0 + 1.0 + 0.75 + 0.67)} } $$$${\text{average precision at 6}} = {\text{~0.85} } $$

axis-aligned condition

#df

In adecision tree, aconditionthat involves only a singlefeature. For example, ifareais a feature, then the following is an axis-aligned condition:

area > 200

Contrast withoblique condition.

B

backpropagation

#fundamentals

The algorithm that implementsgradient descent inneural networks.

Training a neural network involves manyiterationsof the following two-pass cycle:

During theforward pass, the system processes abatch ofexamples to yield prediction(s). The system compares eachprediction to eachlabel value. The difference betweenthe prediction and the label value is theloss for that example.The system aggregates the losses for all the examples to compute the totalloss for the current batch.
During thebackward pass (backpropagation), the system reduces loss byadjusting the weights of all theneurons in all thehidden layer(s).

Neural networks often contain many neurons across many hidden layers.Each of those neurons contribute to the overall loss in different ways.Backpropagation determines whether to increase or decrease the weightsapplied to particular neurons.

Thelearning rate is a multiplier that controls thedegree to which each backward pass increases or decreases each weight.A large learning rate will increase or decrease each weight more than asmall learning rate.

In calculus terms, backpropagation implements thechain rule.from calculus. That is, backpropagation calculates thepartial derivative of the error withrespect to each parameter.

Years ago, ML practitioners had to write code to implement backpropagation.Modern ML APIs like Keras now implement backpropagation for you. Phew!

SeeNeural networksin Machine Learning Crash Course for more information.

bagging

#df

A method totrain anensemble where eachconstituentmodel trains on a random subset of trainingexamplessampled with replacement.For example, arandom forest is a collection ofdecision trees trained with bagging.

The termbagging is short forbootstrapaggregating.

SeeRandom forestsin the Decision Forests course for more information.

bag of words

A representation of the words in a phrase or passage,irrespective of order. For example, bag of words represents thefollowing three phrases identically:

the dog jumps
jumps the dog
dog jumps the

Each word is mapped to an index in asparse vector, wherethe vector has an index for every word in the vocabulary. For example,the phrasethe dog jumps is mapped into a feature vector with non-zerovalues at the three indexes corresponding to the wordsthe,dog, andjumps. The non-zero value can be any of the following:

A 1 to indicate the presence of a word.
A count of the number of times a word appears in the bag. For example,if the phrase werethe maroon dog is a dog with maroon fur, then bothmaroon anddog would be represented as 2, while the other words wouldbe represented as 1.
Some other value, such as the logarithm of the count of the number oftimes a word appears in the bag.

baseline

#Metric

Amodel used as a reference point for comparing how well anothermodel (typically, a more complex one) is performing. For example, alogistic regression model might serve as agood baseline for adeep model.

For a particular problem, the baseline helps model developers quantifythe minimal expected performance that a new model must achieve for the newmodel to be useful.

base model

#generativeAI

Apre-trained model that can serve as the startingpoint forfine-tuning to address specific tasks orapplications.

batch

#fundamentals

The set ofexamples used in one trainingiteration.Thebatch size determines the number of examples in abatch.

Seeepoch for an explanation of how a batch relates toan epoch.

SeeLinear regression:Hyperparametersin Machine Learning Crash Course for more information.

batch inference

#GoogleCloud

The process ofinferring predictions on multipleunlabeled examples divided into smallersubsets ("batches").

Batch inference can take advantage of the parallelization features ofaccelerator chips. That is, multiple acceleratorscan simultaneously infer predictions on different batches of unlabeledexamples, dramatically increasing the number of inferences per second.

SeeProduction ML systems: Static versus dynamicinferencein Machine Learning Crash Course for more information.

batch normalization

Normalizing the input or output of theactivation functions in ahidden layer. Batch normalization canprovide the following benefits:

Makeneural networks more stable by protectingagainstoutlier weights.
Enable higherlearning rates, which canspeed training.
Reduceoverfitting.

batch size

#fundamentals

The number ofexamples in abatch.For instance, if the batch size is 100, then the model processes100 examples periteration.

The following are popular batch size strategies:

Stochastic Gradient Descent (SGD), in which the batch size is 1.
Full batch, in which the batch size is the number of examples in the entiretraining set. For instance, if the training setcontains a million examples, then the batch size would be a millionexamples. Full batch is usually an inefficient strategy.
mini-batch in which the batch size is usually between10 and 1000. Mini-batch is usually the most efficient strategy.

See the following for more information:

Production ML systems: Static versus dynamicinferencein Machine Learning Crash Course.
Deep Learning TuningPlaybook.

Bayesian neural network

A probabilisticneural network that accounts foruncertainty inweights and outputs. A standard neural networkregression model typicallypredicts a scalar value;for example, a standard model predicts a house priceof 853,000. In contrast, a Bayesian neural network predicts a distribution ofvalues; for example, a Bayesian model predicts a house price of 853,000 witha standard deviation of 67,200.

A Bayesian neural network relies onBayes' Theoremto calculate uncertainties in weights and predictions. A Bayesian neuralnetwork can be useful when it is important to quantify uncertainty, such as inmodels related to pharmaceuticals. Bayesian neural networks can also helppreventoverfitting.

Bayesian optimization

Aprobabilistic regression modeltechnique for optimizing computationally expensiveobjective functions by instead optimizing a surrogatethat quantifies the uncertainty using a Bayesian learning technique. SinceBayesian optimization is itself very expensive, it is usually used to optimizeexpensive-to-evaluate tasks that have a small number of parameters, such asselectinghyperparameters.

Bellman equation

In reinforcement learning, the following identity satisfied by the optimalQ-function:

\[Q(s, a) = r(s, a) + \gamma \mathbb{E}_{s'|s,a} \max_{a'} Q(s', a')\]

Reinforcement learning algorithms apply thisidentity to createQ-learning using the following updaterule:

\[Q(s,a) \gets Q(s,a) + \alpha \left[r(s,a) + \gamma \displaystyle\max_{\substack{a_1}} Q(s',a') - Q(s,a) \right]\]

Beyond reinforcement learning, the Bellman equation has applications todynamic programming. See theWikipedia entry for Bellman equation.

BERT (Bidirectional EncoderRepresentations from Transformers)

A model architecture for textrepresentation. A trainedBERT model can act as part of a larger model for text classification orother ML tasks.

BERT has the following characteristics:

Uses theTransformer architecture, and therefore reliesonself-attention.
Uses theencoder part of the Transformer. The encoder's jobis to produce good text representations, rather than to perform a specifictask like classification.
Isbidirectional.
Usesmasking forunsupervised training.

BERT's variants include:

ALBERT,which is an acronym forALightBERT.
LaBSE.

SeeOpen Sourcing BERT: State-of-the-Art Pre-training for Natural LanguageProcessingfor an overview of BERT.

bias (ethics/fairness)

#responsible

#fundamentals

1. Stereotyping, prejudice or favoritism towards some things, people,or groups over others. These biases can affect collection andinterpretation of data, the design of a system, and how users interactwith a system. Forms of this type of bias include:

2. Systematic error introduced by a sampling or reporting procedure.Forms of this type of bias include:

Not to be confused with thebias term in machine learning modelsorprediction bias.

SeeFairness: Types ofbias inMachine Learning Crash Course for more information.

bias (math) or bias term

#fundamentals

An intercept or offset from an origin. Bias is a parameter inmachine learning models, which is symbolized by either of thefollowing:

b
w₀

For example, bias is theb in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In a simple two-dimensional line, bias just means "y-intercept."For example, the bias of the line in the following illustration is 2.

The plot of a line with a slope of 0.5 and a bias (y-intercept) of 2.

Bias exists because not all models start from the origin (0,0). For example,suppose an amusement park costs 2 Euros to enter and an additional0.5 Euro for every hour a customer stays. Therefore, a model mapping thetotal cost has a bias of 2 because the lowest cost is 2 Euros.

Bias is not to be confused withbias in ethics and fairnessorprediction bias.

SeeLinear Regressionin Machine Learning Crash Course for more information.

bidirectional

A term used to describe a system that evaluates the text that bothprecedesandfollows a target section of text. In contrast, aunidirectional system onlyevaluates the text thatprecedes a target section of text.

For example, consider amasked language model thatmust determine probabilities for the word or words representing the underline inthe following question:

What is the _____ with you?

A unidirectional language model would have to base its probabilities onlyon the context provided by the words "What", "is", and "the". In contrast,a bidirectional language model could also gain context from "with" and "you",which might help the model generate better predictions.

bidirectional language model

Alanguage model that determines the probability that agiven token is present at a given location in an excerpt of text based onthepreceding andfollowing text.

bigram

AnN-gram in which N=2.

binary classification

#fundamentals

A type ofclassification task thatpredicts one of two mutually exclusive classes:

For example, the following two machine learning models each performbinary classification:

A model that determines whether email messages arespam (the positive class) ornot spam (the negative class).
A model that evaluates medical symptoms to determine whether a personhas a particular disease (the positive class) or doesn't have thatdisease (the negative class).

Contrast withmulti-class classification.

SeeClassificationin Machine Learning Crash Course for more information.

binary condition

#df

In adecision tree, aconditionthat has only two possible outcomes, typicallyyes orno.For example, the following is a binary condition:

temperature >= 100

Contrast withnon-binary condition.

SeeTypes of conditionsin the Decision Forests course for more information.

binning

Synonym forbucketing.

black box model

Amodel whose "reasoning" is impossible or difficult for humansto understand. That is, although humans can see howpromptsaffectresponses, humans can't determine exactly how a blackbox model determines the response. In other words, a black box model is lackinginterpretability.

Mostdeep models andlarge language models are black boxes.

BLEU (Bilingual Evaluation Understudy)

A metric between 0.0 and 1.0 for evaluatingmachine translations, for example, fromSpanish to Japanese.

To calculate a score, BLEU typically compares an ML model's translation(generated text) to a human expert's translation(reference text).The degree to whichN-grams in the generated text andreference text match determines the BLEU score.

The original paper on this metric isBLEU: a Method for Automatic Evaluation of Machine Translation.

BLEURT (Bilingual Evaluation Understudy from Transformers)

A metric for evaluatingmachine translationsfrom one language to another, particularly to and from English.

For translations to and from English, BLEURT aligns more closely to humanratings thanBLEU. Unlike BLEU, BLEURT emphasizes semantic(meaning) similarities and can accommodate paraphrasing.

BLEURT relies on apre-trained large language model(BERT to be exact) that is thenfine-tunedon text from human translators.

The original paper on this metric isBLEURT: Learning Robust Metrics for Text Generation.

Boolean Questions (BoolQ)

#Metric

A dataset for evaluating an LLM's proficiency in answering yes-or-no questions.Each of the challenges in the dataset has three components:

A query
A passage implying the answer to the query.
The correct answer, which is eitheryes orno.

For example:

Query: Are there any nuclear power plants in Michigan?
Passage: ...three nuclear power plants supply Michigan withabout 30% of its electricity.
Correct answer: Yes

Researchers gathered the questions from anonymized, aggregated Google Searchqueries and then used Wikipedia pages to ground the information.

For more information, seeBoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions.

BoolQ is a component of theSuperGLUE ensemble.

BoolQ

#Metric

Abbreviation forBoolean Questions.

boosting

A machine learning technique that iteratively combines a set of simple and notvery accurateclassification models (referred toas "weak classifiers") into a classification model with high accuracy(a "strong classifier") byupweighting the examples thatthe model is currently misclassifying.

SeeGradient Boosted DecisionTrees?in the Decision Forests course for more information.

bounding box

In an image, the (x,y) coordinates of a rectangle around an area ofinterest, such as the dog in the image below.

Photograph of a dog sitting on a sofa. A green bounding box with top-left coordinates of (275, 1271) and bottom-right coordinates of (2954, 2761) circumscribes the dog's body

broadcasting

Expanding the shape of an operand in a matrix math operation todimensions compatible for that operation. For example,linear algebra requires that the two operands in a matrix addition operationmust have the same dimensions. Consequently, you can't add a matrix of shape(m, n) to a vector of length n. Broadcasting enables this operation byvirtually expanding the vector of length n to a matrix of shape (m, n) byreplicating the same values down each column.

Click the icon for an example.

Given the following definitions of A and B, linear algebra prohibitsA+B because A and B have different dimensions:

A = [[7, 10, 4],     [13, 5, 9]]B = [2]

However, broadcasting enables the operation A+B by virtually expandingB to:

 [[2, 2, 2],  [2, 2, 2]]

Thus, A+B is now a valid operation:

[[7, 10, 4],  +  [[2, 2, 2],  =  [[ 9, 12, 6], [13, 5, 9]]      [2, 2, 2]]      [15, 7, 11]]

See the following description ofbroadcasting in NumPy for more details.

bucketing

#fundamentals

Converting a singlefeature into multiple binary featurescalledbuckets orbins,typically based on a value range. The chopped feature is typically acontinuous feature.

For example, instead of representing temperature as a singlecontinuous floating-point feature, you could chop ranges of temperaturesinto discrete buckets, such as:

<= 10 degrees Celsius would be the "cold" bucket.
11 - 24 degrees Celsius would be the "temperate" bucket.
>= 25 degrees Celsius would be the "warm" bucket.

The model will treat every value in the same bucket identically. Forexample, the values13 and22 are both in the temperate bucket, so themodel treats the two values identically.

Click the icon for additional notes.

If you represent temperature as a continuous feature, then the modeltreats temperature as a single feature. If you represent temperatureas three buckets, then the model treats each bucket as a separate feature.That is, a model can learn separate relationships of each bucket to thelabel. For example, alinear regression model can learnseparateweights for each bucket.

Increasing the number of buckets makes your model more complicated byincreasing the number of relationships that your model must learn.For example, the cold, temperate, and warm buckets are essentiallythree separate features for your model to train on. If you decide to addtwo more buckets--for example, freezing and hot--your model wouldnow have to train on five separate features.

How do you know how many buckets to create, or what the ranges for eachbucket should be? The answers typically require a fair amount ofexperimentation.

SeeNumerical data:Binningin Machine Learning Crash Course for more information.

C

calibration layer

A post-prediction adjustment, typically to account forprediction bias. The adjusted predictions andprobabilities should match the distribution of an observed set of labels.

candidate generation

The initial set of recommendations chosen by arecommendation system. For example, consider abookstore that offers 100,000 titles. The candidate generation phase createsa much smaller list of suitable books for a particular user, say 500. But even500 books is way too many to recommend to a user. Subsequent, more expensive,phases of a recommendation system (such asscoring andre-ranking) reduce those 500 to a much smaller,more useful set of recommendations.

SeeCandidate generationoverviewin the Recommendation Systems course for more information.

candidate sampling

A training-time optimization that calculates a probability for all thepositive labels, using, for example,softmax, but only for a randomsample of negative labels. For instance, given an example labeledbeagle anddog, candidate sampling computes the predicted probabilitiesand corresponding loss terms for:

beagle
dog
a random subset of the remaining negative classes (for example,cat,lollipop,fence).

The idea is that thenegative classes can learn from less frequentnegative reinforcement as long aspositive classes always get proper positivereinforcement, and this is indeed observed empirically.

Candidate sampling is more computationally efficient than training algorithmsthat compute predictions forall negative classes, particularly when thenumber of negative classes is very large.

categorical data

#fundamentals

Features having a specific set of possible values. For example,consider a categorical feature namedtraffic-light-state, which can onlyhave one of the following three possible values:

red
yellow
green

By representingtraffic-light-state as a categorical feature,a model can learn thediffering impacts ofred,green, andyellow on driver behavior.

Categorical features are sometimes calleddiscrete features.

Contrast withnumerical data.

SeeWorking with categoricaldatain Machine Learning Crash Course for more information.

causal language model

Synonym forunidirectional language model.

Seebidirectional language model tocontrast different directional approaches in language modeling.

CB

#Metric

Abbreviation forCommitmentBank.

centroid

#clustering

The center of a cluster as determined by ak-means ork-median algorithm. For example, if k is 3,then the k-means or k-median algorithm finds 3 centroids.

SeeClustering algorithmsin the Clustering course for more information.

centroid-based clustering

#clustering

A category ofclustering algorithms that organizes datainto nonhierarchical clusters.k-means is the most widelyused centroid-based clustering algorithm.

Contrast withhierarchical clusteringalgorithms.

SeeClustering algorithmsin the Clustering course for more information.

chain-of-thought prompting

#generativeAI

Aprompt engineering technique that encouragesalarge language model (LLM) to explain itsreasoning, step by step. For example, consider the following prompt, payingparticular attention to the second sentence:

How many g forces would a driver experience in a car that goes from 0 to 60miles per hour in 7 seconds? In the answer, show all relevant calculations.

The LLM'sresponse would likely:

Show a sequence of physics formulas, plugging in the values 0, 60, and 7in appropriate places.
Explain why it chose those formulas and what the various variables mean.

Chain-of-thought prompting forces the LLM to perform all the calculations,which might lead to a more correct answer. In addition, chain-of-thoughtprompting enables the user to examine the LLM's steps to determine whetheror not the answer makes sense.

Character N-gram F-score (ChrF)

#Metric

A metric to evaluatemachine translation models.Character N-gram F-score determines the degree to whichN-grams inreference text overlap theN-grams in an ML model'sgenerated text.

Character N-gram F-score is similar to metrics in theROUGEandBLEU families, except that:

Character N-gram F-score operates oncharacter N-grams.
ROUGE and BLEU operate onword N-grams ortokens.

chat

#generativeAI

The contents of a back-and-forth dialogue with an ML system, typically alarge language model.The previous interaction in a chat(what you typed and how the large language model responded) becomes thecontext for subsequent parts of the chat.

Achatbot is an application of a large language model.

checkpoint

Data that captures the state of a model'sparameters eitherduring training or after training is completed. For example, during training,you can:

Stop training, perhaps intentionally or perhaps as the result ofcertain errors.
Capture the checkpoint.
Later, reload the checkpoint, possibly on different hardware.
Restart training.

Choice of Plausible Alternatives (COPA)

#Metric

A dataset for evaluating how well an LLM can identify the better of twoalternative answers to a premise. Each of the challenges in the datasetconsists of three components:

A premise, which is typically a statement followed by a question
Two possible answers to the question posed in the premise, one of whichis correct and the other incorrect
The correct answer

For example:

Premise: The man broke his toe. What was the CAUSE of this?
Possible answers:
1. He got a hole in his sock.
2. He dropped a hammer on his foot.
Correct answer: 2

COPA is a component of theSuperGLUE ensemble.

class

#fundamentals

A category that alabel can belong to.For example:

In abinary classification model that detectsspam, the two classes might bespam andnot spam.
In amulti-class classification modelthat identifies dog breeds, the classes might bepoodle,beagle,pug,and so on.

Aclassification model predicts a class.In contrast, aregression model predicts a numberrather than a class.

SeeClassificationin Machine Learning Crash Course for more information.

class-balanced dataset

Adataset containingcategorical labels in which the number of instances of each category isapproximately equal. For example, consider a botanical dataset whose binarylabel can be eithernative plant ornonnative plant:

A dataset with 515native plants and 485nonnative plants is aclass-balanced dataset.
A dataset with 875native plants and 125nonnative plants is aclass-imbalanced dataset.

A formal dividing line between class-balanced datasets andclass-imbalanced datasets doesn't exist.The distinction only becomes important when a model trained on a highlyclass-imbalanced dataset can't converge. SeeDatasets: imbalanced datasetsin Machine Learning Crash Course for details.

classification model

#fundamentals

Amodel whose prediction is aclass.For example, the following are all classification models:

A model that predicts an input sentence's language (French? Spanish?Italian?).
A model that predicts tree species (Maple? Oak? Baobab?).
A model that predicts the positive or negative class for a particularmedical condition.

In contrast,regression models predict numbersrather than classes.

Two common types of classification models are:

classification threshold

#fundamentals

In abinary classification, anumber between 0 and 1 that converts the raw output of alogistic regression modelinto a prediction of either thepositive classor thenegative class.Note that the classification threshold is a value that a human chooses,not a value chosen by model training.

A logistic regression model outputs a raw value between 0 and 1. Then:

If this raw value isgreater than the classification threshold, thenthe positive class is predicted.
If this raw value isless than the classification threshold, thenthe negative class is predicted.

For example, suppose the classification threshold is 0.8. If the raw valueis 0.9, then the model predicts the positive class. If the raw value is0.7, then the model predicts the negative class.

The choice of classification threshold strongly influences the number offalse positives andfalse negatives.

Click the icon for additional notes.

As models or datasets evolve, engineers sometimes also change theclassification threshold. When the classification threshold changes,positive class predictions can suddenly become negative classesand vice-versa.

For example, consider a binary classification disease prediction model.Suppose that when the system runs in the first year:

The raw value for a particular patient is 0.95.
The classification threshold is 0.94.

Therefore, the system diagnoses the positive class. (The patient gasps,"Oh no! I'm sick!")

A year later, perhaps the values now look as follows:

The raw value for the same patient remains at 0.95.
The classification threshold changes to 0.97.

Therefore, the system now reclassifies that patient as the negative class.("Happy day! I'm not sick.") Same patient. Different diagnosis.

SeeThresholds and the confusionmatrixin Machine Learning Crash Course for more information.

classifier

#fundamentals

A casual term for aclassification model.

class-imbalanced dataset

#fundamentals

Adataset for aclassificationin which the total number oflabels of eachclassdiffers significantly. For example, consider abinary classification dataset whose two labelsare divided as follows:

1,000,000 negative labels
10 positive labels

The ratio of negative to positive labels is 100,000 to 1, so thisis a class-imbalanced dataset.

In contrast, the following dataset isclass-balanced because the ratio of negativelabels to positive labels is relatively close to 1:

517 negative labels
483 positive labels

Multi-class datasets can also be class-imbalanced. For example, the followingmulti-class classification dataset is also class-imbalanced because one labelhas far more examples than the other two:

1,000,000 labels with class "green"
200 labels with class "purple"
350 labels with class "orange"

Training class-imbalanced datasets can present special challenges. SeeImbalanced datasetsin Machine Learning Crash Course for details.

clipping

#fundamentals

A technique for handlingoutliers by doingeither or both of the following:

Reducingfeature values that are greater than a maximumthreshold down to that maximum threshold.
Increasing feature values that are less than a minimum threshold up to thatminimum threshold.

For example, suppose that <0.5% of values for a particular feature falloutside the range 40–60. In this case, you could do the following:

Clip all values over 60 (the maximum threshold) to be exactly 60.
Clip all values under 40 (the minimum threshold) to be exactly 40.

Outliers can damage models, sometimes causingweightsto overflow during training. Some outliers can also dramatically spoilmetrics likeaccuracy. Clipping is a common technique to limitthe damage.

Gradient clipping forcesgradient values within a designated range during training.

SeeNumerical data:Normalizationin Machine Learning Crash Course for more information.

Cloud TPU

#TensorFlow

#GoogleCloud

A specialized hardware accelerator designed to speed up machinelearning workloads on Google Cloud.

clustering

#clustering

Grouping relatedexamples, particularly duringunsupervised learning. Once all theexamples are grouped, a human can optionally supply meaning to each cluster.

Many clustering algorithms exist. For example, thek-meansalgorithm clusters examples based on their proximity to acentroid, as in the following diagram:

A two-dimensional graph in which the x-axis is labeled tree width, and the y-axis is labeled tree height. The graph contains two centroids and several dozen data points. The data points are categorized based on their proximity. That is, the data points closest to one centroid are categorized as cluster 1, while those closest to the other centroid are categorized as cluster 2.

A human researcher could then review the clusters and, for example,label cluster 1 as "dwarf trees" and cluster 2 as "full-size trees."

As another example, consider a clustering algorithm based on anexample's distance from a center point, illustrated as follows:

Dozens of data points are arranged in concentric circles, almost like holes around the center of a dart board. The innermost ring of data points is categorized as cluster 1, the middle ring is categorized as cluster 2, and the outermost ring as cluster 3.

See theClustering coursefor more information.

co-adaptation

An undesirable behavior in whichneurons predict patterns intraining data by relying almost exclusively on outputs of specific other neuronsinstead of relying on the network's behavior as a whole. When the patterns thatcause co-adaptation are not present in validation data, then co-adaptationcausesoverfitting.Dropout regularization reduces co-adaptationbecause dropout ensures neurons cannot rely solely on specific other neurons.

collaborative filtering

Makingpredictions about the interests of one userbased on the interests of many other users. Collaborative filteringis often used inrecommendation systems.

SeeCollaborativefilteringin the Recommendation Systems course for more information.

CommitmentBank (CB)

#Metric

A dataset for evaluating an LLM's proficiency in determining whetherthe author of a passage believes a target clause within that passage.Each entry in the dataset contains:

A passage
A target clause within that passage
A Boolean value indicating whether the passage's author believes the targetclause

For example:

Passage: What fun to hear Artemis laugh. She's such a serious child.I didn't know she had a sense of humor.
Target clause: she had a sense of humor
Boolean: True, which means the author believes the target clause

CommitmentBank is a component of theSuperGLUE ensemble.

compact model

Any small model designed to run on small devices with limited computationalresources. For example, compact models can run on mobile phones, tablets, orembedded systems.

compute

(Noun) The computational resources used by a model or system, such asprocessing power, memory, and storage.

Seeaccelerator chips.

concept drift

A shift in the relationship between features and the label.Over time, concept drift reduces a model's quality.

During training, the model learns the relationship between the features andtheir labels in the training set. If the labels in the training set aregood proxies for the real-world, then the modelshould make goodreal world predictions. However, due to concept drift, the model'spredictions tend to degrade over time.

For example, consider abinary classificationmodel that predicts whether or not a certain car model is "fuel efficient."That is, the features could be:

car weight
engine compression
transmission type

while the label is either:

fuel efficient
not fuel efficient

However, the concept of "fuel efficient car" keepschanging. A car model labeledfuel efficient in 1994 would almost certainlybe labelednot fuel efficient in 2024. A model suffering from concept drifttends to make less and less useful predictions over time.

Compare and contrast withnonstationarity.

Click the icon for additional notes.

To compensate for concept drift, retrain models faster than therate ofconcept drift. For example, if concept drift reduces model precision by ameaningful margin every two months, then retrain your model more frequentlythan every two months.

condition

#df

In adecision tree, anynode thatperforms a test. For example, the following decision tree containstwo conditions:

A decision tree consisting of two conditions: (x > 0) and (y > 0).

A condition is also called a split or a test.

Contrast condition withleaf.

confabulation

Synonym forhallucination.

Confabulation is probably a more technically accurate term than hallucination.However, hallucination became popular first.

configuration

The process of assigning the initial property values used to train a model,including:

In machine learning projects, configuration can be done through a specialconfiguration file or using configuration libraries such as the following:

confirmation bias

#responsible

The tendency to search for, interpret, favor, and recall information in away that confirms one's pre-existing beliefs or hypotheses.Machine learning developers may inadvertently collect or labeldata in ways that influence an outcome supporting their existingbeliefs. Confirmation bias is a form ofimplicit bias.

Experimenter's bias is a form of confirmation bias in whichan experimenter continues training models until a pre-existinghypothesis is confirmed.

confusion matrix

#fundamentals

An NxN table that summarizes the number of correct and incorrect predictionsthat aclassification model made.For example, consider the following confusion matrix for abinary classification model:

	Tumor (predicted)	Non-Tumor (predicted)
Tumor (ground truth)	18 (TP)	1 (FN)
Non-Tumor (ground truth)	6 (FP)	452 (TN)

The preceding confusion matrix shows the following:

Of the 19 predictions in whichground truth was Tumor,the model correctly classified 18 and incorrectly classified 1.
Of the 458 predictions in which ground truth was Non-Tumor, the modelcorrectly classified 452 and incorrectly classified 6.

The confusion matrix for amulti-class classificationproblem can help you identify patterns of mistakes.For example, consider the following confusion matrix for a 3-classmulti-class classification model that categorizes three different iris types(Virginica, Versicolor, and Setosa). When the ground truth was Virginica, theconfusion matrix shows that the model was far more likely to mistakenlypredict Versicolor than Setosa:

	Setosa (predicted)	Versicolor (predicted)	Virginica (predicted)
Setosa (ground truth)	88	12	0
Versicolor (ground truth)	6	141	7
Virginica (ground truth)	2	27	109

As yet another example, a confusion matrix could reveal that a model trainedto recognize handwritten digits tends to mistakenly predict 9 instead of 4,or mistakenly predict 1 instead of 7.

Confusion matrixes contain sufficient information to calculate avariety of performance metrics, includingprecisionandrecall.

constituency parsing

Dividing a sentence into smaller grammatical structures ("constituents").A later part of the ML system, such as anatural language understanding model,can parse the constituents more easily than the original sentence. For example,consider the following sentence:

My friend adopted two cats.

A constituency parser can divide this sentence into the followingtwo constituents:

My friend is a noun phrase.
adopted two cats is a verb phrase.

These constituents can be further subdivided into smaller constituents.For example, the verb phrase

adopted two cats

could be further subdivided into:

adopted is a verb.
two cats is another noun phrase.

contextualized language embedding

#generativeAI

Anembedding that comes close to "understanding" wordsand phrases in ways that fluent human speakers can. Contextualized languageembeddings can understand complex syntax, semantics, and context.

For example, consider embeddings of the English wordcow. Older embeddingssuch asword2vec can represent Englishwords such that the distance in theembedding spacefromcow tobull is similar to the distance fromewe (female sheep) toram (male sheep) or fromfemale tomale. Contextualized languageembeddings can go a step further by recognizing that English speakers sometimescasually use the wordcow to mean either cow or bull.

context window

#generativeAI

The number oftokens a model can process in a givenprompt. The larger the context window, the more informationthe model can use to provide coherent and consistentresponsesto the prompt.

continuous feature

#fundamentals

A floating-pointfeature with an infinite range of possiblevalues, such as temperature or weight.

Contrast withdiscrete feature.

convenience sampling

Using a dataset not gathered scientifically in order to run quickexperiments. Later on, it's essential to switch to a scientifically gathereddataset.

convergence

#fundamentals

A state reached whenloss values change very little ornot at all with eachiteration. For example, the followingloss curve suggests convergence at around 700 iterations:

Cartesian plot. X-axis is loss. Y-axis is the number of training iterations. Loss is very high during first few iterations, but drops sharply. After about 100 iterations, loss is still descending but far more gradually. After about 700 iterations, loss stays flat.

A modelconverges when additional training won'timprove the model.

Indeep learning, loss values sometimes stay constant ornearly so for many iterations before finally descending. During a long periodof constant loss values, you may temporarily get a false sense of convergence.

conversational coding

#generativeAI

An iterative dialog between you and a generative AI model for the purposeof creating software. You issue a prompt describing some software. Then,the model uses that description to generate code. Then, you issue a newprompt to address the flaws in the previous prompt or in the generatedcode, and the model generates updated code. You two keep going back andforth until the generated software is good enough.

Conversation coding is essentially the original meaning ofvibe coding.

Contrast withspecificational coding.

convex function

A function in which the region above the graph of the function is aconvex set. The prototypical convex function isshaped something like the letterU. For example, the followingare all convex functions:

U-shaped curves, each with a single minimum point.

In contrast, the following function is not convex. Notice how theregion above the graph is not a convex set:

A W-shaped curve with two different local minimum points.

Astrictly convex function has exactly one local minimum point, whichis also the global minimum point. The classic U-shaped functions arestrictly convex functions. However, some convex functions(for example, straight lines) are not U-shaped.

Click the icon for a deeper look at the math.

A lot of the commonloss functions, including thefollowing, are convex functions:

Many variations ofgradient descentare guaranteed to find a point close to the minimum of astrictly convex function. Similarly, many variations ofstochastic gradient descent have a high probability(though, not a guarantee) of finding a point close to the minimum of astrictly convex function.

The sum of two convex functions (for example,L₂ loss + L₁ regularization) is a convex function.

Deep models are never convex functions.Remarkably, algorithms designed forconvex optimization tend to findreasonably good solutions on deep networks anyway, even thoughthose solutions are not guaranteed to be a global minimum.

SeeConvergence and convexfunctionsin Machine Learning Crash Course for more information.

convex optimization

The process of using mathematical techniques such asgradient descent to findthe minimum of aconvex function.A great deal of research in machine learning has focused on formulating variousproblems as convex optimization problems and in solving those problems moreefficiently.

For complete details, see Boyd and Vandenberghe,ConvexOptimization.

convex set

A subset of Euclidean space such that a line drawn between any two points in thesubset remains completely within the subset. For instance, the following twoshapes are convex sets:

One illustration of a rectangle. Another illustration of an oval.

In contrast, the following two shapes are not convex sets:

One illustration of a pie-chart with a missing slice. Another illustration of a wildly irregular polygon.

convolution

In mathematics, casually speaking, a mixture of two functions. In machinelearning, a convolution mixes theconvolutionalfilter and the input matrixin order to trainweights.

The term "convolution" in machine learning is often a shorthand way ofreferring to eitherconvolutional operationorconvolutional layer.

Without convolutions, a machine learning algorithm would have to learna separate weight for every cell in a largetensor. For example,a machine learning algorithm training on 2K x 2K images would be forced tofind 4M separate weights. Thanks to convolutions, a machine learningalgorithm only has to find weights for every cell in theconvolutional filter, dramatically reducingthe memory needed to train the model. When the convolutional filter isapplied, it is simply replicated across cells such that each is multipliedby the filter.

convolutional filter

One of the two actors in aconvolutional operation. (The other actoris a slice of an input matrix.) A convolutional filter is a matrix havingthe samerank as the input matrix, but a smaller shape.For example, given a 28x28 input matrix, the filter could be any 2D matrixsmaller than 28x28.

In photographic manipulation, all the cells in a convolutional filter aretypically set to a constant pattern of ones and zeroes. In machine learning,convolutional filters are typically seeded with random numbers and then thenetworktrains the ideal values.

convolutional layer

A layer of adeep neural network in which aconvolutional filter passes along an inputmatrix. For example, consider the following 3x3convolutional filter:

A 3x3 matrix with the following values: [[0,1,0], [1,0,1], [0,1,0]]

The following animation shows a convolutional layer consisting of 9convolutional operations involving the 5x5 input matrix. Notice that eachconvolutional operation works on a different 3x3 slice of the input matrix.The resulting 3x3 matrix (on the right) consists of the results of the 9convolutional operations:

convolutional neural network

Aneural network in which at least one layer is aconvolutional layer. A typical convolutionalneural network consists of some combination of the following layers:

Convolutional neural networks have had great success in certain kindsof problems, such as image recognition.

convolutional operation

The following two-step mathematical operation:

Element-wise multiplication of theconvolutional filter and a slice of aninput matrix. (The slice of the input matrix has the same rank andsize as the convolutional filter.)
Summation of all the values in the resulting product matrix.

For example, consider the following 5x5 input matrix:

The 5x5 matrix: [[128,97,53,201,198], [35,22,25,200,195], [37,24,28,197,182], [33,28,92,195,179], [31,40,100,192,177]].

Now imagine the following 2x2 convolutional filter:

The 2x2 matrix: [[1, 0], [0, 1]]

Each convolutional operation involves a single 2x2 slice of theinput matrix. For example, suppose we use the 2x2 slice at thetop-left of the input matrix. So, the convolution operation onthis slice looks as follows:

Applying the convolutional filter [[1, 0], [0, 1]] to the top-left 2x2 section of the input matrix, which is [[128,97], [35,22]]. The convolutional filter leaves the 128 and 22 intact, but zeroes out the 97 and 35. Consequently, the convolution operation yields the value 150 (128+22).

Aconvolutional layer consists of aseries of convolutional operations, each acting on a different sliceof the input matrix.

COPA

#Metric

Abbreviation forChoice of Plausible Alternatives.

cost

#Metric

Synonym forloss.

co-training

Asemi-supervised learning approachparticularly useful when all of the following conditions are true:

The ratio ofunlabeled examples tolabeled examples in the dataset is high.
This is a classification problem (binary ormulti-class).
Thedataset contains two different sets ofpredictive features that are independent of each other and complementary.

Co-training essentially amplifies independent signals into a stronger signal.For example, consider aclassification model thatcategorizes individual used cars as eitherGood orBad. One set ofpredictive features might focus on aggregate characteristics such as the year,make, and model of the car; another set of predictive features might focus onthe previous owner's driving record and the car's maintenance history.

The seminal paper on co-training isCombining Labeled and Unlabeled Data withCo-Training byBlum and Mitchell.

counterfactual fairness

#responsible

#Metric

Afairness metric that checks whether aclassification model produces the same result forone individual as it does for another individual who is identical to the first,except with respect to one or moresensitive attributes. Evaluating aclassification model for counterfactual fairness isone method for surfacing potential sources of bias in a model.

See either of the following for more information:

Fairness: Counterfactualfairnessin Machine Learning Crash Course.
When Worlds Collide: Integrating Different Counterfactual Assumptionsin Fairness

coverage bias

#responsible

Seeselection bias.

crash blossom

A sentence or phrase with an ambiguous meaning.Crash blossoms present a significant problem innaturallanguage understanding.For example, the headlineRed Tape Holds Up Skyscraper is acrash blossom because an NLU model could interpret the headline literally orfiguratively.

Click the icon for additional notes.

Just to clarify that mysterious headline:

Red Tape could refer to either of the following:
- An adhesive
- Excessive bureaucracy
Holds Up could refer to either of the following:
- Structural support
- Delays

critic

Synonym forDeep Q-Network.

cross-entropy

#Metric

A generalization ofLog Loss tomulti-class classification problems. Cross-entropyquantifies the difference between two probability distributions. See alsoperplexity.

cross-validation

A mechanism for estimating how well amodel would generalize tonew data by testing the model against one or more non-overlapping data subsetswithheld from thetraining set.

cumulative distribution function (CDF)

#Metric

A function that defines the frequency of samples less than or equal to atarget value. For example, consider a normal distribution of continuous values.A CDF tells you that approximately 50% of samples should be less than or equalto the mean and that approximately 84% of samples should be less than or equalto one standard deviation above the mean.

D

data analysis

Obtaining an understanding of data by considering samples, measurement,and visualization. Data analysis can be particularly useful when adataset is first received, before one builds the firstmodel.It is also crucial in understanding experiments and debugging problems withthe system.

data augmentation

Artificially boosting the range and number oftraining examplesby transforming existingexamples to create additional examples. For example,suppose images are one of yourfeatures, but your dataset doesn'tcontain enough image examples for the model to learn useful associations.Ideally, you'd add enoughlabeled images to your dataset toenable your model to train properly. If that's not possible, data augmentationcan rotate, stretch, and reflect each image to produce many variants of theoriginal picture, possibly yielding enough labeled data to enable excellenttraining.

DataFrame

#fundamentals

A popularpandas data type for representingdatasets in memory.

A DataFrame is analogous to a table or a spreadsheet. Each column ofa DataFrame has a name (a header), and each row is identified by aunique number.

Each column in a DataFrame is structured like a 2D array, except thateach column can be assigned its own data type.

See also the officialpandas.DataFrame referencepage.

data parallelism

A way of scalingtraining orinferencethat replicates an entiremodel ontomultiple devices and then passes a subset of the input data to each device.Data parallelism can enable training and inference on very largebatch sizes; however, data parallelism requires that themodel be small enough to fit on all devices.

Data parallelism typically speeds training and inference.

Dataset API (tf.data)

#TensorFlow

A high-levelTensorFlow API for reading data andtransforming it into a form that a machine learning algorithm requires.Atf.data.Dataset object represents a sequence of elements, in whicheach element contains one or moreTensors. Atf.data.Iteratorobject provides access to the elements of aDataset.

data set or dataset

#fundamentals

A collection of raw data, commonly (but not exclusively) organized in oneof the following formats:

a spreadsheet
a file in CSV (comma-separated values) format

decision boundary

The separator betweenclasses learned by amodel in abinary class ormulti-class classification problems. For example,in the following image representing a binary classification problem,the decision boundary is the frontier between the orange class andthe blue class:

A well-defined boundary between one class and another.

decision forest

#df

A model created from multipledecision trees.A decision forest makes a prediction by aggregating the predictions ofits decision trees. Popular types of decision forests includerandom forests andgradient boosted trees.

See theDecisionForestssection in the Decision Forests course for more information.

decision threshold

Synonym forclassification threshold.

decision tree

#df

A supervised learning model composed of a set ofconditions andleaves organized hierarchically.For example, the following is a decision tree:

A decision tree consisting of four conditions arranged hierarchically, which lead to five leaves.

decoder

In general, any ML system that converts from a processed, dense, orinternal representation to a more raw, sparse, or external representation.

Decoders are often a component of a larger model, where they are frequentlypaired with anencoder.

Insequence-to-sequence tasks, a decoderstarts with the internal state generated by the encoder to predict the nextsequence.

Refer toTransformer for the definition of a decoder withinthe Transformer architecture.

SeeLarge language modelsin Machine Learning Crash Course for more information.

deep model

#fundamentals

Aneural network containing more than onehidden layer.

A deep model is also called adeep neural network.

Contrast withwide model.

deep neural network

Synonym fordeep model.

Deep Q-Network (DQN)

InQ-learning, a deepneural networkthat predictsQ-functions.

Critic is a synonym for Deep Q-Network.

demographic parity

#responsible

#Metric

Afairness metric that is satisfied ifthe results of a model's classification are not dependent on agivensensitive attribute.

For example, if both Lilliputians and Brobdingnagians apply toGlubbdubdrib University, demographic parity is achieved if the percentageof Lilliputians admitted is the same as the percentage of Brobdingnagiansadmitted, irrespective of whether one group is on average more qualifiedthan the other.

Contrast withequalized odds andequality of opportunity, which permitclassification results in aggregate to depend on sensitive attributes,but don't permit classification results for certain specifiedground truth labels to depend on sensitive attributes. See"Attackingdiscrimination with smarter machine learning" for a visualizationexploring the tradeoffs when optimizing for demographic parity.

SeeFairness: demographicparityin Machine Learning Crash Course for more information.

denoising

A common approach toself-supervised learningin which:

Noise is artificially added to the dataset.
Themodel tries to remove the noise.

Denoising enables learning fromunlabeled examples.The originaldataset serves as the target orlabel andthe noisy data as the input.

Somemasked language models use denoisingas follows:

Noise is artificially added to an unlabeled sentence by masking some of the tokens.
The model tries to predict the original tokens.

dense feature

#fundamentals

Afeature in which most or all values are nonzero, typicallyaTensor of floating-point values. For example, the following10-element Tensor is dense because 9 of its values are nonzero:

Contrast withsparse feature.

dense layer

Synonym forfully connected layer.

depth

#fundamentals

The sum of the following in aneural network:

the number ofhidden layers
the number ofoutput layers, which is typically 1
the number of anyembedding layers

For example, a neural network with five hidden layers and one output layerhas a depth of 6.

Notice that theinput layer doesn'tinfluence depth.

depthwise separable convolutional neural network (sepCNN)

Aconvolutional neural networkarchitecture based onInception,but where Inception modules are replaced with depthwise separableconvolutions. Also known as Xception.

A depthwise separable convolution (also abbreviated as separable convolution)factors a standard 3D convolution into two separate convolution operationsthat are more computationally efficient: first, a depthwise convolution,with a depth of 1 (n ✕ n ✕ 1), and then second, a pointwise convolution,with length and width of 1 (1 ✕ 1 ✕ n).

To learn more, seeXception: Deep Learning with Depthwise SeparableConvolutions.

derived label

Synonym forproxy label.

device

#TensorFlow

#GoogleCloud

An overloaded term with the following two possible definitions:

A category of hardware that can run a TensorFlow session, includingCPUs, GPUs, andTPUs.
When training an ML model onaccelerator chips(GPUs or TPUs), the part of the system that actually manipulatestensors andembeddings.The device runs on accelerator chips. In contrast, thehosttypically runs on a CPU.

differential privacy

In machine learning, an anonymization approach to protect any sensitive data(for example, an individual's personal information) included in a model'straining set from being exposed. This approach ensuresthat themodel doesn't learn or remember much about a specificindividual. This is accomplished by sampling and adding noise during modeltraining to obscure individual data points, mitigating the risk of exposingsensitive training data.

Differential privacy is also used outside of machine learning. For example,data scientists sometimes use differential privacy to protect individualprivacy when computing product usage statistics for different demographics.

dimension reduction

Decreasing the number of dimensions used to represent a particular featurein a feature vector, typically byconverting to anembedding vector.

dimensions

Overloaded term having any of the following definitions:

The number of levels of coordinates in aTensor. Forexample:
- A scalar has zero dimensions; for example,["Hello"].
- A vector has one dimension; for example,[3, 5, 7, 11].
- A matrix has two dimensions; for example,[[2, 4, 18], [5, 7, 14]].You can uniquely specify a particular cell in a one-dimensional vectorwith one coordinate; you need two coordinates to uniquely specify aparticular cell in a two-dimensional matrix.
The number of entries in afeature vector.
The number of elements in anembedding layer.

direct prompting

#generativeAI

Synonym forzero-shot prompting.

discrete feature

#fundamentals

Afeature with a finite set of possible values. For example,a feature whose values may only beanimal,vegetable, ormineral is adiscrete (or categorical) feature.

Contrast withcontinuous feature.

discriminative model

Amodel that predictslabels from a set of one ormorefeatures. More formally, discriminative models define theconditional probability of an output given the features andweights; that is:

p(output | features, weights)

For example, a model that predicts whether an email is spam from featuresand weights is a discriminative model.

The vast majority of supervised learning models, including classificationand regression models, are discriminative models.

Contrast withgenerative model.

discriminator

A system that determines whetherexamples are real or fake.

Alternatively, the subsystem within agenerative adversarialnetwork that determines whetherthe examples created by thegenerator are real or fake.

SeeThe discriminatorin the GAN course for more information.

disparate impact

#responsible

Making decisions about people that impact different populationsubgroups disproportionately. This usually refers to situationswhere an algorithmic decision-making process harms or benefitssome subgroups more than others.

For example, suppose an algorithm that determines a Lilliputian'seligibility for a miniature-home loan is more likely to classifythem as "ineligible" if their mailing address contains a certainpostal code. If Big-Endian Lilliputians are more likely to havemailing addresses with this postal code than Little-Endian Lilliputians,then this algorithm may result in disparate impact.

Contrast withdisparate treatment,which focuses on disparities that result when subgroup characteristicsare explicit inputs to an algorithmic decision-making process.

disparate treatment

#responsible

Factoring subjects'sensitive attributesinto an algorithmic decision-making process such that different subgroupsof people are treated differently.

For example, consider an algorithm thatdetermines Lilliputians' eligibility for a miniature-home loan based on thedata they provide in their loan application. If the algorithm uses aLilliputian's affiliation as Big-Endian or Little-Endian as an input, itis enacting disparate treatment along that dimension.

Contrast withdisparate impact, which focuseson disparities in the societal impacts of algorithmic decisions on subgroups,irrespective of whether those subgroups are inputs to the model.

Warning: Because sensitive attributes are almost always correlated withother features the data may have, explicitly removing sensitive attributeinformation doesn't guarantee that subgroups will be treated equally.For example, removing sensitive demographic attributes from a trainingdataset that still includes postal code as a feature may address disparatetreatment of subgroups, but there still might bedisparate impact upon these groups becausepostal code might serve as a proxy for otherdemographic information.

distillation

#generativeAI

The process of reducing the size of onemodel (known as theteacher) into a smaller model (known as thestudent) that emulatesthe original model's predictions as faithfully as possible. Distillationis useful because the smaller model has two key benefits over the largermodel (the teacher):

Faster inference time
Reduced memory and energy usage

However, the student's predictions are typically not as good asthe teacher's predictions.

Distillation trains the student model to minimize aloss function based on the difference between the outputsof the predictions of the student and teacher models.

Compare and contrast distillation with the following terms:

SeeLLMs: Fine-tuning, distillation, and promptengineeringin Machine Learning Crash Course for more information.

distribution

The frequency and range of different values for a givenfeature orlabel.A distribution captures how likely a particular value is.

The following image shows histograms of two different distributions:

On the left, a power law distribution of wealth versus the number of peoplepossessing that wealth.
On the right, a normal distribution of height versus the number of peoplepossessing that height.

Two histograms. One histogram shows a power law distribution with wealth on the x-axis and number of people having that wealth on the y-axis. Most people have very little wealth, and a few people have a lot of wealth. The other histogram shows a normal distribution with height on the x-axis and number of people having that height on the y-axis. Most people are clustered somewhere near the mean.

Understanding each feature and label's distribution can help you determine howtonormalize values and detectoutliers.

The phraseout of distribution refers to a value that doesn't appear in thedataset or is very rare. For example, an image of the planet Saturn would beconsidered out of distribution for a dataset consisting of cat images.

divisive clustering

#clustering

Seehierarchical clustering.

downsampling

Overloaded term that can mean either of the following:

Reducing the amount of information in afeature inorder totrain a model more efficiently. For example,before training an image recognition model, downsampling high-resolutionimages to a lower-resolution format.
Training on a disproportionately low percentage of over-representedclassexamples in order to improve model training on under-represented classes.For example, in aclass-imbalanceddataset, models tend to learn a lot about themajority class and not enough about theminority class. Downsampling helpsbalance the amount of training on the majority and minority classes.

SeeDatasets: Imbalanceddatasetsin Machine Learning Crash Course for more information.

DQN

Abbreviation forDeep Q-Network.

dropout regularization

A form ofregularization useful in trainingneural networks. Dropout regularizationremoves a random selection of a fixed number of the units in a networklayer for a single gradient step. The more units dropped out, the strongerthe regularization. This is analogous to training the network to emulatean exponentially largeensemble of smaller networks.For full details, seeDropout: A Simple Way to Prevent Neural Networks fromOverfitting.

dynamic

#fundamentals

Something done frequently or continuously.The termsdynamic andonline are synonyms in machine learning.The following are common uses ofdynamic andonline in machinelearning:

Adynamic model (oronline model) is a modelthat is retrained frequently or continuously.
Dynamic training (oronline training) is the process of trainingfrequently or continuously.
Dynamic inference (oronline inference) is the process ofgenerating predictions on demand.

dynamic model

#fundamentals

Amodel that is frequently (maybe even continuously)retrained. A dynamic model is a "lifelong learner" thatconstantly adapts to evolving data. A dynamic model is also known as anonline model.

Contrast withstatic model.

E

eager execution

#TensorFlow

A TensorFlow programming environment in whichoperationsrun immediately. In contrast, operations called ingraph execution don't run until they are explicitlyevaluated. Eager execution is animperative interface, muchlike the code in most programming languages. Eager execution programs aregenerally far easier to debug than graph execution programs.

early stopping

#fundamentals

A method forregularization that involves endingtrainingbefore training loss finishesdecreasing. In early stopping, you intentionally stop training the modelwhen the loss on avalidation dataset starts toincrease; that is, whengeneralization performance worsens.

Click the icon for additional notes.

Early stopping may seem counterintuitive. After all, telling a model to halttraining while the loss is still decreasing may seem like telling a chef tostop cooking before the dessert has fully baked. However, training a model fortoo long can lead tooverfitting. That is, if youtrain a model too long, the model may fit the training data so closely thatthe model doesn't make good predictions on new examples.

Contrast withearly exit.

earth mover's distance (EMD)

#Metric

A measure of the relative similarity of twodistributions.The lower the earth mover's distance, the more similar the distributions.

edit distance

#Metric

A measurement of how similar two text strings are to each other.In machine learning, edit distance is useful for the following reasons:

Edit distance is easy to compute.
Edit distance can compare two strings known to be similar to each other.
Edit distance can determine the degree to which different strings aresimilar to a given string.

Several definitions of edit distance exist, each using different stringoperations. SeeLevenshtein distance for anexample.

Einsum notation

An efficient notation for describing how twotensors are to becombined. The tensors are combined by multiplying the elements of one tensorby the elements of the other tensor and then summing the products.Einsum notation uses symbols to identify the axes of each tensor, and thosesame symbols are rearranged to specify the shape of the new resulting tensor.

NumPy provides a common Einsum implementation.

embedding layer

#fundamentals

A specialhidden layer that trains on ahigh-dimensionalcategorical feature togradually learn a lower dimension embedding vector. Anembedding layer enables a neural network to train far moreefficiently than training just on the high-dimensional categorical feature.

For example, Earth currently supports about 73,000 tree species. Supposetree species is afeature in your model, so your model'sinput layer includes aone-hot vector 73,000elements long.For example, perhapsbaobab would be represented something like this:

An array of 73,000 elements. The first 6,232 elements hold the value 0. The next element holds the value 1. The final 66,767 elements hold the value zero.

A 73,000-element array is very long. If you don't add an embedding layerto the model, training is going to be very time consuming due tomultiplying 72,999 zeros. Perhaps you pick the embedding layer to consistof 12 dimensions. Consequently, the embedding layer will gradually learna new embedding vector for each tree species.

In certain situations,hashing is a reasonable alternativeto an embedding layer.

SeeEmbeddingsin Machine Learning Crash Course for more information.

embedding space

The d-dimensional vector space that features from a higher-dimensionalvector space are mapped to. Embedding space is trained to capture structurethat is meaningful for the intended application.

Thedot productof two embeddings is a measure of their similarity.

embedding vector

Broadly speaking, an array of floating-point numbers taken fromanyhidden layer that describe the inputs to that hidden layer.Often, an embedding vector is the array of floating-point numbers trained inan embedding layer. For example, suppose an embedding layer must learn anembedding vector for each of the 73,000 tree species on Earth. Perhaps thefollowing array is the embedding vector for a baobab tree:

An array of 12 elements, each holding a floating-point number between 0.0 and 1.0.

An embedding vector is not a bunch of random numbers. An embedding layerdetermines these values through training, similar to the way aneural network learns other weights during training. Each element of thearray is a rating along some characteristic of a tree species. Whichelement represents which tree species' characteristic? That's very hardfor humans to determine.

The mathematically remarkable part of an embedding vector is that similaritems have similar sets of floating-point numbers. For example, similartree species have a more similar set of floating-point numbers thandissimilar tree species. Redwoods and sequoias are related tree species,so they'll have a more similar set of floating-pointing numbers thanredwoods and coconut palms. The numbers in the embedding vector willchange each time you retrain the model, even if you retrain the modelwith identical input.

empirical cumulative distribution function (eCDF orEDF)

#Metric

Acumulative distribution functionbased onempirical measurements from a real dataset. The value of thefunction at any point along the x-axis is the fraction of observations inthe dataset that are less than or equal to the specified value.

empirical risk minimization (ERM)

Choosing the function that minimizes loss on the training set. Contrastwithstructural risk minimization.

encoder

In general, any ML system that converts from a raw, sparse, or externalrepresentation into a more processed, denser, or more internal representation.

Encoders are often a component of a larger model, where they are frequentlypaired with adecoder. SomeTransformerspair encoders with decoders, though other Transformers use only the encoderor only the decoder.

Some systems use the encoder's output as the input to a classification orregression network.

Insequence-to-sequence tasks, an encodertakes an input sequence and returns an internal state (a vector). Then, thedecoder uses that internal state to predict the next sequence.

Refer toTransformer for the definition of an encoder inthe Transformer architecture.

SeeLLMs: What's a large languagemodelin Machine Learning Crash Course for more information.

endpoints

A network-addressable location (typically a URL) where a service can be reached.

ensemble

A collection ofmodels trained independently whose predictionsare averaged or aggregated. In many cases, an ensemble produces betterpredictions than a single model. For example, arandom forest is an ensemble built from multipledecision trees. Note that not alldecision forests are ensembles.

SeeRandomForestin Machine Learning Crash Course for more information.

entropy

#df

#Metric

Ininformation theory,a description of how unpredictable a probabilitydistribution is. Alternatively, entropy is also defined as how muchinformation eachexample contains. A distribution hasthe highest possible entropy when all values of a random variable areequally likely.

The entropy of a set with two possible values "0" and "1" (for example,the labels in abinary classification problem)has the following formula:

H = -p log p - q log q = -p log p - (1-p) * log (1-p)

where:

H is the entropy.
p is the fraction of "1" examples.
q is the fraction of "0" examples. Note that q = (1 - p)
log is generally log₂. In this case, the entropyunit is a bit.

For example, suppose the following:

100 examples contain the value "1"
300 examples contain the value "0"

Therefore, the entropy value is:

p = 0.25
q = 0.75
H = (-0.25)log₂(0.25) - (0.75)log₂(0.75) = 0.81 bits per example

A set that is perfectly balanced (for example, 200 "0"s and 200 "1"s)would have an entropy of 1.0 bit per example. As a set becomes moreimbalanced, its entropy moves towards 0.0.

Indecision trees, entropy helps formulateinformation gain to help thesplitter select theconditionsduring the growth of a classification decision tree.

Compare entropy with:

gini impurity
cross-entropy loss function

Entropy is often calledShannon's entropy.

SeeExact splitter for binary classification with numericalfeaturesin the Decision Forests course for more information.

environment

In reinforcement learning, the world that contains theagentand allows the agent to observe that world'sstate. For example,the represented world can be a game like chess, or a physical world like amaze. When the agent applies anaction to the environment,then the environment transitions between states.

episode

In reinforcement learning, each of the repeated attempts by theagent to learn anenvironment.

epoch

#fundamentals

A full training pass over the entiretraining setsuch that eachexample has been processed once.

An epoch representsN/batch sizetrainingiterations, whereN is thetotal number of examples.

For instance, suppose the following:

The dataset consists of 1,000 examples.
The batch size is 50 examples.

Therefore, a single epoch requires 20 iterations:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

SeeLinear regression:Hyperparametersin Machine Learning Crash Course for more information.

epsilon greedy policy

In reinforcement learning, apolicy that either follows arandom policy with epsilon probability or agreedy policy otherwise. For example, if epsilon is0.9, then the policy follows a random policy 90% of the time and a greedypolicy 10% of the time.

Over successive episodes, the algorithm reduces epsilon's value in orderto shift from following a random policy to following a greedy policy. Byshifting the policy, the agent first randomly explores the environment andthen greedily exploits the results of random exploration.

equality of opportunity

#responsible

#Metric

Afairness metric to assess whether a model ispredicting the desirable outcome equally well for all values of asensitive attribute. In other words, if thedesirable outcome for a model is thepositive class,the goal would be to have thetrue positive rate be thesame for all groups.

Equality of opportunity is related toequalized odds,which requires thatboth the true positive rates andfalse positive rates are the same for all groups.

Suppose Glubbdubdrib University admits both Lilliputians and Brobdingnagiansto a rigorous mathematics program. Lilliputians' secondary schools offer arobust curriculum of math classes, and the vast majority of students arequalified for the university program. Brobdingnagians' secondary schools don'toffer math classes at all, and as a result, far fewer of their students arequalified. Equality of opportunity is satisfied for the preferred label of"admitted" with respect to nationality (Lilliputian or Brobdingnagian) ifqualified students are equally likely to be admitted irrespective of whetherthey're a Lilliputian or a Brobdingnagian.

For example, suppose 100 Lilliputians and 100 Brobdingnagians apply toGlubbdubdrib University, and admissions decisions are made as follows:

Table 1. Lilliputian applicants (90% are qualified)

	Qualified	Unqualified
Admitted	45	3
Rejected	45	7
Total	90	10
Percentage of qualified students admitted: 45/90 = 50% Percentage of unqualified students rejected: 7/10 = 70% Total percentage of Lilliputian students admitted: (45+3)/100 = 48%

Table 2. Brobdingnagian applicants (10% are qualified):

	Qualified	Unqualified
Admitted	5	9
Rejected	5	81
Total	10	90
Percentage of qualified students admitted: 5/10 = 50% Percentage of unqualified students rejected: 81/90 = 90% Total percentage of Brobdingnagian students admitted: (5+9)/100 = 14%

The preceding examples satisfy equality of opportunity for acceptance ofqualified students because qualified Lilliputians and Brobdingnagians bothhave a 50% chance of being admitted.

While equality of opportunity is satisfied, the following two fairness metricsare not satisfied:

demographic parity: Lilliputians and Brobdingnagians are admitted to the university at different rates; 48% of Lilliputians students are admitted, but only 14% of Brobdingnagian students are admitted.
equalized odds: While qualified Lilliputian and Brobdingnagian students both have the same chance of being admitted, the additional constraint that unqualified Lilliputians and Brobdingnagians both have the same chance of being rejected is not satisfied. Unqualified Lilliputians have a 70% rejection rate, whereas unqualified Brobdingnagians have a 90% rejection rate.

SeeFairness: Equality ofopportunityin Machine Learning Crash Course for more information.

equalized odds

#responsible

#Metric

A fairness metric to assess whether a model is predicting outcomes equallywell for all values of asensitive attribute withrespect to both thepositive class andnegative class—not just one class or the otherexclusively. In other words, both thetrue positive rateandfalse negative rate should be the same forall groups.

Equalized odds is related toequality of opportunity, which only focuseson error rates for a single class (positive or negative).

For example, suppose Glubbdubdrib University admits both Lilliputians andBrobdingnagians to a rigorous mathematics program. Lilliputians' secondaryschools offer a robust curriculum of math classes, and the vast majority ofstudents are qualified for the university program. Brobdingnagians' secondaryschools don't offer math classes at all, and as a result, far fewer oftheir students are qualified. Equalized odds is satisfied provided that nomatter whether an applicant is a Lilliputian or a Brobdingnagian, if theyare qualified, they are equally as likely to get admitted to the program,and if they are not qualified, they are equally as likely to get rejected.

Suppose 100 Lilliputians and 100 Brobdingnagians apply to GlubbdubdribUniversity, and admissions decisions are made as follows:

Table 3. Lilliputian applicants (90% are qualified)

	Qualified	Unqualified
Admitted	45	2
Rejected	45	8
Total	90	10
Percentage of qualified students admitted: 45/90 = 50% Percentage of unqualified students rejected: 8/10 = 80% Total percentage of Lilliputian students admitted: (45+2)/100 = 47%

Table 4. Brobdingnagian applicants (10% are qualified):

	Qualified	Unqualified
Admitted	5	18
Rejected	5	72
Total	10	90
Percentage of qualified students admitted: 5/10 = 50% Percentage of unqualified students rejected: 72/90 = 80% Total percentage of Brobdingnagian students admitted: (5+18)/100 = 23%

Equalized odds is satisfied because qualified Lilliputian and Brobdingnagianstudents both have a 50% chance of being admitted, and unqualified Lilliputianand Brobdingnagian have an 80% chance of being rejected.

Note: While equalized odds is satisfied here,demographic parity isnot satisfied. Lilliputianand Brobdingnagian students are admitted to Glubbdubdrib University atdifferent rates; 47% of Lilliputian students are admitted, and 23% ofBrobdingnagian students are admitted.

Equalized odds is formally defined in"Equality ofOpportunity in Supervised Learning" as follows:"predictor Ŷ satisfies equalized odds with respectto protected attribute A and outcome Y if Ŷ and A are independent,conditional on Y."

Note: Contrast equalized odds with the more relaxed equality of opportunity metric.

Estimator

#TensorFlow

A deprecated TensorFlow API. Usetf.keras insteadof Estimators.

evals

#generativeAI

#Metric

Primarily used as an abbreviation forLLM evaluations.More broadly,evals is an abbreviation for any form ofevaluation.

evaluation

#generativeAI

#Metric

The process of measuring a model's quality or comparing different modelsagainst each other.

To evaluate asupervised machine learningmodel, you typically judge it against avalidation setand atest set.Evaluating a LLMtypically involves broader quality and safety assessments.

exact match

#Metric

An all-or-nothing metric in which the model's output either matchesground truth or thereference textexactly or it doesn't. For example, if ground truth isorange, the onlymodel output that satisfies exact match isorange.

Exact match can also evaluate models whose output is a sequence(a ranked list of items). In general, exact match requires the generatedranked list to exactly match ground truth; that is, each itemin both lists must be in the same order. That said, if ground truthconsists ofmultiple correct sequences, then exact match only requiresmodel's output matchesone of the correct sequences.

example

#fundamentals

The values of one row offeatures and possiblyalabel. Examples insupervised learning fall into twogeneral categories:

Alabeled example consists of one or more featuresand a label. Labeled examples are used during training.
Anunlabeled example consists of one ormore features but no label. Unlabeled examples are used during inference.

For instance, suppose you are training a model to determine the influenceof weather conditions on student test scores. Here are three labeled examples:

Features			Label
Temperature	Humidity	Pressure	Test score
15	47	998	Good
19	34	1020	Excellent
18	92	1012	Poor

Here are three unlabeled examples:

Temperature	Humidity	Pressure
12	62	1014
21	47	1017
19	41	1021

The row of adataset is typically the raw source for an example.That is, an example typically consists of a subset of the columns inthe dataset. Furthermore, the features in an example can also includesynthetic features, such asfeature crosses.

SeeSupervised Learning inthe Introduction to Machine Learning course for more information.

experience replay

In reinforcement learning, aDQN technique used toreduce temporal correlations in training data. Theagentstores state transitions in areplay buffer, and thensamples transitions from the replay buffer to create training data.

experimenter's bias

#responsible

Seeconfirmation bias.

exploding gradient problem

The tendency forgradients indeep neural networks (especiallyrecurrent neural networks) to becomesurprisingly steep (high). Steep gradients often cause very large updatesto theweights of eachnode in adeep neural network.

Models suffering from the exploding gradient problem become difficultor impossible to train.Gradient clippingcan mitigate this problem.

Compare tovanishing gradient problem.

Extreme Summarization (xsum)

#Metric

A dataset for evaluating an LLM's ability to summarize a single document.Each entry in the dataset consists of:

A document authored by the British Broadcasting Corporation (BBC).
A one-sentence summary of that document.

For details, seeDon't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization.

F

F₁

#Metric

A "roll-up"binary classification metric thatrelies on bothprecision andrecall.Here is the formula:

$$F{_1} =\frac{\text{2 * precision * recall}} {\text{precision + recall}}$$

Click the icon to see examples.

Suppose precision and recall have the following values:

precision = 0.6
recall = 0.4

You calculate F₁ as follows:

$$F{_1} =\frac{\text{2 * 0.6 * 0.4}} {\text{0.6 + 0.4}} = 0.48$$

When precision and recall are fairly similar (as in the preceding example),F₁ is close to their mean. When precision and recall differsignificantly, F₁ is closer to the lower value. For example:

precision = 0.9
recall = 0.1

$$F{_1} =\frac{\text{2 * 0.9 * 0.1}} {\text{0.9 + 0.1}} = 0.18$$

factuality

#generativeAI

Within the ML world, a property describing a model whose output is basedon reality. Factuality is a concept rather than a metric.For example, suppose you send the followingpromptto alarge language model:

What is the chemical formula for table salt?

A model optimizing factuality would respond:

NaCl

It is tempting to assume that all models should be based on factuality.However, some prompts, such as the following, should cause a generative AI modelto optimizecreativity rather thanfactuality.

Tell me a limerick about an astronaut and a caterpillar.

It is unlikely that the resulting limerick would be based on reality.

Contrast withgroundedness.

fairness constraint

#responsible

Applying a constraint to an algorithm to ensure one or more definitionsof fairness are satisfied. Examples of fairness constraints include:

Post-processing your model's output.
Altering theloss function to incorporate a penalty for violating afairness metric.
Directly adding a mathematical constraint to an optimization problem.

fairness metric

#responsible

#Metric

A mathematical definition of "fairness" that is measurable.Some commonly used fairness metrics include:

Many fairness metrics are mutually exclusive; seeincompatibility of fairness metrics.

#fundamentals

#Metric

The proportion of actual negative examples for which the model mistakenlypredicted the positive class. The following formula calculates the falsepositive rate:

$$\text{false positive rate} =\frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in anROC curve.

SeeClassification: ROC andAUCin Machine Learning Crash Course for more information.

fast decay

#generativeAI

Atraining technique to improve the performance ofLLMs. Fast decay involves rapidly decreasingthelearning rate during training.This strategy helps prevent the model fromoverfitting tothe training data, and improvesgeneralization.

feature

#fundamentals

An input variable to a machine learning model. Anexampleconsists of one or more features. For instance, suppose you are training amodel to determine the influence of weather conditions on student test scores.The following table shows three examples, each of which containsthree features and one label:

Features			Label
Temperature	Humidity	Pressure	Test score
15	47	998	92
19	34	1020	84
18	92	1012	87

Contrast withlabel.

SeeSupervised Learningin the Introduction to Machine Learning course for more information.

feature cross

#fundamentals

Asynthetic feature formed by "crossing"categorical orbucketed features.

For example, consider a "mood forecasting" model that representstemperature in one of the following four buckets:

freezing
chilly
temperate
warm

And represents wind speed in one of the following three buckets:

still
light
windy

Without feature crosses, the linear model trains independently on each of thepreceding seven various buckets. So, the model trains on, for example,freezing independently of the training on, for example,windy.

Alternatively, you could create a feature cross of temperature andwind speed. This synthetic feature would have the following 12 possiblevalues:

freezing-still
freezing-light
freezing-windy
chilly-still
chilly-light
chilly-windy
temperate-still
temperate-light
temperate-windy
warm-still
warm-light
warm-windy

Thanks to feature crosses, the model can learn mood differencesbetween afreezing-windy day and afreezing-still day.

If you create a synthetic feature from two features that each have a lot ofdifferent buckets, the resulting feature cross will have a huge numberof possible combinations. For example, if one feature has 1,000 buckets andthe other feature has 2,000 buckets, the resulting feature cross has 2,000,000buckets.

Formally, a cross is aCartesian product.

Feature crosses are mostly used with linear models and are rarely usedwith neural networks.

SeeCategorical data: Featurecrossesin Machine Learning Crash Course for more information.

feature engineering

#fundamentals

#TensorFlow

A process that involves the following steps:

Determining whichfeatures might be usefulin training a model.
Converting raw data from the dataset into efficient versions ofthose features.

For example, you might determine thattemperature might be a usefulfeature. Then, you might experiment withbucketingto optimize what the model can learn from differenttemperature ranges.

Feature engineering is sometimes calledfeature extraction orfeaturization.

Click the icon for additional notes about TensorFlow.

In TensorFlow, feature engineering often means converting raw log fileentries totf.Example protocol buffers.See alsotf.Transform.

SeeNumerical data: How a model ingests data using featurevectorsin Machine Learning Crash Course for more information.

feature extraction

Overloaded term having either of the following definitions:

Retrieving intermediate feature representations calculated by anunsupervised or pretrained model(for example,hidden layer values in aneural network) for use in another model as input.
Synonym forfeature engineering.

feature importances

#df

#Metric

Synonym forvariable importances.

feature set

#fundamentals

The group offeatures your machine learningmodel trains on.For example, a simple feature set for a model that predicts housing pricesmight consist of postal code, property size, and property condition.

feature spec

#TensorFlow

Describes the information required to extractfeatures datafrom thetf.Example protocol buffer. Because thetf.Example protocol buffer is just a container for data, you must specifythe following:

The data to extract (that is, the keys for the features)
The data type (for example, float or int)
The length (fixed or variable)

feature vector

#fundamentals

The array offeature values comprising anexample. The feature vector is input duringtraining and duringinference.For example, the feature vector for a model with two discrete featuresmight be:

[0.92, 0.56]

Four layers: an input layer, two hidden layers, and one output layer. The input layer contains two nodes, one containing the value 0.92 and the other containing the value 0.56.

Each example supplies different values for the feature vector, so thefeature vector for the next example could be something like:

[0.73, 0.49]

Feature engineering determines how to representfeatures in the feature vector. For example, a binary categorical feature withfive possible values might be represented withone-hot encoding. In this case, the portion of thefeature vector for a particular example would consist of four zeroes anda single 1.0 in the third position, as follows:

[0.0, 0.0, 1.0, 0.0, 0.0]

As another example, suppose your model consists of three features:

a binary categorical feature withfive possible values represented withone-hot encoding; for example:[0.0, 1.0, 0.0, 0.0, 0.0]
another binary categorical feature withthree possible values representedwith one-hot encoding; for example:[0.0, 0.0, 1.0]
a floating-point feature; for example:8.3.

In this case, the feature vector for each example would be representedbynine values. Given the example values in the preceding list, thefeature vector would be:

0.01.00.00.00.00.00.01.08.3

SeeNumerical data: How a model ingests data using featurevectorsin Machine Learning Crash Course for more information.

featurization

The process of extractingfeatures from an input source,such as a document or video, and mapping those features into afeature vector.

Some ML experts use featurization as a synonym forfeature engineering orfeature extraction.

federated learning

A distributed machine learning approach thattrainsmachine learningmodels using decentralizedexamples residing on devices such as smartphones.In federated learning, a subset of devices downloads the current modelfrom a central coordinating server. The devices use the examples storedon the devices to make improvements to the model. The devices then uploadthe model improvements (but not the training examples) to the coordinatingserver, where they are aggregated with other updates to yield an improvedglobal model. After the aggregation, the model updates computed by devicesare no longer needed, and can be discarded.

Since the training examples are never uploaded, federated learning follows theprivacy principles of focused data collection and data minimization.

See theFederated Learning comic(yes, a comic) for more details.

feedback loop

#fundamentals

In machine learning, a situation in which a model's predictions influence thetraining data for the same model or another model. For example, a model thatrecommends movies will influence the movies that people see, which will theninfluence subsequent movie recommendation models.

SeeProduction ML systems: Questions toaskin Machine Learning Crash Course for more information.

feedforward neural network (FFN)

A neural network without cyclic or recursive connections. For example,traditionaldeep neural networks arefeedforward neural networks. Contrast withrecurrent neuralnetworks, which are cyclic.

few-shot learning

A machine learning approach, often used for object classification,designed to train effectiveclassification modelsfrom only a small number of training examples.

few-shot prompting

#generativeAI

Aprompt that contains more than one (a "few") exampledemonstrating how thelarge language modelshould respond. For example, the following lengthy prompt contains twoexamples showing a large language model how to answer a query.

Parts of one prompt	Notes
`What is the official currency of the specified country?`	The question you want the LLM to answer.
`France: EUR`	One example.
`United Kingdom: GBP`	Another example.
`India:`	The actual query.

Few-shot prompting generally produces more desirable results thanzero-shot prompting andone-shot prompting. However, few-shot promptingrequires a lengthier prompt.

Few-shot prompting is a form offew-shot learningapplied toprompt-based learning.

SeePromptengineeringin Machine Learning Crash Course for more information.

Fiddle

A Python-firstconfiguration library that sets thevalues of functions and classes without invasive code or infrastructure.In the case ofPax—and other ML codebases—these functions andclasses representmodels andtraining hyperparameters.

Fiddleassumes that machine learning codebases are typically divided into:

Library code, which defines the layers and optimizers.
Dataset "glue" code, which calls the libraries and wires everything together.

Fiddle captures the call structure of the glue code in an unevaluated andmutable form.

fine-tuning

#generativeAI

A second, task-specific training pass performed on apre-trained model to refine its parameters for aspecific use case. For example, the full training sequence for somelarge language models is as follows:

Pre-training: Train a large language model on a vastgeneral dataset,such as all the English language Wikipedia pages.
Fine-tuning: Train the pre-trained model to perform aspecific task,such as responding to medical queries. Fine-tuning typically involveshundreds or thousands of examples focused on the specific task.

As another example, the full training sequence for a large image model is asfollows:

Pre-training: Train a large image model on a vastgeneral imagedataset, such as all the images in Wikimedia commons.
Fine-tuning: Train the pre-trained model to perform aspecific task,such as generating images of orcas.

Fine-tuning can entail any combination of the following strategies:

Modifyingall of the pre-trained model's existingparameters. This is sometimes calledfull fine-tuning.
Modifying onlysome of the pre-trained model's existing parameters(typically, the layers closest to theoutput layer),while keeping other existing parameters unchanged (typically, the layersclosest to theinput layer). Seeparameter-efficient tuning.
Adding more layers, typically on top of the existing layers closest to theoutput layer.

Fine-tuning is a form oftransfer learning.As such, fine-tuning might use a different loss function or a different modeltype than those used to train the pre-trained model. For example, you couldfine-tune a pre-trained large image model to produce a regression model thatreturns the number of birds in an input image.

Compare and contrast fine-tuning with the following terms:

SeeFine-tuningin Machine Learning Crash Course for more information.

Flash model

#generativeAI

A family of relatively smallGemini models optimized for speedand lowlatency. Flash models are designed for a widerange of applications where quick responses and high throughput are crucial.

Flax

A high-performance open-source library fordeep learning built on top ofJAX. Flax provides functionsfortraining neural networks, as wellas methods for evaluating their performance.

Flaxformer

An open-sourceTransformer library,built onFlax, designed primarily for natural language processingand multimodal research.

forget gate

The portion of aLong Short-Term Memorycell that regulates the flow of information through the cell.Forget gates maintain context by deciding which information to discardfrom the cell state.

foundation model

#generativeAI

#Metric

A very largepre-trained model trained on an enormousand diversetraining set. A foundation model can do bothof the following:

Respond well to a wide range of requests.
Serve as abase model for additionalfine-tuning or other customization.

In other words, a foundation model is already very capable in a general sensebut can be further customized to become even more useful for a specific task.

fraction of successes

#generativeAI

#Metric

A metric for evaluating an ML model'sgenerated text.The fraction of successes is the number of "successful" generated textoutputs divided by the total number of generated text outputs. For example,if alarge language model generated 10 blocksof code, five of which were successful, then the fraction of successeswould be 50%.

Although fraction of successes is broadly useful throughout statistics,within ML, this metric is primarily useful for measuring verifiable taskslike code generation or math problems.

full softmax

Synonym forsoftmax.

Contrast withcandidate sampling.

SeeNeural networks: Multi-classclassificationin Machine Learning Crash Course for more information.

fully connected layer

Ahidden layer in which eachnode isconnected toevery node in the subsequent hidden layer.

A fully connected layer is also known as adense layer.

function transformation

A function that takes a function as input and returns a transformed functionas output.JAX uses function transformations.

G

GAN

Abbreviation forgenerative adversarialnetwork.

Gemini

#generativeAI

The ecosystem comprising Google's most advanced AI. Elements of this ecosysteminclude:

VariousGemini models.
The interactive conversational interface to a Gemini model.Users type prompts and Gemini responds to those prompts.
Various Gemini APIs.
Various business products based on Gemini models; for example,Gemini for Google Cloud.

Gemini models

#generativeAI

Google's state-of-the-artTransformer-basedmultimodal models. Gemini models are specificallydesigned to integrate withagents.

Users can interact with Gemini models in a variety of ways, including throughan interactive dialog interface and through SDKs.

Gemma

#generativeAI

A family of lightweight open models built from the sameresearch and technology used to create theGemini models. Severaldifferent Gemma models are available, each providing different features, such asvision, code, and instruction following. SeeGemmafor details.

GenAI or genAI

#generativeAI

Abbreviation forgenerative AI.

generalization

#fundamentals

Amodel's ability to make correct predictions on new,previously unseen data. A model that can generalize is the oppositeof a model that isoverfitting.

Click the icon for additional notes.

You train a model on the examples in the training set. Consequently, themodel learns the peculiarities of the data in the training set. Generalizationessentially asks whether your model can make good predictions on examplesthat arenot in the training set.

To encourage generalization,regularization helps a model trainless exactly to the peculiarities of the data in the training set.

SeeGeneralizationin Machine Learning Crash Course for more information.

generalization curve

#fundamentals

A plot of bothtraining loss andvalidation loss as a function of the number ofiterations.

A generalization curve can help you detect possibleoverfitting. For example, the followinggeneralization curve suggests overfitting because validation lossultimately becomes significantly higher than training loss.

A Cartesian graph in which the y-axis is labeled loss and the x-axis is labeled iterations. Two plots appear. One plots shows the training loss and the other shows the validation loss. The two plots start off similarly, but the training loss eventually dips far lower than the validation loss.

SeeGeneralizationin Machine Learning Crash Course for more information.

generalized linear model

A generalization ofleast squares regressionmodels, which are based onGaussiannoise, to othertypes of models based on other types of noise, such asPoisson noiseorcategorical noise. Examples of generalized linear models include:

logistic regression
multi-class regression
least squares regression

The parameters of a generalized linear model can be found throughconvex optimization.

Generalized linear models exhibit the following properties:

The average prediction of the optimal least squares regression model isequal to the average label on the training data.
The average probability predicted by the optimal logistic regressionmodel is equal to the average label on the training data.

The power of a generalized linear model is limited by its features. Unlikea deep model, a generalized linear model cannot "learn new features."

generated text

#generativeAI

In general, the text that an ML model outputs. When evaluating largelanguage models, some metrics compare generated text againstreference text. For example, suppose you aretrying to determine how effectively an ML model translates from Frenchto Dutch. In this case:

Thegenerated text is the Dutch translation that the ML model outputs.
Thereference text is the Dutch translation that a human translator (orsoftware) creates.

Note that some evaluation strategies don't involve reference text.

generative adversarial network (GAN)

A system to create new data in which agenerator createsdata and adiscriminator determines whether thatcreated data is valid or invalid.

See theGenerative Adversarial Networks coursefor more information.

generative AI

#generativeAI

An emerging transformative field with no formal definition.That said, most experts agree that generative AI models cancreate ("generate") content that is all of the following:

complex
coherent
original

Examples of generative AI include:

Large language models, which can generatesophisticated original text and answer questions.
Image generation model, which can produce unique images.
Audio and music generation models, which can compose original music orgenerate realistic speech.
Video generation models, which can generate original videos.

Some earlier technologies, includingLSTMsandRNNs, can also generate original andcoherent content. Some experts view these earlier technologies asgenerative AI, while others feel that true generative AI requires more complexoutput than those earlier technologies can produce.

Contrast withpredictive ML.

generative model

Practically speaking, a model that does either of the following:

Creates (generates) new examples from the training dataset.For example, a generative model could create poetry after trainingon a dataset of poems. Thegenerator part of agenerative adversarial networkfalls into this category.
Determines the probability that a new example comes from thetraining set, or was created from the same mechanism that createdthe training set. For example, after training ona dataset consisting of English sentences, a generative model coulddetermine the probability that new input is a valid English sentence.

A generative model can theoretically discern the distribution of examplesor particular features in a dataset. That is:

p(examples)

Unsupervised learning models are generative.

Contrast withdiscriminative models.

generator

The subsystem within agenerative adversarialnetworkthat creates newexamples.

Contrast withdiscriminative model.

gini impurity

#df

#Metric

A metric similar toentropy.Splittersuse values derived from either gini impurity or entropy to composeconditions for classificationdecision trees.Information gain is derived from entropy.No universally accepted equivalent term for the metric derivedfrom gini impurity exists; however, this unnamed metric is just as important asinformation gain.

Gini impurity is also calledgini index, or simplygini.

Click the icon for mathematical details about gini impurity.

Gini impurity is the probability of misclassifying a new piece of datataken from the same distribution. The gini impurity of a set with twopossible values "0" and "1" (for example, the labels in abinary classification problem)is calculated from the following formula:

I = 1 - (p² + q²) = 1 - (p² + (1-p)²)

where:

I is the gini impurity.
p is the fraction of "1" examples.
q is the fraction of "0" examples. Note thatq = 1-p

For example, consider the following dataset:

100 labels (0.25 of the dataset) contain the value "1"
300 labels (0.75 of the dataset) contain the value "0"

Therefore, the gini impurity is:

p = 0.25
q = 0.75
I = 1 - (0.25² + 0.75²) =0.375

Consequently, a random label from the same dataset would have a 37.5% chanceof being misclassified, and a 62.5% chance of being properly classified.

A perfectly balanced label (for example, 200 "0"s and 200 "1"s) would have agini impurity of 0.5. A highlyimbalanced label would have agini impurity close to 0.0.

golden dataset

A set of manually curated data that capturesground truth.Teams can use one or more golden datasets to evaluate a model's quality.

Some golden datasets capture different subdomains of ground truth. For example,a golden dataset for image classification might capture lighting conditionsand image resolution.

golden response

#generativeAI

Aresponse known to be good. For example, given the followingprompt:

2 + 2

The golden response is hopefully:

4

Note: Some organizations define additional terms such assilver response andplatinum response for responses of lower or higher quality, respectively,than the golden response. For example, an organization might useplatinumresponse to indicate a golden response generated by an expert and thenfurther vetted by other experts.

Click here for notes about golden response and reference text.

Some evaluation metrics, such as ROUGE, comparereference text to a model'sgenerated text.When there is a single right answer to a prompt, the golden response typicallyserves as the reference text.

Some prompts haveno one right answer.For example, the promptSummarize this document would likely have manyright answers. For such prompts, reference text is often impractical becausea model can generate a very wide range of possible summaries. However, a goldenresponse might be helpful in this situation. For example, a golden responsecontaining a good document summary can help train anautorater to discover patterns of gooddocument summaries.

Google AI Studio

A Google tool providing a user-friendly interfacefor experimenting with and building applications using Google'slarge language models.See theGoogle AI Studio home pagefor details.

GPT (Generative Pre-trained Transformer)

#generativeAI

A family ofTransformer-basedlarge language models developed byOpenAI.

GPT variants can apply to multiplemodalities, including:

image generation (for example, ImageGPT)
text-to-image generation (for example,DALL-E).

gradient

The vector ofpartial derivatives with respect toall of the independent variables. In machine learning, the gradient isthe vector of partial derivatives of the model function. The gradient pointsin the direction of steepest ascent.

gradient accumulation

Abackpropagation technique that updates theparameters onlyonce per epoch rather than once periteration. After processing eachmini-batch, gradientaccumulation simply updates a running total of gradients. Then, afterprocessing the last mini-batch in the epoch, the system finally updatesthe parameters based on the total of all gradient changes.

Gradient accumulation is useful when thebatch size isvery large compared to the amount of available memory for training.When memory is an issue, the natural tendency is to reduce batch size.However, reducing the batch size in normal backpropagationincreasesthe number of parameter updates. Gradient accumulation enables the modelto avoid memory issues but still train efficiently.

gradient boosted (decision) trees (GBT)

#df

A type ofdecision forest in which:

Training relies ongradient boosting.
The weak model is adecision tree.

SeeGradient Boosted DecisionTrees in theDecision Forests course for more information.

gradient boosting

#df

A training algorithm where weak models are trained to iterativelyimprove the quality (reduce the loss) of a strong model. For example,a weak model could be a linear or small decision tree model.The strong model becomes the sum of all the previously trained weak models.

In the simplest form of gradient boosting, at each iteration, a weak modelis trained to predict the loss gradient of the strong model. Then, thestrong model's output is updated by subtracting the predicted gradient,similar togradient descent.

$$F_{0} = 0$$$$F_{i+1} = F_i - \xi f_i $$

where:

$F_{0}$ is the starting strong model.
$F_{i+1}$ is the next strong model.
$F_{i}$ is the current strong model.
$\xi$ is a value between 0.0 and 1.0 calledshrinkage, which is analogous to thelearning rate in gradient descent.
$f_{i}$ is the weak model trained to predict the loss gradient of $F_{i}$.

Modern variations of gradient boosting also include the second derivative(Hessian) of the loss in their computation.

Decision trees are commonly used as weak models ingradient boosting. Seegradient boosted (decision) trees.

gradient clipping

A commonly used mechanism to mitigate theexploding gradient problem by artificiallylimiting (clipping) the maximum value of gradients when usinggradient descent totrain a model.

gradient descent

#fundamentals

A mathematical technique to minimizeloss.Gradient descent iteratively adjustsweights andbiases,gradually finding the best combination to minimize loss.

Gradient descent is older—much, much older—than machine learning.

See theLinear regression: Gradientdescentin Machine Learning Crash Course for more information.

graph

#TensorFlow

In TensorFlow, a computation specification. Nodes in the graphrepresent operations. Edges are directed and represent passing the resultof an operation (aTensor) as anoperand to another operation. UseTensorBoard to visualize a graph.

graph execution

#TensorFlow

A TensorFlow programming environment in which the program first constructsagraph and then executes all or part of that graph. Graphexecution is the default execution mode in TensorFlow 1.x.

Contrast witheager execution.

greedy policy

In reinforcement learning, apolicy that always chooses theaction with the highest expectedreturn.

groundedness

A property of a model whose output is based on (is "grounded on") specificsource material. For example, suppose you provide an entire physics textbook asinput ("context") to alarge language model.Then, you prompt that large language model with a physics question.If the model'sresponse reflects information in that textbook,then that model isgrounded on that textbook.

Note that a grounded model is not always afactual model.For example, the input physics textbook could contain mistakes.

ground truth

#fundamentals

Reality.

The thing that actually happened.

For example, consider abinary classificationmodel that predicts whether a student in their first year of universitywill graduate within six years. Ground truth for this model is whether ornot that student actually graduated within six years.

Click the icon for additional notes.

We assess model quality against ground truth. However, ground truthis not always completely, well, truthful. For example, consider thefollowing examples of potential imperfections in ground truth:

In the graduation example, are wecertain that the graduation records for each student are always correct? Is the university's record-keeping flawless?
Suppose the label is a floating-point value measured by instruments (for example, barometers). How can we be sure that each instrument is calibrated identically or that each reading was taken under the same circumstances?
If the label is a matter of human opinion, how can we be sure that each humanrater is evaluating events in the same way? To improve consistency,expert human raters sometimes intervene.

group attribution bias

#responsible

Assuming that what is true for an individual is also true for everyonein that group. The effects of group attribution bias can be exacerbatedif aconvenience samplingis used for data collection. In a non-representative sample, attributionsmay be made that don't reflect reality.

See alsoout-group homogeneity biasandin-group bias. Also, seeFairness: Types of biasin Machine Learning Crash Course for more information.

H

hallucination

#generativeAI

The production of plausible-seeming but factually incorrect output by agenerative AI model that purports to be making anassertion about the real world.For example, a generative AI model that claims that Barack Obama died in 1865ishallucinating.

hashing

In machine learning, a mechanism for bucketingcategorical data, particularly when the numberof categories is large, but the number of categories actually appearingin the dataset is comparatively small.

For example, Earth is home to about 73,000 tree species. You couldrepresent each of the 73,000 tree species in 73,000 separate categoricalbuckets. Alternatively, if only 200 of those tree species actually appearin a dataset, you could use hashing to divide tree species intoperhaps 500 buckets.

A single bucket could contain multiple tree species. For example, hashingcould placebaobab andred maple—two genetically dissimilarspecies—into the same bucket. Regardless, hashing is still a good way tomap large categorical sets into the selected number of buckets. Hashing turns acategorical feature having a large number of possible values into a muchsmaller number of values by grouping values in adeterministic way.

SeeCategorical data: Vocabulary and one-hotencodingin Machine Learning Crash Course for more information.

heuristic

A simple and quickly implemented solution to a problem. For example,"With a heuristic, we achieved 86% accuracy. When we switched to adeep neural network, accuracy went up to 98%."

hidden layer

#fundamentals

A layer in aneural network between theinput layer (the features) and theoutput layer (the prediction).Each hidden layer consists of one or moreneurons.For example, the following neural network contains two hidden layers,the first with three neurons and the second with two neurons:

Adeep neural network contains more than onehidden layer. For example, the preceding illustration is a deep neuralnetwork because the model contains two hidden layers.

SeeNeural networks: Nodes and hiddenlayersin Machine Learning Crash Course for more information.

hierarchical clustering

#clustering

A category ofclustering algorithms that create a treeof clusters. Hierarchical clustering is well-suited to hierarchical data,such as botanical taxonomies. There are two types of hierarchicalclustering algorithms:

Agglomerative clustering first assigns every example to its own cluster,and iteratively merges the closest clusters to create a hierarchicaltree.
Divisive clustering first groups all examples into one cluster and theniteratively divides the cluster into a hierarchical tree.

Contrast withcentroid-based clustering.

SeeClusteringalgorithmsin the Clustering course for more information.

hill climbing

An algorithm for iteratively improving ("walking uphill") an ML model untilthe model stops improving ("reaches the top of a hill"). The general formof the algorithm is as follows:

Build a starting model.
Create new candidate models by making small adjustments to the way youtrain orfine-tune. This might entailworking with a slightly differenttraining set ordifferent hyperparameters.
Evaluate the new candidate models and take one of thefollowing actions:
- If a candidate model outperforms the starting model, then that candidatemodel becomes the new starting model. In this case, repeat Steps 1, 2,and 3.
- If no model outperforms the starting model, then you've reached the topof the hill and should stop iterating.

Note: Think of the top of the hill as alocal maximum that isn't necessarily aglobal maximum. That is, hill climbing can help you find the best modelwithin your current constraints. However, you might be able to build an even better model by starting over with a new approach.

See Deep Learning Tuning Playbookfor guidance on hyperparameter tuning. See the Data modules ofMachine Learning Crash Course for guidanceon feature engineering.

hinge loss

#Metric

A family ofloss functions forclassification designed to find thedecision boundary as distant as possiblefrom each training example,thus maximizing the margin between examples and the boundary.KSVMs use hinge loss (or a related function, such assquared hinge loss). For binary classification, the hinge loss functionis defined as follows:

$$\text{loss} = \text{max}(0, 1 - (y * y'))$$

wherey is the true label, either -1 or +1, andy' is the raw outputof theclassification model:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

Consequently, a plot of hinge loss versus (y * y') looks as follows:

A Cartesian plot consisting of two joined line segments. The first line segment starts at (-3, 4) and ends at (1, 0). The second line segment begins at (1, 0) and continues indefinitely with a slope of 0.

historical bias

#responsible

A type ofbias that already exists in the world and hasmade its way into a dataset. These biases have a tendency to reflect existingcultural stereotypes, demographic inequalities, and prejudices against certainsocial groups.

For example, consider aclassification model thatpredicts whether or not a loan applicant will default on their loan, which wastrained on historical loan-default data from the 1980s from local banks in twodifferent communities. If past applicants from Community A were six times morelikely to default on their loans than applicants from Community B, the modelmight learn a historical bias resulting in the model being less likely toapprove loans in Community A, even if the historical conditions that resultedin that community's higher default rates were no longer relevant.

SeeFairness: Types ofbiasin Machine Learning Crash Course for more information.

holdout data

Examples intentionally not used ("held out") during training.Thevalidation dataset andtest dataset are examples of holdout data. Holdout datahelps evaluate your model's ability to generalize to data other than thedata it was trained on. The loss on the holdout set provides a betterestimate of the loss on an unseen dataset than does the loss on thetraining set.

host

#TensorFlow

#GoogleCloud

When training an ML model onaccelerator chips(GPUs orTPUs), the part of the systemthat controls both of the following:

The overall flow of the code.
The extraction and transformation of the input pipeline.

The host typically runs on a CPU, not on an accelerator chip; thedevice manipulatestensors on theaccelerator chips.

human evaluation

#generativeAI

A process in whichpeople judge the quality of an ML model's output;for example, having bilingual people judge the quality of an ML translationmodel. Human evaluation is particularly useful for judging models that haveno one right answer.

Contrast withautomatic evaluation andautorater evaluation.

human in the loop (HITL)

#generativeAI

A loosely-defined idiom that could mean either of the following:

A policy of viewinggenerative AI output critically orskeptically.
A strategy or system for ensuring that people help shape, evaluate, and refinea model's behavior. Keeping a human in the loop enables an AI to benefit fromboth machine intelligence and human intelligence. For example, a system inwhich an AI generates code which software engineers then review is ahuman-in-the-loop system.

hyperparameter

#fundamentals

The variables that you or a hyperparameter tuning serviceadjust during successive runs of training a model. For example,learning rate is a hyperparameter. You couldset the learning rate to 0.01 before one training session. If youdetermine that 0.01 is too high, you could perhaps set the learningrate to 0.003 for the next training session.

In contrast,parameters are the variousweights andbias that the modellearns during training.

SeeLinear regression:Hyperparametersin Machine Learning Crash Course for more information.

hyperplane

A boundary that separates a space into two subspaces. For example, a line is ahyperplane in two dimensions and a plane is a hyperplane in three dimensions.More typically in machine learning, a hyperplane is the boundary separating ahigh-dimensional space.Kernel Support Vector Machines usehyperplanes to separate positive classes from negative classes, often in a veryhigh-dimensional space.

I

i.i.d.

Abbreviation forindependently and identically distributed.

image recognition

A process that classifies object(s), pattern(s), or concept(s) in an image.Image recognition is also known asimage classification.

imbalanced dataset

Synonym forclass-imbalanced dataset.

implicit bias

#responsible

Automatically making an association or assumption based on one's mindmodels and memories. Implicit bias can affect the following:

How data is collected and classified.
How machine learning systems are designed and developed.

For example, when building aclassification modelto identify wedding photos, an engineer may use the presence of a white dressin a photo as a feature. However, white dresses have been customary only duringcertain eras and in certain cultures.

imputation

Short form ofvalue imputation.

incompatibility of fairness metrics

#responsible

#Metric

The idea that some notions of fairness are mutually incompatible andcannot be satisfied simultaneously. As a result, there is no singleuniversalmetric for quantifying fairnessthat can be applied to all ML problems.

While this may seem discouraging, incompatibility of fairness metricsdoesn't imply that fairness efforts are fruitless. Instead, it suggeststhat fairness must be defined contextually for a given ML problem, withthe goal of preventing harms specific to its use cases.

See"On the(im)possibility of fairness" for a more detailed discussion of theincompatibility of fairness metrics.

in-context learning

#generativeAI

Synonym forfew-shot prompting.

independently and identically distributed (i.i.d)

#fundamentals

Data drawn from a distribution that doesn't change, and where each valuedrawn doesn't depend on values that have been drawn previously. An i.i.d.is theideal gasof machinelearning—a useful mathematical construct but almost never exactly foundin the real world. For example, the distribution of visitors to a web pagemay be i.i.d. over a brief window of time; that is, the distribution doesn'tchange during that brief window and one person's visit is generallyindependent of another's visit. However, if you expand that window of time,seasonal differences in the web page's visitors may appear.

individual fairness

#responsible

#Metric

A fairness metric that checks whether similar individuals are classifiedsimilarly. For example, Brobdingnagian Academy might want to satisfyindividual fairness by ensuring that two students with identical gradesand standardized test scores are equally likely to gain admission.

Note that individual fairness relies entirely on how you define "similarity"(in this case, grades and test scores), and you can run the risk ofintroducing new fairness problems if your similarity metric misses importantinformation (such as the rigor of a student's curriculum).

See"Fairness ThroughAwareness" for a more detailed discussion of individual fairness.

inference

#fundamentals

#generativeAI

In traditional machine learning, the process of making predictions byapplying a trained model tounlabeled examples.SeeSupervised Learningin the Intro to ML course to learn more.

Inlarge language models, inference is theprocess of using a trained model to generate aresponseto an inputprompt.

Inference has a somewhat different meaning in statistics. See theWikipedia article on statistical inference for details.

inference path

#df

In adecision tree, duringinference,the route a particularexample takes from theroot to otherconditions, terminating withaleaf. For example, in the following decision tree, thethicker arrows show the inference path for an example with the followingfeature values:

x = 7
y = 12
z = -3

The inference path in the following illustration travels through threeconditions before reaching the leaf (Zeta).

A decision tree consisting of four conditions and five leaves. The root condition is (x > 0). Since the answer is Yes, the inference path travels from the root to the next condition (y > 0). Since the answer is Yes, the inference path then travels to the next condition (z > 0). Since the answer is No, the inference path travels to its terminal node, which is the leaf (Zeta).

The three thick arrows show the inference path.

SeeDecision treesin the Decision Forests course for more information.

information gain

#df

#Metric

Indecision forests, the difference betweena node'sentropy and the weighted (by number of examples)sum of the entropy of its children nodes. A node's entropy is the entropyof the examples in that node.

For example, consider the following entropy values:

entropy of parent node = 0.6
entropy of one child node with 16 relevant examples = 0.2
entropy of another child node with 24 relevant examples = 0.1

So 40% of the examples are in one child node and 60% are in theother child node. Therefore:

weighted entropy sum of child nodes = (0.4 * 0.2) + (0.6 * 0.1) = 0.14

So, the information gain is:

information gain = entropy of parent node - weighted entropy sum of child nodes
information gain = 0.6 - 0.14 = 0.46

Mostsplitters seek to createconditionsthat maximize information gain.

in-group bias

#responsible

Showing partiality to one's own group or own characteristics.If testers or raters consist of the machine learning developer's friends,family, or colleagues, then in-group bias may invalidate product testingor the dataset.

In-group bias is a form ofgroup attribution bias.See alsoout-group homogeneity bias.

SeeFairness: Types ofbias inMachine Learning Crash Course for more information.

input generator

A mechanism by which data is loaded intoaneural network.

An input generator can be thought of as a component responsible for processingraw data into tensors which are iterated over to generate batches fortraining, evaluation, and inference.

input layer

#fundamentals

Thelayer of aneural network thatholds thefeature vector. That is, the input layerprovidesexamples fortraining orinference. For example, the input layer in the followingneural network consists of two features:

Four layers: an input layer, two hidden layers, and an output layer.

in-set condition

#df

In adecision tree, aconditionthat tests for the presence of one item in a set of items.For example, the following is an in-set condition:

  house-style in [tudor, colonial, cape]

During inference, if the value of the house-stylefeatureistudor orcolonial orcape, then this condition evaluates to Yes. Ifthe value of the house-style feature is something else (for example,ranch),then this condition evaluates to No.

In-set conditions usually lead to more efficient decision trees thanconditions that testone-hot encoded features.

instance

Synonym forexample.

instruction tuning

#generativeAI

A form offine-tuning that improves agenerative AI model's ability to followinstructions. Instruction tuning involves training a model on a seriesof instruction prompts, typically covering a widevariety of tasks. The resulting instruction-tuned model then tends togenerate usefulresponses tozero-shot prompts across a variety of tasks.

Compare and contrast with:

interpretability

#fundamentals

The ability to explain or to present an MLmodel's reasoning inunderstandable terms to a human.

Mostlinear regression models, for example, are highlyinterpretable. (You merely need to look at the trained weights for eachfeature.) Decision forests are also highly interpretable. Some models, however,require sophisticated visualization to become interpretable.

You can use theLearning Interpretability Tool (LIT)to interpret ML models.

inter-rater agreement

#Metric

A measurement of how often human raters agree when doing a task.If raters disagree, the task instructions may need to be improved.Also sometimes calledinter-annotator agreement orinter-rater reliability. See alsoCohen'skappa,which is one of the most popular inter-rater agreement measurements.

SeeCategorical data: Commonissuesin Machine Learning Crash Course for more information.

intersection over union (IoU)

The intersection of two sets divided by their union. In machine-learningimage-detection tasks, IoU is used to measure the accuracy of the model'spredictedbounding box with respect to theground-truth bounding box. In this case, the IoU for thetwo boxes is the ratio between the overlapping area and the total area, andits value ranges from 0 (no overlap of predicted bounding box and ground-truthbounding box) to 1 (predicted bounding box and ground-truth bounding box havethe exact same coordinates).

For example, in the image below:

The predicted bounding box (the coordinates delimiting where the modelpredicts the night table in the painting is located) is outlined in purple.
The ground-truth bounding box (the coordinates delimiting where the nighttable in the painting is actually located) is outlined in green.

Here, the intersection of the bounding boxes for prediction and ground truth(below left) is 1, and the union of the bounding boxes for prediction andground truth (below right) is 7, so the IoU is $\frac{1}{7}$.

Same image as above, but with each bounding box divided into four quadrants. There are seven quadrants total, as the bottom-right quadrant of the ground-truth bounding box and the top-left quadrant of the predicted bounding box overlap each other. This overlapping section (highlighted in green) represents the intersection, and has an area of 1.

IoU

Abbreviation forintersection over union.

item matrix

Inrecommendation systems, amatrix ofembedding vectors generated bymatrix factorizationthat holds latent signals about eachitem.Each row of the item matrix holds the value of a single latentfeature for all items.For example, consider a movie recommendation system. Each columnin the item matrix represents a single movie. The latent signalsmight represent genres, or might be harder-to-interpretsignals that involve complex interactions among genre, stars,movie age, or other factors.

The item matrix has the same number of columns as the targetmatrix that is being factorized. For example, given a movierecommendation system that evaluates 10,000 movie titles, theitem matrix will have 10,000 columns.

items

In arecommendation system, the entities thata system recommends. For example, videos are the items that a video storerecommends, while books are the items that a bookstore recommends.

iteration

#fundamentals

A single update of amodel's parameters—the model'sweights andbiases—duringtraining. Thebatch size determineshow many examples the model processes in a single iteration. For instance,if the batch size is 20, then the model processes 20 examples beforeadjusting the parameters.

When training aneural network, a single iterationinvolves the following two passes:

A forward pass to evaluate loss on a single batch.
A backward pass (backpropagation) to adjust themodel's parameters based on the loss and the learning rate.

SeeGradientdescentin Machine Learning Crash Course for more information.

J

JAX

An array computing library, bringing togetherXLA (Accelerated Linear Algebra) and automatic differentiationfor high-performance numerical computing. JAX provides a simple and powerfulAPI for writing accelerated numerical code with composable transformations.JAX provides features such as:

grad (automatic differentiation)
jit (just-in-time compilation)
vmap (automatic vectorization or batching)
pmap (parallelization)

JAX is a language for expressing and composing transformations of numericalcode, analogous—but much larger in scope—to Python'sNumPylibrary. (In fact, the .numpy library under JAX is a functionally equivalent,but entirely rewritten version of the Python NumPy library.)

JAX is particularly well-suited for speeding up many machine learning tasksby transforming the models and data into a form suitable for parallelismacross GPU andTPU accelerator chips.

Flax,Optax,Pax, and many otherlibraries are built on the JAX infrastructure.

K

Keras

A popular Python machine learning API.Kerasruns onseveral deep learning frameworks, including TensorFlow, where it is madeavailable astf.keras.

Kernel Support Vector Machines (KSVMs)

A classification algorithm that seeks to maximize the margin betweenpositive andnegative classes by mapping input data vectorsto a higher dimensional space. For example, consider a classificationproblem in which the input datasethas a hundred features. To maximize the margin betweenpositive and negative classes, a KSVM could internally map those features intoa million-dimension space. KSVMs uses a loss function calledhinge loss.

keypoints

The coordinates of particular features in an image. For example, for animage recognition model that distinguishesflower species, keypoints might be the center of each petal, the stem,the stamen, and so on.

k-fold cross validation

An algorithm for predicting a model's ability togeneralize to new data. Thek in k-fold refers to thenumber of equal groups you divide a dataset's examples into; that is, you trainand test your model k times. For each round of training and testing, adifferent group is the test set, and all remaining groups become the trainingset. After k rounds of training and testing, you calculate the mean andstandard deviation of the chosen test metric(s).

For example, suppose your dataset consists of 120 examples. Further suppose,you decide to set k to 4. Therefore, after shuffling the examples,you divide the dataset into four equal groups of 30 examples and conduct fourtraining and testing rounds:

For example,Mean Squared Error (MSE) mightbe the most meaningful metric for a linear regression model. Therefore, youwould find the mean and standard deviation of the MSE across all four rounds.

k-means

#clustering

A popularclustering algorithm that groups examplesin unsupervised learning. The k-means algorithm basically does the following:

Iteratively determines the best k center points (knownascentroids).
Assigns each example to the closest centroid. Those examples nearestthe same centroid belong to the same group.

The k-means algorithm picks centroid locations to minimize the cumulativesquare of the distances from each example to its closest centroid.

For example, consider the following plot of dog height to dog width:

A Cartesian plot with several dozen data points.

If k=3, the k-means algorithm will determine three centroids. Each exampleis assigned to its closest centroid, yielding three groups:

The same Cartesian plot as in the previous illustration, except with three centroids added. The previous data points are clustered into three distinct groups, with each group representing the data points closest to a particular centroid.

Imagine that a manufacturer wants to determine the ideal sizes for small,medium, and large sweaters for dogs. The three centroids identify the meanheight and mean width of each dog in that cluster. So, the manufacturershould probably base sweater sizes on those three centroids. Note thatthe centroid of a cluster is typicallynot an example in the cluster.

The preceding illustrations shows k-means for examples with onlytwo features (height and width). Note that k-means can group examplesacross many features.

SeeWhat is k-means clustering?in the Clustering course for more information.

k-median

#clustering

A clustering algorithm closely related tok-means. Thepractical difference between the two is as follows:

In k-means, centroids are determined by minimizing the sum of thesquares of the distance between a centroid candidate and each ofits examples.
In k-median, centroids are determined by minimizing the sum of thedistance between a centroid candidate and each of its examples.

Note that the definitions of distance are also different:

k-means relies on theEuclidean distance fromthe centroid to an example. (In two dimensions, the Euclideandistance means using the Pythagorean theorem to calculatethe hypotenuse.) For example, the k-means distance between (2,2)and (5,-2) would be:

$${\text{Euclidean distance}} = {\sqrt {(2-5)^2 + (2--2)^2}} = 5$$

k-median relies on the Manhattan distancefrom the centroid to an example. This distance is the sum of theabsolute deltas in each dimension. For example, the k-mediandistance between (2,2) and (5,-2) would be:

$${\text{Manhattan distance}} = \lvert 2-5 \rvert + \lvert 2--2 \rvert = 7$$

L

L₀ regularization

#fundamentals

A type ofregularization thatpenalizes thetotal number of nonzeroweightsin a model. For example, a model having 11 nonzero weightswould be penalized more than a similar model having 10 nonzero weights.

L₀ regularization is sometimes calledL0-norm regularization.

Click the icon for additional notes.

L₀ regularization is generally impractical in large models becauseL₀ regularization turns training into aconvexoptimization problem.

L₁ loss

#fundamentals

#Metric

Aloss function that calculates the absolute valueof the difference between actuallabel values andthe values that amodel predicts. For example, here's thecalculation of L₁ loss for abatch of fiveexamples:

Actual value of example	Model's predicted value	Absolute value of delta
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L₁ loss

L₁ loss is less sensitive tooutliersthanL₂ loss.

TheMean Absolute Error is the averageL₁ loss per example.

Click the icon to see the formal math.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

where:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

SeeLinear regression: Lossin Machine Learning Crash Course for more information.

L₁ regularization

#fundamentals

A type ofregularization that penalizesweights in proportion to the sum of the absolute value ofthe weights. L₁ regularization helps drive the weights of irrelevantor barely relevant features toexactly 0. Afeature witha weight of 0 is effectively removed from the model.

Contrast withL₂ regularization.

L₂ loss

#fundamentals

#Metric

Aloss function that calculates the squareof the difference between actuallabel values andthe values that amodel predicts. For example, here's thecalculation of L₂ loss for abatch of fiveexamples:

Actual value of example	Model's predicted value	Square of delta
7	6	1
5	4	1
8	11	9
4	6	4
9	8	1
		16 = L₂ loss

Due to squaring, L₂ loss amplifies the influence ofoutliers.That is, L₂ loss reacts more strongly to bad predictions thanL₁ loss. For example, the L₁ lossfor the preceding batch would be 8 rather than 16. Notice that a singleoutlier accounts for 9 of the 16.

Regression models typically use L₂ lossas the loss function.

TheMean Squared Error is the averageL₂ loss per example.Squared loss is another name for L₂ loss.

Click the icon to see the formal math.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

where:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

SeeLogistic regression: Loss andregularizationin Machine Learning Crash Course for more information.

L₂ regularization

#fundamentals

A type ofregularization that penalizesweights in proportion to the sum of thesquares of the weights.L₂ regularization helps driveoutlier weights (thosewith high positive or low negative values) closer to 0 butnot quite to 0.Features with values very close to 0 remain in the modelbut don't influence the model's prediction very much.

L₂ regularization always improves generalization inlinear models.

Contrast withL₁ regularization.

SeeOverfitting: L2regularizationin Machine Learning Crash Course for more information.

label

#fundamentals

Insupervised machine learning, the"answer" or "result" portion of anexample.

Eachlabeled example consists of one or morefeatures and a label. For example, in a spamdetection dataset, the label would probably be either "spam" or"not spam." In a rainfall dataset, the label might be the amount ofrain that fell during a certain period.

SeeSupervised Learningin Introduction to Machine Learning for more information.

labeled example

#fundamentals

An example that contains one or morefeatures and alabel. For example, the following table shows threelabeled examples from a house valuation model, each with three featuresand one label:

Number of bedrooms	Number of bathrooms	House age	House price (label)
3	2	15	$345,000
2	1	72	$179,000
4	2	34	$392,000

Insupervised machine learning,models train on labeled examples and make predictions onunlabeled examples.

Contrast labeled example with unlabeled examples.

SeeSupervised Learningin Introduction to Machine Learning for more information.

label leakage

A model design flaw in which afeature is a proxy for thelabel. For example, consider abinary classification model that predictswhether or not a prospective customer will purchase a particular product.Suppose that one of the features for the model is a Boolean namedSpokeToCustomerAgent. Further suppose that a customer agent is onlyassignedafter the prospective customer has actually purchased theproduct. During training, the model will quickly learn the associationbetweenSpokeToCustomerAgent and the label.

SeeMonitoringpipelinesin Machine Learning Crash Course for more information.

lambda

#fundamentals

Synonym forregularization rate.

Lambda is an overloaded term. Here we're focusing on the term'sdefinition withinregularization.

LaMDA (Language Model for Dialogue Applications)

ATransformer-basedlarge language model developed by Google trained ona large dialogue dataset that can generate realistic conversationalresponses.

LaMDA: our breakthrough conversationtechnology provides an overview.

landmarks

Synonym forkeypoints.

language model

Amodel that estimates the probability of atokenor sequence of tokens occurring in a longer sequence of tokens.

Click the icon for additional notes.

Though counterintuitive, many models that evaluate text are notlanguage models. For example, text classification models and sentimentanalysis models are notlanguage models.

SeeWhat is a language model?in Machine Learning Crash Course for more information.

large language model

#generativeAI

At a minimum, alanguage model having a very high numberofparameters. More informally, anyTransformer-based language model, such asGemini orGPT.

SeeLarge language models (LLMs)in Machine Learning Crash Course for more information.

latency

#generativeAI

The time it takes for a model to process input and generate a response.Ahigh latency response takes takes longer to generate than alow latency response.

Factors that influence latency oflarge language models include:

Input and outputtoken lengths
Model complexity
The infrastructure the model runs on

Optimizing for latency is crucial for creating responsive and user-friendlyapplications.

latent space

Synonym forembedding space.

layer

#fundamentals

A set ofneurons in aneural network. Three common types of layersare as follows:

Theinput layer, which provides values for all thefeatures.
One or morehidden layers, which findnonlinear relationships between the features and the label.
Theoutput layer, which provides the prediction.

For example, the following illustration shows a neural network withone input layer, two hidden layers, and one output layer:

A neural network with one input layer, two hidden layers, and one output layer. The input layer consists of two features. The first hidden layer consists of three neurons and the second hidden layer consists of two neurons. The output layer consists of a single node.

InTensorFlow,layers are also Python functions that takeTensors and configuration options as input andproduce other tensors as output.

Layers API (tf.layers)

#TensorFlow

A TensorFlow API for constructing adeep neural networkas a composition of layers. The Layers API lets you build differenttypes oflayers, such as:

tf.layers.Dense for afully-connected layer.
tf.layers.Conv2D for a convolutional layer.

The Layers API follows theKeras layers API conventions.That is, aside from a different prefix, all functions in the Layers APIhave the same names and signatures as their counterparts in the Keraslayers API.

leaf

#df

Any endpoint in adecision tree. Unlike acondition, a leaf doesn't perform a test.Rather, a leaf is a possible prediction. A leaf is also the terminalnode of aninference path.

For example, the following decision tree contains three leaves:

A decision tree with two conditions leading to three leaves.

SeeDecision treesin the Decision Forests course for more information.

Learning Interpretability Tool (LIT)

A visual, interactive model-understanding and data visualization tool.

You can use open-sourceLIT tointerpret models or to visualize text, image, andtabular data.

learning rate

#fundamentals

A floating-point number that tells thegradient descentalgorithm how strongly to adjust weights and biases on eachiteration. For example, a learning rate of 0.3 wouldadjust weights and biases three times more powerfully than a learning rateof 0.1.

Learning rate is a keyhyperparameter. If you setthe learning rate too low, training will take too long. Ifyou set the learning rate too high, gradient descent often has troublereachingconvergence.

Click the icon for a more mathematical explanation.

During each iteration, thegradient descentalgorithm multiplies thelearning rate by the gradient. The resulting product is called thegradient step.

SeeLinear regression:Hyperparametersin Machine Learning Crash Course for more information.

least squares regression

Alinear regression model trained by minimizingL₂ Loss.

Levenshtein Distance

#metric

Anedit distance metric that calculates the fewest delete,insert, and substitute operations required to change one word to another.For example, the Levenshtein distance between the words "heart" and "darts"is three because the following three edits are the fewest changes to turnone word into the other:

heart → deart (substitute "h" with "d")
deart → dart (delete "e")
dart → darts (insert "s")

Note that the preceding sequence isn't the only path of three edits.

linear

#fundamentals

A relationship between two or more variables that can be represented solelythrough addition and multiplication.

The plot of a linear relationship is a line.

Contrast withnonlinear.

linear model

#fundamentals

Amodel that assigns oneweight perfeature to makepredictions.(Linear models also incorporate abias.) In contrast,the relationship of features to predictions indeep modelsis generallynonlinear.

Linear models are usually easier to train and moreinterpretable than deep models. However,deep models can learn complex relationshipsbetween features.

Linear regression andlogistic regression are two types of linear models.

Click the icon to see the math.

A linear model follows this formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

where:

y' is the raw prediction. (In certain kinds of linear models, thisraw prediction will be further modified. For example, seelogistic regression.)
b is thebias.
w is aweight, so w₁ isthe weight of the first feature, w₂ is the weight of thesecond feature, and so on.
x is afeature, so x₁ is thevalue of the first feature, x₂ is the value of the second feature,and so on.

For example, suppose a linear model for three features learns the followingbias and weights:

b = 7
w₁ = -2.5
w₂ = -1.2
w₃ = 1.4

Therefore, given three features (x₁, x₂,and x₃), the linear model uses the following equationto generate each prediction:

y' = 7 + (-2.5)(x₁) + (-1.2)(x₂) + (1.4)(x₃)

Suppose a particular example contains the following values:

x₁ = 4
x₂ = -10
x₃ = 5

Plugging those values into the formula yields a prediction for this example:

y' = 7 + (-2.5)(4) + (-1.2)(-10) + (1.4)(5)y' = 16

Linear models include not only models that use only a linear equation tomake predictions but also a broader set of models that use a linear equationas just one component of the formula that makes predictions.For example, logistic regression post-processes the rawprediction (y') to produce a final prediction value between 0 and 1,exclusively.

linear regression

#fundamentals

A type of machine learning model in which both of the following are true:

The model is alinear model.
The prediction is a floating-point value. (This is theregression part oflinear regression.)

Contrast linear regression withlogistic regression.Also, contrast regression withclassification.

SeeLinear regressionin Machine Learning Crash Course for more information.

LIT

Abbreviation for theLearning Interpretability Tool (LIT),which was previously known as the Language Interpretability Tool.

LLM

#generativeAI

Abbreviation forlarge language model.

LLM evaluations (evals)

#generativeAI

#Metric

A set of metrics and benchmarks for assessing the performance oflarge language models (LLMs). At a high level,LLM evaluations:

Help researchers identify areas where LLMs need improvement.
Are useful in comparing different LLMs and identifying the best LLM for aparticular task.
Help ensure that LLMs are safe and ethical to use.

SeeLarge languagemodels (LLMs)in Machine Learning Crash Course for more information.

logistic regression

#fundamentals

A type ofregression model that predicts a probability.Logistic regression models have the following characteristics:

The label iscategorical. The term logisticregression usually refers tobinary logistic regression, that is,to a model that calculates probabilities for labels with two possible values.A less common variant,multinomial logistic regression, calculatesprobabilities for labels with more than two possible values.
The loss function during training isLog Loss.(Multiple Log Loss units can be placed in parallel for labelswith more than two possible values.)
The model has a linear architecture, not a deep neural network.However, the remainder of this definition also applies todeep models that predict probabilitiesfor categorical labels.

For example, consider a logistic regression model that calculates theprobability of an input email being either spam or not spam.During inference, suppose the model predicts 0.72. Therefore, themodel is estimating:

A 72% chance of the email being spam.
A 28% chance of the email not being spam.

A logistic regression model uses the following two-step architecture:

The model generates a raw prediction (y') by applying a linear functionof input features.
The model uses that raw prediction as input to asigmoid function, which converts the rawprediction to a value between 0 and 1, exclusive.

Like any regression model, a logistic regression model predicts a number.However, this number typically becomes part of a binary classificationmodel as follows:

If the predicted number isgreater than theclassification threshold, thebinary classification model predicts the positive class.
If the predicted number isless than the classification threshold,the binary classification model predicts the negative class.

SeeLogistic regressionin Machine Learning Crash Course for more information.

logits

The vector of raw (non-normalized) predictions that a classificationmodel generates, which is ordinarily then passed to a normalization function.If the model is solving amulti-class classificationproblem, logits typically become an input to thesoftmax function.The softmax function then generates a vector of (normalized)probabilities with one value for each possible class.

Log Loss

#fundamentals

Theloss function used in binarylogistic regression.

Click the icon to see the math.

The following formula calculates Log Loss:

$$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$

where:

$(x,y)\in D$ is the dataset containing many labeled examples, which are $(x,y)$ pairs.
$y$ is the label in a labeled example. Since this is logistic regression, every value of $y$ must either be 0 or 1.
$y'$ is the predicted value (somewhere between 0 and 1, exclusive), given the set of features in $x$.

SeeLogistic regression: Loss and regularizationin Machine Learning Crash Course for more information.

log-odds

#fundamentals

The logarithm of the odds of some event.

Click the icon to see the math.

If the event is a binary probability, thenodds refers tothe ratio of the probability of success (p) to the probability offailure (1-p). For example, suppose that a given event has a 90%probability of success and a 10% probability of failure. In this case,odds is calculated as follows:

$${\text{odds}} =\frac{\text{p}} {\text{(1-p)}} =\frac{.9} {.1} ={\text{9}}$$

The log-odds is simply the logarithm of the odds. By convention,"logarithm" refers tonatural logarithm,but logarithm could actually be any base greater than 1.Sticking to convention, the log-odds of our example is therefore:

$${\text{log-odds}} =ln(9) ~= 2.2$$

The log-odds function is the inverse of thesigmoid function.

Long Short-Term Memory (LSTM)

A type of cell in arecurrent neural network used to processsequences of data in applications such as handwriting recognition,machine translation, and image captioning. LSTMsaddress thevanishing gradient problem thatoccurs when training RNNs due to long data sequences by maintaining history inan internal memory state based on new input and context from previous cells inthe RNN.

LoRA

#generativeAI

Abbreviation forLow-Rank Adaptability.

loss

#fundamentals

#Metric

During thetraining of asupervised model, a measure of how far amodel'sprediction is from itslabel.

Aloss function calculates the loss.

SeeLinear regression: Lossin Machine Learning Crash Course for more information.

loss aggregator

A type ofmachine learning algorithm thatimproves theperformance of amodelby combining thepredictions of multiple models andusing those predictions to make a single prediction. As a result,a loss aggregator can reduce the variance of the predictions andimprove theaccuracy of the predictions.

loss curve

#fundamentals

A plot ofloss as a function of the number of trainingiterations. The following plot shows a typical losscurve:

A Cartesian graph of loss versus training iterations, showing a rapid drop in loss for the initial iterations, followed by a gradual drop, and then a flat slope during the final iterations.

Loss curves can help you determine when your model isconverging oroverfitting.

Loss curves can plot all of the following types of loss:

loss function

#fundamentals

#Metric

Duringtraining or testing, amathematical function that calculates theloss on abatch of examples. A loss function returns a lower lossfor models that makes good predictions than for models that makebad predictions.

The goal of training is typically to minimize the loss that a loss functionreturns.

Many different kinds of loss functions exist. Pick the appropriate lossfunction for the kind of model you are building. For example:

L₂ loss (orMean Squared Error)is the loss function forlinear regression.
Log Loss is the loss function forlogistic regression.

loss surface

A graph of weight(s) versus loss.Gradient descent aimsto find the weight(s) for which the loss surface is at a local minimum.

lost-in-the-middle effect

AnLLM's tendency to use information from the start and end of alongcontext window more effectively than informationfrom the middle. That is, given a long context, the lost-in-the-middle effectcauses accuracy to be:

Relatively high when the relevant information to form aresponse is near thebeginning orend of the context.
Relatively low when the relevant information to form aresponse is in themiddle of the context.

The term comes fromLost in the Middle: How Language Models Use Long Contexts.

Low-Rank Adaptability (LoRA)

#generativeAI

Aparameter-efficient technique forfine tuning that "freezes" the model's pre-trainedweights (such that they can no longer be modified) and then inserts a small setof trainable weights into the model. This set of trainable weights (also knownas "update matrixes") is considerably smaller than the base model and istherefore much faster to train.

LoRA provides the following benefits:

Improves the quality of a model's predictions for the domain where the finetuning is applied.
Fine-tunes faster than techniques that require fine-tuningall of a model'sparameters.
Reduces the computational cost ofinference by enablingconcurrent serving of multiple specialized models sharing the same basemodel.

Click the icon to learn more about update matrixes in LoRA.

The update matrixes used in LoRA consist ofrank decomposition matrixes,which are derived from the base model to help filter out noise andfocus training on the most important features of the model.

LSTM

Abbreviation forLong Short-Term Memory.

M

machine learning

#fundamentals

A program or system thattrains amodel from input data. The trained model canmake useful predictions from new (never-before-seen) data drawn fromthe same distribution as the one used to train the model.

Machine learning also refers to the field of study concernedwith these programs or systems.

See theIntroduction to Machine Learningcourse for more information.

machine translation

#generativeAI

Using software (typically, a machine learning model) to convert text fromone human language to another human language, for example, from English toJapanese.

majority class

#fundamentals

The more common label in aclass-imbalanced dataset. For example,given a dataset containing 99% negative labels and 1% positive labels, thenegative labels are the majority class.

Contrast withminority class.

SeeDatasets: Imbalanced datasetsin Machine Learning Crash Course for more information.

Markov decision process (MDP)

A graph representing the decision-making model where decisions(oractions) are taken to navigate a sequence ofstates under the assumption that theMarkov property holds. Inreinforcement learning, these transitionsbetween states return a numericalreward.

Markov property

A property of certainenvironments, where statetransitions are entirely determined by information implicit in thecurrentstate and the agent'saction.

masked language model

Alanguage model that predicts the probability ofcandidate tokens to fill in blanks in a sequence. For example, amasked language model can calculate probabilities for candidate word(s)to replace the underline in the following sentence:

The ____ in the hat came back.

The literature typically uses the string "MASK" instead of an underline.For example:

The "MASK" in the hat came back.

Most modern masked language models arebidirectional.

math-pass@k

A metric to determine an LLM's accuracy in solving a math problem withinK attempts. For example, math-pass@2 measures an LLM's ability to solve mathproblems within two attempts. An accuracy of 0.85 on math-pass@2 indicates thatan LLM was able to solve math problems 85% of the time within two attempts.

math-pass@k is identical to thepass@k metric, except thatthe term math-pass@k is specifically used for math evaluation.

matplotlib

An open-source Python 2D plotting library.matplotlib helps you visualizedifferent aspects of machine learning.

matrix factorization

In math, a mechanism for finding the matrixes whose dot product approximates atarget matrix.

Inrecommendation systems, the target matrixoften holds users' ratings onitems. For example, the targetmatrix for a movie recommendation system might look something like thefollowing, where the positive integers are user ratings and 0means that the user didn't rate the movie:

	Casablanca	The Philadelphia Story	Black Panther	Wonder Woman	Pulp Fiction
User 1	5.0	3.0	0.0	2.0	0.0
User 2	4.0	0.0	0.0	1.0	5.0
User 3	3.0	1.0	4.0	5.0	0.0

The movie recommendation system aims to predict user ratings forunrated movies. For example, will User 1 likeBlack Panther?

One approach for recommendation systems is to use matrixfactorization to generate the following two matrixes:

Auser matrix, shaped as the number of users X thenumber of embedding dimensions.
Anitem matrix, shaped as the number of embeddingdimensions X the number of items.

For example, using matrix factorization on our three users and five itemscould yield the following user matrix and item matrix:

User Matrix                 Item Matrix1.1   2.3           0.9   0.2   1.4    2.0   1.20.6   2.0           1.7   1.2   1.2   -0.1   2.12.5   0.5

The dot product of the user matrix and item matrix yields a recommendationmatrix that contains not only the original user ratings but also predictionsfor the movies that each user hasn't seen.For example, consider User 1's rating ofCasablanca, which was 5.0. The dotproduct corresponding to that cell in the recommendation matrix shouldhopefully be around 5.0, and it is:

(1.1* 0.9) + (2.3 * 1.7) = 4.9

More importantly, will User 1 likeBlack Panther? Taking the dot productcorresponding to the first row and the third column yields a predictedrating of 4.3:

(1.1* 1.4) + (2.3 * 1.2) = 4.3

Matrix factorization typically yields a user matrix and item matrix that,together, are significantly more compact than the target matrix.

MBPP

#Metric

Abbreviation forMostly Basic Python Problems.

Mean Absolute Error (MAE)

#Metric

The average loss per example whenL₁ loss isused. Calculate Mean Absolute Error as follows:

Calculate the L₁ loss for a batch.
Divide the L₁ loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Absolute Error} = \frac{1}{n}\sum_{i=0}^n | y_i - \hat{y}_i |$$

where:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the value that the model predicts for $y$.

For example, consider the calculation of L₁ loss on thefollowing batch of five examples:

Actual value of example	Model's predicted value	Loss (difference between actual and predicted)
7	6	1
5	4	1
8	11	3
4	6	2
9	8	1
		8 = L₁ loss

So, L₁ loss is 8 and the number of examples is 5.Therefore, the Mean Absolute Error is:

Mean Absolute Error = L₁ loss / Number of ExamplesMean Absolute Error = 8/5 = 1.6

Contrast Mean Absolute Error withMean Squared Error andRoot Mean Squared Error.

mean average precision at k (mAP@k)

#generativeAI

#Metric

The statistical mean of allaverage precision at k scores acrossa validation dataset. One use of mean average precision at k is to judgethe quality of recommendations generated by arecommendation system.

Although the phrase "mean average" sounds redundant, the name ofthe metric is appropriate. After all, this metric finds the meanof multipleaverage precision at k values.

Click the icon to see an example.

Suppose you build a recommendation system that generates a personalizedlist of recommended novels for each user. Based on feedback from selectedusers, you calculate the following five average precision at k scores (onescore per user):

0.73
0.77
0.67
0.82
0.76

The mean Average Precision at K is therefore:

$$\text{mean } = \frac{\text{0.73 + 0.77 + 0.67 + 0.82 + 0.76}} {\text{5}} = \text{0.75}$$

Mean Squared Error (MSE)

#Metric

The average loss per example whenL₂ loss isused. Calculate Mean Squared Error as follows:

Calculate the L₂ loss for a batch.
Divide the L₂ loss by the number of examples in the batch.

Click the icon to see the formal math.

$$\text{Mean Squared Error} = \frac{1}{n}\sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$where:

$n$ is the number of examples.
$y$ is the actual value of the label.
$\hat{y}$ is the model's prediction for $y$.

For example, consider the loss on the following batch of five examples:

Actual value	Model's prediction	Loss	Squared loss
7	6	1	1
5	4	1	1
8	11	3	9
4	6	2	4
9	8	1	1
			16 = L₂ loss

Therefore, the Mean Squared Error is:

Mean Squared Error = L₂ loss / Number of ExamplesMean Squared Error = 16/5 = 3.2

Mean Squared Error is a popular trainingoptimizer,particularly forlinear regression.

Contrast Mean Squared Error withMean Absolute Error andRoot Mean Squared Error.

TensorFlow Playground uses Mean Squared Errorto calculate loss values.

Click the icon to see more details about outliers.

Outliers strongly influence Mean Squared Error.For example, a loss of 1 is a squared loss of 1, but a loss of 3 is asquared loss of 9. In the preceding table, the example with a loss of 3accounts for ~56% of the Mean Squared Error, while each of the exampleswith a loss of 1 accounts for only 6% of the Mean Squared Error.

Outliers don't influence Mean Absolute Error as strongly asMean Squared Error. For example, a loss of 3 accounts for only ~38% of theMean Absolute Error.

Clipping is one way to prevent extremeoutliers from damaging your model's predictive ability.

mesh

#TensorFlow

#GoogleCloud

In ML parallel programming, a term associated with assigning the data andmodel to TPU chips, and defining how these values will be sharded or replicated.

Mesh is an overloaded term that can mean either of the following:

A physical layout of TPU chips.
An abstract logical construct for mapping the data and model to the TPUchips.

In either case, a mesh is specified as ashape.

meta-learning

A subset of machine learning that discovers or improves a learning algorithm.A meta-learning system can also aim to train a model to quickly learn a newtask from a small amount of data or from experience gained in previous tasks.Meta-learning algorithms generally try to achieve the following:

Improve or learn hand-engineered features (such as an initializer oran optimizer).
Be more data-efficient and compute-efficient.
Improve generalization.

Meta-learning is related tofew-shot learning.

metric

#TensorFlow

#Metric

A statistic that you care about.

Anobjective is a metric that a machine learning systemtries to optimize.

Metrics API (tf.metrics)

#Metric

A TensorFlow API for evaluating models. For example,tf.metrics.accuracydetermines how often a model's predictions match labels.

mini-batch

#fundamentals

A small, randomly selected subset of abatch processed in oneiteration.Thebatch size of a mini-batch is usuallybetween 10 and 1,000 examples.

For example, suppose the entire training set (the full batch)consists of 1,000 examples. Further suppose that you set thebatch size of each mini-batch to 20. Therefore, eachiteration determines the loss on a random 20 of the 1,000 examples and thenadjusts theweights andbiases accordingly.

It is much more efficient to calculate the loss on a mini-batch than theloss on all the examples in the full batch.

SeeLinear regression:Hyperparametersin Machine Learning Crash Course for more information.

mini-batch stochastic gradient descent

Agradient descent algorithm that usesmini-batches. In other words, mini-batch stochasticgradient descent estimates the gradient based on a small subset of thetraining data. Regularstochastic gradient descent uses amini-batch of size 1.

minimax loss

#Metric

A loss function forgenerative adversarial networks,based on thecross-entropy between the distributionof generated data and real data.

Minimax loss is used in thefirst paper to describegenerative adversarial networks.

SeeLoss Functions in theGenerative Adversarial Networks course for more information.

minority class

#fundamentals

The less common label in aclass-imbalanced dataset. For example,given a dataset containing 99% negative labels and 1% positive labels, thepositive labels are the minority class.

Contrast withmajority class.

Click the icon for additional notes.

A training set with a millionexamples soundsimpressive. However, if the minority class is poorly represented,then even a very large training set might be insufficient. Focus lesson the total number of examples in the dataset and more on the number ofexamples in the minority class.

If your dataset doesn't contain enough minority class examples, considerusingdownsampling (the definitionin the second bullet) to supplement the minority class.

SeeDatasets: Imbalanced datasetsin Machine Learning Crash Course for more information.

mixture of experts

#generativeAI

A scheme to increaseneural network efficiency byusing only a subset of its parameters (known as anexpert) to processa given inputtoken orexample. Agating network routes each input token or example to the proper expert(s).

For details, see either of the following papers:

ML

Abbreviation formachine learning.

MMIT

#generativeAI

Abbreviation formultimodal instruction-tuned.

MNIST

A public-domain dataset compiled by LeCun, Cortes, and Burges containing60,000 images, each image showing how a human manually wrote a particulardigit from 0–9. Each image is stored as a 28x28 array of integers, whereeach integer is a grayscale value between 0 and 255, inclusive.

MNIST is a canonical dataset for machine learning, often used to test newmachine learning approaches. For details, seeThe MNIST Database of Handwritten Digits.

modality

A high-level data category. For example, numbers, text, images, video, andaudio are five different modalities.

model

#fundamentals

In general, any mathematical construct that processes input data and returnsoutput. Phrased differently, a model is the set of parameters and structureneeded for a system to make predictions.Insupervised machine learning,a model takes anexample as input and infers aprediction as output. Within supervised machine learning,models differ somewhat. For example:

A linear regression model consists of a set ofweightsand abias.
Aneural network model consists of:
- A set ofhidden layers, each containing one ormoreneurons.
- The weights and bias associated with each neuron.
Adecision tree model consists of:
- The shape of the tree; that is, the pattern in which the conditionsand leaves are connected.
- The conditions and leaves.

You can save, restore, or make copies of a model.

Unsupervised machine learning alsogenerates models, typically a function that can map an input example tothe most appropriatecluster.

Click the icon to compare algebraic and programming functions to ML models.

An algebraic function such as the following is a model:

  f(x, y) = 3x -5xy + y² + 17

The preceding function maps input values (x andy) tooutput.

Similarly, a programming function like the following is also a model:

def half_of_greater(x, y):  if (x > y):    return(x / 2)  else    return(y / 2)

A caller passes arguments to the preceding Python function, and thePython function generates output (via thereturn statement).

Although adeep neural networkhas a very different mathematical structure than an algebraic or programmingfunction, a deep neural network still takes input (an example) and returnsoutput (a prediction).

A human programmer codes a programming function manually. In contrast,a machine learning model gradually learns the optimal parametersduring automated training.

model capacity

#Metric

The complexity of problems that a model can learn. The more complex theproblems that a model can learn, the higher the model's capacity. A model'scapacity typically increases with the number of model parameters. For a formaldefinition ofclassification model capacity, seeVC dimension.

model cascading

#generativeAI

A system that picks the idealmodel for a specific inferencequery.

Imagine a group of models, ranging from very large (lots ofparameters) to much smaller (far fewer parameters).Very large models consume more computational resources atinference time than smaller models. However, very largemodels can typically infer more complex requests than smaller models.Model cascading determines the complexity of the inference query and thenpicks the appropriate model to perform the inference.The main motivation for model cascading is to reduce inference costs bygenerally selecting smaller models, and only selecting a larger model for morecomplex queries.

Imagine that a small model runs on a phone and a larger version of that modelruns on a remote server. Good model cascading reduces cost andlatency by enabling the smaller model to handle simple requestsand only calling the remote model to handle complex requests.

model parallelism

A way of scaling training or inference that puts different parts of onemodel on differentdevices. Model parallelismenables models that are too big to fit on a single device.

To implement model parallelism, a system typically does the following:

Shards (divides) the model into smaller parts.
Distributes the training of those smaller parts across multiple processors.Each processor trains its own part of the model.
Combines the results to create a single model.

Model parallelism slows training.

model router

#generativeAI

The algorithm that determines the idealmodel forinference inmodel cascading.A model router is itself typically a machine learning model thatgradually learns how to pick the best model for a given input.However, a model router could sometimes be a simpler,non-machine learning algorithm.

model training

The process of determining the bestmodel.

MOE

#generativeAI

Abbreviation formixture of experts.

Momentum

A sophisticated gradient descent algorithm in which a learning step dependsnot only on the derivative in the current step, but also on the derivativesof the step(s) that immediately preceded it. Momentum involves computing anexponentially weighted moving average of the gradients over time, analogousto momentum in physics. Momentum sometimes prevents learning from gettingstuck in local minima.

Mostly Basic Python Problems (MBPP)

#Metric

A dataset for evaluating an LLM's proficiency in generating Python code.Mostly Basic Python Problemsprovides about 1,000 crowd-sourced programming problems.Each problem in the dataset contains:

A task description
Solution code
Three automated test cases

MT

#generativeAI

Abbreviation formachine translation.

multi-class classification

#fundamentals

In supervised learning, aclassification problemin which the dataset containsmore than twoclasses of labels.For example, the labels in the Iris dataset must be one of the followingthree classes:

Iris setosa
Iris virginica
Iris versicolor

A model trained on the Iris dataset that predicts Iris type on new examplesis performing multi-class classification.

In contrast, classification problems that distinguish between exactly twoclasses arebinary classification models.For example, an email model that predicts eitherspam ornot spamis a binary classification model.

In clustering problems, multi-class classification refers to more thantwo clusters.

SeeNeural networks: Multi-class classificationin Machine Learning Crash Course for more information.

multi-class logistic regression

Usinglogistic regression inmulti-class classification problems.

multi-head self-attention

An extension ofself-attention that applies theself-attention mechanism multiple times for each position in the input sequence.

Transformers introduced multi-head self-attention.

multimodal instruction-tuned

Aninstruction-tuned model that can process inputbeyond text, such as images, video, and audio.

multimodal model

A model whose inputs, outputs, or both include more than onemodality. For example, consider a model that takes both animage and a text caption (two modalities) asfeatures, andoutputs a score indicating how appropriate the text caption is for the image.So, this model's inputs are multimodal and the output is unimodal.

multinomial classification

Synonym formulti-class classification.

multinomial regression

Synonym formulti-class logistic regression.

Multi-sentence Reading Comprehension (MultiRC)

A dataset to evaluate an LLM's ability to answer multiple choice exercises.Each example in the dataset contains:

A context paragraph
A question about that paragraph
Multiple answers to the question. Each answer is labeled True or False.Multiple answers may be True.

For example:

Context paragraph:
Susan wanted to have a birthday party. She called all of her friends. She hasfive friends. Her mom said that Susan can invite them all to the party. Herfirst friend could not go to the party because she was sick. Her second friendwas going out of town. Her third friend was not so sure if her parents wouldlet her. The fourth friend said maybe. The fifth friend could go to the partyfor sure. Susan was a little sad. On the day of the party, all five friendsshowed up. Each friend had a present for Susan. Susan was happy and sent eachfriend a thank you card the next week.
Question: Did Susan's sick friend recover?
Multiple answers:
- Yes, she recovered. (True)
- No. (False)
- Yes. (True)
- No, she didn't recover. (False)
- Yes, she was at Susan's party. (True)

MultiRC is a component of theSuperGLUE ensemble.

For details, seeLooking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences.

multitask

A machine learning technique in which a singlemodel istrained to perform multipletasks.

Multitask models are created by training on data that is appropriate foreach of the different tasks. This allows the model to learn to shareinformation across the tasks, which helps the model learn more effectively.

A model trained for multiple tasks often has improved generalization abilitiesand can be more robust at handling different types of data.

N

Nano

#generativeAI

A relatively smallGemini model designed for on-deviceuse. SeeGemini Nano for details.

NaN trap

When one number in your model becomes aNaNduring training, which causes many or all other numbers in your model toeventually become a NaN.

NaN is an abbreviation forNotaNumber.

natural language processing

The field of teaching computers to process what a user said or typed usinglinguistic rules. Almost all modern natural language processing relies onmachine learning.

natural language understanding

A subset ofnatural language processingthat determines theintentions of something said or typed. Natural languageunderstanding can go beyond natural language processing to consider complexaspects of language like context, sarcasm, and sentiment.

negative class

#fundamentals

#Metric

Inbinary classification, one class istermedpositive and the other is termednegative. The positive class isthe thing or event that the model is testing for and the negative class is theother possibility. For example:

The negative class in a medical test might be "not tumor."
The negative class in an emailclassification model might be "not spam."

Contrast withpositive class.

negative sampling

Synonym forcandidate sampling.

Neural Architecture Search (NAS)

A technique for automatically designing the architecture of aneural network. NAS algorithms can reduce the amountof time and resources required to train a neural network.

NAS typically uses:

A search space, which is a set of possible architectures.
A fitness function, which is a measure of how well a particulararchitecture performs on a given task.

NAS algorithms often start with a small set of possible architectures andgradually expand the search space as the algorithm learns more about whatarchitectures are effective. The fitness function is typically based on theperformance of the architecture on a training set, and the algorithm istypically trained using areinforcement learning technique.

NAS algorithms have proven effective in finding high-performingarchitectures for a variety of tasks, including imageclassification, text classification,andmachine translation.

neural network

#fundamentals

Amodel containing at least onehidden layer.Adeep neural network is a type of neural networkcontaining more than one hidden layer. For example, the following diagramshows a deep neural network containing two hidden layers.

A neural network with an input layer, two hidden layers, and an output layer.

Each neuron in a neural network connects to all of the nodes in the next layer.For example, in the preceding diagram, notice that each of the three neuronsin the first hidden layer separately connect to both of the two neurons in thesecond hidden layer.

Neural networks implemented on computers are sometimes calledartificial neural networks to differentiate them fromneural networks found in brains and other nervous systems.

Some neural networks can mimic extremely complex nonlinear relationshipsbetween different features and the label.

SeeNeural networksin Machine Learning Crash Course for more information.

neuron

#fundamentals

In machine learning, a distinct unit within ahidden layerof aneural network. Each neuron performs the followingtwo-step action:

Calculates theweighted sum of input values multipliedby their corresponding weights.
Passes the weighted sum as input to anactivation function.

A neuron in the first hidden layer accepts inputs from the feature valuesin theinput layer. A neuron in any hidden layer beyondthe first accepts inputs from the neurons in the preceding hidden layer.For example, a neuron in the second hidden layer accepts inputs from theneurons in the first hidden layer.

The following illustration highlights two neurons and theirinputs.

A neural network with an input layer, two hidden layers, and an output layer. Two neurons are highlighted: one in the first hidden layer and one in the second hidden layer. The highlighted neuron in the first hidden layer receives inputs from both features in the input layer. The highlighted neuron in the second hidden layer receives inputs from each of the three neurons in the first hidden layer.

A neuron in a neural network mimics the behavior of neurons in brains andother parts of nervous systems.

N-gram

An ordered sequence of N words. For example,truly madly is a 2-gram. Becauseorder is relevant,madly truly is a different 2-gram thantruly madly.

N	Name(s) for this kind of N-gram	Examples
2	bigram or 2-gram	to go, go to, eat lunch, eat dinner
3	trigram or 3-gram	ate too much, happily ever after, the bell tolls
4	4-gram	walk in the park, dust in the wind, the boy ate lentils

Manynatural language understandingmodels rely on N-grams to predict the next word that the user will typeor say. For example, suppose a user typedhappily ever.An NLU model based on trigrams would likely predict that theuser will next type the wordafter.

Contrast N-grams withbag of words, which areunordered sets of words.

SeeLarge language modelsin Machine Learning Crash Course for more information.

NLP

Abbreviation fornatural language processing.

NLU

Abbreviation fornatural language understanding.

node (decision tree)

#df

In adecision tree, anycondition orleaf.

A decision tree with two conditions and three leaves.

SeeDecision Treesin the Decision Forests course for more information.

node (neural network)

#fundamentals

Aneuron in ahidden layer.

SeeNeural Networksin Machine Learning Crash Course for more information.

node (TensorFlow graph)

#TensorFlow

An operation in a TensorFlowgraph.

noise

Broadly speaking, anything that obscures the signal in a dataset. Noisecan be introduced into data in a variety of ways. For example:

Human raters make mistakes in labeling.
Humans and instruments mis-record or omit feature values.

non-binary condition

#df

Acondition containing more than two possible outcomes.For example, the following non-binary condition contains three possibleoutcomes:

A condition (number_of_legs = ?) that leads to three possible outcomes. One outcome (number_of_legs = 8) leads to a leaf named spider. A second outcome (number_of_legs = 4) leads to a leaf named dog. A third outcome (number_of_legs = 2) leads to a leaf named penguin.

SeeTypes of conditionsin the Decision Forests course for more information.

nonlinear

#fundamentals

A relationship between two or more variables that can't be represented solelythrough addition and multiplication. Alinear relationshipcan be represented as a line; anonlinear relationship can't berepresented as a line. For example, consider two models that each relatea single feature to a single label. The model on the left is linearand the model on the right is nonlinear:

Two plots. One plot is a line, so this is a linear relationship. The other plot is a curve, so this is a nonlinear relationship.

SeeNeural networks: Nodes and hidden layersin Machine Learning Crash Course to experiment with different kindsof nonlinear functions.

non-response bias

#responsible

Seeselection bias.

nonstationarity

#fundamentals

A feature whose values change across one or more dimensions, usually time.For example, consider the following examples of nonstationarity:

The number of swimsuits sold at a particular store varies with the season.
The quantity of a particular fruit harvested in a particular regionis zero for much of the year but large for a brief period.
Due to climate change, annual mean temperatures are shifting.

Contrast withstationarity.

no one right answer (NORA)

#generativeAI

Aprompt havingmultiple correctresponses.For example, the following prompt has no one right answer:

Tell me a funny joke about elephants.

Evaluating the responses to no one right answer promptsis usually far more subjective than evaluating prompts withone right answer. For example, evaluating an elephantjoke requires a systematic way to determine how funny the joke is.

NORA

#generativeAI

Abbreviation forno one right answer.

normalization

#fundamentals

Broadly speaking, the process of converting a variable's actual rangeof values into a standard range of values, such as:

-1 to +1
0 to 1
Z-scores (roughly, -3 to +3)

For example, suppose the actual range of values of a certain feature is800 to 2,400. As part offeature engineering,you could normalize the actual values down to a standard range, suchas -1 to +1.

Normalization is a common task infeature engineering. Models usually train faster(and produce better predictions) when every numerical feature in thefeature vector has roughly the same range.

Notebook LM

#generativeAI

A Gemini-based tool that enables users to upload documents and thenuseprompts to ask questions about, summarize, or organizethose documents. For example, an author could upload several short storiesand ask Notebook LM to find their common themes or to identify which one wouldmake the best movie.

novelty detection

The process of determining whether a new (novel) example comes from the samedistribution as thetraining set. In other words, aftertraining on the training set, novelty detection determines whether anewexample (during inference or during additional training) is anoutlier.

Contrast withoutlier detection.

numerical data

#fundamentals

Features represented as integers or real-valued numbers.For example, a house valuation model would probably represent the sizeof a house (in square feet or square meters) as numerical data. Representinga feature as numerical data indicates that the feature's values haveamathematical relationship to the label.That is, the number of square meters in a house probably has somemathematical relationship to the value of the house.

Not all integer data should be represented as numerical data. For example,postal codes in some parts of the world are integers; however, integer postalcodes shouldn't be represented as numerical data in models. That's because apostal code of20000 is not twice (or half) as potent as a postal code of10000. Furthermore, although different postal codesdo correlate to differentreal estate values, we can't assume that real estate values at postal code20000 are twice as valuable as real estate values at postal code 10000.Postal codes should be represented ascategorical datainstead.

Numerical features are sometimes calledcontinuous features.

SeeWorking with numerical datain Machine Learning Crash Course for more information.

NumPy

Anopen-source math librarythat provides efficient array operations in Python.pandas is built on NumPy.

O

objective

#Metric

Ametric that your algorithm is trying to optimize.

objective function

#Metric

The mathematical formula ormetric that a model aims to optimize.For example, the objective function forlinear regression is usuallyMean Squared Loss. Therefore, when training alinear regression model, training aims to minimize Mean Squared Loss.

In some cases, the goal is tomaximize the objective function.For example, if the objective function is accuracy, the goal isto maximize accuracy.

oblique condition

#df

In adecision tree, acondition that involves more than onefeature. For example, if height and width are both features,then the following is an oblique condition:

  height > width

Contrast withaxis-aligned condition.

SeeTypes of conditionsin the Decision Forests course for more information.

offline

#fundamentals

Synonym forstatic.

offline inference

#fundamentals

The process of a model generating a batch ofpredictionsand then caching (saving) those predictions. Apps can then access the inferredprediction from the cache rather than rerunning the model.

For example, consider a model that generates local weather forecasts(predictions) once every four hours. After each model run, the systemcaches all the local weather forecasts. Weather apps retrieve the forecastsfrom the cache.

Offline inference is also calledstatic inference.

Contrast withonline inference.SeeProduction ML systems: Static versus dynamic inferencein Machine Learning Crash Course for more information.

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

One element is set to 1.
All other elements are set to 0.

One-hot encoding is commonly used to represent strings or identifiers thathave a finite set of possible values.For example, suppose a certain categorical feature namedScandinavia has five possible values:

"Denmark"
"Sweden"
"Norway"
"Finland"
"Iceland"

One-hot encoding could represent each of the five values as follows:

Country	Vector
"Denmark"	1	0	0	0	0
"Sweden"	0	1	0	0	0
"Norway"	0	0	1	0	0
"Finland"	0	0	0	1	0
"Iceland"	0	0	0	0	1

Thanks to one-hot encoding, a model can learn different connectionsbased on each of the five countries.

Representing a feature asnumerical data is analternative to one-hot encoding. Unfortunately, representing theScandinavian countries numerically is not a good choice. For example,consider the following numeric representation:

"Denmark" is 0
"Sweden" is 1
"Norway" is 2
"Finland" is 3
"Iceland" is 4

With numeric encoding, a model would interpret the raw numbersmathematically and would try to train on those numbers.However, Iceland isn't actually twice as much (or half as much) ofsomething as Norway, so the model would come to some strange conclusions.

SeeCategorical data: Vocabulary and one-hotencodingin Machine Learning Crash Course for more information.

one right answer (ORA)

#generativeAI

Aprompt having asingle correctresponse.For example, consider the following prompt:

True or false: Saturn is bigger than Mars.

The only correct response istrue.

Contrast withno one right answer.

one-shot learning

A machine learning approach, often used for object classification,designed to learn effectiveclassification modelfrom a single training example.

one-shot prompting

#generativeAI

Aprompt that containsone example demonstrating how thelarge language model should respond. For example,the following prompt contains one example showing a large language model howit should answer a query.

Parts of one prompt	Notes
`What is the official currency of the specified country?`	The question you want the LLM to answer.
`France: EUR`	One example.
`India:`	The actual query.

Compare and contrastone-shot prompting with the following terms:

one-vs.-all

#fundamentals

Given a classification problem with N classes, asolution consisting of N separatebinary classification model—one binaryclassification model for each possible outcome. For example, given a modelthat classifies examples as animal, vegetable, or mineral, a one-vs.-allsolution would provide the following three separate binary classificationmodels:

animal versus not animal
vegetable versus not vegetable
mineral versus not mineral

online

#fundamentals

Synonym fordynamic.

online inference

#fundamentals

Generatingpredictions on demand. For example,suppose an app passes input to a model and issues a request for aprediction.A system using online inference responds to the request by runningthe model (and returning the prediction to the app).

Contrast withoffline inference.

SeeProduction ML systems: Static versus dynamic inferencein Machine Learning Crash Course for more information.

operation (op)

#TensorFlow

In TensorFlow, any procedure that creates,manipulates, or destroys aTensor. Forexample, a matrix multiply is an operation that takes two Tensors asinput and generates one Tensor as output.

Optax

A gradient processing and optimization library forJAX.Optax facilitates research by providing building blocks that can berecombined in custom ways to optimize parametric models such asdeep neural networks. Other goals include:

Providing readable, well-tested, efficient implementations ofcore components.
Improving productivity by making it possible to combine low level ingredientsinto custom optimizers (or other gradient processing components).
Accelerating adoption of new ideas by making it easy for anyoneto contribute.

optimizer

A specific implementation of thegradient descentalgorithm. Popular optimizers include:

AdaGrad, which stands for ADAptive GRADient descent.
Adam, which stands for ADAptive with Momentum.

ORA

#generativeAI

Abbreviation forone right answer.

out-group homogeneity bias

#responsible

The tendency to see out-group members as more alike than in-group memberswhen comparing attitudes, values, personality traits, and othercharacteristics.In-group refers to people you interact with regularly;out-group refers to people you don't interact with regularly. If youcreate a dataset by asking people to provide attributes aboutout-groups, those attributes may be less nuanced and more stereotypedthan attributes that participants list for people in their in-group.

For example, Lilliputians might describe the houses of other Lilliputiansin great detail, citing small differences in architectural styles, windows,doors, and sizes. However, the same Lilliputians might simply declare thatBrobdingnagians all live in identical houses.

Out-group homogeneity bias is a form ofgroup attribution bias.

outlier detection

The process of identifyingoutliers in atraining set.

Contrast withnovelty detection.

outliers

Values distant from most other values. In machine learning, any of thefollowing are outliers:

Input data whose values are more than roughly 3 standard deviationsfrom the mean.
Weights with high absolute values.
Predicted values relatively far away from the actual values.

For example, suppose thatwidget-price is a feature of a certain model.Assume that the meanwidget-price is 7 Euros with a standard deviationof 1 Euro. Examples containing awidget-price of 12 Euros or 2 Euroswould therefore be considered outliers because each of those prices isfive standard deviations from the mean.

Outliers are often caused by typos or other input mistakes. In other cases,outliers aren't mistakes; after all, values five standard deviations awayfrom the mean are rare but hardly impossible.

Outliers often cause problems in model training.Clippingis one way of managing outliers.

SeeWorking with numerical datain Machine Learning Crash Course for more information.

out-of-bag evaluation (OOB evaluation)

#df

A mechanism for evaluating the quality of adecision forest by testing eachdecision tree against theexamplesnot used duringtraining of that decision tree. For example, in thefollowing diagram, notice that the system trains each decision treeon about two-thirds of the examples and then evaluates against theremaining one-third of the examples.

A decision forest consisting of three decision trees. One decision tree trains on two-thirds of the examples and then uses the remaining one-third for OOB evaluation. A second decision tree trains on a different two-thirds of the examples than the previous decision tree, and then uses a different one-third for OOB evaluation than the previous decision tree.

Out-of-bag evaluation is a computationally efficient and conservativeapproximation of thecross-validation mechanism.In cross-validation, one model is trained for each cross-validation round(for example, 10 models are trained in a 10-fold cross-validation).With OOB evaluation, a single model is trained. Becausebaggingwithholds some data from each tree during training, OOB evaluation can usethat data to approximate cross-validation.

SeeOut-of-bag evaluationin the Decision Forests course for more information.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an inputlayer, two hidden layers, and an output layer:

overfitting

#fundamentals

Creating amodel that matches thetraining data so closely that the model fails tomake correct predictions on new data.

Regularization can reduce overfitting.Training on a large and diverse training set can also reduce overfitting.

Click the icon for additional notes.

Overfitting is like strictly following advice from only your favoriteteacher. You'll probably be successful in that teacher's class, but youmight "overfit" to that teacher's ideas and be unsuccessful in otherclasses. Following advice from a mixture of teachers will enable you toadapt better to new situations.

SeeOverfittingin Machine Learning Crash Course for more information.

oversampling

Reusing theexamples of aminority classin aclass-imbalanced dataset in order tocreate a more balancedtraining set.

For example, consider abinary classificationproblem in which the ratio of themajority class to theminority class is 5,000:1. If the dataset contains a million examples, thenthe dataset contains only about 200 examples of the minority class, which mightbe too few examples for effective training. To overcome this deficiency, youmight oversample (reuse) those 200 examples multiple times, possibly yieldingsufficient examples for useful training.

You need to be careful about overoverfitting whenoversampling.

Contrast withundersampling.

P

packed data

An approach for storing data more efficiently.

Packed data stores data either by using a compressed format or insome other way that allows it to be accessed more efficiently.Packed data minimizes the amount of memory and computation required toaccess it, leading to faster training and more efficient model inference.

Packed data is often used with other techniques, such asdata augmentation andregularization, further improving the performance ofmodels.

PaLM

Abbreviation forPathways Language Model.

pandas

#fundamentals

A column-oriented data analysis API built on top ofnumpy.Many machine learning frameworks,including TensorFlow, support pandas data structures as inputs. See thepandas documentationfor details.

parameter

#fundamentals

Theweights andbiases that a model learns duringtraining. For example, in alinear regression model, the parameters consist ofthe bias (b) and all the weights (w₁,w₂,and so on) in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast,hyperparameters are the values thatyou (or a hyperparameter tuning service) supply to the model.For example,learning rate is a hyperparameter.

parameter-efficient tuning

#generativeAI

A set of techniques tofine-tune a largepre-trained language model (PLM)more efficiently than fullfine-tuning. Parameter-efficienttuning typically fine-tunes far fewerparameters than fullfine-tuning, yet generally produces alarge language model that performsas well (or almost as well) as a large language model built from fullfine-tuning.

Compare and contrast parameter-efficient tuning with:

Parameter-efficient tuning is also known asparameter-efficient fine-tuning.

Parameter Server (PS)

#TensorFlow

A job that keeps track of a model'sparameters in adistributed setting.

parameter update

The operation of adjusting a model'sparameters duringtraining, typically within a single iteration ofgradient descent.

partial derivative

A derivative in which all but one of the variables is considered a constant.For example, the partial derivative off(x, y) with respect tox is thederivative off considered as a function ofx alone (that is, keepingyconstant). The partial derivative off with respect tox focuses only onhowx is changing and ignores all other variables in the equation.

participation bias

#responsible

Synonym for non-response bias. Seeselection bias.

partitioning strategy

The algorithm by which variables are divided acrossparameter servers.

pass at k (pass@k)

#Metric

A metric to determine the quality of code (for example, Python)that alarge language model generates.More specifically, pass atk tells you the likelihood that at leastone generated block of code out ofk generated blocks of code willpass all of its unit tests.

Large language models often struggle to generate good code forcomplex programming problems. Software engineers adapt to this problem byprompting the large language model to generatemultiple (k) solutionsfor the same problem. Then, software engineers test each of the solutionsagainst unit tests. The calculation of pass atk depends on the outcomeof the unit tests:

Ifone or more of those solutions pass the unit test, then the LLMPasses that code generation challenge.
Ifnone of the solutions pass the unit test, then the LLMFails that code generation challenge.

The formula for pass atk is as follows:

\[\text{pass at k} = \frac{\text{total number of passes}} {\text{total number of challenges}}\]

In general, higher values ofk produce higher pass at k scores; however,higher values ofk require more large language model and unit testingresources.

Click the icon for an example.

Suppose a software engineer asks a large language model togeneratek=10 solutions forn=50 challenging coding problems.Here are the results:

30 Passes
20 Fails

The pass at 10 score is therefore:

$$\text{pass at 10} = \frac{\text{30}} {\text{50}} = 0.6$$

Pathways Language Model (PaLM)

An older model and predecessor toGemini models.

Pax

#generativeAI

A programming framework designed for training large-scaleneural network models so largethat they span multipleTPU accelerator chipslicesorpods.

Pax is built onFlax, which is built onJAX.

Diagram indicating Pax's position in the software stack. Pax is built on top of JAX. Pax itself consists of three layers. The bottom layer contains TensorStore and Flax. The middle layer contains Optax and Flaxformer. The top layer contains Praxis Modeling Library. Fiddle is built on top of Pax.

perceptron

A system (either hardware or software) that takes in one or more input values,runs a function on the weighted sum of the inputs, and computes a singleoutput value. In machine learning, the function is typically nonlinear, such asReLU,sigmoid, ortanh.For example, the following perceptron relies on the sigmoid function to processthree input values:

$$f(x_1, x_2, x_3) = \text{sigmoid}(w_1 x_1 + w_2 x_2 + w_3 x_3)$$

In the following illustration, the perceptron takes three inputs, each of whichis itself modified by a weight before entering the perceptron:

A perceptron that takes in 3 inputs, each multiplied by separate weights. The perceptron outputs a single value.

Perceptrons are theneurons inneural networks.

performance

#Metric

Overloaded term with the following meanings:

The standard meaning within software engineering. Namely: How fast(or efficiently) does this piece of software run?
The meaning within machine learning. Here, performance answers thefollowing question: How correct is thismodel? That is,how good are the model's predictions?

permutation variable importances

#df

#Metric

A type ofvariable importance that evaluatesthe increase in the prediction error of a modelafter permuting thefeature's values. Permutation variable importance is a model-independentmetric.

perplexity

#Metric

One measure of how well amodel is accomplishing its task.For example, suppose your task is to read the first few letters of a worda user is typing on a phone keyboard, and to offer a list of possiblecompletion words. Perplexity, P, for this task is approximately the numberof guesses you need to offer in order for your list to contain the actualword the user is trying to type.

Perplexity is related tocross-entropy as follows:

$$P= 2^{-\text{cross entropy}}$$

pipeline

The infrastructure surrounding a machine learning algorithm. A pipelineincludes gathering the data, putting the data into training data files,training one or more models, and exporting the models to production.

SeeML pipelinesin the Managing ML Projects course for more information.

pipelining

A form ofmodel parallelism in which a model'sprocessing is divided into consecutive stages and each stage is executedon a different device. While a stage is processing one batch, the precedingstage can work on the next batch.

pjit

AJAX function that splits code to run across multipleaccelerator chips. The user passes a function to pjit,which returns a function that has the equivalent semantics but is compiledinto anXLA computation that runs across multiple devices(such as GPUs orTPU cores).

pjit enables users to shard computations without rewriting them by usingtheSPMD partitioner.

As of March 2023,pjit has been merged withjit. Refer toDistributed arrays and automaticparallelizationfor more details.

PLM

#generativeAI

Abbreviation forpre-trained language model.

pmap

AJAX function that executes copies of an input functionon multiple underlying hardware devices(CPUs, GPUs, orTPUs), with different input values.pmap relies onSPMD.

policy

In reinforcement learning, anagent's probabilistic mappingfromstates toactions.

pooling

Reducing a matrix (or matrixes) created by an earlierconvolutional layer to a smaller matrix.Pooling usually involves taking either the maximum or average valueacross the pooled area. For example, suppose we have thefollowing 3x3 matrix:

The 3x3 matrix [[5,3,1], [8,2,5], [9,4,3]].

A pooling operation, just like a convolutional operation, divides thatmatrix into slices and then slides that convolutional operation bystrides. For example, suppose the pooling operationdivides the convolutional matrix into 2x2 slices with a 1x1 stride.As the following diagram illustrates, four pooling operations take place.Imagine that each pooling operation picks the maximum value of thefour in that slice:

Pooling helps enforcetranslational invariance in the input matrix.

Pooling for vision applications is known more formally asspatial pooling.Time-series applications usually refer to pooling astemporal pooling.Less formally, pooling is often calledsubsampling ordownsampling.

positional encoding

A technique to add information about theposition of a token in a sequence tothe token's embedding.Transformer models use positionalencoding to better understand the relationship between different parts of thesequence.

A common implementation of positional encoding uses a sinusoidal function.(Specifically, the frequency and amplitude of the sinusoidal function aredetermined by the position of the token in the sequence.) This techniqueenables a Transformer model to learn to attend to different parts of thesequence based on their position.

positive class

#fundamentals

#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor."The positive class in an emailclassification modelmight be "spam."

Contrast withnegative class.

Click the icon for additional notes.

The termpositive class can be confusing because the "positive" outcomeof many tests is often an undesirable result. For example, the positive class inmany medical tests corresponds to tumors or diseases. In general, you want adoctor to tell you, "Congratulations! Your test results were negative."Regardless, the positive class is the event that the test is seeking to find.

Admittedly, you're simultaneously testing for both the positive and negativeclasses.

post-processing

#responsible

#fundamentals

Adjusting the output of a modelafter the model has been run.Post-processing can be used to enforce fairness constraints withoutmodifying models themselves.

For example, one might apply post-processing to abinary classification model by setting aclassification threshold such thatequality of opportunity is maintainedfor some attribute by checking that thetrue positive rateis the same for all values of that attribute.

post-trained model

#generativeAI

Loosely-defined term that typically refers to apre-trained model that has gone through somepost-processing, such as one or more of the following:

PR AUC (area under the PR curve)

#Metric

Area under the interpolatedprecision-recall curve, obtained by plotting(recall, precision) points for different values of theclassification threshold.

Praxis

A core, high-performance ML library ofPax. Praxis is oftencalled the "Layer library".

Praxis contains not just the definitions for the Layer class, but most ofits supporting components as well, including:

data inputs
configuration libraries (HParam andFiddle)
optimizers

Praxis provides the definitions for the Model class.

precision

#fundamentals

#Metric

A metric forclassification models that answersthe following question:

When the model predicted thepositive class,what percentage of the predictions were correct?

Here is the formula:

$$\text{Precision} =\frac{\text{true positives}} {\text{true positives} + \text{false positives}}$$

where:

true positive means the modelcorrectly predicted the positive class.
false positive means the modelmistakenly predicted the positive class.

For example, suppose a model made 200 positive predictions.Of these 200 positive predictions:

150 were true positives.
50 were false positives.

In this case:

$$\text{Precision} =\frac{\text{150}} {\text{150} + \text{50}} = 0.75$$

Contrast withaccuracy andrecall.

SeeClassification: Accuracy, recall, precision and relatedmetricsin Machine Learning Crash Course for more information.

precision at k (precision@k)

#Metric

A metric for evaluating a ranked (ordered) list of items.Precision atk identifies the fraction of the firstk items in that listthat are "relevant." That is:

\[\text{precision at k} = \frac{\text{relevant items in first k items of the list}} {\text{k}}\]

The value ofk must be less than or equal to the length of the returned list.Note that the length of the returned list is not part of the calculation.

Relevance is often subjective; even experthuman evaluators often disagree on which items arerelevant.

Compare with:

Click the icon to see an example.

Suppose alarge language modelis given the following query:

List the 6 funniest movies of all time in order.

And the large language model returns the list shown in thefirst two columns of the following table:

Position	Movie	Relevant?
1	The General	Yes
2	Mean Girls	Yes
3	Platoon	No
4	Bridesmaids	Yes
5	Citizen Kane	No
6	This is Spinal Tap	Yes

Two of the first three movies are relevant, so precision at 3 is:

$$\text{precision at 3} = \frac{\text{2}} {\text{3}} = 0.67$$

Three of the first five movies are very funny, so precision at 5 is:

$$\text{precision at 5} = \frac{\text{3}} {\text{5}} = 0.6$$

precision-recall curve

#Metric

A curve ofprecision versusrecall at differentclassification thresholds.

prediction

#fundamentals

A model's output. For example:

The prediction of a binary classification model is either the positiveclass or the negative class.
The prediction of a multi-class classification model is one class.
The prediction of a linear regression model is a number.

pre-trained model

#generativeAI

Although this term could refer to any trainedmodel ortrainedembedding vector, pre-trained model nowtypically refers to a trainedlarge language modelor other form of trainedgenerative AI model.

pre-training

#generativeAI

The initialtraining of a model on a largedataset. Some pre-trained modelsare clumsy giants and must typically be refined through additional training.For example, ML experts might pre-train alarge language model on a vast text dataset,such as all the English pages in Wikipedia. Following pre-training, theresulting model might be further refined through any of the followingtechniques:

prior belief

What you believe about the data before you begin training on it.For example,L₂ regularization relies ona prior belief thatweights should be small and normallydistributed around zero.

Pro

#generativeAI

AGemini model with fewerparametersthanUltra but more parameters thanNano. SeeGemini Profor details.

probabilistic regression model

Aregression model that uses not only theweights for eachfeature, but also theuncertainty of those weights. A probabilistic regression model generatesa prediction and the uncertainty of that prediction. For example, aprobabilistic regression model might yield a prediction of 325 with astandard deviation of 12. For more information about probabilistic regressionmodels, see thisColab ontensorflow.org.

probability density function

#Metric

A function that identifies the frequency of data samples havingexactly aparticular value. When a dataset's values are continuous floating-pointnumbers, exact matches rarely occur. However,integrating a probabilitydensity function from valuex to valuey yields the expected frequency ofdata samples betweenx andy.

For example, consider a normal distribution having a mean of 200 and astandard deviation of 30. To determine the expected frequency of data samplesfalling within the range 211.4 to 218.7, you can integrate the probabilitydensity function for a normal distribution from 211.4 to 218.7.

prompt

#generativeAI

Any text entered as input to alarge language modelto condition the model to behave in a certain way. Prompts can be as short as aphrase or arbitrarily long (for example, the entire text of a novel). Promptsfall into multiple categories, including those shown in the following table:

Prompt category	Example	Notes
Question	`How fast can a pigeon fly?`
Instruction	`Write a funny poem about arbitrage.`	A prompt that asks the large language model todo something.
Example	`Translate Markdown code to HTML. For example: Markdown: * list item HTML: <ul> <li>list item</li> </ul>`	The first sentence in this example prompt is an instruction. The remainder of the prompt is the example.
Role	`Explain why gradient descent is used in machine learning training to a PhD in Physics.`	The first part of the sentence is an instruction; the phrase "to a PhD in Physics" is the role portion.
Partial input for the model to complete	`The Prime Minister of the United Kingdom lives at`	A partial input prompt can either end abruptly (as this example does) or end with an underscore.

Agenerative AI model can respond to a prompt with text,code, images,embeddings, videos…almost anything.

prompt-based learning

#generativeAI

A capability of certainmodels that enables them to adapttheir behavior in response to arbitrary text input (prompts).In a typical prompt-based learning paradigm, alarge language model responds to a prompt bygenerating text. For example, suppose a user enters the following prompt:

Summarize Newton's Third Law of Motion.

A model capable of prompt-based learning isn't specifically trained to answerthe previous prompt. Rather, the model "knows" a lot of facts about physics,a lot about general language rules, and a lot about what constitutes generallyuseful answers. That knowledge is sufficient to provide a (hopefully) usefulanswer. Additional human feedback ("That answer was too complicated." or"What's a reaction?") enables some prompt-based learning systems to graduallyimprove the usefulness of their answers.

prompt design

#generativeAI

Synonym forprompt engineering.

prompt engineering

#generativeAI

The art of creatingprompts that elicit the desiredresponses from alarge language model. Humans perform promptengineering. Writing well-structured prompts is an essential part of ensuringuseful responses from a large language model. Prompt engineering depends onmany factors, including:

The dataset used topre-train and possiblyfine-tune the large language model.
Thetemperature and other decoding parameters that themodel uses to generate responses.

Prompt design is a synonym for prompt engineering.

SeeIntroduction to prompt designfor more details on writing helpful prompts.

prompt set

#generativeAI

A group ofprompts forevaluating alarge language model. For example, the followingillustration shows a prompt set consisting of three prompts:

Three prompts to an LLM produce three responses. The three prompts are the prompt set. The three responses are the response set.

Good prompt sets consist of a sufficiently "wide" collection of prompts tothoroughly evaluate the safety and helpfulness of a large language model.

prompt tuning

#generativeAI

Aparameter efficient tuning mechanismthat learns a "prefix" that the system prepends to theactualprompt.

One variation of prompt tuning—sometimes calledprefix tuning—is toprepend the prefix atevery layer. In contrast, most prompt tuning onlyadds a prefix to theinput layer.

Click the icon to learn more about prefixes.

For prompt tuning, the "prefix" (also known as a "soft prompt") is ahandful of learned, task-specific vectors prepended to the text tokenembeddings from the actual prompt. The system learns the soft prompt byfreezing all other model parameters and fine-tuning on a specific task.

proxy (sensitive attributes)

#responsible

An attribute used as a stand-in for asensitive attribute. For example, anindividual's postal code might be used as a proxy for their income,race, or ethnicity.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

For example, suppose you must train a model to predict employeestress level. Your dataset contains a lot of predictive features butdoesn't contain a label namedstress level.Undaunted, you pick "workplace accidents" as a proxy label forstress level. After all, employees under high stress get into moreaccidents than calm employees. Or do they? Maybe workplace accidentsactually rise and fall for multiple reasons.

As a second example, suppose you wantis it raining? to be a Boolean labelfor your dataset, but your dataset doesn't contain rain data. Ifphotographs are available, you might establish pictures of peoplecarrying umbrellas as a proxy label foris it raining? Is thata good proxy label? Possibly, but people in some cultures may bemore likely to carry umbrellas to protect against sun than the rain.

Proxy labels are often imperfect. When possible, choose actual labels overproxy labels. That said, when an actual label is absent, pick the proxylabel very carefully, choosing the least horrible proxy label candidate.

SeeDatasets: Labelsin Machine Learning Crash Course for more information.

pure function

A function whose outputs are based only on its inputs, and that has no sideeffects. Specifically, a pure function doesn't use or change any global state,such as the contents of a file or the value of a variable outside the function.

Pure functions can be used to create thread-safe code, which is beneficialwhen shardingmodel code across multipleaccelerator chips.

JAX's function transformation methods requirethat the input functions are pure functions.

Q

Q-function

Inreinforcement learning, the function thatpredicts the expectedreturn from taking anaction in astate and then following a givenpolicy.

Q-function is also known asstate-action value function.

Q-learning

Inreinforcement learning, an algorithm thatallows anagentto learn the optimalQ-function of aMarkov decision process by applying theBellman equation. The Markov decision process modelsanenvironment.

quantile

Each bucket inquantile bucketing.

quantile bucketing

Distributing a feature's values intobuckets so that eachbucket contains the same (or almost the same) number of examples. For example,the following figure divides 44 points into 4 buckets, each of whichcontains 11 points. In order for each bucket in the figure to contain thesame number of points, some buckets span a different width of x-values.

44 data points divided into 4 buckets of 11 points each. Although each bucket contains the same number of data points, some buckets contain a wider range of feature values than other buckets.

SeeNumerical data: Binningin Machine Learning Crash Course for more information.

quantization

Overloaded term that could be used in any of the following ways:

Implementingquantile bucketingon a particularfeature.
Transforming data into zeroes and ones for quicker storing, training,and inferring. As Boolean data is more robust to noise and errors thanother formats, quantization can improve model correctness.Quantization techniques include rounding, truncating, andbinning.
Reducing the number of bits used to store a model'sparameters. For example, suppose a model's parameters arestored as 32-bit floating-point numbers. Quantization converts thoseparameters from 32 bits down to 4, 8, or 16 bits. Quantization reduces thefollowing:
- Compute, memory, disk, and network usage
- Time to infer a predication
- Power consumption
However, quantization sometimes decreases the correctness of a model'spredictions.

queue

#TensorFlow

A TensorFlowOperation that implements a queue datastructure. Typically used in I/O.

R

RAG

#fundamentals

Abbreviation forretrieval-augmented generation.

random forest

#df

Anensemble ofdecision trees inwhich each decision tree is trained with a specific random noise,such asbagging.

Random forests are a type ofdecision forest.

SeeRandom Forestin the Decision Forests course for more information.

random policy

Inreinforcement learning, apolicy that chooses anaction at random.

rank (ordinality)

The ordinal position of a class in a machine learning problem that categorizesclasses from highest to lowest. For example, a behavior rankingsystem could rank a dog's rewards from highest (a steak) tolowest (wilted kale).

rank (Tensor)

#TensorFlow

The number of dimensions in aTensor. For example,a scalar has rank 0, a vector has rank 1, and a matrix has rank 2.

Not to be confused withrank (ordinality).

ranking

A type ofsupervised learning whoseobjective is to order a list of items.

rater

#fundamentals

A human who provideslabels forexamples."Annotator" is another name for rater.

SeeCategorical data: Common issuesin Machine Learning Crash Course for more information.

Reading Comprehension with Commonsense Reasoning Dataset (ReCoRD)

#Metric

A dataset to evaluate an LLM's ability to perform commonsense reasoning.Each example in the dataset contains three components:

A paragraph or two from a news article
A query in which one of the entities explicitly or implicitly identifiedin the passage ismasked.
The answer (the name of the entity that belongs in the mask)

SeeReCoRDfor an extensive list of examples.

ReCoRD is a component of theSuperGLUE ensemble.

RealToxicityPrompts

#Metric

A dataset that contains a set of sentence beginnings that might containtoxic content. Use this dataset to evaluate an LLM's ability to generatenon-toxic text to complete the sentence. Typically, you use thePerspective API to determine how wellthe LLM performed at this task.

SeeRealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Modelsfor details.

recall

#fundamentals

#Metric

A metric forclassification models that answersthe following question:

Whenground truth was thepositive class, what percentage of predictions didthe model correctly identify as the positive class?

Here is the formula:

\[\text{Recall} =\frac{\text{true positives}} {\text{true positives} + \text{false negatives}}\]

where:

true positive means the modelcorrectly predicted the positive class.
false negative means that the modelmistakenly predicted thenegative class.

For instance, suppose your model made 200 predictions on examples for whichground truth was the positive class. Of these 200 predictions:

180 were true positives.
20 were false negatives.

In this case:

\[\text{Recall} =\frac{\text{180}} {\text{180} + \text{20}} = 0.9\]

Click the icon for notes about class-imbalanced datasets.

Recall is particularly useful for determining the predictive power ofclassification models in which the positive class is rare. For example, consideraclass-imbalanced datasetin which the positive class for a certain disease occurs in only 10 patientsout of a million. Suppose your model makes five million predictions that yieldthe following outcomes:

30 True Positives
20 False Negatives
4,999,000 True Negatives
950 False Positives

The recall of this model is therefore:

recall = TP / (TP + FN)recall = 30 / (30 + 20) = 0.6 = 60%

By contrast, theaccuracy of this model is:

accuracy = (TP + TN) / (TP + TN + FP + FN)accuracy = (30 + 4,999,000) / (30 + 4,999,000 + 950 + 20) = 99.98%

That high value of accuracy looks impressive but is essentially meaningless.Recall is a much more useful metric for class-imbalanced datasets than accuracy.

SeeClassification: Accuracy, recall, precision and relatedmetricsfor more information.

recall at k (recall@k)

#Metric

A metric for evaluating systems that output a ranked (ordered) list of items.Recall atk identifies the fraction of relevant items in the firstk itemsin that list out of the total number of relevant items returned.

\[\text{recall at k} = \frac{\text{relevant items in first k items of the list}} {\text{total number of relevant items in the list}}\]

Contrast withprecision at k.

Click the icon to see an example.

Suppose alarge language modelis given the following query:

List the 10 funniest movies of all time in order.

And the large language model returns the list shown in the firsttwo columns:

Position	Movie	Relevant?
1	The General	Yes
2	Mean Girls	Yes
3	Platoon	No
4	Bridesmaids	Yes
5	This is Spinal Tap	Yes
6	Airplane!	Yes
7	Groundhog Day	Yes
8	Monty Python and the Holy Grail	Yes
9	Oppenheimer	No
10	Clueless	Yes

Eight of the movies in the preceding list are very funny, so they are"relevant items in the list." Therefore, 8 will be the denominator inall the calculations of recall atk. What about the numerator? Well,3 of the first 4 items are relevant, so recall at 4 is:

$$\text{recall at 4} = \frac{\text{3}} {\text{8}} = 0.375$$

7 of the first 8 movies are very funny, so recall at 8 is:

$$\text{recall at 8} = \frac{\text{7}} {\text{8}} = 0.875$$

Recognizing Textual Entailment (RTE)

#Metric

A dataset for evaluating an LLM's ability to determine whether a hypothesiscan be entailed (logically drawn) from a text passage.Each example in an RTE evaluation consists of three parts:

A passage, typically from news or Wikipedia articles
A hypothesis
The correct answer, which is either:
- True, meaning the hypothesiscan be entailed from the passage
- False, meaning the hypothesiscan't be entailed from the passage

For example:

Passage: The Euro is the currency of the European Union.
Hypothesis: France uses the Euro as currency.
Entailment: True, because France is part of the European Union.

RTE is a component of theSuperGLUE ensemble.

recommendation system

A system that selects for each user a relatively small set of desirableitems from a large corpus.For example, a video recommendation system might recommend two videosfrom a corpus of 100,000 videos, selectingCasablanca andThe Philadelphia Story for one user, andWonder Woman andBlack Panther for another. A video recommendation system mightbase its recommendations on factors such as:

Movies that similar users have rated or watched.
Genre, directors, actors, target demographic...

See theRecommendation Systems coursefor more information.

ReCoRD

#Metric

Abbreviation forReading Comprehension with Commonsense Reasoning Dataset.

Rectified Linear Unit (ReLU)

#fundamentals

Anactivation function with the following behavior:

If input is negative or zero, then the output is 0.
If input is positive, then the output is equal to the input.

For example:

If the input is -3, then the output is 0.
If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

ReLU is a very popular activation function. Despite its simple behavior,ReLU still enables a neural network to learnnonlinearrelationships betweenfeatures and thelabel.

recurrent neural network

Aneural network that is intentionally run multipletimes, where parts of each run feed into the next run. Specifically,hidden layers from the previous run provide part of theinput to the same hidden layer in the next run. Recurrent neural networksare particularly useful for evaluating sequences, so that the hidden layerscan learn from previous runs of the neural network on earlier parts ofthe sequence.

For example, the following figure shows a recurrent neural network thatruns four times. Notice that the values learned in the hidden layers fromthe first run become part of the input to the same hidden layers inthe second run. Similarly, the values learned in the hidden layer on thesecond run become part of the input to the same hidden layer in thethird run. In this way, the recurrent neural network gradually trains andpredicts the meaning of the entire sequence rather than just the meaningof individual words.

An RNN that runs four times to process four input words.

reference text

#generativeAI

An expert's response to aprompt. For example, given thefollowing prompt:

Translate the question "What is your name?" from English to French.

An expert's response might be:

Comment vous appelez-vous?

Various metrics (such asROUGE) measure the degree to which thereference text matches an ML model'sgenerated text.

Note: The expert is typically a human but could be an ML model.

reflection

#generativeAI

A strategy for improving the quality of anagentic workflow by examining (reflecting on) astep's output before passing that output to the next step.

The examiner is often the sameLLM that generated the response(though it could be a different LLM).How could the same LLM that generated a response be a fair judge of itsown response? The "trick" is to put the LLM in a critical (reflective)mindset. This process is analogous to a writer who uses a creativemindset to write a first draft and then switches to a critical mindset toedit it.

For example, imagine an agentic workflow whose first step is to createtext for coffee mugs. The prompt for this step might be:

You are a creative. Generate humorous, original text of less than 50characters suitable for a coffee mug.

Now imagine the following reflective prompt:

You are a coffee drinker. Would you find the preceding response humorous?

The workflow might then only pass text that receives a high reflection scoreto the next stage.

regression model

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast,aclassification model generates a classprediction.) For example, the following are all regression models:

A model that predicts a certain house's value in Euros, such as 423,000.
A model that predicts a certain tree's life expectancy in years,such as 23.2.
A model that predicts the amount of rain in inches that will fall in acertain city over the next six hours, such as 0.18.

Two common types of regression models are:

Linear regression, which finds the line that bestfits label values to features.
Logistic regression, which generates aprobability between 0.0 and 1.0 that a system typically then maps to a classprediction.

Not every model that outputs numerical predictions is a regression model.In some cases, a numeric prediction is really just a classification modelthat happens to have numeric class names. For example, a model that predictsa numeric postal code is a classification model, not a regression model.

regularization

#fundamentals

Any mechanism that reducesoverfitting.Popular types of regularization include:

L₁ regularization
L₂ regularization
dropout regularization
early stopping (this is not a formalregularization method, but can effectively limit overfitting)

Regularization can also be defined as the penalty on a model's complexity.

Click the icon for additional notes.

Regularization is counterintuitive. Increasing regularization usuallyincreases training loss, which is confusing because, well, isn'tthe goal tominimize training loss?

Actually, no. The goal isn't to minimize training loss. The goal is tomake excellent predictions on real-world examples. Remarkably, even thoughincreasing regularization increases training loss, it usually helps models makebetter predictions on real-world examples.

SeeOverfitting: Model complexityin Machine Learning Crash Course for more information.

regularization rate

#fundamentals

A number that specifies the relative importance ofregularization during training. Raising theregularization rate reducesoverfitting but mayreduce the model's predictive power. Conversely, reducing or omittingthe regularization rate increases overfitting.

Click the icon to see the math.

The regularization rate is usually represented as the Greek letter lambda.The following simplifiedloss equation showslambda's influence:

$$\text{minimize(loss function + }\lambda\text{(regularization))}$$

whereregularization is any regularization mechanism, including;

SeeOverfitting: L2regularizationin Machine Learning Crash Course for more information.

reinforcement learning (RL)

A family of algorithms that learn an optimalpolicy, whose goalis to maximizereturn when interacting withanenvironment.For example, the ultimate reward of most games is victory.Reinforcement learning systems can become expert at playing complexgames by evaluating sequences of previous game moves that ultimatelyled to wins and sequences that ultimately led to losses.

Reinforcement Learning from Human Feedback (RLHF)

#generativeAI

Using feedback from human raters to improve the quality of a model'sresponses. For example, an RLHF mechanism can ask users torate the quality of a model's response with a 👍 or 👎 emoji. The systemcan then adjust its future responses based on that feedback.

ReLU

#fundamentals

Abbreviation forRectified Linear Unit.

replay buffer

InDQN-like algorithms, the memory used by the agentto store state transitions for use inexperience replay.

replica

A copy (or part of) of atraining set ormodel, typically stored on another machine. For example, asystem could use the following strategy for implementingdata parallelism:

Place replicas of an existing model on multiple machines.
Send different subsets of the training set to each replica.
Aggregate theparameter updates.

A replica can also refer to another copy of aninferenceserver. Increasing the number of replicas increases thenumber of requests that the system can serve simultaneously but alsoincreases serving costs.

reporting bias

#responsible

The fact that the frequency with which people write about actions,outcomes, or properties is not a reflection of their real-worldfrequencies or the degree to which a property is characteristicof a class of individuals. Reporting bias can influence the compositionof data that machine learning systems learn from.

For example, in books, the wordlaughed is more prevalent thanbreathed. A machine learning model that estimates the relative frequency oflaughing and breathing from a book corpus would probably determinethat laughing is more common than breathing.

SeeFairness: Types of biasin Machine Learning Crash Course for more information.

representation

The process of mapping data to usefulfeatures.

re-ranking

The final stage of arecommendation system,during which scored items may be re-graded according to some other(typically, non-ML) algorithm. Re-ranking evaluates the list of itemsgenerated by thescoring phase, taking actions such as:

Eliminating items that the user has already purchased.
Boosting the score of fresher items.

SeeRe-rankingin the Recommendation Systems course for more information.

response

#generativeAI

The text, images, audio, or video that agenerative AImodelinfers. In other words, aprompt istheinput to a generative AI model and the response is theoutput.

response set

#generativeAI

The collection ofresponses alarge language model returns to an inputprompt set.

retrieval-augmented generation (RAG)

#fundamentals

A technique for improving the quality oflarge language model (LLM) outputby grounding it with sources of knowledge retrieved after the model was trained.RAG improves the accuracy of LLMresponses by providing thetrained LLM with access to information retrieved from trusted knowledge basesor documents.

Common motivations to use retrieval-augmented generation include:

Increasing the factual accuracy of a model's generated responses.
Giving the model access to knowledge it was not trained on.
Changing the knowledge that the model uses.
Enabling the model to cite sources.

For example, suppose that a chemistry app uses thePaLMAPI to generate summariesrelated to user queries. When the app's backend receives a query, the backend:

Searches for ("retrieves") data that's relevant to the user's query.
Appends ("augments") the relevant chemistry data to the user's query.
Instructs the LLM to create a summary based on the appended data.

return

In reinforcement learning, given a certain policy and a certain state, thereturn is the sum of allrewards that theagentexpects to receive when following thepolicy from thestate to the end of theepisode. The agentaccounts for the delayed nature of expected rewards by discounting rewardsaccording to the state transitions required to obtain the reward.

Therefore, if the discount factor is $\gamma$, and $r_0, \ldots, r_{N}$denote the rewards until the end of the episode, then the return calculationis as follows:

$$\text{Return} = r_0 + \gamma r_1 + \gamma^2 r_2 + \ldots + \gamma^{N-1} r_{N-1}$$

reward

In reinforcement learning, the numerical result of taking anaction in astate, as defined bytheenvironment.

ridge regularization

Synonym forL₂ regularization. The termridge regularization is more frequently used in pure statisticscontexts, whereasL₂ regularization is used more oftenin machine learning.

RNN

Abbreviation forrecurrent neural networks.

ROC (receiver operating characteristic) Curve

#fundamentals

#Metric

A graph oftrue positive rate versusfalse positive rate for differentclassification thresholds in binaryclassification.

The shape of an ROC curve suggests a binary classification model's abilityto separate positive classes from negative classes. Suppose, for example,that a binary classification model perfectly separates all the negativeclasses from all the positive classes:

A number line with 8 positive examples on the right side and 7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

An ROC curve. The x-axis is False Positive Rate and the y-axis is True Positive Rate. The curve has an inverted L shape. The curve starts at (0.0,0.0) and goes straight up to (0.0,1.0). Then the curve goes from (0.0,1.0) to (1.0,1.0).

In contrast, the following illustration graphs the raw logistic regressionvalues for a terrible model that can't separate negative classes frompositive classes at all:

A number line with positive examples and negative classes completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0) to (1.0,1.0).

Meanwhile, back in the real world, most binary classification models separatepositive and negative classes to some degree, but usually not perfectly. So,a typical ROC curve falls somewhere between the two extremes:

An ROC curve. The x-axis is False Positive Rate and the y-axis is True Positive Rate. The ROC curve approximates a shaky arc traversing the compass points from West to North.

The point on an ROC curve closest to (0.0,1.0) theoretically identifies theideal classification threshold. However, several other real-world issuesinfluence the selection of the ideal classification threshold. For example,perhaps false negatives cause far more pain than false positives.

A numerical metric calledAUC summarizes the ROC curve intoa single floating-point value.

role prompting

#generativeAI

Aprompt, typically beginning with the pronounyou, thattells agenerative AI model to pretend to be a certainperson or a certain role when generating theresponse.Role prompting can help a generative AI model get into the right "mindset"in order to generate a more useful response. For example, any of thefollowing role prompts might be appropriate depending on the kind ofresponse you are seeking:

You have a PhD in computer science.

You are a software engineer who enjoys giving patient explanations aboutPython to new programming students.

You are an action hero with a very particular set of programming skills.Assure me that you will find a particular item in a Python list.

root

#df

The startingnode (the firstcondition) in adecision tree.By convention, diagrams put the root at the top of the decision tree.For example:

A decision tree with two conditions and three leaves. The starting condition (x > 2) is the root.

root directory

#TensorFlow

The directory you specify for hosting subdirectories of the TensorFlowcheckpoint and events files of multiple models.

Root Mean Squared Error (RMSE)

#fundamentals

#Metric

The square root of theMean Squared Error.

rotational invariance

In an image classification problem, an algorithm's ability to successfullyclassify images even when the orientation of the image changes. For example,the algorithm can still identify a tennis racket whether it is pointing up,sideways, or down. Note that rotational invariance is not always desirable;for example, an upside-down 9 shouldn't be classified as a 9.

ROUGE (Recall-Oriented Understudy for Gisting Evaluation)

#Metric

A family of metrics that evaluate automatic summarization andmachine translation models.ROUGE metrics determine the degree to which areference text overlaps anML model'sgenerated text. Each member of the ROUGEfamily measures overlap in a different way. Higher ROUGE scores indicatemore similarity between the reference text and generated text than lowerROUGE scores.

Each ROUGE family member typically generates the following metrics:

Precision
Recall
F₁

Note: ROUGE usesprecision andrecall somewhat differently than traditional precision andrecall.

For details and examples, see:

Note:BLEU andBLEURT optimize forprecision while ROUGE optimizes forrecall. Consequently, BLEU and BLEURT are better metrics for evaluatingmachine translation (since the focus is precision) while ROUGE is a better metric for summarization (since the focus is recall).

ROUGE-L

#Metric

A member of theROUGE family focused on the lengthof thelongest common subsequencein thereference text andgenerated text.The following formulas calculate recall and precision for ROUGE-L:

$$\text{ROUGE-L recall} = \frac{\text{longest common sequence}} {\text{number of words in the reference text} }$$

$$\text{ROUGE-L precision} = \frac{\text{longest common sequence}} {\text{number of words in the generated text} }$$

You can then useF₁ to roll up ROUGE-Lrecall and ROUGE-L precision into a single metric:

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{ROUGE-L recall} * \text{ROUGE-L precision}} {\text{ROUGE-L recall} + \text{ROUGE-L precision} }$$

Click the icon for an example calculation of ROUGE-L.

Consider the following reference text and generated text.

Category	Who produced?	Text
Reference text	Human translator	I want to understand a wide variety of things.
Generated text	ML model	I want to learn plenty of things.

Therefore:

The longest common subsequence is 5 (I want to of things)
The number of words in the reference text is 9.
The number of words in the generated text is 7.

Consequently:

$$\text{ROUGE-L recall} = \frac{\text{5}} {\text{9} } = 0.56$$

$$\text{ROUGE-L precision} = \frac{\text{5}} {\text{7} } = 0.71$$

$$\text{ROUGE-L F} {_1} = \frac{\text{2} * \text{0.56} * \text{0.71}} {\text{0.56} + \text{0.71} } = 0.63$$

ROUGE-L ignores any newlines in the reference text and generated text, sothe longest common subsequence could cross multiple sentences.When the reference text and generated text involve multiple sentences,a variation of ROUGE-L calledROUGE-Lsum is generally a better metric.ROUGE-Lsum determines the longest common subsequence for eachsentencein a passage and then calculates the mean of those longest common subsequences.

Click the icon for an example calculation of ROUGE-Lsum.

Consider the following reference text and generated text.

Category	Who produced?	Text
Reference text	Human translator	The surface of Mars is dry. Nearly all the water is deep underground.
Generated text	ML model	Mars has a dry surface. However, the vast majority of water is underground.

Therefore:

	First sentence	Second sentence
Longest common sequence	2 (Mars dry)	3 (water is underground)
Sentence length of reference text	6	7
Sentence length of generated text	5	8

Consequently:

$$\text{recall of first sentence} = \frac{\text{2}} {\text{6}} = 0.33 $$

$$\text{recall of second sentence} = \frac{\text{3}} {\text{7}} = 0.43 $$

$$\text{ROUGE-Lsum recall} = \frac{\text{0.33} + \text{0.43}} {\text{2}} = 0.38 $$

$$\text{precision of first sentence} = \frac{\text{2}} {\text{5}} = 0.4 $$

$$\text{precision of second sentence} = \frac{\text{3}} {\text{8}} = 0.38 $$

$$\text{ROUGE-Lsum precision} = \frac{\text{0.4} + \text{0.38}} {\text{2}} = 0.39 $$

$$\text{ROUGE-Lsum F}{_1} = \frac{\text{2} * \text{0.38} * \text{0.39}} {\text{0.38} + \text{0.39}} = 0.38 $$

ROUGE-N

#Metric

A set of metrics within theROUGE family that compares theshared N-grams of a certain size in thereference textandgenerated text. For example:

ROUGE-1 measures the number of shared tokens in the reference text andgenerated text.
ROUGE-2 measures the number of sharedbigrams (2-grams)in the reference text and generated text.
ROUGE-3 measures the number of sharedtrigrams (3-grams)in the reference text and generated text.

You can use the following formulas to calculate ROUGE-N recall andROUGE-N precision for any member of the ROUGE-N family:

$$\text{ROUGE-N recall} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the reference text} }$$

$$\text{ROUGE-N precision} = \frac{\text{number of matching N-grams}} {\text{number of N-grams in the generated text} }$$

You can then useF₁ to roll up ROUGE-Nrecall and ROUGE-N precision into a single metric:

$$\text{ROUGE-N F}{_1} = \frac{\text{2} * \text{ROUGE-N recall} * \text{ROUGE-N precision}} {\text{ROUGE-N recall} + \text{ROUGE-N precision} }$$

Click the icon for an example.

Suppose you decide to use ROUGE-2 to measure the effectiveness of an ML model'stranslation compared to a human translator's.

Category	Who produced?	Text	Bigrams
Reference text	Human translator	I want to understand a wide variety of things.	I want, want to, to understand, understand a, a wide, wide variety, variety of, of things
Generated text	ML model	I want to learn plenty of things.	I want, want to, to learn, learn plenty, plenty of, of things

Therefore:

The number of matching 2-grams is 3 (I want,want to, andof things).
The number of 2-grams in the reference text is 8.
The number of 2-grams in the generated text is 6.

Consequently:

$$\text{ROUGE-2 recall} = \frac{\text{3}} {\text{8} } = 0.375$$

$$\text{ROUGE-2 precision} = \frac{\text{3}} {\text{6} } = 0.5$$

$$\text{ROUGE-2 F}{_1} = \frac{\text{2} * \text{0.375} * \text{0.5}} {\text{0.375} + \text{0.5} } = 0.43$$

ROUGE-S

#Metric

A forgiving form ofROUGE-N that enablesskip-gram matching. That is, ROUGE-N only countsN-grams that matchexactly, but ROUGE-S also counts N-gramsseparated by one or more words. For example, consider the following:

reference text:White clouds
generated text:White billowing clouds

When calculating ROUGE-N, the 2-gram,White clouds doesn't matchWhite billowing clouds. However, when calculating ROUGE-S,White cloudsdoes matchWhite billowing clouds.

R-squared

#Metric

Aregression metric indicating how much variation in alabel is due to an individual feature or to a feature set.R-squared is a value between 0 and 1, which you can interpret as follows:

An R-squared of 0 means that none of a label's variation is due to thefeature set.
An R-squared of 1 means that all of a label's variation is due to thefeature set.
An R-squared between 0 and 1 indicates the extent to which the label'svariation can be predicted from a particular feature or the feature set.For example, an R-squared of 0.10 means that 10 percent of the variancein the label is due to the feature set, an R-squared of 0.20 means that20 percent is due to the feature set, and so on.

R-squared is the square of thePearson correlationcoefficientbetween the values that a model predicted andground truth.

RTE

#Metric

Abbreviation forRecognizing Textual Entailment.

S

sampling bias

#responsible

Seeselection bias.

sampling with replacement

#df

A method of picking items from a set of candidate items in which the sameitem can be picked multiple times. The phrase "with replacement" meansthat after each selection, the selected item is returned to the poolof candidate items. The inverse method,sampling without replacement,means that a candidate item can only be picked once.

For example, consider the following fruit set:

fruit = {kiwi, apple, pear, fig, cherry, lime, mango}

Suppose that the system randomly picksfig as the first item.If using sampling with replacement, then the system picks thesecond item from the following set:

fruit = {kiwi, apple, pear, fig, cherry, lime, mango}

Yes, that's the same set as before, so the system could potentiallypickfig again.

If using sampling without replacement, once picked, a sample can't bepicked again. For example, if the system randomly picksfig as thefirst sample, thenfig can't be picked again. Therefore, the systempicks the second sample from the following (reduced) set:

fruit = {kiwi, apple, pear, cherry, lime, mango}

Click the icon for additional notes.

The wordreplacement insampling with replacement confusesmany people. In English,replacement means "substitution."However,sampling with replacement actually uses the French definitionforreplacement, which means "putting something back."

The English wordreplacement is translated as the Frenchwordremplacement.

SavedModel

#TensorFlow

The recommended format for saving and recovering TensorFlow models. SavedModelis a language-neutral, recoverable serialization format, which enableshigher-level systems and tools to produce, consume, and transform TensorFlowmodels.

See theSaving and Restoring sectionof the TensorFlow Programmer's Guide for complete details.

Saver

#TensorFlow

ATensorFlow objectresponsible for saving model checkpoints.

scalar

A single number or a single string that can be represented as atensor ofrank 0. For example, the followinglines of code each create one scalar in TensorFlow:

breed = tf.Variable("poodle", tf.string)temperature = tf.Variable(27, tf.int16)precision = tf.Variable(0.982375101275, tf.float64)

scaling

Any mathematical transform or technique that shifts the range of a label,a feature value, or both. Some forms of scaling are very useful fortransformations likenormalization.

Common forms of scaling useful in Machine Learning include:

linear scaling, which typically uses a combination of subtraction anddivision to replace the original value with a number between -1 and +1 orbetween 0 and 1.
logarithmic scaling, which replaces the original value with itslogarithm.
Z-score normalization, which replaces theoriginal value with a floating-point value representing the number ofstandard deviations from that feature's mean.

scikit-learn

A popular open-source machine learning platform. Seescikit-learn.org.

scoring

#Metric

The part of arecommendation system thatprovides a value or ranking for each item produced by thecandidate generation phase.

selection bias

#responsible

Errors in conclusions drawn from sampled data due to a selection processthat generates systematic differences between samples observed in the dataand those not observed. The following forms of selection bias exist:

coverage bias: The population represented in the dataset doesn'tmatch the population that the machine learning model is makingpredictions about.
sampling bias: Data is not collected randomly from the target group.
non-response bias (also calledparticipation bias): Users fromcertain groups opt-out of surveys at different rates than users fromother groups.

For example, suppose you are creating a machine learning model that predictspeople's enjoyment of a movie. To collect training data,you hand out a survey to everyone in the front row of a theatershowing the movie. Offhand, this may sound like a reasonable wayto gather a dataset; however, this form of data collection mayintroduce the following forms of selection bias:

coverage bias: By sampling from a population who chose to seethe movie, your model's predictions may not generalize to peoplewho did not already express that level of interest in the movie.
sampling bias: Rather than randomly sampling from theintended population (all the people at the movie), you sampled onlythe people in the front row. It is possible that the people sittingin the front row were more interested in the movie than those inother rows.
non-response bias: In general, people with strong opinions tendto respond to optional surveys more frequently than people with mildopinions. Since the movie survey is optional, the responsesare more likely to form abimodal distributionthan a normal (bell-shaped) distribution.

self-attention (also called self-attention layer)

A neural network layer that transforms a sequence ofembeddings (for example,token embeddings)into another sequence of embeddings. Each embedding in the output sequence isconstructed by integrating information from the elements of the input sequencethrough anattention mechanism.

Theself part ofself-attention refers to the sequence attending toitself rather than to some other context. Self-attention is one of the mainbuilding blocks forTransformers and uses dictionary lookupterminology, such as "query", "key", and "value".

A self-attention layer starts with a sequence of input representations, onefor each word. The input representation for a word can be a simpleembedding. For each word in an input sequence, the networkscores the relevance of the word to every element in the whole sequence ofwords. The relevance scores determine how much the word's final representationincorporates the representations of other words.

For example, consider the following sentence:

The animal didn't cross the street because it was too tired.

The following illustration (fromTransformer: A Novel Neural Network Architecture for LanguageUnderstanding)shows a self-attention layer's attention pattern for the pronounit, withthe darkness of each line indicating how much each word contributes to therepresentation:

The following sentence appears twice: The animal didn't cross the street because it was too tired. Lines connect the pronoun it in one sentence to five tokens (The, animal, street, it, and the period) in the other sentence. The line between the pronoun it and the word animal is strongest.

The self-attention layer highlights words that are relevant to "it". In thiscase, the attention layer has learned to highlight words thatit mightrefer to, assigning the highest weight toanimal.

For a sequence ofntokens, self-attention transforms a sequenceof embeddingsn separate times, once at each position in the sequence.

Refer also toattention andmulti-head self-attention.

self-supervised learning

A family of techniques for converting anunsupervised machine learning probleminto asupervised machine learning problemby creating surrogatelabels fromunlabeled examples.

SomeTransformer-based models such asBERT useself-supervised learning.

Self-supervised training is asemi-supervised learning approach.

self-training

A variant ofself-supervised learning that isparticularly useful when all of the following conditions are true:

The ratio ofunlabeled examples tolabeled examples in the dataset is high.
This is aclassification problem.

Self-training works by iterating over the following two steps until the modelstops improving:

Usesupervised machine learning to train a model on the labeled examples.
Use the model created in Step 1 to generate predictions (labels) on the unlabeled examples, moving those in which there is high confidence into the labeled examples with the predicted label.

Notice that each iteration of Step 2 adds more labeled examples for Step 1 totrain on.

semi-supervised learning

Training a model on data where some of the training examples have labels butothers don't. One technique for semi-supervised learning is to infer labels forthe unlabeled examples, and then to train on the inferred labels to create a newmodel. Semi-supervised learning can be useful if labels are expensive to obtainbut unlabeled examples are plentiful.

Self-training is one technique for semi-supervisedlearning.

sensitive attribute

#responsible

A human attribute that may be given special consideration for legal,ethical, social, or personal reasons.

sentiment analysis

Using statistical or machine learning algorithms to determine a group'soverall attitude—positive or negative—toward a service, product,organization, or topic. For example, usingnatural language understanding,an algorithm could perform sentiment analysis on the textual feedbackfrom a university course to determine the degree to which studentsgenerally liked or disliked the course.

See theText classificationguide for more information.

sequence model

A model whose inputs have a sequential dependence. For example, predictingthe next video watched from a sequence of previously watched videos.

sequence-to-sequence task

A task that converts an input sequence oftokens to an outputsequence of tokens. For example, two popular kinds of sequence-to-sequencetasks are:

Translators:
- Sample input sequence: "I love you."
- Sample output sequence: "Je t'aime."
Question answering:
- Sample input sequence: "Do I need my car in New York City?"
- Sample output sequence: "No. Keep your car at home."

serving

The process of making a trained model available to provide predictions throughonline inference oroffline inference.

shape (Tensor)

The number of elements in eachdimension of atensor. The shape is represented as a list of integers. For example,the following two-dimensional tensor has a shape of [3,4]:

[[5, 7, 6, 4], [2, 9, 4, 8], [3, 6, 5, 1]]

TensorFlow uses row-major (C-style) format to represent the order ofdimensions, which is why the shape in TensorFlow is[3,4] rather than[4,3]. In other words, in a two-dimensional TensorFlow Tensor, the shapeis[number of rows,number of columns].

Astatic shape is a tensor shape that isknown at compile time.

Adynamic shape isunknown at compile time and istherefore dependent on runtime data. This tensor might be represented with aplaceholder dimension in TensorFlow, as in[3, ?].

shard

#TensorFlow

#GoogleCloud

A logical division of thetraining set or themodel. Typically, some process creates shards by dividingtheexamples orparameters into (usually)equal-sized chunks. Each shard is then assigned to a different machine.

Sharding a model is calledmodel parallelism;sharding data is calleddata parallelism.

shrinkage

#df

Ahyperparameter ingradient boosting that controlsoverfitting. Shrinkage in gradient boostingis analogous tolearning rate ingradient descent. Shrinkage is a decimalvalue between 0.0 and 1.0. A lower shrinkage value reduces overfittingmore than a larger shrinkage value.

side-by-side evaluation

Comparing the quality of two models by judging theirresponses to thesame prompt. For example, supposethe following prompt is given to twodifferent models:

Create an image of a cute dog juggling three balls.

In a side-by-side evaluation, a rater wouldpick which image was "better" (More accurate? More beautiful? Cuter?).

sigmoid function

#fundamentals

A mathematical function that "squishes" an input value into a constrained range,typically 0 to 1 or -1 to +1. That is, you can pass any number (two, a million,negative billion, whatever) to a sigmoid and the output will still be in theconstrained range.A plot of the sigmoid activation function looks as follows:

The sigmoid function has several uses in machine learning, including:

Converting the raw output of alogistic regressionormultinomial regression model toa probability.
Acting as anactivation function in someneural networks.

Click the icon to see the math.

The sigmoid function over an input numberx has the following formula:

$$sigmoid(x) = \frac{1}{1 + e^{-\text{x}}}$$

In machine learning,x is generally aweighted sum.

similarity measure

#clustering

#Metric

Inclustering algorithms, the metric used to determinehow alike (how similar) any two examples are.

single program / multiple data (SPMD)

A parallelism technique where the same computation is run on different inputdata in parallel on different devices. The goal of SPMD is to obtain resultsmore quickly. It is the most common style of parallel programming.

size invariance

In an image classification problem, an algorithm's ability to successfullyclassify images even when the size of the image changes. For example,the algorithm can still identify acat whether it consumes 2M pixels or 200K pixels. Note that even the bestimage classification algorithms still have practical limits on size invariance.For example, an algorithm (or human) is unlikely to correctly classify acat image consuming only 20 pixels.

See theClustering coursefor more information.

sketching

#clustering

Inunsupervised machine learning,a category of algorithms that perform a preliminary similarity analysison examples. Sketching algorithms use alocality-sensitive hash functionto identify points that are likely to be similar, and then groupthem into buckets.

Sketching decreases the computation required for similarity calculationson large datasets. Instead of calculating similarity for every singlepair of examples in the dataset, we calculate similarity only for eachpair of points within each bucket.

skip-gram

Ann-gram which may omit (or "skip") words from the originalcontext, meaning the N words might not have been originally adjacent. Moreprecisely, a "k-skip-n-gram" is an n-gram for which up to k words may havebeen skipped.

For example, "the quick brown fox" has the following possible 2-grams:

"the quick"
"quick brown"
"brown fox"

A "1-skip-2-gram" is a pair of words that have at most 1 word between them.Therefore, "the quick brown fox" has the following 1-skip 2-grams:

"the brown"
"quick fox"

In addition, all the 2-grams arealso 1-skip-2-grams, since fewerthan one word may be skipped.

Skip-grams are useful for understanding more of a word's surrounding context.In the example, "fox" was directly associated with "quick" in the set of1-skip-2-grams, but not in the set of 2-grams.

Skip-grams help trainword embedding models.

softmax

#fundamentals

A function that determines probabilities for each possible class in amulti-class classification model. The probabilities add upto exactly 1.0. For example, the following table shows how softmax distributesvarious probabilities:

Image is a...	Probability
dog	.85
cat	.13
horse	.02

Softmax is also calledfull softmax.

Contrast withcandidate sampling.

Click the icon to see the math.

The softmax equation is as follows:

$$\sigma_i = \frac{e^{\text{z}_i}} {\sum_{j=1}^{j=K} {e^{\text{z}_j}}} $$

where:

$\sigma_i$ is the output vector. Each element of the output vectorspecifies the probability of this element. The sum of all the elementsin the output vector is 1.0. The output vector contains the same numberof elements as the input vector, $z$.
$z$ is the input vector. Each element of the input vector containsa floating-point value.
$K$ is the number of elements in the input vector (and the outputvector).

For example, suppose the input vector is:

[1.2, 2.5, 1.8]

Therefore, softmax calculates the denominator as follows:

$$\text{denominator} = e^{1.2} + e^{2.5} + e^{1.8} = 21.552$$

The softmax probability of each element is therefore:

$$\sigma_1 = \frac{e^{1.2}}{21.552} = 0.154 $$$$\sigma_2 = \frac{e^{2.5}}{21.552} = 0.565 $$$$\sigma_1 = \frac{e^{1.8}}{21.552} = 0.281 $$

So, the output vector is therefore:

$$\sigma = [0.154, 0.565, 0.281]$$

The sum of the three elements in $\sigma$ is 1.0. Phew!

SeeNeural networks: Multi-class classificationin Machine Learning Crash Course for more information.

soft prompt tuning

#generativeAI

A technique for tuning alarge language modelfor a particular task, without resource intensivefine-tuning. Instead of retraining all theweights in the model, soft prompt tuningautomatically adjusts aprompt to achieve the same goal.

Given a textual prompt, soft prompt tuningtypically appends additional token embeddings to the prompt and usesbackpropagation to optimize the input.

A "hard" prompt contains actual tokens instead of token embeddings.

sparse feature

#fundamentals

Afeature whose values are predominately zero or empty.For example, a feature containing a single 1 value and a million 0 values issparse. In contrast, adense feature has values thatare predominantly not zero or empty.

In machine learning, a surprising number of features are sparse features.Categorical features are usually sparse features.For example, of the 300 possible tree species in a forest, a single examplemight identify just amaple tree. Or, of the millionsof possible videos in a video library, a single example might identifyjust "Casablanca."

In a model, you typically represent sparse features withone-hot encoding. If the one-hot encoding is big,you might put anembedding layer on top of theone-hot encoding for greater efficiency.

sparse representation

#fundamentals

Storing only theposition(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature namedspecies identifies the 36tree species in a particular forest. Further assume that eachexample identifies only a single species.

You could use a one-hot vector to represent the tree species in each example.A one-hot vector would contain a single1 (to representthe particular tree species in that example) and 350s (to represent the35 tree speciesnot in that example). So, the one-hot representationofmaple might look something like the following:

A vector in which positions 0 through 23 hold the value 0, position 24 holds the value 1, and positions 25 through 35 hold the value 0.

Alternatively, sparse representation would simply identify the position of theparticular species. Ifmaple is at position 24, then the sparse representationofmaple would simply be:

Notice that the sparse representation is much more compact than the one-hotrepresentation.

Note: You shouldn't pass a sparse representation as a direct feature inputto a model. Instead, you should convert the sparse representation into aone-hot representation before training on it.

Click the icon for a slightly more complex example.

Suppose each example in your model must represent the words—but notthe order of those words—in an English sentence.English consists of about 170,000 words, so English is a categoricalfeature with about 170,000 elements. Most English sentences use anextremely tiny fraction of those 170,000 words, so the set of words in asingle example is almost certainly going to be sparse data.

Consider the following sentence:

My dog is a great dog

You could use a variant of one-hot vector to represent the words in thissentence. In this variant, multiple cells in the vector can containa nonzero value. Furthermore, in this variant, a cell can contain an integerother than one. Although the words "my", "is", "a", and "great" appear onlyonce in the sentence, the word "dog" appears twice. Using this variant ofone-hot vectors to represent the words in this sentence yields the following170,000-element vector:

A sparse representation of the same sentence would simply be:

0:126100:245770:158906:191520:1

Click the icon if you are confused.

The term "sparse representation" confuses a lot of people because sparserepresentation is itselfnot a sparse vector. Rather, sparserepresentation is actually adense representation of a sparse vector.The synonymindex representation is a little clearer than"sparse representation."

See Working with categorical datain Machine Learning Crash Course for more information.

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See alsosparsefeature andsparsity.

sparsity

#Metric

The number of elements set to zero (or null) in a vector or matrix dividedby the total number of entries in that vector or matrix. For example,consider a 100-element matrix in which 98 cells contain zero. The calculation ofsparsity is as follows:

$${\text{sparsity}} =\frac{\text{98}} {\text{100}} ={\text{0.98}}$$

Feature sparsity refers to the sparsity of a feature vector;model sparsity refers to the sparsity of the model weights.

spatial pooling

Seepooling.

specificational coding

#generativeAI

The process of writing and maintaining a file in a human language (for example,English) that describes software. You can then tell a generative AI model oranother software engineer to create the software that fulfills that description.

Automatically-generated code generally requires iteration. In specificationalcoding, you iterate on the description file. By contrast, inconversational coding, you iterate within theprompt box. In practice, automatic code generation sometimes involves acombination ofboth specificational coding and conversational coding.

split

#df

In adecision tree, another name for acondition.

splitter

#df

While training adecision tree, the routine(and algorithm) responsible for finding the bestcondition at eachnode.

SPMD

Abbreviation forsingle program / multiple data.

SQuAD

#Metric

Acronym forStanford Question Answering Dataset, introduced in the paperSQuAD: 100,000+ Questions for Machine Comprehension of Text.The questions in this dataset come from people posing questions aboutWikipedia articles. Some of the questions in SQuAD have answers, butother questions intentionally don't have answers. Therefore, you can use SQuADto evaluate anLLM's ability to do both of the following:

Answer questions thatcan be answered.
Identify questions thatcannot be answered.

Exact match in combination withF₁ are the most common metrics forevaluating LLMs against SQuAD.

squared hinge loss

#Metric

The square of thehinge loss. Squared hinge loss penalizesoutliers more harshly than regular hinge loss.

squared loss

#fundamentals

#Metric

Synonym forL₂ loss.

staged training

A tactic of training a model in a sequence of discrete stages. The goal can beeither to speed up the training process, or to achieve better model quality.

An illustration of the progressive stacking approach is shown below:

Stage 1 contains 3 hidden layers, stage 2 contains 6 hidden layers, andstage 3 contains 12 hidden layers.
Stage 2 begins training with the weights learned in the 3 hidden layersof Stage 1. Stage 3 begins training with the weights learned in the 6hidden layers of Stage 2.

Three stages, which are labeled Stage 1, Stage 2, and Stage 3. Each stage contains a different number of layers: Stage 1 contains 3 layers, Stage 2 contains 6 layers, and Stage 3 contains 12 layers. The 3 layers from Stage 1 become the first 3 layers of Stage 2. Similarly, the 6 layers from Stage 2 become the first 6 layers of Stage 3.

state

Contrast withnonstationarity.

step

A forward pass and backward pass of onebatch.

Seebackpropagation for more informationon the forward pass and backward pass.

step size

Synonym forlearning rate.

stochastic gradient descent (SGD)

#fundamentals

Agradient descent algorithm in which thebatch size is one. In other words, SGD trains ona single example chosen uniformly atrandom from atraining set.

SeeLinear regression: Hyperparametersin Machine Learning Crash Course for more information.

stride

In a convolutional operation or pooling, the delta in each dimension of thenext series of input slices. For example, the following animationdemonstrates a (1,1) stride during a convolutional operation. Therefore,the next input slice starts one position to the right of the previous inputslice. When the operation reaches the right edge, the next slice is allthe way over to the left but one position down.

The preceding example demonstrates a two-dimensional stride. If the inputmatrix is three-dimensional, the stride would also be three-dimensional.

structural risk minimization (SRM)

An algorithm that balances two goals:

The need to build the most predictive model (for example, lowest loss).
The need to keep the model as simple as possible (for example, strongregularization).

For example, a function that minimizes loss+regularization on thetraining set is a structural risk minimization algorithm.

Contrast withempirical risk minimization.

subsampling

Seepooling.

subword token

Inlanguage models, atoken that is asubstring of a word, which may be the entire word.

For example, a word like "itemize" might be broken up into the pieces "item"(a root word) and "ize" (a suffix), each of which is represented by its owntoken. Splitting uncommon words into such pieces, called subwords, allowslanguage models to operate on the word's more common constituent parts,such as prefixes and suffixes.

Conversely, common words like "going" might not be broken up and might berepresented by a single token.

summary

#TensorFlow

In TensorFlow, a value or set of values calculated at a particularstep, usually used for tracking model metrics during training.

SuperGLUE

#Metric

An ensemble of datasets for rating an LLM's overall ability to understandand generate text. The ensemble consists of the following datasets:

For details, seeSuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems.

supervised machine learning

#fundamentals

Training amodel fromfeatures and theircorrespondinglabels. Supervised machine learning is analogousto learning a subject by studying a set of questions and theircorresponding answers. After mastering the mapping between questions andanswers, a student can then provide answers to new (never-before-seen)questions on the same topic.

Compare withunsupervised machine learning.

SeeSupervised Learningin the Introduction to ML course for more information.

synthetic feature

#fundamentals

Afeature not present among the input features, butassembled from one or more of them. Methods for creating synthetic featuresinclude the following:

Bucketing a continuous feature into range bins.
Creating afeature cross.
Multiplying (or dividing) one feature value by other feature value(s)or by itself. For example, ifa andb are input features, then thefollowing are examples of synthetic features:
- ab
- a²
Applying a transcendental function to a feature value. For example, ifcis an input feature, then the following are examples of synthetic features:
- sin(c)
- ln(c)

Features created bynormalizing orscalingalone are not considered synthetic features.

T

T5

A text-to-texttransfer learning modelintroduced by Google AI in 2020.T5 is anencoder-decoder model, based on theTransformer architecture, trained on an extremely largedataset. It is effective at a variety of natural language processing tasks,such as generating text, translating languages, and answering questions ina conversational manner.

T5 gets its name from the five letter Ts in "Text-to-Text Transfer Transformer."

T5X

An open-source,machine learning framework designedto build andtrain large-scale natural language processing(NLP) models.T5 is implemented on the T5X codebase (which isbuilt onJAX andFlax).

tabular Q-learning

Inreinforcement learning, implementingQ-learning by using a table to store theQ-functions for every combination ofstate andaction.

target

Synonym forlabel.

target network

InDeep Q-learning, a neural network that is a stableapproximation of the main neural network, where the main neural networkimplements either aQ-function or apolicy.Then, you can train the main network on the Q-values predicted by the targetnetwork. Therefore, you prevent the feedback loop that occurs when the mainnetwork trains on Q-values predicted by itself. By avoiding this feedback,training stability increases.

task

A problem that can be solved using machine learning techniques, such as:

temperature

#generativeAI

Ahyperparameter that controls the degree of randomnessof a model's output. Higher temperatures result in more random output,while lower temperatures result in less random output.

#TensorFlow

The total number of scalars aTensor contains. For example, a[5, 10] Tensor has a size of 50.

TensorStore

Alibrary for efficiently reading andwriting large multi-dimensional arrays.

termination condition

Inreinforcement learning, the conditions thatdetermine when anepisode ends, such as when the agent reachesa certain state or exceeds a threshold number of state transitions.For example, intic-tac-toe (alsoknown as noughts and crosses), an episode terminates either when a player marksthree consecutive spaces or when all spaces are marked.

test

#df

In adecision tree, another name for acondition.

test loss

#fundamentals

#Metric

Ametric representing a model'sloss againstthetest set. When building amodel, youtypically try to minimize test loss. That's because a low test loss is astronger quality signal than a lowtraining loss orlowvalidation loss.

A large gap between test loss and training loss or validation loss sometimessuggests that you need to increase theregularization rate.

test set

A subset of thedataset reserved for testinga trainedmodel.

Traditionally, you divide examples in the dataset into the following threedistinct subsets:

Each example in a dataset should belong to only one of the preceding subsets.For instance, a single example shouldn't belong to both the training set andthe test set.

The training set and validation set are both closely tied to training a model.Because the test set is only indirectly associated with training,test loss is a less biased, higher quality metric thantraining loss orvalidation loss.

SeeDatasets: Dividing the original datasetin Machine Learning Crash Course for more information.

text span

The array index span associated with a specific subsection of a text string.For example, the wordgood in the Python strings="Be good now" occupiesthe text span from 3 to 6.

tf.Example

#TensorFlow

A standardprotocol bufferfor describing input data for machine learning model training or inference.

tf.keras

#TensorFlow

An implementation ofKeras integrated intoTensorFlow.

threshold (for decision trees)

#df

In anaxis-aligned condition, the value that afeature is being compared against. For example, 75 is thethreshold value in the following condition:

grade >= 75

This form of the termthreshold is different than classification threshold.

SeeExact splitter for binary classification with numerical featuresin the Decision Forests course for more information.

time series analysis

#clustering

A subfield of machine learning and statistics that analyzestemporal data. Many types of machine learningproblems require time series analysis, including classification, clustering,forecasting, and anomaly detection. For example, you could usetime series analysis to forecast the future sales of winter coats by monthbased on historical sales data.

timestep

One "unrolled" cell within arecurrent neural network.For example, the following figure shows three timesteps (labeled withthe subscripts t-1, t, and t+1):

Three timesteps in a recurrent neural network. The output of the first timestep becomes input to the second timestep. The output of the second timestep becomes input to the third timestep.

token

In alanguage model, the atomic unit that the model istraining on and making predictions on. A token is typically one of thefollowing:

a word—for example, the phrase "dogs like cats" consists of three wordtokens: "dogs", "like", and "cats".
a character—for example, the phrase "bike fish" consists of ninecharacter tokens. (Note that the blank space counts as one of the tokens.)
subwords—in which a single word can be a single token or multiple tokens.A subword consists of a root word, a prefix, or a suffix. For example,a language model that uses subwords as tokens might view the word "dogs"as two tokens (the root word "dog" and the plural suffix "s"). That samelanguage model might view the single word "taller" as two subwords (theroot word "tall" and the suffix "er").

In domains outside of language models, tokens can represent other kinds ofatomic units. For example, in computer vision, a token might be a subsetof an image.

SeeLarge language modelsin Machine Learning Crash Course for more information.

tokenizer

A system or algorithm that translates a sequence of input data intotokens.

Most modernfoundation models aremultimodal. A tokenizer for a multimodal systemmust translate each input type into the appropriate format. For example,given input data consisting of both text and graphics, the tokenizer mighttranslate input text into subwords and input images into small patches.The tokenizer must then convert all the tokens into a single unified embeddingspace, which enables the model to "understand" a stream of multimodal input.

top-k accuracy

#Metric

The percentage of times that a "target label" appears within the firstkpositions of generated lists. The lists could be personalized recommendationsor a list of items ordered bysoftmax.

Top-k accuracy is also known asaccuracy at k.

Note: The target label could be any class (not necessarily the ground truth class), so top-k accuracy is not always equivalent to traditionalaccuracy.

Click the icon for an example.

Consider a machine learning system that uses softmax to identifytree probabilities based on a picture of tree leaves. The following table showsoutput lists generated from five input tree pictures. Each row contains a targetlabel and the five most likely trees. For example, when the target label wasmaple, the machine learning model identifiedelm as the mostlikely tree,oak as the second most likely tree, and so on.

Target label	1	2	3	4	5
maple	elm	oak	maple	beech	poplar
dogwood	oak	dogwood	poplar	hickory	maple
oak	oak	basswood	locust	alder	linden
linden	maple	paw-paw	oak	basswood	poplar
oak	locust	linden	oak	maple	paw-paw

The target label appears in the first position only once, so thetop-1 accuracy is:

$$\text{top-1 accuracy} = \frac{\text{1}} {\text{5}} = 0.2$$

The target label appears in one of the top three positions four times,so the top-3 accuracy is:

$$\text{top-1 accuracy} = \frac{\text{4}} {\text{5}} = 0.8$$

tower

A component of adeep neural network that isitself a deep neural network. In some cases, each tower reads from anindependent data source, and those towers stay independent until theiroutput is combined in a final layer. In other cases, (for example, intheencoder anddecoder tower ofmanyTransformers), towers have cross-connectionsto each other.

#fundamentals

The process of determining the idealparameters (weights andbiases) comprising amodel. During training, a system reads inexamples and gradually adjusts parameters. Training uses eachexample anywhere from a few times to billions of times.

SeeSupervised Learningin the Introduction to ML course for more information.

training loss

#fundamentals

#Metric

Ametric representing a model'sloss during aparticular training iteration. For example, suppose the loss functionisMean Squared Error. Perhaps the training loss (the MeanSquared Error) for the 10th iteration is 2.2, and the training loss forthe 100th iteration is 1.9.

Aloss curve plots training loss versus the number ofiterations. A loss curve provides the following hints about training:

A downward slope implies that the model is improving.
An upward slope implies that the model is getting worse.
A flat slope implies that the model has reachedconvergence.

For example, the following somewhat idealizedloss curveshows:

A steep downward slope during the initial iterations, which impliesrapid model improvement.
A gradually flattening (but still downward) slope until close to the endof training, which implies continued model improvement at a somewhatslower pace then during the initial iterations.
A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts with a steep downward slope. The slope gradually flattens until the slope becomes zero.

Although training loss is important, see alsogeneralization.

training-serving skew

#fundamentals

The difference between a model's performance duringtraining and that same model's performance duringserving.

training set

#fundamentals

The subset of thedataset used to train amodel.

Traditionally, examples in the dataset are divided into the following threedistinct subsets:

Ideally, each example in the dataset should belong to only one of thepreceding subsets. For example, a single example shouldn't belong toboth the training set and the validation set.

SeeDatasets: Dividing the original datasetin Machine Learning Crash Course for more information.

trajectory

Inreinforcement learning, a sequence oftuples that representa sequence ofstate transitions of theagent,where each tuple corresponds to the state,action,reward, and next state for a given state transition.

transfer learning

Transferring information from one machine learning task to another.For example, in multi-task learning, a single model solves multiple tasks,such as adeep model that has different output nodes fordifferent tasks. Transfer learning might involve transferring knowledgefrom the solution of a simpler task to a more complex one, or involvetransferring knowledge from a task where there is more data to one wherethere is less data.

Most machine learning systems solve asingle task. Transfer learning is ababy step towards artificial intelligence in which a single program can solvemultiple tasks.

Transformer

Aneural network architecture developed at Google thatrelies onself-attention mechanisms to transform asequence of input embeddings into a sequence of outputembeddings without relying onconvolutions orrecurrent neural networks. A Transformer can beviewed as a stack of self-attention layers.

A Transformer can include any of the following:

anencoder
adecoder
both an encoder and decoder

Anencoder transforms a sequence of embeddings into a new sequence of thesame length. An encoder includes N identical layers, each of which contains twosub-layers. These two sub-layers are applied at each position of the inputembedding sequence, transforming each element of the sequence into a newembedding. The first encoder sub-layer aggregates information from across theinput sequence. The second encoder sub-layer transforms the aggregatedinformation into an output embedding.

Adecoder transforms a sequence of input embeddings into a sequence ofoutput embeddings, possibly with a different length. A decoder also includesN identical layers with three sub-layers, two of which are similar to theencoder sub-layers. The third decoder sub-layer takes the output of theencoder and applies theself-attention mechanism togather information from it.

The blog postTransformer: A Novel Neural Network Architecture for LanguageUnderstandingprovides a good introduction to Transformers.

SeeLLMs: What's a large language model?in Machine Learning Crash Course for more information.

translational invariance

In an image classification problem, an algorithm's ability to successfullyclassify images even when the position of objects within the image changes.For example, the algorithm can still identify a dog, whether it is in thecenter of the frame or at the left end of the frame.

trigram

AnN-gram in which N=3.

Trivia Question Answering

#Metric

Datasets to evaluate an LLM's ability to answer trivia questions.Each dataset contains question-answer pairs authored by trivia enthusiasts.Different datasets are grounded by different sources, including:

Web search (TriviaQA)
Wikipedia (TriviaQA_wiki)

For more information seeTriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension.

true negative (TN)

#fundamentals

#Metric

Ultra

#generativeAI

TheGemini model with the mostparameters.SeeGemini Ultrafor details.

unawareness (to a sensitive attribute)

#responsible

A situation in whichsensitive attributes arepresent, but not included in the training data. Because sensitive attributesare often correlated with other attributes of one's data, a model trainedwith unawareness about a sensitive attribute could still havedisparate impact with respect to that attribute,or violate otherfairness constraints.

underfitting

#fundamentals

Producing amodel with poor predictive ability because the modelhasn't fully captured the complexity of the training data. Many problemscan cause underfitting, including:

Training on the wrong set offeatures.
Training for too fewepochs or at too lowalearning rate.
Training with too high aregularization rate.
Providing too fewhidden layers in adeep neural network.

SeeOverfittingin Machine Learning Crash Course for more information.

undersampling

Removingexamples from themajority class in aclass-imbalanced dataset in order tocreate a more balancedtraining set.

For example, consider a dataset in which the ratio of the majority class totheminority class is 20:1. To overcome this classimbalance, you could create a training set consisting ofall of the minorityclass examples but only atenth of the majority class examples, which wouldcreate a training-set class ratio of 2:1. Thanks to undersampling, this morebalanced training set might produce a better model. Alternatively, thismore balanced training set might contain insufficient examples to train aneffective model.

Contrast withoversampling.

unidirectional

A system that only evaluates the text thatprecedes a target section of text.In contrast, a bidirectional system evaluates both thetext thatprecedes andfollows a target section of text.Seebidirectional for more details.

unidirectional language model

Alanguage model that bases its probabilities only on thetokens appearingbefore, notafter, the target token(s).Contrast withbidirectional language model.

unlabeled example

#fundamentals

An example that containsfeatures but nolabel.For example, the following table shows three unlabeled examples from a housevaluation model, each with three features but no house value:

Number of bedrooms	Number of bathrooms	House age
3	2	15
2	1	72
4	2	34

Insupervised machine learning,models train on labeled examples and make predictions onunlabeled examples.

Insemi-supervised andunsupervised learning,unlabeled examples are used during training.

Contrast unlabeled example withlabeled example.

unsupervised machine learning

#clustering

#fundamentals

Training amodel to find patterns in a dataset, typically anunlabeled dataset.

The most common use of unsupervised machine learning is tocluster datainto groups of similar examples. For example, an unsupervised machinelearning algorithm can cluster songs based on various propertiesof the music. The resulting clusters can become an input to other machinelearning algorithms (for example, to a music recommendation service).Clustering can help when useful labels are scarce or absent.For example, in domains such as anti-abuse and fraud, clusters can helphumans better understand the data.

Contrast withsupervised machine learning.

Click the icon for additional notes.

Another example of unsupervised machine learning isprincipal component analysis (PCA).For example, applying PCA on adataset containing the contents of millions of shopping carts might revealthat shopping carts containing lemons frequently also contain antacids.

SeeWhat is Machine Learning?in the Introduction to ML course for more information.

uplift modeling

A modeling technique, commonly used in marketing, that models the"causal effect" (also known as the "incremental impact") of a"treatment" on an "individual." Here are two examples:

Doctors might use uplift modeling to predict the mortality decrease(causal effect) of a medical procedure (treatment) depending on theage and medical history of a patient (individual).
Marketers might use uplift modeling to predict the increase inprobability of a purchase (causal effect) due to an advertisement(treatment) on a person (individual).

Uplift modeling differs fromclassification orregression in that some labels (for example, halfof the labels in binary treatments) are always missing in uplift modeling.For example, a patient can either receive or not receive a treatment;therefore, we can only observe whether the patient is going to heal ornot heal in only one of these two situations (but never both).The main advantage of an uplift model is that it can generate predictionsfor the unobserved situation (the counterfactual) and use it to computethe causal effect.

upweighting

Applying a weight to thedownsampled class equalto the factor by which you downsampled.

user matrix

Inrecommendation systems, anembedding vector generated bymatrix factorizationthat holds latent signals about user preferences.Each row of the user matrix holds information about the relativestrength of various latent signals for a single user.For example, consider a movie recommendation system. In this system,the latent signals in the user matrix might represent each user's interestin particular genres, or might be harder-to-interpret signals that involvecomplex interactions across multiple factors.

The user matrix has a column for each latent feature and a row for each user.That is, the user matrix has the same number of rows as the targetmatrix that is being factorized. For example, given a movierecommendation system for 1,000,000 users, theuser matrix will have 1,000,000 rows.

V

validation

#fundamentals

The initial evaluation of a model's quality.Validation checks the quality of a model's predictions against thevalidation set.

Because the validation set differs from thetraining set,validation helps guard againstoverfitting.

You might think of evaluating the model against the validation set as thefirst round of testing and evaluating the model against thetest set as the second round of testing.

validation loss

#fundamentals

#Metric

Ametric representing a model'sloss onthevalidation set during a particulariteration of training.

validation set

#fundamentals

The subset of thedataset that performs initialevaluation against a trainedmodel. Typically, you evaluatethe trained model against thevalidation set severaltimes before evaluating the model against thetest set.

Traditionally, you divide the examples in the dataset into the following threedistinct subsets:

Ideally, each example in the dataset should belong to only one of thepreceding subsets. For example, a single example shouldn't belong toboth the training set and the validation set.

SeeDatasets: Dividing the original datasetin Machine Learning Crash Course for more information.

value imputation

The process of replacing a missing value with an acceptable substitute.When a value is missing, you can either discard the entire example or youcan use value imputation to salvage the example.

For example, consider a dataset containing atemperature feature that issupposed to be recorded every hour. However, the temperature reading wasunavailable for a particular hour. Here is a section of the dataset:

Timestamp	Temperature
1680561000	10
1680564600	12
1680568200	missing
1680571800	20
1680575400	21
1680579000	21

A system could either delete the missing example or impute the missingtemperature as 12, 16, 18, or 20, depending on the imputation algorithm.

vanishing gradient problem

The tendency for the gradients of earlyhidden layersof somedeep neural networks to becomesurprisingly flat (low). Increasingly lower gradients result in increasinglysmaller changes to the weights on nodes in a deep neural network, leading tolittle or no learning. Models suffering from the vanishing gradient problembecome difficult or impossible to train.Long Short-Term Memory cells address this issue.

Compare toexploding gradient problem.

variable importances

#df

#Metric

A set of scores that indicates the relative importance of eachfeature to the model.

For example, consider adecision tree thatestimates house prices. Suppose this decision tree uses threefeatures: size, age, and style. If a set of variable importancesfor the three features are calculated to be{size=5.8, age=2.5, style=4.7}, then size is more important to thedecision tree than age or style.

Different variable importance metrics exist, which can informML experts about different aspects of models.

variational autoencoder (VAE)

A type ofautoencoder that leverages the discrepancybetween inputs and outputs to generate modified versions of the inputs.Variational autoencoders are useful forgenerative AI.

VAEs are based on variational inference: a technique for estimating theparameters of a probability model.

vector

Very overloaded term whose meaning varies across different mathematicaland scientific fields. Within machine learning, a vector has two properties:

Data type: Vectors in machine learning usually hold floating-point numbers.
Number of elements: This is the vector's length or itsdimension.

For example, consider afeature vector that holds eightfloating-point numbers. This feature vector has a length or dimension of eight.Note that machine learning vectors often have a huge number of dimensions.

You can represent many different kinds of information as a vector. For example:

Any position on the surface of Earth can be represented as a 2-dimensionalvector, where one dimension is the latitude and the other is the longitude.
The current prices of each of 500 stocks can be represented as a500-dimensional vector.
A probability distribution over a finite number of classes can be representedas a vector. For example, amulticlass classification system thatpredicts one of three output colors (red, green, or yellow) could output thevector(0.3, 0.2, 0.5) to meanP[red]=0.3, P[green]=0.2, P[yellow]=0.5.

Vectors can be concatenated; therefore, a variety of different media can berepresented as a single vector. Some models operate directly on theconcatenation of manyone-hot encodings.

Specialized processors such asTPUs are optimized to performmathematical operations on vectors.

A vector is atensor ofrank 1.

Vertex

#GoogleCloud

#generativeAI

Google Cloud's platform for AI and machine learning. Vertex provides toolsand infrastructure for building, deploying, and managing AI applications,including access toGemini models.

vibe coding

#generativeAI

Prompting a generative AI model to create software. That is, your promptsdescribe the software's purpose and features, which a generative AI modeltranslates into source code. The generated code doesn't always match yourintentions, so vibe coding usually requires iteration.

Andrej Karpathy coined the termvibe coding in thisX post.In the X post, Karpathy describes it as "a new kind of coding...where you fullygive in to the vibes..." So, the term originally implied an intentionally looseapproach to creating software in which you might not even examine the generatedcode. However, the term has rapidly evolved in many circles to now meananyform of AI-generated coding.

For a more detailed description of vibe coding, seeWhat is vibe coding?.

In addition, compare and contrast vibe coding with:

W

Wasserstein loss

#Metric

One of the loss functions commonly used ingenerative adversarial networks,based on theearth mover's distance betweenthe distribution of generated data and real data.

weight

#fundamentals

A value that a model multiplies by another value.Training is the process of determining a model's ideal weights;inference is the process of using those learned weights tomake predictions.

Click the icon to see an example of weights in a linear model.

Imagine alinear model with two features.Suppose that training determines the following weights (andbias):

The bias, b, has a value of 2.2
The weight, w₁ associated with one feature is 1.5.
The weight, w₂ associated with the other feature is 0.4.

Now imagine anexample with the following featurevalues:

The value of one feature, x₁, is 6.
The value of the other feature, x₂, is 10.

This linear model uses the following formula to generate a prediction,y':

$$y' = b + w_1x_1 + w_2x_2$$

Therefore, the prediction is:

$$y' = 2.2 + (1.5)(6) + (0.4)(10) = 15.2$$

If a weight is 0, then the corresponding feature doesn't contribute tothe model. For example, if w₁ is 0, then the value of x₁is irrelevant.

SeeLinear regressionin Machine Learning Crash Course for more information.

Weighted Alternating Least Squares (WALS)

An algorithm for minimizing the objective function duringmatrix factorization inrecommendation systems, which allows adownweighting of the missing examples. WALS minimizes the weightedsquared error between the original matrix and the reconstruction byalternating between fixing the row factorization and column factorization.Each of these optimizations can be solved by least squaresconvex optimization. For details, see theRecommendation Systems course.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their correspondingweights. For example, suppose the relevant inputs consist of the following:

input value	input weight
2	-1.3
-1	0.6
3	0.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to anactivation function.

WiC

#Metric

Abbreviation forWords in Context.

wide model

A linear model that typically has manysparse input features. We refer to it as "wide" sincesuch a model is a special type ofneural network with alarge number of inputs that connect directly to the output node. Wide modelsare often easier to debug and inspect thandeep models.Although wide modelscannot express nonlinearities throughhidden layers,wide models can use transformations such asfeature crossing andbucketization to model nonlinearities in different ways.

Contrast withdeep model.

width

The number ofneurons in a particularlayerof aneural network.

WikiLingua (wiki_lingua)

#Metric

A dataset for evaluating an LLM's ability to summarize short articles.WikiHow, an encyclopedia of articles explaininghow to do various tasks, is the human-authored source for both the articlesand the summaries. Each entry in the dataset consists of:

An article, which is created by appending each step of the prose (paragraph)version of the numbered list, minus the opening sentence of each step.
A summary of that article, consisting of the opening sentenceof each step in the numbered list.

For details, seeWikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization.

Winograd Schema Challenge (WSC)

#Metric

A format (or dataset conforming to that format) for evaluating an LLM's abilityto determine the noun phrase that apronounrefers to.

Each entry in a Winograd Schema Challenge consists of:

A short passage, which contains a target pronoun
A target pronoun
Candidate noun phrases, followed by the correct answer (a Boolean).If the target pronoun refers to this candidate, the answer is True.If the target pronoun does not refer to this candidate, the answer is False.

For example:

Passage: Mark told Pete many lies about himself, which Pete included inhis book. He should have been more truthful.
Target pronoun: He
Candidate noun phrases:
- Mark: True, because the target pronoun refers to Mark
- Pete: False, because the target pronoun doesn't refer to Peter

The Winograd Schema Challenge is a component of theSuperGLUE ensemble.

wisdom of the crowd

#df

The idea that averaging the opinions or estimates of a large groupof people ("the crowd") often produces surprisingly good results.For example, consider a game in which people guess the number ofjelly beans packed into a large jar. Although most individualguesses will be inaccurate, the average of all the guesses has beenempirically shown to be surprisingly close to the actual number ofjelly beans in the jar.

Ensembles are a software analog of wisdom of the crowd.Even if individual models make wildly inaccurate predictions,averaging the predictions of many models often generates surprisinglygood predictions. For example, although an individualdecision tree might make poor predictions, adecision forest often makes very good predictions.

WMT

Strangely, an abbreviation forConference on Machine Translation.(The abbreviation isWMT because the original name wasWorkshopon Machine Translation.) The conference focuses on developments inmachine translation systems.

word embedding

Representing each word in a word set within anembedding vector; that is, representing each word asa vector of floating-point values between 0.0 and 1.0. Words with similarmeanings have more-similar representations than words with different meanings.For example,carrots,celery, andcucumbers would all have relativelysimilar representations, which would be very different from the representationsofairplane,sunglasses, andtoothpaste.

Words in Context (WiC)

#Metric

A dataset for evaluating how well an LLM uses context to understand words thathave multiple meanings. Each entry in the dataset contains:

Two sentences, each containing the target word
The target word
The correct answer (a Boolean), where:
- True means the target word has the same meaning in the two sentences
- False means the target word has a different meaning in the two sentences

For example:

Two sentences:
- There's a lot of trash on the bed of the river.
- I keep a glass of water next to my bed when I sleep.
The target word: bed
Correct answer: False, because the target word has a different meaning inthe two sentences.

For details, seeWiC: the Word-in-Context Dataset for Evaluating Context-Sensitive Meaning Representations.

Words in Context is a component of theSuperGLUE ensemble.

WSC

#Metric

Abbreviation forWinograd Schema Challenge.

X

XLA (Accelerated Linear Algebra)

An open-source machine learning compiler for GPUs, CPUs, and ML accelerators.

The XLA compiler takes models from popular ML frameworks such asPyTorch,TensorFlow, andJAX, and optimizes themfor high-performance execution across different hardware platforms includingGPUs, CPUs, and MLaccelerators.

XL-Sum (xlsum)

#Metric

A dataset for evaluating an LLM's proficiency in summarizing text.XL-Sum provides entries in many languages. Each entry in the dataset contains:

An article, taken from the British Broadcasting Company (BBC).
A summary of the article, written by the article's author. Note thatthat summary can contain words or phrases not present in the article.

For details, seeXL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages.

xsum

Abbreviation forExtreme Summarization.

Z

zero-shot learning

A type of machine learningtraining where themodel infers aprediction for a taskthat it was not specifically already trained on. In other words, the modelis given zero task-specific trainingexamples but askedto doinference for that task.

zero-shot prompting

#generativeAI

Aprompt that doesnot provide an example of how you wantthelarge language model to respond. For example:

Parts of one prompt	Notes
`What is the official currency of the specified country?`	The question you want the LLM to answer.
`India:`	The actual query.

The large language model might respond with any of the following:

Rupee
INR
₹
Indian rupee
The rupee
The Indian rupee

All of the answers are correct, though you might prefer a particular format.

Compare and contrastzero-shot prompting with the following terms:

Z-score normalization

#fundamentals

Ascaling technique that replaces a rawfeature value with a floating-point value representingthe number of standard deviations from that feature's mean.For example, consider a feature whose mean is 800 and whose standarddeviation is 100. The following table shows how Z-score normalizationwould map the raw value to its Z-score:

Raw value	Z-score
800	0
950	+1.5
575	-2.25

The machine learning model then trains on the Z-scoresfor that feature instead of on the raw values.

SeeNumerical data: Normalizationin Machine Learning Crash Course for more information.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-16 UTC.

Movatterモバイル変換

Machine Learning Glossary Stay organized with collections Save and categorize content based on your preferences.

Page Summary

Do you have questions about this glossary?

A

ablation

A/B testing

accelerator chip

accuracy

Click the icon for details about accuracy and class-imbalanced datasets.

action

activation function

Click the icon to see an example.

active learning

AdaGrad

adaptation

agent

agentic

agentic workflow

agglomerative clustering

AI slop

anomaly detection

AR

area under the PR curve

area under the ROC curve

artificial general intelligence

artificial intelligence

attention

attribute

attribute sampling

AUC (Area under the ROC curve)

Click the icon to learn about the relationship between AUC and ROC curves.

Click the icon for a more formal definition of AUC.

augmented reality

autoencoder

automatic evaluation

automation bias

AutoML

autorater evaluation

auto-regressive model

auxiliary loss

average precision at k

Click the icon for an example

axis-aligned condition

B

backpropagation

bagging

bag of words

baseline

base model

batch

batch inference

batch normalization

batch size

Bayesian neural network

Bayesian optimization

Bellman equation

BERT (Bidirectional EncoderRepresentations from Transformers)

bias (ethics/fairness)

bias (math) or bias term

bidirectional

bidirectional language model

bigram

binary classification

binary condition

binning

black box model

BLEU (Bilingual Evaluation Understudy)

BLEURT (Bilingual Evaluation Understudy from Transformers)

Boolean Questions (BoolQ)

BoolQ

boosting

bounding box

broadcasting

Click the icon for an example.

bucketing

Click the icon for additional notes.

C

calibration layer

candidate generation

Machine Learning Glossary