Machine Learning Glossary: ML Fundamentals

  • This content provides definitions for fundamental machine learning terms, including concepts related to models, training, features, and evaluation.

  • It covers a broad range of topics such as supervised/unsupervised learning, neural networks, regularization, loss functions, and specific algorithms like linear/logistic regression.

  • The glossary aims to equip beginners with essential vocabulary for comprehending core machine learning principles.

  • Various practical considerations like data preprocessing techniques (normalization, bucketing), dealing with imbalanced datasets, and preventing overfitting are also addressed.

  • The content serves as a foundational resource for individuals starting their machine learning journey, offering clear explanations and illustrations of key concepts.

This page contains ML Fundamentals glossary terms. For all glossary terms,click here.

A

accuracy

#fundamentals
#Metric

The number of correct classificationpredictions dividedby the total number of predictions. That is:

$$\text{Accuracy} =\frac{\text{correct predictions}} {\text{correct predictions + incorrect predictions }}$$

For example, a model that made 40 correct predictions and 10 incorrectpredictions would have an accuracy of:

$$\text{Accuracy} =\frac{\text{40}} {\text{40 + 10}} =\text{80%}$$

Binary classification provides specific namesfor the different categories ofcorrect predictions andincorrect predictions. So, the accuracy formula for binary classificationis as follows:

$$\text{Accuracy} = \frac{\text{TP} + \text{TN}} {\text{TP} + \text{TN} + \text{FP} + \text{FN}}$$

where:

Compare and contrast accuracy withprecision andrecall.

Click the icon for details about accuracy and class-imbalanced datasets.

Although a valuable metric for some situations, accuracy is highlymisleading for others. Notably, accuracy is usually a poor metricfor evaluating classification models that processclass-imbalanced datasets.

For example, suppose snow falls only 25 days per century in a certainsubtropical city. Since days without snow (the negative class) vastlyoutnumber days with snow (the positive class), the snow dataset forthis city is class-imbalanced.Imagine abinary classificationmodel that is supposed to predict either snow or no snow each day butsimply predicts "no snow" every day.This model is highly accurate but has no predictive power.The following table summarizes the results for a century of predictions:

CategoryNumber
TP0
TN36499
FP0
FN25

The accuracy of this model is therefore:

accuracy = (TP + TN) / (TP + TN + FP + FN)accuracy = (0 + 36499) / (0 + 36499 + 0 + 25) = 0.9993 = 99.93%

Although 99.93% accuracy seems like a very impressive percentage, the modelactually has no predictive power.

Precision andrecall are usually more useful metricsthanaccuracy for evaluating models trained on class-imbalanced datasets.


SeeClassification: Accuracy, recall, precision and relatedmetricsin Machine Learning Crash Course for more information.

activation function

#fundamentals

A function that enablesneural networks to learnnonlinear (complex) relationships between featuresand the label.

Popular activation functions include:

The plots of activation functions are never single straight lines.For example, the plot of the ReLU activation function consists oftwo straight lines:

A cartesian plot of two lines. The first line has a constant          y value of 0, running along the x-axis from -infinity,0 to 0,-0.          The second line starts at 0,0. This line has a slope of +1, so          it runs from 0,0 to +infinity,+infinity.

A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain          -infinity to +positive, while y values span the range almost 0 to          almost 1. When x is 0, y is 0.5. The slope of the curve is always          positive, with the highest slope at 0,0.5 and gradually decreasing          slopes as the absolute value of x increases.

Click the icon to see an example.

In a neural network, activation functions manipulate theweighted sum of all the inputs to aneuron. To calculate a weighted sum, the neuron adds upthe products of the relevant values and weights. For example, suppose therelevant input to a neuron consists of the following:

input valueinput weight
2-1.3
-10.6
30.4
The weighted sum is therefore:
weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0
Suppose the designer of this neural network chooses thesigmoid function to be theactivation function. In that case, the neuron calculates thesigmoid of -2.0, which is approximately 0.12. Therefore, theneuron passes 0.12 (rather than -2.0) to the next layer in the neural network.The following figure illustrates the relevant part of the process:

An input layer with three features passing three feature values and          three weights to a neuron in a hidden layer. The hidden layer          calculates the raw value (-2.0), and then passes the raw value to          the activation function. The activation function calculates the          sigmoid of the raw value and passes the result (0.12) to the next          layer of the neural network.


SeeNeural networks: Activationfunctionsin Machine Learning Crash Course for more information.

artificial intelligence

#fundamentals

A non-human program ormodel that can solve sophisticated tasks.For example, a program or model that translates text or a program or model thatidentifies diseases from radiologic images both exhibit artificial intelligence.

Formally,machine learning is a sub-field of artificialintelligence. However, in recent years, some organizations have begun using thetermsartificial intelligence andmachine learning interchangeably.

AUC (Area under the ROC curve)

#fundamentals
#Metric

A number between 0.0 and 1.0 representing abinary classification model'sability to separatepositive classes fromnegative classes.The closer the AUC is to 1.0, the better the model's ability to separateclasses from each other.

For example, the following illustration shows aclassification model that separates positiveclasses (green ovals) from negative classes (purple rectangles) perfectly.This unrealistically perfect model has an AUC of 1.0:

A number line with 8 positive examples on one side and          9 negative examples on the other side.

Conversely, the following illustration shows the results for aclassification model that generated randomresults. This model has an AUC of 0.5:

A number line with 6 positive examples and 6 negative examples.          The sequence of examples is positive, negative,          positive, negative, positive, negative, positive, negative, positive          negative, positive, negative.

Yes, the preceding model has an AUC of 0.5, not 0.0.

Most models are somewhere between the two extremes. For instance, thefollowing model separates positives from negatives somewhat, and thereforehas an AUC somewhere between 0.5 and 1.0:

A number line with 6 positive examples and 6 negative examples.          The sequence of examples is negative, negative, negative, negative,          positive, negative, positive, positive, negative, positive, positive,          positive.

AUC ignores any value you set forclassification threshold. Instead, AUCconsidersall possible classification thresholds.

Click the icon to learn about the relationship between AUC and ROC curves.

AUC represents thearea under anROC curve.For example,the ROC curve for a model that perfectly separates positives fromnegatives looks as follows:

Cartesian plot. x-axis is false positive rate; y-axis          is true positive rate. Graph starts at 0,0 and goes straight up          to 0,1 and then straight to the right ending at 1,1.

AUC is the area of the gray region in the preceding illustration.In this unusual case, the area is simply the length of the gray region(1.0) multiplied by the width of the gray region (1.0). So, the productof 1.0 and 1.0 yields an AUC of exactly 1.0, which is the highest possibleAUC score.

Conversely, the ROC curve for aclassification model that can'tseparate classes at all is as follows. The area of this gray region is 0.5.

Cartesian plot. x-axis is false positive rate; y-axis is true          positive rate. Graph starts at 0,0 and goes diagonally to 1,1.

A more typical ROC curve looks approximately like the following:

Cartesian plot. x-axis is false positive rate; y-axis is true          positive rate. Graph starts at 0,0 and takes an irregular arc          to 1,0.

It would be painstaking to calculate the area under this curve manually,which is why a program typically calculates most AUC values.


Click the icon for a more formal definition of AUC.

AUC is the probability that aclassification model will be moreconfident that a randomly chosen positive example is actually positive thanthat a randomly chosen negative example is positive.


SeeClassification: ROC andAUCin Machine Learning Crash Course for more information.

B

backpropagation

#fundamentals

The algorithm that implementsgradient descent inneural networks.

Training a neural network involves manyiterationsof the following two-pass cycle:

  1. During theforward pass, the system processes abatch ofexamples to yield prediction(s). The system compares eachprediction to eachlabel value. The difference betweenthe prediction and the label value is theloss for that example.The system aggregates the losses for all the examples to compute the totalloss for the current batch.
  2. During thebackward pass (backpropagation), the system reduces loss byadjusting the weights of all theneurons in all thehidden layer(s).

Neural networks often contain many neurons across many hidden layers.Each of those neurons contribute to the overall loss in different ways.Backpropagation determines whether to increase or decrease the weightsapplied to particular neurons.

Thelearning rate is a multiplier that controls thedegree to which each backward pass increases or decreases each weight.A large learning rate will increase or decrease each weight more than asmall learning rate.

In calculus terms, backpropagation implements thechain rule.from calculus. That is, backpropagation calculates thepartial derivative of the error withrespect to each parameter.

Years ago, ML practitioners had to write code to implement backpropagation.Modern ML APIs like Keras now implement backpropagation for you. Phew!

SeeNeural networksin Machine Learning Crash Course for more information.

batch

#fundamentals

The set ofexamples used in one trainingiteration.Thebatch size determines the number of examples in abatch.

Seeepoch for an explanation of how a batch relates toan epoch.

SeeLinear regression:Hyperparametersin Machine Learning Crash Course for more information.

batch size

#fundamentals

The number ofexamples in abatch.For instance, if the batch size is 100, then the model processes100 examples periteration.

The following are popular batch size strategies:

  • Stochastic Gradient Descent (SGD), in which the batch size is 1.
  • Full batch, in which the batch size is the number of examples in the entiretraining set. For instance, if the training setcontains a million examples, then the batch size would be a millionexamples. Full batch is usually an inefficient strategy.
  • mini-batch in which the batch size is usually between10 and 1000. Mini-batch is usually the most efficient strategy.

See the following for more information:

bias (ethics/fairness)

#responsible
#fundamentals

1. Stereotyping, prejudice or favoritism towards some things, people,or groups over others. These biases can affect collection andinterpretation of data, the design of a system, and how users interactwith a system. Forms of this type of bias include:

2. Systematic error introduced by a sampling or reporting procedure.Forms of this type of bias include:

Not to be confused with thebias term in machine learning modelsorprediction bias.

SeeFairness: Types ofbias inMachine Learning Crash Course for more information.

bias (math) or bias term

#fundamentals

An intercept or offset from an origin. Bias is a parameter inmachine learning models, which is symbolized by either of thefollowing:

  • b
  • w0

For example, bias is theb in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In a simple two-dimensional line, bias just means "y-intercept."For example, the bias of the line in the following illustration is 2.

The plot of a line with a slope of 0.5 and a bias (y-intercept) of 2.

Bias exists because not all models start from the origin (0,0). For example,suppose an amusement park costs 2 Euros to enter and an additional0.5 Euro for every hour a customer stays. Therefore, a model mapping thetotal cost has a bias of 2 because the lowest cost is 2 Euros.

Bias is not to be confused withbias in ethics and fairnessorprediction bias.

SeeLinear Regressionin Machine Learning Crash Course for more information.

binary classification

#fundamentals

A type ofclassification task thatpredicts one of two mutually exclusive classes:

For example, the following two machine learning models each performbinary classification:

  • A model that determines whether email messages arespam (the positive class) ornot spam (the negative class).
  • A model that evaluates medical symptoms to determine whether a personhas a particular disease (the positive class) or doesn't have thatdisease (the negative class).

Contrast withmulti-class classification.

See alsologistic regression andclassification threshold.

SeeClassificationin Machine Learning Crash Course for more information.

bucketing

#fundamentals

Converting a singlefeature into multiple binary featurescalledbuckets orbins,typically based on a value range. The chopped feature is typically acontinuous feature.

For example, instead of representing temperature as a singlecontinuous floating-point feature, you could chop ranges of temperaturesinto discrete buckets, such as:

  • <= 10 degrees Celsius would be the "cold" bucket.
  • 11 - 24 degrees Celsius would be the "temperate" bucket.
  • >= 25 degrees Celsius would be the "warm" bucket.

The model will treat every value in the same bucket identically. Forexample, the values13 and22 are both in the temperate bucket, so themodel treats the two values identically.

Click the icon for additional notes.

If you represent temperature as a continuous feature, then the modeltreats temperature as a single feature. If you represent temperatureas three buckets, then the model treats each bucket as a separate feature.That is, a model can learn separate relationships of each bucket to thelabel. For example, alinear regression model can learnseparateweights for each bucket.

Increasing the number of buckets makes your model more complicated byincreasing the number of relationships that your model must learn.For example, the cold, temperate, and warm buckets are essentiallythree separate features for your model to train on. If you decide to addtwo more buckets--for example, freezing and hot--your model wouldnow have to train on five separate features.

How do you know how many buckets to create, or what the ranges for eachbucket should be? The answers typically require a fair amount ofexperimentation.


SeeNumerical data:Binningin Machine Learning Crash Course for more information.

C

categorical data

#fundamentals

Features having a specific set of possible values. For example,consider a categorical feature namedtraffic-light-state, which can onlyhave one of the following three possible values:

  • red
  • yellow
  • green

By representingtraffic-light-state as a categorical feature,a model can learn thediffering impacts ofred,green, andyellow on driver behavior.

Categorical features are sometimes calleddiscrete features.

Contrast withnumerical data.

SeeWorking with categoricaldatain Machine Learning Crash Course for more information.

class

#fundamentals

A category that alabel can belong to.For example:

Aclassification model predicts a class.In contrast, aregression model predicts a numberrather than a class.

SeeClassificationin Machine Learning Crash Course for more information.

classification model

#fundamentals

Amodel whose prediction is aclass.For example, the following are all classification models:

  • A model that predicts an input sentence's language (French? Spanish?Italian?).
  • A model that predicts tree species (Maple? Oak? Baobab?).
  • A model that predicts the positive or negative class for a particularmedical condition.

In contrast,regression models predict numbersrather than classes.

Two common types of classification models are:

classification threshold

#fundamentals

In abinary classification, anumber between 0 and 1 that converts the raw output of alogistic regression modelinto a prediction of either thepositive classor thenegative class.Note that the classification threshold is a value that a human chooses,not a value chosen by model training.

A logistic regression model outputs a raw value between 0 and 1. Then:

  • If this raw value isgreater than the classification threshold, thenthe positive class is predicted.
  • If this raw value isless than the classification threshold, thenthe negative class is predicted.

For example, suppose the classification threshold is 0.8. If the raw valueis 0.9, then the model predicts the positive class. If the raw value is0.7, then the model predicts the negative class.

The choice of classification threshold strongly influences the number offalse positives andfalse negatives.

Click the icon for additional notes.

As models or datasets evolve, engineers sometimes also change theclassification threshold. When the classification threshold changes,positive class predictions can suddenly become negative classesand vice-versa.

For example, consider a binary classification disease prediction model.Suppose that when the system runs in the first year:

  • The raw value for a particular patient is 0.95.
  • The classification threshold is 0.94.

Therefore, the system diagnoses the positive class. (The patient gasps,"Oh no! I'm sick!")

A year later, perhaps the values now look as follows:

  • The raw value for the same patient remains at 0.95.
  • The classification threshold changes to 0.97.

Therefore, the system now reclassifies that patient as the negative class.("Happy day! I'm not sick.") Same patient. Different diagnosis.


SeeThresholds and the confusionmatrixin Machine Learning Crash Course for more information.

classifier

#fundamentals

A casual term for aclassification model.

class-imbalanced dataset

#fundamentals

Adataset for aclassificationin which the total number oflabels of eachclassdiffers significantly. For example, consider abinary classification dataset whose two labelsare divided as follows:

  • 1,000,000 negative labels
  • 10 positive labels

The ratio of negative to positive labels is 100,000 to 1, so thisis a class-imbalanced dataset.

In contrast, the following dataset isclass-balanced because the ratio of negativelabels to positive labels is relatively close to 1:

  • 517 negative labels
  • 483 positive labels

Multi-class datasets can also be class-imbalanced. For example, the followingmulti-class classification dataset is also class-imbalanced because one labelhas far more examples than the other two:

  • 1,000,000 labels with class "green"
  • 200 labels with class "purple"
  • 350 labels with class "orange"

Training class-imbalanced datasets can present special challenges. SeeImbalanced datasetsin Machine Learning Crash Course for details.

See alsoentropy,majority class,andminority class.

clipping

#fundamentals

A technique for handlingoutliers by doingeither or both of the following:

  • Reducingfeature values that are greater than a maximumthreshold down to that maximum threshold.
  • Increasing feature values that are less than a minimum threshold up to thatminimum threshold.

For example, suppose that <0.5% of values for a particular feature falloutside the range 40–60. In this case, you could do the following:

  • Clip all values over 60 (the maximum threshold) to be exactly 60.
  • Clip all values under 40 (the minimum threshold) to be exactly 40.

Outliers can damage models, sometimes causingweightsto overflow during training. Some outliers can also dramatically spoilmetrics likeaccuracy. Clipping is a common technique to limitthe damage.

Gradient clipping forcesgradient values within a designated range during training.

SeeNumerical data:Normalizationin Machine Learning Crash Course for more information.

confusion matrix

#fundamentals

An NxN table that summarizes the number of correct and incorrect predictionsthat aclassification model made.For example, consider the following confusion matrix for abinary classification model:

Tumor (predicted)Non-Tumor (predicted)
Tumor (ground truth)18 (TP)1 (FN)
Non-Tumor (ground truth)6 (FP)452 (TN)

The preceding confusion matrix shows the following:

  • Of the 19 predictions in whichground truth was Tumor,the model correctly classified 18 and incorrectly classified 1.
  • Of the 458 predictions in which ground truth was Non-Tumor, the modelcorrectly classified 452 and incorrectly classified 6.

The confusion matrix for amulti-class classificationproblem can help you identify patterns of mistakes.For example, consider the following confusion matrix for a 3-classmulti-class classification model that categorizes three different iris types(Virginica, Versicolor, and Setosa). When the ground truth was Virginica, theconfusion matrix shows that the model was far more likely to mistakenlypredict Versicolor than Setosa:

 Setosa (predicted)Versicolor (predicted)Virginica (predicted)
Setosa (ground truth)88120
Versicolor (ground truth)61417
Virginica (ground truth)227109

As yet another example, a confusion matrix could reveal that a model trainedto recognize handwritten digits tends to mistakenly predict 9 instead of 4,or mistakenly predict 1 instead of 7.

Confusion matrixes contain sufficient information to calculate avariety of performance metrics, includingprecisionandrecall.

continuous feature

#fundamentals

A floating-pointfeature with an infinite range of possiblevalues, such as temperature or weight.

Contrast withdiscrete feature.

convergence

#fundamentals

A state reached whenloss values change very little ornot at all with eachiteration. For example, the followingloss curve suggests convergence at around 700 iterations:

Cartesian plot. X-axis is loss. Y-axis is the number of training          iterations. Loss is very high during first few iterations, but          drops sharply. After about 100 iterations, loss is still          descending but far more gradually. After about 700 iterations,          loss stays flat.

A modelconverges when additional training won'timprove the model.

Indeep learning, loss values sometimes stay constant ornearly so for many iterations before finally descending. During a long periodof constant loss values, you may temporarily get a false sense of convergence.

See alsoearly stopping.

SeeModel convergence and losscurvesin Machine Learning Crash Course for more information.

D

DataFrame

#fundamentals

A popularpandas data type for representingdatasets in memory.

A DataFrame is analogous to a table or a spreadsheet. Each column ofa DataFrame has a name (a header), and each row is identified by aunique number.

Each column in a DataFrame is structured like a 2D array, except thateach column can be assigned its own data type.

See also the officialpandas.DataFrame referencepage.

data set or dataset

#fundamentals

A collection of raw data, commonly (but not exclusively) organized in oneof the following formats:

  • a spreadsheet
  • a file in CSV (comma-separated values) format

deep model

#fundamentals

Aneural network containing more than onehidden layer.

A deep model is also called adeep neural network.

Contrast withwide model.

dense feature

#fundamentals

Afeature in which most or all values are nonzero, typicallyaTensor of floating-point values. For example, the following10-element Tensor is dense because 9 of its values are nonzero:

8375240496

Contrast withsparse feature.

depth

#fundamentals

The sum of the following in aneural network:

For example, a neural network with five hidden layers and one output layerhas a depth of 6.

Notice that theinput layer doesn'tinfluence depth.

discrete feature

#fundamentals

Afeature with a finite set of possible values. For example,a feature whose values may only beanimal,vegetable, ormineral is adiscrete (or categorical) feature.

Contrast withcontinuous feature.

dynamic

#fundamentals

Something done frequently or continuously.The termsdynamic andonline are synonyms in machine learning.The following are common uses ofdynamic andonline in machinelearning:

  • Adynamic model (oronline model) is a modelthat is retrained frequently or continuously.
  • Dynamic training (oronline training) is the process of trainingfrequently or continuously.
  • Dynamic inference (oronline inference) is the process ofgenerating predictions on demand.

dynamic model

#fundamentals

Amodel that is frequently (maybe even continuously)retrained. A dynamic model is a "lifelong learner" thatconstantly adapts to evolving data. A dynamic model is also known as anonline model.

Contrast withstatic model.

E

early stopping

#fundamentals

A method forregularization that involves endingtrainingbefore training loss finishesdecreasing. In early stopping, you intentionally stop training the modelwhen the loss on avalidation dataset starts toincrease; that is, whengeneralization performance worsens.

Click the icon for additional notes.

Early stopping may seem counterintuitive. After all, telling a model to halttraining while the loss is still decreasing may seem like telling a chef tostop cooking before the dessert has fully baked. However, training a model fortoo long can lead tooverfitting. That is, if youtrain a model too long, the model may fit the training data so closely thatthe model doesn't make good predictions on new examples.


Contrast withearly exit.

embedding layer

#fundamentals

A specialhidden layer that trains on ahigh-dimensionalcategorical feature togradually learn a lower dimension embedding vector. Anembedding layer enables a neural network to train far moreefficiently than training just on the high-dimensional categorical feature.

For example, Earth currently supports about 73,000 tree species. Supposetree species is afeature in your model, so your model'sinput layer includes aone-hot vector 73,000elements long.For example, perhapsbaobab would be represented something like this:

An array of 73,000 elements. The first 6,232 elements hold the value     0. The next element holds the value 1. The final 66,767 elements hold     the value zero.

A 73,000-element array is very long. If you don't add an embedding layerto the model, training is going to be very time consuming due tomultiplying 72,999 zeros. Perhaps you pick the embedding layer to consistof 12 dimensions. Consequently, the embedding layer will gradually learna new embedding vector for each tree species.

In certain situations,hashing is a reasonable alternativeto an embedding layer.

SeeEmbeddingsin Machine Learning Crash Course for more information.

epoch

#fundamentals

A full training pass over the entiretraining setsuch that eachexample has been processed once.

An epoch representsN/batch sizetrainingiterations, whereN is thetotal number of examples.

For instance, suppose the following:

  • The dataset consists of 1,000 examples.
  • The batch size is 50 examples.

Therefore, a single epoch requires 20 iterations:

1 epoch = (N/batch size) = (1,000 / 50) = 20 iterations

SeeLinear regression:Hyperparametersin Machine Learning Crash Course for more information.

example

#fundamentals

The values of one row offeatures and possiblyalabel. Examples insupervised learning fall into twogeneral categories:

  • Alabeled example consists of one or more featuresand a label. Labeled examples are used during training.
  • Anunlabeled example consists of one ormore features but no label. Unlabeled examples are used during inference.

For instance, suppose you are training a model to determine the influenceof weather conditions on student test scores. Here are three labeled examples:

FeaturesLabel
TemperatureHumidityPressureTest score
1547998Good
19341020Excellent
18921012Poor

Here are three unlabeled examples:

TemperatureHumidityPressure 
12621014 
21471017 
19411021 

The row of adataset is typically the raw source for an example.That is, an example typically consists of a subset of the columns inthe dataset. Furthermore, the features in an example can also includesynthetic features, such asfeature crosses.

SeeSupervised Learning inthe Introduction to Machine Learning course for more information.

F

false negative (FN)

#fundamentals
#Metric

An example in which the model mistakenly predicts thenegative class. For example, the modelpredicts that a particular email message isnot spam(the negative class), but that email messageactually is spam.

false positive (FP)

#fundamentals
#Metric

An example in which the model mistakenly predicts thepositive class. For example, the model predictsthat a particular email message isspam (the positive class), but thatemail message isactually not spam.

SeeThresholds and the confusionmatrixin Machine Learning Crash Course for more information.

false positive rate (FPR)

#fundamentals
#Metric

The proportion of actual negative examples for which the model mistakenlypredicted the positive class. The following formula calculates the falsepositive rate:

$$\text{false positive rate} =\frac{\text{false positives}}{\text{false positives} + \text{true negatives}}$$

The false positive rate is the x-axis in anROC curve.

SeeClassification: ROC andAUCin Machine Learning Crash Course for more information.

feature

#fundamentals

An input variable to a machine learning model. Anexampleconsists of one or more features. For instance, suppose you are training amodel to determine the influence of weather conditions on student test scores.The following table shows three examples, each of which containsthree features and one label:

FeaturesLabel
TemperatureHumidityPressureTest score
154799892
1934102084
1892101287

Contrast withlabel.

SeeSupervised Learningin the Introduction to Machine Learning course for more information.

feature cross

#fundamentals

Asynthetic feature formed by "crossing"categorical orbucketed features.

For example, consider a "mood forecasting" model that representstemperature in one of the following four buckets:

  • freezing
  • chilly
  • temperate
  • warm

And represents wind speed in one of the following three buckets:

  • still
  • light
  • windy

Without feature crosses, the linear model trains independently on each of thepreceding seven various buckets. So, the model trains on, for example,freezing independently of the training on, for example,windy.

Alternatively, you could create a feature cross of temperature andwind speed. This synthetic feature would have the following 12 possiblevalues:

  • freezing-still
  • freezing-light
  • freezing-windy
  • chilly-still
  • chilly-light
  • chilly-windy
  • temperate-still
  • temperate-light
  • temperate-windy
  • warm-still
  • warm-light
  • warm-windy

Thanks to feature crosses, the model can learn mood differencesbetween afreezing-windy day and afreezing-still day.

If you create a synthetic feature from two features that each have a lot ofdifferent buckets, the resulting feature cross will have a huge numberof possible combinations. For example, if one feature has 1,000 buckets andthe other feature has 2,000 buckets, the resulting feature cross has 2,000,000buckets.

Formally, a cross is aCartesian product.

Feature crosses are mostly used with linear models and are rarely usedwith neural networks.

SeeCategorical data: Featurecrossesin Machine Learning Crash Course for more information.

feature engineering

#fundamentals
#TensorFlow

A process that involves the following steps:

  1. Determining whichfeatures might be usefulin training a model.
  2. Converting raw data from the dataset into efficient versions ofthose features.

For example, you might determine thattemperature might be a usefulfeature. Then, you might experiment withbucketingto optimize what the model can learn from differenttemperature ranges.

Feature engineering is sometimes calledfeature extraction orfeaturization.

Click the icon for additional notes about TensorFlow.

In TensorFlow, feature engineering often means converting raw log fileentries totf.Example protocol buffers.See alsotf.Transform.


SeeNumerical data: How a model ingests data using featurevectorsin Machine Learning Crash Course for more information.

feature set

#fundamentals

The group offeatures your machine learningmodel trains on.For example, a simple feature set for a model that predicts housing pricesmight consist of postal code, property size, and property condition.

feature vector

#fundamentals

The array offeature values comprising anexample. The feature vector is input duringtraining and duringinference.For example, the feature vector for a model with two discrete featuresmight be:

[0.92, 0.56]

Four layers: an input layer, two hidden layers, and one output layer.          The input layer contains two nodes, one containing the value          0.92 and the other containing the value 0.56.

Each example supplies different values for the feature vector, so thefeature vector for the next example could be something like:

[0.73, 0.49]

Feature engineering determines how to representfeatures in the feature vector. For example, a binary categorical feature withfive possible values might be represented withone-hot encoding. In this case, the portion of thefeature vector for a particular example would consist of four zeroes anda single 1.0 in the third position, as follows:

[0.0, 0.0, 1.0, 0.0, 0.0]

As another example, suppose your model consists of three features:

  • a binary categorical feature withfive possible values represented withone-hot encoding; for example:[0.0, 1.0, 0.0, 0.0, 0.0]
  • another binary categorical feature withthree possible values representedwith one-hot encoding; for example:[0.0, 0.0, 1.0]
  • a floating-point feature; for example:8.3.

In this case, the feature vector for each example would be representedbynine values. Given the example values in the preceding list, thefeature vector would be:

0.01.00.00.00.00.00.01.08.3

SeeNumerical data: How a model ingests data using featurevectorsin Machine Learning Crash Course for more information.

feedback loop

#fundamentals

In machine learning, a situation in which a model's predictions influence thetraining data for the same model or another model. For example, a model thatrecommends movies will influence the movies that people see, which will theninfluence subsequent movie recommendation models.

SeeProduction ML systems: Questions toaskin Machine Learning Crash Course for more information.

G

generalization

#fundamentals

Amodel's ability to make correct predictions on new,previously unseen data. A model that can generalize is the oppositeof a model that isoverfitting.

Click the icon for additional notes.

You train a model on the examples in the training set. Consequently, themodel learns the peculiarities of the data in the training set. Generalizationessentially asks whether your model can make good predictions on examplesthat arenot in the training set.

To encourage generalization,regularization helps a model trainless exactly to the peculiarities of the data in the training set.


SeeGeneralizationin Machine Learning Crash Course for more information.

generalization curve

#fundamentals

A plot of bothtraining loss andvalidation loss as a function of the number ofiterations.

A generalization curve can help you detect possibleoverfitting. For example, the followinggeneralization curve suggests overfitting because validation lossultimately becomes significantly higher than training loss.

A Cartesian graph in which the y-axis is labeled loss and the x-axis          is labeled iterations. Two plots appear. One plots shows the          training loss and the other shows the validation loss.          The two plots start off similarly, but the training loss eventually          dips far lower than the validation loss.

SeeGeneralizationin Machine Learning Crash Course for more information.

gradient descent

#fundamentals

A mathematical technique to minimizeloss.Gradient descent iteratively adjustsweights andbiases,gradually finding the best combination to minimize loss.

Gradient descent is older—much, much older—than machine learning.

See theLinear regression: Gradientdescentin Machine Learning Crash Course for more information.

ground truth

#fundamentals

Reality.

The thing that actually happened.

For example, consider abinary classificationmodel that predicts whether a student in their first year of universitywill graduate within six years. Ground truth for this model is whether ornot that student actually graduated within six years.

Click the icon for additional notes.

We assess model quality against ground truth. However, ground truthis not always completely, well, truthful. For example, consider thefollowing examples of potential imperfections in ground truth:

  • In the graduation example, are wecertain that the graduation records for each student are always correct? Is the university's record-keeping flawless?
  • Suppose the label is a floating-point value measured by instruments (for example, barometers). How can we be sure that each instrument is calibrated identically or that each reading was taken under the same circumstances?
  • If the label is a matter of human opinion, how can we be sure that each humanrater is evaluating events in the same way? To improve consistency,expert human raters sometimes intervene.

H

hidden layer

#fundamentals

A layer in aneural network between theinput layer (the features) and theoutput layer (the prediction).Each hidden layer consists of one or moreneurons.For example, the following neural network contains two hidden layers,the first with three neurons and the second with two neurons:

Four layers. The first layer is an input layer containing two          features. The second layer is a hidden layer containing three          neurons. The third layer is a hidden layer containing two          neurons. The fourth layer is an output layer. Each feature          contains three edges, each of which points to a different neuron          in the second layer. Each of the neurons in the second layer          contains two edges, each of which points to a different neuron          in the third layer. Each of the neurons in the third layer contain          one edge, each pointing to the output layer.

Adeep neural network contains more than onehidden layer. For example, the preceding illustration is a deep neuralnetwork because the model contains two hidden layers.

SeeNeural networks: Nodes and hiddenlayersin Machine Learning Crash Course for more information.

hyperparameter

#fundamentals

The variables that you or a hyperparameter tuning serviceadjust during successive runs of training a model. For example,learning rate is a hyperparameter. You couldset the learning rate to 0.01 before one training session. If youdetermine that 0.01 is too high, you could perhaps set the learningrate to 0.003 for the next training session.

In contrast,parameters are the variousweights andbias that the modellearns during training.

SeeLinear regression:Hyperparametersin Machine Learning Crash Course for more information.

I

independently and identically distributed (i.i.d)

#fundamentals

Data drawn from a distribution that doesn't change, and where each valuedrawn doesn't depend on values that have been drawn previously. An i.i.d.is theideal gasof machinelearning—a useful mathematical construct but almost never exactly foundin the real world. For example, the distribution of visitors to a web pagemay be i.i.d. over a brief window of time; that is, the distribution doesn'tchange during that brief window and one person's visit is generallyindependent of another's visit. However, if you expand that window of time,seasonal differences in the web page's visitors may appear.

See alsononstationarity.

inference

#fundamentals
#generativeAI

In traditional machine learning, the process of making predictions byapplying a trained model tounlabeled examples.SeeSupervised Learningin the Intro to ML course to learn more.

Inlarge language models, inference is theprocess of using a trained model to generate aresponseto an inputprompt.

Inference has a somewhat different meaning in statistics. See theWikipedia article on statistical inference for details.

input layer

#fundamentals

Thelayer of aneural network thatholds thefeature vector. That is, the input layerprovidesexamples fortraining orinference. For example, the input layer in the followingneural network consists of two features:

Four layers: an input layer, two hidden layers, and an output layer.

interpretability

#fundamentals

The ability to explain or to present an MLmodel's reasoning inunderstandable terms to a human.

Mostlinear regression models, for example, are highlyinterpretable. (You merely need to look at the trained weights for eachfeature.) Decision forests are also highly interpretable. Some models, however,require sophisticated visualization to become interpretable.

You can use theLearning Interpretability Tool (LIT)to interpret ML models.

iteration

#fundamentals

A single update of amodel's parameters—the model'sweights andbiases—duringtraining. Thebatch size determineshow many examples the model processes in a single iteration. For instance,if the batch size is 20, then the model processes 20 examples beforeadjusting the parameters.

When training aneural network, a single iterationinvolves the following two passes:

  1. A forward pass to evaluate loss on a single batch.
  2. A backward pass (backpropagation) to adjust themodel's parameters based on the loss and the learning rate.

SeeGradientdescentin Machine Learning Crash Course for more information.

L

L0 regularization

#fundamentals

A type ofregularization thatpenalizes thetotal number of nonzeroweightsin a model. For example, a model having 11 nonzero weightswould be penalized more than a similar model having 10 nonzero weights.

L0 regularization is sometimes calledL0-norm regularization.

Click the icon for additional notes.

L0 regularization is generally impractical in large models becauseL0 regularization turns training into aconvexoptimization problem.


L1 loss

#fundamentals
#Metric

Aloss function that calculates the absolute valueof the difference between actuallabel values andthe values that amodel predicts. For example, here's thecalculation of L1 loss for abatch of fiveexamples:

Actual value of exampleModel's predicted valueAbsolute value of delta
761
541
8113
462
981
 8 = L1 loss

L1 loss is less sensitive tooutliersthanL2 loss.

TheMean Absolute Error is the averageL1 loss per example.

Click the icon to see the formal math.

$$ L_1 loss = \sum_{i=0}^n | y_i - \hat{y}_i |$$

where:
  • $n$ is the number of examples.
  • $y$ is the actual value of the label.
  • $\hat{y}$ is the value that the model predicts for $y$.

SeeLinear regression: Lossin Machine Learning Crash Course for more information.

L1 regularization

#fundamentals

A type ofregularization that penalizesweights in proportion to the sum of the absolute value ofthe weights. L1 regularization helps drive the weights of irrelevantor barely relevant features toexactly 0. Afeature witha weight of 0 is effectively removed from the model.

Contrast withL2 regularization.

L2 loss

#fundamentals
#Metric

Aloss function that calculates the squareof the difference between actuallabel values andthe values that amodel predicts. For example, here's thecalculation of L2 loss for abatch of fiveexamples:

Actual value of exampleModel's predicted valueSquare of delta
761
541
8119
464
981
 16 = L2 loss

Due to squaring, L2 loss amplifies the influence ofoutliers.That is, L2 loss reacts more strongly to bad predictions thanL1 loss. For example, the L1 lossfor the preceding batch would be 8 rather than 16. Notice that a singleoutlier accounts for 9 of the 16.

Regression models typically use L2 lossas the loss function.

TheMean Squared Error is the averageL2 loss per example.Squared loss is another name for L2 loss.

Click the icon to see the formal math.

$$ L_2 loss = \sum_{i=0}^n {(y_i - \hat{y}_i)}^2$$

where:
  • $n$ is the number of examples.
  • $y$ is the actual value of the label.
  • $\hat{y}$ is the value that the model predicts for $y$.

SeeLogistic regression: Loss andregularizationin Machine Learning Crash Course for more information.

L2 regularization

#fundamentals

A type ofregularization that penalizesweights in proportion to the sum of thesquares of the weights.L2 regularization helps driveoutlier weights (thosewith high positive or low negative values) closer to 0 butnot quite to 0.Features with values very close to 0 remain in the modelbut don't influence the model's prediction very much.

L2 regularization always improves generalization inlinear models.

Contrast withL1 regularization.

SeeOverfitting: L2regularizationin Machine Learning Crash Course for more information.

label

#fundamentals

Insupervised machine learning, the"answer" or "result" portion of anexample.

Eachlabeled example consists of one or morefeatures and a label. For example, in a spamdetection dataset, the label would probably be either "spam" or"not spam." In a rainfall dataset, the label might be the amount ofrain that fell during a certain period.

SeeSupervised Learningin Introduction to Machine Learning for more information.

labeled example

#fundamentals

An example that contains one or morefeatures and alabel. For example, the following table shows threelabeled examples from a house valuation model, each with three featuresand one label:

Number of bedroomsNumber of bathroomsHouse ageHouse price (label)
3215$345,000
2172$179,000
4234$392,000

Insupervised machine learning,models train on labeled examples and make predictions onunlabeled examples.

Contrast labeled example with unlabeled examples.

SeeSupervised Learningin Introduction to Machine Learning for more information.

lambda

#fundamentals

Synonym forregularization rate.

Lambda is an overloaded term. Here we're focusing on the term'sdefinition withinregularization.

layer

#fundamentals

A set ofneurons in aneural network. Three common types of layersare as follows:

For example, the following illustration shows a neural network withone input layer, two hidden layers, and one output layer:

A neural network with one input layer, two hidden layers, and one          output layer. The input layer consists of two features. The first          hidden layer consists of three neurons and the second hidden layer          consists of two neurons. The output layer consists of a single node.

InTensorFlow,layers are also Python functions that takeTensors and configuration options as input andproduce other tensors as output.

learning rate

#fundamentals

A floating-point number that tells thegradient descentalgorithm how strongly to adjust weights and biases on eachiteration. For example, a learning rate of 0.3 wouldadjust weights and biases three times more powerfully than a learning rateof 0.1.

Learning rate is a keyhyperparameter. If you setthe learning rate too low, training will take too long. Ifyou set the learning rate too high, gradient descent often has troublereachingconvergence.

Click the icon for a more mathematical explanation.

During each iteration, thegradient descentalgorithm multiplies thelearning rate by the gradient. The resulting product is called thegradient step.


SeeLinear regression:Hyperparametersin Machine Learning Crash Course for more information.

linear

#fundamentals

A relationship between two or more variables that can be represented solelythrough addition and multiplication.

The plot of a linear relationship is a line.

Contrast withnonlinear.

linear model

#fundamentals

Amodel that assigns oneweight perfeature to makepredictions.(Linear models also incorporate abias.) In contrast,the relationship of features to predictions indeep modelsis generallynonlinear.

Linear models are usually easier to train and moreinterpretable than deep models. However,deep models can learn complex relationshipsbetween features.

Linear regression andlogistic regression are two types of linear models.

Click the icon to see the math.

A linear model follows this formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$
where:
  • y' is the raw prediction. (In certain kinds of linear models, thisraw prediction will be further modified. For example, seelogistic regression.)
  • b is thebias.
  • w is aweight, so w1 isthe weight of the first feature, w2 is the weight of thesecond feature, and so on.
  • x is afeature, so x1 is thevalue of the first feature, x2 is the value of the second feature,and so on.
For example, suppose a linear model for three features learns the followingbias and weights:
  • b = 7
  • w1 = -2.5
  • w2 = -1.2
  • w3 = 1.4
Therefore, given three features (x1, x2,and x3), the linear model uses the following equationto generate each prediction:
y' = 7 + (-2.5)(x1) + (-1.2)(x2) + (1.4)(x3)

Suppose a particular example contains the following values:

  • x1 = 4
  • x2 = -10
  • x3 = 5
Plugging those values into the formula yields a prediction for this example:
y' = 7 + (-2.5)(4) + (-1.2)(-10) + (1.4)(5)y' = 16

Linear models include not only models that use only a linear equation tomake predictions but also a broader set of models that use a linear equationas just one component of the formula that makes predictions.For example, logistic regression post-processes the rawprediction (y') to produce a final prediction value between 0 and 1,exclusively.


linear regression

#fundamentals

A type of machine learning model in which both of the following are true:

  • The model is alinear model.
  • The prediction is a floating-point value. (This is theregression part oflinear regression.)

Contrast linear regression withlogistic regression.Also, contrast regression withclassification.

SeeLinear regressionin Machine Learning Crash Course for more information.

logistic regression

#fundamentals

A type ofregression model that predicts a probability.Logistic regression models have the following characteristics:

  • The label iscategorical. The term logisticregression usually refers tobinary logistic regression, that is,to a model that calculates probabilities for labels with two possible values.A less common variant,multinomial logistic regression, calculatesprobabilities for labels with more than two possible values.
  • The loss function during training isLog Loss.(Multiple Log Loss units can be placed in parallel for labelswith more than two possible values.)
  • The model has a linear architecture, not a deep neural network.However, the remainder of this definition also applies todeep models that predict probabilitiesfor categorical labels.

For example, consider a logistic regression model that calculates theprobability of an input email being either spam or not spam.During inference, suppose the model predicts 0.72. Therefore, themodel is estimating:

  • A 72% chance of the email being spam.
  • A 28% chance of the email not being spam.

A logistic regression model uses the following two-step architecture:

  1. The model generates a raw prediction (y') by applying a linear functionof input features.
  2. The model uses that raw prediction as input to asigmoid function, which converts the rawprediction to a value between 0 and 1, exclusive.

Like any regression model, a logistic regression model predicts a number.However, this number typically becomes part of a binary classificationmodel as follows:

  • If the predicted number isgreater than theclassification threshold, thebinary classification model predicts the positive class.
  • If the predicted number isless than the classification threshold,the binary classification model predicts the negative class.

SeeLogistic regressionin Machine Learning Crash Course for more information.

Log Loss

#fundamentals

Theloss function used in binarylogistic regression.

Click the icon to see the math.

The following formula calculates Log Loss:

$$\text{Log Loss} = \sum_{(x,y)\in D} -y\log(y') - (1 - y)\log(1 - y')$$
where:
  • \((x,y)\in D\) is the dataset containing many labeled examples, which are \((x,y)\) pairs.
  • \(y\) is the label in a labeled example. Since this is logistic regression, every value of \(y\) must either be 0 or 1.
  • \(y'\) is the predicted value (somewhere between 0 and 1, exclusive), given the set of features in \(x\).

SeeLogistic regression: Loss and regularizationin Machine Learning Crash Course for more information.

log-odds

#fundamentals

The logarithm of the odds of some event.

Click the icon to see the math.

If the event is a binary probability, thenodds refers tothe ratio of the probability of success (p) to the probability offailure (1-p). For example, suppose that a given event has a 90%probability of success and a 10% probability of failure. In this case,odds is calculated as follows:

$${\text{odds}} =\frac{\text{p}} {\text{(1-p)}} =\frac{.9} {.1} ={\text{9}}$$

The log-odds is simply the logarithm of the odds. By convention,"logarithm" refers tonatural logarithm,but logarithm could actually be any base greater than 1.Sticking to convention, the log-odds of our example is therefore:

$${\text{log-odds}} =ln(9) ~= 2.2$$

The log-odds function is the inverse of thesigmoid function.


loss

#fundamentals
#Metric

During thetraining of asupervised model, a measure of how far amodel'sprediction is from itslabel.

Aloss function calculates the loss.

SeeLinear regression: Lossin Machine Learning Crash Course for more information.

loss curve

#fundamentals

A plot ofloss as a function of the number of trainingiterations. The following plot shows a typical losscurve:

A Cartesian graph of loss versus training iterations, showing a          rapid drop in loss for the initial iterations, followed by a gradual          drop, and then a flat slope during the final iterations.

Loss curves can help you determine when your model isconverging oroverfitting.

Loss curves can plot all of the following types of loss:

See alsogeneralization curve.

SeeOverfitting: Interpreting loss curvesin Machine Learning Crash Course for more information.

loss function

#fundamentals
#Metric

Duringtraining or testing, amathematical function that calculates theloss on abatch of examples. A loss function returns a lower lossfor models that makes good predictions than for models that makebad predictions.

The goal of training is typically to minimize the loss that a loss functionreturns.

Many different kinds of loss functions exist. Pick the appropriate lossfunction for the kind of model you are building. For example:

M

machine learning

#fundamentals

A program or system thattrains amodel from input data. The trained model canmake useful predictions from new (never-before-seen) data drawn fromthe same distribution as the one used to train the model.

Machine learning also refers to the field of study concernedwith these programs or systems.

See theIntroduction to Machine Learningcourse for more information.

majority class

#fundamentals

The more common label in aclass-imbalanced dataset. For example,given a dataset containing 99% negative labels and 1% positive labels, thenegative labels are the majority class.

Contrast withminority class.

SeeDatasets: Imbalanced datasetsin Machine Learning Crash Course for more information.

mini-batch

#fundamentals

A small, randomly selected subset of abatch processed in oneiteration.Thebatch size of a mini-batch is usuallybetween 10 and 1,000 examples.

For example, suppose the entire training set (the full batch)consists of 1,000 examples. Further suppose that you set thebatch size of each mini-batch to 20. Therefore, eachiteration determines the loss on a random 20 of the 1,000 examples and thenadjusts theweights andbiases accordingly.

It is much more efficient to calculate the loss on a mini-batch than theloss on all the examples in the full batch.

SeeLinear regression:Hyperparametersin Machine Learning Crash Course for more information.

minority class

#fundamentals

The less common label in aclass-imbalanced dataset. For example,given a dataset containing 99% negative labels and 1% positive labels, thepositive labels are the minority class.

Contrast withmajority class.

Click the icon for additional notes.

A training set with a millionexamples soundsimpressive. However, if the minority class is poorly represented,then even a very large training set might be insufficient. Focus lesson the total number of examples in the dataset and more on the number ofexamples in the minority class.

If your dataset doesn't contain enough minority class examples, considerusingdownsampling (the definitionin the second bullet) to supplement the minority class.


SeeDatasets: Imbalanced datasetsin Machine Learning Crash Course for more information.

model

#fundamentals

In general, any mathematical construct that processes input data and returnsoutput. Phrased differently, a model is the set of parameters and structureneeded for a system to make predictions.Insupervised machine learning,a model takes anexample as input and infers aprediction as output. Within supervised machine learning,models differ somewhat. For example:

  • A linear regression model consists of a set ofweightsand abias.
  • Aneural network model consists of:
    • A set ofhidden layers, each containing one ormoreneurons.
    • The weights and bias associated with each neuron.
  • Adecision tree model consists of:
    • The shape of the tree; that is, the pattern in which the conditionsand leaves are connected.
    • The conditions and leaves.

You can save, restore, or make copies of a model.

Unsupervised machine learning alsogenerates models, typically a function that can map an input example tothe most appropriatecluster.

Click the icon to compare algebraic and programming functions to ML models.

An algebraic function such as the following is a model:

  f(x, y) = 3x -5xy + y2 + 17

The preceding function maps input values (x andy) tooutput.

Similarly, a programming function like the following is also a model:

def half_of_greater(x, y):  if (x > y):    return(x / 2)  else    return(y / 2)

A caller passes arguments to the preceding Python function, and thePython function generates output (via thereturn statement).

Although adeep neural networkhas a very different mathematical structure than an algebraic or programmingfunction, a deep neural network still takes input (an example) and returnsoutput (a prediction).

A human programmer codes a programming function manually. In contrast,a machine learning model gradually learns the optimal parametersduring automated training.


multi-class classification

#fundamentals

In supervised learning, aclassification problemin which the dataset containsmore than twoclasses of labels.For example, the labels in the Iris dataset must be one of the followingthree classes:

  • Iris setosa
  • Iris virginica
  • Iris versicolor

A model trained on the Iris dataset that predicts Iris type on new examplesis performing multi-class classification.

In contrast, classification problems that distinguish between exactly twoclasses arebinary classification models.For example, an email model that predicts eitherspam ornot spamis a binary classification model.

In clustering problems, multi-class classification refers to more thantwo clusters.

SeeNeural networks: Multi-class classificationin Machine Learning Crash Course for more information.

N

negative class

#fundamentals
#Metric

Inbinary classification, one class istermedpositive and the other is termednegative. The positive class isthe thing or event that the model is testing for and the negative class is theother possibility. For example:

  • The negative class in a medical test might be "not tumor."
  • The negative class in an emailclassification model might be "not spam."

Contrast withpositive class.

neural network

#fundamentals

Amodel containing at least onehidden layer.Adeep neural network is a type of neural networkcontaining more than one hidden layer. For example, the following diagramshows a deep neural network containing two hidden layers.

A neural network with an input layer, two hidden layers, and an          output layer.

Each neuron in a neural network connects to all of the nodes in the next layer.For example, in the preceding diagram, notice that each of the three neuronsin the first hidden layer separately connect to both of the two neurons in thesecond hidden layer.

Neural networks implemented on computers are sometimes calledartificial neural networks to differentiate them fromneural networks found in brains and other nervous systems.

Some neural networks can mimic extremely complex nonlinear relationshipsbetween different features and the label.

See alsoconvolutional neural network andrecurrent neural network.

SeeNeural networksin Machine Learning Crash Course for more information.

neuron

#fundamentals

In machine learning, a distinct unit within ahidden layerof aneural network. Each neuron performs the followingtwo-step action:

  1. Calculates theweighted sum of input values multipliedby their corresponding weights.
  2. Passes the weighted sum as input to anactivation function.

A neuron in the first hidden layer accepts inputs from the feature valuesin theinput layer. A neuron in any hidden layer beyondthe first accepts inputs from the neurons in the preceding hidden layer.For example, a neuron in the second hidden layer accepts inputs from theneurons in the first hidden layer.

The following illustration highlights two neurons and theirinputs.

A neural network with an input layer, two hidden layers, and an          output layer. Two neurons are highlighted: one in the first          hidden layer and one in the second hidden layer. The highlighted          neuron in the first hidden layer receives inputs from both features          in the input layer. The highlighted neuron in the second hidden layer          receives inputs from each of the three neurons in the first hidden          layer.

A neuron in a neural network mimics the behavior of neurons in brains andother parts of nervous systems.

node (neural network)

#fundamentals

Aneuron in ahidden layer.

SeeNeural Networksin Machine Learning Crash Course for more information.

nonlinear

#fundamentals

A relationship between two or more variables that can't be represented solelythrough addition and multiplication. Alinear relationshipcan be represented as a line; anonlinear relationship can't berepresented as a line. For example, consider two models that each relatea single feature to a single label. The model on the left is linearand the model on the right is nonlinear:

Two plots. One plot is a line, so this is a linear relationship.          The other plot is a curve, so this is a nonlinear relationship.

SeeNeural networks: Nodes and hidden layersin Machine Learning Crash Course to experiment with different kindsof nonlinear functions.

nonstationarity

#fundamentals

A feature whose values change across one or more dimensions, usually time.For example, consider the following examples of nonstationarity:

  • The number of swimsuits sold at a particular store varies with the season.
  • The quantity of a particular fruit harvested in a particular regionis zero for much of the year but large for a brief period.
  • Due to climate change, annual mean temperatures are shifting.

Contrast withstationarity.

normalization

#fundamentals

Broadly speaking, the process of converting a variable's actual rangeof values into a standard range of values, such as:

  • -1 to +1
  • 0 to 1
  • Z-scores (roughly, -3 to +3)

For example, suppose the actual range of values of a certain feature is800 to 2,400. As part offeature engineering,you could normalize the actual values down to a standard range, suchas -1 to +1.

Normalization is a common task infeature engineering. Models usually train faster(and produce better predictions) when every numerical feature in thefeature vector has roughly the same range.

See alsoZ-score normalization.

SeeNumerical Data: Normalizationin Machine Learning Crash Course for more information.

numerical data

#fundamentals

Features represented as integers or real-valued numbers.For example, a house valuation model would probably represent the sizeof a house (in square feet or square meters) as numerical data. Representinga feature as numerical data indicates that the feature's values haveamathematical relationship to the label.That is, the number of square meters in a house probably has somemathematical relationship to the value of the house.

Not all integer data should be represented as numerical data. For example,postal codes in some parts of the world are integers; however, integer postalcodes shouldn't be represented as numerical data in models. That's because apostal code of20000 is not twice (or half) as potent as a postal code of10000. Furthermore, although different postal codesdo correlate to differentreal estate values, we can't assume that real estate values at postal code20000 are twice as valuable as real estate values at postal code 10000.Postal codes should be represented ascategorical datainstead.

Numerical features are sometimes calledcontinuous features.

SeeWorking with numerical datain Machine Learning Crash Course for more information.

O

offline

#fundamentals

Synonym forstatic.

offline inference

#fundamentals

The process of a model generating a batch ofpredictionsand then caching (saving) those predictions. Apps can then access the inferredprediction from the cache rather than rerunning the model.

For example, consider a model that generates local weather forecasts(predictions) once every four hours. After each model run, the systemcaches all the local weather forecasts. Weather apps retrieve the forecastsfrom the cache.

Offline inference is also calledstatic inference.

Contrast withonline inference.SeeProduction ML systems: Static versus dynamic inferencein Machine Learning Crash Course for more information.

one-hot encoding

#fundamentals

Representing categorical data as a vector in which:

  • One element is set to 1.
  • All other elements are set to 0.

One-hot encoding is commonly used to represent strings or identifiers thathave a finite set of possible values.For example, suppose a certain categorical feature namedScandinavia has five possible values:

  • "Denmark"
  • "Sweden"
  • "Norway"
  • "Finland"
  • "Iceland"

One-hot encoding could represent each of the five values as follows:

CountryVector
"Denmark"10000
"Sweden"01000
"Norway"00100
"Finland"00010
"Iceland"00001

Thanks to one-hot encoding, a model can learn different connectionsbased on each of the five countries.

Representing a feature asnumerical data is analternative to one-hot encoding. Unfortunately, representing theScandinavian countries numerically is not a good choice. For example,consider the following numeric representation:

  • "Denmark" is 0
  • "Sweden" is 1
  • "Norway" is 2
  • "Finland" is 3
  • "Iceland" is 4

With numeric encoding, a model would interpret the raw numbersmathematically and would try to train on those numbers.However, Iceland isn't actually twice as much (or half as much) ofsomething as Norway, so the model would come to some strange conclusions.

SeeCategorical data: Vocabulary and one-hotencodingin Machine Learning Crash Course for more information.

one-vs.-all

#fundamentals

Given a classification problem with N classes, asolution consisting of N separatebinary classification model—one binaryclassification model for each possible outcome. For example, given a modelthat classifies examples as animal, vegetable, or mineral, a one-vs.-allsolution would provide the following three separate binary classificationmodels:

  • animal versus not animal
  • vegetable versus not vegetable
  • mineral versus not mineral

online

#fundamentals

Synonym fordynamic.

online inference

#fundamentals

Generatingpredictions on demand. For example,suppose an app passes input to a model and issues a request for aprediction.A system using online inference responds to the request by runningthe model (and returning the prediction to the app).

Contrast withoffline inference.

SeeProduction ML systems: Static versus dynamic inferencein Machine Learning Crash Course for more information.

output layer

#fundamentals

The "final" layer of a neural network. The output layer contains the prediction.

The following illustration shows a small deep neural network with an inputlayer, two hidden layers, and an output layer:

A neural network with one input layer, two hidden layers, and one          output layer. The input layer consists of two features. The first          hidden layer consists of three neurons and the second hidden layer          consists of two neurons. The output layer consists of a single node.

overfitting

#fundamentals

Creating amodel that matches thetraining data so closely that the model fails tomake correct predictions on new data.

Regularization can reduce overfitting.Training on a large and diverse training set can also reduce overfitting.

Click the icon for additional notes.

Overfitting is like strictly following advice from only your favoriteteacher. You'll probably be successful in that teacher's class, but youmight "overfit" to that teacher's ideas and be unsuccessful in otherclasses. Following advice from a mixture of teachers will enable you toadapt better to new situations.


SeeOverfittingin Machine Learning Crash Course for more information.

P

pandas

#fundamentals

A column-oriented data analysis API built on top ofnumpy.Many machine learning frameworks,including TensorFlow, support pandas data structures as inputs. See thepandas documentationfor details.

parameter

#fundamentals

Theweights andbiases that a model learns duringtraining. For example, in alinear regression model, the parameters consist ofthe bias (b) and all the weights (w1,w2,and so on) in the following formula:

$$y' = b + w_1x_1 + w_2x_2 + … w_nx_n$$

In contrast,hyperparameters are the values thatyou (or a hyperparameter tuning service) supply to the model.For example,learning rate is a hyperparameter.

positive class

#fundamentals
#Metric

The class you are testing for.

For example, the positive class in a cancer model might be "tumor."The positive class in an emailclassification modelmight be "spam."

Contrast withnegative class.

Click the icon for additional notes.

The termpositive class can be confusing because the "positive" outcomeof many tests is often an undesirable result. For example, the positive class inmany medical tests corresponds to tumors or diseases. In general, you want adoctor to tell you, "Congratulations! Your test results were negative."Regardless, the positive class is the event that the test is seeking to find.

Admittedly, you're simultaneously testing for both the positive and negativeclasses.


post-processing

#responsible
#fundamentals

Adjusting the output of a modelafter the model has been run.Post-processing can be used to enforce fairness constraints withoutmodifying models themselves.

For example, one might apply post-processing to abinary classification model by setting aclassification threshold such thatequality of opportunity is maintainedfor some attribute by checking that thetrue positive rateis the same for all values of that attribute.

precision

#fundamentals
#Metric

A metric forclassification models that answersthe following question:

When the model predicted thepositive class,what percentage of the predictions were correct?

Here is the formula:

$$\text{Precision} =\frac{\text{true positives}} {\text{true positives} + \text{false positives}}$$

where:

  • true positive means the modelcorrectly predicted the positive class.
  • false positive means the modelmistakenly predicted the positive class.

For example, suppose a model made 200 positive predictions.Of these 200 positive predictions:

  • 150 were true positives.
  • 50 were false positives.

In this case:

$$\text{Precision} =\frac{\text{150}} {\text{150} + \text{50}} = 0.75$$

Contrast withaccuracy andrecall.

SeeClassification: Accuracy, recall, precision and relatedmetricsin Machine Learning Crash Course for more information.

prediction

#fundamentals

A model's output. For example:

  • The prediction of a binary classification model is either the positiveclass or the negative class.
  • The prediction of a multi-class classification model is one class.
  • The prediction of a linear regression model is a number.

proxy labels

#fundamentals

Data used to approximate labels not directly available in a dataset.

For example, suppose you must train a model to predict employeestress level. Your dataset contains a lot of predictive features butdoesn't contain a label namedstress level.Undaunted, you pick "workplace accidents" as a proxy label forstress level. After all, employees under high stress get into moreaccidents than calm employees. Or do they? Maybe workplace accidentsactually rise and fall for multiple reasons.

As a second example, suppose you wantis it raining? to be a Boolean labelfor your dataset, but your dataset doesn't contain rain data. Ifphotographs are available, you might establish pictures of peoplecarrying umbrellas as a proxy label foris it raining? Is thata good proxy label? Possibly, but people in some cultures may bemore likely to carry umbrellas to protect against sun than the rain.

Proxy labels are often imperfect. When possible, choose actual labels overproxy labels. That said, when an actual label is absent, pick the proxylabel very carefully, choosing the least horrible proxy label candidate.

SeeDatasets: Labelsin Machine Learning Crash Course for more information.

R

RAG

#fundamentals

Abbreviation forretrieval-augmented generation.

rater

#fundamentals

A human who provideslabels forexamples."Annotator" is another name for rater.

SeeCategorical data: Common issuesin Machine Learning Crash Course for more information.

recall

#fundamentals
#Metric

A metric forclassification models that answersthe following question:

Whenground truth was thepositive class, what percentage of predictions didthe model correctly identify as the positive class?

Here is the formula:

\[\text{Recall} =\frac{\text{true positives}} {\text{true positives} + \text{false negatives}}\]

where:

  • true positive means the modelcorrectly predicted the positive class.
  • false negative means that the modelmistakenly predicted thenegative class.

For instance, suppose your model made 200 predictions on examples for whichground truth was the positive class. Of these 200 predictions:

  • 180 were true positives.
  • 20 were false negatives.

In this case:

\[\text{Recall} =\frac{\text{180}} {\text{180} + \text{20}} = 0.9\]

Click the icon for notes about class-imbalanced datasets.

Recall is particularly useful for determining the predictive power ofclassification models in which the positive class is rare. For example, consideraclass-imbalanced datasetin which the positive class for a certain disease occurs in only 10 patientsout of a million. Suppose your model makes five million predictions that yieldthe following outcomes:

  • 30 True Positives
  • 20 False Negatives
  • 4,999,000 True Negatives
  • 950 False Positives

The recall of this model is therefore:

recall = TP / (TP + FN)recall = 30 / (30 + 20) = 0.6 = 60%
By contrast, theaccuracy of this model is:
accuracy = (TP + TN) / (TP + TN + FP + FN)accuracy = (30 + 4,999,000) / (30 + 4,999,000 + 950 + 20) = 99.98%

That high value of accuracy looks impressive but is essentially meaningless.Recall is a much more useful metric for class-imbalanced datasets than accuracy.


SeeClassification: Accuracy, recall, precision and relatedmetricsfor more information.

Rectified Linear Unit (ReLU)

#fundamentals

Anactivation function with the following behavior:

  • If input is negative or zero, then the output is 0.
  • If input is positive, then the output is equal to the input.

For example:

  • If the input is -3, then the output is 0.
  • If the input is +3, then the output is 3.0.

Here is a plot of ReLU:

A cartesian plot of two lines. The first line has a constant          y value of 0, running along the x-axis from -infinity,0 to 0,-0.          The second line starts at 0,0. This line has a slope of +1, so          it runs from 0,0 to +infinity,+infinity.

ReLU is a very popular activation function. Despite its simple behavior,ReLU still enables a neural network to learnnonlinearrelationships betweenfeatures and thelabel.

regression model

#fundamentals

Informally, a model that generates a numerical prediction. (In contrast,aclassification model generates a classprediction.) For example, the following are all regression models:

  • A model that predicts a certain house's value in Euros, such as 423,000.
  • A model that predicts a certain tree's life expectancy in years,such as 23.2.
  • A model that predicts the amount of rain in inches that will fall in acertain city over the next six hours, such as 0.18.

Two common types of regression models are:

  • Linear regression, which finds the line that bestfits label values to features.
  • Logistic regression, which generates aprobability between 0.0 and 1.0 that a system typically then maps to a classprediction.

Not every model that outputs numerical predictions is a regression model.In some cases, a numeric prediction is really just a classification modelthat happens to have numeric class names. For example, a model that predictsa numeric postal code is a classification model, not a regression model.

regularization

#fundamentals

Any mechanism that reducesoverfitting.Popular types of regularization include:

Regularization can also be defined as the penalty on a model's complexity.

Click the icon for additional notes.

Regularization is counterintuitive. Increasing regularization usuallyincreases training loss, which is confusing because, well, isn'tthe goal tominimize training loss?

Actually, no. The goal isn't to minimize training loss. The goal is tomake excellent predictions on real-world examples. Remarkably, even thoughincreasing regularization increases training loss, it usually helps models makebetter predictions on real-world examples.


SeeOverfitting: Model complexityin Machine Learning Crash Course for more information.

regularization rate

#fundamentals

A number that specifies the relative importance ofregularization during training. Raising theregularization rate reducesoverfitting but mayreduce the model's predictive power. Conversely, reducing or omittingthe regularization rate increases overfitting.

Click the icon to see the math.

The regularization rate is usually represented as the Greek letter lambda.The following simplifiedloss equation showslambda's influence:

$$\text{minimize(loss function + }\lambda\text{(regularization))}$$

whereregularization is any regularization mechanism, including;


SeeOverfitting: L2regularizationin Machine Learning Crash Course for more information.

ReLU

#fundamentals

Abbreviation forRectified Linear Unit.

retrieval-augmented generation (RAG)

#fundamentals

A technique for improving the quality oflarge language model (LLM) outputby grounding it with sources of knowledge retrieved after the model was trained.RAG improves the accuracy of LLMresponses by providing thetrained LLM with access to information retrieved from trusted knowledge basesor documents.

Common motivations to use retrieval-augmented generation include:

  • Increasing the factual accuracy of a model's generated responses.
  • Giving the model access to knowledge it was not trained on.
  • Changing the knowledge that the model uses.
  • Enabling the model to cite sources.

For example, suppose that a chemistry app uses thePaLMAPI to generate summariesrelated to user queries. When the app's backend receives a query, the backend:

  1. Searches for ("retrieves") data that's relevant to the user's query.
  2. Appends ("augments") the relevant chemistry data to the user's query.
  3. Instructs the LLM to create a summary based on the appended data.

ROC (receiver operating characteristic) Curve

#fundamentals
#Metric

A graph oftrue positive rate versusfalse positive rate for differentclassification thresholds in binaryclassification.

The shape of an ROC curve suggests a binary classification model's abilityto separate positive classes from negative classes. Suppose, for example,that a binary classification model perfectly separates all the negativeclasses from all the positive classes:

A number line with 8 positive examples on the right side and          7 negative examples on the left.

The ROC curve for the preceding model looks as follows:

An ROC curve. The x-axis is False Positive Rate and the y-axis          is True Positive Rate. The curve has an inverted L shape. The curve          starts at (0.0,0.0) and goes straight up to (0.0,1.0). Then the curve          goes from (0.0,1.0) to (1.0,1.0).

In contrast, the following illustration graphs the raw logistic regressionvalues for a terrible model that can't separate negative classes frompositive classes at all:

A number line with positive examples and negative classes          completely intermixed.

The ROC curve for this model looks as follows:

An ROC curve, which is actually a straight line from (0.0,0.0)          to (1.0,1.0).

Meanwhile, back in the real world, most binary classification models separatepositive and negative classes to some degree, but usually not perfectly. So,a typical ROC curve falls somewhere between the two extremes:

An ROC curve. The x-axis is False Positive Rate and the y-axis          is True Positive Rate. The ROC curve approximates a shaky arc          traversing the compass points from West to North.

The point on an ROC curve closest to (0.0,1.0) theoretically identifies theideal classification threshold. However, several other real-world issuesinfluence the selection of the ideal classification threshold. For example,perhaps false negatives cause far more pain than false positives.

A numerical metric calledAUC summarizes the ROC curve intoa single floating-point value.

Root Mean Squared Error (RMSE)

#fundamentals
#Metric

The square root of theMean Squared Error.

S

sigmoid function

#fundamentals

A mathematical function that "squishes" an input value into a constrained range,typically 0 to 1 or -1 to +1. That is, you can pass any number (two, a million,negative billion, whatever) to a sigmoid and the output will still be in theconstrained range.A plot of the sigmoid activation function looks as follows:

A two-dimensional curved plot with x values spanning the domain          -infinity to +positive, while y values span the range almost 0 to          almost 1. When x is 0, y is 0.5. The slope of the curve is always          positive, with the highest slope at 0,0.5 and gradually decreasing          slopes as the absolute value of x increases.

The sigmoid function has several uses in machine learning, including:

Click the icon to see the math.

The sigmoid function over an input numberx has the following formula:

$$sigmoid(x) = \frac{1}{1 + e^{-\text{x}}}$$

In machine learning,x is generally aweighted sum.


softmax

#fundamentals

A function that determines probabilities for each possible class in amulti-class classification model. The probabilities add upto exactly 1.0. For example, the following table shows how softmax distributesvarious probabilities:

Image is a...Probability
dog.85
cat.13
horse.02

Softmax is also calledfull softmax.

Contrast withcandidate sampling.

Click the icon to see the math.

The softmax equation is as follows:

$$\sigma_i = \frac{e^{\text{z}_i}} {\sum_{j=1}^{j=K} {e^{\text{z}_j}}} $$
where:
  • $\sigma_i$ is the output vector. Each element of the output vectorspecifies the probability of this element. The sum of all the elementsin the output vector is 1.0. The output vector contains the same numberof elements as the input vector, $z$.
  • $z$ is the input vector. Each element of the input vector containsa floating-point value.
  • $K$ is the number of elements in the input vector (and the outputvector).

For example, suppose the input vector is:

[1.2, 2.5, 1.8]

Therefore, softmax calculates the denominator as follows:

$$\text{denominator} = e^{1.2} + e^{2.5} + e^{1.8} = 21.552$$

The softmax probability of each element is therefore:

$$\sigma_1 = \frac{e^{1.2}}{21.552} = 0.154 $$$$\sigma_2 = \frac{e^{2.5}}{21.552} = 0.565 $$$$\sigma_1 = \frac{e^{1.8}}{21.552} = 0.281 $$

So, the output vector is therefore:

$$\sigma = [0.154, 0.565, 0.281]$$

The sum of the three elements in $\sigma$ is 1.0. Phew!


SeeNeural networks: Multi-class classificationin Machine Learning Crash Course for more information.

sparse feature

#fundamentals

Afeature whose values are predominately zero or empty.For example, a feature containing a single 1 value and a million 0 values issparse. In contrast, adense feature has values thatare predominantly not zero or empty.

In machine learning, a surprising number of features are sparse features.Categorical features are usually sparse features.For example, of the 300 possible tree species in a forest, a single examplemight identify just amaple tree. Or, of the millionsof possible videos in a video library, a single example might identifyjust "Casablanca."

In a model, you typically represent sparse features withone-hot encoding. If the one-hot encoding is big,you might put anembedding layer on top of theone-hot encoding for greater efficiency.

sparse representation

#fundamentals

Storing only theposition(s) of nonzero elements in a sparse feature.

For example, suppose a categorical feature namedspecies identifies the 36tree species in a particular forest. Further assume that eachexample identifies only a single species.

You could use a one-hot vector to represent the tree species in each example.A one-hot vector would contain a single1 (to representthe particular tree species in that example) and 350s (to represent the35 tree speciesnot in that example). So, the one-hot representationofmaple might look something like the following:

A vector in which positions 0 through 23 hold the value 0, position          24 holds the value 1, and positions 25 through 35 hold the value 0.

Alternatively, sparse representation would simply identify the position of theparticular species. Ifmaple is at position 24, then the sparse representationofmaple would simply be:

24

Notice that the sparse representation is much more compact than the one-hotrepresentation.

Note: You shouldn't pass a sparse representation as a direct feature inputto a model. Instead, you should convert the sparse representation into aone-hot representation before training on it.

Click the icon for a slightly more complex example.

Suppose each example in your model must represent the words—but notthe order of those words—in an English sentence.English consists of about 170,000 words, so English is a categoricalfeature with about 170,000 elements. Most English sentences use anextremely tiny fraction of those 170,000 words, so the set of words in asingle example is almost certainly going to be sparse data.

Consider the following sentence:

My dog is a great dog

You could use a variant of one-hot vector to represent the words in thissentence. In this variant, multiple cells in the vector can containa nonzero value. Furthermore, in this variant, a cell can contain an integerother than one. Although the words "my", "is", "a", and "great" appear onlyonce in the sentence, the word "dog" appears twice. Using this variant ofone-hot vectors to represent the words in this sentence yields the following170,000-element vector:

A vector of 170,000 integers. The number 1 is at vector position 0,          45770, 58906, and 91520. The number 2 is at position 26,100.          Zeroes are at the remaining 169,996 positions.

A sparse representation of the same sentence would simply be:

0:126100:245770:158906:191520:1

Click the icon if you are confused.

The term "sparse representation" confuses a lot of people because sparserepresentation is itselfnot a sparse vector. Rather, sparserepresentation is actually adense representation of a sparse vector.The synonymindex representation is a little clearer than"sparse representation."


SeeWorking with categorical datain Machine Learning Crash Course for more information.

sparse vector

#fundamentals

A vector whose values are mostly zeroes. See alsosparsefeature andsparsity.

squared loss

#fundamentals
#Metric

Synonym forL2 loss.

static

#fundamentals

Something done once rather than continuously.The termsstatic andoffline are synonyms.The following are common uses ofstatic andoffline in machinelearning:

  • static model (oroffline model) is a model trained once and thenused for a while.
  • static training (oroffline training) is the process of training astatic model.
  • static inference (oroffline inference) is aprocess in which a model generates a batch of predictions at a time.

Contrast withdynamic.

static inference

#fundamentals

Synonym foroffline inference.

stationarity

#fundamentals

A feature whose values don't change across one or more dimensions, usually time.For example, a feature whose values look about the same in 2021 and2023 exhibits stationarity.

In the real world, very few features exhibit stationarity. Even featuressynonymous with stability (like sea level) change over time.

Contrast withnonstationarity.

stochastic gradient descent (SGD)

#fundamentals

Agradient descent algorithm in which thebatch size is one. In other words, SGD trains ona single example chosen uniformly atrandom from atraining set.

SeeLinear regression: Hyperparametersin Machine Learning Crash Course for more information.

supervised machine learning

#fundamentals

Training amodel fromfeatures and theircorrespondinglabels. Supervised machine learning is analogousto learning a subject by studying a set of questions and theircorresponding answers. After mastering the mapping between questions andanswers, a student can then provide answers to new (never-before-seen)questions on the same topic.

Compare withunsupervised machine learning.

SeeSupervised Learningin the Introduction to ML course for more information.

synthetic feature

#fundamentals

Afeature not present among the input features, butassembled from one or more of them. Methods for creating synthetic featuresinclude the following:

  • Bucketing a continuous feature into range bins.
  • Creating afeature cross.
  • Multiplying (or dividing) one feature value by other feature value(s)or by itself. For example, ifa andb are input features, then thefollowing are examples of synthetic features:
    • ab
    • a2
  • Applying a transcendental function to a feature value. For example, ifcis an input feature, then the following are examples of synthetic features:
    • sin(c)
    • ln(c)

Features created bynormalizing orscalingalone are not considered synthetic features.

T

test loss

#fundamentals
#Metric

Ametric representing a model'sloss againstthetest set. When building amodel, youtypically try to minimize test loss. That's because a low test loss is astronger quality signal than a lowtraining loss orlowvalidation loss.

A large gap between test loss and training loss or validation loss sometimessuggests that you need to increase theregularization rate.

training

#fundamentals

The process of determining the idealparameters (weights andbiases) comprising amodel. During training, a system reads inexamples and gradually adjusts parameters. Training uses eachexample anywhere from a few times to billions of times.

SeeSupervised Learningin the Introduction to ML course for more information.

training loss

#fundamentals
#Metric

Ametric representing a model'sloss during aparticular training iteration. For example, suppose the loss functionisMean Squared Error. Perhaps the training loss (the MeanSquared Error) for the 10th iteration is 2.2, and the training loss forthe 100th iteration is 1.9.

Aloss curve plots training loss versus the number ofiterations. A loss curve provides the following hints about training:

  • A downward slope implies that the model is improving.
  • An upward slope implies that the model is getting worse.
  • A flat slope implies that the model has reachedconvergence.

For example, the following somewhat idealizedloss curveshows:

  • A steep downward slope during the initial iterations, which impliesrapid model improvement.
  • A gradually flattening (but still downward) slope until close to the endof training, which implies continued model improvement at a somewhatslower pace then during the initial iterations.
  • A flat slope towards the end of training, which suggests convergence.

The plot of training loss versus iterations. This loss curve starts     with a steep downward slope. The slope gradually flattens until the     slope becomes zero.

Although training loss is important, see alsogeneralization.

training-serving skew

#fundamentals

The difference between a model's performance duringtraining and that same model's performance duringserving.

training set

#fundamentals

The subset of thedataset used to train amodel.

Traditionally, examples in the dataset are divided into the following threedistinct subsets:

Ideally, each example in the dataset should belong to only one of thepreceding subsets. For example, a single example shouldn't belong toboth the training set and the validation set.

SeeDatasets: Dividing the original datasetin Machine Learning Crash Course for more information.

true negative (TN)

#fundamentals
#Metric

An example in which the modelcorrectly predicts thenegative class. For example, the model infers thata particular email message isnot spam, and that email message really isnot spam.

true positive (TP)

#fundamentals
#Metric

An example in which the modelcorrectly predicts thepositive class. For example, the model infers thata particular email message is spam, and that email message really is spam.

true positive rate (TPR)

#fundamentals
#Metric

Synonym forrecall. That is:

$$\text{true positive rate} = \frac {\text{true positives}} {\text{true positives} + \text{false negatives}}$$

True positive rate is the y-axis in anROC curve.

U

underfitting

#fundamentals

Producing amodel with poor predictive ability because the modelhasn't fully captured the complexity of the training data. Many problemscan cause underfitting, including:

SeeOverfittingin Machine Learning Crash Course for more information.

unlabeled example

#fundamentals

An example that containsfeatures but nolabel.For example, the following table shows three unlabeled examples from a housevaluation model, each with three features but no house value:

Number of bedroomsNumber of bathroomsHouse age
3215
2172
4234

Insupervised machine learning,models train on labeled examples and make predictions onunlabeled examples.

Insemi-supervised andunsupervised learning,unlabeled examples are used during training.

Contrast unlabeled example withlabeled example.

unsupervised machine learning

#clustering
#fundamentals

Training amodel to find patterns in a dataset, typically anunlabeled dataset.

The most common use of unsupervised machine learning is tocluster datainto groups of similar examples. For example, an unsupervised machinelearning algorithm can cluster songs based on various propertiesof the music. The resulting clusters can become an input to other machinelearning algorithms (for example, to a music recommendation service).Clustering can help when useful labels are scarce or absent.For example, in domains such as anti-abuse and fraud, clusters can helphumans better understand the data.

Contrast withsupervised machine learning.

Click the icon for additional notes.

Another example of unsupervised machine learning isprincipal component analysis (PCA).For example, applying PCA on adataset containing the contents of millions of shopping carts might revealthat shopping carts containing lemons frequently also contain antacids.


SeeWhat is Machine Learning?in the Introduction to ML course for more information.

V

validation

#fundamentals

The initial evaluation of a model's quality.Validation checks the quality of a model's predictions against thevalidation set.

Because the validation set differs from thetraining set,validation helps guard againstoverfitting.

You might think of evaluating the model against the validation set as thefirst round of testing and evaluating the model against thetest set as the second round of testing.

validation loss

#fundamentals
#Metric

Ametric representing a model'sloss onthevalidation set during a particulariteration of training.

See alsogeneralization curve.

validation set

#fundamentals

The subset of thedataset that performs initialevaluation against a trainedmodel. Typically, you evaluatethe trained model against thevalidation set severaltimes before evaluating the model against thetest set.

Traditionally, you divide the examples in the dataset into the following threedistinct subsets:

Ideally, each example in the dataset should belong to only one of thepreceding subsets. For example, a single example shouldn't belong toboth the training set and the validation set.

SeeDatasets: Dividing the original datasetin Machine Learning Crash Course for more information.

W

weight

#fundamentals

A value that a model multiplies by another value.Training is the process of determining a model's ideal weights;inference is the process of using those learned weights tomake predictions.

Click the icon to see an example of weights in a linear model.

Imagine alinear model with two features.Suppose that training determines the following weights (andbias):

  • The bias, b, has a value of 2.2
  • The weight, w1 associated with one feature is 1.5.
  • The weight, w2 associated with the other feature is 0.4.

Now imagine anexample with the following featurevalues:

  • The value of one feature, x1, is 6.
  • The value of the other feature, x2, is 10.

This linear model uses the following formula to generate a prediction,y':

$$y' = b + w_1x_1 + w_2x_2$$

Therefore, the prediction is:

$$y' = 2.2 + (1.5)(6) + (0.4)(10) = 15.2$$

If a weight is 0, then the corresponding feature doesn't contribute tothe model. For example, if w1 is 0, then the value of x1is irrelevant.


SeeLinear regressionin Machine Learning Crash Course for more information.

weighted sum

#fundamentals

The sum of all the relevant input values multiplied by their correspondingweights. For example, suppose the relevant inputs consist of the following:

input valueinput weight
2-1.3
-10.6
30.4

The weighted sum is therefore:

weighted sum = (2)(-1.3) + (-1)(0.6) + (3)(0.4) = -2.0

A weighted sum is the input argument to anactivation function.

Z

Z-score normalization

#fundamentals

Ascaling technique that replaces a rawfeature value with a floating-point value representingthe number of standard deviations from that feature's mean.For example, consider a feature whose mean is 800 and whose standarddeviation is 100. The following table shows how Z-score normalizationwould map the raw value to its Z-score:

Raw valueZ-score
8000
950+1.5
575-2.25

The machine learning model then trains on the Z-scoresfor that feature instead of on the raw values.

SeeNumerical data: Normalizationin Machine Learning Crash Course for more information.

Except as otherwise noted, the content of this page is licensed under theCreative Commons Attribution 4.0 License, and code samples are licensed under theApache 2.0 License. For details, see theGoogle Developers Site Policies. Java is a registered trademark of Oracle and/or its affiliates.

Last updated 2025-12-16 UTC.