Making developers awesome at machine learning
Making developers awesome at machine learning
The key to getting good at applied machine learning is practicing on lots of different datasets.
This is because each problem is different, requiring subtly different data preparation and modeling methods.
In this post, you will discover 10 top standard machine learning datasets that you can use for practice.
Let’s dive in.
Each dataset is summarized in a consistent way. This makes them easy to compare and navigate for you to practice a specific data preparation technique or modeling method.
The aspects that you need to know about each dataset are:
Below is a list of the 10 datasets we’ll cover.
Each dataset is small enough to fit into memory and review in a spreadsheet. All datasets are comprised of tabular data and no (explicitly) missing values.
The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims.
It is a regression problem. It is comprised of 63 observations with 1 input variable and one output variable. The variable names are as follows:
The baseline performance of predicting the mean value is an RMSE of approximately 81 thousand Kronor.
A sample of the first 5 rows is listed below.
1 2 3 4 5 | 108,392.5 19,46.2 13,15.7 124,422.2 40,119.4 |
Below is a scatter plot of the entire dataset.

Swedish Auto Insurance Dataset
The Wine Quality Dataset involves predicting the quality of white wines on a scale given chemical measures of each wine.
It is a multi-class classification problem, but could also be framed as a regression problem. The number of observations for each class is not balanced. There are 4,898 observations with 11 input variables and one output variable. The variable names are as follows:
The baseline performance of predicting the mean value is an RMSE of approximately 0.148 quality points.
A sample of the first 5 rows is listed below.
1 2 3 4 5 | 7,0.27,0.36,20.7,0.045,45,170,1.001,3,0.45,8.8,6 6.3,0.3,0.34,1.6,0.049,14,132,0.994,3.3,0.49,9.5,6 8.1,0.28,0.4,6.9,0.05,30,97,0.9951,3.26,0.44,10.1,6 7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6 7.2,0.23,0.32,8.5,0.058,47,186,0.9956,3.19,0.4,9.9,6 |
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.
A sample of the first 5 rows is listed below.
1 2 3 4 5 | 6,148,72,35,0,33.6,0.627,50,1 1,85,66,29,0,26.6,0.351,31,0 8,183,64,0,0,23.3,0.672,32,1 1,89,66,23,94,28.1,0.167,21,0 0,137,40,35,168,43.1,2.288,33,1 |
The Sonar Dataset involves the prediction of whether or not an object is a mine or a rock given the strength of sonar returns at different angles.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 208 observations with 60 input variables and 1 output variable. The variable names are as follows:
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 53%. Top results achieve a classification accuracy of approximately 88%.
A sample of the first 5 rows is listed below.
1 2 3 4 5 | 0.0200,0.0371,0.0428,0.0207,0.0954,0.0986,0.1539,0.1601,0.3109,0.2111,0.1609,0.1582,0.2238,0.0645,0.0660,0.2273,0.3100,0.2999,0.5078,0.4797,0.5783,0.5071,0.4328,0.5550,0.6711,0.6415,0.7104,0.8080,0.6791,0.3857,0.1307,0.2604,0.5121,0.7547,0.8537,0.8507,0.6692,0.6097,0.4943,0.2744,0.0510,0.2834,0.2825,0.4256,0.2641,0.1386,0.1051,0.1343,0.0383,0.0324,0.0232,0.0027,0.0065,0.0159,0.0072,0.0167,0.0180,0.0084,0.0090,0.0032,R 0.0453,0.0523,0.0843,0.0689,0.1183,0.2583,0.2156,0.3481,0.3337,0.2872,0.4918,0.6552,0.6919,0.7797,0.7464,0.9444,1.0000,0.8874,0.8024,0.7818,0.5212,0.4052,0.3957,0.3914,0.3250,0.3200,0.3271,0.2767,0.4423,0.2028,0.3788,0.2947,0.1984,0.2341,0.1306,0.4182,0.3835,0.1057,0.1840,0.1970,0.1674,0.0583,0.1401,0.1628,0.0621,0.0203,0.0530,0.0742,0.0409,0.0061,0.0125,0.0084,0.0089,0.0048,0.0094,0.0191,0.0140,0.0049,0.0052,0.0044,R 0.0262,0.0582,0.1099,0.1083,0.0974,0.2280,0.2431,0.3771,0.5598,0.6194,0.6333,0.7060,0.5544,0.5320,0.6479,0.6931,0.6759,0.7551,0.8929,0.8619,0.7974,0.6737,0.4293,0.3648,0.5331,0.2413,0.5070,0.8533,0.6036,0.8514,0.8512,0.5045,0.1862,0.2709,0.4232,0.3043,0.6116,0.6756,0.5375,0.4719,0.4647,0.2587,0.2129,0.2222,0.2111,0.0176,0.1348,0.0744,0.0130,0.0106,0.0033,0.0232,0.0166,0.0095,0.0180,0.0244,0.0316,0.0164,0.0095,0.0078,R 0.0100,0.0171,0.0623,0.0205,0.0205,0.0368,0.1098,0.1276,0.0598,0.1264,0.0881,0.1992,0.0184,0.2261,0.1729,0.2131,0.0693,0.2281,0.4060,0.3973,0.2741,0.3690,0.5556,0.4846,0.3140,0.5334,0.5256,0.2520,0.2090,0.3559,0.6260,0.7340,0.6120,0.3497,0.3953,0.3012,0.5408,0.8814,0.9857,0.9167,0.6121,0.5006,0.3210,0.3202,0.4295,0.3654,0.2655,0.1576,0.0681,0.0294,0.0241,0.0121,0.0036,0.0150,0.0085,0.0073,0.0050,0.0044,0.0040,0.0117,R 0.0762,0.0666,0.0481,0.0394,0.0590,0.0649,0.1209,0.2467,0.3564,0.4459,0.4152,0.3952,0.4256,0.4135,0.4528,0.5326,0.7306,0.6193,0.2032,0.4636,0.4148,0.4292,0.5730,0.5399,0.3161,0.2285,0.6995,1.0000,0.7262,0.4724,0.5103,0.5459,0.2881,0.0981,0.1951,0.4181,0.4604,0.3217,0.2828,0.2430,0.1979,0.2444,0.1847,0.0841,0.0692,0.0528,0.0357,0.0085,0.0230,0.0046,0.0156,0.0031,0.0054,0.0105,0.0110,0.0015,0.0072,0.0048,0.0107,0.0094,R |
The Banknote Dataset involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 1,372 observations with 4 input variables and 1 output variable. The variable names are as follows:
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 50%.
A sample of the first 5 rows is listed below.
1 2 3 4 5 6 | 3.6216,8.6661,-2.8073,-0.44699,0 4.5459,8.1674,-2.4586,-1.4621,0 3.866,-2.6383,1.9242,0.10645,0 3.4566,9.5228,-4.0112,-3.5944,0 0.32924,-4.4552,4.5718,-0.9888,0 4.3684,9.6718,-3.9606,-3.1625,0 |
The Iris Flowers Dataset involves predicting the flower species given measurements of iris flowers.
It is a multi-class classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 26%.
A sample of the first 5 rows is listed below.
1 2 3 4 5 | 5.1,3.5,1.4,0.2,Iris-setosa 4.9,3.0,1.4,0.2,Iris-setosa 4.7,3.2,1.3,0.2,Iris-setosa 4.6,3.1,1.5,0.2,Iris-setosa 5.0,3.6,1.4,0.2,Iris-setosa |
The Abalone Dataset involves predicting the age of abalone given objective measures of individuals.
It is a multi-class classification problem, but can also be framed as a regression. The number of observations for each class is not balanced. There are 4,177 observations with 8 input variables and 1 output variable. The variable names are as follows:
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 16%. The baseline performance of predicting the mean value is an RMSE of approximately 3.2 rings.
A sample of the first 5 rows is listed below.
1 2 3 4 5 | M,0.455,0.365,0.095,0.514,0.2245,0.101,0.15,15 M,0.35,0.265,0.09,0.2255,0.0995,0.0485,0.07,7 F,0.53,0.42,0.135,0.677,0.2565,0.1415,0.21,9 M,0.44,0.365,0.125,0.516,0.2155,0.114,0.155,10 I,0.33,0.255,0.08,0.205,0.0895,0.0395,0.055,7 |
The Ionosphere Dataset requires the prediction of structure in the atmosphere given radar returns targeting free electrons in the ionosphere.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 351 observations with 34 input variables and 1 output variable. The variable names are as follows:
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 64%. Top results achieve a classification accuracy of approximately 94%.
A sample of the first 5 rows is listed below.
1 2 3 4 5 | 1,0,0.99539,-0.05889,0.85243,0.02306,0.83398,-0.37708,1,0.03760,0.85243,-0.17755,0.59755,-0.44945,0.60536,-0.38223,0.84356,-0.38542,0.58212,-0.32192,0.56971,-0.29674,0.36946,-0.47357,0.56811,-0.51171,0.41078,-0.46168,0.21266,-0.34090,0.42267,-0.54487,0.18641,-0.45300,g 1,0,1,-0.18829,0.93035,-0.36156,-0.10868,-0.93597,1,-0.04549,0.50874,-0.67743,0.34432,-0.69707,-0.51685,-0.97515,0.05499,-0.62237,0.33109,-1,-0.13151,-0.45300,-0.18056,-0.35734,-0.20332,-0.26569,-0.20468,-0.18401,-0.19040,-0.11593,-0.16626,-0.06288,-0.13738,-0.02447,b 1,0,1,-0.03365,1,0.00485,1,-0.12062,0.88965,0.01198,0.73082,0.05346,0.85443,0.00827,0.54591,0.00299,0.83775,-0.13644,0.75535,-0.08540,0.70887,-0.27502,0.43385,-0.12062,0.57528,-0.40220,0.58984,-0.22145,0.43100,-0.17365,0.60436,-0.24180,0.56045,-0.38238,g 1,0,1,-0.45161,1,1,0.71216,-1,0,0,0,0,0,0,-1,0.14516,0.54094,-0.39330,-1,-0.54467,-0.69975,1,0,0,1,0.90695,0.51613,1,1,-0.20099,0.25682,1,-0.32382,1,b 1,0,1,-0.02401,0.94140,0.06531,0.92106,-0.23255,0.77152,-0.16399,0.52798,-0.20275,0.56409,-0.00712,0.34395,-0.27457,0.52940,-0.21780,0.45107,-0.17813,0.05982,-0.35575,0.02309,-0.52879,0.03286,-0.65158,0.13290,-0.53206,0.02431,-0.62197,-0.05707,-0.59573,-0.04608,-0.65697,g |
The Wheat Seeds Dataset involves the prediction of species given measurements of seeds from different varieties of wheat.
It is a multiclass (3-class) classification problem. The number of observations for each class is balanced. There are 210 observations with 7 input variables and 1 output variable. The variable names are as follows:
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 28%.
A sample of the first 5 rows is listed below.
1 2 3 4 5 | 15.26,14.84,0.871,5.763,3.312,2.221,5.22,1 14.88,14.57,0.8811,5.554,3.333,1.018,4.956,1 14.29,14.09,0.905,5.291,3.337,2.699,4.825,1 13.84,13.94,0.8955,5.324,3.379,2.259,4.805,1 16.14,14.99,0.9034,5.658,3.562,1.355,5.175,1 |
The Boston House Price Dataset involves the prediction of a house price in thousands of dollars given details of the house and its neighborhood.
It is a regression problem. There are 506 observations with 13 input variables and 1 output variable. The variable names are as follows:
The baseline performance of predicting the mean value is an RMSE of approximately 9.21 thousand dollars.
A sample of the first 5 rows is listed below.
1 2 3 4 5 | 0.00632 18.00 2.310 0 0.5380 6.5750 65.20 4.0900 1 296.0 15.30 396.90 4.98 24.00 0.02731 0.00 7.070 0 0.4690 6.4210 78.90 4.9671 2 242.0 17.80 396.90 9.14 21.60 0.02729 0.00 7.070 0 0.4690 7.1850 61.10 4.9671 2 242.0 17.80 392.83 4.03 34.70 0.03237 0.00 2.180 0 0.4580 6.9980 45.80 6.0622 3 222.0 18.70 394.63 2.94 33.40 0.06905 0.00 2.180 0 0.4580 7.1470 54.20 6.0622 3 222.0 18.70 396.90 5.33 36.20 |
In this post, you discovered 10 top standard datasets that you can use to practice applied machine learning.
Here is your next step:






Format for Swedish Auto Insurance data has changed. It’s not in CSV format anymore and there are extra rows at the beginning of the data
You can copy paste the data from this page into a file and load in excel, then covert to csv:
https://www.math.muni.cz/~kolacek/docs/frvs/M7222/data/AutoInsurSweden.txt
Your posts have been a big help. Could you recommend a dataset which i can use to practice clustering and PCA on ?
Thanks.
Perhaps something where all features have the same units, like the iris flowers dataset?
Hello, in reference to the Swedish auto data, is it not possible to use Scikit-Learn to perform linear regression? I get deprecation errors that request that I reshape the data. When I reshape, I get the error that the samples are different sizes. What am I missing please. Thank you.
Sorry, I don’t know Joe. Perhaps try posting your code and errors to stackoverflow?
sir for wheat dataset i got result like this
0.97619047619
[[ 9 0 1]
[ 0 20 0]
[ 0 0 12]]
precision recall f1-score support
1.0 1.00 0.90 0.95 10
2.0 1.00 1.00 1.00 20
3.0 0.92 1.00 0.96 12
avg / total 0.98 0.98 0.98 42
is it correct sir?
Sir ,the confusion matrix and the accuracy what i got, is it acceptable?is that right?
It really depends on the problem. Sorry, I don’t know the problem well enough, perhaps compare it to the confusion matrix of other algorithms.
WHAT IS THE DIFFERENCE BETWEEN NUMERIC AND CLINICAL CANCER. OR BOTH ARE SAME . I NEED LEUKEMIA ,LUNG,COLON DATASETS FOR MY WORK. I TOO NEED IMAGE DATSET FOR MY RESEARCH .WHERE TO GET THE DATASETS
Thanks a lot for sharing Jason!
I applied sklearn random forest and svm classifier to the wheat seed dataset in my very first Python notebook! 😀 The error oscilliates between 10% and 20% from an execution to an other. Can share it if anyone interrested.
Bye
Thanks Jason
I tried decision tree classifier with 70% training and 30% testing on Banknote dataset.
Achieved accuracy of 99%.
[Accuracy: 0.9902912621359223]
used k- nearest neighbors classifier with 75% training & 25% testing on the iris data set. Achieved 0.973684 accuracy.
used k- nearest neighbors classifier with 75% training & 25% testing on the iris data set. Achieved 0.9970845481049563 accuracy.
99.71%
Do you have any of these solved that I can reference back to?
Yes, I have solutions to most of them on the blog, you can try a blog search.
Hi, I used Support Vector Classifier and KNN classifier on the Wheat Seeds Dataset (80% train data, 20% test data )
Accuracy Score of SVC : 0.9047619047619048
Accuracy Score of KNN : 0.8809523809523809
Hiya! Found some incredible toplogical trends in Iris that I am looking to replicate in another multi-class problem.
Are people typically classifying the gender of the species, or the ring number as a discrete output?
The age is the target on that dataset, but you can frame any predictive modeling problem you like with the dataset for practice.
Hi sir I am looking for a data sets for wheat production bu using SVM regression algorithm .So please give me a proper data sets for machine running .
Search for datasets here:
https://machinelearningmastery.com/faq/single-faq/where-can-i-get-a-dataset-on-___
Hi guys, i am new to ML .
Thanks for this set of data !
Anyone beat the wine quality problem ?
My results are so bad.
Thanks
I need a data set that
Contains at least 5 dimensions/features, including at least one categorical and one numerical dimension.
• Contains a clear class label attribute (binary or multi-label).
• Be of a simple tabular structure (i.e., no time series, multimedia, etc.).
• Be of reasonable size, and contains at least 2K tuples.
Where can I find the default result for the problems so I can compare with my result?
Thanks for the post – it is very helpfull!
I would like to know if anyone knows about a classification-dataset, where the importances for the features regarding the output classes is known. For example: Feature 1 is a good indicator for class 1, or Feature 3,4,5 are good indicators for class 2, …
Hope anyone can help 😉
You’re welcome.
Generally, we let the model discover the importance and how best to use input features.
Thank you very much for your answer. I was asking because I want to validate my approach to access the feature importance via global sensitivity analysis (Sobol Indices). In order to do I am searching for a dataset (or a dummy-dataset) with the described properties.
It’s a variance based global sensitity analysis (ANOVA). It is quite similar to permutation-importance ranking but can reveal cross-correlations of features by calculation of the so called “total effect index”. If you are further interessed in the topic I can recommend the following paper:
Some Python code for straightforward calculation of sobol indices is provided here:
https://salib.readthedocs.io/en/latest/api.html#sobol-sensitivity-analysis
Coming back to my first question: Do you know about a dataset with those properties or do you have any idea how I can build up a dummy dataset with known feature importance for each output?
Thanks.
Yes, you can contrive a dataset with relevant/irrelevant inputs via the make_classification() function. I use it all the time.
Beyond that, you will have to contrive your own problem I would expect. Feature importance is not objective!
day4 lesson
code:-
import pandas as pd
url = “https://goo.gl/bDdBiA”
names = [‘preg’, ‘plas’, ‘pres’, ‘skin’, ‘test’, ‘mass’, ‘pedi’, ‘age’, ‘class’]
data = pd.read_csv(url, names=names)
description = data.describe()
print(description)
output:-
preg plas pres skin test mass pedi age class
count 768.000000 768.000000 768.000000 768.000000 768.000000 768.000000
768.000000 768.000000 768.000000
mean 3.845052 120.894531 69.105469 20.536458 79.799479 31.992578
0.471876 33.240885 0.348958
std 3.369578 31.972618 19.355807 15.952218 115.244002 7.884160 0.331329
11.760232 0.476951
min 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.078000
21.000000 0.000000
25% 1.000000 99.000000 62.000000 0.000000 0.000000 27.300000 0.243750
24.000000 0.000000
50% 3.000000 117.000000 72.000000 23.000000 30.500000 32.000000
0.372500 29.000000 0.000000
75% 6.000000 140.250000 80.000000 32.000000 127.250000 36.600000
0.626250 41.000000 1.000000
max 17.000000 199.000000 122.000000 99.000000 846.000000 67.100000
2.420000 81.000000 1.000000
The output not properly fit in comment section
Hi there,
thanks a lot for the post!
I’m quite a beginner and something I’m not sure. I’ve fit the data with a straigth line (first dataset), but how do we measure accuracy?
Should we leave out some data points, and use to test or what?
Good question, this may help:
https://machinelearningmastery.com/regression-metrics-for-machine-learning/
Thanks, following the post, but with my own code:
mean = np.mean(Y)
l = Y.shape[1]
res = Y-mean
rmse = np.sqrt(np.dot(res,res.T)/l)
I calculated the rmse, and it yields
python
>>> rmse
array([[86.63170447]])
Also the values of wine quality have a max of 8 not 10, at least that’s what I get.
Tx for the help
Would you please tell where 1.48 comes from in with wine dataset? I’ve calculated mean squared error but it yields 0.034, usingnp.sqrt(np.mean(Y)/len(Y))
It is probably the RMSE of a model that predicts the mean value from the training dataset.
The calculation for RMSE is here:
https://machinelearningmastery.com/regression-metrics-for-machine-learning/
Hi Jason,
The Wine quality dataset poses a multi-class classification problem. How could we have RMSE as a metric? Is that a mistake or am I missing something?
Hi Amit…yes the dataset can be utilized for classification, however in order to get that point the RMSE can be used to determine how accurate the predictions are based upon comparing averages of each quantity represented in the features.
With a linear model coded from scratch got
naive estimation rmse: [[86.63170447]]
model rmse: [[35.44597361]]
I’ve also done a simple visual of the model’s evolution here:
Thank you so much, Jason. Was really looking for these datasets for practice today.
Thanks a lot! I’ll use some of these for practice
Btw, it is written in a Wheat Seeds Dataset that it is a binary classification problem, however, 3 classes are given. Is this a mistake or something? Thank you
Just found that the evaluation dataset of wine isn’t as same as the one in here:
https://machinelearningmastery.com/results-for-standard-classification-and-regression-machine-learning-datasets/#:~:text=https%3A//raw.githubusercontent.com/jbrownlee/Datasets/master/wine.csv
Welcome!
I'mJason Brownlee PhD
and Ihelp developers get results withmachine learning.
Read more
TheEBook Catalog is where
you'll find theReally Good stuff.