Making developers awesome at machine learning
Making developers awesome at machine learning
32 Tips, Tricks and Hacks That You Can Use To Make Better Predictions.
The most valuable part of machine learning is predictive modeling.
This is the development of models that are trained on historical data and make predictions on new data.
And the number one question when it comes to predictive modeling is:
How can I get better results?
This cheat sheet contains my best advice distilled from years of my own application and studying top machine learning practitioners and competition winners.
With this guide, you will not only get unstuck and lift performance, you might even achieve world-class results on your prediction problems.
Let’s dive in.
Note, the structure of this guide is based on an early guide that you might fine useful on improving performance for deep learning titled: How To Improve Deep Learning Performance.

Machine Learning Performance Improvement Cheat Sheet
Photo byNASA, some rights reserved.
This cheat sheet is designed to give you ideas to lift performance on your machine learning problem.
All it takes is one good idea to get a breakthrough.
Find that one idea, then come back and find another.
I have divided the list into 4 sub-topics:
The gains often get smaller the further you go down the list.
For example, a new framing of your problem or more data is often going to give you more payoff than tuning the parameters of your best performing algorithm. Not always, but in general.
You can get big wins with changes to your training data and problem definition. Perhaps even the biggest wins.
Strategy: Create new and different perspectives on your data in order to best expose the structure of the underlying problem to the learning algorithms.
Outcome: You should now have a suite of new views and versions of your dataset.
Next: You can evaluate the value of each withpredictive modeling algorithms.
Machine learning is all about algorithms.
Strategy: Identify the algorithms and data representations that perform above a baseline of performance and better than average. Remain skeptical of results and design experiments that make it hard to fool yourself.
Outcome: You should now have a short list of well-performing algorithms and data representations.
Next: The next step is to improve performance with algorithm tuning.
Algorithm tuning might be where you spend the most of your time. It can be very time-consuming. You can often unearth one or two well-performing algorithms quickly from spot-checking. Getting the most from those algorithms can take, days, weeks or months.
Strategy: Get the most out of well-performing machine learning algorithms.
Outcome: You should now have a short list of highly tuned algorithms on your machine learning problem, maybe even just one.
Next:One or more models could be finalized at this point and used to make predictions or put into production. Further lifts in performance can be gained by combining the predictions from multiple models.
You can combine the predictions from multiple models. After algorithm tuning, this is the next big area for improvement. In fact, you can often get good performance from combining the predictions from multiple “good enough” models rather than from multiple highly tuned (and fragile) models.
Strategy: Combine the predictions of multiple well-performing models.
Outcome: You should have one or more ensembles of well-performing models that outperform any single model.
Next: One or more ensembles could be finalized at this point and used to make predictions or put into production.
This cheat sheet is jam packed full of ideas to try to improve performance on your problem.
You do not need to do everything. You just need one good idea to get a lift in performance.
Here’s how to handle the overwhelm:
Did you find this post useful?
Did you get that one idea or method that made a difference?
Let me know, leave a comment! I would love to hear about it.






What tactics could be used for Multi-Class classification problems?
Most classification machine learning algorithms support multi-class classification.
Perhaps start with a decision tree?
Very useful. Many thanks Jason.
As a person who is always in search of learning something new I am only just starting on Python/Machine Learning/Analytics.
I am currently looking for a basic discussion as I am trying to get a clearer basic understanding of what is involved. I have done a bit of reading, but could do with a demystifying conversation.
If it is Ok with you kindly confirm how does machine learning fit in statistics and how does Python Algorithms and use of python related libraries come into place?
Great qusetion Benson,
Machine learning could thought of as applied statistics with computers and much larger datasets. The useful part of machine learning is predictive modeling, as distinct from descriptive modeling often performed in statistics.
We do not want to implement methods every time a new model is needed or new project is started. Therefore we use robust standard implementations. The Python library scikit-learn provides industry standard implementations that we can use for R&D and operational models.
I hope that helps?
Excellent post as usual. I really like how concise, clear and precise your articles are, they really help getting the big picture that I missed while I was a Master student. Keep up the good work!
It´s an amazing info !!! Your blog is the best machine learning resource I ever seen, congratulations. I will buy your book the next month, it must be really usefult. Thanks !
Hi Jason,
I’m playing around with soil spectroscopy and trying to correlate the spectra with reference method data. When I had few samples (less than 100) I got a r2 over 70, now with over 4000 samples its down to 60… weird huh? To me this means that the property I’m trying to measure is not really correlated with the signal, so improving the number of samples would not improve the results, what do you think?
Sincerely
Grandonia
It is possible that there are non-linear interactions of the variable with other input variables that may influence the output variable. These are hard to impossible to measure.
I recommend developing models with different framings of the problem to see what works best.
Thanks for your insights Jason. Could you explain what you mean with framings? I’m pretty sure that there is no direct relationship with the elements I wanna measure (like Phosphorous) and the spectra, only indirect (P bound to some organic molecules that have a signal in the infrared region).
So do you really think that in this cases more data does not mean better results, since the interactions are so non-linear?
By framings, I mean the choice of the inputs and outputs to the mapping function that you are trying to approximate. This is not just feature selection, but the structure/nature of what is being predicted and from what.
The are many ways to frame a given predictive modeling problem and improved results can come from exploring variations, and even combining predictions from models developed from different framings.
Generally, neural nets do require more data than linear methods. Yes:
https://machinelearningmastery.com/much-training-data-required-machine-learning/
Hi Jason, Is there more scope for improving performance with ensembling if I’m already using an ensemble model like gradient boosting regressor which has been tuned?
Yes, but you will need a model that is quite different with regard to features or modeling approach (e.g. uncorrelated errors).
any suggestions as to appropriate algorithm to use to improve lottery predictions?
Can you suggest some good links for generating new features or doing feature engineering using statistics?
Thanks in Advance!!
Hi Jason great pathway for the thought process.
Does ensembling model tend to Overfit quickly. I mean combining predictions on say a binary classification task having output 0 or 1.
I have tried ensembling using gradient boosting and achieve accuracy of. 99-1. Does model predictions needs to be non-correlated in order to combine predictions.
How should we choose which models to combine
Not generally. It can be easy to overfit due to using information from the test set in creating the ensemble though. Be careful.
Hi Jason,
Do you have any suggestion in imbalance sequence labeling with LSTM?
Sequence of length N to sequence of length N.
Binary classification.
One class is highly imbalances and model is overfitting badly!
A good start would be to carefully choose a performance measure, such as AUC or ROC Curves.
Hi Jason,
Awesome pointers. I will try to work with some of them .I hope they work for me , I am struggling with exact same performance levels, although I have changed data views and altered some algorithm parameters as well.
Hi,
Thank for your information. I have question and hope you can help me.
Totally, I have 55 points data, I divided it into 70:30 percent for training and testing data.
However, the accuracy of training is high, while testing is very worst.
I have 13 factors as input and 1 target for model (I use machine learning)
What is my problem here? How I can fix it?
Thank you very much
Perhaps try using k-fold cross validation or leave one out cross validation to get a more robust estimate of model performance.
Generally, I recommend following this process to get the best performance:
https://machinelearningmastery.com/start-here/#process
Using Grid search for Hyperparmeter optimization It takes long time for processing, I have run 2048 combinations and it took me nearly 120 hours to get best fit for keras RNN model with Lstm layer.
Can you please suggest how can I speedup the processing of Grid search or Is there any other approach.
I have some ideas:
Perhaps try running on a faster machine (e.g. AWS EC2)?
Perhaps try fewer combinations of hyperparameters?
Perhaps try a smaller model?
Perhaps try a smaller dataset?
Perhaps evaluate different grids of hyperparameters on different machines?
I hope that helps as a start.
Sir, I wish to develop a hybrid for classification and regression as ebola prediction model
and I have a failed to understand the philosophy on how this can be formed. for instance, if like 4 algorithms selected to be applied in this hybrid , so how this model can work for categorical the sometimes on the numerical outcomes ..Sir, is it possible and how can be formed ? your advice is highly welcomed
That is tricky to achieve.
You could have a model that outputs a real value, that is interpreted by another model for a classification outcome?
You could also have a multi-output model, like a neural net, that outputs a real value and classification prediction.
Let me know how you go.
I have a problem which I think can be related to what you say above as,
“1. Improve Performance With Data: Resample Data: resample data to change the size or distribution”.
My problem is a binary classification (0 over 1 classes). Each row of data is a 3D point having three columns of X,Y,Z point coordinates and the rest of columns are attribute values.
Although the classification only uses columns of attribute (features), I keep X,Y,Z at each row(point), because I need them to eventually visualize the results of classification (for example, colorizing points in class 0 as blue while points in class 1 as red).
The points are 3D coordinates of a building, so when I visualize the points I can randomly cut different parts of the building and label them as train/test data. I have randomly cut different parts of building, so I have several train/test data . For each one I do the process (fit model on train data, test on test data and then predict classes of unseen data e.g., another building).
Therefore, I can have different train/test data with different spatial distributions of two classes. (By spatial distributions of two classes, I mean where the two classes are located in the 3D space.). SO as you say I am doing resampling to find a sample which gives the best performance and use it as training data set.
In train/test data called A, the 3D locations of red and blue classes, are different from those in train/test data called B.
The f1-score of A and B on their test set are different but good (high around 90% for either of classes). However, when I predict unseen data with model fitted to A, the f1-score is awful while when I predict unseen data with model fitted to B, the f1-score is good (and visualizing the building gives meaningful predicted classes).
Can I call this change of f1-score for A and B on unseen data as model variance?
By trial and error, I concluded that when classes 0 and 1 are surrounded by each other (spatial distribution of B) I get good f1-score on unseen data, while when classes 0 and 2 are away from each other I get awful f1-score on unseen data. I there any scientific reason for this?
For all cases, I have almost equal number of 0 and 1 classes.
Many Thanks
Model variance would be a measure of the same model fit on different training dataset and evaluated on different test dataset. Variance is the spread of model skill defined by a chosen metric in this scenario.
This might be a case of requiring sufficiently representative training/test sets when fitting/evaluating a model.
Hi Jason,
I want to develop a model that will do hierarchical multi-class classification, In my case each child node can have more than 10 class till level 4. I tried creating model at leaf level but no luck(since more than 1000 class). Tried Classification at each level (less number of class than flat structure), still no luck. Is there any other way I can work on to optimize the model, may be creating model at node (too number of model).
What best you can suggest ?
Many Thanks
Sorry, I don’t know much about hierarchical multi-class classification. Perhaps hit the literature to see what the state of the art is?
Sir, Whan iam running this code getting this error sir. How can i clear the errror. The same code which i tried in pyhton 3.6. Same error i was getting.Please reply me>>>
RESTART: C:\Users\poornima\AppData\Local\Programs\Python\Python37-32\ex2.py
Scores: [0.0, 0.0, 0.0, 0.0, 0.0]
Mean Accuracy: 0.000%
>>>
Looks like your model has no skill. Perhaps try debugging your code?
hi, I am implementing a model for the financial forecast of a European index based on the data of a systemic risk indicator and I have followed your instructions to prepare the data, I have used LSTM but I have a low loss value and a very low accuracy ( 0.0014). what could be the problem?
You cannot measure accuracy for a regression task:
https://machinelearningmastery.com/faq/single-faq/how-do-i-calculate-accuracy-for-regression
I implemented J48, how can I measure the effect of lowering the number of examples in the training set
Perhaps prepare a separate dataset with fewer rows, then evaluate the model on the new dataset in Weka.
Do we evaluate the model by considering “precision and recall” or by corrected instance percentage?
You must choose a metric for your project that best captures the goals of your project.
This can help:
https://machinelearningmastery.com/tour-of-evaluation-metrics-for-imbalanced-classification/
Very good article, thank you for sharing.
I have a question and I hope you can help me: Is there any specific method or algorithm for “Invent More Data”? I want to focus on ordinary algorithms, rather than data expansion methods similar to methods such as inverting and mirroring pictures in deep learning. I searched related content, but did not find the ideal result.
Looking forward for your response, thank you!
Thanks!
Yes, it is called oversampling and there are many approaches for this. Perhaps start here:
https://machinelearningmastery.com/smote-oversampling-for-imbalanced-classification/
Very good article, thank you for sharing.
I have a question and I hope you can help me: Is there any specific method or algorithm for “Invent More Data”? I want to focus on ordinary algorithms, rather than data expansion methods similar to methods such as inverting and mirroring pictures in deep learning. I searched related content, but did not find the ideal result.
Looking forward for your response, thank you!
————
Supplementary explanation following the above content:
What I have learned about data expansion is as follows:
* Regarding deep learning, such as image processing, data can be increased and expanded through image rotation;
* Regarding other machine learning: 1. Classification problems (unbalanced samples), data can be expanded by algorithms similar to SMOTE; 2. Regression problems? ——I hope you can help me.
I want to know if there are some methods or algorithms that are sufficient for data expansion, not only to deal with the imbalance of samples, but also to be applicable to more general situations and to increase training data.
Looking forward to your reply, thank you.
Thanks.
The methods for imbalanced classification can be used for any tabular data (e.g. random oversampling).
Also, simply adding gaussian noise to examples can be used to expand the dataset.
Dear Jason Brownlee, kindly suggest me how to make an expert system in diabetes using the advanced features of WEKA .Is it possible ?
Welcome!
I'mJason Brownlee PhD
and Ihelp developers get results withmachine learning.
Read more
TheEBook Catalog is where
you'll find theReally Good stuff.