Review

.2017 Dec 8:10:35.

doi: 10.1186/s13040-017-0155-3. eCollection 2017.

Ten quick tips for machine learning in computational biology

Davide Chicco¹

Affiliations

PMID:29234465
PMCID: PMC5721660
DOI: 10.1186/s13040-017-0155-3

Review

Ten quick tips for machine learning in computational biology

Davide Chicco. BioData Min.2017.

.2017 Dec 8:10:35.

doi: 10.1186/s13040-017-0155-3. eCollection 2017.

Author

Davide Chicco¹

Affiliation

¹ Princess Margaret Cancer Centre, PMCR Tower 11-401, 101 College Street, Toronto, Ontario, M5G 1L7 Canada.

PMID:29234465
PMCID: PMC5721660
DOI: 10.1186/s13040-017-0155-3

Abstract

Machine learning has become a pivotal tool for many projects in computational biology, bioinformatics, and health informatics. Nevertheless, beginners and biomedical researchers often do not have enough experience to run a data mining project effectively, and therefore can follow incorrect practices, that may lead to common mistakes or over-optimistic results. With this review, we present ten quick tips to take advantage of machine learning in any computational biology context, by avoiding some common errors that we observed hundreds of times in multiple bioinformatics projects. We believe our ten suggestions can strongly help any machine learning practitioner to carry on a successful project in computational biology and related sciences.

Keywords: Bioinformatics; Biomedical informatics; Computational biology; Computational intelligence; Data mining; Health informatics; Machine learning; Tips.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The author declares that he has no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

**Fig. 1**
a Example of dataset feature which needs data pre-processing and cleaning before being employed in a machine learning program. All the feature data have values in the [0;0.5], except an outlier having value 80 (Tip 1).b Representation of a typical dataset table havingN features as columns andM data instances as rows. An effective ratio for the split of an input dataset table: 50% of the data instances for the training set; 30% of the data instances for the validation set; and the last 20% of the data instances for the test set (Tip 2).c Example of a typical biological imbalanced dataset, which can contain 90% negative data instances and only 10% positive instances. This aspect can be tackled withunder-sampling and other techniques (Tip 5)

**Fig. 2**
Example of how an algorithm’s behavior and results change when the hyper-parameter changes, for the thek-nearest neighbors method [20] (image adapted from [72]).a In this example, there are six blue square points and five red triangle points in the Euclidean space. A new point (the green circle) enters the space, andk-NN has to decide to which category to assign it (red triangle or blue square).b If we set the hyper-parameterk=3, the algorithm considers only the three points nearest to the new green circle, and assigns the green circle to the red triangle category (two red triangles versus one blue square).c Likewise, if we set the hyper-parameterk=4, the algorithm considers only the four points nearest to the new green circle, and assigns the green circle again to the red triangle category (the two red triangles are nearer to the green circle than the two blue squares).d However, if we set the hyper-parameterk=5, the algorithm considers only the five points nearest to the new green circle, and assigns the green circle to the blue square category (three blue squares versus two red triangles)

**Fig. 3**
a Example of Precision-Recall curve, with the precision score on the y axis and the recall score on the x axis (Tip 8). The grey area is the PR cuve area under the curve (AUPRC).b Example of receiver operating characteristic (ROC) curve, with the recall (true positive rate) score on the y axis and the fallout (false positive rate) score on the x axis (Tip 8). The grey area is the ROC area under the curve (AUROC)

See this image and copyright information in PMC

Cited by

Towards a potential pan-cancer prognostic signature for gene expression based on probesets and ensemble machine learning.
Chicco D, Alameer A, Rahmati S, Jurman G.Chicco D, et al.BioData Min. 2022 Nov 3;15(1):28. doi: 10.1186/s13040-022-00312-y.BioData Min. 2022.PMID:36329531Free PMC article.
A genome scale transcriptional regulatory model of the human placenta.
Paquette A, Ahuna K, Hwang YM, Pearl J, Liao H, Shannon P, Kadam L, Lapehn S, Bucher M, Roper R, Funk C, MacDonald J, Bammler T, Baloni P, Brockway H, Mason WA, Bush N, Lewinn KZ, Karr CJ, Stamatoyannopoulos J, Muglia LJ, Jones H, Sadovsky Y, Myatt L, Sathyanarayana S, Price ND.Paquette A, et al.Sci Adv. 2024 Jun 28;10(26):eadf3411. doi: 10.1126/sciadv.adf3411. Epub 2024 Jun 28.Sci Adv. 2024.PMID:38941464Free PMC article.
Evaluation of penalized and machine learning methods for asthma disease prediction in the Korean Genome and Epidemiology Study (KoGES).
Choi Y, Cha J, Choi S.Choi Y, et al.BMC Bioinformatics. 2024 Feb 2;25(1):56. doi: 10.1186/s12859-024-05677-x.BMC Bioinformatics. 2024.PMID:38308205Free PMC article.
Prediction of Left Ventricular Mechanics Using Machine Learning.
Dabiri Y, Van der Velden A, Sack KL, Choy JS, Kassab GS, Guccione JM.Dabiri Y, et al.Front Phys. 2019 Sep;7:117. doi: 10.3389/fphy.2019.00117. Epub 2019 Sep 6.Front Phys. 2019.PMID:31903394Free PMC article.
Predicting and explaining the impact of genetic disruptions and interactions on organismal viability.
Al-Anzi BF, Khajah M, Fakhraldeen SA.Al-Anzi BF, et al.Bioinformatics. 2022 Sep 2;38(17):4088-4099. doi: 10.1093/bioinformatics/btac519.Bioinformatics. 2022.PMID:35861390Free PMC article.

See all "Cited by" articles

References

1. Yip KY, Cheng C, Gerstein M. Machine learning and genome annotation: a match meant to be? Genome Biol. 2013;14(5):205. doi: 10.1186/gb-2013-14-5-205. - DOI - PMC - PubMed
1. Baldi P, Brunak S. Bioinformatics: the machine learning approach. Cambridge: MIT press; 2001.
1. Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, et al. Machine learning in bioinformatics. Brief Bioinform. 2006;7(1):86–112. doi: 10.1093/bib/bbk007. - DOI - PubMed
1. Tarca AL, Carey VJ, Chen X-W, Romero R, Drȧghici S. Machine learning and its applications to biology. PLoS Comput Biol. 2007;3(6):e116. doi: 10.1371/journal.pcbi.0030116. - DOI - PMC - PubMed
1. Schölkopf B, Tsuda K, Vert J-P . Kernel methods in computational biology. Cambridge: MIT Press; 2004.

Publication types

Actions

Related information

MedGen

LinkOut - more resources

Full Text Sources
Other Literature Sources
- The Lens - Patent Citations Database
- scite Smart Citations

Movatterモバイル変換

Account

Save citation to file

Email citation

Add to Collections

Add to My Bibliography

Your saved search

Create a file for external citation management software

Your RSS Feed

Full text links

Actions

Share

Ten quick tips for machine learning in computational biology

Affiliation

Ten quick tips for machine learning in computational biology

Author

Affiliation

Abstract

Conflict of interest statement

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Figures

Similar articles

Cited by

References

Publication types

Related information

LinkOut - more resources

Full Text Sources

Other Literature Sources