Movatterモバイル変換


[0]ホーム

URL:


Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
Thehttps:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

NIH NLM Logo
Log inShow account info
Access keysNCBI HomepageMyNCBI HomepageMain ContentMain Navigation
pubmed logo
Advanced Clipboard
User Guide

Full text links

BioMed Central full text link BioMed Central Free PMC article
Full text links

Actions

Share

Review
.2017 Dec 8:10:35.
doi: 10.1186/s13040-017-0155-3. eCollection 2017.

Ten quick tips for machine learning in computational biology

Affiliations
Review

Ten quick tips for machine learning in computational biology

Davide Chicco. BioData Min..

Abstract

Machine learning has become a pivotal tool for many projects in computational biology, bioinformatics, and health informatics. Nevertheless, beginners and biomedical researchers often do not have enough experience to run a data mining project effectively, and therefore can follow incorrect practices, that may lead to common mistakes or over-optimistic results. With this review, we present ten quick tips to take advantage of machine learning in any computational biology context, by avoiding some common errors that we observed hundreds of times in multiple bioinformatics projects. We believe our ten suggestions can strongly help any machine learning practitioner to carry on a successful project in computational biology and related sciences.

Keywords: Bioinformatics; Biomedical informatics; Computational biology; Computational intelligence; Data mining; Health informatics; Machine learning; Tips.

PubMed Disclaimer

Conflict of interest statement

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The author declares that he has no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures

Fig. 1
Fig. 1
a Example of dataset feature which needs data pre-processing and cleaning before being employed in a machine learning program. All the feature data have values in the [0;0.5], except an outlier having value 80 (Tip 1).b Representation of a typical dataset table havingN features as columns andM data instances as rows. An effective ratio for the split of an input dataset table: 50% of the data instances for the training set; 30% of the data instances for the validation set; and the last 20% of the data instances for the test set (Tip 2).c Example of a typical biological imbalanced dataset, which can contain 90% negative data instances and only 10% positive instances. This aspect can be tackled withunder-sampling and other techniques (Tip 5)
Fig. 2
Fig. 2
Example of how an algorithm’s behavior and results change when the hyper-parameter changes, for the thek-nearest neighbors method [20] (image adapted from [72]).a In this example, there are six blue square points and five red triangle points in the Euclidean space. A new point (the green circle) enters the space, andk-NN has to decide to which category to assign it (red triangle or blue square).b If we set the hyper-parameterk=3, the algorithm considers only the three points nearest to the new green circle, and assigns the green circle to the red triangle category (two red triangles versus one blue square).c Likewise, if we set the hyper-parameterk=4, the algorithm considers only the four points nearest to the new green circle, and assigns the green circle again to the red triangle category (the two red triangles are nearer to the green circle than the two blue squares).d However, if we set the hyper-parameterk=5, the algorithm considers only the five points nearest to the new green circle, and assigns the green circle to the blue square category (three blue squares versus two red triangles)
Fig. 3
Fig. 3
a Example of Precision-Recall curve, with the precision score on the y axis and the recall score on the x axis (Tip 8). The grey area is the PR cuve area under the curve (AUPRC).b Example of receiver operating characteristic (ROC) curve, with the recall (true positive rate) score on the y axis and the fallout (false positive rate) score on the x axis (Tip 8). The grey area is the ROC area under the curve (AUROC)
See this image and copyright information in PMC

Similar articles

See all similar articles

Cited by

See all "Cited by" articles

References

    1. Yip KY, Cheng C, Gerstein M. Machine learning and genome annotation: a match meant to be? Genome Biol. 2013;14(5):205. doi: 10.1186/gb-2013-14-5-205. - DOI - PMC - PubMed
    1. Baldi P, Brunak S. Bioinformatics: the machine learning approach. Cambridge: MIT press; 2001.
    1. Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armañanzas R, Santafé G, Pérez A, et al. Machine learning in bioinformatics. Brief Bioinform. 2006;7(1):86–112. doi: 10.1093/bib/bbk007. - DOI - PubMed
    1. Tarca AL, Carey VJ, Chen X-W, Romero R, Drȧghici S. Machine learning and its applications to biology. PLoS Comput Biol. 2007;3(6):e116. doi: 10.1371/journal.pcbi.0030116. - DOI - PMC - PubMed
    1. Schölkopf B, Tsuda K, Vert J-P . Kernel methods in computational biology. Cambridge: MIT Press; 2004.

Publication types

Related information

LinkOut - more resources

Full text links
BioMed Central full text link BioMed Central Free PMC article
Cite
Send To

NCBI Literature Resources

MeSHPMCBookshelfDisclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.


[8]ページ先頭

©2009-2025 Movatter.jp