RK900/Flu-PredictionPublic

NotificationsYou must be signed in to change notification settings
Fork14
Star31

Predicting Future Influenza Virus Sequences with Machine Learning

License

GPL-3.0 license

31 stars 14 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 116 Commits
Flu-Data		Flu-Data
Flu-Models		Flu-Models
.gitattributes		.gitattributes
.gitignore		.gitignore
Flu Prediction Presentation.pdf		Flu Prediction Presentation.pdf
Flu Prediction Research Paper.pdf		Flu Prediction Research Paper.pdf
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Repository files navigation

Flu-Prediction

Predicting Future Flu Virus Strains with Machine Learning.These programs predict future influenza virus strains based on previous trends in flu mutations.

Talks

Check out my talks atPyData andPyGotham.

License

Flu-Prediction is available under theGPLv3 License.

Dependencies

Python 2 or 3 with Numpy, Biopython, and Scikit-learn libraries installed.

To use:

Clone/download the repository. Install the dependencies by doingpip install -r requirements.txt.

Input any HA (hemagglutinin) or NA (neuraminidase) flu protein sequence and it's corresponding child sequence into the program and it will output a predicted offspring of that specific flu strain.

Reading in a FASTA file with Biopython

Use the Biopython library to import a sequence (a FASTA file format). You can use any flu FASTA file of your choosing, or you can use the ones in the Flu-Data folder. The data in the Flu-Data folder contain a wide variety of flu FASTA files, from single flu strains up to 1000 flu strains, which are grouped by flu subtype and protein. Data was obtained from theInfluenza Research Database (IRD).

fromBioimportSeqIOsequence=SeqIO.parse('myfasta.fasta','fasta')# put your FASTA file hereparent_fasta=parent.fastaparent_seq=parent.seqchild_fasta=parent.fastachild_seq=child.seq

Encoding

Then encode it with the Encoding_v2 module:

fromEncoding_v2importencodingparent= []forkinrange(len(X0)):encoded_parent=encoding(parent_seq[k])parent.append(encoded_parent)child= []forlinrange(len(y0)):encoded_child=encoding(child_seq[l])child.append(encoded_child)

This turns the sequence into a list of float64's.Then, give the X and y to the machine learning algorithm.Enter any machine learning algorithm (eg, RandomForestsRegressor, DecisionTreeRegressor, etc.) in the 'algorithm' parts of the code.

Fitting the model

Substitutealgorithm for any scikit-learn model of your choosing.

fromsklearn.algorithmsimportalgorithm()alg=algorithm()alg.fit(X,y)alg.predict(new_X)

The algorithm I use in this project is a Random Forests Regressor model:

fromsklearn.ensembleimportRandomForestRegressor()rfr=RandomForestRegressor()# Specify and parameters in the parenthesisrfr.fit(X,y)rfr.predict(new_X)

Computing accuracy using K-Fold cross-validation:

fromsklearnimportcross_validationalgorithm_scores=cross_validation.cross_val_score(algorithm,X,y,cv=2)print'Algorithm Trees',algorithm_scoresprint("Average Accuracy: %0.2f (+/- %0.2f)"% (algorithm_scores.mean()*100,algorithm_scores.std()*100))

Computing accuracy using R² (for linear models):

fromsklearnimportmetricsy_pred=algorithm.predict(X_test)print'Algorithm R2 score:',metrics.r2_score(y_test,y_pred,multioutput='variance_weighted')

Computing accuracy using Mean Squared Error (MSE):

fromsklearnimportmetricsy_pred=algorithm.predict(X_test)print'Algorithm mean squared error:',metrics.mean_squared_error(y_test,y_pred,multioutput='variance_weighted')

Predicting Flu Strains:

y_pred=algorithm.predict(X)printy_pred

The prediction output is a list of floats. Each number in the float corresponds to a base pair:A to 1, T to 2, G to 3, and C to 4.

About

Predicting Future Influenza Virus Sequences with Machine Learning

rk900.github.io/Flu-Prediction

Releases6

Version 2.1.2 Latest

Feb 11, 2018

+ 5 releases

Packages

No packages published

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Flu-Prediction

Talks

License

Dependencies

To use:

Reading in a FASTA file with Biopython

Encoding

Fitting the model

Computing accuracy using K-Fold cross-validation:

Computing accuracy using R² (for linear models):

Computing accuracy using Mean Squared Error (MSE):

Predicting Flu Strains:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases6

Packages

Uh oh!

Languages

Movatterモバイル変換

License

RK900/Flu-Prediction

Folders and files

Latest commit

History

Repository files navigation

Flu-Prediction

Talks

License

Dependencies

To use:

Reading in a FASTA file with Biopython

Encoding

Fitting the model

Computing accuracy using K-Fold cross-validation:

Computing accuracy using R2 (for linear models):

Computing accuracy using Mean Squared Error (MSE):

Predicting Flu Strains:

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases6

Packages0

Uh oh!

Languages

Computing accuracy using R² (for linear models):

Packages