- Notifications
You must be signed in to change notification settings - Fork14
Predicting Future Influenza Virus Sequences with Machine Learning
License
RK900/Flu-Prediction
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Predicting Future Flu Virus Strains with Machine Learning.These programs predict future influenza virus strains based on previous trends in flu mutations.
Check out my talks atPyData andPyGotham.
Flu-Prediction is available under theGPLv3 License.
Python 2 or 3 with Numpy, Biopython, and Scikit-learn libraries installed.
Clone/download the repository. Install the dependencies by doingpip install -r requirements.txt
.
Input any HA (hemagglutinin) or NA (neuraminidase) flu protein sequence and it's corresponding child sequence into the program and it will output a predicted offspring of that specific flu strain.
Use the Biopython library to import a sequence (a FASTA file format). You can use any flu FASTA file of your choosing, or you can use the ones in the Flu-Data folder. The data in the Flu-Data folder contain a wide variety of flu FASTA files, from single flu strains up to 1000 flu strains, which are grouped by flu subtype and protein. Data was obtained from theInfluenza Research Database (IRD).
fromBioimportSeqIOsequence=SeqIO.parse('myfasta.fasta','fasta')# put your FASTA file hereparent_fasta=parent.fastaparent_seq=parent.seqchild_fasta=parent.fastachild_seq=child.seq
Then encode it with the Encoding_v2 module:
fromEncoding_v2importencodingparent= []forkinrange(len(X0)):encoded_parent=encoding(parent_seq[k])parent.append(encoded_parent)child= []forlinrange(len(y0)):encoded_child=encoding(child_seq[l])child.append(encoded_child)
This turns the sequence into a list of float64's.Then, give the X and y to the machine learning algorithm.Enter any machine learning algorithm (eg, RandomForestsRegressor, DecisionTreeRegressor, etc.) in the 'algorithm' parts of the code.
Substitutealgorithm
for any scikit-learn model of your choosing.
fromsklearn.algorithmsimportalgorithm()alg=algorithm()alg.fit(X,y)alg.predict(new_X)
The algorithm I use in this project is a Random Forests Regressor model:
fromsklearn.ensembleimportRandomForestRegressor()rfr=RandomForestRegressor()# Specify and parameters in the parenthesisrfr.fit(X,y)rfr.predict(new_X)
fromsklearnimportcross_validationalgorithm_scores=cross_validation.cross_val_score(algorithm,X,y,cv=2)print'Algorithm Trees',algorithm_scoresprint("Average Accuracy: %0.2f (+/- %0.2f)"% (algorithm_scores.mean()*100,algorithm_scores.std()*100))
fromsklearnimportmetricsy_pred=algorithm.predict(X_test)print'Algorithm R2 score:',metrics.r2_score(y_test,y_pred,multioutput='variance_weighted')
fromsklearnimportmetricsy_pred=algorithm.predict(X_test)print'Algorithm mean squared error:',metrics.mean_squared_error(y_test,y_pred,multioutput='variance_weighted')
y_pred=algorithm.predict(X)printy_pred
The prediction output is a list of floats. Each number in the float corresponds to a base pair:A to 1, T to 2, G to 3, and C to 4.
About
Predicting Future Influenza Virus Sequences with Machine Learning