Preprocessing: In Sonnenburg and Franc (2010) for splice site prediction,they internally map data to a high dimensional space andtrain a linear classifier.Agarwal et al. (2014)use thescriptto explicitly store all feature values.We apply the same script totheoriginal datafor generating training/test files:> splice_explicit_features data/H_sapiens_acc_all_examples_plain_139-279_50000000.fasta data/H_sapiens_acc_all_examples_plain_139-279_50000000.fasta_down data/H_sapiens_acc_all_examples_plain_139-279_50000000.fasta_up data/H_sapiens_acc_all_examples_plain_50000000.label
and> splice_explicit_features data/H_sapiens_acc_all_examples_plain_139-279_5e7_test.fasta data/H_sapiens_acc_all_examples_plain_139-279_5e7_test.fasta_down data/H_sapiens_acc_all_examples_plain_139-279_5e7_test.fasta_up data/H_sapiens_acc_all_examples_plain_5e7_test.label
This set is highly skewed, so auPRC (area under precision-recall curve)is the suitable criterion. Using matlab statistics toolbox, you canobtain auPRC by[Xpr,Ypr,Tpr,AUCpr] = perfcurve(labels, predictions, 1, 'xCrit', 'reca', 'yCrit', 'prec'); AUCpr
where labels are true labels and predictions are your predicted decision values.You can use LIBLINEAR with option -s 3 (i.e., l2-regularized l1-loss SVM)to get auPRC of 0.5773, similar to 0.5775 reported in Table 2 ofSonnenburg and Franc (2010). If you don't have enough RAM to run LIBLINEAR,you can use the followingcodeat LIBSVM tools and see our experimental loghere.The code used is a disk-level linear classifier. [HFY11a]