Reproduction of the TrAdaBoost experiments
The purpose of this example is to reproduce the results obtained in the paperBoosting for Transfer Learning (2007). In this work, the authors developed a transfer algorithm called TrAdaBoost dedicated forsupervised domain adaptation. You can find more details about this algorithmhere. Thegoal of this algorithm is to combine a source dataset with many labeled instances to a target dataset with few labels in order to learn a good model on the target domain.
We try to reproduce the two following exepriments:
Mushrooms
20newsgroups
Mushrooms
Dataset description TheMushrooms data set includes descriptions of hypothetical samples corresponding to 23 species of gilled mushrooms in the Agaricus and Lepiota Family (pp. 500-525). Each species is identified as definitely edible, definitely poisonous, or of unknown edibility and not recommended. This latter class was combined with the poisonous one. The Guide clearly states that there is no simple rule for determining the edibilityof a mushroom.
Experiment description: For the TrAdaBoost experiment, according to the authors :
The data is splited in two sets based on the featurestalk-shape. The diff-distribution data set (the source data set) consists of all the instances whose stalks areenlarging, while the same-distribution data set (the target data set) consists of the instances abouttapering mushrooms. Then, the two sets contain examples from different types of mushrooms, which makes the distributions different. – Boosting for Transfer Learning (2007)
[53]:
fromIPython.displayimportdisplayimportnumpyasnpimportpandasaspdimportmatplotlib.pyplotaspltfromsklearn.preprocessingimportOneHotEncoderfromadapt.datasetsimportopen_uci_dataset
[54]:
url="https://archive.ics.uci.edu/ml/machine-learning-databases/mushroom/agaricus-lepiota.data"columns=["target","cap-shape","cap-surface","cap-color","bruises?","odor","gill-attachment","gill-spacing","gill-size","gill-color","stalk-shape","stalk-root","stalk-surface-above-ring","stalk-surface-below-ring","stalk-color-above-ring","stalk-color-below-ring","veil-type","veil-color","ring-number","ring-type","spore-print-color","population","habitat"]data=pd.read_csv(url,header=None)data.columns=columnsX=data.drop(["target"],axis=1)y=data[["target"]]display(X.head())
| cap-shape | cap-surface | cap-color | bruises? | odor | gill-attachment | gill-spacing | gill-size | gill-color | stalk-shape | ... | stalk-surface-below-ring | stalk-color-above-ring | stalk-color-below-ring | veil-type | veil-color | ring-number | ring-type | spore-print-color | population | habitat | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | x | s | n | t | p | f | c | n | k | e | ... | s | w | w | p | w | o | p | k | s | u |
| 1 | x | s | y | t | a | f | c | b | k | e | ... | s | w | w | p | w | o | p | n | n | g |
| 2 | b | s | w | t | l | f | c | b | n | e | ... | s | w | w | p | w | o | p | n | n | m |
| 3 | x | y | w | t | p | f | c | n | n | e | ... | s | w | w | p | w | o | p | k | s | u |
| 4 | x | s | g | f | n | f | w | b | k | t | ... | s | w | w | p | w | o | e | n | a | g |
5 rows × 22 columns
[55]:
X["stalk-shape"].value_counts()
[55]:
t 4608e 3516Name: stalk-shape, dtype: int64
Note: When looking at the number of instances in each category of thestalk-shape attribute, it seems that the authors inversed the source data set with the target one in the text above. Indeed, when looking at Table 1 in the paper, the number of source instances should be 4608 which corresponds to thetapering class and not theenlarging one.
For the first experiment, the number of traget instances is set to 1% of the length of the source data set:
Each same-distribution (understand “target”) data set is split into two sets: a same-distribution training set Ts and a test set S. Table 3 presents the experimental results ofSVM,SVMt,AUX andTrAdaBoost(SVM) when the ratio between same-distribution and diff-distribution (understand “source”) training data is 0.01. The performance in error rate was the average of 10 repeats by random. The number of iterations (of TrAdaBoost) is set to 100. – Boosting for TransferLearning (2007)
HereSVM refers to a linear SVM classifier fitted only with source data,SVMt with source and target labeled data whith uniform weight,AUX refers to the BalanceWeighting method andTrAdaBoost(SVM) to TrAdaBoost used with a linear SVM classifier as base-learner. We also add a comparison with AdaBoost to verify that TrAdaBoost is not advantaged by the averaging of predicition over multiple estimators.
[56]:
defsplit_source_target(X,y,ratio_of_target_labels=0.01):Xs=X.loc[X["stalk-shape"]=="t"]ys=y.loc[Xs.index]Xt=X.loc[X["stalk-shape"]=="e"]yt=y.loc[Xt.index]Xt_lab=Xt.sample(int(ratio_of_target_labels*len(Xs)))yt_lab=yt.loc[Xt_lab.index]Xt=Xt.drop(Xt_lab.index,axis=0)yt=yt.drop(yt_lab.index,axis=0)ohe=OneHotEncoder(sparse=False).fit(X)Xs=ohe.transform(Xs)Xt=ohe.transform(Xt)Xt_lab=ohe.transform(Xt_lab)returnXs,ys["target"],Xt,yt["target"],Xt_lab,yt_lab["target"]
[57]:
fromadapt.baseimportBaseAdaptEstimatorfromscipy.sparseimportvstack,issparse# We create here the AUX model which consist in a balanced weighting# between instances from source and target domains.classBalancedWeighting(BaseAdaptEstimator):def__init__(self,estimator=None,alpha=1.,Xt=None,yt=None):super().__init__(estimator=estimator,alpha=alpha,Xt=Xt,yt=yt)deffit(self,Xs,ys,Xt=None,yt=None,**kwargs):Xt,yt=self._get_target_data(Xt,yt)ifissparse(Xs):X=vstack((Xs,Xt))else:X=np.concatenate((Xs,Xt))y=np.concatenate((ys,yt))sample_weight=np.ones(X.shape[0])sample_weight[Xs.shape[0]:]*=(Xs.shape[0]/Xt.shape[0])*self.alphaself.fit_estimator(X,y,sample_weight=sample_weight)
We repeat the experiment 10 times with different random seed, the trade-off parameter alpha for the Balanced Weighting technique (AUX) is set to 4 as did the authors:
Besides the baselines, we also compare TrAdaBoost with the method developed for learning with auxiliary data proposed by Wu and Dietterich (2004), which is denoted as AUX. The parameter Cp/Ca (as used in (Wu & Dietterich, 2004)) is set to 4 after tuning. – Boosting for Transfer Learning (2007)
Besides, we balanced the weights between positive and negative instances:
Furthermore, we also added some constraints to the basic learners to avoid the case of training weights being unbalanced. When training SVM, we always balance the overall training weights between positive and negative examples. – Boosting for Transfer Learning (2007)
[58]:
fromadapt.instance_basedimportTrAdaBoostfromsklearn.svmimportLinearSVCnames=["SVM","SVMt","AUX","TrAdaBoost"]scores={k:[]forkinnames}forstateinrange(10):np.random.seed(state)Xs,ys,Xt,yt,Xt_lab,yt_lab=split_source_target(X,y,ratio_of_target_labels=0.01)ifstate==0:print("Xs shape:%s, Xt shape:%s"%(str(Xs.shape),str(Xt.shape)))models=[LinearSVC(class_weight="balanced"),LinearSVC(class_weight="balanced"),BalancedWeighting(LinearSVC(class_weight="balanced"),alpha=4.,Xt=Xt_lab,yt=yt_lab),TrAdaBoost(LinearSVC(class_weight="balanced"),n_estimators=100,verbose=0,Xt=Xt_lab,yt=yt_lab)]formodel,nameinzip(models,names):ifname=="SVMt":model.fit(np.concatenate((Xs,Xt_lab)),np.concatenate((ys,yt_lab)))else:model.fit(Xs,ys)scores[name].append(1-model.score(Xt,yt))print("Round%i :%s"%(state,str({k:np.round(v[-1],3)fork,vinscores.items()})))
Xs shape: (4608, 117), Xt shape: (3470, 117)Round 0 : {'SVM': 0.262, 'SVMt': 0.069, 'AUX': 0.067, 'TrAdaBoost': 0.067}Round 1 : {'SVM': 0.263, 'SVMt': 0.06, 'AUX': 0.062, 'TrAdaBoost': 0.061}Round 2 : {'SVM': 0.262, 'SVMt': 0.045, 'AUX': 0.046, 'TrAdaBoost': 0.048}Round 3 : {'SVM': 0.261, 'SVMt': 0.021, 'AUX': 0.017, 'TrAdaBoost': 0.028}Round 4 : {'SVM': 0.262, 'SVMt': 0.049, 'AUX': 0.048, 'TrAdaBoost': 0.052}Round 5 : {'SVM': 0.261, 'SVMt': 0.052, 'AUX': 0.052, 'TrAdaBoost': 0.052}Round 6 : {'SVM': 0.261, 'SVMt': 0.08, 'AUX': 0.08, 'TrAdaBoost': 0.063}Round 7 : {'SVM': 0.262, 'SVMt': 0.086, 'AUX': 0.086, 'TrAdaBoost': 0.082}Round 8 : {'SVM': 0.263, 'SVMt': 0.048, 'AUX': 0.049, 'TrAdaBoost': 0.044}Round 9 : {'SVM': 0.261, 'SVMt': 0.042, 'AUX': 0.042, 'TrAdaBoost': 0.031}Results Summary
[59]:
error_mu=np.round(pd.DataFrame(pd.DataFrame(scores).mean(0),columns=["Error"]),3).transpose().astype(str)error_std=np.round(pd.DataFrame(pd.DataFrame(scores).std(0),columns=["Error"]),3).transpose().astype(str)display(error_mu+" ("+error_std+")")
| SVM | SVMt | AUX | TrAdaBoost | |
|---|---|---|---|---|
| Error | 0.261 (0.001) | 0.055 (0.019) | 0.055 (0.02) | 0.053 (0.016) |
The results that we obtain differ a little from the ones obtained by the authors in Table 3. Here, the error for SVMt, AUX and TrAdaBoost is smaller but the error of SVM is higher. Moreover, the error of SVMt is much lower than the corresponding error computed by the authors.
20 NewsGroup
Dataset description The20 NewsGroup data set comprises around 18000 newsgroups posts on 20 main topics, whith some topics divided in subcategories.
Experiment description For the TrAdaBoost experiment, according to the authors: >We define the tasks as top-category classification problems. When we split the data to generate diff-distribution (source) and same-distribution (target) sets, the data are split based on subcategories instead of based on random splitting. Then, the two data sets contain data in different subcategories. Their distributions also differ as a result. > – Boosting for Transfer Learning (2007)
The authors do not precise which categories have been selected whithin each domain. We try to impute them based on the number of instances in each domain given by the authors in Table 1.
[6]:
fromsklearn.datasetsimportfetch_20newsgroups# Set download_if_missing to True if not downloaded yetdata=fetch_20newsgroups(download_if_missing=False,subset="all")source_rec=['rec.autos','rec.motorcycles']target_rec=['rec.sport.baseball','rec.sport.hockey']source_sci=['sci.crypt','sci.electronics']target_sci=['sci.med','sci.space']source_talk=['talk.politics.guns','talk.politics.mideast']target_talk=['talk.politics.misc','talk.religion.misc']
The author do not precise which preprocessing is applied on the data, so we use the default preprocessing of scikit-learn: TfidfVectorizer
[17]:
fromsklearn.feature_extraction.textimportTfidfVectorizervectorizer=TfidfVectorizer(stop_words="english",analyzer="word",min_df=5,max_df=0.1)X=vectorizer.fit_transform(data.data)defsplit_source_target(source_index,target_index,positive_index,ratio_of_target_labels=0.01):Xs=X[source_index]Xt=X[target_index]ys=np.isin(data.target[source_index],positive_index).astype(float)yt=np.isin(data.target[target_index],positive_index).astype(float)lab_index=np.random.choice(Xt.shape[0],int(0.01*Xs.shape[0]),replace=False)unlab_index=np.array(list(set(np.arange(Xt.shape[0]))-set(lab_index)))Xt_lab=Xt[lab_index]yt_lab=yt[lab_index]Xt=Xt[unlab_index]yt=yt[unlab_index]returnXs,ys,Xt,yt,Xt_lab,yt_lab
We conduct the three proposed experiments “rec vs talk”, “rec vs sci” and “sci vs talk”. We set the number of TrAdaBoost estimators to 10 instead of 100. We found that using 100 estimators give poor results for TrAdaBoost.
Rec vs talk
[38]:
source_rec=['rec.autos','rec.motorcycles']target_rec=['rec.sport.baseball','rec.sport.hockey']source_talk=['talk.politics.guns','talk.politics.misc']target_talk=['talk.religion.misc','talk.politics.mideast']source_index=np.isin(data.target,[data.target_names.index(s)forsinsource_rec+source_talk])target_index=np.isin(data.target,[data.target_names.index(s)forsintarget_rec+target_talk])positive_index=[data.target_names.index(s)forsintarget_rec+source_rec]
[39]:
fromadapt.instance_basedimportTrAdaBoostfromsklearn.svmimportLinearSVCfromscipy.sparseimportvstacknames=["SVM","SVMt","AUX","TrAdaBoost"]scores={k:[]forkinnames}forstateinrange(10):np.random.seed(state)Xs,ys,Xt,yt,Xt_lab,yt_lab=split_source_target(source_index,target_index,positive_index,ratio_of_target_labels=0.01)ifstate==0:print("Xs shape:%s, Xt shape:%s"%(str(Xs.shape),str(Xt.shape)))models=[LinearSVC(class_weight="balanced"),LinearSVC(class_weight="balanced"),BalancedWeighting(LinearSVC(class_weight="balanced"),alpha=4.,Xt=Xt_lab,yt=yt_lab),TrAdaBoost(LinearSVC(class_weight="balanced"),n_estimators=10,verbose=0,Xt=Xt_lab,yt=yt_lab)]formodel,nameinzip(models,names):ifname=="SVMt":model.fit(vstack((Xs,Xt_lab)),np.concatenate((ys,yt_lab)))else:model.fit(Xs,ys)scores[name].append(1-model.score(Xt,yt))print("Round%i :%s"%(state,str({k:np.round(v[-1],3)fork,vinscores.items()})))
Xs shape: (3671, 34814), Xt shape: (3525, 34814)Round 0 : {'SVM': 0.206, 'SVMt': 0.112, 'AUX': 0.099, 'TrAdaBoost': 0.091}Round 1 : {'SVM': 0.207, 'SVMt': 0.106, 'AUX': 0.085, 'TrAdaBoost': 0.076}Round 2 : {'SVM': 0.206, 'SVMt': 0.107, 'AUX': 0.089, 'TrAdaBoost': 0.076}Round 3 : {'SVM': 0.205, 'SVMt': 0.119, 'AUX': 0.1, 'TrAdaBoost': 0.084}Round 4 : {'SVM': 0.205, 'SVMt': 0.092, 'AUX': 0.08, 'TrAdaBoost': 0.078}Round 5 : {'SVM': 0.205, 'SVMt': 0.107, 'AUX': 0.089, 'TrAdaBoost': 0.081}Round 6 : {'SVM': 0.205, 'SVMt': 0.106, 'AUX': 0.087, 'TrAdaBoost': 0.076}Round 7 : {'SVM': 0.207, 'SVMt': 0.104, 'AUX': 0.089, 'TrAdaBoost': 0.081}Round 8 : {'SVM': 0.207, 'SVMt': 0.104, 'AUX': 0.092, 'TrAdaBoost': 0.091}Round 9 : {'SVM': 0.206, 'SVMt': 0.104, 'AUX': 0.088, 'TrAdaBoost': 0.073}Results Summary
[40]:
error_mu=np.round(pd.DataFrame(pd.DataFrame(scores).mean(0),columns=["Error"]),3).transpose().astype(str)error_std=np.round(pd.DataFrame(pd.DataFrame(scores).std(0),columns=["Error"]),3).transpose().astype(str)display(error_mu+" ("+error_std+")")
| SVM | SVMt | AUX | TrAdaBoost | |
|---|---|---|---|---|
| Error | 0.206 (0.001) | 0.106 (0.007) | 0.09 (0.006) | 0.081 (0.006) |
Rec vs Sci
[41]:
source_rec=['rec.autos','rec.motorcycles']target_rec=['rec.sport.baseball','rec.sport.hockey']source_sci=['sci.crypt','sci.electronics']target_sci=['sci.med','sci.space']source_index=np.isin(data.target,[data.target_names.index(s)forsinsource_sci+source_rec])target_index=np.isin(data.target,[data.target_names.index(s)forsintarget_sci+target_rec])positive_index=[data.target_names.index(s)forsintarget_rec+source_rec]
[42]:
fromadapt.instance_basedimportTrAdaBoostfromsklearn.svmimportLinearSVCfromscipy.sparseimportvstacknames=["SVM","SVMt","AUX","TrAdaBoost"]scores={k:[]forkinnames}forstateinrange(10):np.random.seed(state)Xs,ys,Xt,yt,Xt_lab,yt_lab=split_source_target(source_index,target_index,positive_index,ratio_of_target_labels=0.01)ifstate==0:print("Xs shape:%s, Xt shape:%s"%(str(Xs.shape),str(Xt.shape)))models=[LinearSVC(class_weight="balanced"),LinearSVC(class_weight="balanced"),BalancedWeighting(LinearSVC(class_weight="balanced"),alpha=4.,Xt=Xt_lab,yt=yt_lab),TrAdaBoost(LinearSVC(class_weight="balanced"),n_estimators=10,verbose=0,Xt=Xt_lab,yt=yt_lab)]formodel,nameinzip(models,names):ifname=="SVMt":model.fit(vstack((Xs,Xt_lab)),np.concatenate((ys,yt_lab)))else:model.fit(Xs,ys)scores[name].append(1-model.score(Xt,yt))print("Round%i :%s"%(state,str({k:np.round(v[-1],3)fork,vinscores.items()})))
Xs shape: (3961, 34814), Xt shape: (3931, 34814)Round 0 : {'SVM': 0.347, 'SVMt': 0.194, 'AUX': 0.16, 'TrAdaBoost': 0.131}Round 1 : {'SVM': 0.347, 'SVMt': 0.17, 'AUX': 0.14, 'TrAdaBoost': 0.116}Round 2 : {'SVM': 0.349, 'SVMt': 0.208, 'AUX': 0.177, 'TrAdaBoost': 0.144}Round 3 : {'SVM': 0.347, 'SVMt': 0.163, 'AUX': 0.139, 'TrAdaBoost': 0.119}Round 4 : {'SVM': 0.346, 'SVMt': 0.165, 'AUX': 0.137, 'TrAdaBoost': 0.115}Round 5 : {'SVM': 0.349, 'SVMt': 0.205, 'AUX': 0.163, 'TrAdaBoost': 0.138}Round 6 : {'SVM': 0.347, 'SVMt': 0.166, 'AUX': 0.14, 'TrAdaBoost': 0.121}Round 7 : {'SVM': 0.349, 'SVMt': 0.22, 'AUX': 0.182, 'TrAdaBoost': 0.15}Round 8 : {'SVM': 0.35, 'SVMt': 0.185, 'AUX': 0.153, 'TrAdaBoost': 0.115}Round 9 : {'SVM': 0.349, 'SVMt': 0.205, 'AUX': 0.169, 'TrAdaBoost': 0.139}Results Summary
[43]:
error_mu=np.round(pd.DataFrame(pd.DataFrame(scores).mean(0),columns=["Error"]),3).transpose().astype(str)error_std=np.round(pd.DataFrame(pd.DataFrame(scores).std(0),columns=["Error"]),3).transpose().astype(str)display(error_mu+" ("+error_std+")")
| SVM | SVMt | AUX | TrAdaBoost | |
|---|---|---|---|---|
| Error | 0.348 (0.001) | 0.188 (0.021) | 0.156 (0.017) | 0.129 (0.013) |
Talk vs Sci
[44]:
source_sci=['sci.crypt','sci.electronics']target_sci=['sci.med','sci.space']source_talk=['talk.politics.misc','talk.religion.misc']target_talk=['talk.politics.guns','talk.politics.mideast']source_index=np.isin(data.target,[data.target_names.index(s)forsinsource_sci+source_talk])target_index=np.isin(data.target,[data.target_names.index(s)forsintarget_sci+target_talk])positive_index=[data.target_names.index(s)forsintarget_sci+source_sci]
[52]:
fromadapt.instance_basedimportTrAdaBoostfromsklearn.svmimportLinearSVCfromscipy.sparseimportvstacknames=["SVM","SVMt","AUX","TrAdaBoost"]scores={k:[]forkinnames}forstateinrange(10):np.random.seed(state)Xs,ys,Xt,yt,Xt_lab,yt_lab=split_source_target(source_index,target_index,positive_index,ratio_of_target_labels=0.01)ifstate==0:print("Xs shape:%s, Xt shape:%s"%(str(Xs.shape),str(Xt.shape)))models=[LinearSVC(class_weight="balanced"),LinearSVC(class_weight="balanced"),BalancedWeighting(LinearSVC(class_weight="balanced"),alpha=4.,Xt=Xt_lab,yt=yt_lab),TrAdaBoost(LinearSVC(class_weight="balanced"),n_estimators=10,verbose=0,Xt=Xt_lab,yt=yt_lab)]formodel,nameinzip(models,names):ifname=="SVMt":model.fit(vstack((Xs,Xt_lab)),np.concatenate((ys,yt_lab)))else:model.fit(Xs,ys)scores[name].append(1-model.score(Xt,yt))print("Round%i :%s"%(state,str({k:np.round(v[-1],3)fork,vinscores.items()})))
Xs shape: (3378, 34814), Xt shape: (3794, 34814)Round 0 : {'SVM': 0.261, 'SVMt': 0.209, 'AUX': 0.185, 'TrAdaBoost': 0.159}Round 1 : {'SVM': 0.26, 'SVMt': 0.218, 'AUX': 0.202, 'TrAdaBoost': 0.184}Round 2 : {'SVM': 0.26, 'SVMt': 0.203, 'AUX': 0.182, 'TrAdaBoost': 0.166}Round 3 : {'SVM': 0.26, 'SVMt': 0.214, 'AUX': 0.199, 'TrAdaBoost': 0.177}Round 4 : {'SVM': 0.26, 'SVMt': 0.197, 'AUX': 0.176, 'TrAdaBoost': 0.154}Round 5 : {'SVM': 0.261, 'SVMt': 0.214, 'AUX': 0.199, 'TrAdaBoost': 0.18}Round 6 : {'SVM': 0.26, 'SVMt': 0.202, 'AUX': 0.182, 'TrAdaBoost': 0.16}Round 7 : {'SVM': 0.26, 'SVMt': 0.202, 'AUX': 0.179, 'TrAdaBoost': 0.158}Round 8 : {'SVM': 0.259, 'SVMt': 0.186, 'AUX': 0.165, 'TrAdaBoost': 0.142}Round 9 : {'SVM': 0.26, 'SVMt': 0.192, 'AUX': 0.167, 'TrAdaBoost': 0.145}Results Summary
[46]:
error_mu=np.round(pd.DataFrame(pd.DataFrame(scores).mean(0),columns=["Error"]),3).transpose().astype(str)error_std=np.round(pd.DataFrame(pd.DataFrame(scores).std(0),columns=["Error"]),3).transpose().astype(str)display(error_mu+" ("+error_std+")")
| SVM | SVMt | AUX | TrAdaBoost | |
|---|---|---|---|---|
| Error | 0.26 (0.0) | 0.204 (0.01) | 0.184 (0.013) | 0.162 (0.014) |
We can see that are not very similar to the ones that the authors obtained but we have the same hierarchical order of error level: SVM > SVMt > AUX > TrAdaBoost
