14.2.Fine-Tuning
Colab [pytorch]
Open the notebook in Colab
Colab [mxnet]
Open the notebook in Colab
Colab [jax]
Open the notebook in Colab
Colab [tensorflow]
Open the notebook in Colab
SageMaker Studio Lab
Open the notebook in SageMaker Studio Lab

In earlier chapters, we discussed how to train models on theFashion-MNIST training dataset with only 60000 images. We also describedImageNet, the most widely used large-scale image dataset in academia,which has more than 10 million images and 1000 objects. However, thesize of the dataset that we usually encounter is between those of thetwo datasets.

Suppose that we want to recognize different types of chairs from images,and then recommend purchase links to users. One possible method is tofirst identify 100 common chairs, take 1000 images of different anglesfor each chair, and then train a classification model on the collectedimage dataset. Although this chair dataset may be larger than theFashion-MNIST dataset, the number of examples is still less thanone-tenth of that in ImageNet. This may lead to overfitting ofcomplicated models that are suitable for ImageNet on this chair dataset.Besides, due to the limited amount of training examples, the accuracy ofthe trained model may not meet practical requirements.

In order to address the above problems, an obvious solution is tocollect more data. However, collecting and labeling data can take a lotof time and money. For example, in order to collect the ImageNetdataset, researchers have spent millions of dollars from researchfunding. Although the current data collection cost has beensignificantly reduced, this cost still cannot be ignored.

Another solution is to applytransfer learning to transfer theknowledge learned from thesource dataset to thetarget dataset. Forexample, although most of the images in the ImageNet dataset havenothing to do with chairs, the model trained on this dataset may extractmore general image features, which can help identify edges, textures,shapes, and object composition. These similar features may also beeffective for recognizing chairs.

14.2.1.Steps

In this section, we will introduce a common technique in transferlearning:fine-tuning. As shown inFig. 14.2.1,fine-tuning consists of the following four steps:

  1. Pretrain a neural network model, i.e., thesource model, on asource dataset (e.g., the ImageNet dataset).

  2. Create a new neural network model, i.e., thetarget model. Thiscopies all model designs and their parameters on the source modelexcept the output layer. We assume that these model parameterscontain the knowledge learned from the source dataset and thisknowledge will also be applicable to the target dataset. We alsoassume that the output layer of the source model is closely relatedto the labels of the source dataset; thus it is not used in thetarget model.

  3. Add an output layer to the target model, whose number of outputs isthe number of categories in the target dataset. Then randomlyinitialize the model parameters of this layer.

  4. Train the target model on the target dataset, such as a chairdataset. The output layer will be trained from scratch, while theparameters of all the other layers are fine-tuned based on theparameters of the source model.

../_images/finetune.svg

Fig. 14.2.1Fine tuning.

When target datasets are much smaller than source datasets, fine-tuninghelps to improve models’ generalization ability.

14.2.2.Hot Dog Recognition

Let’s demonstrate fine-tuning via a concrete case: hot dog recognition.We will fine-tune a ResNet model on a small dataset, which waspretrained on the ImageNet dataset. This small dataset consists ofthousands of images with and without hot dogs. We will use thefine-tuned model to recognize hot dogs from images.

%matplotlibinlineimportosimporttorchimporttorchvisionfromtorchimportnnfromd2limporttorchasd2l
%matplotlibinlineimportosfrommxnetimportgluon,init,np,npxfrommxnet.gluonimportnnfromd2limportmxnetasd2lnpx.set_np()

14.2.2.1.Reading the Dataset

The hot dog dataset we use was taken from online images. This datasetconsists of 1400 positive-class images containing hot dogs, and as manynegative-class images containing other foods. 1000 images of bothclasses are used for training and the rest are for testing.

After unzipping the downloaded dataset, we obtain two foldershotdog/train andhotdog/test. Both folders havehotdog andnot-hotdog subfolders, either of which contains images of thecorresponding class.

#@saved2l.DATA_HUB['hotdog']=(d2l.DATA_URL+'hotdog.zip','fba480ffa8aa7e0febbb511d181409f899b9baa5')data_dir=d2l.download_extract('hotdog')
Downloading../data/hotdog.zipfromhttp://d2l-data.s3-accelerate.amazonaws.com/hotdog.zip...
#@saved2l.DATA_HUB['hotdog']=(d2l.DATA_URL+'hotdog.zip','fba480ffa8aa7e0febbb511d181409f899b9baa5')data_dir=d2l.download_extract('hotdog')
Downloading../data/hotdog.zipfromhttp://d2l-data.s3-accelerate.amazonaws.com/hotdog.zip...

We create two instances to read all the image files in the training andtesting datasets, respectively.

train_imgs=torchvision.datasets.ImageFolder(os.path.join(data_dir,'train'))test_imgs=torchvision.datasets.ImageFolder(os.path.join(data_dir,'test'))
train_imgs=gluon.data.vision.ImageFolderDataset(os.path.join(data_dir,'train'))test_imgs=gluon.data.vision.ImageFolderDataset(os.path.join(data_dir,'test'))

The first 8 positive examples and the last 8 negative images are shownbelow. As you can see, the images vary in size and aspect ratio.

hotdogs=[train_imgs[i][0]foriinrange(8)]not_hotdogs=[train_imgs[-i-1][0]foriinrange(8)]d2l.show_images(hotdogs+not_hotdogs,2,8,scale=1.4);
../_images/output_fine-tuning_368659_30_0.png
hotdogs=[train_imgs[i][0]foriinrange(8)]not_hotdogs=[train_imgs[-i-1][0]foriinrange(8)]d2l.show_images(hotdogs+not_hotdogs,2,8,scale=1.4);
[22:08:09]../src/storage/storage.cc:196:UsingPooled(Naive)StorageManagerforCPU
../_images/output_fine-tuning_368659_33_1.png

During training, we first crop a random area of random size and randomaspect ratio from the image, and then scale this area to a\(224 \times 224\) input image. During testing, we scale both theheight and width of an image to 256 pixels, and then crop a central\(224 \times 224\) area as input. In addition, for the three RGB(red, green, and blue) color channels westandardize their valueschannel by channel. Concretely, the mean value of a channel issubtracted from each value of that channel and then the result isdivided by the standard deviation of that channel.

# Specify the means and standard deviations of the three RGB channels to# standardize each channelnormalize=torchvision.transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])train_augs=torchvision.transforms.Compose([torchvision.transforms.RandomResizedCrop(224),torchvision.transforms.RandomHorizontalFlip(),torchvision.transforms.ToTensor(),normalize])test_augs=torchvision.transforms.Compose([torchvision.transforms.Resize([256,256]),torchvision.transforms.CenterCrop(224),torchvision.transforms.ToTensor(),normalize])
# Specify the means and standard deviations of the three RGB channels to# standardize each channelnormalize=gluon.data.vision.transforms.Normalize([0.485,0.456,0.406],[0.229,0.224,0.225])train_augs=gluon.data.vision.transforms.Compose([gluon.data.vision.transforms.RandomResizedCrop(224),gluon.data.vision.transforms.RandomFlipLeftRight(),gluon.data.vision.transforms.ToTensor(),normalize])test_augs=gluon.data.vision.transforms.Compose([gluon.data.vision.transforms.Resize(256),gluon.data.vision.transforms.CenterCrop(224),gluon.data.vision.transforms.ToTensor(),normalize])

14.2.2.2.Defining and Initializing the Model

We use ResNet-18, which was pretrained on the ImageNet dataset, as thesource model. Here, we specifypretrained=True to automaticallydownload the pretrained model parameters. If this model is used for thefirst time, Internet connection is required for download.

pretrained_net=torchvision.models.resnet18(pretrained=True)

The pretrained source model instance contains a number of feature layersand an output layerfc. The main purpose of this division is tofacilitate the fine-tuning of model parameters of all layers but theoutput layer. The member variablefc of source model is given below.

pretrained_net.fc
Linear(in_features=512,out_features=1000,bias=True)
pretrained_net=gluon.model_zoo.vision.resnet18_v2(pretrained=True)
Downloading/opt/mxnet/models/resnet18_v2-a81db45f.zip2fac831f-e7e6-40bb-987f-e037f3e8e5d3fromhttps://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/models/resnet18_v2-a81db45f.zip...

The pretrained source model instance contains two member variables:features andoutput. The former contains all layers of the modelexcept the output layer, and the latter is the output layer of themodel. The main purpose of this division is to facilitate thefine-tuning of model parameters of all layers but the output layer. Themember variableoutput of source model is shown below.

pretrained_net.output
Dense(512->1000,linear)

As a fully connected layer, it transforms ResNet’s final global averagepooling outputs into 1000 class outputs of the ImageNet dataset. We thenconstruct a new neural network as the target model. It is defined in thesame way as the pretrained source model except that its number ofoutputs in the final layer is set to the number of classes in the targetdataset (rather than 1000).

In the code below, the model parameters before the output layer of thetarget model instancefinetune_net are initialized to modelparameters of the corresponding layers from the source model. Sincethese model parameters were obtained via pretraining on ImageNet, theyare effective. Therefore, we can only use a small learning rate tofine-tune such pretrained parameters. In contrast, model parameters inthe output layer are randomly initialized and generally require a largerlearning rate to be learned from scratch. Letting the base learning ratebe\(\eta\), a learning rate of\(10\eta\) will be used toiterate the model parameters in the output layer.

finetune_net=torchvision.models.resnet18(pretrained=True)finetune_net.fc=nn.Linear(finetune_net.fc.in_features,2)nn.init.xavier_uniform_(finetune_net.fc.weight);
finetune_net=gluon.model_zoo.vision.resnet18_v2(classes=2)finetune_net.features=pretrained_net.featuresfinetune_net.output.initialize(init.Xavier())# The model parameters in the output layer will be iterated using a learning# rate ten times greaterfinetune_net.output.collect_params().setattr('lr_mult',10)

14.2.2.3.Fine-Tuning the Model

First, we define a training functiontrain_fine_tuning that usesfine-tuning so it can be called multiple times.

# If `param_group=True`, the model parameters in the output layer will be# updated using a learning rate ten times greaterdeftrain_fine_tuning(net,learning_rate,batch_size=128,num_epochs=5,param_group=True):train_iter=torch.utils.data.DataLoader(torchvision.datasets.ImageFolder(os.path.join(data_dir,'train'),transform=train_augs),batch_size=batch_size,shuffle=True)test_iter=torch.utils.data.DataLoader(torchvision.datasets.ImageFolder(os.path.join(data_dir,'test'),transform=test_augs),batch_size=batch_size)devices=d2l.try_all_gpus()loss=nn.CrossEntropyLoss(reduction="none")ifparam_group:params_1x=[paramforname,paraminnet.named_parameters()ifnamenotin["fc.weight","fc.bias"]]trainer=torch.optim.SGD([{'params':params_1x},{'params':net.fc.parameters(),'lr':learning_rate*10}],lr=learning_rate,weight_decay=0.001)else:trainer=torch.optim.SGD(net.parameters(),lr=learning_rate,weight_decay=0.001)d2l.train_ch13(net,train_iter,test_iter,loss,trainer,num_epochs,devices)
deftrain_fine_tuning(net,learning_rate,batch_size=128,num_epochs=5):train_iter=gluon.data.DataLoader(train_imgs.transform_first(train_augs),batch_size,shuffle=True)test_iter=gluon.data.DataLoader(test_imgs.transform_first(test_augs),batch_size)devices=d2l.try_all_gpus()net.collect_params().reset_ctx(devices)net.hybridize()loss=gluon.loss.SoftmaxCrossEntropyLoss()trainer=gluon.Trainer(net.collect_params(),'sgd',{'learning_rate':learning_rate,'wd':0.001})d2l.train_ch13(net,train_iter,test_iter,loss,trainer,num_epochs,devices)

We set the base learning rate to a small value in order tofine-tunethe model parameters obtained via pretraining. Based on the previoussettings, we will train the output layer parameters of the target modelfrom scratch using a learning rate ten times greater.

train_fine_tuning(finetune_net,5e-5)
loss0.242,trainacc0.909,testacc0.9401062.4examples/secon[device(type='cuda',index=0),device(type='cuda',index=1)]
../_images/output_fine-tuning_368659_79_1.svg
train_fine_tuning(finetune_net,0.01)
loss0.255,trainacc0.923,testacc0.944368.7examples/secon[gpu(0),gpu(1)]
../_images/output_fine-tuning_368659_82_1.svg

For comparison, we define an identical model, but initialize all of itsmodel parameters to random values. Since the entire model needs to betrained from scratch, we can use a larger learning rate.

scratch_net=torchvision.models.resnet18()scratch_net.fc=nn.Linear(scratch_net.fc.in_features,2)train_fine_tuning(scratch_net,5e-4,param_group=False)
loss0.352,trainacc0.846,testacc0.8501525.4examples/secon[device(type='cuda',index=0),device(type='cuda',index=1)]
../_images/output_fine-tuning_368659_88_1.svg
scratch_net=gluon.model_zoo.vision.resnet18_v2(classes=2)scratch_net.initialize(init=init.Xavier())train_fine_tuning(scratch_net,0.1)
loss0.356,trainacc0.842,testacc0.860574.1examples/secon[gpu(0),gpu(1)]
../_images/output_fine-tuning_368659_91_1.svg

As we can see, the fine-tuned model tends to perform better for the sameepoch because its initial parameter values are more effective.

14.2.3.Summary

  • Transfer learning transfers knowledge learned from the source datasetto the target dataset. Fine-tuning is a common technique for transferlearning.

  • The target model copies all model designs with their parameters fromthe source model except the output layer, and fine-tunes theseparameters based on the target dataset. In contrast, the output layerof the target model needs to be trained from scratch.

  • Generally, fine-tuning parameters uses a smaller learning rate, whiletraining the output layer from scratch can use a larger learningrate.

14.2.4.Exercises

  1. Keep increasing the learning rate offinetune_net. How does theaccuracy of the model change?

  2. Further adjust hyperparameters offinetune_net andscratch_net in the comparative experiment. Do they still differin accuracy?

  3. Set the parameters before the output layer offinetune_net tothose of the source model and donot update them during training.How does the accuracy of the model change? You can use the followingcode.

forparaminfinetune_net.parameters():param.requires_grad=False
finetune_net.features.collect_params().setattr('grad_req','null')
  1. In fact, there is a “hotdog” class in theImageNet dataset. Itscorresponding weight parameter in the output layer can be obtainedvia the following code. How can we leverage this weight parameter?

weight=pretrained_net.fc.weighthotdog_w=torch.split(weight.data,1,dim=0)[934]hotdog_w.shape
torch.Size([1,512])

Discussions

weight=pretrained_net.output.weighthotdog_w=np.split(weight.data(),1000,axis=0)[713]hotdog_w.shape
(1,512)

Discussions