Movatterモバイル変換

Chapter6

DeepFeedforwardNetworks

Deepfeedforwardnetworks

,alsocalled

feedforwardneuralnetworks

,or

multilayerperceptrons

(MLPs),arethequintessentialdeeplearningmodels.

Thegoalofafeedforwardnetworkistoapproximatesomefunction

∗

.Forexample,

foraclassiﬁer,

∗

(

)mapsaninput

toacategory

.Afeedforwardnetwork

deﬁnesamapping

(

;

)andlearnsthevalueoftheparameters

thatresult

inthebestfunctionapproximation.

Thesemodelsarecalled

feedforward

becauseinformationﬂowsthroughthe

functionbeingevaluatedfrom

,throughtheintermediatecomputationsusedto

deﬁne

,andﬁnallytotheoutput

.Thereareno

feedback

connectionsinwhich

outputsofthemodelarefedbackintoitself.Whenfeedforwardneuralnetworks

areextendedtoincludefeedbackconnections,theyarecalled

recurrentneural

networks,aspresentedinchapter10.

Feedforwardnetworksareofextremeimportancetomachinelearningpracti-

tioners.Theyformthebasisofmanyimportantcommercialapplications.For

example,theconvolutionalnetworksusedforobjectrecognitionfromphotosarea

specializedkindoffeedforwardnetwork.Feedforwardnetworksareaconceptual

steppingstoneonthepathtorecurrentnetworks,whichpowermanynatural

languageapplications.

Feedforwardneuralnetworksarecalled

networks

becausetheyaretypically

representedbycomposingtogethermanydiﬀerentfunctions.Themodelisasso-

ciatedwithadirectedacyclicgraphdescribinghowthefunctionsarecomposed

together.Forexample,wemighthavethreefunctions

(1)

(2)

,and

(3)

connected

inachain,toform

(

(3)

(

(2)

(

(1)

(

))).Thesechainstructuresarethe

mostcommonlyusedstructuresofneuralnetworks.Inthiscase,

(1)

iscalled

the

ﬁrstlayer

ofthenetwork,

(2)

iscalledthe

secondlayer

,andsoon. The

164

CHAPTER6.DEEPFEEDFORWARDNETWORKS

overalllengthofthechaingivesthe

depth

ofthemodel.Thename“deeplearning”

arosefromthisterminology.Theﬁnallayerofafeedforwardnetworkiscalledthe

outputlayer

.Duringneuralnetworktraining,wedrive

(

)tomatch

∗

(

Thetrainingdataprovidesuswithnoisy,approximateexamplesof

∗

(

)evaluated

atdiﬀerenttrainingpoints.Eachexample

isaccompaniedbyalabel

y≈f

∗

(

Thetrainingexamplesspecifydirectlywhattheoutputlayermustdoateachpoint

;itmustproduceavaluethatiscloseto

.Thebehavioroftheotherlayersis

notdirectlyspeciﬁedbythetrainingdata.Thelearningalgorithmmustdecide

howtousethoselayerstoproducethedesiredoutput,butthetrainingdatado

notsaywhateachindividuallayershoulddo.Instead,thelearningalgorithmmust

decidehowtousetheselayerstobestimplementanapproximationof

∗

.Because

thetrainingdatadoesnotshowthedesiredoutputforeachoftheselayers,they

arecalledhiddenlayers.

Finally,thesenetworksarecalledneuralbecausetheyarelooselyinspiredby

neuroscience.Eachhiddenlayerofthenetworkistypicallyvectorvalued.The

dimensionalityofthesehiddenlayersdeterminesthe

width

ofthemodel.Each

elementofthevectormaybeinterpretedasplayingaroleanalogoustoaneuron.

Ratherthanthinkingofthelayerasrepresentingasinglevector-to-vectorfunction,

wecanalsothinkofthelayerasconsistingofmany

units

thatactinparallel,

eachrepresentingavector-to-scalarfunction.Eachunitresemblesaneuronin

thesensethatitreceivesinputfrommanyotherunitsandcomputesitsown

activationvalue.Theideaofusingmanylayersofvector-valuedrepresentations

isdrawnfromneuroscience.Thechoiceofthefunctions

(i)

(

)usedtocompute

theserepresentationsisalsolooselyguidedbyneuroscientiﬁcobservationsabout

thefunctionsthatbiologicalneuronscompute.Modernneuralnetworkresearch,

however,isguidedbymanymathematicalandengineeringdisciplines,andthe

goalofneuralnetworksisnottoperfectlymodelthebrain.Itisbesttothinkof

feedforwardnetworksasfunctionapproximationmachinesthataredesignedto

achievestatisticalgeneralization,occasionallydrawingsomeinsightsfromwhatwe

knowaboutthebrain,ratherthanasmodelsofbrainfunction.

Onewaytounderstandfeedforwardnetworksistobeginwithlinearmodels

andconsiderhowtoovercometheirlimitations. Linearmodels,suchaslogistic

regressionandlinearregression,areappealingbecausetheycanbeﬁteﬃciently

andreliably,eitherinclosedformorwithconvexoptimization.Linearmodelsalso

havetheobviousdefectthatthemodelcapacityislimitedtolinearfunctions,so

themodelcannotunderstandtheinteractionbetweenanytwoinputvariables.

Toextendlinearmodelstorepresentnonlinearfunctionsof

,wecanapply

thelinearmodelnotto

itselfbuttoatransformedinput

(

),where

isa

165

CHAPTER6.DEEPFEEDFORWARDNETWORKS

nonlineartransformation.Equivalently,wecanapplythekerneltrickdescribedin

section5.7.2,toobtainanonlinearlearningalgorithmbasedonimplicitlyapplying

the

mapping.Wecanthinkof

asprovidingasetoffeaturesdescribing

,or

asprovidinganewrepresentationforx.

Thequestionisthenhowtochoosethemappingφ.

Oneoptionistouseaverygeneric

,suchastheinﬁnite-dimensional

that

isimplicitlyusedbykernelmachinesbasedontheRBFkernel. If

(

)is

ofhighenoughdimension,wecanalwayshaveenoughcapacitytoﬁtthe

trainingset,butgeneralizationtothetestsetoftenremainspoor.Very

genericfeaturemappingsareusuallybasedonlyontheprincipleoflocal

smoothnessanddonotencodeenoughpriorinformationtosolveadvanced

problems.

Anotheroptionistomanuallyengineer

.Untiltheadventofdeeplearning,

thiswasthedominantapproach.Itrequiresdecadesofhumaneﬀortfor

eachseparatetask,withpractitionersspecializingindiﬀerentdomains,such

asspeechrecognitionorcomputervision,andwithlittletransferbetween

domains.

Thestrategyofdeeplearningistolearn

.Inthisapproach,wehaveamodel

(

;

θ,w

) =

(

;

)



.Wenowhaveparameters

thatweusetolearn

fromabroadclassoffunctions,andparameters

thatmapfrom

(

)to

thedesiredoutput.Thisisanexampleofadeepfeedforwardnetwork,with

deﬁningahiddenlayer.Thisapproachistheonlyoneofthethreethat

givesupontheconvexityofthetrainingproblem,butthebeneﬁtsoutweigh

theharms.Inthisapproach,weparametrizetherepresentationas

(

;

)

andusetheoptimizationalgorithmtoﬁndthe

thatcorrespondstoagood

representation.Ifwewish,thisapproachcancapturethebeneﬁtoftheﬁrst

approachbybeinghighlygeneric—wedosobyusingaverybroadfamily

(

;

).Deeplearningcanalsocapturethebeneﬁtofthesecondapproach.

Humanpractitionerscanencodetheirknowledgetohelpgeneralizationby

designingfamilies

(

;

)thattheyexpectwillperformwell.Theadvantage

isthatthehumandesigneronlyneedstoﬁndtherightgeneralfunction

familyratherthanﬁndingpreciselytherightfunction.

Thisgeneralprincipleofimprovingmodelsbylearningfeaturesextendsbeyond

thefeedforwardnetworksdescribedinthischapter.Itisarecurringthemeof

deeplearningthatappliestoallthekindsofmodelsdescribedthroughoutthis

book.Feedforwardnetworks aretheapplicationofthisprincipletolearning

166

CHAPTER6.DEEPFEEDFORWARDNETWORKS

deterministicmappingsfrom

thatlackfeedbackconnections.Othermodels,

presentedlater,applytheseprinciplestolearningstochasticmappings,functions

withfeedback,andprobabilitydistributionsoverasinglevector.

Webeginthischapterwithasimpleexampleofafeedforwardnetwork.Next,

weaddresseachofthedesigndecisionsneededtodeployafeedforwardnetwork.

First,trainingafeedforwardnetworkrequiresmakingmanyofthesamedesign

decisionsasarenecessaryforalinearmodel:choosingtheoptimizer,thecost

function,andtheformoftheoutputunits.Wereviewthesebasicsofgradient-based

learning,thenproceedtoconfrontsomeofthedesigndecisionsthatareunique

tofeedforwardnetworks.Feedforwardnetworkshaveintroducedtheconceptofa

hiddenlayer,andthisrequiresustochoosethe

activationfunctions

thatwill

beusedtocomputethehiddenlayervalues.Wemustalsodesignthearchitecture

ofthenetwork,includinghowmanylayersthenetworkshouldcontain,howthese

layersshouldbeconnected toeachother, and how manyunitsshouldbein

eachlayer.Learningindeepneuralnetworksrequirescomputingthegradients

ofcomplicatedfunctions.Wepresentthe

back-propagation

algorithmandits

moderngeneralizations,whichcanbeusedtoeﬃcientlycomputethesegradients.

Finally,weclosewithsomehistoricalperspective.

6.1Example:LearningXOR

Tomaketheideaofafeedforwardnetworkmoreconcrete,webeginwithan

exampleofafullyfunctioningfeedforwardnetworkonaverysimpletask:learning

theXORfunction.

TheXORfunction(“exclusiveor”)isanoperationontwobinaryvalues,

and

.Whenexactlyoneofthesebinaryvaluesisequalto1,theXORfunction

returns1.Otherwise,itreturns0.TheXORfunctionprovidesthetargetfunction

∗

(

)thatwewanttolearn.Ourmodelprovidesafunction

(

;

),and

ourlearningalgorithmwilladapttheparameters

tomake

assimilaraspossible

tof

∗

Inthissimpleexample,wewillnotbeconcernedwithstatisticalgeneralization.

Wewantournetworktoperformcorrectlyonthefourpoints

{



,[0



,and[1



}

. Wewilltrainthenetworkonallfourofthesepoints. The

onlychallengeistoﬁtthetrainingset.

Wecantreatthisproblemasaregressionproblemanduseameansquared

errorlossfunction.Wehavechosenthislossfunctiontosimplifythemathfor

thisexampleasmuchaspossible.Inpracticalapplications,MSEisusuallynotan

167

CHAPTER6.DEEPFEEDFORWARDNETWORKS

appropriatecostfunctionformodelingbinarydata.Moreappropriateapproaches

aredescribedinsection6.2.2.2.

Evaluatedonourwholetrainingset,theMSElossfunctionis

J(θ) =



x∈X

∗

(x)−f(x;θ))

.(6.1)

Nowwemustchoosetheformofourmodel,

(

;

).Supposethatwechoose

alinearmodel,withθconsistingofwandb.Ourmodelisdeﬁnedtobe

f(x;w,b) =x



w+b.(6.2)

Wecanminimize

(

)inclosedformwithrespectto

and

usingthenormal

equations.

Aftersolvingthenormalequations,weobtain

and

.Thelinear

modelsimplyoutputs0

5everywhere.Whydoesthishappen?Figure6.1shows

howalinearmodelisnotabletorepresenttheXORfunction.Onewaytosolve

thisproblemistouseamodelthatlearnsadiﬀerentfeaturespaceinwhicha

linearmodelisabletorepresentthesolution.

Speciﬁcally,wewillintroduceasimplefeedforwardnetworkwithonehidden

layercontainingtwohiddenunits.Seeﬁgure6.2foranillustrationofthismodel.

Thisfeedforwardnetworkhasavectorofhiddenunits

thatarecomputedbya

function

(1)

(

;

W,c

).Thevaluesofthesehiddenunitsarethenusedastheinput

forasecondlayer.Thesecondlayeristheoutputlayerofthenetwork.Theoutput

layerisstilljustalinearregressionmodel,butnowitisappliedto

ratherthanto

Thenetworknowcontainstwofunctionschainedtogether,

(1)

(

;

W,c

)and

(2)

(

;

w,b

),withthecompletemodelbeing

(

;

W,c,w,b

) =

(2)

(

(1)

(

))

Whatfunctionshould

(1)

compute?Linearmodelshaveserveduswellsofar,

anditmaybetemptingtomake

(1)

linearaswell. Unfortunately,if

(1)

were

linear,thenthefeedforwardnetworkasawholewouldremainalinearfunctionof

itsinput.Ignoringtheintercepttermsforthemoment,suppose

(1)

(

) =



and

(2)

(

) =



.Then

(

) =



.Wecouldrepresentthisfunctionas

f(x) =x





wherew



=Ww.

Clearly,wemustuseanonlinearfunctiontodescribethefeatures.Mostneural

networksdosousinganaﬃnetransformationcontrolledbylearnedparameters,

followedbyaﬁxednonlinearfunctioncalledanactivationfunction.Weusethat

strategyhere,bydeﬁning

(



)

where

providestheweightsofa

lineartransformationand

thebiases.Previously,todescribealinearregression

model,weusedavectorofweightsandascalarbiasparametertodescribean

168

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Originalxspace

012

Learnedhspace

Figure6.1:SolvingtheXORproblembylearningarepresentation.Theboldnumbers

printedontheplotindicatethevaluethatthelearnedfunctionmustoutputateachpoint.

(Left)AlinearmodelapplieddirectlytotheoriginalinputcannotimplementtheXOR

function.When

= 0,themodel’soutputmustincreaseas

increases.When

= 1,

themodel’soutputmustdecreaseas

increases.Alinearmodelmustapplyaﬁxed

coeﬃcient

.Thelinearmodelthereforecannotusethevalueof

tochange

thecoeﬃcienton

andcannotsolvethisproblem.(Right)Inthetransformedspace

representedbythefeaturesextractedbyaneuralnetwork,alinearmodelcannowsolve

theproblem. Inourexamplesolution,thetwopointsthatmusthaveoutput1havebeen

collapsedintoasinglepointinfeaturespace.Inotherwords,thenonlinearfeatureshave

mappedboth

= [1



and

= [0



toasinglepointinfeaturespace,

= [1



Thelinearmodelcannowdescribethefunctionasincreasingin

anddecreasingin

Inthisexample,themotivationforlearningthefeaturespaceisonlytomakethemodel

capacitygreatersothatitcanﬁtthetrainingset.Inmorerealisticapplications,learned

representationscanalsohelpthemodeltogeneralize.

169

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Figure6.2:Anexampleofafeedforwardnetwork,drawnintwodiﬀerentstyles.Speciﬁcally,

thisisthefeedforwardnetworkweusetosolvetheXORexample.Ithasasinglehidden

layercontainingtwounits.(Left)Inthisstyle,wedraweveryunitasanodeinthegraph.

Thisstyleisexplicitandunambiguous,butfornetworkslargerthanthisexample,itcan

consumetoomuchspace.(Right)Inthisstyle,wedrawanodeinthegraphforeachentire

vectorrepresentingalayer’sactivations.Thisstyleismuchmorecompact.Sometimes

weannotatetheedgesinthisgraphwiththenameoftheparametersthatdescribethe

relationshipbetweentwolayers.Here,weindicatethatamatrix

describesthemapping

from

,andavector

describesthemappingfrom

.Wetypicallyomitthe

interceptparametersassociatedwitheachlayerwhenlabelingthiskindofdrawing.

g(z)=max{0,z}

Figure6.3:Therectiﬁedlinearactivationfunction.Thisactivationfunctionisthedefault

activationfunctionrecommendedforusewithmostfeedforwardneuralnetworks.Applying

thisfunctiontotheoutputofalineartransformationyieldsanonlineartransformation.

Thefunctionremainsveryclosetolinear,however,inthesensethatitisapiecewise

linearfunctionwithtwolinearpieces.Becauserectiﬁedlinearunitsarenearlylinear,

theypreservemanyofthepropertiesthatmakelinearmodelseasytooptimizewith

gradient-basedmethods.Theyalsopreservemanyofthepropertiesthatmakelinear

modelsgeneralizewell.Acommonprinciplethroughoutcomputerscienceisthatwe

canbuildcomplicatedsystemsfromminimalcomponents.MuchasaTuringmachine’s

memoryneedsonlytobeabletostore0or1states,wecanbuildauniversalfunction

approximatorfromrectiﬁedlinearfunctions.

170

CHAPTER6.DEEPFEEDFORWARDNETWORKS

aﬃnetransformationfromaninputvectortoanoutputscalar.Now,wedescribe

anaﬃnetransformationfromavector

toavector

,soanentirevectorofbias

parametersisneeded.Theactivationfunction

istypicallychosentobeafunction

thatisappliedelement-wise,with

(



:,i

).Inmodernneuralnetworks,

thedefaultrecommendationistousethe

rectiﬁedlinearunit

,orReLU(Jarrett

etal.,2009;NairandHinton,2010;Glorotetal.,2011a),deﬁnedbytheactivation

functiong(z) = max{0,z},depictedinﬁgure6.3.

Wecannowspecifyourcompletenetworkas

f(x;W,c,w,b) =w



max{0,W



x+c}+b.(6.3)

WecanthenspecifyasolutiontotheXORproblem.Let





,(6.4)



−1



,(6.5)



−2



,(6.6)

andb= 0.

Wecannowwalkthroughhowthemodelprocessesabatchofinputs.Let

bethedesignmatrixcontainingallfourpointsinthebinaryinputspace,withone

exampleperrow:













.(6.7)

Theﬁrststepintheneuralnetworkistomultiplytheinputmatrixbytheﬁrst

layer’sweightmatrix:

XW=













.(6.8)

Next,weaddthebiasvectorc,toobtain







0−1







.(6.9)

171

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Inthisspace,alltheexamplesliealongalinewithslope1.Aswemovealongthis

line,theoutputneedstobeginat0,thenriseto1,thendropbackdownto0.A

linearmodelcannotimplementsuchafunction.Toﬁnishcomputingthevalueof

hforeachexample,weapplytherectiﬁedlineartransformation:













.(6.10)

Thistransformationhaschangedtherelationshipbetweentheexamples.Theyno

longerlieonasingleline.Asshowninﬁgure6.1,theynowlieinaspacewherea

linearmodelcansolvetheproblem.

Weﬁnishwithmultiplyingbytheweightvectorw:













.(6.11)

Theneuralnetworkhasobtainedthecorrectanswerforeveryexampleinthebatch.

Inthisexample,wesimplyspeciﬁedthesolution,thenshowedthatitobtained

zeroerror. Inarealsituation,theremightbebillionsofmodelparametersand

billionsoftrainingexamples,soonecannotsimplyguessthesolutionaswedid

here.Instead,agradient-basedoptimizationalgorithmcanﬁndparametersthat

produceverylittleerror.ThesolutionwedescribedtotheXORproblemisata

globalminimumofthelossfunction,sogradientdescentcouldconvergetothis

point.ThereareotherequivalentsolutionstotheXORproblemthatgradient

descentcouldalsoﬁnd.Theconvergencepointofgradientdescentdependsonthe

initialvaluesoftheparameters.Inpractice,gradientdescentwouldusuallynot

ﬁndclean,easilyunderstood,integer-valuedsolutionsliketheonewepresented

here.

6.2Gradient-BasedLearning

Designingandtraininganeuralnetworkisnotmuchdiﬀerentfromtrainingany

othermachinelearningmodelwithgradientdescent.Insection5.10,wedescribed

howtobuildamachinelearningalgorithmbyspecifyinganoptimizationprocedure,

acostfunction,andamodelfamily.

172

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Thelargestdiﬀerencebetweenthelinearmodelswehaveseensofarandneural

networksisthatthenonlinearityofaneuralnetworkcausesmostinterestingloss

functionstobecomenonconvex.Thismeansthatneuralnetworksareusually

trainedbyusingiterative,gradient-basedoptimizersthatmerelydrivethecost

functiontoaverylowvalue,ratherthanthelinearequationsolversusedtotrain

linearregressionmodelsortheconvexoptimizationalgorithmswithglobalconver-

genceguaranteesusedtotrainlogisticregressionorSVMs.Convexoptimization

convergesstartingfromanyinitialparameters(intheory—inpracticeitisrobust

butcanencounternumericalproblems).Stochasticgradientdescentappliedto

nonconvexlossfunctionshasnosuchconvergenceguaranteeandissensitivetothe

valuesoftheinitialparameters.Forfeedforwardneuralnetworks,itisimportantto

initializeallweightstosmallrandomvalues.Thebiasesmaybeinitializedtozero

ortosmallpositivevalues.Theiterativegradient-basedoptimizationalgorithms

usedtotrainfeedforwardnetworksandalmostallotherdeepmodelsaredescribed

indetailinchapter8,withparameterinitializationinparticulardiscussedin

section8.4.Forthemoment,itsuﬃcestounderstandthatthetrainingalgorithm

isalmostalwaysbasedonusingthegradienttodescendthecostfunctioninone

wayoranother.Thespeciﬁcalgorithmsareimprovementsandreﬁnementson

theideasofgradientdescent,introducedinsection4.3,and,morespeciﬁcally,are

mostoftenimprovementsofthestochasticgradientdescentalgorithm,introduced

insection5.9.

Wecanofcoursetrainmodelssuchaslinearregressionandsupportvector

machineswithgradientdescenttoo,andinfactthisiscommonwhenthetraining

setisextremelylarge.Fromthispointofview,traininganeuralnetworkisnot

muchdiﬀerentfromtraininganyothermodel.Computingthegradientisslightly

morecomplicatedforaneuralnetworkbutcanstillbedoneeﬃcientlyandexactly.

InSection6.5wedescribehowtoobtainthegradientusingtheback-propagation

algorithmandmoderngeneralizationsoftheback-propagationalgorithm.

Aswithothermachinelearningmodels,toapplygradient-basedlearningwe

mustchooseacostfunction,andwemustchoosehowtorepresenttheoutputof

themodel.Wenowrevisitthesedesignconsiderationswithspecialemphasison

theneuralnetworksscenario.

6.2.1CostFunctions

Animportantaspectofthedesignofadeepneuralnetworkisthechoiceofthe

costfunction.Fortunately,thecostfunctionsforneuralnetworksaremoreorless

173

CHAPTER6.DEEPFEEDFORWARDNETWORKS

thesameasthoseforotherparametricmodels,suchaslinearmodels.

Inmostcases,ourparametricmodeldeﬁnesadistribution

(

y|x

;

)and

wesimplyuse theprinciple ofmaximumlikelihood.This meansweuse the

cross-entropybetweenthetrainingdataandthemodel’spredictionsasthecost

function.

Sometimes,wetakeasimplerapproach,whereratherthanpredictingacomplete

probabilitydistributionover

,wemerelypredictsomestatisticof

conditioned

onx.Specializedlossfunctionsenableustotrainapredictoroftheseestimates.

Thetotalcostfunctionusedtotrainaneuralnetworkwilloftencombineone

oftheprimarycostfunctionsdescribedherewitharegularizationterm.Wehave

alreadyseensomesimpleexamplesofregularizationappliedtolinearmodelsin

section5.2.2. Theweightdecayapproachusedforlinearmodelsisalsodirectly

applicabletodeepneuralnetworksandisamongthemostpopularregulariza-

tionstrategies.Moreadvancedregularizationstrategiesforneuralnetworksare

describedinchapter7.

6.2.1.1LearningConditionalDistributionswithMaximumLikelihood

Mostmodernneuralnetworksaretrainedusingmaximumlikelihood.Thismeans

thatthecostfunctionissimplythenegativelog-likelihood,equivalentlydescribed

asthecross-entropybetweenthetrainingdataandthemodeldistribution.This

costfunctionisgivenby

J(θ) =−E

x,y∼ˆp

data

logp

model

(y|x).(6.12)

Thespeciﬁcformofthecostfunctionchangesfrommodeltomodel,depending

onthespeciﬁcformof

logp

model

.Theexpansionoftheaboveequationtypically

yieldssometermsthatdonotdependonthemodelparametersandmaybedis-

carded.Forexample,aswesawinsection5.5.1,if

model

(

y|x

) =

(

;

(

;

)

thenwerecoverthemeansquarederrorcost,

J(θ) =

x,y∼ˆp

data

||y−f(x;θ)||

+const,(6.13)

uptoascalingfactorof

andatermthatdoesnotdependonθ.Thediscarded

constantisbasedonthevarianceoftheGaussiandistribution,whichinthiscase

wechosenottoparametrize.Previously,wesawthattheequivalencebetween

maximumlikelihoodestimationwithanoutputdistributionandminimizationof

174

CHAPTER6.DEEPFEEDFORWARDNETWORKS

meansquarederrorholdsforalinearmodel,butinfact,theequivalenceholds

regardlessofthef(x;θ)usedtopredictthemeanoftheGaussian.

Anadvantageofthisapproachofderivingthecostfunctionfrommaximum

likelihoodisthatitremovestheburdenofdesigningcostfunctionsforeachmodel.

Specifyingamodel

(

y|x

)automaticallydeterminesacostfunction

logp

(

y|x

Onerecurringthemethroughoutneuralnetworkdesignisthatthegradientof

thecostfunctionmustbelargeandpredictableenoughtoserveasagoodguide

forthelearningalgorithm.Functionsthatsaturate(becomeveryﬂat)undermine

thisobjectivebecausetheymakethegradientbecomeverysmall.Inmanycases

thishappensbecausetheactivationfunctionsusedtoproducetheoutputofthe

hiddenunitsortheoutputunitssaturate.Thenegativelog-likelihoodhelpsto

avoidthisproblemformanymodels.Severaloutputunitsinvolvean

exp

function

thatcansaturatewhenitsargumentisverynegative.The

log

functioninthe

negativelog-likelihoodcostfunctionundoesthe

exp

ofsomeoutputunits.Wewill

discusstheinteractionbetweenthecostfunctionandthechoiceofoutputunitin

section6.2.2.

Oneunusualpropertyofthecross-entropycostusedtoperformmaximum

likelihoodestimationisthatitusuallydoesnothaveaminimumvaluewhenapplied

tothemodelscommonlyusedinpractice.Fordiscreteoutputvariables,most

modelsareparametrizedinsuchawaythattheycannotrepresentaprobability

ofzeroorone,butcancomearbitrarilyclosetodoingso.Logisticregression

isanexampleofsuchamodel.Forreal-valuedoutputvariables,ifthemodel

cancontrolthedensityoftheoutputdistribution(forexample,bylearningthe

varianceparameterofaGaussianoutputdistribution)thenitbecomespossible

toassignextremelyhighdensitytothecorrecttrainingsetoutputs,resultingin

cross-entropyapproachingnegativeinﬁnity.Regularizationtechniquesdescribed

inchapter7provideseveraldiﬀerentwaysofmodifyingthelearningproblemso

thatthemodelcannotreapunlimitedrewardinthisway.

6.2.1.2LearningConditionalStatistics

Insteadoflearningafullprobabilitydistribution

(

y|x

;

),weoftenwantto

learnjustoneconditionalstatisticofygivenx.

Forexample,wemayhaveapredictor

(

;

)thatwewishtoemploytopredict

themeanofy.

Ifweuseasuﬃcientlypowerfulneuralnetwork,wecanthinkoftheneural

networkasbeingabletorepresentanyfunction

fromawideclassoffunctions,

withthisclassbeinglimitedonlybyfeaturessuchascontinuityandboundedness

175

CHAPTER6.DEEPFEEDFORWARDNETWORKS

ratherthanbyhavingaspeciﬁcparametricform.Fromthispointofview,we

canviewthecostfunctionasbeinga

functional

ratherthanjustafunction.A

functionalisamappingfromfunctionstorealnumbers.Wecanthusthinkof

learningaschoosingafunctionratherthanmerelychoosingasetofparameters.

Wecandesignourcostfunctionaltohaveitsminimumoccuratsomespeciﬁc

functionwedesire.Forexample,wecandesignthecostfunctionaltohaveits

minimumlieonthefunctionthatmaps

totheexpectedvalueof

given

Solvinganoptimizationproblemwithrespecttoafunctionrequiresamathematical

toolcalled

calculusofvariations

,describedinsection19.4.2.Itisnotnecessary

tounderstandcalculusofvariationstounderstandthecontentofthischapter.At

themoment,itisonlynecessarytounderstandthatcalculusofvariationsmaybe

usedtoderivethefollowingtworesults.

176

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Ourﬁrstresultderivedusingcalculusofvariationsisthatsolvingtheoptimiza-

tionproblem

∗

= argmin

x,y∼p

data

||y−f(x)||

(6.14)

yields

∗

(x) =E

y∼p

data

(y|x)

[y],(6.15)

solongasthisfunctionlieswithintheclassweoptimizeover.Inotherwords,ifwe

couldtrainoninﬁnitelymanysamplesfromthetruedatageneratingdistribution,

minimizingthemeansquarederrorcostfunctionwouldgiveafunctionthatpredicts

themeanofyforeachvalueofx.

Diﬀerentcostfunctionsgivediﬀerentstatistics.Asecondresultderivedusing

calculusofvariationsisthat

∗

= argmin

x,y∼p

data

||y−f(x)||

(6.16)

yieldsafunctionthatpredictsthemedianvalueof

foreach

,aslongassucha

functionmaybedescribedbythefamilyoffunctionsweoptimizeover.Thiscost

functioniscommonlycalledmeanabsoluteerror.

Unfortunately,meansquarederrorandmeanabsoluteerroroftenleadtopoor

resultswhenusedwithgradient-basedoptimization.Someoutputunitsthat

saturateproduceverysmallgradientswhencombinedwiththesecostfunctions.

Thisisonereasonthatthecross-entropycostfunctionismorepopularthanmean

squarederrorormeanabsoluteerror,evenwhenitisnotnecessarytoestimatean

entiredistributionp(y|x).

6.2.2OutputUnits

Thechoiceofcostfunctionistightlycoupledwiththechoiceofoutputunit.Most

ofthetime,wesimplyusethecross-entropybetweenthedatadistributionandthe

modeldistribution. Thechoiceofhowtorepresenttheoutputthendetermines

theformofthecross-entropyfunction.

Anykindofneuralnetworkunitthatmaybeusedasanoutputcanalsobe

usedasahiddenunit.Here,wefocusontheuseoftheseunitsasoutputsofthe

model,butinprincipletheycanbeusedinternallyaswell.Werevisittheseunits

withadditionaldetailabouttheiruseashiddenunitsinsection6.3.

Throughoutthissection,wesupposethatthefeedforwardnetworkprovidesa

setofhiddenfeaturesdeﬁnedby

(

;

).Theroleoftheoutputlayeristhen

toprovidesomeadditionaltransformationfromthefeaturestocompletethetask

thatthenetworkmustperform.

177

CHAPTER6.DEEPFEEDFORWARDNETWORKS

6.2.2.1LinearUnitsforGaussianOutputDistributions

Onesimplekindofoutputunitisbasedonanaﬃnetransformationwithno

nonlinearity.Theseareoftenjustcalledlinearunits.

Givenfeatures

,alayeroflinearoutputunitsproducesavector



Linearoutputlayersareoftenusedtoproducethemeanofaconditional

Gaussiandistribution:

p(y|x) =N(y;

y,I).(6.17)

Maximizingthelog-likelihoodisthenequivalenttominimizingthemeansquared

error.

Themaximumlikelihoodframeworkmakesitstraightforwardtolearnthe

covarianceoftheGaussiantoo,ortomakethecovarianceoftheGaussianbea

functionoftheinput.However,thecovariancemustbeconstrainedtobeapositive

deﬁnitematrixforallinputs.Itisdiﬃculttosatisfysuchconstraintswithalinear

outputlayer,sotypicallyotheroutputunitsareusedtoparametrizethecovariance.

Approachestomodelingthecovariancearedescribedshortly,insection6.2.2.4.

Becauselinearunitsdonotsaturate,theyposelittlediﬃcultyforgradient-

basedoptimizationalgorithmsandmaybeusedwithawidevarietyofoptimization

algorithms.

6.2.2.2SigmoidUnitsforBernoulliOutputDistributions

Manytasksrequirepredictingthevalueofabinaryvariable

.Classiﬁcation

problemswithtwoclassescanbecastinthisform.

ThemaximumlikelihoodapproachistodeﬁneaBernoullidistributionover

conditionedonx.

ABernoullidistributionisdeﬁnedbyjustasinglenumber.Theneuralnet

needstopredictonly

(

= 1

).Forthisnumbertobeavalidprobability,it

mustlieintheinterval[0,1].

Satisfyingthisconstraintrequiressomecarefuldesigneﬀort.Supposewewere

tousealinearunitandthresholditsvaluetoobtainavalidprobability:

P(y= 1|x) = max



0,min



1,w



h+b



.(6.18)

Thiswouldindeeddeﬁneavalidconditionaldistribution,butwewouldnotbeable

totrainitveryeﬀectivelywithgradientdescent.Anytimethat



strayed

178

CHAPTER6.DEEPFEEDFORWARDNETWORKS

outsidetheunitinterval,thegradientoftheoutputofthemodelwithrespectto

itsparameterswouldbe

.Agradientof

istypicallyproblematicbecausethe

learningalgorithmnolongerhasaguideforhowtoimprovethecorresponding

parameters.

Instead,itisbettertouseadiﬀerentapproachthatensuresthereisalwaysa

stronggradientwheneverthemodelhasthewronganswer.Thisapproachisbased

onusingsigmoidoutputunitscombinedwithmaximumlikelihood.

Asigmoidoutputunitisdeﬁnedby

ˆy=σ





h+b



,(6.19)

whereσisthelogisticsigmoidfunctiondescribedinsection3.10.

Wecanthinkofthesigmoidoutputunitashavingtwocomponents.First,it

usesalinearlayertocompute



.Next,itusesthesigmoidactivation

functiontoconvertzintoaprobability.

Weomitthedependenceon

forthemomenttodiscusshowtodeﬁnea

probabilitydistributionover

usingthevalue

.Thesigmoidcanbemotivated

byconstructinganunnormalizedprobabilitydistribution

(

),whichdoesnot

sumto1.Wecanthendividebyanappropriateconstanttoobtainavalid

probabilitydistribution.Ifwebeginwiththeassumptionthattheunnormalizedlog

probabilitiesarelinearin

and

,wecanexponentiatetoobtaintheunnormalized

probabilities.WethennormalizetoseethatthisyieldsaBernoullidistribution

controlledbyasigmoidaltransformationofz:

log

P(y) =yz,(6.20)

P(y) = exp(yz),(6.21)

P(y) =

exp(yz)





exp(y



z),

(6.22)

P(y) =σ((2y−1)z).(6.23)

Probabilitydistributionsbasedonexponentiationandnormalizationarecommon

throughoutthestatisticalmodelingliterature.The

variabledeﬁningsucha

distributionoverbinaryvariablesiscalledalogit.

Thisapproachtopredictingtheprobabilitiesinlogspaceisnaturaltouse

withmaximumlikelihoodlearning.Becausethecostfunctionusedwithmaximum

likelihoodis

−logP

(

y|x

),the

log

inthecostfunctionundoesthe

exp

ofthe

sigmoid.Withoutthiseﬀect,thesaturationofthesigmoidcouldpreventgradient-

based learningfrom makinggood progress.The lossfunction formaximum

179

CHAPTER6.DEEPFEEDFORWARDNETWORKS

likelihoodlearningofaBernoulliparametrizedbyasigmoidis

J(θ) =−logP(y|x)(6.24)

=−logσ((2y−1)z)(6.25)

=ζ((1−2y)z).(6.26)

Thisderivationmakesuseofsomepropertiesfromsection3.10.Byrewriting

thelossintermsofthesoftplusfunction,wecanseethatitsaturatesonlywhen

−

)

isverynegative.Saturationthusoccursonlywhenthemodelalready

hastherightanswer—when

= 1and

isverypositive,or

= 0and

isvery

negative.When

hasthewrongsign,theargumenttothesoftplusfunction,

−

)

,maybesimpliﬁedto

|z|

.As

|z|

becomeslargewhile

hasthewrongsign,

thesoftplusfunctionasymptotestowardsimplyreturningitsargument

|z|

.The

derivativewithrespectto

asymptotesto

sign

(

),so,inthelimitofextremely

incorrect

,thesoftplusfunctiondoesnotshrinkthegradientatall.Thisproperty

isusefulbecauseitmeansthatgradient-basedlearningcanacttoquicklycorrect

amistakenz.

Whenweuseotherlossfunctions,suchasmeansquarederror,thelosscan

saturateanytime

(

)saturates.Thesigmoidactivationfunctionsaturatesto0

when

becomesverynegativeandsaturatesto1when

becomesverypositive.

Thegradientcanshrinktoosmalltobeusefulforlearningwhenthishappens,

whetherthemodelhasthecorrectanswerortheincorrectanswer.Forthisreason,

maximumlikelihoodisalmostalwaysthepreferredapproachtotrainingsigmoid

outputunits.

Analytically,thelogarithmofthesigmoidisalwaysdeﬁnedandﬁnite,because

thesigmoidreturnsvaluesrestrictedtotheopeninterval(0

1),ratherthanusing

theentireclosedintervalofvalidprobabilities[0

1].Insoftwareimplementations,

toavoidnumericalproblems,itisbesttowritethenegativelog-likelihoodasa

functionof

,ratherthanasafunctionof

ˆy

(

).Ifthesigmoidfunction

underﬂowstozero,thentakingthelogarithmofˆyyieldsnegativeinﬁnity.

6.2.2.3SoftmaxUnitsforMultinoulliOutputDistributions

Anytimewewishtorepresentaprobabilitydistributionoveradiscretevariable

with

possiblevalues,wemayusethesoftmaxfunction.Thiscanbeseenasa

generalizationofthesigmoidfunction,whichwasusedtorepresentaprobability

distributionoverabinaryvariable.

Softmaxfunctionsaremostoftenusedastheoutputofaclassiﬁer,torepresent

theprobabilitydistributionover

diﬀerentclasses.Morerarely,softmaxfunctions

180

CHAPTER6.DEEPFEEDFORWARDNETWORKS

canbeusedinsidethemodelitself,ifwewishthemodeltochoosebetweenoneof

ndiﬀerentoptionsforsomeinternalvariable.

Inthecaseofbinaryvariables,wewishedtoproduceasinglenumber

ˆy=P(y= 1|x).(6.27)

Becausethisnumberneededtoliebetween0and1,andbecausewewantedthe

logarithmofthenumbertobewellbehavedforgradient-basedoptimizationof

thelog-likelihood,wechosetoinsteadpredictanumber

log

(

ExponentiatingandnormalizinggaveusaBernoullidistributioncontrolledbythe

sigmoidfunction.

Togeneralizetothecaseofadiscretevariablewith

values,wenowneed

toproduceavector

,with

ˆy

(

i|x

).Werequirenotonlythateach

elementof

ˆy

bebetween0and1,butalsothattheentirevectorsumsto1sothat

itrepresentsavalidprobabilitydistribution.Thesameapproachthatworkedfor

theBernoullidistributiongeneralizestothemultinoullidistribution.First,alinear

layerpredictsunnormalizedlogprobabilities:

z=W



h+b,(6.28)

where

log

(

i|x

)

Thesoftmaxfunctioncanthenexponentiateand

normalizeztoobtainthedesired

y.Formally,thesoftmaxfunctionisgivenby

softmax(z)

exp(z

)



exp(z

)

.(6.29)

Aswiththelogisticsigmoid,theuseofthe

exp

functionworkswellwhen

trainingthesoftmaxtooutputatargetvalue

usingmaximumlog-likelihood.In

thiscase,wewishtomaximize

logP

(

;

logsoftmax

(

)

.Deﬁningthe

softmaxintermsof

exp

isnaturalbecausethe

log

inthelog-likelihoodcanundo

theexpofthesoftmax:

logsoftmax(z)

−log



exp(z

).(6.30)

Theﬁrsttermofequation6.30showsthattheinput

alwayshasadirect

contributiontothecostfunction.Becausethistermcannotsaturate,weknow

thatlearningcanproceed,evenifthecontributionof

tothesecondtermof

equation6.30becomesverysmall.Whenmaximizingthelog-likelihood,theﬁrst

termencourages

tobepushedup,whilethesecondtermencouragesallof

tobe

pusheddown.Togainsomeintuitionforthesecondterm,

log



exp

(

),observe

181

CHAPTER6.DEEPFEEDFORWARDNETWORKS

thatthistermcanberoughlyapproximatedby

max

.Thisapproximationis

basedontheideathat

exp

(

)isinsigniﬁcantforany

thatisnoticeablylessthan

max

.Theintuitionwecangainfromthisapproximationisthatthenegative

log-likelihoodcostfunctionalwaysstronglypenalizesthemostactiveincorrect

prediction.Ifthecorrectansweralreadyhasthelargestinputtothesoftmax,then

the

−z

termandthe

log



exp

(

)

≈max

termswillroughlycancel.

Thisexamplewillthencontributelittletotheoveralltrainingcost,whichwillbe

dominatedbyotherexamplesthatarenotyetcorrectlyclassiﬁed.

Sofarwehavediscussedonlyasingleexample.Overall,unregularizedmaximum

likelihoodwilldrivethemodeltolearnparametersthatdrivethesoftmaxtopredict

thefractionofcountsofeachoutcomeobservedinthetrainingset:

softmax(z(x;θ))

≈



j=1

(j)

=i,x

(j)



j=1

(j)

.(6.31)

Becausemaximumlikelihoodisaconsistentestimator,thisisguaranteedtohappen

aslongasthemodelfamilyiscapableofrepresentingthetrainingdistribution.In

practice,limitedmodelcapacityandimperfectoptimizationwillmeanthatthe

modelisonlyabletoapproximatethesefractions.

Manyobjectivefunctionsotherthanthelog-likelihooddonotworkaswell

withthesoftmaxfunction.Speciﬁcally,objectivefunctionsthatdonotusea

log

undothe

exp

ofthesoftmaxfailtolearnwhentheargumenttothe

exp

becomes

verynegative,causingthegradienttovanish.Inparticular,squarederrorisapoor

lossfunctionforsoftmaxunitsandcanfailtotrainthemodeltochangeitsoutput,

evenwhenthemodelmakeshighlyconﬁdentincorrectpredictions(Bridle,1990).

Tounderstandwhytheseotherlossfunctionscanfail,weneedtoexaminethe

softmaxfunctionitself.

Likethesigmoid,thesoftmaxactivationcansaturate.Thesigmoidfunction

hasasingleoutputthatsaturateswhenitsinputisextremelynegativeorextremely

positive.Thesoftmaxhasmultipleoutputvalues.Theseoutputvaluescansaturate

whenthediﬀerencesbetweeninputvaluesbecomeextreme. Whenthesoftmax

saturates,manycostfunctionsbasedonthesoftmaxalsosaturate,unlesstheyare

abletoinvertthesaturatingactivatingfunction.

Toseethatthesoftmaxfunctionrespondstothediﬀerencebetweenitsinputs,

observethatthesoftmaxoutputisinvarianttoaddingthesamescalartoallits

inputs:

softmax(z) = softmax(z+c).(6.32)

Usingthisproperty,wecanderiveanumericallystablevariantofthesoftmax:

softmax(z) = softmax(z−max

).(6.33)

182

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Thereformulatedversionenablesustoevaluatesoftmaxwithonlysmallnumerical

errors, evenwhen

containsextremelylarge orextremely negative numbers.

Examiningthenumericallystablevariant,weseethatthesoftmaxfunctionis

drivenbytheamountthatitsargumentsdeviatefrommax

Anoutput

softmax

(

)

saturatesto1whenthecorrespondinginputismaximal

(

max

)and

ismuchgreaterthanalltheotherinputs.Theoutput

softmax

(

)

canalsosaturateto0when

isnotmaximalandthemaximumis

muchgreater.Thisisageneralizationofthewaythatsigmoidunitssaturateand

cancausesimilardiﬃcultiesforlearningifthelossfunctionisnotdesignedto

compensateforit.

Theargument

tothesoftmaxfunctioncanbeproducedintwodiﬀerentways.

Themostcommonissimplytohaveanearlierlayeroftheneuralnetworkoutput

everyelementof

,asdescribedaboveusingthelinearlayer



.While

straightforward,thisapproachactuallyoverparametrizesthedistribution.The

constraintthatthe

outputsmustsumto1meansthatonly

n−

1parametersare

necessary;theprobabilityofthe

-thvaluemaybeobtainedbysubtractingthe

ﬁrst

n−

1probabilitiesfrom1.Wecanthusimposearequirementthatoneelement

beﬁxed.Forexample,wecanrequirethat

=0.Indeed,thisisexactly

whatthesigmoidunitdoes.Deﬁning

(

= 1

) =

(

)isequivalenttodeﬁning

(

= 1

) =

softmax

(

)

withatwo-dimensional

and

= 0.Boththe

n−

argumentandthe

argumentapproachestothesoftmaxcandescribethesame

setofprobabilitydistributionsbuthavediﬀerentlearningdynamics.Inpractice,

thereisrarelymuchdiﬀerencebetweenusingtheoverparametrizedversionorthe

restrictedversion,anditissimplertoimplementtheoverparametrizedversion.

Fromaneuroscientiﬁcpointofview,itisinterestingtothinkofthesoftmaxas

awaytocreateaformofcompetitionbetweentheunitsthatparticipateinit:the

softmaxoutputsalwayssumto1soanincreaseinthevalueofoneunitnecessarily

correspondstoadecreaseinthevalueofothers.Thisisanalogoustothelateral

inhibitionthatisbelievedtoexistbetweennearbyneuronsinthecortex.Atthe

extreme(whenthediﬀerencebetweenthemaximal

andtheothersislargein

magnitude)itbecomesaformof

winner-take-all

(oneoftheoutputsisnearly1,

andtheothersarenearly0).

Thename“softmax”canbesomewhatconfusing.Thefunctionismoreclosely

relatedtothe

argmax

functionthanthemaxfunction. Theterm“soft”derives

fromthefactthatthesoftmaxfunctioniscontinuousanddiﬀerentiable.The

argmax

function,withitsresultrepresentedasaone-hotvector,isnotcontinuous

ordiﬀerentiable.Thesoftmaxfunctionthusprovidesa“softened”versionofthe

argmax

.Thecorrespondingsoftversionofthemaximumfunctionis

softmax

(

)



183

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Itwouldperhapsbebettertocallthesoftmaxfunction“softargmax,” butthe

currentnameisanentrenchedconvention.

6.2.2.4OtherOutputTypes

Thelinear, sigmoid, andsoftmax outputunitsdescribedabovearethemost

common.Neuralnetworkscangeneralizetoalmostanykindofoutputlayerthat

wewish.Theprincipleofmaximumlikelihoodprovidesaguideforhowtodesign

agoodcostfunctionfornearlyanykindofoutputlayer.

Ingeneral,ifwedeﬁneaconditionaldistribution

(

y|x

;

),theprincipleof

maximumlikelihoodsuggestsweuse−logp(y|x;θ)asourcostfunction.

Ingeneral,wecanthinkoftheneuralnetworkasrepresentingafunction

(

;

Theoutputsofthisfunctionarenotdirectpredictionsofthevalue

.Instead,

(

;

) =

providestheparametersforadistributionover

.Ourlossfunction

canthenbeinterpretedas−logp(y;ω(x)).

Forexample,wemaywishtolearnthevarianceofaconditionalGaussianfor

given

.Inthesimplecase,wherethevariance

isaconstant,thereisaclosed

formexpressionbecausethemaximumlikelihoodestimatorofvarianceissimplythe

empiricalmeanofthesquareddiﬀerencebetweenobservations

andtheirexpected

value.Acomputationallymoreexpensiveapproachthatdoesnotrequirewriting

special-casecodeistosimplyincludethevarianceasoneofthepropertiesofthe

distribution

(

y|x

)thatiscontrolledby

(

;

).Thenegativelog-likelihood

−logp

(

;

(

))willthenprovideacostfunctionwiththeappropriateterms

necessarytomakeouroptimizationprocedureincrementallylearnthevariance.In

thesimplecasewherethestandarddeviationdoesnotdependontheinput,we

canmakeanewparameterinthenetworkthatiscopieddirectlyinto

.Thisnew

parametermightbe

itselforcouldbeaparameter

representing

oritcould

beaparameter

representing

,dependingonhowwechoosetoparametrize

thedistribution.Wemaywishourmodeltopredictadiﬀerentamountofvariance

fordiﬀerentvaluesof

.Thisiscalleda

heteroscedastic

model.Inthe

heteroscedasticcase,wesimplymakethespeciﬁcationofthevariancebeoneof

thevaluesoutputby

(

;

).AtypicalwaytodothisistoformulatetheGaussian

distributionusingprecision,ratherthanvariance,asdescribedinequation3.22.

Inthemultivariatecase,itismostcommontouseadiagonalprecisionmatrix

diag(β).(6.34)

Thisformulationworkswellwithgradientdescentbecausetheformulaforthe

log-likelihoodoftheGaussiandistributionparametrizedby

involvesonlymul-

tiplicationby

andadditionof

logβ

.Thegradientofmultiplication,addition,

184

CHAPTER6.DEEPFEEDFORWARDNETWORKS

andlogarithmoperationsiswellbehaved.Bycomparison,ifweparametrizedthe

outputintermsofvariance,wewouldneedtousedivision.Thedivisionfunction

becomesarbitrarilysteepnearzero.Whilelargegradientscanhelplearning,arbi-

trarilylargegradientsusuallyresultininstability.Ifweparametrizedtheoutputin

termsofstandarddeviation,thelog-likelihoodwouldstillinvolvedivisionaswell

assquaring. Thegradientthroughthesquaringoperationcanvanishnearzero,

makingitdiﬃculttolearnparametersthataresquared.Regardlessofwhetherwe

usestandarddeviation,variance,orprecision,wemustensurethatthecovariance

matrixoftheGaussianispositivedeﬁnite.Becausetheeigenvaluesoftheprecision

matrixarethereciprocalsoftheeigenvaluesofthecovariancematrix, thisis

equivalenttoensuringthattheprecisionmatrixispositivedeﬁnite. Ifweusea

diagonalmatrix,orascalartimesthediagonalmatrix,thentheonlycondition

weneedtoenforceontheoutputofthemodelispositivity.Ifwesupposethata

istherawactivationofthemodelusedtodeterminethediagonalprecision,we

canusethesoftplusfunctiontoobtainapositiveprecisionvector:

(

)

This

samestrategyappliesequallyifusingvarianceorstandarddeviationratherthan

precisionorifusingascalartimesidentityratherthandiagonalmatrix.

Itisraretolearnacovarianceorprecisionmatrixwithricherstructurethan

diagonal. Ifthecovarianceisfullandconditional,thenaparametrizationmust

bechosenthatguaranteespositivedeﬁnitenessofthepredictedcovariancematrix.

Thiscanbeachievedbywriting

Σ(x) =B(x)B



(x)

,where

isanunconstrained

squarematrix.Onepracticalissueifthematrixisfullrankisthatcomputingthe

likelihoodisexpensive,witha

d×d

matrixrequiring

(

)computationforthe

determinantandinverseof

(

)(orequivalently,andmorecommonlydone,its

eigendecompositionorthatofB(x)).

We oftenwantto perform multimodal regression, thatis, to predictreal

valuesfromaconditionaldistribution

(

y|x

)thatcanhaveseveraldiﬀerent

peaksin

spaceforthesamevalueof

.Inthiscase,aGaussianmixtureis

anaturalrepresentationfortheoutput(Jacobsetal.,1991;Bishop,1994).

NeuralnetworkswithGaussianmixturesastheiroutputareoftencalled

mixture

densitynetworks

.AGaussianmixtureoutputwith

componentsisdeﬁnedby

theconditionalprobabilitydistribution:

p(y|x) =



i=1

p(c=i|x)N(y;µ

(i)

(x),Σ

(i)

(x)).(6.35)

Theneuralnetworkmusthavethreeoutputs:avectordeﬁning

(

i|x

),a

matrixproviding

(i)

(

)forall

,andatensorproviding

(i)

(

)forall

.These

outputsmustsatisfydiﬀerentconstraints:

185

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Mixturecomponents

(

i|x

):theseformamultinoullidistribution

overthe

diﬀerentcomponentsassociatedwithlatentvariable

,andcan

typicallybeobtainedbyasoftmaxoveran

-dimensionalvector,toguarantee

thattheseoutputsarepositiveandsumto1.

Means

(i)

(

):theseindicatethecenterormeanassociatedwiththe

-th

Gaussiancomponentandareunconstrained(typicallywithnononlinearityat

allfortheseoutputunits).If

isa

-vector,thenthenetworkmustoutput

n×d

matrixcontainingall

ofthese

-dimensionalvectors.Learning

thesemeanswithmaximumlikelihoodisslightlymorecomplicatedthan

learningthemeansofadistributionwithonlyoneoutputmode. Weonly

wanttoupdatethemeanforthecomponentthatactuallyproducedthe

observation.Inpractice,wedonotknowwhichcomponentproducedeach

observation.Theexpressionforthenegativelog-likelihoodnaturallyweights

eachexample’scontributiontothelossforeachcomponentbytheprobability

thatthecomponentproducedtheexample.

Covariances

(i)

(

):thesespecifythecovariancematrixforeachcomponent

.AswhenlearningasingleGaussiancomponent,wetypicallyuseadiagonal

matrixtoavoidneedingtocomputedeterminants.Aswithlearningthemeans

ofthemixture,maximumlikelihoodiscomplicatedbyneedingtoassign

partialresponsibilityforeachpointtoeachmixturecomponent.Gradient

descentwillautomaticallyfollowthecorrectprocessifgiventhecorrect

speciﬁcationofthenegativelog-likelihoodunderthemixturemodel.

Ithasbeenreportedthatgradient-basedoptimizationofconditionalGaussian

mixtures(ontheoutputofneuralnetworks)canbeunreliable,inpartbecauseone

getsdivisions(bythevariance)whichcanbenumericallyunstable(whensome

variancegetstobesmallforaparticularexample,yieldingverylargegradients).

Onesolutionisto

clipgradients

(seesection10.11.1),whileanotheristoscale

thegradientsheuristically(Uriaetal.,2014).

Gaussianmixtureoutputsareparticularlyeﬀectiveingenerativemodelsof

speech(Schuster,1999)andmovementsofphysicalobjects(Graves,2013).The

mixturedensitystrategygivesawayforthenetworktorepresentmultipleoutput

modesandtocontrolthevarianceofitsoutput,whichiscrucialforobtaining

Weconsider

tobelatentbecausewedonotobserveitinthedata:giveninput

andtarget

,itisnotpossibletoknowwithcertaintywhichGaussiancomponentwasresponsiblefor

,but

wecanimaginethat

wasgeneratedbypickingoneofthem,andwecanmakethatunobserved

choicearandomvariable.

186

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Figure6.4:Samplesdrawnfromaneuralnetworkwithamixturedensityoutputlayer.

Theinput

issampledfromauniformdistribution,andtheoutput

issampledfrom

model

(

y|x

).Theneuralnetworkisabletolearnnonlinearmappingsfromtheinputto

theparametersoftheoutputdistribution.Theseparametersincludetheprobabilities

governingwhichofthreemixturecomponentswillgeneratetheoutputaswellasthe

parametersforeachmixturecomponent.EachmixturecomponentisGaussianwith

predictedmeanandvariance.Alltheseaspectsoftheoutputdistributionareabletovary

withrespecttotheinputx,andtodosoinnonlinearways.

ahighdegreeofqualityinthesereal-valueddomains.Anexampleofamixture

densitynetworkisshowninﬁgure6.4.

Ingeneral,wemaywishtocontinuetomodellargervectors

containingmore

variables,andtoimposericherandricherstructuresontheseoutputvariables.For

example,ifwewantourneuralnetworktooutputasequenceofcharactersthat

formsasentence,wemightcontinuetousetheprincipleofmaximumlikelihood

appliedtoourmodel

(

;

(

)).Inthiscase,themodelweusetodescribe

would

becomecomplexenoughtobebeyondthescopeofthischapter.InChapter10we

describehowtouserecurrentneuralnetworkstodeﬁnesuchmodelsoversequences,

andinpartIIIwedescribeadvancedtechniquesformodelingarbitraryprobability

distributions.

6.3HiddenUnits

Sofarwehavefocusedourdiscussionondesignchoicesforneuralnetworksthat

arecommontomostparametricmachinelearningmodelstrainedwithgradient-

basedoptimization.Nowweturntoanissuethatisuniquetofeedforwardneural

187

CHAPTER6.DEEPFEEDFORWARDNETWORKS

networks:howtochoosethetypeofhiddenunittouseinthehiddenlayersofthe

model.

Thedesignofhiddenunitsisanextremelyactiveareaofresearchanddoesnot

yethavemanydeﬁnitiveguidingtheoreticalprinciples.

Rectiﬁedlinearunitsareanexcellentdefaultchoiceofhiddenunit.Manyother

typesofhiddenunitsareavailable.Itcanbediﬃculttodeterminewhentouse

whichkind(thoughrectiﬁedlinearunitsareusuallyanacceptablechoice). We

describeheresomeofthebasicintuitionsmotivatingeachtypeofhiddenunit.

Theseintuitionscanhelpdecidewhentotryoutwhichunit.Predictinginadvance

whichwillworkbestisusuallyimpossible.Thedesignprocessconsistsoftrial

anderror,intuitingthatakindofhiddenunitmayworkwell,andthentraining

anetworkwiththatkindofhiddenunitandevaluatingitsperformanceona

validationset.

Someofthehiddenunitsincludedinthislistarenotactuallydiﬀerentiable

atallinputpoints.Forexample,therectiﬁedlinearfunction

(

max{

,z}

isnotdiﬀerentiableat

=0.Thismayseemlikeitinvalidates

forusewith

agradient-basedlearningalgorithm.Inpractice,gradientdescentstillperforms

wellenoughforthesemodelstobeusedformachinelearningtasks.Thisisin

partbecauseneuralnetworktrainingalgorithmsdonotusuallyarriveatalocal

minimumofthecostfunction,butinsteadmerelyreduceitsvaluesigniﬁcantly,as

showninﬁgure4.3.(Theseideasaredescribedfurtherinchapter8.)Becausewedo

notexpecttrainingtoactuallyreachapointwherethegradientis

,itisacceptable

fortheminimaofthecostfunctiontocorrespondtopointswithundeﬁnedgradient.

Hiddenunitsthatarenotdiﬀerentiableareusuallynondiﬀerentiableatonlya

smallnumberofpoints.Ingeneral,afunction

(

)hasaleftderivativedeﬁned

bytheslopeofthefunctionimmediatelytotheleftof

andarightderivative

deﬁnedbytheslopeofthefunctionimmediatelytotherightof

.Afunction

isdiﬀerentiableat

onlyifboththeleftderivativeandtherightderivativeare

deﬁnedandequaltoeachother.Thefunctionsusedinthecontextofneural

networksusuallyhavedeﬁnedleftderivativesanddeﬁnedrightderivatives.Inthe

caseof

(

) =

max{

,z}

,theleftderivativeat

= 0is0,andtherightderivative

is1.Softwareimplementationsofneuralnetworktrainingusuallyreturnoneof

theone-sidedderivativesratherthanreportingthatthederivativeisundeﬁnedor

raisinganerror. Thismaybeheuristicallyjustiﬁedbyobservingthatgradient-

basedoptimizationonadigitalcomputerissubjecttonumericalerroranyway.

Whenafunctionisaskedtoevaluate

(0),itisveryunlikelythattheunderlying

valuetrulywas0.Instead,itwaslikelytobesomesmallvalue



thatwasrounded

to0.Insomecontexts,moretheoreticallypleasingjustiﬁcationsareavailable,but

188

CHAPTER6.DEEPFEEDFORWARDNETWORKS

theseusuallydonotapplytoneuralnetworktraining.Theimportantpointis

thatinpracticeonecansafelydisregardthenondiﬀerentiabilityofthehiddenunit

activationfunctionsdescribedbelow.

Unlessindicatedotherwise,mosthiddenunitscanbedescribedasaccepting

avectorofinputs

,computinganaﬃnetransformation



,and

thenapplyinganelement-wisenonlinearfunction

(

).Mosthiddenunitsare

distinguishedfromeachotheronlybythechoiceoftheformoftheactivation

functiong(z).

6.3.1RectiﬁedLinearUnitsandTheirGeneralizations

Rectiﬁedlinearunitsusetheactivationfunctiong(z) = max{0,z}.

Theseunitsareeasytooptimizebecausetheyaresosimilartolinearunits.The

onlydiﬀerencebetweenalinearunitandarectiﬁedlinearunitisthatarectiﬁed

linearunitoutputszeroacrosshalfitsdomain.Thismakesthederivativesthrough

arectiﬁedlinearunitremainlargewhenevertheunitisactive.Thegradients

arenotonlylargebutalsoconsistent.Thesecondderivativeoftherectifying

operationis0almosteverywhere,andthederivativeoftherectifyingoperationis

1everywherethattheunitisactive.Thismeansthatthegradientdirectionisfar

moreusefulforlearningthanitwouldbewithactivationfunctionsthatintroduce

second-ordereﬀects.

Rectiﬁedlinearunitsaretypicallyusedontopofanaﬃnetransformation:

h=g(W



x+b).(6.36)

Wheninitializingtheparametersoftheaﬃnetransformation,itcanbeagood

practicetosetallelementsof

toasmallpositivevalue,suchas0

1. Doingso

makesitverylikelythattherectiﬁedlinearunitswillbeinitiallyactiveformost

inputsinthetrainingsetandallowthederivativestopassthrough.

Severalgeneralizationsofrectiﬁedlinearunitsexist.Mostofthesegeneral-

izationsperformcomparablytorectiﬁedlinearunitsandoccasionallyperform

better.

Onedrawbacktorectiﬁedlinearunitsisthattheycannotlearnviagradient-

basedmethodsonexamplesforwhichtheiractivationiszero.Variousgeneraliza-

tionsofrectiﬁedlinearunitsguaranteethattheyreceivegradienteverywhere.

Threegeneralizationsofrectiﬁedlinearunitsarebasedonusinganonzero

slope

when

(

z,α

)

max

min

Absolutevalue

189

CHAPTER6.DEEPFEEDFORWARDNETWORKS

rectiﬁcation

ﬁxes

−

1toobtain

(

) =

|z|

.Itisusedforobjectrecognition

fromimages(Jarrettetal.,2009),whereitmakessensetoseekfeaturesthatare

invariantunderapolarityreversaloftheinputillumination.Othergeneralizations

ofrectiﬁedlinearunitsaremorebroadlyapplicable.A

leakyReLU

(Maasetal.,

2013)ﬁxes

toasmallvaluelike0.01,whilea

parametricReLU

,or

PReLU

treatsα

asalearnableparameter(Heetal.,2015).

Maxoutunits

(Goodfellowet al.,2013a)generalizerectiﬁedlinearunits

further.Insteadofapplyinganelement-wisefunction

(

),maxoutunitsdivide

intogroupsof

values.Eachmaxoutunitthenoutputsthemaximumelementof

oneofthesegroups:

g(z)

=max

j∈G

(i)

,(6.37)

where

(i)

isthesetofindicesintotheinputsforgroup

{

(

i−

,...,ik}

Thisprovidesawayoflearningapiecewiselinearfunctionthatrespondstomultiple

directionsintheinputxspace.

Amaxoutunitcanlearnapiecewiselinear,convexfunctionwithupto

pieces.

Maxoutunitscanthusbeseenaslearningtheactivationfunctionitselfratherthan

justtherelationshipbetweenunits.Withlargeenough

,amaxoutunitcanlearn

toapproximateanyconvexfunctionwitharbitraryﬁdelity.Inparticular,amaxout

layerwithtwopiecescanlearntoimplementthesamefunctionoftheinput

asatraditionallayerusingtherectiﬁedlinearactivationfunction,theabsolute

valuerectiﬁcationfunction,ortheleakyorparametricReLU,oritcanlearnto

implementatotallydiﬀerentfunctionaltogether.Themaxoutlayerwillofcourse

beparametrizeddiﬀerentlyfromanyoftheseotherlayertypes,sothelearning

dynamicswillbediﬀerenteveninthecaseswheremaxoutlearnstoimplementthe

samefunctionofxasoneoftheotherlayertypes.

Eachmaxoutunitisnowparametrizedby

weightvectorsinsteadofjustone,

somaxoutunitstypicallyneedmoreregularizationthanrectiﬁedlinearunits.They

canworkwellwithoutregularizationifthetrainingsetislargeandthenumberof

piecesperunitiskeptlow(Caietal.,2013).

Maxoutunitshaveafewotherbeneﬁts.Insomecases,onecangainsomesta-

tisticalandcomputationaladvantagesbyrequiringfewerparameters.Speciﬁcally,

ifthefeaturescapturedby

diﬀerentlinearﬁlterscanbesummarizedwithout

losinginformationbytakingthemaxovereachgroupof

features,thenthenext

layercangetbywithktimesfewerweights.

Becauseeachunitisdrivenbymultipleﬁlters,maxoutunitshavesomeredun-

dancythathelpsthemresistaphenomenoncalled

catastrophicforgetting

,in

whichneuralnetworksforgethowtoperformtasksthattheyweretrainedonin

190

CHAPTER6.DEEPFEEDFORWARDNETWORKS

thepast(Goodfellowetal.,2014a).

Rectiﬁedlinearunitsandallthesegeneralizationsofthemarebasedonthe

principlethatmodelsareeasiertooptimizeiftheirbehaviorisclosertolinear.

Thissamegeneralprincipleofusinglinearbehaviortoobtaineasieroptimization

alsoappliesinothercontextsbesidesdeeplinearnetworks.Recurrentnetworkscan

learnfromsequencesandproduceasequenceofstatesandoutputs.Whentraining

them,oneneedstopropagateinformationthroughseveraltimesteps,whichismuch

easierwhensomelinearcomputations(withsomedirectionalderivativesbeingof

magnitudenear1)areinvolved.Oneofthebest-performingrecurrentnetwork

architectures,theLSTM,propagatesinformationthroughtimeviasummation—a

particularstraightforwardkindoflinearactivation. Thisisdiscussedfurtherin

section10.10.

6.3.2LogisticSigmoidandHyperbolicTangent

Priortotheintroductionofrectiﬁedlinearunits,mostneuralnetworksusedthe

logisticsigmoidactivationfunction

g(z) =σ(z)(6.38)

orthehyperbolictangentactivationfunction

g(z) = tanh(z).(6.39)

Theseactivationfunctionsarecloselyrelatedbecausetanh(z) = 2σ(2z)−1.

We havealready seensigmoidunits asoutputunits,usedto predictthe

probabilitythatabinaryvariableis1.Unlikepiecewiselinearunits,sigmoidal

unitssaturateacrossmostoftheirdomain—theysaturatetoahighvaluewhen

isverypositive,saturatetoalowvaluewhen

isverynegative,andareonly

stronglysensitivetotheirinputwhen

isnear0.Thewidespreadsaturationof

sigmoidalunitscanmakegradient-basedlearningverydiﬃcult.Forthisreason,

theiruseashiddenunitsinfeedforwardnetworksisnowdiscouraged.Theiruse

asoutputunitsiscompatiblewiththeuseofgradient-basedlearningwhenan

appropriatecostfunctioncanundothesaturationofthesigmoidintheoutput

layer.

Whenasigmoidalactivationfunctionmustbeused,thehyperbolictangent

activationfunctiontypicallyperformsbetterthanthelogisticsigmoid.Itresembles

theidentityfunctionmoreclosely,inthesensethat

tanh

(0) = 0while

(0) =

Because

tanh

issimilartotheidentityfunctionnear0,trainingadeepneural

network

ˆy



tanh

(



tanh

(



))resemblestrainingalinearmodel

ˆy

191

CHAPTER6.DEEPFEEDFORWARDNETWORKS



aslongastheactivationsofthenetworkcanbekeptsmall.This

makestrainingthetanhnetworkeasier.

Sigmoidalactivationfunctionsaremorecommoninsettingsotherthanfeed-

forwardnetworks.Recurrentnetworks,manyprobabilisticmodels, andsome

autoencodershaveadditionalrequirementsthatruleouttheuseofpiecewise

linearactivationfunctionsandmakesigmoidalunitsmoreappealingdespitethe

drawbacksofsaturation.

6.3.3OtherHiddenUnits

Manyothertypesofhiddenunitsarepossiblebutareusedlessfrequently.

Ingeneral,awidevarietyofdiﬀerentiablefunctionsperformperfectlywell.

Manyunpublishedactivationfunctionsperformjustaswellasthepopularones.To

provideaconcreteexample,wetestedafeedforwardnetworkusing

cos

(

)

ontheMNISTdatasetandobtainedanerrorrateoflessthan1percent,whichis

competitivewithresultsobtainedusingmoreconventionalactivationfunctions.

Duringresearchanddevelopmentofnewtechniques,itiscommontotestmany

diﬀerentactivationfunctionsandﬁndthatseveralvariationsonstandardpractice

performcomparably.Thismeansthatusuallynewhiddenunittypesarepublished

onlyiftheyareclearlydemonstratedtoprovideasigniﬁcantimprovement.New

hiddenunittypesthatperformroughlycomparablytoknowntypesaresocommon

astobeuninteresting.

Itwouldbeimpracticaltolistallthehiddenunittypesthathaveappearedin

theliterature.Wehighlightafewespeciallyusefulanddistinctiveones.

Onepossibilityistonothaveanactivation

(

)atall.Onecanalsothinkof

thisasusingtheidentityfunctionastheactivationfunction.Wehavealready

seenthatalinearunitcanbeusefulastheoutputofaneuralnetwork. Itmay

alsobeusedasahiddenunit.Ifeverylayeroftheneuralnetworkconsistsofonly

lineartransformations,thenthenetworkasawholewillbelinear.However,it

isacceptableforsomelayersoftheneuralnetworktobepurelylinear.Consider

aneuralnetworklayerwith

inputsand

outputs,

(



).Wemay

replacethiswithtwolayers,withonelayerusingweightmatrix

andtheother

usingweightmatrix

.Iftheﬁrstlayerhasnoactivationfunction,thenwehave

essentiallyfactoredtheweightmatrixoftheoriginallayerbasedon

.The

factoredapproachistocompute

(



).If

produces

outputs,

then

and

togethercontainonly(

)

parameters,while

contains

parameters.Forsmall

,thiscanbeaconsiderablesavinginparameters.It

comesatthecostofconstrainingthelineartransformationtobelowrank,but

192

CHAPTER6.DEEPFEEDFORWARDNETWORKS

theselow-rankrelationshipsareoftensuﬃcient.Linearhiddenunitsthusoﬀeran

eﬀectivewayofreducingthenumberofparametersinanetwork.

Softmaxunitsareanotherkindofunitthatisusuallyusedasanoutput(as

describedinsection6.2.2.3)butmaysometimesbeusedasahiddenunit.Softmax

unitsnaturallyrepresentaprobabilitydistributionoveradiscretevariablewith

possiblevalues,sotheymaybeusedasakindofswitch.Thesekindsofhidden

unitsareusuallyonlyusedinmoreadvancedarchitecturesthatexplicitlylearnto

manipulatememory,asdescribedinsection10.12.

Afewotherreasonablycommonhiddenunittypesinclude

•Radialbasisfunction

(RBF),

unit

exp



−

||W

:,i

−x||



.This

functionbecomesmoreactiveas

approachesatemplate

:,i

.Becauseit

saturatesto0formostx,itcanbediﬃculttooptimize.

•Softplus

(

) =

(

) =

log

(1+

).Thisisasmoothversionoftherectiﬁer,

introducedbyDugasetal.(2001)forfunctionapproximationandbyNair

andHinton(2010)fortheconditionaldistributionsofundirectedprobabilistic

models.Glorotetal.(2011a)comparedthesoftplusandrectiﬁerandfound

betterresultswiththelatter.Theuseofthesoftplusisgenerallydiscouraged.

Thesoftplusdemonstratesthattheperformanceofhiddenunittypescan

beverycounterintuitive—onemightexpectittohaveanadvantageover

therectiﬁerduetobeingdiﬀerentiableeverywhereorduetosaturatingless

completely,butempiricallyitdoesnot.

•Hardtanh

.Thisisshapedsimilarlytothe

tanh

andtherectiﬁer,butunlike

thelatter,itisbounded,

(

max

(

−

,min

)).Itwasintroduced

byCollobert(2004).

Hiddenunitdesignremainsanactiveareaofresearch,andmanyusefulhidden

unittypesremaintobediscovered.

6.4ArchitectureDesign

Anotherkeydesignconsiderationforneuralnetworksisdeterminingthearchitecture.

Theword

architecture

referstotheoverallstructureofthenetwork:howmany

unitsitshouldhaveandhowtheseunitsshouldbeconnectedtoeachother.

Mostneuralnetworksareorganizedintogroupsofunitscalledlayers. Most

neuralnetworkarchitecturesarrangetheselayersinachainstructure,witheach

193

CHAPTER6.DEEPFEEDFORWARDNETWORKS

layerbeingafunctionofthelayerthatprecededit.Inthisstructure,theﬁrstlayer

isgivenby

(1)



(1)

x+b

(1)



;(6.40)

thesecondlayerisgivenby

(2)



(2)

(1)

(2)



;(6.41)

andsoon.

Inthesechain-basedarchitectures,themainarchitecturalconsiderationsare

choosingthe depthofthe networkand thewidthof each layer.Aswe will

see,anetworkwithevenonehiddenlayerissuﬃcienttoﬁtthetrainingset.

Deepernetworksareoftenabletousefarfewerunitsperlayerandfarfewer

parameters,aswellasfrequentlygeneralizingtothetestset,buttheyalsotendto

behardertooptimize.Theidealnetworkarchitectureforataskmustbefound

viaexperimentationguidedbymonitoringthevalidationseterror.

6.4.1UniversalApproximationPropertiesandDepth

Alinearmodel,mappingfromfeaturestooutputsviamatrixmultiplication,can

bydeﬁnitionrepresentonlylinearfunctions.Ithastheadvantageofbeingeasyto

trainbecausemanylossfunctionsresultinconvexoptimizationproblemswhen

appliedtolinearmodels.Unfortunately, weoftenwantoursystemstolearn

nonlinearfunctions.

Atﬁrstglance,wemightpresumethatlearninganonlinearfunctionrequires

designingaspecializedmodelfamilyforthekindofnonlinearitywewanttolearn.

Fortunately,feedforwardnetworkswithhiddenlayersprovideauniversalapproxi-

mationframework.Speciﬁcally,the

universalapproximationtheorem

(Hornik

etal.,1989;Cybenko,1989)statesthatafeedforwardnetworkwithalinearoutput

layerandatleastonehiddenlayerwithany“squashing”activationfunction(such

asthelogisticsigmoidactivationfunction)canapproximateanyBorelmeasurable

functionfromoneﬁnite-dimensionalspacetoanotherwithanydesirednonzero

amountoferror,providedthatthenetworkisgivenenoughhiddenunits.The

derivativesofthefeedforwardnetworkcanalsoapproximatethederivativesofthe

functionarbitrarilywell(Horniketal.,1990).TheconceptofBorelmeasurability

isbeyondthescopeofthisbook; forourpurposesitsuﬃcestosaythatany

continuousfunctiononaclosedandboundedsubsetof

isBorelmeasurable

andthereforemaybeapproximatedbyaneuralnetwork.Aneuralnetworkmay

alsoapproximateanyfunctionmappingfromanyﬁnitedimensionaldiscretespace

194

CHAPTER6.DEEPFEEDFORWARDNETWORKS

toanother.Whiletheoriginaltheoremswereﬁrststatedintermsofunitswith

activationfunctionsthatsaturateforbothverynegativeandverypositiveargu-

ments,universalapproximationtheoremshavealsobeenprovedforawiderclass

ofactivationfunctions,whichincludesthenowcommonlyusedrectiﬁedlinear

unit(Leshnoetal.,1993).

Theuniversalapproximationtheoremmeansthatregardlessofwhatfunction

wearetryingtolearn,weknowthatalargeMLPwillbeabletorepresentthis

function.Wearenotguaranteed,however,thatthetrainingalgorithmwillbe

abletolearnthatfunction.EveniftheMLPisabletorepresentthefunction,

learningcanfailfortwodiﬀerentreasons.First,theoptimizationalgorithmused

fortrainingmaynotbeabletoﬁndthevalueoftheparametersthatcorresponds

tothedesiredfunction.Second,thetrainingalgorithmmightchoosethewrong

functionasaresultofoverﬁtting.Recallfromsection5.2.1thatthenofreelunch

theoremshowsthatthereisnouniversallysuperiormachinelearningalgorithm.

Feedforwardnetworksprovideauniversalsystemforrepresentingfunctionsinthe

sensethat,givenafunction,thereexistsafeedforwardnetworkthatapproximates

thefunction.Thereisnouniversalprocedureforexaminingatrainingsetof

speciﬁcexamplesandchoosingafunctionthatwillgeneralizetopointsnotinthe

trainingset.

Accordingtotheuniversalapproximationtheorem,thereexistsanetworklarge

enoughtoachieveanydegreeofaccuracywedesire,butthetheoremdoesnot

sayhowlargethisnetworkwillbe.Barron(1993)providessomeboundsonthe

sizeofasingle-layernetworkneededtoapproximateabroadclassoffunctions.

Unfortunately,intheworstcase,anexponentialnumberofhiddenunits(possibly

withonehiddenunitcorrespondingtoeachinputconﬁgurationthatneedstobe

distinguished)mayberequired.Thisiseasiesttoseeinthebinarycase:the

numberofpossiblebinaryfunctionsonvectors

v∈{

}

is2

andselecting

onesuchfunctionrequires2

bits,whichwillingeneralrequire

)degreesof

freedom.

Insummary,afeedforwardnetworkwithasinglelayerissuﬃcienttorepresent

anyfunction,butthelayermaybeinfeasiblylargeandmayfailtolearnand

generalizecorrectly.Inmanycircumstances,usingdeepermodelscanreducethe

numberofunitsrequiredtorepresentthedesiredfunctionandcanreducethe

amountofgeneralizationerror.

Variousfamiliesoffunctionscanbeapproximatedeﬃcientlybyanarchitecture

withdepthgreaterthansomevalue

,buttheyrequireamuchlargermodelif

depthisrestrictedtobelessthanorequalto

.Inmanycases,thenumber

ofhiddenunitsrequiredbytheshallowmodelisexponentialin

. Suchresults

195

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Figure6.5:Anintuitive,geometricexplanationoftheexponentialadvantageofdeeper

rectiﬁernetworksformallybyMontufaretal.(2014).(Left)Anabsolutevaluerectiﬁcation

unithasthesameoutputforeverypairofmirrorpointsinitsinput.Themirroraxis

ofsymmetryisgivenbythehyperplanedeﬁnedbytheweightsandbiasoftheunit.A

functioncomputedontopofthatunit(thegreendecisionsurface)willbeamirrorimage

ofasimplerpatternacrossthataxisofsymmetry.(Center)Thefunctioncanbeobtained

byfoldingthespacearoundtheaxisofsymmetry.(Right)Anotherrepeatingpatterncan

befoldedontopoftheﬁrst(byanotherdownstreamunit)toobtainanothersymmetry

(whichisnowrepeatedfourtimes,withtwohiddenlayers).Figurereproducedwith

permissionfromMontufaretal.(2014).

wereﬁrstprovedformodelsthatdonotresemblethecontinuous,diﬀerentiable

neuralnetworksusedformachinelearningbuthavesincebeenextendedtothese

models.Theﬁrstresultswereforcircuitsoflogicgates(Håstad,1986).Later

workextendedtheseresultstolinearthresholdunitswithnonnegativeweights

(HåstadandGoldmann,1991;Hajnaletal.,1993),andthentonetworkswith

continuous-valuedactivations(Maass,1992;Maassetal.,1994). Manymodern

neuralnetworksuserectiﬁedlinearunits.Leshnoetal.(1993)demonstrated

thatshallownetworkswithabroadfamilyofnon-polynomialactivationfunctions,

includingrectiﬁedlinearunits,haveuniversalapproximationproperties,butthese

resultsdonotaddressthequestionsofdepthoreﬃciency—theyspecifyonlythat

asuﬃcientlywiderectiﬁernetworkcouldrepresentanyfunction.Montufaretal.

(2014)showedthatfunctionsrepresentablewithadeeprectiﬁernetcanrequire

anexponentialnumberofhiddenunitswithashallow(onehiddenlayer)network.

Moreprecisely,theyshowedthatpiecewiselinearnetworks(whichcanbeobtained

fromrectiﬁernonlinearitiesormaxoutunits)canrepresentfunctionswithanumber

ofregionsthatisexponentialinthedepthofthenetwork.Figure6.5illustrateshow

anetworkwithabsolutevaluerectiﬁcationcreatesmirrorimagesofthefunction

computedontopofsomehiddenunit,withrespecttotheinputofthathidden

unit.Eachhiddenunitspeciﬁeswheretofoldtheinputspaceinordertocreate

mirrorresponses(onbothsidesoftheabsolutevaluenonlinearity).Bycomposing

thesefoldingoperations,weobtainanexponentiallylargenumberofpiecewise

linearregionsthatcancaptureallkindsofregular(e.g.,repeating)patterns.

196

CHAPTER6.DEEPFEEDFORWARDNETWORKS

ThemaintheoreminMontufaretal.(2014)statesthatthenumberoflinear

regionscarvedoutbyadeeprectiﬁernetworkwith

inputs,depth

,and

units

perhiddenlayeris







d(l−1)



,(6.42)

thatis,exponentialindepth

.Inthecaseofmaxoutnetworkswith

ﬁltersper

unit,thenumberoflinearregionsis



(l−1)+d



.(6.43)

Ofcourse,thereisnoguaranteethatthekindsoffunctionswewanttolearnin

applicationsofmachinelearning(andinparticularforAI)sharesuchaproperty.

Wemayalsowanttochooseadeepmodelforstatisticalreasons. Anytime

wechooseaspeciﬁcmachinelearningalgorithm,weareimplicitlystatingsome

setofpriorbeliefswehaveaboutwhatkindoffunctionthealgorithmshould

learn.Choosingadeepmodelencodesaverygeneralbeliefthatthefunctionwe

wanttolearnshouldinvolvecompositionofseveralsimplerfunctions.Thiscanbe

interpretedfromarepresentationlearningpointofviewassayingthatwebelieve

thelearningproblemconsistsofdiscoveringasetofunderlyingfactorsofvariation

thatcaninturnbedescribedintermsofother,simplerunderlyingfactorsof

variation.Alternately,wecaninterprettheuseofadeeparchitectureasexpressing

abeliefthatthefunctionwewanttolearnisacomputerprogramconsistingof

multiplesteps,whereeachstepmakesuseofthepreviousstep’soutput. These

intermediateoutputsarenotnecessarilyfactorsofvariationbutcaninsteadbe

analogoustocountersorpointersthatthenetworkusestoorganizeitsinternal

processing.Empirically,greaterdepthdoesseemtoresultinbettergeneralization

forawidevarietyoftasks(Bengioetal.,2007;Erhanetal.,2009;Bengio,2009;

Mesniletal.,2011;Ciresanetal.,2012;Krizhevskyetal.,2012;Sermanetetal.,

2013;Farabetetal.,2013;Couprieetal.,2013;Kahouetal.,2013;Goodfellow

etal.,2014d;Szegedyetal.,2014a).Seeﬁgure6.6andﬁgure6.7forexamplesof

someoftheseempiricalresults.Theseresultssuggestthatusingdeeparchitectures

doesindeedexpressausefulprioroverthespaceoffunctionsthemodellearns.

6.4.2OtherArchitecturalConsiderations

Sofarwehavedescribedneuralnetworksasbeingsimplechainsoflayers,withthe

mainconsiderationsbeingthedepthofthenetworkandthewidthofeachlayer.

Inpractice,neuralnetworksshowconsiderablymorediversity.

197

CHAPTER6.DEEPFEEDFORWARDNETWORKS

34567891011

Numberofhiddenlayers

92.0

92.5

93.0

93.5

94.0

94.5

95.0

95.5

96.0

96.5

Testaccuracy(percent)

Figure6.6:Eﬀectofdepth.Empiricalresultsshowingthatdeepernetworksgeneralize

betterwhenusedtotranscribemultidigitnumbersfromphotographsofaddresses.Data

fromGoodfellowetal.(2014d).Thetestsetaccuracyconsistentlyincreaseswithincreasing

depth.Seeﬁgure6.7foracontrolexperimentdemonstratingthatotherincreasestothe

modelsizedonotyieldthesameeﬀect.

Manyneuralnetworkarchitectureshavebeendevelopedforspeciﬁctasks.

Specializedarchitecturesforcomputervisioncalledconvolutionalnetworksare

describedinchapter9.Feedforwardnetworksmayalsobegeneralizedtothe

recurrentneuralnetworksforsequenceprocessing,describedinchapter10,which

havetheirownarchitecturalconsiderations.

Ingeneral,thelayersneednotbeconnectedinachain,eventhoughthisisthe

mostcommonpractice.Manyarchitecturesbuildamainchainbutthenaddextra

architecturalfeaturestoit,suchasskipconnectionsgoingfromlayer

tolayer

+2orhigher.Theseskipconnectionsmakeiteasierforthegradienttoﬂowfrom

outputlayerstolayersnearertheinput.

Anotherkeyconsiderationofarchitecturedesignisexactlyhowtoconnecta

pairoflayerstoeachother.Inthedefaultneuralnetworklayerdescribedbyalinear

transformationviaamatrix

,everyinputunitisconnectedtoeveryoutput

unit.Manyspecializednetworksinthechaptersaheadhavefewerconnections,so

thateachunitintheinputlayerisconnectedtoonlyasmallsubsetofunitsinthe

outputlayer. Thesestrategiesfordecreasingthenumberofconnectionsreduce

thenumberofparametersandtheamountofcomputationrequiredtoevaluate

thenetworkbutareoftenhighlyproblemdependent.Forexample,convolutional

198

CHAPTER6.DEEPFEEDFORWARDNETWORKS

0.00.20.40.60.81.0

Numberofparameters

×10

Testaccuracy(percent)

3,convolutional

3,fullyconnected

11,convolutional

Figure6.7:Eﬀectofnumberofparameters.Deepermodelstendtoperformbetter.

Thisisnotmerelybecausethemodelislarger.ThisexperimentfromGoodfellowetal.

(2014d)showsthatincreasingthenumberofparametersinlayersofconvolutionalnetworks

withoutincreasingtheirdepthisnotnearlyaseﬀectiveatincreasingtestsetperformance,

asillustratedinthisﬁgure.Thelegendindicatesthedepthofnetworkusedtomake

eachcurveandwhetherthecurverepresentsvariationinthesizeoftheconvolutional

orthefullyconnectedlayers.Weobservethatshallowmodelsinthiscontextoverﬁtat

around20millionparameterswhiledeeponescanbeneﬁtfromhavingover60million.

Thissuggeststhatusingadeepmodelexpressesausefulpreferenceoverthespaceof

functionsthemodelcanlearn.Speciﬁcally,itexpressesabeliefthatthefunctionshould

consistofmanysimplerfunctionscomposedtogether.Thiscouldresulteitherinlearning

arepresentationthatiscomposedinturnofsimplerrepresentations(e.g.,cornersdeﬁned

intermsofedges)orinlearningaprogramwithsequentiallydependentsteps(e.g.,ﬁrst

locateasetofobjects,thensegmentthemfromeachother,thenrecognizethem).

199

CHAPTER6.DEEPFEEDFORWARDNETWORKS

networks,describedinchapter9,usespecializedpatternsofsparseconnections

thatareveryeﬀectiveforcomputervisionproblems.Inthischapter,itisdiﬃcult

togivemorespeciﬁcadviceconcerningthearchitectureofagenericneuralnetwork.

Insubsequentchapterswedeveloptheparticulararchitecturalstrategiesthathave

beenfoundtoworkwellfordiﬀerentapplicationdomains.

6.5Back-PropagationandOtherDiﬀerentiation

Algorithms

Whenweuseafeedforwardneuralnetworktoacceptaninput

andproducean

output

,informationﬂowsforwardthroughthenetwork.Theinput

provides

theinitialinformationthatthenpropagatesuptothehiddenunitsateachlayer

andﬁnallyproduces

.Thisiscalled

forwardpropagation

.Duringtraining,

forwardpropagationcancontinueonwarduntilitproducesascalarcost

(

The

back-propagation

algorithm(Rumelhartetal.,1986a),oftensimplycalled

backprop

,allowstheinformationfromthecosttothenﬂowbackwardthrough

thenetworkinordertocomputethegradient.

Computingananalyticalexpressionforthegradientisstraightforward,but

numericallyevaluatingsuchanexpressioncanbecomputationallyexpensive.The

back-propagationalgorithmdoessousingasimpleandinexpensiveprocedure.

Thetermback-propagationisoftenmisunderstoodasmeaningthewhole

learningalgorithmformultilayerneuralnetworks.Actually,back-propagation

refersonlytothemethodforcomputingthegradient,whileanotheralgorithm,

suchasstochasticgradientdescent,isusedtoperformlearningusingthisgradient.

Furthermore,back-propagationisoftenmisunderstoodasbeingspeciﬁctomulti-

layerneuralnetworks,butinprincipleitcancomputederivativesofanyfunction

(forsomefunctions,thecorrectresponseistoreportthatthederivativeofthe

functionisundeﬁned).Speciﬁcally,wewilldescribehowtocomputethegradient

∇

(

x,y

)foranarbitraryfunction

,where

isasetofvariableswhosederivatives

aredesired,and

isanadditionalsetofvariablesthatareinputstothefunction

butwhosederivativesarenotrequired.Inlearningalgorithms,thegradientwemost

oftenrequireisthegradientofthecostfunctionwithrespecttotheparameters,

∇

(

).Manymachinelearningtasksinvolvecomputingotherderivatives,either

aspartof thelearning process, or toanalyze thelearned model.The back-

propagationalgorithmcanbeappliedtothesetasksaswellandisnotrestrictedto

computingthegradientofthecostfunctionwithrespecttotheparameters.The

ideaofcomputingderivativesbypropagatinginformationthroughanetworkis

verygeneralandcanbeusedtocomputevaluessuchastheJacobianofafunction

200

CHAPTER6.DEEPFEEDFORWARDNETWORKS

withmultipleoutputs.Werestrictourdescriptionheretothemostcommonly

usedcase,wherefhasasingleoutput.

6.5.1ComputationalGraphs

Sofarwehavediscussedneuralnetworkswitharelativelyinformalgraphlanguage.

Todescribetheback-propagationalgorithmmoreprecisely,itishelpfultohavea

moreprecisecomputationalgraphlanguage.

Manywaysofformalizingcomputationasgraphsarepossible.

Here,weuseeachnodeinthegraphtoindicateavariable.Thevariablemay

beascalar,vector,matrix,tensor,orevenavariableofanothertype.

Toformalizeourgraphs,wealsoneedtointroducetheideaofan

operation

Anoperationisasimplefunctionofoneormorevariables.Ourgraphlanguage

isaccompaniedbyasetofallowableoperations.Functionsmorecomplicated

thantheoperationsinthissetmaybedescribedbycomposingmanyoperations

together.

Withoutlossofgenerality, wedeﬁneanoperationtoreturnonlyasingle

outputvariable.Thisdoesnotlosegeneralitybecausetheoutputvariablecanhave

multipleentries,suchasavector.Softwareimplementationsofback-propagation

usuallysupportoperationswithmultipleoutputs,butweavoidthiscaseinour

descriptionbecauseitintroducesmanyextradetailsthatarenotimportantto

conceptualunderstanding.

Ifavariable

iscomputedbyapplyinganoperationtoavariable

,then

wedrawadirectededgefrom

. Wesometimesannotatetheoutputnode

withthenameoftheoperationapplied,andothertimesomitthislabelwhenthe

operationisclearfromcontext.

Examplesofcomputationalgraphsareshowninﬁgure6.8.

6.5.2ChainRuleofCalculus

Thechainruleofcalculus(nottobeconfusedwiththechainruleofprobability)is

usedtocomputethederivativesoffunctionsformedbycomposingotherfunctions

whosederivativesareknown.Back-propagationisanalgorithmthatcomputesthe

chainrule,withaspeciﬁcorderofoperationsthatishighlyeﬃcient.

Let

bearealnumber,andlet

and

bothbefunctionsmappingfromareal

numbertoarealnumber.Supposethat

(

)and

(

)) =

(

).Then

201

CHAPTER6.DEEPFEEDFORWARDNETWORKS

(a)

(b)

(1)

dot

(2)

ˆyˆy

(c)

(1)

matmul

(2)

relu

(d)

ˆyˆy

dot

λλ

(1)

sqr

(2)

sum

(3)

Figure6.8:Examplesofcomputationalgraphs.(a)Thegraphusingthe

operationto

compute

.(b)Thegraphforthelogisticregressionprediction

ˆy





w+b



Someoftheintermediateexpressionsdonothavenamesinthealgebraicexpression

butneednamesinthegraph.Wesimplynamethe

-thsuchvariable

(i)

.(c)The

computationalgraphfortheexpression

max{

,XW

,whichcomputesadesign

matrixofrectiﬁedlinearunitactivations

givenadesignmatrixcontainingaminibatch

ofinputs

.(d)Examplesa–cappliedatmostoneoperationtoeachvariable,butit

ispossibletoapplymorethanoneoperation.Hereweshowacomputationgraphthat

appliesmorethanoneoperationtotheweights

ofalinearregressionmodel.The

weightsareusedtomakeboththepredictionˆyandtheweightdecaypenaltyλ



202

CHAPTER6.DEEPFEEDFORWARDNETWORKS

thechainrulestatesthat

.(6.44)

Wecangeneralizethisbeyondthescalarcase.Supposethat

x∈R

y∈R

mapsfrom

,and

mapsfrom

.If

(

)and

(

),then

∂z

∂x



∂z

∂y

∂x

.(6.45)

Invectornotation,thismaybeequivalentlywrittenas

∇



∂y

∂x





∇

z,(6.46)

where

∂y

∂x

isthen×mJacobianmatrixofg.

Fromthisweseethatthegradientofavariable

canbeobtainedbymultiplying

aJacobianmatrix

∂y

∂x

byagradient

∇

.Theback-propagationalgorithmconsists

ofperformingsuchaJacobian-gradientproductforeachoperationinthegraph.

Usuallyweapplytheback-propagationalgorithmtotensorsofarbitrarydi-

mensionality,notmerelytovectors.Conceptually,thisisexactlythesameas

back-propagationwithvectors.Theonlydiﬀerenceishowthenumbersarear-

rangedinagridtoformatensor.Wecouldimagineﬂatteningeachtensorinto

avectorbeforewerunback-propagation,computingavector-valuedgradient,

andthenreshapingthegradientbackintoatensor.Inthisrearrangedview,

back-propagationisstilljustmultiplyingJacobiansbygradients.

Todenotethegradientofavalue

withrespecttoatensor

,wewrite

∇

justasif

wereavector.Theindicesinto

nowhavemultiplecoordinates—for

example,a3-Dtensorisindexedbythreecoordinates.Wecanabstractthisaway

byusingasinglevariable

torepresentthecompletetupleofindices.Forall

possibleindextuples

∇

)

gives

∂z

∂X

.Thisisexactlythesameashowforall

possibleintegerindices

intoavector,(

∇

)

gives

∂z

∂x

.Usingthisnotation,we

canwritethechainruleasitappliestotensors.IfY=g(X)andz=f(Y),then

∇



(∇

)

∂z

∂Y

.(6.47)

203

CHAPTER6.DEEPFEEDFORWARDNETWORKS

6.5.3RecursivelyApplyingtheChainRuletoObtainBackprop

Usingthechainrule,itisstraightforwardtowritedownanalgebraicexpressionfor

thegradientofascalarwithrespecttoanynodeinthecomputationalgraphthat

producedthatscalar.Actuallyevaluatingthatexpressioninacomputer,however,

introducessomeextraconsiderations.

Speciﬁcally,manysubexpressionsmayberepeatedseveraltimeswithinthe

overallexpressionforthegradient.Anyprocedurethatcomputesthegradient

willneedtochoosewhethertostorethesesubexpressionsortorecomputethem

severaltimes.Anexampleofhowtheserepeatedsubexpressionsariseisgivenin

ﬁgure6.9.Insomecases,computingthesamesubexpressiontwicewouldsimply

bewasteful. Forcomplicatedgraphs,therecanbeexponentiallymanyofthese

wastedcomputations,makinganaiveimplementationofthechainruleinfeasible.

Inothercases,computingthesamesubexpressiontwicecouldbeavalidwayto

reducememoryconsumptionatthecostofhigherruntime.

Webeginwithaversionoftheback-propagationalgorithmthatspeciﬁesthe

actualgradientcomputationdirectly(algorithm6.2alongwithalgorithm6.1forthe

associatedforwardcomputation),intheorderitwillactuallybedoneandaccording

totherecursiveapplicationofchainrule.Onecouldeitherdirectlyperformthese

computationsorviewthedescriptionofthealgorithmasasymbolicspeciﬁcation

ofthecomputationalgraphforcomputingtheback-propagation.However,this

formulationdoesnotmakeexplicitthemanipulationandtheconstructionofthe

symbolicgraphthatperformsthegradientcomputation. Suchaformulationis

presentedinsection6.5.6,withalgorithm6.5,wherewealsogeneralizetonodes

thatcontainarbitrarytensors.

Firstconsideracomputationalgraphdescribinghowtocomputeasinglescalar

(n)

(say, thelossonatrainingexample).Thisscalaristhequantitywhose

gradientwewanttoobtain,withrespecttothe

inputnodes

(1)

)

. In

otherwords,wewishtocompute

∂u

(n)

∂u

(i)

forall

i∈{

,...,n

}

.Intheapplication

ofback-propagationtocomputinggradientsforgradientdescentoverparameters,

(n)

willbethecostassociatedwithanexampleoraminibatch,while

(1)

)

correspondtotheparametersofthemodel.

Wewillassumethatthenodesofthegraphhavebeenorderedinsuchaway

thatwecancomputetheiroutputoneaftertheother,startingat

+1)

and

goingupto

(n)

.Asdeﬁnedinalgorithm6.1,eachnode

(i)

isassociatedwithan

operationf

(i)

andiscomputedbyevaluatingthefunction

(i)

=f(A

(i)

),(6.48)

whereA

(i)

isthesetofallnodesthatareparentsofu

(i)

204

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Figure6.9:Acomputationalgraphthatresultsinrepeatedsubexpressionswhencomputing

thegradient.Let

w∈R

betheinputtothegraph.Weusethesamefunction

R→R

astheoperationthatweapplyateverystepofachain:

(

Tocompute

∂z

∂w

,weapplyequation6.44andobtain:

∂z

∂w

(6.49)

∂z

∂y

∂x

∂w

(6.50)



(y)f



(x)f



(w)(6.51)



(f(f(w)))f



(f(w))f



(w).(6.52)

Equation6.51suggestsanimplementationinwhichwecomputethevalueof

(

)only

onceandstoreitinthevariable

.Thisistheapproachtakenbytheback-propagation

algorithm.Analternativeapproachissuggestedbyequation6.52,wherethesubexpression

(

)appearsmorethanonce.Inthealternativeapproach,

(

)isrecomputedeachtime

itisneeded.Whenthememoryrequiredtostorethevalueoftheseexpressionsislow,the

back-propagationapproachofequation6.51isclearlypreferablebecauseofitsreduced

runtime.However,equation6.52isalsoavalidimplementationofthechainruleandis

usefulwhenmemoryislimited.

205

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Thatalgorithmspeciﬁestheforwardpropagationcomputation,whichwecould

putinagraph

.Toperformback-propagation,wecanconstructacomputational

graphthatdependson

andaddstoitanextrasetofnodes.Theseforma

subgraph

withonenodepernodeof

.Computationin

proceedsinexactly

thereverseoftheorderofcomputationin

,andeachnodeof

computesthe

derivative

∂u

(n)

∂u

(i)

associatedwiththeforwardgraphnode

(i)

.Thisisdoneusing

thechainrulewithrespecttoscalaroutputu

(n)

∂u

(n)

∂u

(j)



i:j∈Pa(u

(i)

)

∂u

(n)

∂u

(i)

∂u

(i)

∂u

(j)

(6.53)

asspeciﬁedbyalgorithm6.2.Thesubgraph

containsexactlyoneedgeforeach

edgefromnode

(j)

tonode

(i)

.Theedgefrom

(j)

(i)

isassociatedwith

thecomputationof

∂u

(i)

∂u

(j)

.Inaddition,adotproductisperformedforeachnode,

betweenthegradientalreadycomputedwithrespecttonodes

(i)

thatarechildren

(j)

andthevectorcontainingthepartialderivatives

∂u

(i)

∂u

(j)

forthesamechildren

nodes

(i)

.Tosummarize,theamountofcomputationrequiredforperforming

theback-propagationscaleslinearlywiththenumberofedgesin

,wherethe

computationforeachedgecorrespondstocomputingapartialderivative(ofone

nodewithrespecttooneofitsparents)aswellasperformingonemultiplication

andoneaddition.Below,wegeneralizethisanalysistotensor-valuednodes,which

isjustawaytogroupmultiplescalarvaluesinthesamenodeandenablemore

Algorithm6.1

Aprocedurethatperformsthecomputationsmapping

inputs

(1)

)

toanoutput

(n)

.Thisdeﬁnesacomputationalgraphwhereeachnode

computesnumericalvalue

(i)

byapplyingafunction

(i)

tothesetofarguments

(i)

thatcomprisesthevaluesofpreviousnodes

(j)

j<i

,with

j∈Pa

(

(i)

).The

inputtothecomputationalgraphisthevector

,andissetintotheﬁrst

nodes

(1)

)

.Theoutputofthecomputationalgraphisreadoﬀthelast(output)

nodeu

(n)

fori= 1,...,n

(i)

←x

endfor

fori=n

+1,...,ndo

(i)

←{u

(j)

|j∈Pa(u

(i)

)}

(i)

←f

(i)

)

endfor

returnu

(n)

206

CHAPTER6.DEEPFEEDFORWARDNETWORKS

eﬃcientimplementations.

Theback-propagationalgorithmisdesignedtoreducethenumberofcommon

subexpressionswithoutregardtomemory.Speciﬁcally,itperformsontheorder

ofoneJacobianproductpernodeinthegraph. Thiscanbeseenfromthefact

thatbackprop(algorithm6.2)visitseachedgefromnode

(j)

tonode

(i)

thegraphexactlyonceinordertoobtaintheassociatedpartialderivative

∂u

(i)

∂u

(j)

Back-propagationthusavoidstheexponentialexplosioninrepeatedsubexpres-

sions.Otheralgorithmsmaybeabletoavoidmoresubexpressionsbyperforming

simpliﬁcationsonthecomputationalgraph,ormaybeabletoconservememory

byrecomputingratherthanstoringsomesubexpressions.Werevisittheseideas

afterdescribingtheback-propagationalgorithmitself.

Algorithm6.2

Simpliﬁedversionoftheback-propagationalgorithmforcomputing

thederivativesof

(n)

withrespecttothevariablesinthegraph.Thisexampleis

intendedtofurtherunderstandingbyshowingasimpliﬁedcasewhereallvariables

arescalars,andwewishtocomputethederivativeswithrespectto

(1)

,...,u

)

Thissimpliﬁedversioncomputesthederivativesofallnodesinthegraph.The

computationalcostofthisalgorithmisproportionaltothenumberofedgesin

thegraph,assumingthatthepartialderivativeassociatedwitheachedgerequires

aconstanttime.Thisisofthesameorderasthenumberofcomputationsfor

theforwardpropagation.Each

∂u

(i)

∂u

(j)

isafunctionoftheparents

(j)

(i)

,thus

linkingthenodesoftheforwardgraphtothoseaddedfortheback-propagation

graph.

Runforwardpropagation(algorithm6.1forthisexample)toobtaintheactiva-

tionsofthenetwork.

Initialize

grad_table

,adatastructurethatwillstorethederivativesthathave

beencomputed.Theentry

grad_table

[

(i)

]willstorethecomputedvalueof

∂u

(n)

∂u

(i)

grad_table[u

(n)

]←1

forj=n−1downto1do

Thenextlinecomputes

∂u

(n)

∂u

(j)



i:j∈Pa(u

(i)

)

∂u

(n)

∂u

(i)

∂u

(i)

∂u

(j)

usingstoredvalues:

grad_table[u

(j)

]←



i:j∈Pa(u

(i)

)

grad_table[u

(i)

]

∂u

(i)

∂u

(j)

endfor

return{grad_table[u

(i)

]|i= 1,...,n

}

207

CHAPTER6.DEEPFEEDFORWARDNETWORKS

6.5.4Back-PropagationComputationinFullyConnectedMLP

Toclarifytheabovedeﬁnitionoftheback-propagationcomputation,letusconsider

thespeciﬁcgraphassociatedwithafully-connectedmultilayerMLP.

Algorithm6.3ﬁrstshowstheforwardpropagation,whichmapsparametersto

thesupervisedloss

(

y,y

)associatedwithasingle(input,target)trainingexample

(x,y),with

ytheoutputoftheneuralnetworkwhenxisprovidedininput.

Algorithm6.4then shows thecorresponding computationto bedone for

applyingtheback-propagationalgorithmtothisgraph.

Algorithms6.3and6.4aredemonstrationschosentobesimpleandstraightfor-

wardtounderstand.However,theyarespecializedtoonespeciﬁcproblem.

Modernsoftwareimplementationsarebasedonthegeneralizedformofback-

propagationdescribedinsection6.5.6below,whichcanaccommodateanycompu-

tationalgraphbyexplicitlymanipulatingadatastructureforrepresentingsymbolic

computation.

Algorithm6.3

Forwardpropagationthroughatypicaldeepneuralnetworkand

thecomputationofthecostfunction.Theloss

(

y,y

)dependsontheoutput

andonthetarget

(seesection6.2.1.1forexamplesoflossfunctions).To

obtainthetotalcost

,thelossmaybeaddedtoaregularizerΩ(

),where

containsalltheparameters(weightsandbiases).Algorithm6.4showshowto

computegradientsof

withrespecttoparameters

and

.Forsimplicity,this

demonstrationusesonlyasingleinputexample

.Practicalapplicationsshould

useaminibatch.Seesection6.5.7foramorerealisticdemonstration.

Require:Networkdepth,l

Require:W

(i)

,i∈{1,...,l},theweightmatricesofthemodel

Require:b

(i)

,i∈{1,...,l},thebiasparametersofthemodel

Require:x,theinputtoprocess

Require:y,thetargetoutput

(0)

fork= 1,...,ldo

(k)

(k−1)

(k)

=f(a

(k)

)

endfor

y=h

(l)

J=L(

y,y)+λΩ(θ)

208

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Algorithm 6.4

Backwardcomputation forthe deepneural network ofalgo-

rithm6.3,whichuses,inadditiontotheinput

,atarget

.Thiscomputation

yieldsthegradientsontheactivations

(k)

foreachlayer

,startingfromthe

outputlayerandgoingbackwardstotheﬁrsthiddenlayer.Fromthesegradients,

whichcanbeinterpretedasanindicationofhoweachlayer’soutputshouldchange

toreduceerror,onecanobtainthegradientontheparametersofeachlayer.The

gradientsonweightsandbiasescanbeimmediatelyusedaspartofastochas-

ticgradientupdate(performingtheupdaterightafterthegradientshavebeen

computed)orusedwithothergradient-basedoptimizationmethods.

Aftertheforwardcomputation,computethegradientontheoutputlayer:

g←∇

J=∇

y,y)

fork=l,l−1,...,1do

Convert thegradientonthelayer’soutputinto agradientonthepre-

nonlinearity activation (element-wisemultiplicationif

iselement-wise):

g←∇

(k)

J=gf



(k)

)

Computegradientsonweightsandbiases(includingtheregularizationterm,

whereneeded):

∇

(k)

J=g+λ∇

(k)

Ω(θ)

∇

(k)

J=gh

(k−1)

+λ∇

(k)

Ω(θ)

Propagatethegradientsw.r.t.thenextlower-levelhiddenlayer’sactivations:

g←∇

(k−1)

J=W

(k)

endfor

6.5.5Symbol-to-SymbolDerivatives

Algebraicexpressionsandcomputationalgraphsbothoperateon

symbols

,or

variables thatdo not havespeciﬁc values.These algebraicand graph-based

representationsarecalled

symbolicrepresentations

.Whenweactuallyuse

ortrainaneuralnetwork,wemustassignspeciﬁcvaluestothesesymbols.We

replaceasymbolicinputtothenetwork

withaspeciﬁc

numeric

value,suchas

[1.2,3.765,−1.8]



Someapproachestoback-propagationtakeacomputationalgraphandasetof

numericalvaluesfortheinputstothegraph,thenreturnasetofnumericalvalues

describingthegradientatthoseinputvalues.Wecallthisapproach

symbol-to-

numberdiﬀerentiation

. ThisistheapproachusedbylibrariessuchasTorch

(Collobertetal.,2011b)andCaﬀe(Jia,2013).

Anotherapproachistotakeacomputationalgraphandaddadditionalnodes

209

CHAPTER6.DEEPFEEDFORWARDNETWORKS



Figure6.10:Anexampleofthesymbol-to-symbolapproachtocomputingderivatives.In

thisapproach,theback-propagationalgorithmdoesnotneedtoeveraccessanyactual

speciﬁcnumericvalues.Instead,itaddsnodestoacomputationalgraphdescribinghow

tocomputethesederivatives.Agenericgraphevaluationenginecanlatercomputethe

derivativesforanyspeciﬁcnumericvalues.(Left)Inthisexample,webeginwithagraph

representing

(

))).(Right)Weruntheback-propagationalgorithm,instructing

ittoconstructthegraphfortheexpressioncorrespondingto

.Inthisexample,wedo

notexplainhowtheback-propagationalgorithmworks.Thepurposeisonlytoillustrate

whatthedesiredresultis:acomputationalgraphwithasymbolicdescriptionofthe

derivative.

tothegraphthatprovideasymbolicdescriptionofthedesiredderivatives.This

istheapproachtakenbyTheano(Bergstraetal.,2010;Bastienetal.,2012)and

TensorFlow(Abadietal.,2015).Anexampleofhowitworksisillustratedin

ﬁgure6.10. Theprimaryadvantageofthisapproachisthatthederivativesare

describedinthesamelanguageastheoriginalexpression.Becausethederivatives

arejustanothercomputationalgraph, itispossibletorunback-propagation

again,diﬀerentiatingthederivativestoobtainhigherderivatives.(Computationof

higher-orderderivativesisdescribedinsection6.5.10.)

Wewillusethelatterapproachanddescribetheback-propagationalgorithmin

termsofconstructingacomputationalgraphforthederivatives.Anysubsetofthe

graphmaythenbeevaluatedusingspeciﬁcnumericalvaluesatalatertime.This

allowsustoavoidspecifyingexactlywheneachoperationshouldbecomputed.

Instead,agenericgraphevaluationenginecanevaluateeverynodeassoonasits

parents’valuesareavailable.

Thedescriptionofthesymbol-to-symbolbasedapproachsubsumesthesymbol-

210

CHAPTER6.DEEPFEEDFORWARDNETWORKS

to-numberapproach.Thesymbol-to-numberapproachcanbeunderstoodas

performingexactlythesamecomputationsasaredoneinthegraphbuiltbythe

symbol-to-symbolapproach.Thekeydiﬀerenceisthatthesymbol-to-number

approachdoesnotexposethegraph.

6.5.6GeneralBack-Propagation

Theback-propagationalgorithmisverysimple.Tocomputethegradientofsome

scalar

withrespecttooneofitsancestors

inthegraph,webeginbyobserving

thatthegradientwithrespectto

isgivenby

=1.Wecanthencompute

thegradientwithrespecttoeachparentof

inthegraphbymultiplyingthe

currentgradientbytheJacobianoftheoperationthatproduced

.Wecontinue

multiplyingbyJacobians,travelingbackwardthroughthegraphinthiswayuntil

wereach

.Foranynodethatmaybereachedbygoingbackwardfrom

through

twoormorepaths,wesimplysumthegradientsarrivingfromdiﬀerentpathsat

thatnode.

Moreformally,eachnodeinthegraph

correspondstoavariable.Toachieve

maximumgenerality,wedescribethisvariableasbeingatensor

.Tensorsin

generalcanhaveanynumberofdimensions.Theysubsumescalars,vectors,and

matrices.

WeassumethateachvariableVisassociatedwiththefollowingsubroutines:

•get_operation

(

):Thisreturnstheoperationthatcomputes

,repre-

sentedbytheedgescominginto

inthecomputationalgraph.Forexample,

theremaybeaPythonorC++classrepresentingthematrixmultiplication

operation,andthe

get_operation

function.Supposewehaveavariablethat

iscreatedbymatrixmultiplication,

.Then

get_operation

(

)

returnsapointertoaninstanceofthecorrespondingC++class.

•get_consumers

(

V,G

):Thisreturnsthelistofvariablesthatarechildrenof

VinthecomputationalgraphG.

•get_inputs(V,G)

:Thisreturnsthelistofvariablesthatareparentsof

inthecomputationalgraphG.

Eachoperation

isalsoassociatedwitha

bprop

operation.This

bprop

operationcancomputeaJacobian-vectorproductasdescribedbyequation6.47.

Thisishowtheback-propagationalgorithmisabletoachievegreatgenerality.

Eachoperationisresponsibleforknowinghowtoback-propagatethroughthe

edgesinthegraphthatitparticipatesin.Forexample,wemightuseamatrix

211

CHAPTER6.DEEPFEEDFORWARDNETWORKS

multiplicationoperationtocreateavariable

.Supposethatthegradient

ofascalar

withrespectto

isgivenby

.Thematrixmultiplicationoperation

isresponsiblefordeﬁningtwoback-propagationrules,oneforeachofitsinput

arguments.Ifwecallthe

bprop

methodtorequestthegradientwithrespectto

giventhatthegradientontheoutputis

,thenthe

bprop

methodofthe

matrixmultiplicationoperationmuststatethatthegradientwithrespectto

isgivenby



.Likewise,ifwecallthe

bprop

methodtorequestthegradient

withrespectto

,thenthematrixoperationisresponsibleforimplementingthe

bprop

methodandspecifyingthatthedesiredgradientisgivenby



.The

back-propagationalgorithmitselfdoesnotneedtoknowanydiﬀerentiationrules.It

onlyneedstocalleachoperation’s

bprop

ruleswiththerightarguments.Formally,

op.bprop(inputs,X,G)mustreturn



(∇

op.f(inputs)

,(6.54)

whichisjustanimplementationofthechainruleasexpressedinequation6.47.

Here,

inputs

isalistofinputsthataresuppliedtotheoperation,

op.f

isthe

mathematicalfunctionthattheoperationimplements,

istheinputwhosegradient

wewishtocompute,andGisthegradientontheoutputoftheoperation.

The

op.bprop

methodshouldalwayspretendthatallitsinputsaredistinct

fromeachother,eveniftheyarenot.Forexample,ifthe

mul

operatorispassed

twocopiesof

tocompute

,the

op.bprop

methodshouldstillreturn

asthe

derivativewithrespecttobothinputs.Theback-propagationalgorithmwilllater

addbothoftheseargumentstogethertoobtain2

,whichisthecorrecttotal

derivativeonx.

Softwareimplementationsofback-propagationusuallyprovideboththeopera-

tionsandtheir

bprop

methods,sothatusersofdeeplearningsoftwarelibrariesare

abletoback-propagatethroughgraphsbuiltusingcommonoperationslikematrix

multiplication,exponents,logarithms,andsoon.Softwareengineerswhobuilda

newimplementationofback-propagationoradvanceduserswhoneedtoaddtheir

ownoperationtoanexistinglibrarymustusuallyderivethe

op.bprop

methodfor

anynewoperationsmanually.

Theback-propagationalgorithmisformallydescribedinalgorithm6.5.

Insection6.5.2,weexplainedthatback-propagationwasdevelopedinorderto

avoidcomputingthesamesubexpressioninthechainrulemultipletimes.Thenaive

algorithmcouldhaveexponentialruntimeduetotheserepeatedsubexpressions.

Nowthatwehavespeciﬁedtheback-propagationalgorithm,wecanunderstandits

computationalcost.Ifweassumethateachoperationevaluationhasroughlythe

212

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Algorithm6.5

Theoutermostskeletonoftheback-propagationalgorithm.This

portiondoessimplesetupandcleanupwork.Mostoftheimportantworkhappens

inthebuild_gradsubroutineofalgorithm6.6

Require:T,thetargetsetofvariableswhosegradientsmustbecomputed.

Require:G,thecomputationalgraph

Require:z,thevariabletobediﬀerentiated

Let



prunedtocontainonlynodesthatareancestorsof

anddescendents

ofnodesinT.

Initializegrad_table,adatastructureassociatingtensorstotheirgradients

grad_table[z]←1

forVinTdo

build_grad(V,G,G



,grad_table)

endfor

Returngrad_tablerestrictedtoT

Algorithm6.6

Theinnerloopsubroutine

build_grad

(

V,G,G



,grad_table

)of

theback-propagationalgorithm,calledbytheback-propagationalgorithmdeﬁned

inalgorithm6.5.

Require:V,thevariablewhosegradientshouldbeaddedtoGandgrad_table

Require:G,thegraphtomodify

Require:G



,therestrictionofGtonodesthatparticipateinthegradient

Require:grad_table,adatastructuremappingnodestotheirgradients

ifVisingrad_tablethen

Returngrad_table[V]

endif

i←1

forCinget_consumers(V,G



)do

op←get_operation(C)

D←build_grad(C,G,G



,grad_table)

(i)

←op.bprop(get_inputs(C,G



),V,D)

i←i+1

endfor

G←



(i)

grad_table[V] =G

InsertGandtheoperationscreatingitintoG

ReturnG

213

CHAPTER6.DEEPFEEDFORWARDNETWORKS

samecost,thenwemayanalyzethecomputationalcostintermsofthenumber

ofoperationsexecuted. Keepinmindherethatwerefertoanoperationasthe

fundamentalunitofourcomputationalgraph,whichmightactuallyconsistof

severalarithmeticoperations(forexample,wemighthaveagraphthattreatsmatrix

multiplicationasasingleoperation).Computingagradientinagraphwith

nodes

willneverexecutemorethan

(

)operationsorstoretheoutputofmorethan

(

)operations.Herewearecountingoperationsinthecomputationalgraph,not

individualoperationsexecutedbytheunderlyinghardware,soitisimportantto

rememberthattheruntimeofeachoperationmaybehighlyvariable.Forexample,

multiplyingtwomatricesthateachcontainmillionsofentriesmightcorrespondto

asingleoperationinthegraph.Wecanseethatcomputingthegradientrequiresat

most

(

)operationsbecausetheforwardpropagationstagewillatworstexecute

all

nodesintheoriginalgraph(dependingonwhichvalueswewanttocompute,

wemaynotneedtoexecutetheentiregraph).Theback-propagationalgorithm

addsoneJacobian-vectorproduct,whichshouldbeexpressedwith

(1)nodes,per

edgeintheoriginalgraph.Becausethecomputationalgraphisadirectedacyclic

graphithasatmost

(

)edges.Forthekindsofgraphsthatarecommonlyused

inpractice,thesituationisevenbetter.Mostneuralnetworkcostfunctionsare

roughlychain-structured,causingback-propagationtohave

(

)cost.Thisisfar

betterthanthenaiveapproach,whichmightneedtoexecuteexponentiallymany

nodes.Thispotentiallyexponentialcostcanbeseenbyexpandingandrewriting

therecursivechainrule(equation6.53)nonrecursively:

∂u

(n)

∂u

(j)



path(u

(π

)

(π

)

,...,u

(π

)

fromπ

=jtoπ



k=2

∂u

(π

)

∂u

(π

k−1

)

.(6.55)

Sincethenumberofpathsfromnode

tonode

cangrowexponentiallyinthe

lengthofthesepaths,thenumberoftermsintheabovesum,whichisthenumber

ofsuchpaths,cangrowexponentiallywiththedepthoftheforwardpropagation

graph.Thislargecostwouldbeincurredbecausethesamecomputationfor

∂u

(i)

∂u

(j)

wouldberedonemanytimes. Toavoidsuchrecomputation,wecanthink

ofback-propagationasatable-ﬁllingalgorithmthattakesadvantageofstoring

intermediateresults

∂u

(n)

∂u

(i)

.Eachnodeinthegraphhasacorrespondingslotina

tabletostorethegradientforthatnode.Byﬁllinginthesetableentriesinorder,

back-propagationavoidsrepeatingmanycommonsubexpressions.Thistable-ﬁlling

strategyissometimescalleddynamicprogramming.

214

CHAPTER6.DEEPFEEDFORWARDNETWORKS

6.5.7Example:Back-PropagationforMLPTraining

Asanexample,wewalkthroughtheback-propagationalgorithmasitisusedto

trainamultilayerperceptron.

Herewedevelopaverysimplemultilayerperceptronwithasinglehidden

layer.Totrainthismodel,wewilluseminibatchstochasticgradientdescent.

Theback-propagationalgorithmisusedtocomputethegradientofthecostona

singleminibatch.Speciﬁcally,weuseaminibatchofexamplesfromthetraining

setformattedasadesignmatrix

andavectorofassociatedclasslabels

Thenetworkcomputesalayerofhiddenfeatures

max{

,XW

(1)

}

.To

simplifythepresentationwedonotusebiasesinthismodel.Weassumethatour

graphlanguageincludesa

relu

operationthatcancompute

max{

,Z}

element-

wise.Thepredictionsoftheunnormalizedlogprobabilitiesoverclassesarethen

givenby

(2)

.Weassumethatourgraphlanguageincludesa

cross_entropy

operationthatcomputesthecross-entropybetweenthetargets

andtheprobability

distributiondeﬁnedbytheseunnormalizedlogprobabilities.Theresultingcross-

entropydeﬁnesthecost

MLE

.Minimizingthiscross-entropyperformsmaximum

likelihoodestimationoftheclassiﬁer.However,tomakethisexamplemorerealistic,

wealsoincludearegularizationterm.Thetotalcost

J=J

MLE

+λ







i,j



(1)

i,j





i,j



(2)

i,j







(6.56)

consistsofthecross-entropyandaweightdecaytermwithcoeﬃcient

.The

computationalgraphisillustratedinﬁgure6.11.

Thecomputationalgraphforthegradientofthisexampleislargeenoughthat

itwouldbetedioustodrawortoread.Thisdemonstratesoneofthebeneﬁts

oftheback-propagationalgorithm,whichisthatitcanautomaticallygenerate

gradientsthatwouldbestraightforwardbuttediousforasoftwareengineerto

derivemanually.

Wecanroughlytraceoutthebehavioroftheback-propagationalgorithm

bylookingattheforwardpropagationgraphinﬁgure6.11.Totrain,wewish

tocomputeboth

∇

(1)

and

∇

(2)

.Therearetwodiﬀerentpathsleading

backwardfrom

totheweights:onethroughthecross-entropycost,andone

throughtheweightdecaycost.Theweightdecaycostisrelativelysimple;itwill

alwayscontribute2λW

(i)

tothegradientonW

(i)

Theotherpaththroughthecross-entropycostisslightlymorecomplicated.

Let

bethegradientontheunnormalizedlogprobabilities

(2)

providedby

the

cross_entropy

operation.Theback-propagationalgorithmnowneedsto

215

CHAPTER6.DEEPFEEDFORWARDNETWORKS

(1)

matmul

relu

(3)

sqr

(4)

sum

λλ

(7)

(2)

matmul

MLE

cross_entropy

(5)

sqr

(6)

sum

(8)

Figure6.11:Thecomputationalgraphusedtocomputethecosttotrainourexampleofa

single-layerMLPusingthecross-entropylossandweightdecay.

exploretwodiﬀerentbranches.Ontheshorterbranch,itadds



tothe

gradienton

(2)

,usingtheback-propagationruleforthesecondargumentto

thematrixmultiplicationoperation.Theotherbranchcorrespondstothelonger

chaindescendingfurtheralongthenetwork.First,theback-propagationalgorithm

computes

∇

(2)

usingtheback-propagationrulefortheﬁrstargument

tothematrixmultiplicationoperation.Next,the

relu

operationusesitsback-

propagationruletozerooutcomponentsofthegradientcorrespondingtoentries

(1)

thatarelessthan0.Lettheresultbecalled



.Thelaststepofthe

back-propagationalgorithmistousetheback-propagationruleforthesecond

argumentofthematmuloperationtoaddX





tothegradientonW

(1)

Afterthesegradientshavebeencomputed,thegradientdescentalgorithm,or

anotheroptimizationalgorithm,usesthesegradientstoupdatetheparameters.

FortheMLP,thecomputationalcostisdominatedbythecostofmatrix

multiplication.Duringtheforwardpropagationstage,wemultiplybyeachweight

matrix,resultingin

(

)multiply-adds,where

isthenumberofweights.During

thebackwardpropagationstage,wemultiplybythetransposeofeachweight

matrix,whichhasthesamecomputationalcost.Themainmemorycostofthe

algorithmisthatweneedtostoretheinputtothenonlinearityofthehiddenlayer.

216

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Thisvalueisstoredfromthetimeitiscomputeduntilthebackwardpasshas

returnedtothesamepoint.Thememorycostisthus

(

),where

isthe

numberofexamplesintheminibatchandn

isthenumberofhiddenunits.

6.5.8Complications

Ourdescriptionoftheback-propagationalgorithmhereissimplerthantheimple-

mentationsactuallyusedinpractice.

Asnotedabove,wehaverestrictedthedeﬁnitionofanoperationtobea

functionthatreturnsasingletensor.Mostsoftwareimplementationsneedto

supportoperationsthatcanreturnmorethanonetensor.Forexample,ifwewish

tocomputeboththemaximumvalueinatensorandtheindexofthatvalue,itis

besttocomputebothinasinglepassthroughmemory,soitismosteﬃcientto

implementthisprocedureasasingleoperationwithtwooutputs.

We havenotdescribed how tocontrol thememory consumptionof back-

propagation.Back-propagationofteninvolvessummationofmanytensorstogether.

Inthenaiveapproach,eachofthesetensorswouldbecomputedseparately,then

allofthemwouldbeaddedinasecondstep.Thenaiveapproachhasanoverly

highmemorybottleneckthatcanbeavoidedbymaintainingasinglebuﬀerand

addingeachvaluetothatbuﬀerasitiscomputed.

Real-worldimplementationsofback-propagationalsoneedtohandlevarious

datatypes,suchas32-bitﬂoatingpoint,64-bitﬂoatingpoint,andintegervalues.

Thepolicyforhandlingeachofthesetypestakesspecialcaretodesign.

Someoperationshaveundeﬁnedgradients,anditisimportanttotrackthese

casesanddeterminewhetherthegradientrequestedbytheuserisundeﬁned.

Variousothertechnicalitiesmakereal-worlddiﬀerentiationmorecomplicated.

Thesetechnicalitiesarenotinsurmountable,andthischapterhasdescribedthekey

intellectualtoolsneededtocomputederivatives,butitisimportanttobeaware

thatmanymoresubtletiesexist.

6.5.9DiﬀerentiationoutsidetheDeepLearningCommunity

The deeplearning communityhas been somewhat isolatedfrom thebroader

computersciencecommunityandhaslargelydevelopeditsownculturalattitudes

concerninghowtoperformdiﬀerentiation.Moregenerally,theﬁeldof

automatic

diﬀerentiation

isconcernedwithhowtocomputederivativesalgorithmically.

Theback-propagationalgorithmdescribedhereisonlyoneapproachtoautomatic

diﬀerentiation.Itisaspecialcaseofabroaderclassoftechniquescalled

reverse

217

CHAPTER6.DEEPFEEDFORWARDNETWORKS

modeaccumulation

.Otherapproachesevaluatethesubexpressionsofthechain

ruleindiﬀerentorders.Ingeneral, determiningtheorderofevaluationthat

resultsinthelowestcomputationalcostisadiﬃcultproblem.Findingtheoptimal

sequenceofoperationstocomputethegradientisNP-complete(Naumann,2008),

inthesensethatitmayrequiresimplifyingalgebraicexpressionsintotheirleast

expensiveform.

Forexample,supposewehavevariables

,...,p

representingprobabilities,

andvariables

,...,z

representingunnormalizedlogprobabilities.Suppose

wedeﬁne

exp(z

)



exp(z

)

,(6.57)

wherewebuildthesoftmaxfunctionoutofexponentiation,summationanddivision

operations, and construct across-entropy loss

−



logq

.A human

mathematiciancanobservethatthederivativeof

withrespectto

takesavery

simpleform:

−p

.Theback-propagationalgorithmisnotcapableofsimplifying

thegradientthiswayandwillinsteadexplicitlypropagategradientsthroughall

thelogarithmandexponentiationoperationsintheoriginalgraph.Somesoftware

librariessuchasTheano(Bergstraetal.,2010;Bastienetal.,2012)areableto

performsomekindsofalgebraicsubstitutiontoimproveoverthegraphproposed

bythepureback-propagationalgorithm.

Whentheforwardgraph

hasasingleoutputnodeandeachpartialderivative

∂u

(i)

∂u

(j)

canbecomputedwithaconstantamountofcomputation,back-propagation

guaranteesthatthenumberofcomputationsforthegradientcomputationisof

thesameorderasthenumberofcomputationsfortheforwardcomputation:this

canbeseeninalgorithm6.2,becauseeachlocalpartialderivative

∂u

(i)

∂u

(j)

needsto

becomputedonlyoncealongwithanassociatedmultiplicationandadditionfor

therecursivechain-ruleformulation(equation6.53).Theoverallcomputationis

therefore

(#edges).Itcanpotentiallybereduced,however,bysimplifyingthe

computationalgraphconstructedbyback-propagation,andthisisanNP-complete

task. ImplementationssuchasTheanoandTensorFlowuseheuristicsbasedon

matchingknownsimpliﬁcationpatternstoiterativelyattempttosimplifythegraph.

Wedeﬁnedback-propagationonlyforcomputingascalaroutputgradient,but

back-propagationcanbeextendedtocomputeaJacobian(eitherof

diﬀerent

scalarnodesinthegraph,orofatensor-valuednodecontaining

values).A

naiveimplementationmaythenneed

timesmorecomputation:foreachscalar

internalnodeintheoriginalforwardgraph,thenaiveimplementationcomputes

gradientsinsteadofasinglegradient.Whenthenumberofoutputsofthegraphis

largerthanthenumberofinputs,itissometimespreferabletouseanotherformof

automaticdiﬀerentiationcalled

forwardmodeaccumulation

.Forwardmode

218

CHAPTER6.DEEPFEEDFORWARDNETWORKS

accumulationhasbeenproposedforobtainingreal-timecomputationofgradients

inrecurrentnetworks,forexample(WilliamsandZipser,1989).Thisapproachalso

avoidstheneedtostorethevaluesandgradientsforthewholegraph,tradingoﬀ

computationaleﬃciencyformemory.Therelationshipbetweenforwardmodeand

backwardmodeisanalogoustotherelationshipbetweenleft-multiplyingversus

right-multiplyingasequenceofmatrices,suchas

ABCD,(6.58)

wherethematricescanbethoughtofasJacobian.Forexample,if

isacolumn

vectorwhile

hasmanyrows,thegraphwillhaveasingleoutputandmanyinputs,

andstartingthemultiplicationsfromtheendandgoingbackwardrequiresonly

matrix-vectorproducts.Thisordercorrespondstothebackwardmode.Instead,

startingtomultiplyfromtheleftwouldinvolveaseriesofmatrix-matrixproducts,

whichmakesthewholecomputationmuchmoreexpensive.If

hasfewerrows

than

hascolumns,however,itischeapertorunthemultiplicationsleft-to-right,

correspondingtotheforwardmode.

Inmanycommunitiesoutsidemachinelearning,itismorecommontoimplement

diﬀerentiationsoftwarethatactsdirectlyontraditionalprogramminglanguage

code, such asPythonorCcode, andautomaticallygeneratesprogramsthat

diﬀerentiatefunctionswrittenintheselanguages.Inthedeeplearningcommunity,

computationalgraphsareusuallyrepresentedbyexplicitdatastructurescreated

byspecializedlibraries.Thespecializedapproachhasthedrawbackofrequiring

thelibrarydevelopertodeﬁnethe

bprop

methodsforeveryoperationandlimiting

theuserofthelibrarytoonlythoseoperationsthathavebeendeﬁned.Yetthe

specializedapproachalsohasthebeneﬁtofallowingcustomizedback-propagation

rulestobedevelopedforeachoperation,enablingthedevelopertoimprovespeed

orstabilityinnonobviouswaysthatanautomaticprocedurewouldpresumablybe

unabletoreplicate.

Back-propagationisthereforenottheonlywayortheoptimalwayofcomputing

thegradient,butitisapracticalmethodthatcontinuestoservethedeeplearning

communitywell.Inthefuture,diﬀerentiationtechnologyfordeepnetworksmay

improveasdeeplearningpractitionersbecomemoreawareofadvancesinthe

broaderﬁeldofautomaticdiﬀerentiation.

6.5.10Higher-OrderDerivatives

Somesoftwareframeworkssupporttheuseofhigher-orderderivatives.Amongthe

deeplearningsoftwareframeworks,thisincludesatleastTheanoandTensorFlow.

219

CHAPTER6.DEEPFEEDFORWARDNETWORKS

Theselibrariesusethesamekindofdatastructuretodescribetheexpressionsfor

derivativesastheyusetodescribetheoriginalfunctionbeingdiﬀerentiated.This

meansthatthesymbolicdiﬀerentiationmachinerycanbeappliedtoderivatives.

Inthecontextofdeeplearning,itisraretocomputeasinglesecondderivative

ofascalarfunction.Instead,weareusuallyinterestedinpropertiesoftheHessian

matrix.Ifwehaveafunction

→R

,thentheHessianmatrixisofsize

n×n

Intypicaldeeplearningapplications,

willbethenumberofparametersinthe

model,whichcouldeasilynumberinthebillions.TheentireHessianmatrixis

thusinfeasibletoevenrepresent.

InsteadofexplicitlycomputingtheHessian,thetypicaldeeplearningapproach

istouse

Krylovmethods

.Krylovmethodsareasetofiterativetechniquesfor

performingvariousoperations,suchasapproximatelyinvertingamatrixorﬁnding

approximationstoitseigenvectorsoreigenvalues,withoutusinganyoperation

otherthanmatrix-vectorproducts.

TouseKrylovmethodson theHessian, weonlyneedto beabletocom-

putetheproductbetweentheHessianmatrix

andanarbitraryvector

straightforwardtechnique(Christianson,1992)fordoingsoistocompute

Hv=∇



(∇

f(x))





.(6.59)

Bothgradientcomputationsinthisexpressionmaybecomputedautomaticallyby

theappropriatesoftwarelibrary.Notethattheoutergradientexpressiontakesthe

gradientofafunctionoftheinnergradientexpression.

isitselfavectorproducedbyacomputationalgraph,itisimportantto

specifythattheautomaticdiﬀerentiationsoftwareshouldnotdiﬀerentiatethrough

thegraphthatproducedv.

WhilecomputingtheHessianisusuallynotadvisable,itispossibletodowith

Hessianvectorproducts.Onesimplycomputes

(i)

forall

= 1

,...,n,

where

(i)

istheone-hotvectorwithe

(i)

= 1andallotherentriesareequalto0.

6.6HistoricalNotes

Feedforwardnetworkscanbeseenaseﬃcientnonlinearfunctionapproximators

basedonusinggradientdescenttominimizetheerrorinafunctionapproximation.

Fromthispointofview,themodernfeedforwardnetworkistheculminationof

centuriesofprogressonthegeneralfunctionapproximationtask.

Thechainrulethatunderliestheback-propagationalgorithmwasinventedin

theseventeenthcentury(Leibniz,1676;L’Hôpital,1696). Calculusandalgebra

220

CHAPTER6.DEEPFEEDFORWARDNETWORKS

havelongbeenusedtosolveoptimizationproblemsinclosedform,butgradient

descentwasnotintroducedasatechniqueforiterativelyapproximatingthesolution

tooptimizationproblemsuntilthenineteenthcentury(Cauchy,1847).

Beginninginthe1940s,thesefunctionapproximationtechniqueswereusedto

motivatemachinelearningmodelssuchastheperceptron.However,theearliest

modelswerebasedonlinearmodels.CriticsincludingMarvinMinskypointedout

severaloftheﬂawsofthelinearmodelfamily,suchasitsinabilitytolearnthe

XORfunction,whichledtoabacklashagainsttheentireneuralnetworkapproach.

Learningnonlinearfunctionsrequiredthedevelopmentofamultilayerper-

ceptronandameansofcomputingthegradientthroughsuchamodel.Eﬃcient

applicationsofthechainrulebasedondynamicprogrammingbegantoappear

inthe1960sand1970s,mostlyforcontrolapplications(Kelley,1960;Brysonand

Denham,1961;Dreyfus,1962;BrysonandHo,1969;Dreyfus,1973)butalsofor

sensitivityanalysis(Linnainmaa,1976).Werbos(1981)proposedapplyingthese

techniquestotrainingartiﬁcialneuralnetworks.Theideawasﬁnallydeveloped

inpracticeafterbeingindependentlyrediscoveredindiﬀerentways(LeCun,1985;

Parker,1985;Rumelhartetal.,1986a).Thebook

ParallelDistributedPro-

cessing

presentedtheresultsofsomeoftheﬁrstsuccessfulexperimentswith

back-propagationinachapter(Rumelhartetal.,1986b)thatcontributedgreatly

tothepopularizationofback-propagationandinitiatedaveryactiveperiodofre-

searchinmultilayerneuralnetworks.Theideasputforwardbytheauthorsofthat

book,particularlybyRumelhartandHinton,gomuchbeyondback-propagation.

Theyincludecrucialideasaboutthepossiblecomputationalimplementationof

severalcentralaspectsofcognitionandlearning,whichcameunderthename

“connectionism”becauseoftheimportancethisschoolofthoughtplacesonthe

connectionsbetweenneuronsasthelocusoflearningandmemory.Inparticular,

theseideasincludethenotionofdistributedrepresentation(Hintonetal.,1986).

Followingthesuccessofback-propagation,neuralnetworkresearchgainedpop-

ularityandreachedapeakintheearly1990s.Afterwards,othermachinelearning

techniquesbecamemorepopularuntilthemoderndeeplearningrenaissancethat

beganin2006.

Thecoreideasbehindmodernfeedforwardnetworkshavenotchangedsub-

stantiallysincethe1980s. Thesameback-propagationalgorithmandthesame

approachestogradientdescentarestillinuse.Mostoftheimprovementinneural

networkperformancefrom1986to2015canbeattributedtotwofactors.First,

largerdatasetshavereducedthedegreetowhichstatisticalgeneralizationisa

challengeforneuralnetworks.Second,neuralnetworkshavebecomemuchlarger,

becauseofmorepowerfulcomputersandbettersoftwareinfrastructure.Asmall

221

CHAPTER6.DEEPFEEDFORWARDNETWORKS

numberofalgorithmicchangeshavealsoimprovedtheperformanceofneural

networksnoticeably.

Oneofthesealgorithmicchangeswasthereplacementofmeansquarederror

withthecross-entropyfamilyoflossfunctions.Meansquarederrorwaspopularin

the1980sand1990sbutwasgraduallyreplacedbycross-entropylossesandthe

principleofmaximumlikelihoodasideasspreadbetweenthestatisticscommunity

andthemachinelearningcommunity.Theuseofcross-entropylossesgreatly

improvedtheperformanceofmodelswithsigmoidandsoftmaxoutputs,which

hadpreviouslysuﬀeredfromsaturationandslowlearningwhenusingthemean

squarederrorloss.

Theothermajoralgorithmicchangethathasgreatlyimprovedtheperformance

offeedforwardnetworkswasthereplacementofsigmoidhiddenunitswithpiecewise

linearhiddenunits,suchasrectiﬁedlinearunits.Rectiﬁcationusingthe

max{

,z}

functionwasintroducedinearlyneuralnetworkmodelsanddatesbackatleastasfar

asthecognitronandneocognitron(Fukushima,1975,1980).Theseearlymodelsdid

notuserectiﬁedlinearunitsbutinsteadappliedrectiﬁcationtononlinearfunctions.

Despitetheearlypopularityofrectiﬁcation,itwaslargelyreplacedbysigmoids

inthe1980s,perhapsbecausesigmoidsperformbetterwhenneuralnetworksare

verysmall. Asoftheearly2000s,rectiﬁedlinearunitswereavoidedbecauseof

asomewhatsuperstitiousbeliefthatactivationfunctionswithnondiﬀerentiable

pointsmustbeavoided.Thisbegantochangeinabout2009.Jarrettetal.(2009)

observedthat“usingarectifyingnonlinearityisthesinglemostimportantfactor

inimprovingtheperformanceofarecognitionsystem,”amongseveraldiﬀerent

factorsofneuralnetworkarchitecturedesign.

Forsmalldatasets,Jarrettetal.(2009)observedthatusingrectifyingnon-

linearitiesisevenmoreimportantthanlearningtheweightsofthehiddenlayers.

Randomweightsaresuﬃcienttopropagateusefulinformationthrougharectiﬁed

linearnetwork,enablingtheclassiﬁerlayeratthetoptolearnhowtomapdiﬀerent

featurevectorstoclassidentities.

Whenmoredataisavailable,learningbeginstoextractenoughusefulknowledge

toexceedtheperformanceofrandomlychosenparameters.Glorotetal.(2011a)

showedthatlearningisfareasierindeeprectiﬁedlinearnetworksthanindeep

networksthathavecurvatureortwo-sidedsaturationintheiractivationfunctions.

Rectiﬁedlinearunitsarealsoofhistoricalinterestbecausetheyshowthat

neurosciencehascontinuedtohave aninﬂuence onthedevelopment ofdeep

learningalgorithms.Glorotetal.(2011a)motivatedrectiﬁedlinearunitsfrom

biologicalconsiderations.Thehalf-rectifyingnonlinearitywasintendedtocapture

thesepropertiesofbiologicalneurons:(1)Forsomeinputs,biologicalneurons

222

CHAPTER6.DEEPFEEDFORWARDNETWORKS

arecompletelyinactive.(2)Forsomeinputs, abiologicalneuron’soutputis

proportionaltoitsinput.(3)Mostofthetime,biologicalneuronsoperateinthe

regimewheretheyareinactive(i.e.,theyshouldhavesparseactivations).

Whenthemodernresurgenceofdeeplearningbeganin2006,feedforward

networkscontinuedtohaveabadreputation. Fromabout2006to2012,itwas

widelybelievedthatfeedforwardnetworkswouldnotperformwellunlesstheywere

assistedbyothermodels,suchasprobabilisticmodels. Today,itisnowknown

thatwiththerightresourcesandengineeringpractices,feedforwardnetworks

performverywell.Today,gradient-basedlearninginfeedforwardnetworksisused

asatooltodevelopprobabilisticmodels,suchasthevariationalautoencoder

andgenerativeadversarialnetworks,describedinchapter20.Ratherthanbeing

viewedasanunreliabletechnologythatmustbesupportedbyothertechniques,

gradient-basedlearninginfeedforwardnetworkshasbeenviewedsince2012asa

powerfultechnologythatcanbeappliedtomanyothermachinelearningtasks.In

2006,thecommunityusedunsupervisedlearningtosupportsupervisedlearning,

andnow,ironically,itismorecommontousesupervisedlearningtosupport

unsupervisedlearning.

Feedforwardnetworkscontinuetohaveunfulﬁlledpotential.Inthefuture,we

expecttheywillbeappliedtomanymoretasks,andthatadvancesinoptimization

algorithmsandmodeldesignwillimprovetheirperformanceevenfurther. This

chapterhasprimarilydescribedtheneuralnetworkfamilyofmodels.Inthe

subsequentchapters,weturntohowtousethesemodels—howtoregularizeand

trainthem.

223