Movatterモバイル変換


[0]ホーム

URL:


Chapter6
DeepFeedforwardNetworks
Deepfeedforwardnetworks
,alsocalled
feedforwardneuralnetworks
,or
multilayerperceptrons
(MLPs),arethequintessentialdeeplearningmodels.
Thegoalofafeedforwardnetworkistoapproximatesomefunction
f
.Forexample,
foraclassifier,
y
=
f
(
x
)mapsaninput
x
toacategory
y
.Afeedforwardnetwork
definesamapping
y
=
f
(
x
;
θ
)andlearnsthevalueoftheparameters
θ
thatresult
inthebestfunctionapproximation.
Thesemodelsarecalled
feedforward
becauseinformationflowsthroughthe
functionbeingevaluatedfrom
x
,throughtheintermediatecomputationsusedto
define
f
,andfinallytotheoutput
y
.Thereareno
feedback
connectionsinwhich
outputsofthemodelarefedbackintoitself.Whenfeedforwardneuralnetworks
areextendedtoincludefeedbackconnections,theyarecalled
recurrentneural
networks,aspresentedinchapter10.
Feedforwardnetworksareofextremeimportancetomachinelearningpracti-
tioners.Theyformthebasisofmanyimportantcommercialapplications.For
example,theconvolutionalnetworksusedforobjectrecognitionfromphotosarea
specializedkindoffeedforwardnetwork.Feedforwardnetworksareaconceptual
steppingstoneonthepathtorecurrentnetworks,whichpowermanynatural
languageapplications.
Feedforwardneuralnetworksarecalled
networks
becausetheyaretypically
representedbycomposingtogethermanydifferentfunctions.Themodelisasso-
ciatedwithadirectedacyclicgraphdescribinghowthefunctionsarecomposed
together.Forexample,wemighthavethreefunctions
f
(1)
,
f
(2)
,and
f
(3)
connected
inachain,toform
f
(
x
)=
f
(3)
(
f
(2)
(
f
(1)
(
x
))).Thesechainstructuresarethe
mostcommonlyusedstructuresofneuralnetworks.Inthiscase,
f
(1)
iscalled
the
firstlayer
ofthenetwork,
f
(2)
iscalledthe
secondlayer
,andsoon. The
164
CHAPTER6.DEEPFEEDFORWARDNETWORKS
overalllengthofthechaingivesthe
depth
ofthemodel.Thename“deeplearning”
arosefromthisterminology.Thefinallayerofafeedforwardnetworkiscalledthe
outputlayer
.Duringneuralnetworktraining,wedrive
f
(
x
)tomatch
f
(
x
).
Thetrainingdataprovidesuswithnoisy,approximateexamplesof
f
(
x
)evaluated
atdifferenttrainingpoints.Eachexample
x
isaccompaniedbyalabel
yf
(
x
).
Thetrainingexamplesspecifydirectlywhattheoutputlayermustdoateachpoint
x
;itmustproduceavaluethatiscloseto
y
.Thebehavioroftheotherlayersis
notdirectlyspecifiedbythetrainingdata.Thelearningalgorithmmustdecide
howtousethoselayerstoproducethedesiredoutput,butthetrainingdatado
notsaywhateachindividuallayershoulddo.Instead,thelearningalgorithmmust
decidehowtousetheselayerstobestimplementanapproximationof
f
.Because
thetrainingdatadoesnotshowthedesiredoutputforeachoftheselayers,they
arecalledhiddenlayers.
Finally,thesenetworksarecalledneuralbecausetheyarelooselyinspiredby
neuroscience.Eachhiddenlayerofthenetworkistypicallyvectorvalued.The
dimensionalityofthesehiddenlayersdeterminesthe
width
ofthemodel.Each
elementofthevectormaybeinterpretedasplayingaroleanalogoustoaneuron.
Ratherthanthinkingofthelayerasrepresentingasinglevector-to-vectorfunction,
wecanalsothinkofthelayerasconsistingofmany
units
thatactinparallel,
eachrepresentingavector-to-scalarfunction.Eachunitresemblesaneuronin
thesensethatitreceivesinputfrommanyotherunitsandcomputesitsown
activationvalue.Theideaofusingmanylayersofvector-valuedrepresentations
isdrawnfromneuroscience.Thechoiceofthefunctions
f
(i)
(
x
)usedtocompute
theserepresentationsisalsolooselyguidedbyneuroscientificobservationsabout
thefunctionsthatbiologicalneuronscompute.Modernneuralnetworkresearch,
however,isguidedbymanymathematicalandengineeringdisciplines,andthe
goalofneuralnetworksisnottoperfectlymodelthebrain.Itisbesttothinkof
feedforwardnetworksasfunctionapproximationmachinesthataredesignedto
achievestatisticalgeneralization,occasionallydrawingsomeinsightsfromwhatwe
knowaboutthebrain,ratherthanasmodelsofbrainfunction.
Onewaytounderstandfeedforwardnetworksistobeginwithlinearmodels
andconsiderhowtoovercometheirlimitations. Linearmodels,suchaslogistic
regressionandlinearregression,areappealingbecausetheycanbefitefficiently
andreliably,eitherinclosedformorwithconvexoptimization.Linearmodelsalso
havetheobviousdefectthatthemodelcapacityislimitedtolinearfunctions,so
themodelcannotunderstandtheinteractionbetweenanytwoinputvariables.
Toextendlinearmodelstorepresentnonlinearfunctionsof
x
,wecanapply
thelinearmodelnotto
x
itselfbuttoatransformedinput
φ
(
x
),where
φ
isa
165
CHAPTER6.DEEPFEEDFORWARDNETWORKS
nonlineartransformation.Equivalently,wecanapplythekerneltrickdescribedin
section5.7.2,toobtainanonlinearlearningalgorithmbasedonimplicitlyapplying
the
φ
mapping.Wecanthinkof
φ
asprovidingasetoffeaturesdescribing
x
,or
asprovidinganewrepresentationforx.
Thequestionisthenhowtochoosethemappingφ.
1.
Oneoptionistouseaverygeneric
φ
,suchastheinfinite-dimensional
φ
that
isimplicitlyusedbykernelmachinesbasedontheRBFkernel. If
φ
(
x
)is
ofhighenoughdimension,wecanalwayshaveenoughcapacitytofitthe
trainingset,butgeneralizationtothetestsetoftenremainspoor.Very
genericfeaturemappingsareusuallybasedonlyontheprincipleoflocal
smoothnessanddonotencodeenoughpriorinformationtosolveadvanced
problems.
2.
Anotheroptionistomanuallyengineer
φ
.Untiltheadventofdeeplearning,
thiswasthedominantapproach.Itrequiresdecadesofhumaneffortfor
eachseparatetask,withpractitionersspecializingindifferentdomains,such
asspeechrecognitionorcomputervision,andwithlittletransferbetween
domains.
3.
Thestrategyofdeeplearningistolearn
φ
.Inthisapproach,wehaveamodel
y
=
f
(
x
;
θ,w
) =
φ
(
x
;
θ
)
w
.Wenowhaveparameters
θ
thatweusetolearn
φ
fromabroadclassoffunctions,andparameters
w
thatmapfrom
φ
(
x
)to
thedesiredoutput.Thisisanexampleofadeepfeedforwardnetwork,with
φ
definingahiddenlayer.Thisapproachistheonlyoneofthethreethat
givesupontheconvexityofthetrainingproblem,butthebenefitsoutweigh
theharms.Inthisapproach,weparametrizetherepresentationas
φ
(
x
;
θ
)
andusetheoptimizationalgorithmtofindthe
θ
thatcorrespondstoagood
representation.Ifwewish,thisapproachcancapturethebenefitofthefirst
approachbybeinghighlygeneric—wedosobyusingaverybroadfamily
φ
(
x
;
θ
).Deeplearningcanalsocapturethebenefitofthesecondapproach.
Humanpractitionerscanencodetheirknowledgetohelpgeneralizationby
designingfamilies
φ
(
x
;
θ
)thattheyexpectwillperformwell.Theadvantage
isthatthehumandesigneronlyneedstofindtherightgeneralfunction
familyratherthanfindingpreciselytherightfunction.
Thisgeneralprincipleofimprovingmodelsbylearningfeaturesextendsbeyond
thefeedforwardnetworksdescribedinthischapter.Itisarecurringthemeof
deeplearningthatappliestoallthekindsofmodelsdescribedthroughoutthis
book.Feedforwardnetworks aretheapplicationofthisprincipletolearning
166
CHAPTER6.DEEPFEEDFORWARDNETWORKS
deterministicmappingsfrom
x
to
y
thatlackfeedbackconnections.Othermodels,
presentedlater,applytheseprinciplestolearningstochasticmappings,functions
withfeedback,andprobabilitydistributionsoverasinglevector.
Webeginthischapterwithasimpleexampleofafeedforwardnetwork.Next,
weaddresseachofthedesigndecisionsneededtodeployafeedforwardnetwork.
First,trainingafeedforwardnetworkrequiresmakingmanyofthesamedesign
decisionsasarenecessaryforalinearmodel:choosingtheoptimizer,thecost
function,andtheformoftheoutputunits.Wereviewthesebasicsofgradient-based
learning,thenproceedtoconfrontsomeofthedesigndecisionsthatareunique
tofeedforwardnetworks.Feedforwardnetworkshaveintroducedtheconceptofa
hiddenlayer,andthisrequiresustochoosethe
activationfunctions
thatwill
beusedtocomputethehiddenlayervalues.Wemustalsodesignthearchitecture
ofthenetwork,includinghowmanylayersthenetworkshouldcontain,howthese
layersshouldbeconnected toeachother, and how manyunitsshouldbein
eachlayer.Learningindeepneuralnetworksrequirescomputingthegradients
ofcomplicatedfunctions.Wepresentthe
back-propagation
algorithmandits
moderngeneralizations,whichcanbeusedtoefficientlycomputethesegradients.
Finally,weclosewithsomehistoricalperspective.
6.1Example:LearningXOR
Tomaketheideaofafeedforwardnetworkmoreconcrete,webeginwithan
exampleofafullyfunctioningfeedforwardnetworkonaverysimpletask:learning
theXORfunction.
TheXORfunction(“exclusiveor”)isanoperationontwobinaryvalues,
x
1
and
x
2
.Whenexactlyoneofthesebinaryvaluesisequalto1,theXORfunction
returns1.Otherwise,itreturns0.TheXORfunctionprovidesthetargetfunction
y
=
f
(
x
)thatwewanttolearn.Ourmodelprovidesafunction
y
=
f
(
x
;
θ
),and
ourlearningalgorithmwilladapttheparameters
θ
tomake
f
assimilaraspossible
tof
.
Inthissimpleexample,wewillnotbeconcernedwithstatisticalgeneralization.
Wewantournetworktoperformcorrectlyonthefourpoints
X
=
{
[0
,
0]
,[0
,
1]
,
[1
,
0]
,and[1
,
1]
}
. Wewilltrainthenetworkonallfourofthesepoints. The
onlychallengeistofitthetrainingset.
Wecantreatthisproblemasaregressionproblemanduseameansquared
errorlossfunction.Wehavechosenthislossfunctiontosimplifythemathfor
thisexampleasmuchaspossible.Inpracticalapplications,MSEisusuallynotan
167
CHAPTER6.DEEPFEEDFORWARDNETWORKS
appropriatecostfunctionformodelingbinarydata.Moreappropriateapproaches
aredescribedinsection6.2.2.2.
Evaluatedonourwholetrainingset,theMSElossfunctionis
J(θ) =
1
4
xX
(f
(x)f(x;θ))
2
.(6.1)
Nowwemustchoosetheformofourmodel,
f
(
x
;
θ
).Supposethatwechoose
alinearmodel,withθconsistingofwandb.Ourmodelisdefinedtobe
f(x;w,b) =x
w+b.(6.2)
Wecanminimize
J
(
θ
)inclosedformwithrespectto
w
and
b
usingthenormal
equations.
Aftersolvingthenormalequations,weobtain
w
=
0
and
b
=
1
2
.Thelinear
modelsimplyoutputs0
.
5everywhere.Whydoesthishappen?Figure6.1shows
howalinearmodelisnotabletorepresenttheXORfunction.Onewaytosolve
thisproblemistouseamodelthatlearnsadifferentfeaturespaceinwhicha
linearmodelisabletorepresentthesolution.
Specifically,wewillintroduceasimplefeedforwardnetworkwithonehidden
layercontainingtwohiddenunits.Seefigure6.2foranillustrationofthismodel.
Thisfeedforwardnetworkhasavectorofhiddenunits
h
thatarecomputedbya
function
f
(1)
(
x
;
W,c
).Thevaluesofthesehiddenunitsarethenusedastheinput
forasecondlayer.Thesecondlayeristheoutputlayerofthenetwork.Theoutput
layerisstilljustalinearregressionmodel,butnowitisappliedto
h
ratherthanto
x
.
Thenetworknowcontainstwofunctionschainedtogether,
h
=
f
(1)
(
x
;
W,c
)and
y
=
f
(2)
(
h
;
w,b
),withthecompletemodelbeing
f
(
x
;
W,c,w,b
) =
f
(2)
(
f
(1)
(
x
))
.
Whatfunctionshould
f
(1)
compute?Linearmodelshaveserveduswellsofar,
anditmaybetemptingtomake
f
(1)
linearaswell. Unfortunately,if
f
(1)
were
linear,thenthefeedforwardnetworkasawholewouldremainalinearfunctionof
itsinput.Ignoringtheintercepttermsforthemoment,suppose
f
(1)
(
x
) =
W
x
and
f
(2)
(
h
) =
h
w
.Then
f
(
x
) =
x
Ww
.Wecouldrepresentthisfunctionas
f(x) =x
w
wherew
=Ww.
Clearly,wemustuseanonlinearfunctiontodescribethefeatures.Mostneural
networksdosousinganaffinetransformationcontrolledbylearnedparameters,
followedbyafixednonlinearfunctioncalledanactivationfunction.Weusethat
strategyhere,bydefining
h
=
g
(
W
x
+
c
)
,
where
W
providestheweightsofa
lineartransformationand
c
thebiases.Previously,todescribealinearregression
model,weusedavectorofweightsandascalarbiasparametertodescribean
168
CHAPTER6.DEEPFEEDFORWARDNETWORKS
01
x
1
0
1
x
2
Originalxspace
012
h
1
0
1
h
2
Learnedhspace
Figure6.1:SolvingtheXORproblembylearningarepresentation.Theboldnumbers
printedontheplotindicatethevaluethatthelearnedfunctionmustoutputateachpoint.
(Left)AlinearmodelapplieddirectlytotheoriginalinputcannotimplementtheXOR
function.When
x
1
= 0,themodel’soutputmustincreaseas
x
2
increases.When
x
1
= 1,
themodel’soutputmustdecreaseas
x
2
increases.Alinearmodelmustapplyafixed
coefficient
w
2
to
x
2
.Thelinearmodelthereforecannotusethevalueof
x
1
tochange
thecoefficienton
x
2
andcannotsolvethisproblem.(Right)Inthetransformedspace
representedbythefeaturesextractedbyaneuralnetwork,alinearmodelcannowsolve
theproblem. Inourexamplesolution,thetwopointsthatmusthaveoutput1havebeen
collapsedintoasinglepointinfeaturespace.Inotherwords,thenonlinearfeatureshave
mappedboth
x
= [1
,
0]
and
x
= [0
,
1]
toasinglepointinfeaturespace,
h
= [1
,
0]
.
Thelinearmodelcannowdescribethefunctionasincreasingin
h
1
anddecreasingin
h
2
.
Inthisexample,themotivationforlearningthefeaturespaceisonlytomakethemodel
capacitygreatersothatitcanfitthetrainingset.Inmorerealisticapplications,learned
representationscanalsohelpthemodeltogeneralize.
169
CHAPTER6.DEEPFEEDFORWARDNETWORKS
yy
hh
xx
W
w
yy
h
1
h
1
x
1
x
1
h
2
h
2
x
2
x
2
Figure6.2:Anexampleofafeedforwardnetwork,drawnintwodifferentstyles.Specifically,
thisisthefeedforwardnetworkweusetosolvetheXORexample.Ithasasinglehidden
layercontainingtwounits.(Left)Inthisstyle,wedraweveryunitasanodeinthegraph.
Thisstyleisexplicitandunambiguous,butfornetworkslargerthanthisexample,itcan
consumetoomuchspace.(Right)Inthisstyle,wedrawanodeinthegraphforeachentire
vectorrepresentingalayer’sactivations.Thisstyleismuchmorecompact.Sometimes
weannotatetheedgesinthisgraphwiththenameoftheparametersthatdescribethe
relationshipbetweentwolayers.Here,weindicatethatamatrix
W
describesthemapping
from
x
to
h
,andavector
w
describesthemappingfrom
h
to
y
.Wetypicallyomitthe
interceptparametersassociatedwitheachlayerwhenlabelingthiskindofdrawing.
0
z
0
g(z)=max{0,z}
Figure6.3:Therectifiedlinearactivationfunction.Thisactivationfunctionisthedefault
activationfunctionrecommendedforusewithmostfeedforwardneuralnetworks.Applying
thisfunctiontotheoutputofalineartransformationyieldsanonlineartransformation.
Thefunctionremainsveryclosetolinear,however,inthesensethatitisapiecewise
linearfunctionwithtwolinearpieces.Becauserectifiedlinearunitsarenearlylinear,
theypreservemanyofthepropertiesthatmakelinearmodelseasytooptimizewith
gradient-basedmethods.Theyalsopreservemanyofthepropertiesthatmakelinear
modelsgeneralizewell.Acommonprinciplethroughoutcomputerscienceisthatwe
canbuildcomplicatedsystemsfromminimalcomponents.MuchasaTuringmachine’s
memoryneedsonlytobeabletostore0or1states,wecanbuildauniversalfunction
approximatorfromrectifiedlinearfunctions.
170
CHAPTER6.DEEPFEEDFORWARDNETWORKS
affinetransformationfromaninputvectortoanoutputscalar.Now,wedescribe
anaffinetransformationfromavector
x
toavector
h
,soanentirevectorofbias
parametersisneeded.Theactivationfunction
g
istypicallychosentobeafunction
thatisappliedelement-wise,with
h
i
=
g
(
x
W
:,i
+
c
i
).Inmodernneuralnetworks,
thedefaultrecommendationistousethe
rectifiedlinearunit
,orReLU(Jarrett
etal.,2009;NairandHinton,2010;Glorotetal.,2011a),definedbytheactivation
functiong(z) = max{0,z},depictedinfigure6.3.
Wecannowspecifyourcompletenetworkas
f(x;W,c,w,b) =w
max{0,W
x+c}+b.(6.3)
WecanthenspecifyasolutiontotheXORproblem.Let
W=
11
11
,(6.4)
c=
0
1
,(6.5)
w=
1
2
,(6.6)
andb= 0.
Wecannowwalkthroughhowthemodelprocessesabatchofinputs.Let
X
bethedesignmatrixcontainingallfourpointsinthebinaryinputspace,withone
exampleperrow:
X=
00
01
10
11
.(6.7)
Thefirststepintheneuralnetworkistomultiplytheinputmatrixbythefirst
layer’sweightmatrix:
XW=
00
11
11
22
.(6.8)
Next,weaddthebiasvectorc,toobtain
01
10
10
21
.(6.9)
171
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Inthisspace,alltheexamplesliealongalinewithslope1.Aswemovealongthis
line,theoutputneedstobeginat0,thenriseto1,thendropbackdownto0.A
linearmodelcannotimplementsuchafunction.Tofinishcomputingthevalueof
hforeachexample,weapplytherectifiedlineartransformation:
00
10
10
21
.(6.10)
Thistransformationhaschangedtherelationshipbetweentheexamples.Theyno
longerlieonasingleline.Asshowninfigure6.1,theynowlieinaspacewherea
linearmodelcansolvetheproblem.
Wefinishwithmultiplyingbytheweightvectorw:
0
1
1
0
.(6.11)
Theneuralnetworkhasobtainedthecorrectanswerforeveryexampleinthebatch.
Inthisexample,wesimplyspecifiedthesolution,thenshowedthatitobtained
zeroerror. Inarealsituation,theremightbebillionsofmodelparametersand
billionsoftrainingexamples,soonecannotsimplyguessthesolutionaswedid
here.Instead,agradient-basedoptimizationalgorithmcanfindparametersthat
produceverylittleerror.ThesolutionwedescribedtotheXORproblemisata
globalminimumofthelossfunction,sogradientdescentcouldconvergetothis
point.ThereareotherequivalentsolutionstotheXORproblemthatgradient
descentcouldalsofind.Theconvergencepointofgradientdescentdependsonthe
initialvaluesoftheparameters.Inpractice,gradientdescentwouldusuallynot
findclean,easilyunderstood,integer-valuedsolutionsliketheonewepresented
here.
6.2Gradient-BasedLearning
Designingandtraininganeuralnetworkisnotmuchdifferentfromtrainingany
othermachinelearningmodelwithgradientdescent.Insection5.10,wedescribed
howtobuildamachinelearningalgorithmbyspecifyinganoptimizationprocedure,
acostfunction,andamodelfamily.
172
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Thelargestdifferencebetweenthelinearmodelswehaveseensofarandneural
networksisthatthenonlinearityofaneuralnetworkcausesmostinterestingloss
functionstobecomenonconvex.Thismeansthatneuralnetworksareusually
trainedbyusingiterative,gradient-basedoptimizersthatmerelydrivethecost
functiontoaverylowvalue,ratherthanthelinearequationsolversusedtotrain
linearregressionmodelsortheconvexoptimizationalgorithmswithglobalconver-
genceguaranteesusedtotrainlogisticregressionorSVMs.Convexoptimization
convergesstartingfromanyinitialparameters(intheory—inpracticeitisrobust
butcanencounternumericalproblems).Stochasticgradientdescentappliedto
nonconvexlossfunctionshasnosuchconvergenceguaranteeandissensitivetothe
valuesoftheinitialparameters.Forfeedforwardneuralnetworks,itisimportantto
initializeallweightstosmallrandomvalues.Thebiasesmaybeinitializedtozero
ortosmallpositivevalues.Theiterativegradient-basedoptimizationalgorithms
usedtotrainfeedforwardnetworksandalmostallotherdeepmodelsaredescribed
indetailinchapter8,withparameterinitializationinparticulardiscussedin
section8.4.Forthemoment,itsufficestounderstandthatthetrainingalgorithm
isalmostalwaysbasedonusingthegradienttodescendthecostfunctioninone
wayoranother.Thespecificalgorithmsareimprovementsandrefinementson
theideasofgradientdescent,introducedinsection4.3,and,morespecifically,are
mostoftenimprovementsofthestochasticgradientdescentalgorithm,introduced
insection5.9.
Wecanofcoursetrainmodelssuchaslinearregressionandsupportvector
machineswithgradientdescenttoo,andinfactthisiscommonwhenthetraining
setisextremelylarge.Fromthispointofview,traininganeuralnetworkisnot
muchdifferentfromtraininganyothermodel.Computingthegradientisslightly
morecomplicatedforaneuralnetworkbutcanstillbedoneefficientlyandexactly.
InSection6.5wedescribehowtoobtainthegradientusingtheback-propagation
algorithmandmoderngeneralizationsoftheback-propagationalgorithm.
Aswithothermachinelearningmodels,toapplygradient-basedlearningwe
mustchooseacostfunction,andwemustchoosehowtorepresenttheoutputof
themodel.Wenowrevisitthesedesignconsiderationswithspecialemphasison
theneuralnetworksscenario.
6.2.1CostFunctions
Animportantaspectofthedesignofadeepneuralnetworkisthechoiceofthe
costfunction.Fortunately,thecostfunctionsforneuralnetworksaremoreorless
173
CHAPTER6.DEEPFEEDFORWARDNETWORKS
thesameasthoseforotherparametricmodels,suchaslinearmodels.
Inmostcases,ourparametricmodeldefinesadistribution
p
(
y|x
;
θ
)and
wesimplyuse theprinciple ofmaximumlikelihood.This meansweuse the
cross-entropybetweenthetrainingdataandthemodel’spredictionsasthecost
function.
Sometimes,wetakeasimplerapproach,whereratherthanpredictingacomplete
probabilitydistributionover
y
,wemerelypredictsomestatisticof
y
conditioned
onx.Specializedlossfunctionsenableustotrainapredictoroftheseestimates.
Thetotalcostfunctionusedtotrainaneuralnetworkwilloftencombineone
oftheprimarycostfunctionsdescribedherewitharegularizationterm.Wehave
alreadyseensomesimpleexamplesofregularizationappliedtolinearmodelsin
section5.2.2. Theweightdecayapproachusedforlinearmodelsisalsodirectly
applicabletodeepneuralnetworksandisamongthemostpopularregulariza-
tionstrategies.Moreadvancedregularizationstrategiesforneuralnetworksare
describedinchapter7.
6.2.1.1LearningConditionalDistributionswithMaximumLikelihood
Mostmodernneuralnetworksaretrainedusingmaximumlikelihood.Thismeans
thatthecostfunctionissimplythenegativelog-likelihood,equivalentlydescribed
asthecross-entropybetweenthetrainingdataandthemodeldistribution.This
costfunctionisgivenby
J(θ) =E
x,yˆp
data
logp
model
(y|x).(6.12)
Thespecificformofthecostfunctionchangesfrommodeltomodel,depending
onthespecificformof
logp
model
.Theexpansionoftheaboveequationtypically
yieldssometermsthatdonotdependonthemodelparametersandmaybedis-
carded.Forexample,aswesawinsection5.5.1,if
p
model
(
y|x
) =
N
(
y
;
f
(
x
;
θ
)
,I
),
thenwerecoverthemeansquarederrorcost,
J(θ) =
1
2
E
x,yˆp
data
||yf(x;θ)||
2
+const,(6.13)
uptoascalingfactorof
1
2
andatermthatdoesnotdependonθ.Thediscarded
constantisbasedonthevarianceoftheGaussiandistribution,whichinthiscase
wechosenottoparametrize.Previously,wesawthattheequivalencebetween
maximumlikelihoodestimationwithanoutputdistributionandminimizationof
174
CHAPTER6.DEEPFEEDFORWARDNETWORKS
meansquarederrorholdsforalinearmodel,butinfact,theequivalenceholds
regardlessofthef(x;θ)usedtopredictthemeanoftheGaussian.
Anadvantageofthisapproachofderivingthecostfunctionfrommaximum
likelihoodisthatitremovestheburdenofdesigningcostfunctionsforeachmodel.
Specifyingamodel
p
(
y|x
)automaticallydeterminesacostfunction
logp
(
y|x
).
Onerecurringthemethroughoutneuralnetworkdesignisthatthegradientof
thecostfunctionmustbelargeandpredictableenoughtoserveasagoodguide
forthelearningalgorithm.Functionsthatsaturate(becomeveryflat)undermine
thisobjectivebecausetheymakethegradientbecomeverysmall.Inmanycases
thishappensbecausetheactivationfunctionsusedtoproducetheoutputofthe
hiddenunitsortheoutputunitssaturate.Thenegativelog-likelihoodhelpsto
avoidthisproblemformanymodels.Severaloutputunitsinvolvean
exp
function
thatcansaturatewhenitsargumentisverynegative.The
log
functioninthe
negativelog-likelihoodcostfunctionundoesthe
exp
ofsomeoutputunits.Wewill
discusstheinteractionbetweenthecostfunctionandthechoiceofoutputunitin
section6.2.2.
Oneunusualpropertyofthecross-entropycostusedtoperformmaximum
likelihoodestimationisthatitusuallydoesnothaveaminimumvaluewhenapplied
tothemodelscommonlyusedinpractice.Fordiscreteoutputvariables,most
modelsareparametrizedinsuchawaythattheycannotrepresentaprobability
ofzeroorone,butcancomearbitrarilyclosetodoingso.Logisticregression
isanexampleofsuchamodel.Forreal-valuedoutputvariables,ifthemodel
cancontrolthedensityoftheoutputdistribution(forexample,bylearningthe
varianceparameterofaGaussianoutputdistribution)thenitbecomespossible
toassignextremelyhighdensitytothecorrecttrainingsetoutputs,resultingin
cross-entropyapproachingnegativeinfinity.Regularizationtechniquesdescribed
inchapter7provideseveraldifferentwaysofmodifyingthelearningproblemso
thatthemodelcannotreapunlimitedrewardinthisway.
6.2.1.2LearningConditionalStatistics
Insteadoflearningafullprobabilitydistribution
p
(
y|x
;
θ
),weoftenwantto
learnjustoneconditionalstatisticofygivenx.
Forexample,wemayhaveapredictor
f
(
x
;
θ
)thatwewishtoemploytopredict
themeanofy.
Ifweuseasufficientlypowerfulneuralnetwork,wecanthinkoftheneural
networkasbeingabletorepresentanyfunction
f
fromawideclassoffunctions,
withthisclassbeinglimitedonlybyfeaturessuchascontinuityandboundedness
175
CHAPTER6.DEEPFEEDFORWARDNETWORKS
ratherthanbyhavingaspecificparametricform.Fromthispointofview,we
canviewthecostfunctionasbeinga
functional
ratherthanjustafunction.A
functionalisamappingfromfunctionstorealnumbers.Wecanthusthinkof
learningaschoosingafunctionratherthanmerelychoosingasetofparameters.
Wecandesignourcostfunctionaltohaveitsminimumoccuratsomespecific
functionwedesire.Forexample,wecandesignthecostfunctionaltohaveits
minimumlieonthefunctionthatmaps
x
totheexpectedvalueof
y
given
x
.
Solvinganoptimizationproblemwithrespecttoafunctionrequiresamathematical
toolcalled
calculusofvariations
,describedinsection19.4.2.Itisnotnecessary
tounderstandcalculusofvariationstounderstandthecontentofthischapter.At
themoment,itisonlynecessarytounderstandthatcalculusofvariationsmaybe
usedtoderivethefollowingtworesults.
176
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Ourfirstresultderivedusingcalculusofvariationsisthatsolvingtheoptimiza-
tionproblem
f
= argmin
f
E
x,yp
data
||yf(x)||
2
(6.14)
yields
f
(x) =E
yp
data
(y|x)
[y],(6.15)
solongasthisfunctionlieswithintheclassweoptimizeover.Inotherwords,ifwe
couldtrainoninfinitelymanysamplesfromthetruedatageneratingdistribution,
minimizingthemeansquarederrorcostfunctionwouldgiveafunctionthatpredicts
themeanofyforeachvalueofx.
Differentcostfunctionsgivedifferentstatistics.Asecondresultderivedusing
calculusofvariationsisthat
f
= argmin
f
E
x,yp
data
||yf(x)||
1
(6.16)
yieldsafunctionthatpredictsthemedianvalueof
y
foreach
x
,aslongassucha
functionmaybedescribedbythefamilyoffunctionsweoptimizeover.Thiscost
functioniscommonlycalledmeanabsoluteerror.
Unfortunately,meansquarederrorandmeanabsoluteerroroftenleadtopoor
resultswhenusedwithgradient-basedoptimization.Someoutputunitsthat
saturateproduceverysmallgradientswhencombinedwiththesecostfunctions.
Thisisonereasonthatthecross-entropycostfunctionismorepopularthanmean
squarederrorormeanabsoluteerror,evenwhenitisnotnecessarytoestimatean
entiredistributionp(y|x).
6.2.2OutputUnits
Thechoiceofcostfunctionistightlycoupledwiththechoiceofoutputunit.Most
ofthetime,wesimplyusethecross-entropybetweenthedatadistributionandthe
modeldistribution. Thechoiceofhowtorepresenttheoutputthendetermines
theformofthecross-entropyfunction.
Anykindofneuralnetworkunitthatmaybeusedasanoutputcanalsobe
usedasahiddenunit.Here,wefocusontheuseoftheseunitsasoutputsofthe
model,butinprincipletheycanbeusedinternallyaswell.Werevisittheseunits
withadditionaldetailabouttheiruseashiddenunitsinsection6.3.
Throughoutthissection,wesupposethatthefeedforwardnetworkprovidesa
setofhiddenfeaturesdefinedby
h
=
f
(
x
;
θ
).Theroleoftheoutputlayeristhen
toprovidesomeadditionaltransformationfromthefeaturestocompletethetask
thatthenetworkmustperform.
177
CHAPTER6.DEEPFEEDFORWARDNETWORKS
6.2.2.1LinearUnitsforGaussianOutputDistributions
Onesimplekindofoutputunitisbasedonanaffinetransformationwithno
nonlinearity.Theseareoftenjustcalledlinearunits.
Givenfeatures
h
,alayeroflinearoutputunitsproducesavector
ˆ
y
=
W
h
+
b
.
Linearoutputlayersareoftenusedtoproducethemeanofaconditional
Gaussiandistribution:
p(y|x) =N(y;
ˆ
y,I).(6.17)
Maximizingthelog-likelihoodisthenequivalenttominimizingthemeansquared
error.
Themaximumlikelihoodframeworkmakesitstraightforwardtolearnthe
covarianceoftheGaussiantoo,ortomakethecovarianceoftheGaussianbea
functionoftheinput.However,thecovariancemustbeconstrainedtobeapositive
definitematrixforallinputs.Itisdifficulttosatisfysuchconstraintswithalinear
outputlayer,sotypicallyotheroutputunitsareusedtoparametrizethecovariance.
Approachestomodelingthecovariancearedescribedshortly,insection6.2.2.4.
Becauselinearunitsdonotsaturate,theyposelittledifficultyforgradient-
basedoptimizationalgorithmsandmaybeusedwithawidevarietyofoptimization
algorithms.
6.2.2.2SigmoidUnitsforBernoulliOutputDistributions
Manytasksrequirepredictingthevalueofabinaryvariable
y
.Classification
problemswithtwoclassescanbecastinthisform.
ThemaximumlikelihoodapproachistodefineaBernoullidistributionover
y
conditionedonx.
ABernoullidistributionisdefinedbyjustasinglenumber.Theneuralnet
needstopredictonly
P
(
y
= 1
|x
).Forthisnumbertobeavalidprobability,it
mustlieintheinterval[0,1].
Satisfyingthisconstraintrequiressomecarefuldesigneffort.Supposewewere
tousealinearunitandthresholditsvaluetoobtainavalidprobability:
P(y= 1|x) = max
0,min
1,w
h+b

.(6.18)
Thiswouldindeeddefineavalidconditionaldistribution,butwewouldnotbeable
totrainitveryeffectivelywithgradientdescent.Anytimethat
w
h
+
b
strayed
178
CHAPTER6.DEEPFEEDFORWARDNETWORKS
outsidetheunitinterval,thegradientoftheoutputofthemodelwithrespectto
itsparameterswouldbe
0
.Agradientof
0
istypicallyproblematicbecausethe
learningalgorithmnolongerhasaguideforhowtoimprovethecorresponding
parameters.
Instead,itisbettertouseadifferentapproachthatensuresthereisalwaysa
stronggradientwheneverthemodelhasthewronganswer.Thisapproachisbased
onusingsigmoidoutputunitscombinedwithmaximumlikelihood.
Asigmoidoutputunitisdefinedby
ˆy=σ
w
h+b
,(6.19)
whereσisthelogisticsigmoidfunctiondescribedinsection3.10.
Wecanthinkofthesigmoidoutputunitashavingtwocomponents.First,it
usesalinearlayertocompute
z
=
w
h
+
b
.Next,itusesthesigmoidactivation
functiontoconvertzintoaprobability.
Weomitthedependenceon
x
forthemomenttodiscusshowtodefinea
probabilitydistributionover
y
usingthevalue
z
.Thesigmoidcanbemotivated
byconstructinganunnormalizedprobabilitydistribution
˜
P
(
y
),whichdoesnot
sumto1.Wecanthendividebyanappropriateconstanttoobtainavalid
probabilitydistribution.Ifwebeginwiththeassumptionthattheunnormalizedlog
probabilitiesarelinearin
y
and
z
,wecanexponentiatetoobtaintheunnormalized
probabilities.WethennormalizetoseethatthisyieldsaBernoullidistribution
controlledbyasigmoidaltransformationofz:
log
˜
P(y) =yz,(6.20)
˜
P(y) = exp(yz),(6.21)
P(y) =
exp(yz)
1
y
=0
exp(y
z),
(6.22)
P(y) =σ((2y1)z).(6.23)
Probabilitydistributionsbasedonexponentiationandnormalizationarecommon
throughoutthestatisticalmodelingliterature.The
z
variabledefiningsucha
distributionoverbinaryvariablesiscalledalogit.
Thisapproachtopredictingtheprobabilitiesinlogspaceisnaturaltouse
withmaximumlikelihoodlearning.Becausethecostfunctionusedwithmaximum
likelihoodis
logP
(
y|x
),the
log
inthecostfunctionundoesthe
exp
ofthe
sigmoid.Withoutthiseffect,thesaturationofthesigmoidcouldpreventgradient-
based learningfrom makinggood progress.The lossfunction formaximum
179
CHAPTER6.DEEPFEEDFORWARDNETWORKS
likelihoodlearningofaBernoulliparametrizedbyasigmoidis
J(θ) =logP(y|x)(6.24)
=logσ((2y1)z)(6.25)
=ζ((12y)z).(6.26)
Thisderivationmakesuseofsomepropertiesfromsection3.10.Byrewriting
thelossintermsofthesoftplusfunction,wecanseethatitsaturatesonlywhen
(1
2
y
)
z
isverynegative.Saturationthusoccursonlywhenthemodelalready
hastherightanswer—when
y
= 1and
z
isverypositive,or
y
= 0and
z
isvery
negative.When
z
hasthewrongsign,theargumenttothesoftplusfunction,
(1
2
y
)
z
,maybesimplifiedto
|z|
.As
|z|
becomeslargewhile
z
hasthewrongsign,
thesoftplusfunctionasymptotestowardsimplyreturningitsargument
|z|
.The
derivativewithrespectto
z
asymptotesto
sign
(
z
),so,inthelimitofextremely
incorrect
z
,thesoftplusfunctiondoesnotshrinkthegradientatall.Thisproperty
isusefulbecauseitmeansthatgradient-basedlearningcanacttoquicklycorrect
amistakenz.
Whenweuseotherlossfunctions,suchasmeansquarederror,thelosscan
saturateanytime
σ
(
z
)saturates.Thesigmoidactivationfunctionsaturatesto0
when
z
becomesverynegativeandsaturatesto1when
z
becomesverypositive.
Thegradientcanshrinktoosmalltobeusefulforlearningwhenthishappens,
whetherthemodelhasthecorrectanswerortheincorrectanswer.Forthisreason,
maximumlikelihoodisalmostalwaysthepreferredapproachtotrainingsigmoid
outputunits.
Analytically,thelogarithmofthesigmoidisalwaysdefinedandfinite,because
thesigmoidreturnsvaluesrestrictedtotheopeninterval(0
,
1),ratherthanusing
theentireclosedintervalofvalidprobabilities[0
,
1].Insoftwareimplementations,
toavoidnumericalproblems,itisbesttowritethenegativelog-likelihoodasa
functionof
z
,ratherthanasafunctionof
ˆy
=
σ
(
z
).Ifthesigmoidfunction
underflowstozero,thentakingthelogarithmofˆyyieldsnegativeinfinity.
6.2.2.3SoftmaxUnitsforMultinoulliOutputDistributions
Anytimewewishtorepresentaprobabilitydistributionoveradiscretevariable
with
n
possiblevalues,wemayusethesoftmaxfunction.Thiscanbeseenasa
generalizationofthesigmoidfunction,whichwasusedtorepresentaprobability
distributionoverabinaryvariable.
Softmaxfunctionsaremostoftenusedastheoutputofaclassifier,torepresent
theprobabilitydistributionover
n
differentclasses.Morerarely,softmaxfunctions
180
CHAPTER6.DEEPFEEDFORWARDNETWORKS
canbeusedinsidethemodelitself,ifwewishthemodeltochoosebetweenoneof
ndifferentoptionsforsomeinternalvariable.
Inthecaseofbinaryvariables,wewishedtoproduceasinglenumber
ˆy=P(y= 1|x).(6.27)
Becausethisnumberneededtoliebetween0and1,andbecausewewantedthe
logarithmofthenumbertobewellbehavedforgradient-basedoptimizationof
thelog-likelihood,wechosetoinsteadpredictanumber
z
=
log
˜
P
(
y
=1
|x
).
ExponentiatingandnormalizinggaveusaBernoullidistributioncontrolledbythe
sigmoidfunction.
Togeneralizetothecaseofadiscretevariablewith
n
values,wenowneed
toproduceavector
ˆ
y
,with
ˆy
i
=
P
(
y
=
i|x
).Werequirenotonlythateach
elementof
ˆy
i
bebetween0and1,butalsothattheentirevectorsumsto1sothat
itrepresentsavalidprobabilitydistribution.Thesameapproachthatworkedfor
theBernoullidistributiongeneralizestothemultinoullidistribution.First,alinear
layerpredictsunnormalizedlogprobabilities:
z=W
h+b,(6.28)
where
z
i
=
log
˜
P
(
y
=
i|x
)
.
Thesoftmaxfunctioncanthenexponentiateand
normalizeztoobtainthedesired
ˆ
y.Formally,thesoftmaxfunctionisgivenby
softmax(z)
i
=
exp(z
i
)
j
exp(z
j
)
.(6.29)
Aswiththelogisticsigmoid,theuseofthe
exp
functionworkswellwhen
trainingthesoftmaxtooutputatargetvalue
y
usingmaximumlog-likelihood.In
thiscase,wewishtomaximize
logP
(
y
=
i
;
z
)=
logsoftmax
(
z
)
i
.Definingthe
softmaxintermsof
exp
isnaturalbecausethe
log
inthelog-likelihoodcanundo
theexpofthesoftmax:
logsoftmax(z)
i
=z
i
log
j
exp(z
j
).(6.30)
Thefirsttermofequation6.30showsthattheinput
z
i
alwayshasadirect
contributiontothecostfunction.Becausethistermcannotsaturate,weknow
thatlearningcanproceed,evenifthecontributionof
z
i
tothesecondtermof
equation6.30becomesverysmall.Whenmaximizingthelog-likelihood,thefirst
termencourages
z
i
tobepushedup,whilethesecondtermencouragesallof
z
tobe
pusheddown.Togainsomeintuitionforthesecondterm,
log
j
exp
(
z
j
),observe
181
CHAPTER6.DEEPFEEDFORWARDNETWORKS
thatthistermcanberoughlyapproximatedby
max
j
z
j
.Thisapproximationis
basedontheideathat
exp
(
z
k
)isinsignificantforany
z
k
thatisnoticeablylessthan
max
j
z
j
.Theintuitionwecangainfromthisapproximationisthatthenegative
log-likelihoodcostfunctionalwaysstronglypenalizesthemostactiveincorrect
prediction.Ifthecorrectansweralreadyhasthelargestinputtothesoftmax,then
the
z
i
termandthe
log
j
exp
(
z
j
)
max
j
z
j
=
z
i
termswillroughlycancel.
Thisexamplewillthencontributelittletotheoveralltrainingcost,whichwillbe
dominatedbyotherexamplesthatarenotyetcorrectlyclassified.
Sofarwehavediscussedonlyasingleexample.Overall,unregularizedmaximum
likelihoodwilldrivethemodeltolearnparametersthatdrivethesoftmaxtopredict
thefractionofcountsofeachoutcomeobservedinthetrainingset:
softmax(z(x;θ))
i
m
j=1
1
y
(j)
=i,x
(j)
=x
m
j=1
1
x
(j)
=x
.(6.31)
Becausemaximumlikelihoodisaconsistentestimator,thisisguaranteedtohappen
aslongasthemodelfamilyiscapableofrepresentingthetrainingdistribution.In
practice,limitedmodelcapacityandimperfectoptimizationwillmeanthatthe
modelisonlyabletoapproximatethesefractions.
Manyobjectivefunctionsotherthanthelog-likelihooddonotworkaswell
withthesoftmaxfunction.Specifically,objectivefunctionsthatdonotusea
log
to
undothe
exp
ofthesoftmaxfailtolearnwhentheargumenttothe
exp
becomes
verynegative,causingthegradienttovanish.Inparticular,squarederrorisapoor
lossfunctionforsoftmaxunitsandcanfailtotrainthemodeltochangeitsoutput,
evenwhenthemodelmakeshighlyconfidentincorrectpredictions(Bridle,1990).
Tounderstandwhytheseotherlossfunctionscanfail,weneedtoexaminethe
softmaxfunctionitself.
Likethesigmoid,thesoftmaxactivationcansaturate.Thesigmoidfunction
hasasingleoutputthatsaturateswhenitsinputisextremelynegativeorextremely
positive.Thesoftmaxhasmultipleoutputvalues.Theseoutputvaluescansaturate
whenthedifferencesbetweeninputvaluesbecomeextreme. Whenthesoftmax
saturates,manycostfunctionsbasedonthesoftmaxalsosaturate,unlesstheyare
abletoinvertthesaturatingactivatingfunction.
Toseethatthesoftmaxfunctionrespondstothedifferencebetweenitsinputs,
observethatthesoftmaxoutputisinvarianttoaddingthesamescalartoallits
inputs:
softmax(z) = softmax(z+c).(6.32)
Usingthisproperty,wecanderiveanumericallystablevariantofthesoftmax:
softmax(z) = softmax(zmax
i
z
i
).(6.33)
182
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Thereformulatedversionenablesustoevaluatesoftmaxwithonlysmallnumerical
errors, evenwhen
z
containsextremelylarge orextremely negative numbers.
Examiningthenumericallystablevariant,weseethatthesoftmaxfunctionis
drivenbytheamountthatitsargumentsdeviatefrommax
i
z
i
.
Anoutput
softmax
(
z
)
i
saturatesto1whenthecorrespondinginputismaximal
(
z
i
=
max
i
z
i
)and
z
i
ismuchgreaterthanalltheotherinputs.Theoutput
softmax
(
z
)
i
canalsosaturateto0when
z
i
isnotmaximalandthemaximumis
muchgreater.Thisisageneralizationofthewaythatsigmoidunitssaturateand
cancausesimilardifficultiesforlearningifthelossfunctionisnotdesignedto
compensateforit.
Theargument
z
tothesoftmaxfunctioncanbeproducedintwodifferentways.
Themostcommonissimplytohaveanearlierlayeroftheneuralnetworkoutput
everyelementof
z
,asdescribedaboveusingthelinearlayer
z
=
W
h
+
b
.While
straightforward,thisapproachactuallyoverparametrizesthedistribution.The
constraintthatthe
n
outputsmustsumto1meansthatonly
n
1parametersare
necessary;theprobabilityofthe
n
-thvaluemaybeobtainedbysubtractingthe
first
n
1probabilitiesfrom1.Wecanthusimposearequirementthatoneelement
of
z
befixed.Forexample,wecanrequirethat
z
n
=0.Indeed,thisisexactly
whatthesigmoidunitdoes.Defining
P
(
y
= 1
|x
) =
σ
(
z
)isequivalenttodefining
P
(
y
= 1
|x
) =
softmax
(
z
)
1
withatwo-dimensional
z
and
z
1
= 0.Boththe
n
1
argumentandthe
n
argumentapproachestothesoftmaxcandescribethesame
setofprobabilitydistributionsbuthavedifferentlearningdynamics.Inpractice,
thereisrarelymuchdifferencebetweenusingtheoverparametrizedversionorthe
restrictedversion,anditissimplertoimplementtheoverparametrizedversion.
Fromaneuroscientificpointofview,itisinterestingtothinkofthesoftmaxas
awaytocreateaformofcompetitionbetweentheunitsthatparticipateinit:the
softmaxoutputsalwayssumto1soanincreaseinthevalueofoneunitnecessarily
correspondstoadecreaseinthevalueofothers.Thisisanalogoustothelateral
inhibitionthatisbelievedtoexistbetweennearbyneuronsinthecortex.Atthe
extreme(whenthedifferencebetweenthemaximal
a
i
andtheothersislargein
magnitude)itbecomesaformof
winner-take-all
(oneoftheoutputsisnearly1,
andtheothersarenearly0).
Thename“softmax”canbesomewhatconfusing.Thefunctionismoreclosely
relatedtothe
argmax
functionthanthemaxfunction. Theterm“soft”derives
fromthefactthatthesoftmaxfunctioniscontinuousanddifferentiable.The
argmax
function,withitsresultrepresentedasaone-hotvector,isnotcontinuous
ordifferentiable.Thesoftmaxfunctionthusprovidesa“softened”versionofthe
argmax
.Thecorrespondingsoftversionofthemaximumfunctionis
softmax
(
z
)
z
.
183
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Itwouldperhapsbebettertocallthesoftmaxfunction“softargmax,” butthe
currentnameisanentrenchedconvention.
6.2.2.4OtherOutputTypes
Thelinear, sigmoid, andsoftmax outputunitsdescribedabovearethemost
common.Neuralnetworkscangeneralizetoalmostanykindofoutputlayerthat
wewish.Theprincipleofmaximumlikelihoodprovidesaguideforhowtodesign
agoodcostfunctionfornearlyanykindofoutputlayer.
Ingeneral,ifwedefineaconditionaldistribution
p
(
y|x
;
θ
),theprincipleof
maximumlikelihoodsuggestsweuselogp(y|x;θ)asourcostfunction.
Ingeneral,wecanthinkoftheneuralnetworkasrepresentingafunction
f
(
x
;
θ
).
Theoutputsofthisfunctionarenotdirectpredictionsofthevalue
y
.Instead,
f
(
x
;
θ
) =
ω
providestheparametersforadistributionover
y
.Ourlossfunction
canthenbeinterpretedaslogp(y;ω(x)).
Forexample,wemaywishtolearnthevarianceofaconditionalGaussianfor
y
,
given
x
.Inthesimplecase,wherethevariance
σ
2
isaconstant,thereisaclosed
formexpressionbecausethemaximumlikelihoodestimatorofvarianceissimplythe
empiricalmeanofthesquareddifferencebetweenobservations
y
andtheirexpected
value.Acomputationallymoreexpensiveapproachthatdoesnotrequirewriting
special-casecodeistosimplyincludethevarianceasoneofthepropertiesofthe
distribution
p
(
y|x
)thatiscontrolledby
ω
=
f
(
x
;
θ
).Thenegativelog-likelihood
logp
(
y
;
ω
(
x
))willthenprovideacostfunctionwiththeappropriateterms
necessarytomakeouroptimizationprocedureincrementallylearnthevariance.In
thesimplecasewherethestandarddeviationdoesnotdependontheinput,we
canmakeanewparameterinthenetworkthatiscopieddirectlyinto
ω
.Thisnew
parametermightbe
σ
itselforcouldbeaparameter
v
representing
σ
2
oritcould
beaparameter
β
representing
1
σ
2
,dependingonhowwechoosetoparametrize
thedistribution.Wemaywishourmodeltopredictadifferentamountofvariance
in
y
fordifferentvaluesof
x
.Thisiscalleda
heteroscedastic
model.Inthe
heteroscedasticcase,wesimplymakethespecificationofthevariancebeoneof
thevaluesoutputby
f
(
x
;
θ
).AtypicalwaytodothisistoformulatetheGaussian
distributionusingprecision,ratherthanvariance,asdescribedinequation3.22.
Inthemultivariatecase,itismostcommontouseadiagonalprecisionmatrix
diag(β).(6.34)
Thisformulationworkswellwithgradientdescentbecausetheformulaforthe
log-likelihoodoftheGaussiandistributionparametrizedby
β
involvesonlymul-
tiplicationby
β
i
andadditionof
logβ
i
.Thegradientofmultiplication,addition,
184
CHAPTER6.DEEPFEEDFORWARDNETWORKS
andlogarithmoperationsiswellbehaved.Bycomparison,ifweparametrizedthe
outputintermsofvariance,wewouldneedtousedivision.Thedivisionfunction
becomesarbitrarilysteepnearzero.Whilelargegradientscanhelplearning,arbi-
trarilylargegradientsusuallyresultininstability.Ifweparametrizedtheoutputin
termsofstandarddeviation,thelog-likelihoodwouldstillinvolvedivisionaswell
assquaring. Thegradientthroughthesquaringoperationcanvanishnearzero,
makingitdifficulttolearnparametersthataresquared.Regardlessofwhetherwe
usestandarddeviation,variance,orprecision,wemustensurethatthecovariance
matrixoftheGaussianispositivedefinite.Becausetheeigenvaluesoftheprecision
matrixarethereciprocalsoftheeigenvaluesofthecovariancematrix, thisis
equivalenttoensuringthattheprecisionmatrixispositivedefinite. Ifweusea
diagonalmatrix,orascalartimesthediagonalmatrix,thentheonlycondition
weneedtoenforceontheoutputofthemodelispositivity.Ifwesupposethata
istherawactivationofthemodelusedtodeterminethediagonalprecision,we
canusethesoftplusfunctiontoobtainapositiveprecisionvector:
β
=
ζ
(
a
)
.
This
samestrategyappliesequallyifusingvarianceorstandarddeviationratherthan
precisionorifusingascalartimesidentityratherthandiagonalmatrix.
Itisraretolearnacovarianceorprecisionmatrixwithricherstructurethan
diagonal. Ifthecovarianceisfullandconditional,thenaparametrizationmust
bechosenthatguaranteespositivedefinitenessofthepredictedcovariancematrix.
Thiscanbeachievedbywriting
Σ(x) =B(x)B
(x)
,where
B
isanunconstrained
squarematrix.Onepracticalissueifthematrixisfullrankisthatcomputingthe
likelihoodisexpensive,witha
d×d
matrixrequiring
O
(
d
3
)computationforthe
determinantandinverseof
Σ
(
x
)(orequivalently,andmorecommonlydone,its
eigendecompositionorthatofB(x)).
We oftenwantto perform multimodal regression, thatis, to predictreal
valuesfromaconditionaldistribution
p
(
y|x
)thatcanhaveseveraldifferent
peaksin
y
spaceforthesamevalueof
x
.Inthiscase,aGaussianmixtureis
anaturalrepresentationfortheoutput(Jacobsetal.,1991;Bishop,1994).
NeuralnetworkswithGaussianmixturesastheiroutputareoftencalled
mixture
densitynetworks
.AGaussianmixtureoutputwith
n
componentsisdefinedby
theconditionalprobabilitydistribution:
p(y|x) =
n
i=1
p(c=i|x)N(y;µ
(i)
(x),Σ
(i)
(x)).(6.35)
Theneuralnetworkmusthavethreeoutputs:avectordefining
p
(
c
=
i|x
),a
matrixproviding
µ
(i)
(
x
)forall
i
,andatensorproviding
Σ
(i)
(
x
)forall
i
.These
outputsmustsatisfydifferentconstraints:
185
CHAPTER6.DEEPFEEDFORWARDNETWORKS
1.
Mixturecomponents
p
(
c
=
i|x
):theseformamultinoullidistribution
overthe
n
differentcomponentsassociatedwithlatentvariable
1
c
,andcan
typicallybeobtainedbyasoftmaxoveran
n
-dimensionalvector,toguarantee
thattheseoutputsarepositiveandsumto1.
2.
Means
µ
(i)
(
x
):theseindicatethecenterormeanassociatedwiththe
i
-th
Gaussiancomponentandareunconstrained(typicallywithnononlinearityat
allfortheseoutputunits).If
y
isa
d
-vector,thenthenetworkmustoutput
an
n×d
matrixcontainingall
n
ofthese
d
-dimensionalvectors.Learning
thesemeanswithmaximumlikelihoodisslightlymorecomplicatedthan
learningthemeansofadistributionwithonlyoneoutputmode. Weonly
wanttoupdatethemeanforthecomponentthatactuallyproducedthe
observation.Inpractice,wedonotknowwhichcomponentproducedeach
observation.Theexpressionforthenegativelog-likelihoodnaturallyweights
eachexample’scontributiontothelossforeachcomponentbytheprobability
thatthecomponentproducedtheexample.
3.
Covariances
Σ
(i)
(
x
):thesespecifythecovariancematrixforeachcomponent
i
.AswhenlearningasingleGaussiancomponent,wetypicallyuseadiagonal
matrixtoavoidneedingtocomputedeterminants.Aswithlearningthemeans
ofthemixture,maximumlikelihoodiscomplicatedbyneedingtoassign
partialresponsibilityforeachpointtoeachmixturecomponent.Gradient
descentwillautomaticallyfollowthecorrectprocessifgiventhecorrect
specificationofthenegativelog-likelihoodunderthemixturemodel.
Ithasbeenreportedthatgradient-basedoptimizationofconditionalGaussian
mixtures(ontheoutputofneuralnetworks)canbeunreliable,inpartbecauseone
getsdivisions(bythevariance)whichcanbenumericallyunstable(whensome
variancegetstobesmallforaparticularexample,yieldingverylargegradients).
Onesolutionisto
clipgradients
(seesection10.11.1),whileanotheristoscale
thegradientsheuristically(Uriaetal.,2014).
Gaussianmixtureoutputsareparticularlyeffectiveingenerativemodelsof
speech(Schuster,1999)andmovementsofphysicalobjects(Graves,2013).The
mixturedensitystrategygivesawayforthenetworktorepresentmultipleoutput
modesandtocontrolthevarianceofitsoutput,whichiscrucialforobtaining
1
Weconsider
c
tobelatentbecausewedonotobserveitinthedata:giveninput
x
andtarget
y
,itisnotpossibletoknowwithcertaintywhichGaussiancomponentwasresponsiblefor
y
,but
wecanimaginethat
y
wasgeneratedbypickingoneofthem,andwecanmakethatunobserved
choicearandomvariable.
186
CHAPTER6.DEEPFEEDFORWARDNETWORKS
x
y
Figure6.4:Samplesdrawnfromaneuralnetworkwithamixturedensityoutputlayer.
Theinput
x
issampledfromauniformdistribution,andtheoutput
y
issampledfrom
p
model
(
y|x
).Theneuralnetworkisabletolearnnonlinearmappingsfromtheinputto
theparametersoftheoutputdistribution.Theseparametersincludetheprobabilities
governingwhichofthreemixturecomponentswillgeneratetheoutputaswellasthe
parametersforeachmixturecomponent.EachmixturecomponentisGaussianwith
predictedmeanandvariance.Alltheseaspectsoftheoutputdistributionareabletovary
withrespecttotheinputx,andtodosoinnonlinearways.
ahighdegreeofqualityinthesereal-valueddomains.Anexampleofamixture
densitynetworkisshowninfigure6.4.
Ingeneral,wemaywishtocontinuetomodellargervectors
y
containingmore
variables,andtoimposericherandricherstructuresontheseoutputvariables.For
example,ifwewantourneuralnetworktooutputasequenceofcharactersthat
formsasentence,wemightcontinuetousetheprincipleofmaximumlikelihood
appliedtoourmodel
p
(
y
;
ω
(
x
)).Inthiscase,themodelweusetodescribe
y
would
becomecomplexenoughtobebeyondthescopeofthischapter.InChapter10we
describehowtouserecurrentneuralnetworkstodefinesuchmodelsoversequences,
andinpartIIIwedescribeadvancedtechniquesformodelingarbitraryprobability
distributions.
6.3HiddenUnits
Sofarwehavefocusedourdiscussionondesignchoicesforneuralnetworksthat
arecommontomostparametricmachinelearningmodelstrainedwithgradient-
basedoptimization.Nowweturntoanissuethatisuniquetofeedforwardneural
187
CHAPTER6.DEEPFEEDFORWARDNETWORKS
networks:howtochoosethetypeofhiddenunittouseinthehiddenlayersofthe
model.
Thedesignofhiddenunitsisanextremelyactiveareaofresearchanddoesnot
yethavemanydefinitiveguidingtheoreticalprinciples.
Rectifiedlinearunitsareanexcellentdefaultchoiceofhiddenunit.Manyother
typesofhiddenunitsareavailable.Itcanbedifficulttodeterminewhentouse
whichkind(thoughrectifiedlinearunitsareusuallyanacceptablechoice). We
describeheresomeofthebasicintuitionsmotivatingeachtypeofhiddenunit.
Theseintuitionscanhelpdecidewhentotryoutwhichunit.Predictinginadvance
whichwillworkbestisusuallyimpossible.Thedesignprocessconsistsoftrial
anderror,intuitingthatakindofhiddenunitmayworkwell,andthentraining
anetworkwiththatkindofhiddenunitandevaluatingitsperformanceona
validationset.
Someofthehiddenunitsincludedinthislistarenotactuallydifferentiable
atallinputpoints.Forexample,therectifiedlinearfunction
g
(
z
)=
max{
0
,z}
isnotdifferentiableat
z
=0.Thismayseemlikeitinvalidates
g
forusewith
agradient-basedlearningalgorithm.Inpractice,gradientdescentstillperforms
wellenoughforthesemodelstobeusedformachinelearningtasks.Thisisin
partbecauseneuralnetworktrainingalgorithmsdonotusuallyarriveatalocal
minimumofthecostfunction,butinsteadmerelyreduceitsvaluesignificantly,as
showninfigure4.3.(Theseideasaredescribedfurtherinchapter8.)Becausewedo
notexpecttrainingtoactuallyreachapointwherethegradientis
0
,itisacceptable
fortheminimaofthecostfunctiontocorrespondtopointswithundefinedgradient.
Hiddenunitsthatarenotdifferentiableareusuallynondifferentiableatonlya
smallnumberofpoints.Ingeneral,afunction
g
(
z
)hasaleftderivativedefined
bytheslopeofthefunctionimmediatelytotheleftof
z
andarightderivative
definedbytheslopeofthefunctionimmediatelytotherightof
z
.Afunction
isdifferentiableat
z
onlyifboththeleftderivativeandtherightderivativeare
definedandequaltoeachother.Thefunctionsusedinthecontextofneural
networksusuallyhavedefinedleftderivativesanddefinedrightderivatives.Inthe
caseof
g
(
z
) =
max{
0
,z}
,theleftderivativeat
z
= 0is0,andtherightderivative
is1.Softwareimplementationsofneuralnetworktrainingusuallyreturnoneof
theone-sidedderivativesratherthanreportingthatthederivativeisundefinedor
raisinganerror. Thismaybeheuristicallyjustifiedbyobservingthatgradient-
basedoptimizationonadigitalcomputerissubjecttonumericalerroranyway.
Whenafunctionisaskedtoevaluate
g
(0),itisveryunlikelythattheunderlying
valuetrulywas0.Instead,itwaslikelytobesomesmallvalue
thatwasrounded
to0.Insomecontexts,moretheoreticallypleasingjustificationsareavailable,but
188
CHAPTER6.DEEPFEEDFORWARDNETWORKS
theseusuallydonotapplytoneuralnetworktraining.Theimportantpointis
thatinpracticeonecansafelydisregardthenondifferentiabilityofthehiddenunit
activationfunctionsdescribedbelow.
Unlessindicatedotherwise,mosthiddenunitscanbedescribedasaccepting
avectorofinputs
x
,computinganaffinetransformation
z
=
W
x
+
b
,and
thenapplyinganelement-wisenonlinearfunction
g
(
z
).Mosthiddenunitsare
distinguishedfromeachotheronlybythechoiceoftheformoftheactivation
functiong(z).
6.3.1RectifiedLinearUnitsandTheirGeneralizations
Rectifiedlinearunitsusetheactivationfunctiong(z) = max{0,z}.
Theseunitsareeasytooptimizebecausetheyaresosimilartolinearunits.The
onlydifferencebetweenalinearunitandarectifiedlinearunitisthatarectified
linearunitoutputszeroacrosshalfitsdomain.Thismakesthederivativesthrough
arectifiedlinearunitremainlargewhenevertheunitisactive.Thegradients
arenotonlylargebutalsoconsistent.Thesecondderivativeoftherectifying
operationis0almosteverywhere,andthederivativeoftherectifyingoperationis
1everywherethattheunitisactive.Thismeansthatthegradientdirectionisfar
moreusefulforlearningthanitwouldbewithactivationfunctionsthatintroduce
second-ordereffects.
Rectifiedlinearunitsaretypicallyusedontopofanaffinetransformation:
h=g(W
x+b).(6.36)
Wheninitializingtheparametersoftheaffinetransformation,itcanbeagood
practicetosetallelementsof
b
toasmallpositivevalue,suchas0
.
1. Doingso
makesitverylikelythattherectifiedlinearunitswillbeinitiallyactiveformost
inputsinthetrainingsetandallowthederivativestopassthrough.
Severalgeneralizationsofrectifiedlinearunitsexist.Mostofthesegeneral-
izationsperformcomparablytorectifiedlinearunitsandoccasionallyperform
better.
Onedrawbacktorectifiedlinearunitsisthattheycannotlearnviagradient-
basedmethodsonexamplesforwhichtheiractivationiszero.Variousgeneraliza-
tionsofrectifiedlinearunitsguaranteethattheyreceivegradienteverywhere.
Threegeneralizationsofrectifiedlinearunitsarebasedonusinganonzero
slope
α
i
when
z
i
<
0:
h
i
=
g
(
z,α
)
i
=
max
(0
,z
i
)+
α
i
min
(0
,z
i
).
Absolutevalue
189
CHAPTER6.DEEPFEEDFORWARDNETWORKS
rectification
fixes
α
i
=
1toobtain
g
(
z
) =
|z|
.Itisusedforobjectrecognition
fromimages(Jarrettetal.,2009),whereitmakessensetoseekfeaturesthatare
invariantunderapolarityreversaloftheinputillumination.Othergeneralizations
ofrectifiedlinearunitsaremorebroadlyapplicable.A
leakyReLU
(Maasetal.,
2013)fixes
α
i
toasmallvaluelike0.01,whilea
parametricReLU
,or
PReLU
,
treatsα
i
asalearnableparameter(Heetal.,2015).
Maxoutunits
(Goodfellowet al.,2013a)generalizerectifiedlinearunits
further.Insteadofapplyinganelement-wisefunction
g
(
z
),maxoutunitsdivide
z
intogroupsof
k
values.Eachmaxoutunitthenoutputsthemaximumelementof
oneofthesegroups:
g(z)
i
=max
jG
(i)
z
j
,(6.37)
where
G
(i)
isthesetofindicesintotheinputsforgroup
i
,
{
(
i
1)
k
+1
,...,ik}
.
Thisprovidesawayoflearningapiecewiselinearfunctionthatrespondstomultiple
directionsintheinputxspace.
Amaxoutunitcanlearnapiecewiselinear,convexfunctionwithupto
k
pieces.
Maxoutunitscanthusbeseenaslearningtheactivationfunctionitselfratherthan
justtherelationshipbetweenunits.Withlargeenough
k
,amaxoutunitcanlearn
toapproximateanyconvexfunctionwitharbitraryfidelity.Inparticular,amaxout
layerwithtwopiecescanlearntoimplementthesamefunctionoftheinput
x
asatraditionallayerusingtherectifiedlinearactivationfunction,theabsolute
valuerectificationfunction,ortheleakyorparametricReLU,oritcanlearnto
implementatotallydifferentfunctionaltogether.Themaxoutlayerwillofcourse
beparametrizeddifferentlyfromanyoftheseotherlayertypes,sothelearning
dynamicswillbedifferenteveninthecaseswheremaxoutlearnstoimplementthe
samefunctionofxasoneoftheotherlayertypes.
Eachmaxoutunitisnowparametrizedby
k
weightvectorsinsteadofjustone,
somaxoutunitstypicallyneedmoreregularizationthanrectifiedlinearunits.They
canworkwellwithoutregularizationifthetrainingsetislargeandthenumberof
piecesperunitiskeptlow(Caietal.,2013).
Maxoutunitshaveafewotherbenefits.Insomecases,onecangainsomesta-
tisticalandcomputationaladvantagesbyrequiringfewerparameters.Specifically,
ifthefeaturescapturedby
n
differentlinearfilterscanbesummarizedwithout
losinginformationbytakingthemaxovereachgroupof
k
features,thenthenext
layercangetbywithktimesfewerweights.
Becauseeachunitisdrivenbymultiplefilters,maxoutunitshavesomeredun-
dancythathelpsthemresistaphenomenoncalled
catastrophicforgetting
,in
whichneuralnetworksforgethowtoperformtasksthattheyweretrainedonin
190
CHAPTER6.DEEPFEEDFORWARDNETWORKS
thepast(Goodfellowetal.,2014a).
Rectifiedlinearunitsandallthesegeneralizationsofthemarebasedonthe
principlethatmodelsareeasiertooptimizeiftheirbehaviorisclosertolinear.
Thissamegeneralprincipleofusinglinearbehaviortoobtaineasieroptimization
alsoappliesinothercontextsbesidesdeeplinearnetworks.Recurrentnetworkscan
learnfromsequencesandproduceasequenceofstatesandoutputs.Whentraining
them,oneneedstopropagateinformationthroughseveraltimesteps,whichismuch
easierwhensomelinearcomputations(withsomedirectionalderivativesbeingof
magnitudenear1)areinvolved.Oneofthebest-performingrecurrentnetwork
architectures,theLSTM,propagatesinformationthroughtimeviasummation—a
particularstraightforwardkindoflinearactivation. Thisisdiscussedfurtherin
section10.10.
6.3.2LogisticSigmoidandHyperbolicTangent
Priortotheintroductionofrectifiedlinearunits,mostneuralnetworksusedthe
logisticsigmoidactivationfunction
g(z) =σ(z)(6.38)
orthehyperbolictangentactivationfunction
g(z) = tanh(z).(6.39)
Theseactivationfunctionsarecloselyrelatedbecausetanh(z) = 2σ(2z)1.
We havealready seensigmoidunits asoutputunits,usedto predictthe
probabilitythatabinaryvariableis1.Unlikepiecewiselinearunits,sigmoidal
unitssaturateacrossmostoftheirdomain—theysaturatetoahighvaluewhen
z
isverypositive,saturatetoalowvaluewhen
z
isverynegative,andareonly
stronglysensitivetotheirinputwhen
z
isnear0.Thewidespreadsaturationof
sigmoidalunitscanmakegradient-basedlearningverydifficult.Forthisreason,
theiruseashiddenunitsinfeedforwardnetworksisnowdiscouraged.Theiruse
asoutputunitsiscompatiblewiththeuseofgradient-basedlearningwhenan
appropriatecostfunctioncanundothesaturationofthesigmoidintheoutput
layer.
Whenasigmoidalactivationfunctionmustbeused,thehyperbolictangent
activationfunctiontypicallyperformsbetterthanthelogisticsigmoid.Itresembles
theidentityfunctionmoreclosely,inthesensethat
tanh
(0) = 0while
σ
(0) =
1
2
.
Because
tanh
issimilartotheidentityfunctionnear0,trainingadeepneural
network
ˆy
=
w
tanh
(
U
tanh
(
V
x
))resemblestrainingalinearmodel
ˆy
=
191
CHAPTER6.DEEPFEEDFORWARDNETWORKS
w
U
V
x
aslongastheactivationsofthenetworkcanbekeptsmall.This
makestrainingthetanhnetworkeasier.
Sigmoidalactivationfunctionsaremorecommoninsettingsotherthanfeed-
forwardnetworks.Recurrentnetworks,manyprobabilisticmodels, andsome
autoencodershaveadditionalrequirementsthatruleouttheuseofpiecewise
linearactivationfunctionsandmakesigmoidalunitsmoreappealingdespitethe
drawbacksofsaturation.
6.3.3OtherHiddenUnits
Manyothertypesofhiddenunitsarepossiblebutareusedlessfrequently.
Ingeneral,awidevarietyofdifferentiablefunctionsperformperfectlywell.
Manyunpublishedactivationfunctionsperformjustaswellasthepopularones.To
provideaconcreteexample,wetestedafeedforwardnetworkusing
h
=
cos
(
Wx
+
b
)
ontheMNISTdatasetandobtainedanerrorrateoflessthan1percent,whichis
competitivewithresultsobtainedusingmoreconventionalactivationfunctions.
Duringresearchanddevelopmentofnewtechniques,itiscommontotestmany
differentactivationfunctionsandfindthatseveralvariationsonstandardpractice
performcomparably.Thismeansthatusuallynewhiddenunittypesarepublished
onlyiftheyareclearlydemonstratedtoprovideasignificantimprovement.New
hiddenunittypesthatperformroughlycomparablytoknowntypesaresocommon
astobeuninteresting.
Itwouldbeimpracticaltolistallthehiddenunittypesthathaveappearedin
theliterature.Wehighlightafewespeciallyusefulanddistinctiveones.
Onepossibilityistonothaveanactivation
g
(
z
)atall.Onecanalsothinkof
thisasusingtheidentityfunctionastheactivationfunction.Wehavealready
seenthatalinearunitcanbeusefulastheoutputofaneuralnetwork. Itmay
alsobeusedasahiddenunit.Ifeverylayeroftheneuralnetworkconsistsofonly
lineartransformations,thenthenetworkasawholewillbelinear.However,it
isacceptableforsomelayersoftheneuralnetworktobepurelylinear.Consider
aneuralnetworklayerwith
n
inputsand
p
outputs,
h
=
g
(
W
x
+
b
).Wemay
replacethiswithtwolayers,withonelayerusingweightmatrix
U
andtheother
usingweightmatrix
V
.Ifthefirstlayerhasnoactivationfunction,thenwehave
essentiallyfactoredtheweightmatrixoftheoriginallayerbasedon
W
.The
factoredapproachistocompute
h
=
g
(
V
U
x
+
b
).If
U
produces
q
outputs,
then
U
and
V
togethercontainonly(
n
+
p
)
q
parameters,while
W
contains
np
parameters.Forsmall
q
,thiscanbeaconsiderablesavinginparameters.It
comesatthecostofconstrainingthelineartransformationtobelowrank,but
192
CHAPTER6.DEEPFEEDFORWARDNETWORKS
theselow-rankrelationshipsareoftensufficient.Linearhiddenunitsthusofferan
effectivewayofreducingthenumberofparametersinanetwork.
Softmaxunitsareanotherkindofunitthatisusuallyusedasanoutput(as
describedinsection6.2.2.3)butmaysometimesbeusedasahiddenunit.Softmax
unitsnaturallyrepresentaprobabilitydistributionoveradiscretevariablewith
k
possiblevalues,sotheymaybeusedasakindofswitch.Thesekindsofhidden
unitsareusuallyonlyusedinmoreadvancedarchitecturesthatexplicitlylearnto
manipulatememory,asdescribedinsection10.12.
Afewotherreasonablycommonhiddenunittypesinclude
Radialbasisfunction
(RBF),
unit
:
h
i
=
exp
1
σ
2
i
||W
:,i
x||
2
.This
functionbecomesmoreactiveas
x
approachesatemplate
W
:,i
.Becauseit
saturatesto0formostx,itcanbedifficulttooptimize.
Softplus
:
g
(
a
) =
ζ
(
a
) =
log
(1+
e
a
).Thisisasmoothversionoftherectifier,
introducedbyDugasetal.(2001)forfunctionapproximationandbyNair
andHinton(2010)fortheconditionaldistributionsofundirectedprobabilistic
models.Glorotetal.(2011a)comparedthesoftplusandrectifierandfound
betterresultswiththelatter.Theuseofthesoftplusisgenerallydiscouraged.
Thesoftplusdemonstratesthattheperformanceofhiddenunittypescan
beverycounterintuitive—onemightexpectittohaveanadvantageover
therectifierduetobeingdifferentiableeverywhereorduetosaturatingless
completely,butempiricallyitdoesnot.
Hardtanh
.Thisisshapedsimilarlytothe
tanh
andtherectifier,butunlike
thelatter,itisbounded,
g
(
a
)=
max
(
1
,min
(1
,a
)).Itwasintroduced
byCollobert(2004).
Hiddenunitdesignremainsanactiveareaofresearch,andmanyusefulhidden
unittypesremaintobediscovered.
6.4ArchitectureDesign
Anotherkeydesignconsiderationforneuralnetworksisdeterminingthearchitecture.
Theword
architecture
referstotheoverallstructureofthenetwork:howmany
unitsitshouldhaveandhowtheseunitsshouldbeconnectedtoeachother.
Mostneuralnetworksareorganizedintogroupsofunitscalledlayers. Most
neuralnetworkarchitecturesarrangetheselayersinachainstructure,witheach
193
CHAPTER6.DEEPFEEDFORWARDNETWORKS
layerbeingafunctionofthelayerthatprecededit.Inthisstructure,thefirstlayer
isgivenby
h
(1)
=g
(1)
W
(1)
x+b
(1)
;(6.40)
thesecondlayerisgivenby
h
(2)
=g
(2)
W
(2)
h
(1)
+b
(2)
;(6.41)
andsoon.
Inthesechain-basedarchitectures,themainarchitecturalconsiderationsare
choosingthe depthofthe networkand thewidthof each layer.Aswe will
see,anetworkwithevenonehiddenlayerissufficienttofitthetrainingset.
Deepernetworksareoftenabletousefarfewerunitsperlayerandfarfewer
parameters,aswellasfrequentlygeneralizingtothetestset,buttheyalsotendto
behardertooptimize.Theidealnetworkarchitectureforataskmustbefound
viaexperimentationguidedbymonitoringthevalidationseterror.
6.4.1UniversalApproximationPropertiesandDepth
Alinearmodel,mappingfromfeaturestooutputsviamatrixmultiplication,can
bydefinitionrepresentonlylinearfunctions.Ithastheadvantageofbeingeasyto
trainbecausemanylossfunctionsresultinconvexoptimizationproblemswhen
appliedtolinearmodels.Unfortunately, weoftenwantoursystemstolearn
nonlinearfunctions.
Atfirstglance,wemightpresumethatlearninganonlinearfunctionrequires
designingaspecializedmodelfamilyforthekindofnonlinearitywewanttolearn.
Fortunately,feedforwardnetworkswithhiddenlayersprovideauniversalapproxi-
mationframework.Specifically,the
universalapproximationtheorem
(Hornik
etal.,1989;Cybenko,1989)statesthatafeedforwardnetworkwithalinearoutput
layerandatleastonehiddenlayerwithany“squashing”activationfunction(such
asthelogisticsigmoidactivationfunction)canapproximateanyBorelmeasurable
functionfromonefinite-dimensionalspacetoanotherwithanydesirednonzero
amountoferror,providedthatthenetworkisgivenenoughhiddenunits.The
derivativesofthefeedforwardnetworkcanalsoapproximatethederivativesofthe
functionarbitrarilywell(Horniketal.,1990).TheconceptofBorelmeasurability
isbeyondthescopeofthisbook; forourpurposesitsufficestosaythatany
continuousfunctiononaclosedandboundedsubsetof
R
n
isBorelmeasurable
andthereforemaybeapproximatedbyaneuralnetwork.Aneuralnetworkmay
alsoapproximateanyfunctionmappingfromanyfinitedimensionaldiscretespace
194
CHAPTER6.DEEPFEEDFORWARDNETWORKS
toanother.Whiletheoriginaltheoremswerefirststatedintermsofunitswith
activationfunctionsthatsaturateforbothverynegativeandverypositiveargu-
ments,universalapproximationtheoremshavealsobeenprovedforawiderclass
ofactivationfunctions,whichincludesthenowcommonlyusedrectifiedlinear
unit(Leshnoetal.,1993).
Theuniversalapproximationtheoremmeansthatregardlessofwhatfunction
wearetryingtolearn,weknowthatalargeMLPwillbeabletorepresentthis
function.Wearenotguaranteed,however,thatthetrainingalgorithmwillbe
abletolearnthatfunction.EveniftheMLPisabletorepresentthefunction,
learningcanfailfortwodifferentreasons.First,theoptimizationalgorithmused
fortrainingmaynotbeabletofindthevalueoftheparametersthatcorresponds
tothedesiredfunction.Second,thetrainingalgorithmmightchoosethewrong
functionasaresultofoverfitting.Recallfromsection5.2.1thatthenofreelunch
theoremshowsthatthereisnouniversallysuperiormachinelearningalgorithm.
Feedforwardnetworksprovideauniversalsystemforrepresentingfunctionsinthe
sensethat,givenafunction,thereexistsafeedforwardnetworkthatapproximates
thefunction.Thereisnouniversalprocedureforexaminingatrainingsetof
specificexamplesandchoosingafunctionthatwillgeneralizetopointsnotinthe
trainingset.
Accordingtotheuniversalapproximationtheorem,thereexistsanetworklarge
enoughtoachieveanydegreeofaccuracywedesire,butthetheoremdoesnot
sayhowlargethisnetworkwillbe.Barron(1993)providessomeboundsonthe
sizeofasingle-layernetworkneededtoapproximateabroadclassoffunctions.
Unfortunately,intheworstcase,anexponentialnumberofhiddenunits(possibly
withonehiddenunitcorrespondingtoeachinputconfigurationthatneedstobe
distinguished)mayberequired.Thisiseasiesttoseeinthebinarycase:the
numberofpossiblebinaryfunctionsonvectors
v{
0
,
1
}
n
is2
2
n
andselecting
onesuchfunctionrequires2
n
bits,whichwillingeneralrequire
O
(2
n
)degreesof
freedom.
Insummary,afeedforwardnetworkwithasinglelayerissufficienttorepresent
anyfunction,butthelayermaybeinfeasiblylargeandmayfailtolearnand
generalizecorrectly.Inmanycircumstances,usingdeepermodelscanreducethe
numberofunitsrequiredtorepresentthedesiredfunctionandcanreducethe
amountofgeneralizationerror.
Variousfamiliesoffunctionscanbeapproximatedefficientlybyanarchitecture
withdepthgreaterthansomevalue
d
,buttheyrequireamuchlargermodelif
depthisrestrictedtobelessthanorequalto
d
.Inmanycases,thenumber
ofhiddenunitsrequiredbytheshallowmodelisexponentialin
n
. Suchresults
195
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Figure6.5:Anintuitive,geometricexplanationoftheexponentialadvantageofdeeper
rectifiernetworksformallybyMontufaretal.(2014).(Left)Anabsolutevaluerectification
unithasthesameoutputforeverypairofmirrorpointsinitsinput.Themirroraxis
ofsymmetryisgivenbythehyperplanedefinedbytheweightsandbiasoftheunit.A
functioncomputedontopofthatunit(thegreendecisionsurface)willbeamirrorimage
ofasimplerpatternacrossthataxisofsymmetry.(Center)Thefunctioncanbeobtained
byfoldingthespacearoundtheaxisofsymmetry.(Right)Anotherrepeatingpatterncan
befoldedontopofthefirst(byanotherdownstreamunit)toobtainanothersymmetry
(whichisnowrepeatedfourtimes,withtwohiddenlayers).Figurereproducedwith
permissionfromMontufaretal.(2014).
werefirstprovedformodelsthatdonotresemblethecontinuous,differentiable
neuralnetworksusedformachinelearningbuthavesincebeenextendedtothese
models.Thefirstresultswereforcircuitsoflogicgates(Håstad,1986).Later
workextendedtheseresultstolinearthresholdunitswithnonnegativeweights
(HåstadandGoldmann,1991;Hajnaletal.,1993),andthentonetworkswith
continuous-valuedactivations(Maass,1992;Maassetal.,1994). Manymodern
neuralnetworksuserectifiedlinearunits.Leshnoetal.(1993)demonstrated
thatshallownetworkswithabroadfamilyofnon-polynomialactivationfunctions,
includingrectifiedlinearunits,haveuniversalapproximationproperties,butthese
resultsdonotaddressthequestionsofdepthorefficiency—theyspecifyonlythat
asufficientlywiderectifiernetworkcouldrepresentanyfunction.Montufaretal.
(2014)showedthatfunctionsrepresentablewithadeeprectifiernetcanrequire
anexponentialnumberofhiddenunitswithashallow(onehiddenlayer)network.
Moreprecisely,theyshowedthatpiecewiselinearnetworks(whichcanbeobtained
fromrectifiernonlinearitiesormaxoutunits)canrepresentfunctionswithanumber
ofregionsthatisexponentialinthedepthofthenetwork.Figure6.5illustrateshow
anetworkwithabsolutevaluerectificationcreatesmirrorimagesofthefunction
computedontopofsomehiddenunit,withrespecttotheinputofthathidden
unit.Eachhiddenunitspecifieswheretofoldtheinputspaceinordertocreate
mirrorresponses(onbothsidesoftheabsolutevaluenonlinearity).Bycomposing
thesefoldingoperations,weobtainanexponentiallylargenumberofpiecewise
linearregionsthatcancaptureallkindsofregular(e.g.,repeating)patterns.
196
CHAPTER6.DEEPFEEDFORWARDNETWORKS
ThemaintheoreminMontufaretal.(2014)statesthatthenumberoflinear
regionscarvedoutbyadeeprectifiernetworkwith
d
inputs,depth
l
,and
n
units
perhiddenlayeris
O
n
d
d(l1)
n
d
,(6.42)
thatis,exponentialindepth
l
.Inthecaseofmaxoutnetworkswith
k
filtersper
unit,thenumberoflinearregionsis
O
k
(l1)+d
.(6.43)
Ofcourse,thereisnoguaranteethatthekindsoffunctionswewanttolearnin
applicationsofmachinelearning(andinparticularforAI)sharesuchaproperty.
Wemayalsowanttochooseadeepmodelforstatisticalreasons. Anytime
wechooseaspecificmachinelearningalgorithm,weareimplicitlystatingsome
setofpriorbeliefswehaveaboutwhatkindoffunctionthealgorithmshould
learn.Choosingadeepmodelencodesaverygeneralbeliefthatthefunctionwe
wanttolearnshouldinvolvecompositionofseveralsimplerfunctions.Thiscanbe
interpretedfromarepresentationlearningpointofviewassayingthatwebelieve
thelearningproblemconsistsofdiscoveringasetofunderlyingfactorsofvariation
thatcaninturnbedescribedintermsofother,simplerunderlyingfactorsof
variation.Alternately,wecaninterprettheuseofadeeparchitectureasexpressing
abeliefthatthefunctionwewanttolearnisacomputerprogramconsistingof
multiplesteps,whereeachstepmakesuseofthepreviousstep’soutput. These
intermediateoutputsarenotnecessarilyfactorsofvariationbutcaninsteadbe
analogoustocountersorpointersthatthenetworkusestoorganizeitsinternal
processing.Empirically,greaterdepthdoesseemtoresultinbettergeneralization
forawidevarietyoftasks(Bengioetal.,2007;Erhanetal.,2009;Bengio,2009;
Mesniletal.,2011;Ciresanetal.,2012;Krizhevskyetal.,2012;Sermanetetal.,
2013;Farabetetal.,2013;Couprieetal.,2013;Kahouetal.,2013;Goodfellow
etal.,2014d;Szegedyetal.,2014a).Seefigure6.6andfigure6.7forexamplesof
someoftheseempiricalresults.Theseresultssuggestthatusingdeeparchitectures
doesindeedexpressausefulprioroverthespaceoffunctionsthemodellearns.
6.4.2OtherArchitecturalConsiderations
Sofarwehavedescribedneuralnetworksasbeingsimplechainsoflayers,withthe
mainconsiderationsbeingthedepthofthenetworkandthewidthofeachlayer.
Inpractice,neuralnetworksshowconsiderablymorediversity.
197
CHAPTER6.DEEPFEEDFORWARDNETWORKS
34567891011
Numberofhiddenlayers
92.0
92.5
93.0
93.5
94.0
94.5
95.0
95.5
96.0
96.5
Testaccuracy(percent)
Figure6.6:Effectofdepth.Empiricalresultsshowingthatdeepernetworksgeneralize
betterwhenusedtotranscribemultidigitnumbersfromphotographsofaddresses.Data
fromGoodfellowetal.(2014d).Thetestsetaccuracyconsistentlyincreaseswithincreasing
depth.Seefigure6.7foracontrolexperimentdemonstratingthatotherincreasestothe
modelsizedonotyieldthesameeffect.
Manyneuralnetworkarchitectureshavebeendevelopedforspecifictasks.
Specializedarchitecturesforcomputervisioncalledconvolutionalnetworksare
describedinchapter9.Feedforwardnetworksmayalsobegeneralizedtothe
recurrentneuralnetworksforsequenceprocessing,describedinchapter10,which
havetheirownarchitecturalconsiderations.
Ingeneral,thelayersneednotbeconnectedinachain,eventhoughthisisthe
mostcommonpractice.Manyarchitecturesbuildamainchainbutthenaddextra
architecturalfeaturestoit,suchasskipconnectionsgoingfromlayer
i
tolayer
i
+2orhigher.Theseskipconnectionsmakeiteasierforthegradienttoflowfrom
outputlayerstolayersnearertheinput.
Anotherkeyconsiderationofarchitecturedesignisexactlyhowtoconnecta
pairoflayerstoeachother.Inthedefaultneuralnetworklayerdescribedbyalinear
transformationviaamatrix
W
,everyinputunitisconnectedtoeveryoutput
unit.Manyspecializednetworksinthechaptersaheadhavefewerconnections,so
thateachunitintheinputlayerisconnectedtoonlyasmallsubsetofunitsinthe
outputlayer. Thesestrategiesfordecreasingthenumberofconnectionsreduce
thenumberofparametersandtheamountofcomputationrequiredtoevaluate
thenetworkbutareoftenhighlyproblemdependent.Forexample,convolutional
198
CHAPTER6.DEEPFEEDFORWARDNETWORKS
0.00.20.40.60.81.0
Numberofparameters
×10
8
91
92
93
94
95
96
97
Testaccuracy(percent)
3,convolutional
3,fullyconnected
11,convolutional
Figure6.7:Effectofnumberofparameters.Deepermodelstendtoperformbetter.
Thisisnotmerelybecausethemodelislarger.ThisexperimentfromGoodfellowetal.
(2014d)showsthatincreasingthenumberofparametersinlayersofconvolutionalnetworks
withoutincreasingtheirdepthisnotnearlyaseffectiveatincreasingtestsetperformance,
asillustratedinthisfigure.Thelegendindicatesthedepthofnetworkusedtomake
eachcurveandwhetherthecurverepresentsvariationinthesizeoftheconvolutional
orthefullyconnectedlayers.Weobservethatshallowmodelsinthiscontextoverfitat
around20millionparameterswhiledeeponescanbenefitfromhavingover60million.
Thissuggeststhatusingadeepmodelexpressesausefulpreferenceoverthespaceof
functionsthemodelcanlearn.Specifically,itexpressesabeliefthatthefunctionshould
consistofmanysimplerfunctionscomposedtogether.Thiscouldresulteitherinlearning
arepresentationthatiscomposedinturnofsimplerrepresentations(e.g.,cornersdefined
intermsofedges)orinlearningaprogramwithsequentiallydependentsteps(e.g.,first
locateasetofobjects,thensegmentthemfromeachother,thenrecognizethem).
199
CHAPTER6.DEEPFEEDFORWARDNETWORKS
networks,describedinchapter9,usespecializedpatternsofsparseconnections
thatareveryeffectiveforcomputervisionproblems.Inthischapter,itisdifficult
togivemorespecificadviceconcerningthearchitectureofagenericneuralnetwork.
Insubsequentchapterswedeveloptheparticulararchitecturalstrategiesthathave
beenfoundtoworkwellfordifferentapplicationdomains.
6.5Back-PropagationandOtherDifferentiation
Algorithms
Whenweuseafeedforwardneuralnetworktoacceptaninput
x
andproducean
output
ˆ
y
,informationflowsforwardthroughthenetwork.Theinput
x
provides
theinitialinformationthatthenpropagatesuptothehiddenunitsateachlayer
andfinallyproduces
ˆ
y
.Thisiscalled
forwardpropagation
.Duringtraining,
forwardpropagationcancontinueonwarduntilitproducesascalarcost
J
(
θ
).
The
back-propagation
algorithm(Rumelhartetal.,1986a),oftensimplycalled
backprop
,allowstheinformationfromthecosttothenflowbackwardthrough
thenetworkinordertocomputethegradient.
Computingananalyticalexpressionforthegradientisstraightforward,but
numericallyevaluatingsuchanexpressioncanbecomputationallyexpensive.The
back-propagationalgorithmdoessousingasimpleandinexpensiveprocedure.
Thetermback-propagationisoftenmisunderstoodasmeaningthewhole
learningalgorithmformultilayerneuralnetworks.Actually,back-propagation
refersonlytothemethodforcomputingthegradient,whileanotheralgorithm,
suchasstochasticgradientdescent,isusedtoperformlearningusingthisgradient.
Furthermore,back-propagationisoftenmisunderstoodasbeingspecifictomulti-
layerneuralnetworks,butinprincipleitcancomputederivativesofanyfunction
(forsomefunctions,thecorrectresponseistoreportthatthederivativeofthe
functionisundefined).Specifically,wewilldescribehowtocomputethegradient
x
f
(
x,y
)foranarbitraryfunction
f
,where
x
isasetofvariableswhosederivatives
aredesired,and
y
isanadditionalsetofvariablesthatareinputstothefunction
butwhosederivativesarenotrequired.Inlearningalgorithms,thegradientwemost
oftenrequireisthegradientofthecostfunctionwithrespecttotheparameters,
θ
J
(
θ
).Manymachinelearningtasksinvolvecomputingotherderivatives,either
aspartof thelearning process, or toanalyze thelearned model.The back-
propagationalgorithmcanbeappliedtothesetasksaswellandisnotrestrictedto
computingthegradientofthecostfunctionwithrespecttotheparameters.The
ideaofcomputingderivativesbypropagatinginformationthroughanetworkis
verygeneralandcanbeusedtocomputevaluessuchastheJacobianofafunction
200
CHAPTER6.DEEPFEEDFORWARDNETWORKS
f
withmultipleoutputs.Werestrictourdescriptionheretothemostcommonly
usedcase,wherefhasasingleoutput.
6.5.1ComputationalGraphs
Sofarwehavediscussedneuralnetworkswitharelativelyinformalgraphlanguage.
Todescribetheback-propagationalgorithmmoreprecisely,itishelpfultohavea
moreprecisecomputationalgraphlanguage.
Manywaysofformalizingcomputationasgraphsarepossible.
Here,weuseeachnodeinthegraphtoindicateavariable.Thevariablemay
beascalar,vector,matrix,tensor,orevenavariableofanothertype.
Toformalizeourgraphs,wealsoneedtointroducetheideaofan
operation
.
Anoperationisasimplefunctionofoneormorevariables.Ourgraphlanguage
isaccompaniedbyasetofallowableoperations.Functionsmorecomplicated
thantheoperationsinthissetmaybedescribedbycomposingmanyoperations
together.
Withoutlossofgenerality, wedefineanoperationtoreturnonlyasingle
outputvariable.Thisdoesnotlosegeneralitybecausetheoutputvariablecanhave
multipleentries,suchasavector.Softwareimplementationsofback-propagation
usuallysupportoperationswithmultipleoutputs,butweavoidthiscaseinour
descriptionbecauseitintroducesmanyextradetailsthatarenotimportantto
conceptualunderstanding.
Ifavariable
y
iscomputedbyapplyinganoperationtoavariable
x
,then
wedrawadirectededgefrom
x
to
y
. Wesometimesannotatetheoutputnode
withthenameoftheoperationapplied,andothertimesomitthislabelwhenthe
operationisclearfromcontext.
Examplesofcomputationalgraphsareshowninfigure6.8.
6.5.2ChainRuleofCalculus
Thechainruleofcalculus(nottobeconfusedwiththechainruleofprobability)is
usedtocomputethederivativesoffunctionsformedbycomposingotherfunctions
whosederivativesareknown.Back-propagationisanalgorithmthatcomputesthe
chainrule,withaspecificorderofoperationsthatishighlyefficient.
Let
x
bearealnumber,andlet
f
and
g
bothbefunctionsmappingfromareal
numbertoarealnumber.Supposethat
y
=
g
(
x
)and
z
=
f
(
g
(
x
)) =
f
(
y
).Then
201
CHAPTER6.DEEPFEEDFORWARDNETWORKS
zz
xx
yy
(a)
×
xx
ww
(b)
u
(1)
u
(1)
dot
bb
u
(2)
u
(2)
+
ˆyˆy
σ
(c)
XX
WW
U
(1)
U
(1)
matmul
bb
U
(2)
U
(2)
+
HH
relu
xx
ww
(d)
ˆyˆy
dot
λλ
u
(1)
u
(1)
sqr
u
(2)
u
(2)
sum
u
(3)
u
(3)
×
Figure6.8:Examplesofcomputationalgraphs.(a)Thegraphusingthe
×
operationto
compute
z
=
xy
.(b)Thegraphforthelogisticregressionprediction
ˆy
=
σ
x
w+b
.
Someoftheintermediateexpressionsdonothavenamesinthealgebraicexpression
butneednamesinthegraph.Wesimplynamethe
i
-thsuchvariable
u
(i)
.(c)The
computationalgraphfortheexpression
H
=
max{
0
,XW
+
b}
,whichcomputesadesign
matrixofrectifiedlinearunitactivations
H
givenadesignmatrixcontainingaminibatch
ofinputs
X
.(d)Examplesa–cappliedatmostoneoperationtoeachvariable,butit
ispossibletoapplymorethanoneoperation.Hereweshowacomputationgraphthat
appliesmorethanoneoperationtotheweights
w
ofalinearregressionmodel.The
weightsareusedtomakeboththepredictionˆyandtheweightdecaypenaltyλ
i
w
2
i
.
202
CHAPTER6.DEEPFEEDFORWARDNETWORKS
thechainrulestatesthat
dz
dx
=
dz
dy
dy
dx
.(6.44)
Wecangeneralizethisbeyondthescalarcase.Supposethat
xR
m
,
yR
n
,
g
mapsfrom
R
m
to
R
n
,and
f
mapsfrom
R
n
to
R
.If
y
=
g
(
x
)and
z
=
f
(
y
),then
z
x
i
=
j
z
y
j
y
j
x
i
.(6.45)
Invectornotation,thismaybeequivalentlywrittenas
x
z=
y
x
y
z,(6.46)
where
y
x
isthen×mJacobianmatrixofg.
Fromthisweseethatthegradientofavariable
x
canbeobtainedbymultiplying
aJacobianmatrix
y
x
byagradient
y
z
.Theback-propagationalgorithmconsists
ofperformingsuchaJacobian-gradientproductforeachoperationinthegraph.
Usuallyweapplytheback-propagationalgorithmtotensorsofarbitrarydi-
mensionality,notmerelytovectors.Conceptually,thisisexactlythesameas
back-propagationwithvectors.Theonlydifferenceishowthenumbersarear-
rangedinagridtoformatensor.Wecouldimagineflatteningeachtensorinto
avectorbeforewerunback-propagation,computingavector-valuedgradient,
andthenreshapingthegradientbackintoatensor.Inthisrearrangedview,
back-propagationisstilljustmultiplyingJacobiansbygradients.
Todenotethegradientofavalue
z
withrespecttoatensor
X
,wewrite
X
z
,
justasif
X
wereavector.Theindicesinto
X
nowhavemultiplecoordinates—for
example,a3-Dtensorisindexedbythreecoordinates.Wecanabstractthisaway
byusingasinglevariable
i
torepresentthecompletetupleofindices.Forall
possibleindextuples
i
,(
X
z
)
i
gives
z
X
i
.Thisisexactlythesameashowforall
possibleintegerindices
i
intoavector,(
x
z
)
i
gives
z
x
i
.Usingthisnotation,we
canwritethechainruleasitappliestotensors.IfY=g(X)andz=f(Y),then
X
z=
j
(
X
Y
j
)
z
Y
j
.(6.47)
203
CHAPTER6.DEEPFEEDFORWARDNETWORKS
6.5.3RecursivelyApplyingtheChainRuletoObtainBackprop
Usingthechainrule,itisstraightforwardtowritedownanalgebraicexpressionfor
thegradientofascalarwithrespecttoanynodeinthecomputationalgraphthat
producedthatscalar.Actuallyevaluatingthatexpressioninacomputer,however,
introducessomeextraconsiderations.
Specifically,manysubexpressionsmayberepeatedseveraltimeswithinthe
overallexpressionforthegradient.Anyprocedurethatcomputesthegradient
willneedtochoosewhethertostorethesesubexpressionsortorecomputethem
severaltimes.Anexampleofhowtheserepeatedsubexpressionsariseisgivenin
figure6.9.Insomecases,computingthesamesubexpressiontwicewouldsimply
bewasteful. Forcomplicatedgraphs,therecanbeexponentiallymanyofthese
wastedcomputations,makinganaiveimplementationofthechainruleinfeasible.
Inothercases,computingthesamesubexpressiontwicecouldbeavalidwayto
reducememoryconsumptionatthecostofhigherruntime.
Webeginwithaversionoftheback-propagationalgorithmthatspecifiesthe
actualgradientcomputationdirectly(algorithm6.2alongwithalgorithm6.1forthe
associatedforwardcomputation),intheorderitwillactuallybedoneandaccording
totherecursiveapplicationofchainrule.Onecouldeitherdirectlyperformthese
computationsorviewthedescriptionofthealgorithmasasymbolicspecification
ofthecomputationalgraphforcomputingtheback-propagation.However,this
formulationdoesnotmakeexplicitthemanipulationandtheconstructionofthe
symbolicgraphthatperformsthegradientcomputation. Suchaformulationis
presentedinsection6.5.6,withalgorithm6.5,wherewealsogeneralizetonodes
thatcontainarbitrarytensors.
Firstconsideracomputationalgraphdescribinghowtocomputeasinglescalar
u
(n)
(say, thelossonatrainingexample).Thisscalaristhequantitywhose
gradientwewanttoobtain,withrespecttothe
n
i
inputnodes
u
(1)
to
u
(n
i
)
. In
otherwords,wewishtocompute
u
(n)
u
(i)
forall
i{
1
,
2
,...,n
i
}
.Intheapplication
ofback-propagationtocomputinggradientsforgradientdescentoverparameters,
u
(n)
willbethecostassociatedwithanexampleoraminibatch,while
u
(1)
to
u
(n
i
)
correspondtotheparametersofthemodel.
Wewillassumethatthenodesofthegraphhavebeenorderedinsuchaway
thatwecancomputetheiroutputoneaftertheother,startingat
u
(n
i
+1)
and
goingupto
u
(n)
.Asdefinedinalgorithm6.1,eachnode
u
(i)
isassociatedwithan
operationf
(i)
andiscomputedbyevaluatingthefunction
u
(i)
=f(A
(i)
),(6.48)
whereA
(i)
isthesetofallnodesthatareparentsofu
(i)
.
204
CHAPTER6.DEEPFEEDFORWARDNETWORKS
zz
xx
yy
ww
f
f
f
Figure6.9:Acomputationalgraphthatresultsinrepeatedsubexpressionswhencomputing
thegradient.Let
wR
betheinputtothegraph.Weusethesamefunction
f
:
RR
astheoperationthatweapplyateverystepofachain:
x
=
f
(
w
),
y
=
f
(
x
),
z
=
f
(
y
).
Tocompute
z
w
,weapplyequation6.44andobtain:
z
w
(6.49)
=
z
y
y
x
x
w
(6.50)
=f
(y)f
(x)f
(w)(6.51)
=f
(f(f(w)))f
(f(w))f
(w).(6.52)
Equation6.51suggestsanimplementationinwhichwecomputethevalueof
f
(
w
)only
onceandstoreitinthevariable
x
.Thisistheapproachtakenbytheback-propagation
algorithm.Analternativeapproachissuggestedbyequation6.52,wherethesubexpression
f
(
w
)appearsmorethanonce.Inthealternativeapproach,
f
(
w
)isrecomputedeachtime
itisneeded.Whenthememoryrequiredtostorethevalueoftheseexpressionsislow,the
back-propagationapproachofequation6.51isclearlypreferablebecauseofitsreduced
runtime.However,equation6.52isalsoavalidimplementationofthechainruleandis
usefulwhenmemoryislimited.
205
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Thatalgorithmspecifiestheforwardpropagationcomputation,whichwecould
putinagraph
G
.Toperformback-propagation,wecanconstructacomputational
graphthatdependson
G
andaddstoitanextrasetofnodes.Theseforma
subgraph
B
withonenodepernodeof
G
.Computationin
B
proceedsinexactly
thereverseoftheorderofcomputationin
G
,andeachnodeof
B
computesthe
derivative
u
(n)
u
(i)
associatedwiththeforwardgraphnode
u
(i)
.Thisisdoneusing
thechainrulewithrespecttoscalaroutputu
(n)
:
u
(n)
u
(j)
=
i:jPa(u
(i)
)
u
(n)
u
(i)
u
(i)
u
(j)
(6.53)
asspecifiedbyalgorithm6.2.Thesubgraph
B
containsexactlyoneedgeforeach
edgefromnode
u
(j)
tonode
u
(i)
of
G
.Theedgefrom
u
(j)
to
u
(i)
isassociatedwith
thecomputationof
u
(i)
u
(j)
.Inaddition,adotproductisperformedforeachnode,
betweenthegradientalreadycomputedwithrespecttonodes
u
(i)
thatarechildren
of
u
(j)
andthevectorcontainingthepartialderivatives
u
(i)
u
(j)
forthesamechildren
nodes
u
(i)
.Tosummarize,theamountofcomputationrequiredforperforming
theback-propagationscaleslinearlywiththenumberofedgesin
G
,wherethe
computationforeachedgecorrespondstocomputingapartialderivative(ofone
nodewithrespecttooneofitsparents)aswellasperformingonemultiplication
andoneaddition.Below,wegeneralizethisanalysistotensor-valuednodes,which
isjustawaytogroupmultiplescalarvaluesinthesamenodeandenablemore
Algorithm6.1
Aprocedurethatperformsthecomputationsmapping
n
i
inputs
u
(1)
to
u
(n
i
)
toanoutput
u
(n)
.Thisdefinesacomputationalgraphwhereeachnode
computesnumericalvalue
u
(i)
byapplyingafunction
f
(i)
tothesetofarguments
A
(i)
thatcomprisesthevaluesofpreviousnodes
u
(j)
,
j<i
,with
jPa
(
u
(i)
).The
inputtothecomputationalgraphisthevector
x
,andissetintothefirst
n
i
nodes
u
(1)
to
u
(n
i
)
.Theoutputofthecomputationalgraphisreadoffthelast(output)
nodeu
(n)
.
fori= 1,...,n
i
do
u
(i)
x
i
endfor
fori=n
i
+1,...,ndo
A
(i)
{u
(j)
|jPa(u
(i)
)}
u
(i)
f
(i)
(A
(i)
)
endfor
returnu
(n)
206
CHAPTER6.DEEPFEEDFORWARDNETWORKS
efficientimplementations.
Theback-propagationalgorithmisdesignedtoreducethenumberofcommon
subexpressionswithoutregardtomemory.Specifically,itperformsontheorder
ofoneJacobianproductpernodeinthegraph. Thiscanbeseenfromthefact
thatbackprop(algorithm6.2)visitseachedgefromnode
u
(j)
tonode
u
(i)
of
thegraphexactlyonceinordertoobtaintheassociatedpartialderivative
u
(i)
u
(j)
.
Back-propagationthusavoidstheexponentialexplosioninrepeatedsubexpres-
sions.Otheralgorithmsmaybeabletoavoidmoresubexpressionsbyperforming
simplificationsonthecomputationalgraph,ormaybeabletoconservememory
byrecomputingratherthanstoringsomesubexpressions.Werevisittheseideas
afterdescribingtheback-propagationalgorithmitself.
Algorithm6.2
Simplifiedversionoftheback-propagationalgorithmforcomputing
thederivativesof
u
(n)
withrespecttothevariablesinthegraph.Thisexampleis
intendedtofurtherunderstandingbyshowingasimplifiedcasewhereallvariables
arescalars,andwewishtocomputethederivativeswithrespectto
u
(1)
,...,u
(n
i
)
.
Thissimplifiedversioncomputesthederivativesofallnodesinthegraph.The
computationalcostofthisalgorithmisproportionaltothenumberofedgesin
thegraph,assumingthatthepartialderivativeassociatedwitheachedgerequires
aconstanttime.Thisisofthesameorderasthenumberofcomputationsfor
theforwardpropagation.Each
u
(i)
u
(j)
isafunctionoftheparents
u
(j)
of
u
(i)
,thus
linkingthenodesoftheforwardgraphtothoseaddedfortheback-propagation
graph.
Runforwardpropagation(algorithm6.1forthisexample)toobtaintheactiva-
tionsofthenetwork.
Initialize
grad_table
,adatastructurethatwillstorethederivativesthathave
beencomputed.Theentry
grad_table
[
u
(i)
]willstorethecomputedvalueof
u
(n)
u
(i)
.
grad_table[u
(n)
]1
forj=n1downto1do
Thenextlinecomputes
u
(n)
u
(j)
=
i:jPa(u
(i)
)
u
(n)
u
(i)
u
(i)
u
(j)
usingstoredvalues:
grad_table[u
(j)
]
i:jPa(u
(i)
)
grad_table[u
(i)
]
u
(i)
u
(j)
endfor
return{grad_table[u
(i)
]|i= 1,...,n
i
}
207
CHAPTER6.DEEPFEEDFORWARDNETWORKS
6.5.4Back-PropagationComputationinFullyConnectedMLP
Toclarifytheabovedefinitionoftheback-propagationcomputation,letusconsider
thespecificgraphassociatedwithafully-connectedmultilayerMLP.
Algorithm6.3firstshowstheforwardpropagation,whichmapsparametersto
thesupervisedloss
L
(
ˆ
y,y
)associatedwithasingle(input,target)trainingexample
(x,y),with
ˆ
ytheoutputoftheneuralnetworkwhenxisprovidedininput.
Algorithm6.4then shows thecorresponding computationto bedone for
applyingtheback-propagationalgorithmtothisgraph.
Algorithms6.3and6.4aredemonstrationschosentobesimpleandstraightfor-
wardtounderstand.However,theyarespecializedtoonespecificproblem.
Modernsoftwareimplementationsarebasedonthegeneralizedformofback-
propagationdescribedinsection6.5.6below,whichcanaccommodateanycompu-
tationalgraphbyexplicitlymanipulatingadatastructureforrepresentingsymbolic
computation.
Algorithm6.3
Forwardpropagationthroughatypicaldeepneuralnetworkand
thecomputationofthecostfunction.Theloss
L
(
ˆ
y,y
)dependsontheoutput
ˆ
y
andonthetarget
y
(seesection6.2.1.1forexamplesoflossfunctions).To
obtainthetotalcost
J
,thelossmaybeaddedtoaregularizerΩ(
θ
),where
θ
containsalltheparameters(weightsandbiases).Algorithm6.4showshowto
computegradientsof
J
withrespecttoparameters
W
and
b
.Forsimplicity,this
demonstrationusesonlyasingleinputexample
x
.Practicalapplicationsshould
useaminibatch.Seesection6.5.7foramorerealisticdemonstration.
Require:Networkdepth,l
Require:W
(i)
,i{1,...,l},theweightmatricesofthemodel
Require:b
(i)
,i{1,...,l},thebiasparametersofthemodel
Require:x,theinputtoprocess
Require:y,thetargetoutput
h
(0)
=x
fork= 1,...,ldo
a
(k)
=b
(k)
+W
(k)
h
(k1)
h
(k)
=f(a
(k)
)
endfor
ˆ
y=h
(l)
J=L(
ˆ
y,y)+λΩ(θ)
208
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Algorithm 6.4
Backwardcomputation forthe deepneural network ofalgo-
rithm6.3,whichuses,inadditiontotheinput
x
,atarget
y
.Thiscomputation
yieldsthegradientsontheactivations
a
(k)
foreachlayer
k
,startingfromthe
outputlayerandgoingbackwardstothefirsthiddenlayer.Fromthesegradients,
whichcanbeinterpretedasanindicationofhoweachlayer’soutputshouldchange
toreduceerror,onecanobtainthegradientontheparametersofeachlayer.The
gradientsonweightsandbiasescanbeimmediatelyusedaspartofastochas-
ticgradientupdate(performingtheupdaterightafterthegradientshavebeen
computed)orusedwithothergradient-basedoptimizationmethods.
Aftertheforwardcomputation,computethegradientontheoutputlayer:
g
ˆ
y
J=
ˆ
y
L(
ˆ
y,y)
fork=l,l1,...,1do
Convert thegradientonthelayer’soutputinto agradientonthepre-
nonlinearity activation (element-wisemultiplicationif
f
iselement-wise):
g
a
(k)
J=gf
(a
(k)
)
Computegradientsonweightsandbiases(includingtheregularizationterm,
whereneeded):
b
(k)
J=g+λ
b
(k)
Ω(θ)
W
(k)
J=gh
(k1)
+λ
W
(k)
Ω(θ)
Propagatethegradientsw.r.t.thenextlower-levelhiddenlayer’sactivations:
g
h
(k1)
J=W
(k)
g
endfor
6.5.5Symbol-to-SymbolDerivatives
Algebraicexpressionsandcomputationalgraphsbothoperateon
symbols
,or
variables thatdo not havespecific values.These algebraicand graph-based
representationsarecalled
symbolicrepresentations
.Whenweactuallyuse
ortrainaneuralnetwork,wemustassignspecificvaluestothesesymbols.We
replaceasymbolicinputtothenetwork
x
withaspecific
numeric
value,suchas
[1.2,3.765,1.8]
.
Someapproachestoback-propagationtakeacomputationalgraphandasetof
numericalvaluesfortheinputstothegraph,thenreturnasetofnumericalvalues
describingthegradientatthoseinputvalues.Wecallthisapproach
symbol-to-
numberdifferentiation
. ThisistheapproachusedbylibrariessuchasTorch
(Collobertetal.,2011b)andCaffe(Jia,2013).
Anotherapproachistotakeacomputationalgraphandaddadditionalnodes
209
CHAPTER6.DEEPFEEDFORWARDNETWORKS
zz
xx
yy
ww
f
f
f
zz
xx
yy
ww
f
f
f
dz
dy
dz
dy
f
dy
dx
dy
dx
f
dz
dx
dz
dx
×
dx
dw
dx
dw
f
dz
dw
dz
dw
×
Figure6.10:Anexampleofthesymbol-to-symbolapproachtocomputingderivatives.In
thisapproach,theback-propagationalgorithmdoesnotneedtoeveraccessanyactual
specificnumericvalues.Instead,itaddsnodestoacomputationalgraphdescribinghow
tocomputethesederivatives.Agenericgraphevaluationenginecanlatercomputethe
derivativesforanyspecificnumericvalues.(Left)Inthisexample,webeginwithagraph
representing
z
=
f
(
f
(
f
(
w
))).(Right)Weruntheback-propagationalgorithm,instructing
ittoconstructthegraphfortheexpressioncorrespondingto
dz
dw
.Inthisexample,wedo
notexplainhowtheback-propagationalgorithmworks.Thepurposeisonlytoillustrate
whatthedesiredresultis:acomputationalgraphwithasymbolicdescriptionofthe
derivative.
tothegraphthatprovideasymbolicdescriptionofthedesiredderivatives.This
istheapproachtakenbyTheano(Bergstraetal.,2010;Bastienetal.,2012)and
TensorFlow(Abadietal.,2015).Anexampleofhowitworksisillustratedin
figure6.10. Theprimaryadvantageofthisapproachisthatthederivativesare
describedinthesamelanguageastheoriginalexpression.Becausethederivatives
arejustanothercomputationalgraph, itispossibletorunback-propagation
again,differentiatingthederivativestoobtainhigherderivatives.(Computationof
higher-orderderivativesisdescribedinsection6.5.10.)
Wewillusethelatterapproachanddescribetheback-propagationalgorithmin
termsofconstructingacomputationalgraphforthederivatives.Anysubsetofthe
graphmaythenbeevaluatedusingspecificnumericalvaluesatalatertime.This
allowsustoavoidspecifyingexactlywheneachoperationshouldbecomputed.
Instead,agenericgraphevaluationenginecanevaluateeverynodeassoonasits
parents’valuesareavailable.
Thedescriptionofthesymbol-to-symbolbasedapproachsubsumesthesymbol-
210
CHAPTER6.DEEPFEEDFORWARDNETWORKS
to-numberapproach.Thesymbol-to-numberapproachcanbeunderstoodas
performingexactlythesamecomputationsasaredoneinthegraphbuiltbythe
symbol-to-symbolapproach.Thekeydifferenceisthatthesymbol-to-number
approachdoesnotexposethegraph.
6.5.6GeneralBack-Propagation
Theback-propagationalgorithmisverysimple.Tocomputethegradientofsome
scalar
z
withrespecttooneofitsancestors
x
inthegraph,webeginbyobserving
thatthegradientwithrespectto
z
isgivenby
dz
dz
=1.Wecanthencompute
thegradientwithrespecttoeachparentof
z
inthegraphbymultiplyingthe
currentgradientbytheJacobianoftheoperationthatproduced
z
.Wecontinue
multiplyingbyJacobians,travelingbackwardthroughthegraphinthiswayuntil
wereach
x
.Foranynodethatmaybereachedbygoingbackwardfrom
z
through
twoormorepaths,wesimplysumthegradientsarrivingfromdifferentpathsat
thatnode.
Moreformally,eachnodeinthegraph
G
correspondstoavariable.Toachieve
maximumgenerality,wedescribethisvariableasbeingatensor
V
.Tensorsin
generalcanhaveanynumberofdimensions.Theysubsumescalars,vectors,and
matrices.
WeassumethateachvariableVisassociatedwiththefollowingsubroutines:
get_operation
(
V
):Thisreturnstheoperationthatcomputes
V
,repre-
sentedbytheedgescominginto
V
inthecomputationalgraph.Forexample,
theremaybeaPythonorC++classrepresentingthematrixmultiplication
operation,andthe
get_operation
function.Supposewehaveavariablethat
iscreatedbymatrixmultiplication,
C
=
AB
.Then
get_operation
(
V
)
returnsapointertoaninstanceofthecorrespondingC++class.
get_consumers
(
V,G
):Thisreturnsthelistofvariablesthatarechildrenof
VinthecomputationalgraphG.
get_inputs(V,G)
:Thisreturnsthelistofvariablesthatareparentsof
V
inthecomputationalgraphG.
Eachoperation
op
isalsoassociatedwitha
bprop
operation.This
bprop
operationcancomputeaJacobian-vectorproductasdescribedbyequation6.47.
Thisishowtheback-propagationalgorithmisabletoachievegreatgenerality.
Eachoperationisresponsibleforknowinghowtoback-propagatethroughthe
edgesinthegraphthatitparticipatesin.Forexample,wemightuseamatrix
211
CHAPTER6.DEEPFEEDFORWARDNETWORKS
multiplicationoperationtocreateavariable
C
=
AB
.Supposethatthegradient
ofascalar
z
withrespectto
C
isgivenby
G
.Thematrixmultiplicationoperation
isresponsiblefordefiningtwoback-propagationrules,oneforeachofitsinput
arguments.Ifwecallthe
bprop
methodtorequestthegradientwithrespectto
A
giventhatthegradientontheoutputis
G
,thenthe
bprop
methodofthe
matrixmultiplicationoperationmuststatethatthegradientwithrespectto
A
isgivenby
GB
.Likewise,ifwecallthe
bprop
methodtorequestthegradient
withrespectto
B
,thenthematrixoperationisresponsibleforimplementingthe
bprop
methodandspecifyingthatthedesiredgradientisgivenby
A
G
.The
back-propagationalgorithmitselfdoesnotneedtoknowanydifferentiationrules.It
onlyneedstocalleachoperation’s
bprop
ruleswiththerightarguments.Formally,
op.bprop(inputs,X,G)mustreturn
i
(
X
op.f(inputs)
i
)G
i
,(6.54)
whichisjustanimplementationofthechainruleasexpressedinequation6.47.
Here,
inputs
isalistofinputsthataresuppliedtotheoperation,
op.f
isthe
mathematicalfunctionthattheoperationimplements,
X
istheinputwhosegradient
wewishtocompute,andGisthegradientontheoutputoftheoperation.
The
op.bprop
methodshouldalwayspretendthatallitsinputsaredistinct
fromeachother,eveniftheyarenot.Forexample,ifthe
mul
operatorispassed
twocopiesof
x
tocompute
x
2
,the
op.bprop
methodshouldstillreturn
x
asthe
derivativewithrespecttobothinputs.Theback-propagationalgorithmwilllater
addbothoftheseargumentstogethertoobtain2
x
,whichisthecorrecttotal
derivativeonx.
Softwareimplementationsofback-propagationusuallyprovideboththeopera-
tionsandtheir
bprop
methods,sothatusersofdeeplearningsoftwarelibrariesare
abletoback-propagatethroughgraphsbuiltusingcommonoperationslikematrix
multiplication,exponents,logarithms,andsoon.Softwareengineerswhobuilda
newimplementationofback-propagationoradvanceduserswhoneedtoaddtheir
ownoperationtoanexistinglibrarymustusuallyderivethe
op.bprop
methodfor
anynewoperationsmanually.
Theback-propagationalgorithmisformallydescribedinalgorithm6.5.
Insection6.5.2,weexplainedthatback-propagationwasdevelopedinorderto
avoidcomputingthesamesubexpressioninthechainrulemultipletimes.Thenaive
algorithmcouldhaveexponentialruntimeduetotheserepeatedsubexpressions.
Nowthatwehavespecifiedtheback-propagationalgorithm,wecanunderstandits
computationalcost.Ifweassumethateachoperationevaluationhasroughlythe
212
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Algorithm6.5
Theoutermostskeletonoftheback-propagationalgorithm.This
portiondoessimplesetupandcleanupwork.Mostoftheimportantworkhappens
inthebuild_gradsubroutineofalgorithm6.6
.
Require:T,thetargetsetofvariableswhosegradientsmustbecomputed.
Require:G,thecomputationalgraph
Require:z,thevariabletobedifferentiated
Let
G
be
G
prunedtocontainonlynodesthatareancestorsof
z
anddescendents
ofnodesinT.
Initializegrad_table,adatastructureassociatingtensorstotheirgradients
grad_table[z]1
forVinTdo
build_grad(V,G,G
,grad_table)
endfor
Returngrad_tablerestrictedtoT
Algorithm6.6
Theinnerloopsubroutine
build_grad
(
V,G,G
,grad_table
)of
theback-propagationalgorithm,calledbytheback-propagationalgorithmdefined
inalgorithm6.5.
Require:V,thevariablewhosegradientshouldbeaddedtoGandgrad_table
Require:G,thegraphtomodify
Require:G
,therestrictionofGtonodesthatparticipateinthegradient
Require:grad_table,adatastructuremappingnodestotheirgradients
ifVisingrad_tablethen
Returngrad_table[V]
endif
i1
forCinget_consumers(V,G
)do
opget_operation(C)
Dbuild_grad(C,G,G
,grad_table)
G
(i)
op.bprop(get_inputs(C,G
),V,D)
ii+1
endfor
G
i
G
(i)
grad_table[V] =G
InsertGandtheoperationscreatingitintoG
ReturnG
213
CHAPTER6.DEEPFEEDFORWARDNETWORKS
samecost,thenwemayanalyzethecomputationalcostintermsofthenumber
ofoperationsexecuted. Keepinmindherethatwerefertoanoperationasthe
fundamentalunitofourcomputationalgraph,whichmightactuallyconsistof
severalarithmeticoperations(forexample,wemighthaveagraphthattreatsmatrix
multiplicationasasingleoperation).Computingagradientinagraphwith
n
nodes
willneverexecutemorethan
O
(
n
2
)operationsorstoretheoutputofmorethan
O
(
n
2
)operations.Herewearecountingoperationsinthecomputationalgraph,not
individualoperationsexecutedbytheunderlyinghardware,soitisimportantto
rememberthattheruntimeofeachoperationmaybehighlyvariable.Forexample,
multiplyingtwomatricesthateachcontainmillionsofentriesmightcorrespondto
asingleoperationinthegraph.Wecanseethatcomputingthegradientrequiresat
most
O
(
n
2
)operationsbecausetheforwardpropagationstagewillatworstexecute
all
n
nodesintheoriginalgraph(dependingonwhichvalueswewanttocompute,
wemaynotneedtoexecutetheentiregraph).Theback-propagationalgorithm
addsoneJacobian-vectorproduct,whichshouldbeexpressedwith
O
(1)nodes,per
edgeintheoriginalgraph.Becausethecomputationalgraphisadirectedacyclic
graphithasatmost
O
(
n
2
)edges.Forthekindsofgraphsthatarecommonlyused
inpractice,thesituationisevenbetter.Mostneuralnetworkcostfunctionsare
roughlychain-structured,causingback-propagationtohave
O
(
n
)cost.Thisisfar
betterthanthenaiveapproach,whichmightneedtoexecuteexponentiallymany
nodes.Thispotentiallyexponentialcostcanbeseenbyexpandingandrewriting
therecursivechainrule(equation6.53)nonrecursively:
u
(n)
u
(j)
=
path(u
(π
1
)
,u
(π
2
)
,...,u
(π
t
)
),
fromπ
1
=jtoπ
t
=n
t
k=2
u
(π
k
)
u
(π
k1
)
.(6.55)
Sincethenumberofpathsfromnode
j
tonode
n
cangrowexponentiallyinthe
lengthofthesepaths,thenumberoftermsintheabovesum,whichisthenumber
ofsuchpaths,cangrowexponentiallywiththedepthoftheforwardpropagation
graph.Thislargecostwouldbeincurredbecausethesamecomputationfor
u
(i)
u
(j)
wouldberedonemanytimes. Toavoidsuchrecomputation,wecanthink
ofback-propagationasatable-fillingalgorithmthattakesadvantageofstoring
intermediateresults
u
(n)
u
(i)
.Eachnodeinthegraphhasacorrespondingslotina
tabletostorethegradientforthatnode.Byfillinginthesetableentriesinorder,
back-propagationavoidsrepeatingmanycommonsubexpressions.Thistable-filling
strategyissometimescalleddynamicprogramming.
214
CHAPTER6.DEEPFEEDFORWARDNETWORKS
6.5.7Example:Back-PropagationforMLPTraining
Asanexample,wewalkthroughtheback-propagationalgorithmasitisusedto
trainamultilayerperceptron.
Herewedevelopaverysimplemultilayerperceptronwithasinglehidden
layer.Totrainthismodel,wewilluseminibatchstochasticgradientdescent.
Theback-propagationalgorithmisusedtocomputethegradientofthecostona
singleminibatch.Specifically,weuseaminibatchofexamplesfromthetraining
setformattedasadesignmatrix
X
andavectorofassociatedclasslabels
y
.
Thenetworkcomputesalayerofhiddenfeatures
H
=
max{
0
,XW
(1)
}
.To
simplifythepresentationwedonotusebiasesinthismodel.Weassumethatour
graphlanguageincludesa
relu
operationthatcancompute
max{
0
,Z}
element-
wise.Thepredictionsoftheunnormalizedlogprobabilitiesoverclassesarethen
givenby
HW
(2)
.Weassumethatourgraphlanguageincludesa
cross_entropy
operationthatcomputesthecross-entropybetweenthetargets
y
andtheprobability
distributiondefinedbytheseunnormalizedlogprobabilities.Theresultingcross-
entropydefinesthecost
J
MLE
.Minimizingthiscross-entropyperformsmaximum
likelihoodestimationoftheclassifier.However,tomakethisexamplemorerealistic,
wealsoincludearegularizationterm.Thetotalcost
J=J
MLE
+λ
i,j
W
(1)
i,j
2
+
i,j
W
(2)
i,j
2
(6.56)
consistsofthecross-entropyandaweightdecaytermwithcoefficient
λ
.The
computationalgraphisillustratedinfigure6.11.
Thecomputationalgraphforthegradientofthisexampleislargeenoughthat
itwouldbetedioustodrawortoread.Thisdemonstratesoneofthebenefits
oftheback-propagationalgorithm,whichisthatitcanautomaticallygenerate
gradientsthatwouldbestraightforwardbuttediousforasoftwareengineerto
derivemanually.
Wecanroughlytraceoutthebehavioroftheback-propagationalgorithm
bylookingattheforwardpropagationgraphinfigure6.11.Totrain,wewish
tocomputeboth
W
(1)
J
and
W
(2)
J
.Therearetwodifferentpathsleading
backwardfrom
J
totheweights:onethroughthecross-entropycost,andone
throughtheweightdecaycost.Theweightdecaycostisrelativelysimple;itwill
alwayscontribute2λW
(i)
tothegradientonW
(i)
.
Theotherpaththroughthecross-entropycostisslightlymorecomplicated.
Let
G
bethegradientontheunnormalizedlogprobabilities
U
(2)
providedby
the
cross_entropy
operation.Theback-propagationalgorithmnowneedsto
215
CHAPTER6.DEEPFEEDFORWARDNETWORKS
XX
W
(1)
W
(1)
U
(1)
U
(1)
matmul
HH
relu
U
(3)
U
(3)
sqr
u
(4)
u
(4)
sum
λλ
u
(7)
u
(7)
W
(2)
W
(2)
U
(2)
U
(2)
matmul
yy
J
MLE
J
MLE
cross_entropy
U
(5)
U
(5)
sqr
u
(6)
u
(6)
sum
u
(8)
u
(8)
JJ
+
×
+
Figure6.11:Thecomputationalgraphusedtocomputethecosttotrainourexampleofa
single-layerMLPusingthecross-entropylossandweightdecay.
exploretwodifferentbranches.Ontheshorterbranch,itadds
H
G
tothe
gradienton
W
(2)
,usingtheback-propagationruleforthesecondargumentto
thematrixmultiplicationoperation.Theotherbranchcorrespondstothelonger
chaindescendingfurtheralongthenetwork.First,theback-propagationalgorithm
computes
H
J
=
GW
(2)
usingtheback-propagationruleforthefirstargument
tothematrixmultiplicationoperation.Next,the
relu
operationusesitsback-
propagationruletozerooutcomponentsofthegradientcorrespondingtoentries
of
U
(1)
thatarelessthan0.Lettheresultbecalled
G
.Thelaststepofthe
back-propagationalgorithmistousetheback-propagationruleforthesecond
argumentofthematmuloperationtoaddX
G
tothegradientonW
(1)
.
Afterthesegradientshavebeencomputed,thegradientdescentalgorithm,or
anotheroptimizationalgorithm,usesthesegradientstoupdatetheparameters.
FortheMLP,thecomputationalcostisdominatedbythecostofmatrix
multiplication.Duringtheforwardpropagationstage,wemultiplybyeachweight
matrix,resultingin
O
(
w
)multiply-adds,where
w
isthenumberofweights.During
thebackwardpropagationstage,wemultiplybythetransposeofeachweight
matrix,whichhasthesamecomputationalcost.Themainmemorycostofthe
algorithmisthatweneedtostoretheinputtothenonlinearityofthehiddenlayer.
216
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Thisvalueisstoredfromthetimeitiscomputeduntilthebackwardpasshas
returnedtothesamepoint.Thememorycostisthus
O
(
mn
h
),where
m
isthe
numberofexamplesintheminibatchandn
h
isthenumberofhiddenunits.
6.5.8Complications
Ourdescriptionoftheback-propagationalgorithmhereissimplerthantheimple-
mentationsactuallyusedinpractice.
Asnotedabove,wehaverestrictedthedefinitionofanoperationtobea
functionthatreturnsasingletensor.Mostsoftwareimplementationsneedto
supportoperationsthatcanreturnmorethanonetensor.Forexample,ifwewish
tocomputeboththemaximumvalueinatensorandtheindexofthatvalue,itis
besttocomputebothinasinglepassthroughmemory,soitismostefficientto
implementthisprocedureasasingleoperationwithtwooutputs.
We havenotdescribed how tocontrol thememory consumptionof back-
propagation.Back-propagationofteninvolvessummationofmanytensorstogether.
Inthenaiveapproach,eachofthesetensorswouldbecomputedseparately,then
allofthemwouldbeaddedinasecondstep.Thenaiveapproachhasanoverly
highmemorybottleneckthatcanbeavoidedbymaintainingasinglebufferand
addingeachvaluetothatbufferasitiscomputed.
Real-worldimplementationsofback-propagationalsoneedtohandlevarious
datatypes,suchas32-bitfloatingpoint,64-bitfloatingpoint,andintegervalues.
Thepolicyforhandlingeachofthesetypestakesspecialcaretodesign.
Someoperationshaveundefinedgradients,anditisimportanttotrackthese
casesanddeterminewhetherthegradientrequestedbytheuserisundefined.
Variousothertechnicalitiesmakereal-worlddifferentiationmorecomplicated.
Thesetechnicalitiesarenotinsurmountable,andthischapterhasdescribedthekey
intellectualtoolsneededtocomputederivatives,butitisimportanttobeaware
thatmanymoresubtletiesexist.
6.5.9DifferentiationoutsidetheDeepLearningCommunity
The deeplearning communityhas been somewhat isolatedfrom thebroader
computersciencecommunityandhaslargelydevelopeditsownculturalattitudes
concerninghowtoperformdifferentiation.Moregenerally,thefieldof
automatic
differentiation
isconcernedwithhowtocomputederivativesalgorithmically.
Theback-propagationalgorithmdescribedhereisonlyoneapproachtoautomatic
differentiation.Itisaspecialcaseofabroaderclassoftechniquescalled
reverse
217
CHAPTER6.DEEPFEEDFORWARDNETWORKS
modeaccumulation
.Otherapproachesevaluatethesubexpressionsofthechain
ruleindifferentorders.Ingeneral, determiningtheorderofevaluationthat
resultsinthelowestcomputationalcostisadifficultproblem.Findingtheoptimal
sequenceofoperationstocomputethegradientisNP-complete(Naumann,2008),
inthesensethatitmayrequiresimplifyingalgebraicexpressionsintotheirleast
expensiveform.
Forexample,supposewehavevariables
p
1
,p
2
,...,p
n
representingprobabilities,
andvariables
z
1
,z
2
,...,z
n
representingunnormalizedlogprobabilities.Suppose
wedefine
q
i
=
exp(z
i
)
i
exp(z
i
)
,(6.57)
wherewebuildthesoftmaxfunctionoutofexponentiation,summationanddivision
operations, and construct across-entropy loss
J
=
i
p
i
logq
i
.A human
mathematiciancanobservethatthederivativeof
J
withrespectto
z
i
takesavery
simpleform:
q
i
p
i
.Theback-propagationalgorithmisnotcapableofsimplifying
thegradientthiswayandwillinsteadexplicitlypropagategradientsthroughall
thelogarithmandexponentiationoperationsintheoriginalgraph.Somesoftware
librariessuchasTheano(Bergstraetal.,2010;Bastienetal.,2012)areableto
performsomekindsofalgebraicsubstitutiontoimproveoverthegraphproposed
bythepureback-propagationalgorithm.
Whentheforwardgraph
G
hasasingleoutputnodeandeachpartialderivative
u
(i)
u
(j)
canbecomputedwithaconstantamountofcomputation,back-propagation
guaranteesthatthenumberofcomputationsforthegradientcomputationisof
thesameorderasthenumberofcomputationsfortheforwardcomputation:this
canbeseeninalgorithm6.2,becauseeachlocalpartialderivative
u
(i)
u
(j)
needsto
becomputedonlyoncealongwithanassociatedmultiplicationandadditionfor
therecursivechain-ruleformulation(equation6.53).Theoverallcomputationis
therefore
O
(#edges).Itcanpotentiallybereduced,however,bysimplifyingthe
computationalgraphconstructedbyback-propagation,andthisisanNP-complete
task. ImplementationssuchasTheanoandTensorFlowuseheuristicsbasedon
matchingknownsimplificationpatternstoiterativelyattempttosimplifythegraph.
Wedefinedback-propagationonlyforcomputingascalaroutputgradient,but
back-propagationcanbeextendedtocomputeaJacobian(eitherof
k
different
scalarnodesinthegraph,orofatensor-valuednodecontaining
k
values).A
naiveimplementationmaythenneed
k
timesmorecomputation:foreachscalar
internalnodeintheoriginalforwardgraph,thenaiveimplementationcomputes
k
gradientsinsteadofasinglegradient.Whenthenumberofoutputsofthegraphis
largerthanthenumberofinputs,itissometimespreferabletouseanotherformof
automaticdifferentiationcalled
forwardmodeaccumulation
.Forwardmode
218
CHAPTER6.DEEPFEEDFORWARDNETWORKS
accumulationhasbeenproposedforobtainingreal-timecomputationofgradients
inrecurrentnetworks,forexample(WilliamsandZipser,1989).Thisapproachalso
avoidstheneedtostorethevaluesandgradientsforthewholegraph,tradingoff
computationalefficiencyformemory.Therelationshipbetweenforwardmodeand
backwardmodeisanalogoustotherelationshipbetweenleft-multiplyingversus
right-multiplyingasequenceofmatrices,suchas
ABCD,(6.58)
wherethematricescanbethoughtofasJacobian.Forexample,if
D
isacolumn
vectorwhile
A
hasmanyrows,thegraphwillhaveasingleoutputandmanyinputs,
andstartingthemultiplicationsfromtheendandgoingbackwardrequiresonly
matrix-vectorproducts.Thisordercorrespondstothebackwardmode.Instead,
startingtomultiplyfromtheleftwouldinvolveaseriesofmatrix-matrixproducts,
whichmakesthewholecomputationmuchmoreexpensive.If
A
hasfewerrows
than
D
hascolumns,however,itischeapertorunthemultiplicationsleft-to-right,
correspondingtotheforwardmode.
Inmanycommunitiesoutsidemachinelearning,itismorecommontoimplement
differentiationsoftwarethatactsdirectlyontraditionalprogramminglanguage
code, such asPythonorCcode, andautomaticallygeneratesprogramsthat
differentiatefunctionswrittenintheselanguages.Inthedeeplearningcommunity,
computationalgraphsareusuallyrepresentedbyexplicitdatastructurescreated
byspecializedlibraries.Thespecializedapproachhasthedrawbackofrequiring
thelibrarydevelopertodefinethe
bprop
methodsforeveryoperationandlimiting
theuserofthelibrarytoonlythoseoperationsthathavebeendefined.Yetthe
specializedapproachalsohasthebenefitofallowingcustomizedback-propagation
rulestobedevelopedforeachoperation,enablingthedevelopertoimprovespeed
orstabilityinnonobviouswaysthatanautomaticprocedurewouldpresumablybe
unabletoreplicate.
Back-propagationisthereforenottheonlywayortheoptimalwayofcomputing
thegradient,butitisapracticalmethodthatcontinuestoservethedeeplearning
communitywell.Inthefuture,differentiationtechnologyfordeepnetworksmay
improveasdeeplearningpractitionersbecomemoreawareofadvancesinthe
broaderfieldofautomaticdifferentiation.
6.5.10Higher-OrderDerivatives
Somesoftwareframeworkssupporttheuseofhigher-orderderivatives.Amongthe
deeplearningsoftwareframeworks,thisincludesatleastTheanoandTensorFlow.
219
CHAPTER6.DEEPFEEDFORWARDNETWORKS
Theselibrariesusethesamekindofdatastructuretodescribetheexpressionsfor
derivativesastheyusetodescribetheoriginalfunctionbeingdifferentiated.This
meansthatthesymbolicdifferentiationmachinerycanbeappliedtoderivatives.
Inthecontextofdeeplearning,itisraretocomputeasinglesecondderivative
ofascalarfunction.Instead,weareusuallyinterestedinpropertiesoftheHessian
matrix.Ifwehaveafunction
f
:
R
n
R
,thentheHessianmatrixisofsize
n×n
.
Intypicaldeeplearningapplications,
n
willbethenumberofparametersinthe
model,whichcouldeasilynumberinthebillions.TheentireHessianmatrixis
thusinfeasibletoevenrepresent.
InsteadofexplicitlycomputingtheHessian,thetypicaldeeplearningapproach
istouse
Krylovmethods
.Krylovmethodsareasetofiterativetechniquesfor
performingvariousoperations,suchasapproximatelyinvertingamatrixorfinding
approximationstoitseigenvectorsoreigenvalues,withoutusinganyoperation
otherthanmatrix-vectorproducts.
TouseKrylovmethodson theHessian, weonlyneedto beabletocom-
putetheproductbetweentheHessianmatrix
H
andanarbitraryvector
v
.A
straightforwardtechnique(Christianson,1992)fordoingsoistocompute
Hv=
x
(
x
f(x))
v
.(6.59)
Bothgradientcomputationsinthisexpressionmaybecomputedautomaticallyby
theappropriatesoftwarelibrary.Notethattheoutergradientexpressiontakesthe
gradientofafunctionoftheinnergradientexpression.
If
v
isitselfavectorproducedbyacomputationalgraph,itisimportantto
specifythattheautomaticdifferentiationsoftwareshouldnotdifferentiatethrough
thegraphthatproducedv.
WhilecomputingtheHessianisusuallynotadvisable,itispossibletodowith
Hessianvectorproducts.Onesimplycomputes
He
(i)
forall
i
= 1
,...,n,
where
e
(i)
istheone-hotvectorwithe
(i)
i
= 1andallotherentriesareequalto0.
6.6HistoricalNotes
Feedforwardnetworkscanbeseenasefficientnonlinearfunctionapproximators
basedonusinggradientdescenttominimizetheerrorinafunctionapproximation.
Fromthispointofview,themodernfeedforwardnetworkistheculminationof
centuriesofprogressonthegeneralfunctionapproximationtask.
Thechainrulethatunderliestheback-propagationalgorithmwasinventedin
theseventeenthcentury(Leibniz,1676;L’Hôpital,1696). Calculusandalgebra
220
CHAPTER6.DEEPFEEDFORWARDNETWORKS
havelongbeenusedtosolveoptimizationproblemsinclosedform,butgradient
descentwasnotintroducedasatechniqueforiterativelyapproximatingthesolution
tooptimizationproblemsuntilthenineteenthcentury(Cauchy,1847).
Beginninginthe1940s,thesefunctionapproximationtechniqueswereusedto
motivatemachinelearningmodelssuchastheperceptron.However,theearliest
modelswerebasedonlinearmodels.CriticsincludingMarvinMinskypointedout
severaloftheflawsofthelinearmodelfamily,suchasitsinabilitytolearnthe
XORfunction,whichledtoabacklashagainsttheentireneuralnetworkapproach.
Learningnonlinearfunctionsrequiredthedevelopmentofamultilayerper-
ceptronandameansofcomputingthegradientthroughsuchamodel.Efficient
applicationsofthechainrulebasedondynamicprogrammingbegantoappear
inthe1960sand1970s,mostlyforcontrolapplications(Kelley,1960;Brysonand
Denham,1961;Dreyfus,1962;BrysonandHo,1969;Dreyfus,1973)butalsofor
sensitivityanalysis(Linnainmaa,1976).Werbos(1981)proposedapplyingthese
techniquestotrainingartificialneuralnetworks.Theideawasfinallydeveloped
inpracticeafterbeingindependentlyrediscoveredindifferentways(LeCun,1985;
Parker,1985;Rumelhartetal.,1986a).Thebook
ParallelDistributedPro-
cessing
presentedtheresultsofsomeofthefirstsuccessfulexperimentswith
back-propagationinachapter(Rumelhartetal.,1986b)thatcontributedgreatly
tothepopularizationofback-propagationandinitiatedaveryactiveperiodofre-
searchinmultilayerneuralnetworks.Theideasputforwardbytheauthorsofthat
book,particularlybyRumelhartandHinton,gomuchbeyondback-propagation.
Theyincludecrucialideasaboutthepossiblecomputationalimplementationof
severalcentralaspectsofcognitionandlearning,whichcameunderthename
“connectionism”becauseoftheimportancethisschoolofthoughtplacesonthe
connectionsbetweenneuronsasthelocusoflearningandmemory.Inparticular,
theseideasincludethenotionofdistributedrepresentation(Hintonetal.,1986).
Followingthesuccessofback-propagation,neuralnetworkresearchgainedpop-
ularityandreachedapeakintheearly1990s.Afterwards,othermachinelearning
techniquesbecamemorepopularuntilthemoderndeeplearningrenaissancethat
beganin2006.
Thecoreideasbehindmodernfeedforwardnetworkshavenotchangedsub-
stantiallysincethe1980s. Thesameback-propagationalgorithmandthesame
approachestogradientdescentarestillinuse.Mostoftheimprovementinneural
networkperformancefrom1986to2015canbeattributedtotwofactors.First,
largerdatasetshavereducedthedegreetowhichstatisticalgeneralizationisa
challengeforneuralnetworks.Second,neuralnetworkshavebecomemuchlarger,
becauseofmorepowerfulcomputersandbettersoftwareinfrastructure.Asmall
221
CHAPTER6.DEEPFEEDFORWARDNETWORKS
numberofalgorithmicchangeshavealsoimprovedtheperformanceofneural
networksnoticeably.
Oneofthesealgorithmicchangeswasthereplacementofmeansquarederror
withthecross-entropyfamilyoflossfunctions.Meansquarederrorwaspopularin
the1980sand1990sbutwasgraduallyreplacedbycross-entropylossesandthe
principleofmaximumlikelihoodasideasspreadbetweenthestatisticscommunity
andthemachinelearningcommunity.Theuseofcross-entropylossesgreatly
improvedtheperformanceofmodelswithsigmoidandsoftmaxoutputs,which
hadpreviouslysufferedfromsaturationandslowlearningwhenusingthemean
squarederrorloss.
Theothermajoralgorithmicchangethathasgreatlyimprovedtheperformance
offeedforwardnetworkswasthereplacementofsigmoidhiddenunitswithpiecewise
linearhiddenunits,suchasrectifiedlinearunits.Rectificationusingthe
max{
0
,z}
functionwasintroducedinearlyneuralnetworkmodelsanddatesbackatleastasfar
asthecognitronandneocognitron(Fukushima,1975,1980).Theseearlymodelsdid
notuserectifiedlinearunitsbutinsteadappliedrectificationtononlinearfunctions.
Despitetheearlypopularityofrectification,itwaslargelyreplacedbysigmoids
inthe1980s,perhapsbecausesigmoidsperformbetterwhenneuralnetworksare
verysmall. Asoftheearly2000s,rectifiedlinearunitswereavoidedbecauseof
asomewhatsuperstitiousbeliefthatactivationfunctionswithnondifferentiable
pointsmustbeavoided.Thisbegantochangeinabout2009.Jarrettetal.(2009)
observedthat“usingarectifyingnonlinearityisthesinglemostimportantfactor
inimprovingtheperformanceofarecognitionsystem,”amongseveraldifferent
factorsofneuralnetworkarchitecturedesign.
Forsmalldatasets,Jarrettetal.(2009)observedthatusingrectifyingnon-
linearitiesisevenmoreimportantthanlearningtheweightsofthehiddenlayers.
Randomweightsaresufficienttopropagateusefulinformationthrougharectified
linearnetwork,enablingtheclassifierlayeratthetoptolearnhowtomapdifferent
featurevectorstoclassidentities.
Whenmoredataisavailable,learningbeginstoextractenoughusefulknowledge
toexceedtheperformanceofrandomlychosenparameters.Glorotetal.(2011a)
showedthatlearningisfareasierindeeprectifiedlinearnetworksthanindeep
networksthathavecurvatureortwo-sidedsaturationintheiractivationfunctions.
Rectifiedlinearunitsarealsoofhistoricalinterestbecausetheyshowthat
neurosciencehascontinuedtohave aninfluence onthedevelopment ofdeep
learningalgorithms.Glorotetal.(2011a)motivatedrectifiedlinearunitsfrom
biologicalconsiderations.Thehalf-rectifyingnonlinearitywasintendedtocapture
thesepropertiesofbiologicalneurons:(1)Forsomeinputs,biologicalneurons
222
CHAPTER6.DEEPFEEDFORWARDNETWORKS
arecompletelyinactive.(2)Forsomeinputs, abiologicalneuron’soutputis
proportionaltoitsinput.(3)Mostofthetime,biologicalneuronsoperateinthe
regimewheretheyareinactive(i.e.,theyshouldhavesparseactivations).
Whenthemodernresurgenceofdeeplearningbeganin2006,feedforward
networkscontinuedtohaveabadreputation. Fromabout2006to2012,itwas
widelybelievedthatfeedforwardnetworkswouldnotperformwellunlesstheywere
assistedbyothermodels,suchasprobabilisticmodels. Today,itisnowknown
thatwiththerightresourcesandengineeringpractices,feedforwardnetworks
performverywell.Today,gradient-basedlearninginfeedforwardnetworksisused
asatooltodevelopprobabilisticmodels,suchasthevariationalautoencoder
andgenerativeadversarialnetworks,describedinchapter20.Ratherthanbeing
viewedasanunreliabletechnologythatmustbesupportedbyothertechniques,
gradient-basedlearninginfeedforwardnetworkshasbeenviewedsince2012asa
powerfultechnologythatcanbeappliedtomanyothermachinelearningtasks.In
2006,thecommunityusedunsupervisedlearningtosupportsupervisedlearning,
andnow,ironically,itismorecommontousesupervisedlearningtosupport
unsupervisedlearning.
Feedforwardnetworkscontinuetohaveunfulfilledpotential.Inthefuture,we
expecttheywillbeappliedtomanymoretasks,andthatadvancesinoptimization
algorithmsandmodeldesignwillimprovetheirperformanceevenfurther. This
chapterhasprimarilydescribedtheneuralnetworkfamilyofmodels.Inthe
subsequentchapters,weturntohowtousethesemodels—howtoregularizeand
trainthem.
223

[8]ページ先頭

©2009-2026 Movatter.jp