Movatterモバイル変換


[0]ホーム

URL:


Uploaded byzikoAr
PPTX, PDF18 views

Introduction to Convolutional Neural Network.pptx

1. Introduction généraleLes réseaux de neurones convolutifs, ou CNN (Convolutional Neural Networks), représentent une avancée majeure dans le domaine de l'intelligence artificielle, et plus spécifiquement dans celui de l’apprentissage automatique supervisé. Inspirés du fonctionnement du cortex visuel humain, les CNN sont aujourd’hui omniprésents dans les applications de reconnaissance d’images, de vidéos, de traitement du langage naturel, de diagnostic médical, et bien plus encore.L’objectif de cette présentation est de fournir une compréhension approfondie mais accessible des CNN. Nous allons explorer leur structure, leur fonctionnement, les raisons de leur efficacité, ainsi que des exemples d’applications réelles. Le tout sera soutenu par des illustrations et, si pertinent, des démonstrations pratiques.2. Contexte et motivationa. OrigineLes CNN trouvent leurs racines dans les années 1980 avec les travaux de Yann LeCun, qui a introduit les premiers réseaux convolutifs pour la reconnaissance de chiffres manuscrits. Ce système a été notamment utilisé par la banque américaine pour la lecture automatique des chèques.b. Pourquoi les CNN ?Avant les CNN, les modèles traditionnels nécessitaient une étape manuelle de feature engineering, c’est-à-dire que les humains devaient extraire les caractéristiques d’une image à la main (ex : bords, coins, formes). Les CNN permettent à la machine d’apprendre automatiquement ces caractéristiques à partir des données brutes.c. ApplicationsVision par ordinateur : détection d’objets, reconnaissance faciale, segmentation d’images.Médical : détection de tumeurs sur des IRM.Sécurité : reconnaissance biométrique, surveillance vidéo intelligente.Voitures autonomes : lecture des panneaux, identification des piétons.Art et création : style transfer, colorisation automatique.3. Anatomie d’un CNNUn CNN est composé de plusieurs couches, chacune jouant un rôle spécifique dans le traitement et l’analyse des images.a. Convolution LayerLe cœur du CNN.Applique un filtre (ou noyau) sur l’image pour extraire des caractéristiques locales.Par exemple, un filtre peut détecter des bords verticaux ou horizontaux.b. ReLU (Rectified Linear Unit)Fonction d’activation non-linéaire.Applique f(x) = max(0, x) à chaque valeur, supprimant les valeurs négatives.Permet d’introduire de la non-linéarité dans le modèle.c. Pooling Layer (Sous-échantillonnage)Réduit la taille des représentations (feature maps).Les plus courants : Max Pooling, Average Pooling.Réduction de la complexité, amélioration de la robustesse.d. Flatten + Fully Connected LayersÀ la fin du CNN, les données sont aplaties puis traitées par des couches entièrement connectées.C’est ici que la classification finale est effectuée (ex. : chat ou chien).4. Fonctionnement global d’un CNNa. Propagation avant (Forward Propagation)L’image passe de couche en couche, transformée à chaque étape.

Embed presentation

Download to read offline
What is computer vision?• The search of the fundamental visual features, andthe two fundamentals applications of reconstructionand recognition– Features, maps a 2D image into a vector or a point x– Recognition– Reconstruction
2012• The current mania and euphoria of the AI revolution– 2012, annual gathering, an improvement of 10% (from 75 to 85)• Computer vision researchers use machine learning techniques to recognizeobjects in large amount of images– Go back to 1998 (1.5 decade!)• Textual (hand-written and printed) is actually visual!– And why wait for so long?• A silent Hardware revolution: GPU• Sadly driven by video gaming – Nvidia (GPU maker) is now in the driving seat of this AI revolution!• 2016, AlphaGo beats professionals• A narrow AI program• Re-shaping AI, Computer Science, digital revolution …
Visual matching, and recognition forunderstanding• Finding the visually similar things in different images--- Visual similarities• Visual matching, find the ‘same’ thing under differentviewpoints, better defined, no semantics per se.• Visual recognition, find the pre-trained ‘labels’,semantics– We define ‘labels’, then ‘learn’ from labeled data, finallyclassify ‘labels’
The state-of-the-art of visualclassification and recognition• Any thing you can clearly define and label• Then show a few thousands examples (labeled data)of this thing to the computer• A computer recognizes a new image, not seenbefore, now as good as humans, even better!• This is done by deep neural networks.
References In the notes.pptx, some advanced topics In talks, some vision history CNN for Visual Recognition, Stanford http://cs231n.github.io/ Deep Learning Tutorial, LeNet, Montrealhttp://www.deeplearning.net/tutorial/mlp.html Deep Learning, Goodfellow, Bengio, and Courville Pattern Recognition and Machine Learning, Bishop Sparse and Redundant Representations, Elad Pattern Recognition and Neural Networks, Ripley Pattern Classification, Duda, Hart, different editions A wavelet tour of signal processing, a sparse way, Mallat Introduction to applied mathematics, StrangSome figures and texts in the slides are cut/paste from these references.
A few basics
7prod.dotwithnnAE  inf.atptsnnRPnAnRNaturally everything starts from the known vector space• add two vectors• multiply any vector by any scalar• zero vector – origin• finite basisVector spaces and geometries
8• Vector space to affine: isomorph, one-to-one• vector to Euclidean as an enrichment: scalar prod.• affine to projective as an extension: add ideal elementsPts, lines, parallelismAngle, distances, circlesPts at infinity
9Numerical computation and Linearalgebra• Gaussian elimination• LU decomposition• Choleski (sym positive) LL^T• orthogonal decomposition• QR (Gram-Schmidt)• SVD (the high(est)light of linear algebra!)TVUA nm• row space: first Vs• null space: last Vs• col space: first Us• null space of the trans : last Us
10Applications• solving equations A x = b, decomposition, SVD• less, constraints, sparse solutions• more, least squares solution, ||Ax – b||^2 is eigendecomposition or S• solving linear equation, also iteratively nonlinear one f(x), or optimi• PCA is SVD
11High dimensionality discussion ---curse• [0,1]^d•…
Visual recognition
Image classification and objectrecognition• Viewing an image as a one dimensional vector of features x, thanks to features! (or simply pixels if youwould like),• Image classification and recognition becomes a machine learning application f(x).• What are good features?• Unsupervised, given examples x_i of a random variable x, implicitly or explicitly we learn thedistributions p(x), or some properties of p(x)– PCA Principal Components Analysis– K-means clustering• Given a set of training images and labels, we predict labels of a test image.• Supervised, given examples x_i and associated values or labels y_i, we learn to predict a y from a x,usually by estimating p(y|x)– KNN K-Nearest Neighbors• Distance metric and the (hyper) parameter K• Slow by curse of dimension– From linear classifiers to nonlinear neural networks
Unsupervised• K-means clustering– Partition n data points into k clusters– NP hard– heuristics• PCA Principal Components Analysis– Learns an orthogonal, linear transformation of the data
Supervised• Given a set of training images and labels, we predictlabels of a test image.• Supervised learning algorithms– KNN K-Nearest Neighbors• Distance metric and the (hyper) parameter K• Slow by curse of dimension– From linear classifiers to nonlinear neural networks
Fundamental linear classifiers
Fundamental binary linearclassifiers Binary linear classifier The classification surface is a hyper-plane Decision function could be a nonlinear thresholding d(x_i,wx+b), Nonlinear distance function, or probability-like sigmoid Geometry, 3d, and n-d Linear algebra, linear space Linear classifiers The linear regression is the simplest case where we have a closeform solution
A neuron is a linear classifier• A single neuron is a linear classifier, where the decision function is theactivation function• w x + b, a linear classifier, a neuron,– It’s a dot product of two vectors, scalar product– A template matching, a correlation, the template w and the input vector x (or amatched filter)– Also an algebraic distance, not the geometric one which is non-linear (thereforethe solution is usually nonlinear!)• The dot product is the metric distance of two points, one data, the other representative• its angular distance, x . w = ||x|| || w|| cos theta, true distance d = ||x-w||^2 = cosinelaw = ||x||^2 + ||w||^2- 2 ||x|| ||w|| x . x = x .x + w.w – 2 x.x w.w x . w– The ‘linear’ is that the decision surface is linear, a hyper-plane. The decisionfunction is not linear at all
A biological neuron and its mathematical model.
A very simple old example• The chroma-key segmentation, we want to removethe blue or green background in images• Given an image RGB, alpha R + betta G + gamma B >threshold or 0.0 R + 0.0 G + 1.0 B > 100
A linear classifier is not thatstraightforward• Two things: Inference and Learning• Only inference ‘scoring’ function is linear– We have only one ‘linear’ classifer, but we do have different ways to define the loss to make thelearning feasible. The learning is a (nonlinear) optimization problem in which we define a lossfunction.• Learning is almost never linear!– How to compute or to learn this hyperplane, and how to assess which one is better? To define anobjective ‘loss’ function• The classification is ‘discrete’, binary 0, 1,• No ‘analytical’ forms of the loss functions, except the true linear regression with y-y_i• It should be computationally feasible– The logistic regression (though it is a classification, not a regression), converts the output of the linearfunction into a probability– SVM
Activation (nonlinearity) functions• ReLU, max(0,x)• Sigmoid logistic function ), normalized to between 0and 1, is naturallly probability-like between 0 and 1,– so naturally, sigmoid for two,– (and softmax for N, )– Activation function and Nonlinearity function, notnecessarily logistic sigmoid between 0 and 1, others liketanh (centered), relu, …• Tanh, 2 s(2x) – 1, centered between -1 and 1, better,
VisGraph, HKUST
The rectified linear unit, ReLU• The smoother non-linearities used to be favoured inthe past.• Sigmoid, kills gradients, not used any more• At present, ReLU is the most popular.• Easy to implement max(0,x)• It learns faster with many layers, accelerates thelearning by a factor of 6 (in Alex)• Not smooth at x=0, subderivative is a set for a convexfunction, left and right derivatives, set to zero forsimplicity and sparsity
From two to N classes• Classification f : R^n  (1,2,…,n), while Regression f:R^n  R• Multi-class, output a vector function = ( +𝒙′ 𝑔 𝑾 𝒙), where g is the rectified linear.𝒃• W x + b, each row is a linear classifier, a neuron
The two common linear classifiers,with different loss functions• SVM, uncalibrated score– A hinge loss, the max-margin loss, A loss for j is different from y– Computationally more feasible, leads to convex optimization• Softmax f(*), the normalized exponentials, = ( )= ( ( + ))𝒚 𝑓 𝒙′ 𝑓 𝑔 𝑾 𝒙 𝒃– multi-class logistic regression– The scores are the unnormalized log probabilities– the negative log-likelihood loss, then gradient descent– (1,-2,0) -> exp, (2.71,0.14,1) -> ln, (0.7,0.04,0.26)– ½(1,-2,0) = (0.5,-1,0) -> exp, (1.65,0.37,1) ->ln, (0.55,0.12,0.33), more uniformwith increasing regularization• They are usually comparable
From linear to non-linearclassifiers, Multi Layer Perceptrons• Go higher and linear!• find a map or a transform ( ) to make them linear, but𝒙↦𝜙 𝒙in higher dimensions– A complete basis of polynomials  too many parameters for thelimited training data– Kernel methods, support vector machine, …• Learn the nonlinearity at the same time as the linearclassifiers  multilayer neural networks
Multi Layer Perceptrons• The first layer• The second layer• One-layer, g^1(x), linear classifiers• Two-layers, one hidden layer --- universal nonlinear approximator• The depth is the number of the layers, n+1, f g^n … g^1• The dimension of the output vector h is the width of the layer• A N-layer neural network does not count the input layer x• But it does count the output layer f(*). It represents the class scoresvector, it does not have an activation function, or the identifyactivation function.
A 2-layer Neural Network, one hiddenlayer of 4 neurons (or units), and oneoutput layer with 2 neurons, and threeinputs.The network has 4 + 2 = 6 neurons (notcounting the inputs), [3 x 4] + [4 x 2] = 20weights and 4 + 2 = 6 biases, for a totalof 26 learnable parameters.
A 3-layer neural network with threeinputs, two hidden layers of 4 neuronseach and one output layer. Notice that inboth cases there are connections(synapses) between neurons acrosslayers, but not within a layer.The network has 4 + 4 + 1 = 9 neurons,[3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32weights and 4 + 4 + 1 = 9 biases, for atotal of 41 learnable parameters.
Universal approximator• Given a function, there exisits a feedforward networkthat approximates the function• A single layer is sufficient to represent any function,• but it may be infeasibly large,• and it may fail to learn and generalize correctly• Using deeper models can reduce the number of unitsrequired to represent the function, and can reducethe amount of generalization error
Functional summary The input , the output y The hidden features The hidden layers is a nonlinear transformation provides a set of features describing Or provides a new representation for The nonlinear is compositional Each layer ), g(z)=max(0,z). The output
MLP is not new, but deep feedforwardneural network is modern!• Feedforward, from x to y, no feedback• Networks, modeled as a directed acyclic grpah forthe composition of functions connected in a chain,f(x) = f^(i)( … f^3(f^2(f^1(x))))• The depth of the network is N• ‘deep learning’ as N is increasing.
Forward inference f(x) andbackward learning nabla f(x)• A (parameterized) score function f(x,w) mapping thedata to class score, forward inference, modeling• A loss function (objective) L measuring the quality ofa particular set of parameters based on the groundtruth labels• Optimization, minimize the loss over the parameterswith a regularization, backward learning
The dataset of pairs of (x,y) is given and fixed. The weights start out asrandom numbers and can change. During the forward pass the scorefunction computes class scores, stored in vector f. The loss functioncontains two components: The data loss computes the compatibilitybetween the scores f and the labels y. The regularization loss is only afunction of the weights. During Gradient Descent, we compute thegradient on the weights (and optionally on data if we wish) and usethem to perform a parameter update during Gradient Descent.
Cost or loss functions• Usually the parametric model f(x,theta) defines a distribution p(y | x; theta) and weuse the maximum likelihood, that is the cross-entropy between the training data y andthe model’s predictions f(x,theta) as the loss or the cost function.– The cross-entropy between a ‘true’ distribution p and an estimated distribution q is H(p,q) = - sum_x p(x) log q(x)• The cost J(theta) = - E log p (y|x), if p is normal, we have the mean squared error cost J= ½ E ||y-f(x; theta)||^2+const• The cost can also be viewed as a functional, mapping functions to real numbers, weare learning functions f parameterized by theta. By calculus of variations, f(x) = E_* [y]• The SVM loss is carefully designed and special, hinge loss, max margin loss,• The softmax is the cross-entropy between the estimated class probabilities e^y_i / sum e and the true class labels, also the negative log likelihood loss L_i = - log(e^y_i / sum e)
MLE, likelihood and probabity discussionsKL, information theory, ….
Gradient-based learning• Define the loss function• Gradient-based optimization with chain rules• z=f(y)=f(g(x)), dz/dx=dz/dy dy/dx• In vectors and , the gradient , where J is the Jocobian matrix of g• In tensors, back-propagation• Analytical gradients are simple– d max(0,x)/d x = 1 or 0, d sigmoid(x)/d x = (1 – sig) sig• Use the centered difference f(x+h)-f(x-h)/2h, error order of O(h^2)
Stochastic gradient descent• min loss(f(x;theta)), function of parameters theta,not x• min f(x)• Gradient descent or the method of steepest descentx’ = x - epsilon nabla_x f(x)• Gradient descent, follows the gradient of an entiretraining set downhill• Stochastic gradient descent, follows the gradient ofrandomly selected minbatches downhill
Monte Carlo discussions ….
Auto-differentiation discussions …
Regularization loss term + lambda regularization term Regularization as constraints for underdetermined system Eg. A^T A + alpha I, Solving linear homogeneous Ax = 0, ||Ax||^2 with ||x||^2=1 Regularization for the stable uniqueness solution of ill-posed problems (no unique solution) Regularization as prior for MAP on top of the maximum likelihood estimation, again allapproximations to the full Bayesian inference L1/2 regularization L1 regularization has ‘feature selection’ property, that is, it produces a sparse vector,setting many to zeros L2 is usually diffuse, producing small numbers. L2 superior is not explicitly selecting features
Architecture prior and inductivebiases discussions ...
Hyperparameters and validation• The hyper-parameter lambda• Split the training data into two disjoint subsets– One is to learn the parameters– The other subset, the validation set, is to guide theselection of the hyperparameters
A linearly separable toy example• The toy spiral data consists of three classes (blue, red, yellow) that are notlinearly separable.– 300 pts, 3 classes• Linear classifier fails to learn the toy spiral dataset.• Neural Network classifier crushes the spiral dataset.– One hidden layer of width 100
A toy example from cn231n• The toy spiral data consists of three classes (blue, red, yellow) that are not linearlyseparable.– 3 classes, 100 pts for each class• Softmax linear classifier fails to learn the toy spiral dataset.– One layer, W, 2*3, b,– analytical gradients, 190 iteration, loss from 1.09 to 0.78, 48% training set accuracy• Neural Network classifier crushes the spiral dataset.– One hidden layer of width 100, W1, 2*100, b1, W2, 100*3, only few extra lines of python codes!– 9000 iteration, loss from 1.09 to 0.24, 98% training set accuracy
Generalization• The ‘optimization’ reduces the training errors (or residual errors before)• The ‘machine learning’ wants to reduce the generalization error or thetest error as well.• The generalization error is the expected value of the error on a new input,from the test set.– Make the training error small.– Make the gap between training and test error small.• Underfitting is not able to have sufficiently low error on the training set• Overfitting is not able to narrow the gap between the training and thetest error.
The training data was generated synthetically, by randomly sampling x values and choosing y deterministicallyby evaluating a quadratic function. (Left) A linear function fit to the data suffers fromunderfitting. (Center) A quadratic function fit to the data generalizes well to unseen points. It does not suffer froma significant amount of overfitting or underfitting. (Right) A polynomial of degree 9 fit tothe data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse to solvethe underdetermined normal equations. The solution passes through all of the trainingpoints exactly, but we have not been lucky enough for it to extract the correct structure.
The capacity of a model• The old Occam’s razor. Among competing ones, we should choose the“simplest” one.• The modern VC Vapnik-Chervonenkis dimension. The largest possiblevalue of m for which there exists a training set of m different x points thatthe binary classifier can label arbitrarily.• The no-free lunch theorem. No best learning algorithms, no best form ofregularization. Task specific.
Practice: learning rateLeft: A cartoon depicting the effects of different learning rates. With low learning rates theimprovements will be linear. With high learning rates they will start to look more exponential.Higher learning rates will decay the loss faster, but they get stuck at worse values of loss(green line). This is because there is too much "energy" in the optimization and theparameters are bouncing around chaotically, unable to settle in a nice spot in the optimizationlandscape. Right: An example of a typical loss function over time, while training a smallnetwork on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightlytoo small learning rate based on its speed of decay, but it's hard to say), and also indicatesthat the batch size might be a little too low (since the cost is a little too noisy).
Practice: avoid overfittingThe gap between the training and validation accuracyindicates the amount of overfitting. Two possible casesare shown in the diagram on the left. The blue validationerror curve shows very small validation accuracycompared to the training accuracy, indicating strongoverfitting. When you see this in practice you probablywant to increase regularization or collect more data. Theother possible case is when the validation accuracy tracksthe training accuracy fairly well. This case indicates thatyour model capacity is not high enough: make the modellarger by increasing the number of parameters.
Why deeper?• Deeper networks are able to use far fewer units per layerand far fewer parameters, as well as frequentlygeneralizing to the test set• But harder to optimize!• Choosing a deep model encodes a very general belief thatthe function we want to learn involves composition ofseveral simpler functions. Or the learning consists ofdiscovering a set of underlying factors of variation that canin turn be described in terms of other, simpler underlyingfactors of variation.
• The core idea in deep learning is that we assume thatthe data was generated by the composition factors orfeatures, potentially at multiple levels in a hierarchy.• This assumption allows an exponential gain in therelationship between the number of examples andthe number of regions that can be distinguished.• The exponential advantages conferred by the use ofdeep, distributed representations counter theexponential challenges posed by the curse ofdimensionality.Curse of dimensionality
Convolutional Neural Networks, or CNN,a visual network, and back to a 2D lattice from the abstraction of the 1D feature vector x in neural networks
From a regular network to CNN• We have regarded an input image as a vector of features, x, by virtue offeature detection and selection, like other machine learning applications tolearn f(x)• We now regard it as is, a 2D image, a 2D grid, a topological discrete lattice,I(I,j)• Converting input images into feature vector loses the spatial neighborhood-ness• complexity increases to cubics• Yet, the connectivities become local to reduce the complexity!
What is a convolution?• The fundamental operation, the convolution I * K(I,j) = sum sum• Flipping kernels makes the convolution commutative, which isfundamental in theory, but not required in NN to composewith other functions, so to include discrete correlations as wellinto “convolution”• Convolution is a linear operator, dot-product like correlation,not a matrix multiplication, but can be implemented as asparse matrix multiplication, to be viewed as an affinetransform
A CNN arranges its neurons in three dimensions (width, height, depth).Every layer of a CNN transforms the 3D input volume to a 3D outputvolume. In this example, the red input layer holds the image, so itswidth and height would be the dimensions of the image, and the depthwould be 3 (Red, Green, Blue channels).A regular 3-layer Neural Network.
LeNet: a layered model composed of convolution andsubsampling operations followed by a holisticrepresentation and ultimately a classifier forhandwritten digits. [ LeNet ]Convolutional Neural Networks: 1998. Input 32*32. CPU
AlexNet: a layered model composed of convolution,subsampling, and further operations followed by aholistic representation and all-in-all a landmarkclassifier onILSVRC12. [ AlexNet ]+ data+ gpu+ non-saturating nonlinearity+ regularizationConvolutional Neural Networks: 2012. Input 224*224*3. GPU.
LeNet vs AlexNet• 32*32*1• 7 layers• 2 conv and 4 classification• 60 thousand parameters• Only two complete convolutional layers– Conv, nonlinearities, and pooling as one completelayer• About 17 k parameters• (from 1989 LeNet 1 of about 10 k parameters to1998 LeNet)• 224*224*3• 8 layers• 5 conv and 3 fully classification• 5 convolutional layers, and 3,4,5 stacked on top ofeach other• Three complete conv layers• 60 million parameters, insufficient data• Data augmentation:– Patches (224 from 256 input), translations, reflections– PCA, simulate changes in intensity and colors
The motivation of convolutions• Sparse interaction, or Local connectivity.– The receptive field of the neuron, or the filter size.– The connections are local in space (width and height), butalways full in depth– A set of learnable filters• Parameters sharing, the weights are tied• Equivariant representation, translation invariant
Convolution and matrixmultiplication• Discrete convolution can be viewed as multiplicationby matrix• The kernel is a doubly block circulant matrix• It is very sparse!
VisGraph, HKUSTThe ‘convolution’ operation• The convolution is commutative because we have flipped the kernel– Many implement a cross-correlation without flipping• A convolution can be defined for 1, 2, 3, and N D– The 2D convolution is different from a real 3D convolution, which integrates the spatio-temporal information, the standard CNN convolution has only ‘spatial’ spreading• In CNN, even for 3D RGB input images, the standard convolution is 2D in eachchannel,– each channel has a different filter or kernel, the convolution per channel is thensummed up in all channels to produce a scalar for non-linearity activation– The filiter in each channel is not normalized, so no need to have different linearcombination coefficients.– 1*1 convolution is a dot product in different channel, a linear combination of differentchanels• The backward pass of a convolution is also a convolution with spatially flippedfilters.
VisGraph, HKUSTThe convolution layers• Stacking several small convolution layers is different fromconvolution cascating– As each small convolution is followed by the nonlinearities ReLU– The nonlinearities make the features more expressive!– Have fewer parameters with small filters, but more memory.– Cascating simply enlarges the spatial extent, the receptive field• Whether each conv layer is also followed by a pooling?– Lenet does not!– AlexNet first did not.
VisGraph, HKUSTThe Pooling Layer• Reduce the spatial size• Reduce the amount of parameters• Avoid over-fitting• Backpropagation for a max: only routing the gradient tothe input that had the highest value in the forward pass• It is unclear whether the pooling is essential.
Pooling layer down-samples the volume spatially, independently in eachdepth slice of the input volume.Left: the input volume of size [224x224x64] is pooled with filter size 2,stride 2 into output volume of size [112x112x64]. Notice that the volumedepth is preserved.Right: The most common down-sampling operation is max, giving riseto max pooling, here shown with a stride of 2. That is, each max is takenover 4 numbers (little 2x2 square).
The spatial hyperparameters• Depth• Stride• Zero-padding
AlexNet 2012• A strong prior has very low entropy, e.g. a Gaussianwith low variance• An infinitely strong prior says that some parametersare forbidden, and places zero probability on them• The convolution ‘prior’ says the identical and zeroweights• The pooling forces the invariance of small translations
The convolution and pooling act asan infinitely strong prior!• A strong prior has very low entropy, e.g. a Gaussianwith low variance• An infinitely strong prior says that some parametersare forbidden, and places zero probability on them• The convolution ‘prior’ says the identical and zeroweights• The pooling forces the invariance of small translations
The neuroscientific basis for CNN• The primary visual cortex, V1, about which we know the most• The brain region LGN, lateral geniculate nucleus, at the back of the head carries thesignal from the eye to V1, a convolutional layer captures three aspects of V1– It has a 2-dimensional structure– V1 contains many simple cells, linear units– V1 has many complex cells, corresponding to features with shift invariance, similar to pooling• When viewing an object, info flows from the retina, through LGN, to V1, then onward toV2, then V4, then IT, inferotemporal cortex, corresponding to the last layer of CNNfeatures• Not modeled at all. The mammalian vision system develops an attention mechanism– The human eye is mostly very low resolution, except for a tiny patch fovea.– The human brain makes several eye movements saccades to salient parts of the scene– The human vision perceives 3D• A simple cell responds to a specific spatial frequency of brightness in a specificdirection at a specific location --- Gabor-like functions
Receptive fieldLeft: An example input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an examplevolume of neurons in the first Convolutional layer. Each neuron in the convolutional layer isconnected only to a local region in the input volume spatially, but to the full depth (i.e. allcolor channels). Note, there are multiple neurons (5 in this example) along the depth, alllooking at the same region in the input - see discussion of depth columns in text below.Right: The neurons from the Neural Network chapter remain unchanged: They stillcompute a dot product of their weights with the input followed by a non-linearity, but theirconnectivity is now restricted to be local spatially.
Receptive field
Gabor functions
Gabor-like learned features
CNN architectures and algorithms
CNN architectures• The conventional linear structure, linear list of layers, feedforward• Generally a DAG, directed acyclic graph• ResNet simply adds back• Different terminology: complex layer and simple layer– A complex (complete) convolutional layer, including different stages such asconvolution per se, batch normalization, nonlinearity, and pooling– Each stage is a layer, even there are no parameters• The traditional CNNs are just a few complex convolutional layers to extractfeatures, then are followed by a softmax classification output layer• Convolutional networks output a high-dimensional, structured object,rather than just predicting a class label for a classiciation task or a realvaluefor a regression task, it it an output tensor– S_i,j,k is the probability that pixel j,k belongs to class I
The popular CNN• LeNet, 1998• AlexNet, 2012• VGGNet, 2014• ResNet, 2015
VGGNet• 16 layers• Only 3*3convolutions• 138 millionparameters
ResNet• 152 layers• ResNet50
Computational complexity• The memory bottleneck• GPU, a few GB
Stochastic Gradient Descent• Gradient descent, follows the gradient of anentire training set downhill• Stochastic gradient descent, follows thegradient of randomly selected minbatchesdownhill
The dropout regularization• Randomly shutdown a subset of units in training• It is a sparse representation• It is a different net each time, but all nets share the parameters– A net with n units can be seen as a collection of 2^n possible thinned nets,all of which share weights.– At test time, it is a single net with averaging• Avoid over-fitting
Ensemble methods• Bagging (bootstrap aggregating), model averaging,ensemble methods– Model averaging outperforms, with increased computationand memory– Model averaging is discouraged when benchmarking forpublications• Boosting, an ensemble method, constructs anensemble with higher capacity than the individualmodels
Batch normalization• After convolution, before nonlinearity• ‘Batch’ as it is done for a subset of data• Instead of 0-1 (zero mean unit variance) input, thedistribution will be learnt, even undone if necessary• The data normalization or PCA/whitening is common ingeneral NN, but in CNN, the ‘normalization layer’ hasbeen shown to be minimal in some nets as well.
Open questions• Networks are non-convex– Need regularization• Smaller networks are hard to train with localmethods– Local minima are bad, in loss, not stable, large variance• Bigger ones are easier– More local minima, but better, more stable, small variance• As big as the computational power, and data!
Local minima, saddle points, andplateaus?• We don’t care about finding the exact mimimum, we onlywant to obtain good generalization error by reducing thefunction value.• In low-dimensional, local minima are common.• In higher dimensional, local minima are rare, and saddlepoints are more common.– The Hessian at a local mimimum has all positive eigenvalues. TheHessian at a saddle point has a micture of pos and neg eigenvalues.– It is exponentially unlikely that all n will have the same sign in highn-dim
CNN applications
CNN applications• Transfer learning• Fine-tuning the CNN– Keep some early layers• Early layers contain more generic features, edges, color blobs• Common to many visual tasks– Fine-tune the later layers• More specific to the details of the class• CNN as feature extractor– Remove the last fully connected layer– A kind of descriptor or CNN codes for the image– AlexNet gives a 4096 Dim descriptor
VisGraph, HKUSTCNN classification/recognition nets• CNN layers and fully-connected classification layers• From ResNet to DenseNet– Densely connected– Feature concatenation
Fully convolutional nets: semanticsegmentation• Classification/recognition nets produce ‘non-spatial’outputs– the last fully connected layer has the fixed dimension ofclasses, throws away spatial coordinates• Fully convolutional nets output maps as well
Semantic segmentation
Using sliding windows for semanticsegmentation
Fully convolutional
Fully convolutional
VisGraph, HKUSTDetection and segmentation nets:The Mask Region-based CNN (R-CNN):• Class-independent region (bounding box) proposals– From selective search to region proposal net with objectness• Use CNN to class each region• Regression on the bounding box or contour segmentation• Mask R-CNN: end-to-end– Use CNN to make proposals on object/non-object in parallel• The old good idea of face detection by Viola– Proposal generation– Cascading (ada boosting)
Using sliding windows for objectdetection as classification
Mask R-CNN
Excellent results
End.
VisGraph, HKUSTSome old notes in 2017 and 2018
Fundamentally from continuous to discreteviews … from geometry to recognition• ‘Simple’ neighborhood from topology• discrete high order• Modeling higher-order discrete, but yet solving it withthe first-order differentiable optimization• Modeling and implementation become easier• The multiscale local jet, a hierarchical and localcharacterization of the image in a full scale-spaceneighborhood
VisGraph, HKUSTLocal jets? Why?• The multiscale local jet, a hierarchical and localcharacterization of the image in a full scale-spaceneighborhood• A feature in CNN is one element of the descriptor
Classification vs regression• Regression predicts a value from a continuous set, areal number/continuous,– Given a set of data, find the relationship (often, thecontinuous mathematical functions) that represent the set ofdata– To predict the output value using training data• Whereas classification predicts the ‘belonging’ to theclass, a discrete/categorial variable– Given a set of data and classes, identify the class that thedata belongs to– To group the output into a class
Autoencoder and decoder• Compression, and reconstruction• Convolutional autoencoder, trained to reconstruct itsinput– Wide and thin (RGB) to narrow and thick• ….
Convolution and deconvolution• The convolution is not inversible, so there is no strictdefinition, or the closed-form of the so-called‘deconvolution’• In iterative procedure, a kind of the convolutiontranspose is applied, so to call it ‘ deconvolution’• The ‘deconvolution’ filters are reconstruction bases
CNN as a natural features anddescriptors• Each point is an interest point or feature point withhigh-dimensional descriptors• Some of them are naturally ‘unimportant’, andweighted down by the nets• The spatial information is kept through the nets• The hierarchical description is natural from local toglobal for each point, each pixel, and each featurepoint
Traditional stereo vs deep stereoregression
VisGraph, HKUSTCNN regression nets: deepregression stereo• Regularize the cost volume.
Traditional stereo• Input image H * W * C• (The matching) cost volume in disparities D, or in depths• D * H * W• The value d_i for each D is the matching cost, or the correlationscore, or the accumulated in the scale space, for the disparity i,or the depth i.• Disparities are a function of H and W, d = c(x,y;x+d,y+d)• Argmin D• H * W
End-to-end deep stereo regression• Input image H * W * C• 18 CNN• H * W * F• (F, features, are descriptor vectors for each pixel, we may just correlate or dot-product two descriptor vectorsf1 and f2 to produce a score in D*H*W. But F could be further redefined in successive convolution layers.)• Cost volume, for each disparity level• D * H * W * 2F• 4D volume, viewed as a descriptor vector 2F for each voxel D*H*W• 3D convolution on H, W, and D• 19-37 3D CNN• The last one (deconvolution) turns F into a scalar as a score• D * H * W• Soft argmin D• H * W
Bayesian decision posterior = likelihood * prior / evidence Decide if > ; otherwise decide
Statistical measurements Precision Recall coverage
VisGraph, HKUSTReinforcement learning (RL)• Dynamic programming uses a mathematical model of theenvironment, which is typically formulated as a MarkovDecision Process• The main difference between the classical dynamicprogramming and RL is that RL does not assume an exactmathematical model of the Markov Decision Process, andtarget large MDP where exact methods become infeasible• Therefore, RL is neuro dynamic programming. Inoperational research and control, it’s called approximatedynamic programming
116Non-linear iterative optimisation• J d = r from vector F(x+d)=F(x)+J d• minimize the square of y-F(x+d)=y-F(x)-J d = r – J d• normal equation is J^T J d = J^T r (Gauss-Newton)• (H+lambda I) d = J^T r (LM)Note: F is a vector of functions, i.e. min f=(y-F)^T(y-F)
117General non-linear optimisation• 1-order , d gradient descent d= g and H =I• 2-order,• Newton step: H d = -g• Gauss-Newton for LS: f=(y-F)^T(y-F), H=J^TJ, g=-J^T r• ‘restricted step’, trust-region, LM: (H+lambda W) d = -gR. Fletcher: practical method of optimisationf(x+d) = f(x)+g^T d + ½ d^T H dNote: f is a scalar valued function here.
118statistics• ‘small errors’ --- classical estimation theory• analytical based on first order appoxi.• Monte Carlo• ‘big errors’ --- robust stat.• RANSAC• LMS• M-estimatorsTalk about it later
An abstract view The input The classification with linear models, Can be a SVM The output layer of the linear softmax classifier find a map or a transform to make them linear, but in higherdimensions provides a set of features describing Or provides a new representation for The nonlinear transformation Can be hand designed kernels in svm It is the hidden layer of a feedforward network Deep learning To learn is compositional, multiple layers CNN is convolutional for each layer
VisGraph, HKUST• Add drawing for ‘receptive fields’• Dot product for vectors, convolution, more specific for 2D? Timeinvariant, or translation invariant, equivariant• Convolution, then nonlinear acitivation, also called ‘detection stage’,detector• Pooling is for efficiency, down-sampling, and handling inputs ofvariable sizes,To do
Cost or loss functions - bis• Classification and regression are different! for different application communities• We used more traditional regression, but the modern one is more on classification, so resultingdifferent loss considerations• Regression is harder to optimize, while classification is easier, and more stable• Classification is easier than regression, so always discretize and quantize the output, and convertthem into classification tasks!• One big improvement in NN modern development is that the cross-entropy dominates the meansqured error L2, as the mean squared error was popular and good for regression, but not thatgood for NN. because of its fundamental more appropriate distribution assumption, not normaldistributions• L2 for regression is harder to optimize than the more stable softmax for classification• L2 is also less robust
Automatic differentiation (algorithmic diff),and backprop, its role in the development• Differentiation: symbolic or numerical (finite differences)• Automatic differentiation is to compute derivatives algorithmically,backprop is only one approach to it.• Its history is related to that of NN and deep learning• Worked for traditional small systems, a f(x)• Larger and more explicit composional nature of f1(f2(f3(… fn(x)))) goesback to the very nature of the derivatives– Forward mode and reverse mode (it is based on f(a+b epsilon) =f(a)+f’(a)bepsilon, f(g(a+b epsilon) = f(g(a))+f’(g(a))g’(a) b epsilon)– The reverse mode is backprop for NN• In the end, it benefits as well the classical large optimization such asbundle adjustment
Viewing the composition of an arbitraryfunction as a natural layering• Take f(x,y)=x+s(y) / s(x) + (x+y)^2, at a given point x=3,y=-4• Forward pass• f1=s(y), f2 = x+f1, f3=s(x), f4=x+y, f5=f4^2, f6=f3+f5, f7=1/f6, f8=f2*f7• So f(*)=f8(f7(f6(f5(f4(f3(f2(f1(*)))))))), each of fn is a known elementary function oroperation• Backprop to get (df/dx, df/dy), or abreviated as (dx,dy), at (3,-4)• f8=f, abreviate df/df7 or df8/df7 as df7, df/df7=f2,…, and df/dx as dx, …• df7=f2, (df2=f7), df6= (-1/f6^2) * df7, df5=df6, (df3=df6), df4=(2*f4)df5, dx=df4, dy=df4, dx+= (1-s(x)*s(x)*df3 (backprop in s(x)=f3), dx += df2 (backprop in f2), dy += df2 (backprop inf2), dy += (1-s(y))*s(y)*df1 (backprop in s(y)=f1)• In NN, there are just more variables in each layer, but the elementary functions are muchsimpler: add, multiply, and max.• Even the primitive function in each layer takes also the simplest one! Then just a lot ofthem!
Computational graph• NN is described with a relatively informal graphlanguage• More precise computational graph to describe thebackprop algorithms
Structured probabilistic models,Graphical models, and factor graph• A structured probabilistic model is a way ofdescribing a probability distribution.• Structured probabilistic models are referred to asgraphical models• Factor graphs are another way of drawing undirectedmodels that resolve an ambiguity in the graphicalrepresentation of standard undirected model syntax.

Recommended

PPTX
cnn.pptx
PPTX
cnn.pptx Convolutional neural network used for image classication
PPTX
Java and Deep Learning
PDF
Chapter8 LINEAR DESCRIMINANT FOR MACHINE LEARNING.pdf
PPTX
Java and Deep Learning (Introduction)
PDF
machine learning notes by Andrew Ng and Tengyu Ma
PDF
CS229_MachineLearning_notes.pdfkkkkkkkkkk
PPTX
Unit-1 Introduction and Mathematical Preliminaries.pptx
PPTX
Deep Learning Module 2A Training MLP.pptx
PPT
SVM (2).ppt
PPTX
super vector machines algorithms using deep
PPT
Pattern Recognition and understanding patterns
PPT
Pattern Recognition- Basic Lecture Notes
PPT
PatternRecognition_fundamental_engineering.ppt
PPT
SVM.ppt
PPT
slides
 
PPT
slides
 
PPT
i i believe is is enviromntbelieve is is enviromnt7.ppt
PPTX
Machine learning introduction lecture notes
PPTX
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
PPTX
Image Recognition of recognition pattern.pptx
PPT
MAchine learning
PPT
Machine Learning Machine Learnin Machine Learningg
PPT
PPT
PPT-3.ppt
PPTX
The world of loss function
PDF
Perceptrons (D1L2 2017 UPC Deep Learning for Computer Vision)
PPTX
How Machine Learning Helps Organizations to Work More Efficiently?
PPTX
Ship Repair and fault diagnosis and restoration of system back to normal .pptx
PPTX
unit-1 Data structure (3).pptx Data structure And Algorithms

More Related Content

PPTX
cnn.pptx
PPTX
cnn.pptx Convolutional neural network used for image classication
PPTX
Java and Deep Learning
PDF
Chapter8 LINEAR DESCRIMINANT FOR MACHINE LEARNING.pdf
PPTX
Java and Deep Learning (Introduction)
PDF
machine learning notes by Andrew Ng and Tengyu Ma
PDF
CS229_MachineLearning_notes.pdfkkkkkkkkkk
PPTX
Unit-1 Introduction and Mathematical Preliminaries.pptx
cnn.pptx
cnn.pptx Convolutional neural network used for image classication
Java and Deep Learning
Chapter8 LINEAR DESCRIMINANT FOR MACHINE LEARNING.pdf
Java and Deep Learning (Introduction)
machine learning notes by Andrew Ng and Tengyu Ma
CS229_MachineLearning_notes.pdfkkkkkkkkkk
Unit-1 Introduction and Mathematical Preliminaries.pptx

Similar to Introduction to Convolutional Neural Network.pptx

PPTX
Deep Learning Module 2A Training MLP.pptx
PPT
SVM (2).ppt
PPTX
super vector machines algorithms using deep
PPT
Pattern Recognition and understanding patterns
PPT
Pattern Recognition- Basic Lecture Notes
PPT
PatternRecognition_fundamental_engineering.ppt
PPT
SVM.ppt
PPT
slides
 
PPT
slides
 
PPT
i i believe is is enviromntbelieve is is enviromnt7.ppt
PPTX
Machine learning introduction lecture notes
PPTX
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
PPTX
Image Recognition of recognition pattern.pptx
PPT
MAchine learning
PPT
Machine Learning Machine Learnin Machine Learningg
PPT
PPT
PPT-3.ppt
PPTX
The world of loss function
PDF
Perceptrons (D1L2 2017 UPC Deep Learning for Computer Vision)
PPTX
How Machine Learning Helps Organizations to Work More Efficiently?
Deep Learning Module 2A Training MLP.pptx
SVM (2).ppt
super vector machines algorithms using deep
Pattern Recognition and understanding patterns
Pattern Recognition- Basic Lecture Notes
PatternRecognition_fundamental_engineering.ppt
SVM.ppt
slides
 
slides
 
i i believe is is enviromntbelieve is is enviromnt7.ppt
Machine learning introduction lecture notes
Multilayer Perceptron (DLAI D1L2 2017 UPC Deep Learning for Artificial Intell...
Image Recognition of recognition pattern.pptx
MAchine learning
Machine Learning Machine Learnin Machine Learningg
PPT-3.ppt
The world of loss function
Perceptrons (D1L2 2017 UPC Deep Learning for Computer Vision)
How Machine Learning Helps Organizations to Work More Efficiently?

Recently uploaded

PPTX
Ship Repair and fault diagnosis and restoration of system back to normal .pptx
PPTX
unit-1 Data structure (3).pptx Data structure And Algorithms
PDF
application of matrix in computer science
PDF
The Impact of Telework on Urban Development (www.kiu.ac.ug)
PPTX
TRANSPORTATION ENGINEERING Unit-5.1.pptx
PPTX
Automation Testing and New Ways to test Software using AI
PPTX
2-Photoelectric effect, phenomena and its related concept.pptx
PPTX
Computer engineering for collage studen. pptx
PPT
Virtual Instrumentation Programming Techniques.ppt
PDF
Introduction to MySQL Spatial Features and Real-World Use Cases
PDF
Welcome to ISPR 2026 - 12th International Conference on Image and Signal Pro...
PPTX
DevFest Seattle 2025 - AI Native Design Patterns.pptx
PDF
November_2025 Top 10 Read Articles in Computer Networks & Communications.pdf
PPTX
31.03.24 - 7.CURRICULUM & TEACHING - LEARNING PROCESS IMPLEMENTATION DETAILS....
PDF
Small Space Big Design - Amar DeXign Scape
PDF
@Regenerative braking system of DC motor
PPTX
Control Structures and Looping Basics Understanding Control Flow and Loops Co...
PDF
ANPARA THERMAL POWER STATION[1] sangam.pdf
PPTX
introduction-to-maintenance- Dr. Munthear Alqaderi
PPTX
Mc25104 - data structures and algorithms using PYTHON OOP_Python_Lecture_Note...
Ship Repair and fault diagnosis and restoration of system back to normal .pptx
unit-1 Data structure (3).pptx Data structure And Algorithms
application of matrix in computer science
The Impact of Telework on Urban Development (www.kiu.ac.ug)
TRANSPORTATION ENGINEERING Unit-5.1.pptx
Automation Testing and New Ways to test Software using AI
2-Photoelectric effect, phenomena and its related concept.pptx
Computer engineering for collage studen. pptx
Virtual Instrumentation Programming Techniques.ppt
Introduction to MySQL Spatial Features and Real-World Use Cases
Welcome to ISPR 2026 - 12th International Conference on Image and Signal Pro...
DevFest Seattle 2025 - AI Native Design Patterns.pptx
November_2025 Top 10 Read Articles in Computer Networks & Communications.pdf
31.03.24 - 7.CURRICULUM & TEACHING - LEARNING PROCESS IMPLEMENTATION DETAILS....
Small Space Big Design - Amar DeXign Scape
@Regenerative braking system of DC motor
Control Structures and Looping Basics Understanding Control Flow and Loops Co...
ANPARA THERMAL POWER STATION[1] sangam.pdf
introduction-to-maintenance- Dr. Munthear Alqaderi
Mc25104 - data structures and algorithms using PYTHON OOP_Python_Lecture_Note...

Introduction to Convolutional Neural Network.pptx

  • 1.
    What is computervision?• The search of the fundamental visual features, andthe two fundamentals applications of reconstructionand recognition– Features, maps a 2D image into a vector or a point x– Recognition– Reconstruction
  • 2.
    2012• The currentmania and euphoria of the AI revolution– 2012, annual gathering, an improvement of 10% (from 75 to 85)• Computer vision researchers use machine learning techniques to recognizeobjects in large amount of images– Go back to 1998 (1.5 decade!)• Textual (hand-written and printed) is actually visual!– And why wait for so long?• A silent Hardware revolution: GPU• Sadly driven by video gaming – Nvidia (GPU maker) is now in the driving seat of this AI revolution!• 2016, AlphaGo beats professionals• A narrow AI program• Re-shaping AI, Computer Science, digital revolution …
  • 3.
    Visual matching, andrecognition forunderstanding• Finding the visually similar things in different images--- Visual similarities• Visual matching, find the ‘same’ thing under differentviewpoints, better defined, no semantics per se.• Visual recognition, find the pre-trained ‘labels’,semantics– We define ‘labels’, then ‘learn’ from labeled data, finallyclassify ‘labels’
  • 4.
    The state-of-the-art ofvisualclassification and recognition• Any thing you can clearly define and label• Then show a few thousands examples (labeled data)of this thing to the computer• A computer recognizes a new image, not seenbefore, now as good as humans, even better!• This is done by deep neural networks.
  • 5.
    References In thenotes.pptx, some advanced topics In talks, some vision history CNN for Visual Recognition, Stanford http://cs231n.github.io/ Deep Learning Tutorial, LeNet, Montrealhttp://www.deeplearning.net/tutorial/mlp.html Deep Learning, Goodfellow, Bengio, and Courville Pattern Recognition and Machine Learning, Bishop Sparse and Redundant Representations, Elad Pattern Recognition and Neural Networks, Ripley Pattern Classification, Duda, Hart, different editions A wavelet tour of signal processing, a sparse way, Mallat Introduction to applied mathematics, StrangSome figures and texts in the slides are cut/paste from these references.
  • 6.
  • 7.
    7prod.dotwithnnAE  inf.atptsnnRPnAnRNaturallyeverything starts from the known vector space• add two vectors• multiply any vector by any scalar• zero vector – origin• finite basisVector spaces and geometries
  • 8.
    8• Vector spaceto affine: isomorph, one-to-one• vector to Euclidean as an enrichment: scalar prod.• affine to projective as an extension: add ideal elementsPts, lines, parallelismAngle, distances, circlesPts at infinity
  • 9.
    9Numerical computation andLinearalgebra• Gaussian elimination• LU decomposition• Choleski (sym positive) LL^T• orthogonal decomposition• QR (Gram-Schmidt)• SVD (the high(est)light of linear algebra!)TVUA nm• row space: first Vs• null space: last Vs• col space: first Us• null space of the trans : last Us
  • 10.
    10Applications• solving equationsA x = b, decomposition, SVD• less, constraints, sparse solutions• more, least squares solution, ||Ax – b||^2 is eigendecomposition or S• solving linear equation, also iteratively nonlinear one f(x), or optimi• PCA is SVD
  • 11.
    11High dimensionality discussion---curse• [0,1]^d•…
  • 12.
  • 13.
    Image classification andobjectrecognition• Viewing an image as a one dimensional vector of features x, thanks to features! (or simply pixels if youwould like),• Image classification and recognition becomes a machine learning application f(x).• What are good features?• Unsupervised, given examples x_i of a random variable x, implicitly or explicitly we learn thedistributions p(x), or some properties of p(x)– PCA Principal Components Analysis– K-means clustering• Given a set of training images and labels, we predict labels of a test image.• Supervised, given examples x_i and associated values or labels y_i, we learn to predict a y from a x,usually by estimating p(y|x)– KNN K-Nearest Neighbors• Distance metric and the (hyper) parameter K• Slow by curse of dimension– From linear classifiers to nonlinear neural networks
  • 14.
    Unsupervised• K-means clustering–Partition n data points into k clusters– NP hard– heuristics• PCA Principal Components Analysis– Learns an orthogonal, linear transformation of the data
  • 16.
    Supervised• Given aset of training images and labels, we predictlabels of a test image.• Supervised learning algorithms– KNN K-Nearest Neighbors• Distance metric and the (hyper) parameter K• Slow by curse of dimension– From linear classifiers to nonlinear neural networks
  • 18.
  • 19.
    Fundamental binary linearclassifiersBinary linear classifier The classification surface is a hyper-plane Decision function could be a nonlinear thresholding d(x_i,wx+b), Nonlinear distance function, or probability-like sigmoid Geometry, 3d, and n-d Linear algebra, linear space Linear classifiers The linear regression is the simplest case where we have a closeform solution
  • 20.
    A neuron isa linear classifier• A single neuron is a linear classifier, where the decision function is theactivation function• w x + b, a linear classifier, a neuron,– It’s a dot product of two vectors, scalar product– A template matching, a correlation, the template w and the input vector x (or amatched filter)– Also an algebraic distance, not the geometric one which is non-linear (thereforethe solution is usually nonlinear!)• The dot product is the metric distance of two points, one data, the other representative• its angular distance, x . w = ||x|| || w|| cos theta, true distance d = ||x-w||^2 = cosinelaw = ||x||^2 + ||w||^2- 2 ||x|| ||w|| x . x = x .x + w.w – 2 x.x w.w x . w– The ‘linear’ is that the decision surface is linear, a hyper-plane. The decisionfunction is not linear at all
  • 21.
    A biological neuronand its mathematical model.
  • 22.
    A very simpleold example• The chroma-key segmentation, we want to removethe blue or green background in images• Given an image RGB, alpha R + betta G + gamma B >threshold or 0.0 R + 0.0 G + 1.0 B > 100
  • 23.
    A linear classifieris not thatstraightforward• Two things: Inference and Learning• Only inference ‘scoring’ function is linear– We have only one ‘linear’ classifer, but we do have different ways to define the loss to make thelearning feasible. The learning is a (nonlinear) optimization problem in which we define a lossfunction.• Learning is almost never linear!– How to compute or to learn this hyperplane, and how to assess which one is better? To define anobjective ‘loss’ function• The classification is ‘discrete’, binary 0, 1,• No ‘analytical’ forms of the loss functions, except the true linear regression with y-y_i• It should be computationally feasible– The logistic regression (though it is a classification, not a regression), converts the output of the linearfunction into a probability– SVM
  • 24.
    Activation (nonlinearity) functions•ReLU, max(0,x)• Sigmoid logistic function ), normalized to between 0and 1, is naturallly probability-like between 0 and 1,– so naturally, sigmoid for two,– (and softmax for N, )– Activation function and Nonlinearity function, notnecessarily logistic sigmoid between 0 and 1, others liketanh (centered), relu, …• Tanh, 2 s(2x) – 1, centered between -1 and 1, better,
  • 25.
  • 26.
    The rectified linearunit, ReLU• The smoother non-linearities used to be favoured inthe past.• Sigmoid, kills gradients, not used any more• At present, ReLU is the most popular.• Easy to implement max(0,x)• It learns faster with many layers, accelerates thelearning by a factor of 6 (in Alex)• Not smooth at x=0, subderivative is a set for a convexfunction, left and right derivatives, set to zero forsimplicity and sparsity
  • 27.
    From two toN classes• Classification f : R^n  (1,2,…,n), while Regression f:R^n  R• Multi-class, output a vector function = ( +𝒙′ 𝑔 𝑾 𝒙), where g is the rectified linear.𝒃• W x + b, each row is a linear classifier, a neuron
  • 28.
    The two commonlinear classifiers,with different loss functions• SVM, uncalibrated score– A hinge loss, the max-margin loss, A loss for j is different from y– Computationally more feasible, leads to convex optimization• Softmax f(*), the normalized exponentials, = ( )= ( ( + ))𝒚 𝑓 𝒙′ 𝑓 𝑔 𝑾 𝒙 𝒃– multi-class logistic regression– The scores are the unnormalized log probabilities– the negative log-likelihood loss, then gradient descent– (1,-2,0) -> exp, (2.71,0.14,1) -> ln, (0.7,0.04,0.26)– ½(1,-2,0) = (0.5,-1,0) -> exp, (1.65,0.37,1) ->ln, (0.55,0.12,0.33), more uniformwith increasing regularization• They are usually comparable
  • 29.
    From linear tonon-linearclassifiers, Multi Layer Perceptrons• Go higher and linear!• find a map or a transform ( ) to make them linear, but𝒙↦𝜙 𝒙in higher dimensions– A complete basis of polynomials  too many parameters for thelimited training data– Kernel methods, support vector machine, …• Learn the nonlinearity at the same time as the linearclassifiers  multilayer neural networks
  • 30.
    Multi Layer Perceptrons•The first layer• The second layer• One-layer, g^1(x), linear classifiers• Two-layers, one hidden layer --- universal nonlinear approximator• The depth is the number of the layers, n+1, f g^n … g^1• The dimension of the output vector h is the width of the layer• A N-layer neural network does not count the input layer x• But it does count the output layer f(*). It represents the class scoresvector, it does not have an activation function, or the identifyactivation function.
  • 31.
    A 2-layer NeuralNetwork, one hiddenlayer of 4 neurons (or units), and oneoutput layer with 2 neurons, and threeinputs.The network has 4 + 2 = 6 neurons (notcounting the inputs), [3 x 4] + [4 x 2] = 20weights and 4 + 2 = 6 biases, for a totalof 26 learnable parameters.
  • 32.
    A 3-layer neuralnetwork with threeinputs, two hidden layers of 4 neuronseach and one output layer. Notice that inboth cases there are connections(synapses) between neurons acrosslayers, but not within a layer.The network has 4 + 4 + 1 = 9 neurons,[3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32weights and 4 + 4 + 1 = 9 biases, for atotal of 41 learnable parameters.
  • 33.
    Universal approximator• Givena function, there exisits a feedforward networkthat approximates the function• A single layer is sufficient to represent any function,• but it may be infeasibly large,• and it may fail to learn and generalize correctly• Using deeper models can reduce the number of unitsrequired to represent the function, and can reducethe amount of generalization error
  • 34.
    Functional summary Theinput , the output y The hidden features The hidden layers is a nonlinear transformation provides a set of features describing Or provides a new representation for The nonlinear is compositional Each layer ), g(z)=max(0,z). The output
  • 35.
    MLP is notnew, but deep feedforwardneural network is modern!• Feedforward, from x to y, no feedback• Networks, modeled as a directed acyclic grpah forthe composition of functions connected in a chain,f(x) = f^(i)( … f^3(f^2(f^1(x))))• The depth of the network is N• ‘deep learning’ as N is increasing.
  • 36.
    Forward inference f(x)andbackward learning nabla f(x)• A (parameterized) score function f(x,w) mapping thedata to class score, forward inference, modeling• A loss function (objective) L measuring the quality ofa particular set of parameters based on the groundtruth labels• Optimization, minimize the loss over the parameterswith a regularization, backward learning
  • 37.
    The dataset ofpairs of (x,y) is given and fixed. The weights start out asrandom numbers and can change. During the forward pass the scorefunction computes class scores, stored in vector f. The loss functioncontains two components: The data loss computes the compatibilitybetween the scores f and the labels y. The regularization loss is only afunction of the weights. During Gradient Descent, we compute thegradient on the weights (and optionally on data if we wish) and usethem to perform a parameter update during Gradient Descent.
  • 38.
    Cost or lossfunctions• Usually the parametric model f(x,theta) defines a distribution p(y | x; theta) and weuse the maximum likelihood, that is the cross-entropy between the training data y andthe model’s predictions f(x,theta) as the loss or the cost function.– The cross-entropy between a ‘true’ distribution p and an estimated distribution q is H(p,q) = - sum_x p(x) log q(x)• The cost J(theta) = - E log p (y|x), if p is normal, we have the mean squared error cost J= ½ E ||y-f(x; theta)||^2+const• The cost can also be viewed as a functional, mapping functions to real numbers, weare learning functions f parameterized by theta. By calculus of variations, f(x) = E_* [y]• The SVM loss is carefully designed and special, hinge loss, max margin loss,• The softmax is the cross-entropy between the estimated class probabilities e^y_i / sum e and the true class labels, also the negative log likelihood loss L_i = - log(e^y_i / sum e)
  • 39.
    MLE, likelihood andprobabity discussionsKL, information theory, ….
  • 40.
    Gradient-based learning• Definethe loss function• Gradient-based optimization with chain rules• z=f(y)=f(g(x)), dz/dx=dz/dy dy/dx• In vectors and , the gradient , where J is the Jocobian matrix of g• In tensors, back-propagation• Analytical gradients are simple– d max(0,x)/d x = 1 or 0, d sigmoid(x)/d x = (1 – sig) sig• Use the centered difference f(x+h)-f(x-h)/2h, error order of O(h^2)
  • 41.
    Stochastic gradient descent•min loss(f(x;theta)), function of parameters theta,not x• min f(x)• Gradient descent or the method of steepest descentx’ = x - epsilon nabla_x f(x)• Gradient descent, follows the gradient of an entiretraining set downhill• Stochastic gradient descent, follows the gradient ofrandomly selected minbatches downhill
  • 42.
  • 43.
  • 44.
    Regularization loss term+ lambda regularization term Regularization as constraints for underdetermined system Eg. A^T A + alpha I, Solving linear homogeneous Ax = 0, ||Ax||^2 with ||x||^2=1 Regularization for the stable uniqueness solution of ill-posed problems (no unique solution) Regularization as prior for MAP on top of the maximum likelihood estimation, again allapproximations to the full Bayesian inference L1/2 regularization L1 regularization has ‘feature selection’ property, that is, it produces a sparse vector,setting many to zeros L2 is usually diffuse, producing small numbers. L2 superior is not explicitly selecting features
  • 45.
    Architecture prior andinductivebiases discussions ...
  • 46.
    Hyperparameters and validation•The hyper-parameter lambda• Split the training data into two disjoint subsets– One is to learn the parameters– The other subset, the validation set, is to guide theselection of the hyperparameters
  • 47.
    A linearly separabletoy example• The toy spiral data consists of three classes (blue, red, yellow) that are notlinearly separable.– 300 pts, 3 classes• Linear classifier fails to learn the toy spiral dataset.• Neural Network classifier crushes the spiral dataset.– One hidden layer of width 100
  • 48.
    A toy examplefrom cn231n• The toy spiral data consists of three classes (blue, red, yellow) that are not linearlyseparable.– 3 classes, 100 pts for each class• Softmax linear classifier fails to learn the toy spiral dataset.– One layer, W, 2*3, b,– analytical gradients, 190 iteration, loss from 1.09 to 0.78, 48% training set accuracy• Neural Network classifier crushes the spiral dataset.– One hidden layer of width 100, W1, 2*100, b1, W2, 100*3, only few extra lines of python codes!– 9000 iteration, loss from 1.09 to 0.24, 98% training set accuracy
  • 49.
    Generalization• The ‘optimization’reduces the training errors (or residual errors before)• The ‘machine learning’ wants to reduce the generalization error or thetest error as well.• The generalization error is the expected value of the error on a new input,from the test set.– Make the training error small.– Make the gap between training and test error small.• Underfitting is not able to have sufficiently low error on the training set• Overfitting is not able to narrow the gap between the training and thetest error.
  • 50.
    The training datawas generated synthetically, by randomly sampling x values and choosing y deterministicallyby evaluating a quadratic function. (Left) A linear function fit to the data suffers fromunderfitting. (Center) A quadratic function fit to the data generalizes well to unseen points. It does not suffer froma significant amount of overfitting or underfitting. (Right) A polynomial of degree 9 fit tothe data suffers from overfitting. Here we used the Moore-Penrose pseudoinverse to solvethe underdetermined normal equations. The solution passes through all of the trainingpoints exactly, but we have not been lucky enough for it to extract the correct structure.
  • 51.
    The capacity ofa model• The old Occam’s razor. Among competing ones, we should choose the“simplest” one.• The modern VC Vapnik-Chervonenkis dimension. The largest possiblevalue of m for which there exists a training set of m different x points thatthe binary classifier can label arbitrarily.• The no-free lunch theorem. No best learning algorithms, no best form ofregularization. Task specific.
  • 52.
    Practice: learning rateLeft:A cartoon depicting the effects of different learning rates. With low learning rates theimprovements will be linear. With high learning rates they will start to look more exponential.Higher learning rates will decay the loss faster, but they get stuck at worse values of loss(green line). This is because there is too much "energy" in the optimization and theparameters are bouncing around chaotically, unable to settle in a nice spot in the optimizationlandscape. Right: An example of a typical loss function over time, while training a smallnetwork on CIFAR-10 dataset. This loss function looks reasonable (it might indicate a slightlytoo small learning rate based on its speed of decay, but it's hard to say), and also indicatesthat the batch size might be a little too low (since the cost is a little too noisy).
  • 53.
    Practice: avoid overfittingThegap between the training and validation accuracyindicates the amount of overfitting. Two possible casesare shown in the diagram on the left. The blue validationerror curve shows very small validation accuracycompared to the training accuracy, indicating strongoverfitting. When you see this in practice you probablywant to increase regularization or collect more data. Theother possible case is when the validation accuracy tracksthe training accuracy fairly well. This case indicates thatyour model capacity is not high enough: make the modellarger by increasing the number of parameters.
  • 54.
    Why deeper?• Deepernetworks are able to use far fewer units per layerand far fewer parameters, as well as frequentlygeneralizing to the test set• But harder to optimize!• Choosing a deep model encodes a very general belief thatthe function we want to learn involves composition ofseveral simpler functions. Or the learning consists ofdiscovering a set of underlying factors of variation that canin turn be described in terms of other, simpler underlyingfactors of variation.
  • 55.
    • The coreidea in deep learning is that we assume thatthe data was generated by the composition factors orfeatures, potentially at multiple levels in a hierarchy.• This assumption allows an exponential gain in therelationship between the number of examples andthe number of regions that can be distinguished.• The exponential advantages conferred by the use ofdeep, distributed representations counter theexponential challenges posed by the curse ofdimensionality.Curse of dimensionality
  • 56.
    Convolutional Neural Networks,or CNN,a visual network, and back to a 2D lattice from the abstraction of the 1D feature vector x in neural networks
  • 57.
    From a regularnetwork to CNN• We have regarded an input image as a vector of features, x, by virtue offeature detection and selection, like other machine learning applications tolearn f(x)• We now regard it as is, a 2D image, a 2D grid, a topological discrete lattice,I(I,j)• Converting input images into feature vector loses the spatial neighborhood-ness• complexity increases to cubics• Yet, the connectivities become local to reduce the complexity!
  • 58.
    What is aconvolution?• The fundamental operation, the convolution I * K(I,j) = sum sum• Flipping kernels makes the convolution commutative, which isfundamental in theory, but not required in NN to composewith other functions, so to include discrete correlations as wellinto “convolution”• Convolution is a linear operator, dot-product like correlation,not a matrix multiplication, but can be implemented as asparse matrix multiplication, to be viewed as an affinetransform
  • 59.
    A CNN arrangesits neurons in three dimensions (width, height, depth).Every layer of a CNN transforms the 3D input volume to a 3D outputvolume. In this example, the red input layer holds the image, so itswidth and height would be the dimensions of the image, and the depthwould be 3 (Red, Green, Blue channels).A regular 3-layer Neural Network.
  • 60.
    LeNet: a layeredmodel composed of convolution andsubsampling operations followed by a holisticrepresentation and ultimately a classifier forhandwritten digits. [ LeNet ]Convolutional Neural Networks: 1998. Input 32*32. CPU
  • 61.
    AlexNet: a layeredmodel composed of convolution,subsampling, and further operations followed by aholistic representation and all-in-all a landmarkclassifier onILSVRC12. [ AlexNet ]+ data+ gpu+ non-saturating nonlinearity+ regularizationConvolutional Neural Networks: 2012. Input 224*224*3. GPU.
  • 62.
    LeNet vs AlexNet•32*32*1• 7 layers• 2 conv and 4 classification• 60 thousand parameters• Only two complete convolutional layers– Conv, nonlinearities, and pooling as one completelayer• About 17 k parameters• (from 1989 LeNet 1 of about 10 k parameters to1998 LeNet)• 224*224*3• 8 layers• 5 conv and 3 fully classification• 5 convolutional layers, and 3,4,5 stacked on top ofeach other• Three complete conv layers• 60 million parameters, insufficient data• Data augmentation:– Patches (224 from 256 input), translations, reflections– PCA, simulate changes in intensity and colors
  • 63.
    The motivation ofconvolutions• Sparse interaction, or Local connectivity.– The receptive field of the neuron, or the filter size.– The connections are local in space (width and height), butalways full in depth– A set of learnable filters• Parameters sharing, the weights are tied• Equivariant representation, translation invariant
  • 64.
    Convolution and matrixmultiplication•Discrete convolution can be viewed as multiplicationby matrix• The kernel is a doubly block circulant matrix• It is very sparse!
  • 65.
    VisGraph, HKUSTThe ‘convolution’operation• The convolution is commutative because we have flipped the kernel– Many implement a cross-correlation without flipping• A convolution can be defined for 1, 2, 3, and N D– The 2D convolution is different from a real 3D convolution, which integrates the spatio-temporal information, the standard CNN convolution has only ‘spatial’ spreading• In CNN, even for 3D RGB input images, the standard convolution is 2D in eachchannel,– each channel has a different filter or kernel, the convolution per channel is thensummed up in all channels to produce a scalar for non-linearity activation– The filiter in each channel is not normalized, so no need to have different linearcombination coefficients.– 1*1 convolution is a dot product in different channel, a linear combination of differentchanels• The backward pass of a convolution is also a convolution with spatially flippedfilters.
  • 66.
    VisGraph, HKUSTThe convolutionlayers• Stacking several small convolution layers is different fromconvolution cascating– As each small convolution is followed by the nonlinearities ReLU– The nonlinearities make the features more expressive!– Have fewer parameters with small filters, but more memory.– Cascating simply enlarges the spatial extent, the receptive field• Whether each conv layer is also followed by a pooling?– Lenet does not!– AlexNet first did not.
  • 67.
    VisGraph, HKUSTThe PoolingLayer• Reduce the spatial size• Reduce the amount of parameters• Avoid over-fitting• Backpropagation for a max: only routing the gradient tothe input that had the highest value in the forward pass• It is unclear whether the pooling is essential.
  • 68.
    Pooling layer down-samplesthe volume spatially, independently in eachdepth slice of the input volume.Left: the input volume of size [224x224x64] is pooled with filter size 2,stride 2 into output volume of size [112x112x64]. Notice that the volumedepth is preserved.Right: The most common down-sampling operation is max, giving riseto max pooling, here shown with a stride of 2. That is, each max is takenover 4 numbers (little 2x2 square).
  • 69.
    The spatial hyperparameters•Depth• Stride• Zero-padding
  • 70.
    AlexNet 2012• Astrong prior has very low entropy, e.g. a Gaussianwith low variance• An infinitely strong prior says that some parametersare forbidden, and places zero probability on them• The convolution ‘prior’ says the identical and zeroweights• The pooling forces the invariance of small translations
  • 71.
    The convolution andpooling act asan infinitely strong prior!• A strong prior has very low entropy, e.g. a Gaussianwith low variance• An infinitely strong prior says that some parametersare forbidden, and places zero probability on them• The convolution ‘prior’ says the identical and zeroweights• The pooling forces the invariance of small translations
  • 72.
    The neuroscientific basisfor CNN• The primary visual cortex, V1, about which we know the most• The brain region LGN, lateral geniculate nucleus, at the back of the head carries thesignal from the eye to V1, a convolutional layer captures three aspects of V1– It has a 2-dimensional structure– V1 contains many simple cells, linear units– V1 has many complex cells, corresponding to features with shift invariance, similar to pooling• When viewing an object, info flows from the retina, through LGN, to V1, then onward toV2, then V4, then IT, inferotemporal cortex, corresponding to the last layer of CNNfeatures• Not modeled at all. The mammalian vision system develops an attention mechanism– The human eye is mostly very low resolution, except for a tiny patch fovea.– The human brain makes several eye movements saccades to salient parts of the scene– The human vision perceives 3D• A simple cell responds to a specific spatial frequency of brightness in a specificdirection at a specific location --- Gabor-like functions
  • 73.
    Receptive fieldLeft: Anexample input volume in red (e.g. a 32x32x3 CIFAR-10 image), and an examplevolume of neurons in the first Convolutional layer. Each neuron in the convolutional layer isconnected only to a local region in the input volume spatially, but to the full depth (i.e. allcolor channels). Note, there are multiple neurons (5 in this example) along the depth, alllooking at the same region in the input - see discussion of depth columns in text below.Right: The neurons from the Neural Network chapter remain unchanged: They stillcompute a dot product of their weights with the input followed by a non-linearity, but theirconnectivity is now restricted to be local spatially.
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
    CNN architectures• Theconventional linear structure, linear list of layers, feedforward• Generally a DAG, directed acyclic graph• ResNet simply adds back• Different terminology: complex layer and simple layer– A complex (complete) convolutional layer, including different stages such asconvolution per se, batch normalization, nonlinearity, and pooling– Each stage is a layer, even there are no parameters• The traditional CNNs are just a few complex convolutional layers to extractfeatures, then are followed by a softmax classification output layer• Convolutional networks output a high-dimensional, structured object,rather than just predicting a class label for a classiciation task or a realvaluefor a regression task, it it an output tensor– S_i,j,k is the probability that pixel j,k belongs to class I
  • 79.
    The popular CNN•LeNet, 1998• AlexNet, 2012• VGGNet, 2014• ResNet, 2015
  • 80.
    VGGNet• 16 layers•Only 3*3convolutions• 138 millionparameters
  • 81.
  • 82.
    Computational complexity• Thememory bottleneck• GPU, a few GB
  • 83.
    Stochastic Gradient Descent•Gradient descent, follows the gradient of anentire training set downhill• Stochastic gradient descent, follows thegradient of randomly selected minbatchesdownhill
  • 84.
    The dropout regularization•Randomly shutdown a subset of units in training• It is a sparse representation• It is a different net each time, but all nets share the parameters– A net with n units can be seen as a collection of 2^n possible thinned nets,all of which share weights.– At test time, it is a single net with averaging• Avoid over-fitting
  • 85.
    Ensemble methods• Bagging(bootstrap aggregating), model averaging,ensemble methods– Model averaging outperforms, with increased computationand memory– Model averaging is discouraged when benchmarking forpublications• Boosting, an ensemble method, constructs anensemble with higher capacity than the individualmodels
  • 86.
    Batch normalization• Afterconvolution, before nonlinearity• ‘Batch’ as it is done for a subset of data• Instead of 0-1 (zero mean unit variance) input, thedistribution will be learnt, even undone if necessary• The data normalization or PCA/whitening is common ingeneral NN, but in CNN, the ‘normalization layer’ hasbeen shown to be minimal in some nets as well.
  • 87.
    Open questions• Networksare non-convex– Need regularization• Smaller networks are hard to train with localmethods– Local minima are bad, in loss, not stable, large variance• Bigger ones are easier– More local minima, but better, more stable, small variance• As big as the computational power, and data!
  • 88.
    Local minima, saddlepoints, andplateaus?• We don’t care about finding the exact mimimum, we onlywant to obtain good generalization error by reducing thefunction value.• In low-dimensional, local minima are common.• In higher dimensional, local minima are rare, and saddlepoints are more common.– The Hessian at a local mimimum has all positive eigenvalues. TheHessian at a saddle point has a micture of pos and neg eigenvalues.– It is exponentially unlikely that all n will have the same sign in highn-dim
  • 89.
  • 90.
    CNN applications• Transferlearning• Fine-tuning the CNN– Keep some early layers• Early layers contain more generic features, edges, color blobs• Common to many visual tasks– Fine-tune the later layers• More specific to the details of the class• CNN as feature extractor– Remove the last fully connected layer– A kind of descriptor or CNN codes for the image– AlexNet gives a 4096 Dim descriptor
  • 91.
    VisGraph, HKUSTCNN classification/recognitionnets• CNN layers and fully-connected classification layers• From ResNet to DenseNet– Densely connected– Feature concatenation
  • 92.
    Fully convolutional nets:semanticsegmentation• Classification/recognition nets produce ‘non-spatial’outputs– the last fully connected layer has the fixed dimension ofclasses, throws away spatial coordinates• Fully convolutional nets output maps as well
  • 93.
  • 94.
    Using sliding windowsfor semanticsegmentation
  • 95.
  • 96.
  • 97.
    VisGraph, HKUSTDetection andsegmentation nets:The Mask Region-based CNN (R-CNN):• Class-independent region (bounding box) proposals– From selective search to region proposal net with objectness• Use CNN to class each region• Regression on the bounding box or contour segmentation• Mask R-CNN: end-to-end– Use CNN to make proposals on object/non-object in parallel• The old good idea of face detection by Viola– Proposal generation– Cascading (ada boosting)
  • 98.
    Using sliding windowsfor objectdetection as classification
  • 99.
  • 100.
  • 101.
  • 102.
    VisGraph, HKUSTSome oldnotes in 2017 and 2018
  • 103.
    Fundamentally from continuousto discreteviews … from geometry to recognition• ‘Simple’ neighborhood from topology• discrete high order• Modeling higher-order discrete, but yet solving it withthe first-order differentiable optimization• Modeling and implementation become easier• The multiscale local jet, a hierarchical and localcharacterization of the image in a full scale-spaceneighborhood
  • 104.
    VisGraph, HKUSTLocal jets?Why?• The multiscale local jet, a hierarchical and localcharacterization of the image in a full scale-spaceneighborhood• A feature in CNN is one element of the descriptor
  • 105.
    Classification vs regression•Regression predicts a value from a continuous set, areal number/continuous,– Given a set of data, find the relationship (often, thecontinuous mathematical functions) that represent the set ofdata– To predict the output value using training data• Whereas classification predicts the ‘belonging’ to theclass, a discrete/categorial variable– Given a set of data and classes, identify the class that thedata belongs to– To group the output into a class
  • 106.
    Autoencoder and decoder•Compression, and reconstruction• Convolutional autoencoder, trained to reconstruct itsinput– Wide and thin (RGB) to narrow and thick• ….
  • 107.
    Convolution and deconvolution•The convolution is not inversible, so there is no strictdefinition, or the closed-form of the so-called‘deconvolution’• In iterative procedure, a kind of the convolutiontranspose is applied, so to call it ‘ deconvolution’• The ‘deconvolution’ filters are reconstruction bases
  • 108.
    CNN as anatural features anddescriptors• Each point is an interest point or feature point withhigh-dimensional descriptors• Some of them are naturally ‘unimportant’, andweighted down by the nets• The spatial information is kept through the nets• The hierarchical description is natural from local toglobal for each point, each pixel, and each featurepoint
  • 109.
    Traditional stereo vsdeep stereoregression
  • 110.
    VisGraph, HKUSTCNN regressionnets: deepregression stereo• Regularize the cost volume.
  • 111.
    Traditional stereo• Inputimage H * W * C• (The matching) cost volume in disparities D, or in depths• D * H * W• The value d_i for each D is the matching cost, or the correlationscore, or the accumulated in the scale space, for the disparity i,or the depth i.• Disparities are a function of H and W, d = c(x,y;x+d,y+d)• Argmin D• H * W
  • 112.
    End-to-end deep stereoregression• Input image H * W * C• 18 CNN• H * W * F• (F, features, are descriptor vectors for each pixel, we may just correlate or dot-product two descriptor vectorsf1 and f2 to produce a score in D*H*W. But F could be further redefined in successive convolution layers.)• Cost volume, for each disparity level• D * H * W * 2F• 4D volume, viewed as a descriptor vector 2F for each voxel D*H*W• 3D convolution on H, W, and D• 19-37 3D CNN• The last one (deconvolution) turns F into a scalar as a score• D * H * W• Soft argmin D• H * W
  • 113.
    Bayesian decision posterior= likelihood * prior / evidence Decide if > ; otherwise decide
  • 114.
  • 115.
    VisGraph, HKUSTReinforcement learning(RL)• Dynamic programming uses a mathematical model of theenvironment, which is typically formulated as a MarkovDecision Process• The main difference between the classical dynamicprogramming and RL is that RL does not assume an exactmathematical model of the Markov Decision Process, andtarget large MDP where exact methods become infeasible• Therefore, RL is neuro dynamic programming. Inoperational research and control, it’s called approximatedynamic programming
  • 116.
    116Non-linear iterative optimisation•J d = r from vector F(x+d)=F(x)+J d• minimize the square of y-F(x+d)=y-F(x)-J d = r – J d• normal equation is J^T J d = J^T r (Gauss-Newton)• (H+lambda I) d = J^T r (LM)Note: F is a vector of functions, i.e. min f=(y-F)^T(y-F)
  • 117.
    117General non-linear optimisation•1-order , d gradient descent d= g and H =I• 2-order,• Newton step: H d = -g• Gauss-Newton for LS: f=(y-F)^T(y-F), H=J^TJ, g=-J^T r• ‘restricted step’, trust-region, LM: (H+lambda W) d = -gR. Fletcher: practical method of optimisationf(x+d) = f(x)+g^T d + ½ d^T H dNote: f is a scalar valued function here.
  • 118.
    118statistics• ‘small errors’--- classical estimation theory• analytical based on first order appoxi.• Monte Carlo• ‘big errors’ --- robust stat.• RANSAC• LMS• M-estimatorsTalk about it later
  • 119.
    An abstract viewThe input The classification with linear models, Can be a SVM The output layer of the linear softmax classifier find a map or a transform to make them linear, but in higherdimensions provides a set of features describing Or provides a new representation for The nonlinear transformation Can be hand designed kernels in svm It is the hidden layer of a feedforward network Deep learning To learn is compositional, multiple layers CNN is convolutional for each layer
  • 120.
    VisGraph, HKUST• Adddrawing for ‘receptive fields’• Dot product for vectors, convolution, more specific for 2D? Timeinvariant, or translation invariant, equivariant• Convolution, then nonlinear acitivation, also called ‘detection stage’,detector• Pooling is for efficiency, down-sampling, and handling inputs ofvariable sizes,To do
  • 121.
    Cost or lossfunctions - bis• Classification and regression are different! for different application communities• We used more traditional regression, but the modern one is more on classification, so resultingdifferent loss considerations• Regression is harder to optimize, while classification is easier, and more stable• Classification is easier than regression, so always discretize and quantize the output, and convertthem into classification tasks!• One big improvement in NN modern development is that the cross-entropy dominates the meansqured error L2, as the mean squared error was popular and good for regression, but not thatgood for NN. because of its fundamental more appropriate distribution assumption, not normaldistributions• L2 for regression is harder to optimize than the more stable softmax for classification• L2 is also less robust
  • 122.
    Automatic differentiation (algorithmicdiff),and backprop, its role in the development• Differentiation: symbolic or numerical (finite differences)• Automatic differentiation is to compute derivatives algorithmically,backprop is only one approach to it.• Its history is related to that of NN and deep learning• Worked for traditional small systems, a f(x)• Larger and more explicit composional nature of f1(f2(f3(… fn(x)))) goesback to the very nature of the derivatives– Forward mode and reverse mode (it is based on f(a+b epsilon) =f(a)+f’(a)bepsilon, f(g(a+b epsilon) = f(g(a))+f’(g(a))g’(a) b epsilon)– The reverse mode is backprop for NN• In the end, it benefits as well the classical large optimization such asbundle adjustment
  • 123.
    Viewing the compositionof an arbitraryfunction as a natural layering• Take f(x,y)=x+s(y) / s(x) + (x+y)^2, at a given point x=3,y=-4• Forward pass• f1=s(y), f2 = x+f1, f3=s(x), f4=x+y, f5=f4^2, f6=f3+f5, f7=1/f6, f8=f2*f7• So f(*)=f8(f7(f6(f5(f4(f3(f2(f1(*)))))))), each of fn is a known elementary function oroperation• Backprop to get (df/dx, df/dy), or abreviated as (dx,dy), at (3,-4)• f8=f, abreviate df/df7 or df8/df7 as df7, df/df7=f2,…, and df/dx as dx, …• df7=f2, (df2=f7), df6= (-1/f6^2) * df7, df5=df6, (df3=df6), df4=(2*f4)df5, dx=df4, dy=df4, dx+= (1-s(x)*s(x)*df3 (backprop in s(x)=f3), dx += df2 (backprop in f2), dy += df2 (backprop inf2), dy += (1-s(y))*s(y)*df1 (backprop in s(y)=f1)• In NN, there are just more variables in each layer, but the elementary functions are muchsimpler: add, multiply, and max.• Even the primitive function in each layer takes also the simplest one! Then just a lot ofthem!
  • 124.
    Computational graph• NNis described with a relatively informal graphlanguage• More precise computational graph to describe thebackprop algorithms
  • 125.
    Structured probabilistic models,Graphicalmodels, and factor graph• A structured probabilistic model is a way ofdescribing a probability distribution.• Structured probabilistic models are referred to asgraphical models• Factor graphs are another way of drawing undirectedmodels that resolve an ambiguity in the graphicalrepresentation of standard undirected model syntax.

Editor's Notes

  • #2 1587, a year of no significance, ray huang,万历15 年, 黄仁宇China: a macro history 中国大历史
  • #5 Now the big trends in visual computing are …
  • #8 Affine and eucl are finite geometry in which we are handling only fnite pts
  • #9 LU for solving simple linear system,Instead of inversing the matrix, It reduces to two triangular system which can be easily computed by forward/backward substitutionQR can solve full-rank least squaresThen if you don’t know the rank, do it with SVD and pseudo-inverse
  • #10  this means: g is gradient vector, H is the hessian matrix jocobien is for squared model, r^T r …
  • #11  this means: g is gradient vector, H is the hessian matrix jocobien is for squared model, r^T r …
  • #13 K means clustering is an unsupervised learning, whileas KNN is a supervised learningSupervised learning: clustering and PCA
  • #14 K means clustering is an unsupervised learning, whileas KNN is a supervised learningSupervised learning: clustering and PCA
  • #15 K means clustering is an unsupervised learning, whileas KNN is a supervised learningSupervised learning: clustering and PCA
  • #16 K means clustering is an unsupervised learning, whileas KNN is a supervised learningSupervised learning: clustering and PCA
  • #17 K means clustering is an unsupervised learning, whileas KNN is a supervised learningSupervised learning: clustering and PCA
  • #19 Can take the nonlinear distance function to the hyperplane, and interpret this distance function as a probability
  • #20 (If take sigmoid as the activation, then a neuron is a binary softmax classifer)w . x + b > 0 -> w . x > -b, merely a thresholding shift or bias dot product, the inner product, scalar product, euclidean metric, a distance metric, the example w and the input x, its angular distance, x . w = ||x|| || w|| cos theta, true distance d = ||x-w||^2 = cosine law = ||x||^2 + ||w||^2- 2 ||x|| ||w|| x . x = x .x + w.w – 2 x.x w.w x . w Dot product, w is a template, or a prototype
  • #24 Sigmoid, with x1=0, x2=x
  • #26 Sigmoid, with x1=0, x2=x
  • #29 They admit simple algorithms where the form of the nonlinearity can be learned from the training data. They are extremely powerful, have nice theoretical properties, apply well to a vast array of applications.
  • #30 Multi layer percetrons is a subclass of neural networks, is the classical type of nn, and is always feedforward nn. a perceptron is more restrictive, originally.Mostly, MLP is what we all feedforward neural networks,Then we have cnn, and rnn …
  • #36 Can take the nonlinear distance function to the hyperplane, and interpret this distance function as a probability
  • #40 Chain rule of calculus, chai rule of probability
  • #41 Chain rule of calculus, chai rule of probability
  • #44 The expansion of log p typically yields some terms that do not depend on the model parameters and may be discarded. If p is normal, then we recover the mean squred error cost, J = E || y – f (x, theta) ||^2 + c, up to a scaling factor and a term that does not depend on theta.
  • #46 The expansion of log p typically yields some terms that do not depend on the model parameters and may be discarded. If p is normal, then we recover the mean squred error cost, J = E || y – f (x, theta) ||^2 + c, up to a scaling factor and a term that does not depend on theta.
  • #47 Chain rule of calculus, chai rule of probability
  • #49 The expansion of log p typically yields some terms that do not depend on the model parameters and may be discarded. If p is normal, then we recover the mean squred error cost, J = E || y – f (x, theta) ||^2 + c, up to a scaling factor and a term that does not depend on theta.
  • #50 The expansion of log p typically yields some terms that do not depend on the model parameters and may be discarded. If p is normal, then we recover the mean squred error cost, J = E || y – f (x, theta) ||^2 + c, up to a scaling factor and a term that does not depend on theta.
  • #51 The expansion of log p typically yields some terms that do not depend on the model parameters and may be discarded. If p is normal, then we recover the mean squred error cost, J = E || y – f (x, theta) ||^2 + c, up to a scaling factor and a term that does not depend on theta.
  • #78 http://cs231n.github.io/neural-networks-1/
  • #79 http://cs231n.github.io/neural-networks-1/
  • #90 http://cs231n.github.io/neural-networks-1/
  • #91 http://cs231n.github.io/neural-networks-1/
  • #92 http://cs231n.github.io/neural-networks-1/
  • #97 HOG features, Histogram of Gradients, by Triggs
  • #103 http://cs231n.github.io/neural-networks-1/
  • #104 http://cs231n.github.io/neural-networks-1/
  • #110 http://cs231n.github.io/neural-networks-1/
  • #112 Alex Kendall et al, Skydio inc.End-to-end learning of geometry and context for deep stereo regression2017
  • #113 Now the big trends in visual computing are …
  • #114 Now the big trends in visual computing are …
  • #116  this means: g is gradient vector, H is the hessian matrix jocobien is for squared model, r^T r …
  • #117  this means: g is gradient vector, H is the hessian matrix jocobien is for squared model, r^T r …
  • #119 They admit simple algorithms where the form of the nonlinearity can be learned from the training data. They are extremely powerful, have nice theoretical properties, apply well to a vast array of applications.
  • #122 Thru cs231n, https://arxiv.org/abs/1502.05767, https://arxiv.org/abs/1502.05767
  • #123 See Stanford cs231n python example
  • #124 http://www.ams.org/publicoutreach/feature-column/fc-2017-12

[8]ページ先頭

©2009-2025 Movatter.jp