Movatterモバイル変換


[0]ホーム

URL:


Jump to content
WikipediaThe Free Encyclopedia
Search

Pattern recognition

From Wikipedia, the free encyclopedia
Automated recognition of patterns and regularities in data
This article is about pattern recognition as a branch of statistics. For the cognitive process, seePattern recognition (psychology). For other uses, seePattern recognition (disambiguation).
icon
This articleneeds additional citations forverification. Please helpimprove this article byadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Pattern recognition" – news ·newspapers ·books ·scholar ·JSTOR
(May 2019) (Learn how and when to remove this message)
Part of a series on
Machine learning
anddata mining

Pattern recognition is the task of assigning aclass to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess PR capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statisticaldata analysis,signal processing,image analysis,information retrieval,bioinformatics,data compression,computer graphics andmachine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use ofmachine learning, due to the increased availability ofbig data and a new abundance ofprocessing power.

Pattern recognition systems are commonly trained from labeled "training" data. When nolabeled data are available, other algorithms can be used to discover previously unknown patterns.KDD and data mining have a larger focus on unsupervised methods and stronger connection to business use. Pattern recognition focuses more on the signal and also takes acquisition andsignal processing into consideration. It originated inengineering, and the term is popular in the context ofcomputer vision: a leading computer vision conference is namedConference on Computer Vision and Pattern Recognition.

Inmachine learning, pattern recognition is the assignment of a label to a given input value. In statistics,discriminant analysis was introduced for this same purpose in 1936. An example of pattern recognition isclassification, which attempts to assign each input value to one of a given set ofclasses (for example, determine whether a given email is "spam"). Pattern recognition is a more general problem that encompasses other types of output as well. Other examples areregression, which assigns areal-valued output to each input;[1]sequence labeling, which assigns a class to each member of a sequence of values[2] (for example,part of speech tagging, which assigns apart of speech to each word in an input sentence); andparsing, which assigns aparse tree to an input sentence, describing thesyntactic structure of the sentence.[3]

Pattern recognition algorithms generally aim to provide a reasonable answer for all possible inputs and to perform "most likely" matching of the inputs, taking into account their statistical variation. This is opposed topattern matching algorithms, which look for exact matches in the input with pre-existing patterns. A common example of a pattern-matching algorithm isregular expression matching, which looks for patterns of a given sort in textual data and is included in the search capabilities of manytext editors andword processors.

Overview

[edit]
Further information on Combination Of Shifted FIlter REsponses:COSFIRE

A modern definition of pattern recognition is:

The field of pattern recognition is concerned with the automatic discovery of regularities in data through the use of computer algorithms and with the use of these regularities to take actions such as classifying the data into different categories.[4]

Pattern recognition is generally categorized according to the type of learning procedure used to generate the output value.Supervised learning assumes that a set of training data (thetraining set) has been provided, consisting of a set of instances that have been properly labeled by hand with the correct output. A learning procedure then generates a model that attempts to meet two sometimes conflicting objectives: Perform as well as possible on the training data, and generalize as well as possible to new data (usually, this means being as simple as possible, for some technical definition of "simple", in accordance withOccam's Razor, discussed below).Unsupervised learning, on the other hand, assumes training data that has not been hand-labeled, and attempts to find inherent patterns in the data that can then be used to determine the correct output value for new data instances.[5] A combination of the two that has been explored issemi-supervised learning, which uses a combination of labeled and unlabeled data (typically a small set of labeled data combined with a large amount of unlabeled data). In cases of unsupervised learning, there may be no training data at all.

Sometimes different terms are used to describe the corresponding supervised and unsupervised learning procedures for the same type of output. The unsupervised equivalent of classification is normally known asclustering, based on the common perception of the task as involving no training data to speak of, and of grouping the input data into clusters based on some inherentsimilarity measure (e.g. thedistance between instances, considered as vectors in a multi-dimensionalvector space), rather than assigning each input instance into one of a set of pre-defined classes. In some fields, the terminology is different. Incommunity ecology, the termclassification is used to refer to what is commonly known as "clustering".

The piece of input data for which an output value is generated is formally termed aninstance. The instance is formally described by avector of features, which together constitute a description of all known characteristics of the instance. These feature vectors can be seen as defining points in an appropriatemultidimensional space, and methods for manipulating vectors invector spaces can be correspondingly applied to them, such as computing thedot product or the angle between two vectors. Features typically are eithercategorical (also known asnominal, i.e., consisting of one of a set of unordered items, such as a gender of "male" or "female", or a blood type of "A", "B", "AB" or "O"),ordinal (consisting of one of a set of ordered items, e.g., "large", "medium" or "small"),integer-valued (e.g., a count of the number of occurrences of a particular word in an email) orreal-valued (e.g., a measurement of blood pressure). Often, categorical and ordinal data are grouped together, and this is also the case for integer-valued and real-valued data. Many algorithms work only in terms of categorical data and require that real-valued or integer-valued data bediscretized into groups (e.g., less than 5, between 5 and 10, or greater than 10).

Probabilistic classifiers

[edit]
Main article:Probabilistic classifier

Many common pattern recognition algorithms areprobabilistic in nature, in that they usestatistical inference to find the best label for a given instance. Unlike other algorithms, which simply output a "best" label, often probabilistic algorithms also output aprobability of the instance being described by the given label. In addition, many probabilistic algorithms output a list of theN-best labels with associated probabilities, for some value ofN, instead of simply a single best label. When the number of possible labels is fairly small (e.g., in the case ofclassification),N may be set so that the probability of all possible labels is output. Probabilistic algorithms have many advantages over non-probabilistic algorithms:

  • They output a confidence value associated with their choice. (Note that some other algorithms may also output confidence values, but in general, only for probabilistic algorithms is this value mathematically grounded inprobability theory. Non-probabilistic confidence values can in general not be given any specific meaning, and only used to compare against other confidence values output by the same algorithm.)
  • Correspondingly, they canabstain when the confidence of choosing any particular output is too low.
  • Because of the probabilities output, probabilistic pattern-recognition algorithms can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem oferror propagation.

Number of important feature variables

[edit]

Feature selection algorithms attempt to directly prune out redundant or irrelevant features. A general introduction tofeature selection which summarizes approaches and challenges, has been given.[6] The complexity of feature-selection is, because of its non-monotonous character, anoptimization problem where given a total ofn{\displaystyle n} features thepowerset consisting of all2n1{\displaystyle 2^{n}-1} subsets of features need to be explored. TheBranch-and-Bound algorithm[7] does reduce this complexity but is intractable for medium to large values of the number of available featuresn{\displaystyle n}

Techniques to transform the raw feature vectors (feature extraction) are sometimes used prior to application of the pattern-matching algorithm.Feature extraction algorithms attempt to reduce a large-dimensionality feature vector into a smaller-dimensionality vector that is easier to work with and encodes less redundancy, using mathematical techniques such asprincipal components analysis (PCA). The distinction betweenfeature selection andfeature extraction is that the resulting features after feature extraction has taken place are of a different sort than the original features and may not easily be interpretable, while the features left after feature selection are simply a subset of the original features.

Problem statement

[edit]

The problem of pattern recognition can be stated as follows: Given an unknown functiong:XY{\displaystyle g:{\mathcal {X}}\rightarrow {\mathcal {Y}}} (theground truth) that maps input instancesxX{\displaystyle {\boldsymbol {x}}\in {\mathcal {X}}} to output labelsyY{\displaystyle y\in {\mathcal {Y}}}, along with training dataD={(x1,y1),,(xn,yn)}{\displaystyle \mathbf {D} =\{({\boldsymbol {x}}_{1},y_{1}),\dots ,({\boldsymbol {x}}_{n},y_{n})\}} assumed to represent accurate examples of the mapping, produce a functionh:XY{\displaystyle h:{\mathcal {X}}\rightarrow {\mathcal {Y}}} that approximates as closely as possible the correct mappingg{\displaystyle g}. (For example, if the problem is filtering spam, thenxi{\displaystyle {\boldsymbol {x}}_{i}} is some representation of an email andy{\displaystyle y} is either "spam" or "non-spam"). In order for this to be a well-defined problem, "approximates as closely as possible" needs to be defined rigorously. Indecision theory, this is defined by specifying aloss function or cost function that assigns a specific value to "loss" resulting from producing an incorrect label. The goal then is to minimize theexpected loss, with the expectation taken over theprobability distribution ofX{\displaystyle {\mathcal {X}}}. In practice, neither the distribution ofX{\displaystyle {\mathcal {X}}} nor the ground truth functiong:XY{\displaystyle g:{\mathcal {X}}\rightarrow {\mathcal {Y}}} are known exactly, but can be computed only empirically by collecting a large number of samples ofX{\displaystyle {\mathcal {X}}} and hand-labeling them using the correct value ofY{\displaystyle {\mathcal {Y}}} (a time-consuming process, which is typically the limiting factor in the amount of data of this sort that can be collected). The particular loss function depends on the type of label being predicted. For example, in the case ofclassification, the simplezero-one loss function is often sufficient. This corresponds simply to assigning a loss of 1 to any incorrect labeling and implies that the optimal classifier minimizes theerror rate on independent test data (i.e. counting up the fraction of instances that the learned functionh:XY{\displaystyle h:{\mathcal {X}}\rightarrow {\mathcal {Y}}} labels wrongly, which is equivalent to maximizing the number of correctly classified instances). The goal of the learning procedure is then to minimize the error rate (maximize thecorrectness) on a "typical" test set.

For a probabilistic pattern recognizer, the problem is instead to estimate the probability of each possible output label given a particular input instance, i.e., to estimate a function of the form

p(label|x,θ)=f(x;θ){\displaystyle p({\rm {label}}|{\boldsymbol {x}},{\boldsymbol {\theta }})=f\left({\boldsymbol {x}};{\boldsymbol {\theta }}\right)}

where thefeature vector input isx{\displaystyle {\boldsymbol {x}}}, and the functionf is typically parameterized by some parametersθ{\displaystyle {\boldsymbol {\theta }}}.[8] In adiscriminative approach to the problem,f is estimated directly. In agenerative approach, however, the inverse probabilityp(x|label){\displaystyle p({{\boldsymbol {x}}|{\rm {label}}})} is instead estimated and combined with theprior probabilityp(label|θ){\displaystyle p({\rm {label}}|{\boldsymbol {\theta }})} usingBayes' rule, as follows:

p(label|x,θ)=p(x|label,θ)p(label|θ)Lall labelsp(x|L)p(L|θ).{\displaystyle p({\rm {label}}|{\boldsymbol {x}},{\boldsymbol {\theta }})={\frac {p({{\boldsymbol {x}}|{\rm {label,{\boldsymbol {\theta }}}}})p({\rm {label|{\boldsymbol {\theta }}}})}{\sum _{L\in {\text{all labels}}}p({\boldsymbol {x}}|L)p(L|{\boldsymbol {\theta }})}}.}

When the labels arecontinuously distributed (e.g., inregression analysis), the denominator involvesintegration rather than summation:

p(label|x,θ)=p(x|label,θ)p(label|θ)Lall labelsp(x|L)p(L|θ)dL.{\displaystyle p({\rm {label}}|{\boldsymbol {x}},{\boldsymbol {\theta }})={\frac {p({{\boldsymbol {x}}|{\rm {label,{\boldsymbol {\theta }}}}})p({\rm {label|{\boldsymbol {\theta }}}})}{\int _{L\in {\text{all labels}}}p({\boldsymbol {x}}|L)p(L|{\boldsymbol {\theta }})\operatorname {d} L}}.}

The value ofθ{\displaystyle {\boldsymbol {\theta }}} is typically learned usingmaximum a posteriori (MAP) estimation. This finds the best value that simultaneously meets two conflicting objects: To perform as well as possible on the training data (smallesterror-rate) and to find the simplest possible model. Essentially, this combinesmaximum likelihood estimation with aregularization procedure that favors simpler models over more complex models. In aBayesian context, the regularization procedure can be viewed as placing aprior probabilityp(θ){\displaystyle p({\boldsymbol {\theta }})} on different values ofθ{\displaystyle {\boldsymbol {\theta }}}. Mathematically:

θ=argmaxθp(θ|D){\displaystyle {\boldsymbol {\theta }}^{*}=\arg \max _{\boldsymbol {\theta }}p({\boldsymbol {\theta }}|\mathbf {D} )}

whereθ{\displaystyle {\boldsymbol {\theta }}^{*}} is the value used forθ{\displaystyle {\boldsymbol {\theta }}} in the subsequent evaluation procedure, andp(θ|D){\displaystyle p({\boldsymbol {\theta }}|\mathbf {D} )}, theposterior probability ofθ{\displaystyle {\boldsymbol {\theta }}}, is given by

p(θ|D)=[i=1np(yi|xi,θ)]p(θ).{\displaystyle p({\boldsymbol {\theta }}|\mathbf {D} )=\left[\prod _{i=1}^{n}p(y_{i}|{\boldsymbol {x}}_{i},{\boldsymbol {\theta }})\right]p({\boldsymbol {\theta }}).}

In theBayesian approach to this problem, instead of choosing a single parameter vectorθ{\displaystyle {\boldsymbol {\theta }}^{*}}, the probability of a given label for a new instancex{\displaystyle {\boldsymbol {x}}} is computed by integrating over all possible values ofθ{\displaystyle {\boldsymbol {\theta }}}, weighted according to the posterior probability:

p(label|x)=p(label|x,θ)p(θ|D)dθ.{\displaystyle p({\rm {label}}|{\boldsymbol {x}})=\int p({\rm {label}}|{\boldsymbol {x}},{\boldsymbol {\theta }})p({\boldsymbol {\theta }}|\mathbf {D} )\operatorname {d} {\boldsymbol {\theta }}.}

Frequentist or Bayesian approach to pattern recognition

[edit]

The first pattern classifier – the linear discriminant presented byFisher – was developed in thefrequentist tradition. The frequentist approach entails that the model parameters are considered unknown, but objective. The parameters are then computed (estimated) from the collected data. For the linear discriminant, these parameters are precisely the mean vectors and thecovariance matrix. Also the probability of each classp(label|θ){\displaystyle p({\rm {label}}|{\boldsymbol {\theta }})} is estimated from the collected dataset. Note that the usage of 'Bayes' rule' in a pattern classifier does not make the classification approach Bayesian.

Bayesian statistics has its origin in Greek philosophy where a distinction was already made between the 'a priori' and the 'a posteriori' knowledge. LaterKant defined his distinction between what is a priori known – before observation – and the empirical knowledge gained from observations. In a Bayesian pattern classifier, the class probabilitiesp(label|θ){\displaystyle p({\rm {label}}|{\boldsymbol {\theta }})} can be chosen by the user, which are then a priori. Moreover, experience quantified as a priori parameter values can be weighted with empirical observations – using e.g., theBeta- (conjugate prior) andDirichlet-distributions. The Bayesian approach facilitates a seamless intermixing between expert knowledge in the form of subjective probabilities, and objective observations.

Probabilistic pattern classifiers can be used according to a frequentist or a Bayesian approach.

Uses

[edit]
A face detected byfacial recognition software

Within medical science, pattern recognition is the basis forcomputer-aided diagnosis (CAD) systems. CAD describes a procedure that supports the doctor's interpretations and findings. Other typical applications of pattern recognition techniques are automaticspeech recognition,speaker identification,classification of text into several categories (e.g., spam or non-spam email messages), theautomatic recognition of handwriting on postal envelopes, automaticrecognition of images of human faces, or handwriting image extraction from medical forms.[9][10] The last two examples form the subtopicimage analysis of pattern recognition that deals with digital images as input to pattern recognition systems.[11][12]

Optical character recognition is an example of the application of a pattern classifier. The method of signing one's name was captured with stylus and overlay starting in 1990.[citation needed] The strokes, speed, relative min, relative max, acceleration and pressure is used to uniquely identify and confirm identity. Banks were first offered this technology, but were content to collect from the FDIC for any bank fraud and did not want to inconvenience customers.[citation needed]

Pattern recognition has many real-world applications in image processing. Some examples include:

In psychology,pattern recognition is used to make sense of and identify objects, and is closely related to perception. This explains how the sensory inputs humans receive are made meaningful. Pattern recognition can be thought of in two different ways. The first concerns template matching and the second concerns feature detection. A template is a pattern used to produce items of the same proportions. The template-matching hypothesis suggests that incoming stimuli are compared with templates in the long-term memory. If there is a match, the stimulus is identified. Feature detection models, such as the Pandemonium system for classifying letters (Selfridge, 1959), suggest that the stimuli are broken down into their component parts for identification. One observation is a capital E having three horizontal lines and one vertical line.[22]

Algorithms

[edit]

Algorithms for pattern recognition depend on the type of label output, on whether learning is supervised or unsupervised, and on whether the algorithm is statistical or non-statistical in nature. Statistical algorithms can further be categorized asgenerative ordiscriminative.

Classification methods (methods predicting categorical labels)

[edit]
Main article:Statistical classification

Parametric:[23]

Nonparametric:[24]

Clustering methods (methods for classifying and predicting categorical labels)

[edit]
Main article:Cluster analysis

Ensemble learning algorithms (supervised meta-algorithms for combining multiple learning algorithms together)

[edit]
Main article:Ensemble learning

General methods for predicting arbitrarily-structured (sets of) labels

[edit]

Multilinear subspace learning algorithms (predicting labels of multidimensional data using tensor representations)

[edit]

Unsupervised:

Real-valued sequence labeling methods (predicting sequences of real-valued labels)

[edit]
Main article:sequence labeling

Regression methods (predicting real-valued labels)

[edit]
Main article:Regression analysis

Sequence labeling methods (predicting sequences of categorical labels)

[edit]


This articlemay containunverified orindiscriminate information inembedded lists. Please helpclean up the lists by removing items or incorporating them into the text of the article.(May 2014)

See also

[edit]

References

[edit]
  1. ^Howard, W.R. (2007-02-20). "Pattern Recognition and Machine Learning".Kybernetes.36 (2): 275.doi:10.1108/03684920710743466.ISSN 0368-492X.
  2. ^"Sequence Labeling"(PDF).utah.edu.Archived(PDF) from the original on 2018-11-06. Retrieved2018-11-06.
  3. ^Ian., Chiswell (2007).Mathematical logic, p. 34. Oxford University Press.ISBN 978-0-19-921562-1.OCLC 799802313.
  4. ^Bishop, Christopher M. (2006).Pattern Recognition and Machine Learning. Springer.
  5. ^Carvalko, J.R., Preston K. (1972). "On Determining Optimum Simple Golay Marking Transforms for Binary Image Processing".IEEE Transactions on Computers.21 (12):1430–33.doi:10.1109/T-C.1972.223519.S2CID 21050445.{{cite journal}}: CS1 maint: multiple names: authors list (link).
  6. ^Isabelle Guyon Clopinet, André Elisseeff (2003).An Introduction to Variable and Feature Selection. The Journal of Machine Learning Research, Vol. 3, 1157-1182.LinkArchived 2016-03-04 at theWayback Machine
  7. ^Iman Foroutan; Jack Sklansky (1987). "Feature Selection for Automatic Classification of Non-Gaussian Data".IEEE Transactions on Systems, Man, and Cybernetics.17 (2):187–198.Bibcode:1987ITSMC..17..187F.doi:10.1109/TSMC.1987.4309029.S2CID 9871395..
  8. ^Forlinear discriminant analysis the parameter vectorθ{\displaystyle {\boldsymbol {\theta }}} consists of the two mean vectorsμ1{\displaystyle {\boldsymbol {\mu }}_{1}} andμ2{\displaystyle {\boldsymbol {\mu }}_{2}} and the commoncovariance matrixΣ{\displaystyle {\boldsymbol {\Sigma }}}.
  9. ^Milewski, Robert; Govindaraju, Venu (31 March 2008)."Binarization and cleanup of handwritten text from carbon copy medical form images".Pattern Recognition.41 (4):1308–1315.Bibcode:2008PatRe..41.1308M.doi:10.1016/j.patcog.2007.08.018.Archived from the original on 10 September 2020. Retrieved26 October 2011.
  10. ^Sarangi, Susanta; Sahidullah, Md; Saha, Goutam (September 2020). "Optimization of data-driven filterbank for automatic speaker verification".Digital Signal Processing.104 102795.arXiv:2007.10729.Bibcode:2020DSP...10402795S.doi:10.1016/j.dsp.2020.102795.S2CID 220665533.
  11. ^Richard O. Duda,Peter E. Hart,David G. Stork (2001).Pattern classification (2nd ed.). Wiley, New York.ISBN 978-0-471-05669-0.Archived from the original on 2020-08-19. Retrieved2019-11-26.{{cite book}}: CS1 maint: multiple names: authors list (link)
  12. ^R. Brunelli,Template Matching Techniques in Computer Vision: Theory and Practice, Wiley,ISBN 978-0-470-51706-2, 2009
  13. ^The Automatic Number Plate Recognition TutorialArchived 2006-08-20 at theWayback Machinehttp://anpr-tutorial.com/Archived 2006-08-20 at theWayback Machine
  14. ^Neural Networks for Face RecognitionArchived 2016-03-04 at theWayback Machine Companion to Chapter 4 of the textbook Machine Learning.
  15. ^Poddar, Arnab; Sahidullah, Md; Saha, Goutam (March 2018)."Speaker Verification with Short Utterances: A Review of Challenges, Trends and Opportunities".IET Biometrics.7 (2):91–101.doi:10.1049/iet-bmt.2017.0065. Archived fromthe original on 2019-09-03. Retrieved2019-08-27.
  16. ^PAPNET For Cervical ScreeningArchived 2012-07-08 atarchive.today
  17. ^"Development of an Autonomous Vehicle Control Strategy Using a Single Camera and Deep Neural Networks (2018-01-0035 Technical Paper)- SAE Mobilus".saemobilus.sae.org. 3 April 2018.doi:10.4271/2018-01-0035.Archived from the original on 2019-09-06. Retrieved2019-09-06.
  18. ^Gerdes, J. Christian; Kegelman, John C.; Kapania, Nitin R.; Brown, Matthew; Spielberg, Nathan A. (2019-03-27)."Neural network vehicle models for high-performance automated driving".Science Robotics.4 (28) eaaw1975.doi:10.1126/scirobotics.aaw1975.ISSN 2470-9476.PMID 33137751.S2CID 89616974.
  19. ^Pickering, Chris (2017-08-15)."How AI is paving the way for fully autonomous cars".The Engineer.Archived from the original on 2019-09-06. Retrieved2019-09-06.
  20. ^Ray, Baishakhi; Jana, Suman; Pei, Kexin; Tian, Yuchi (2017-08-28). "DeepTest: Automated Testing of Deep-Neural-Network-driven Autonomous Cars".arXiv:1708.08559.Bibcode:2017arXiv170808559T.{{cite journal}}:Cite journal requires|journal= (help)
  21. ^Sinha, P. K.; Hadjiiski, L. M.; Mutib, K. (1993-04-01). "Neural Networks in Autonomous Vehicle Control".IFAC Proceedings Volumes. 1st IFAC International Workshop on Intelligent Autonomous Vehicles, Hampshire, UK, 18–21 April.26 (1):335–340.doi:10.1016/S1474-6670(17)49322-0.ISSN 1474-6670.
  22. ^"A-level Psychology Attention Revision - Pattern recognition | S-cool, the revision website". S-cool.co.uk.Archived from the original on 2013-06-22. Retrieved2012-09-17.
  23. ^Assuming known distributional shape of feature distributions per class, such as theGaussian shape.
  24. ^No distributional assumption regarding shape of feature distributions per class.

Further reading

[edit]

External links

[edit]
Differentiable computing
General
Hardware
Software libraries
International
National
Other
Retrieved from "https://en.wikipedia.org/w/index.php?title=Pattern_recognition&oldid=1332694740"
Categories:
Hidden categories:

[8]ページ先頭

©2009-2026 Movatter.jp