US20170323221A1

Movatterモバイル変換

Info

Publication number: US20170323221A1
Application number: US15/185,277
Authority: US
Inventors: Arin Chaudhuri; Deovrat Vijay Kakde; Maria Jahja; Wei Xiao; Seung Hyun Kong; Hansi Jiang; Sergiy Peredriy
Original assignee: SAS Institute Inc
Current assignee: SAS Institute Inc
Priority date: 2016-05-03
Filing date: 2016-06-17
Publication date: 2017-11-09
Anticipated expiration: 2036-06-17
Also published as: US9830558B1

Abstract

A computing device determines an SVDD to identify an outlier in a dataset. First and second sets of observation vectors of a predefined sample size are randomly selected from a training dataset. First and second optimal values are computed using the first and second observation vectors to define a first set of support vectors and a second set of support vectors. A third optimal value is computed using the first set of support vectors updated to include the second set of support vectors to define a third set of support vectors. Whether or not a stop condition is satisfied is determined by comparing a computed value to a stop criterion. When the stop condition is not satisfied, the first set of support vectors is defined as the third set of support vectors, and operations are repeated until the stop condition is satisfied. The third set of support vectors is output.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims the benefit of 35 U.S.C. §119(e) to U.S. Provisional Patent Application No. 62/331,084 filed on May 3, 2016, the entire contents of which is hereby incorporated by reference.

BACKGROUND

Support vector data description (SVDD) is a machine-learning technique used for single class classification and outlier detection. SVDD formulation with a kernel function provides a flexible data description around data.

The SVDD of a dataset is obtained by solving a quadratic programming problem. The time required to solve the quadratic programming problem is directly related to the number of observations in the training dataset resulting in a very high computing time for large training datasets.

SUMMARY

In yet another example embodiment, a computing device is provided. The system includes, but is not limited to, a processor and a non-transitory computer-readable medium operably coupled to the processor. The computer-readable medium has instructions stored thereon that, when executed by the computing device, cause the computing device to determine a support vector data description for outlier identification.

In an example embodiment, a method of determining a support vector data description for outlier identification is provided.

Other principal features of the disclosed subject matter will become apparent to those skilled in the art upon review of the following drawings, the detailed description, and the appended claims.

BRIEF DESCRIPTION OF THE DRAWINGS

Illustrative embodiments of the disclosed subject matter will hereafter be described referring to the accompanying drawings, wherein like numerals denote like elements.

FIG. 1 depicts a block diagram of a support vector data description (SVDD) training device in accordance with an illustrative embodiment.

FIG. 2 depicts a SVDD result defining a normal data description in accordance with an illustrative embodiment.

FIG. 3 depicts an SVDD result defining a flexible data description using a Gaussian kernel function in accordance with an illustrative embodiment.

FIGS. 4A, 4B, and 4C depict a flow diagram illustrating examples of operations performed by the SVDD training device ofFIG. 1 in accordance with an illustrative embodiment.

FIG. 5 depicts a first example dataset having a banana shape in accordance with an illustrative embodiment.

FIGS. 6-12 depict SVDD results using a Gaussian kernel function at different iterations of the operations ofFIGS. 4A, 4B, and 4C using the first example dataset ofFIG. 5 in accordance with an illustrative embodiment.

FIG. 13 depicts a value of a threshold R²as a function of an iteration number of the operations ofFIGS. 4A, 4B, and 4C using the first example dataset ofFIG. 5 in accordance with an illustrative embodiment.

FIG. 14 depicts a value of a run time and a number of iterations as a function of a sample size for the operations ofFIGS. 4A, 4B, and 4C using the first example dataset ofFIG. 5 in accordance with an illustrative embodiment.

FIG. 15 depicts a second example dataset having a star shape in accordance with an illustrative embodiment.

FIG. 16 depicts a value of a run time and a number of iterations as a function of a sample size for the operations ofFIGS. 4A, 4B, and 4C using the second example dataset ofFIG. 15 in accordance with an illustrative embodiment.

FIG. 17 depicts a third example dataset having a two-donut shape in accordance with an illustrative embodiment.

FIG. 18 depicts a value of a run time and a number of iterations as a function of a sample size for the operations ofFIGS. 4A, 4B, and 4C using the third example dataset ofFIG. 17 in accordance with an illustrative embodiment.

FIG. 19 depicts a value of a processing time as a function of a number of observations in a training dataset selected from the third example dataset ofFIG. 17 using the full SVDD method (solving for the SVDD using the entire selected training dataset in one iteration) in accordance with an illustrative embodiment.

FIG. 20 depicts a block diagram of an SVDD training system in accordance with an illustrative embodiment.

FIG. 21 depicts a block diagram of an outlier identification device in accordance with an illustrative embodiment.

FIG. 22 depicts a flow diagram illustrating examples of operations performed by the outlier identification device ofFIG. 21 in accordance with an illustrative embodiment.

FIG. 23 depicts scoring results using the SVDD defined using the operations ofFIGS. 4A, 4B, 4C, and 22 with the first example dataset ofFIG. 5 in accordance with an illustrative embodiment.

FIG. 24 depicts scoring results using an SVDD defined using the first example dataset ofFIG. 5 in accordance with an illustrative embodiment.

FIG. 25 depicts scoring results using the SVDD defined using the operations ofFIGS. 4A, 4B, 4C, and 22 with the second example dataset ofFIG. 15 in accordance with an illustrative embodiment.

FIG. 26 depicts scoring results using an SVDD defined using the second example dataset ofFIG. 15 in accordance with an illustrative embodiment.

FIG. 27 depicts scoring results using the SVDD defined using the operations ofFIGS. 4A, 4B, 4C, and 22 with the third example dataset ofFIG. 17 in accordance with an illustrative embodiment.

FIG. 28 depicts scoring results using an SVDD defined using the third example dataset ofFIG. 17 in accordance with an illustrative embodiment.

DETAILED DESCRIPTION

Referring toFIG. 1, a block diagram of a support vector data description (SVDD)training device100 is shown in accordance with an illustrative embodiment.SVDD training device100 may include aninput interface102, anoutput interface104, acommunication interface106, a non-transitory computer-readable medium108, aprocessor110, atraining application122, atraining dataset124, and a support vector data description (SVDD)126. Fewer, different, and/or additional components may be incorporated intoSVDD training device100.

Input interface

102 provides an interface for receiving information from the user or another device for entry intoSVDD training device100 as understood by those skilled in the art.Input interface102 may interface with various input technologies including, but not limited to, akeyboard112, amicrophone113, a mouse114, adisplay116, a track ball, a keypad, one or more buttons, etc. to allow the user to enter information intoSVDD training device100 or to make selections presented in a user interface displayed ondisplay116. The same interface may support bothinput interface102 andoutput interface104. For example, display116 comprising a touch screen provides a mechanism for user input and for presentation of output to the user.SVDD training device100 may have one or more input interfaces that use the same or a different input interface technology. The input interface technology further may be accessible bySVDD training device100 throughcommunication interface106.

Output interface

104 provides an interface for outputting information for review by a user ofSVDD training device100 and/or for use by another application or device. For example,output interface104 may interface with various output technologies including, but not limited to, display116, aspeaker118, aprinter120, etc.SVDD training device100 may have one or more output interfaces that use the same or a different output interface technology. The output interface technology further may be accessible bySVDD training device100 throughcommunication interface106.

Communication interface

106 provides an interface for receiving and transmitting data between devices using various protocols, transmission technologies, and media as understood by those skilled in the art.Communication interface106 may support communication using various transmission media that may be wired and/or wireless.SVDD training device100 may have one or more communication interfaces that use the same or a different communication interface technology. For example,SVDD training device100 may support communication using an Ethernet port, a Bluetooth antenna, a telephone jack, a USB port, etc. Data and messages may be transferred betweenSVDD training device100 and another computing device usingcommunication interface106.

Computer-readable medium108 is an electronic holding place or storage for information so the information can be accessed byprocessor110 as understood by those skilled in the art. Computer-readable medium108 can include, but is not limited to, any type of random access memory (RAM), any type of read only memory (ROM), any type of flash memory, etc. such as magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips, . . . ), optical disks (e.g., compact disc (CD), digital versatile disc (DVD), . . . ), smart cards, flash memory devices, etc.SVDD training device100 may have one or more computer-readable media that use the same or a different memory media technology. For example, computer-readable medium108 may include different types of computer-readable media that may be organized hierarchically to provide efficient access to the data stored therein as understood by a person of skill in the art. As an example, a cache may be implemented in a smaller, faster memory that stores copies of data from the most frequently/recently accessed main memory locations to reduce an access latency.SVDD training device100 also may have one or more drives that support the loading of a memory media such as a CD, DVD, an external hard drive, etc. One or more external hard drives further may be connected toSVDD training device100 usingcommunication interface106.

Processor

110 executes instructions as understood by those skilled in the art. The instructions may be carried out by a special purpose computer, logic circuits, or hardware circuits.Processor110 may be implemented in hardware and/or firmware.Processor110 executes an instruction, meaning it performs/controls the operations called for by that instruction. The term “execution” is the process of running an application or the carrying out of the operation called for by an instruction. The instructions may be written using one or more programming language, scripting language, assembly language, etc.Processor110 operably couples withinput interface102, withoutput interface104, withcommunication interface106, and with computer-readable medium108 to receive, to send, and to process information.Processor110 may retrieve a set of instructions from a permanent memory device and copy the instructions in an executable form to a temporary memory device that is generally some form of RAM.SVDD training device100 may include a plurality of processors that use the same or a different processing technology.

Training application

122 performs operations associated with definingSVDD126 from data stored intraining dataset124.SVDD126 may be used to classify data stored in a scoring dataset and to identify outliers in the scoring dataset that may be stored in an outlier dataset to support various data analysis functions as well as provide alert/messaging related to the identified outliers. Some or all of the operations described herein may be embodied intraining application122. The operations may be implemented using hardware, firmware, software, or any combination of these methods.

Referring to the example embodiment ofFIG. 1,training application122 is implemented in software (comprised of computer-readable and/or computer-executable instructions) stored in computer-readable medium108 and accessible byprocessor110 for execution of the instructions that embody the operations oftraining application122.Training application122 may be written using one or more programming languages, assembly languages, scripting languages, etc.Training application122 may be integrated with other analytic tools. For example,training application122 may be part of SAS® Enterprise Miner™ developed and provided by SAS Institute Inc. of Cary, N.C. that may be used to create highly accurate predictive and descriptive models based on analysis of vast amounts of data from across an enterprise. Data mining is applicable in a variety of industries.

Training application

122 may be integrated with other system processing tools to automatically process data generated as part of operation of an enterprise, device, system, facility, etc., to identify any outliers in the processed data, and to provide a warning or alert associated with the outlier identification usinginput interface102,output interface104, and/orcommunication interface106 so that appropriate action can be initiated in response to the outlier identification.

Training application

122 may be implemented as a Web application. For example,training application122 may be configured to receive hypertext transport protocol (HTTP) responses and to send HTTP requests. The HTTP responses may include web pages such as hypertext markup language (HTML) documents and linked objects generated in response to the HTTP requests. Each web page may be identified by a uniform resource locator (URL) that includes the location or address of the computing device that contains the resource to be accessed in addition to the location of the resource on that computing device. The type of file or resource depends on the Internet application protocol such as the file transfer protocol, HTTP, H.323, etc. The file accessed may be a simple text file, an image file, an audio file, a video file, an executable, a common gateway interface application, a Java applet, an extensible markup language (XML) file, or any other type of file supported by HTTP.

Training dataset

124 may include, for example, a plurality of rows and a plurality of columns. The plurality of rows may be referred to as observation vectors or records, and the columns may be referred to as variables.Training dataset124 may be transposed.Training dataset124 may include unsupervised data. The plurality of variables may define multiple dimensions for each observation vector. An observation vector x_imay include a value for each of the plurality of variables associated with the observation i. Each variable of the plurality of variables describes a characteristic of a physical object. For example, iftraining dataset124 includes data related to operation of a vehicle, the variables may include an oil pressure, a speed, a gear indicator, a gas tank level, a tire pressure for each tire, an engine temperature, a radiator level, etc.Training dataset124 may include data captured as a function of time for one or more physical objects.

Training dataset

124 may be stored on computer-readable medium108 or on one or more computer-readable media of distributedcomputing system2128 and accessed bySVDD training device100 usingcommunication interface106,input interface102, and/oroutput interface104. Data stored intraining dataset124 may be sensor measurements or signal values captured by a sensor, may be generated or captured in response to occurrence of an event or a transaction, generated by a device such as in response to an interaction by a user with the device, etc. The data stored intraining dataset124 may include any type of content represented in any computer-readable format such as binary, alphanumeric, numeric, string, markup language, etc. The content may include textual information, graphical information, image information, audio information, numeric information, etc. that further may be encoded using various encoding techniques as understood by a person of skill in the art. The data stored intraining dataset124 may be captured at different time points periodically, intermittently, when an event occurs, etc. One or more columns may include a time value.

Training dataset

124 may include data captured under normal operating conditions of the physical object.Training dataset124 may include data captured at a high data rate such as 200 or more observations per second for one or more physical objects. For example, data stored intraining dataset124 may be generated as part of the Internet of Things (IoT), where things (e.g., machines, devices, phones, sensors) can be connected to networks and the data from these things collected and processed within the things and/or external to the things before being stored intraining dataset124. For example, the IoT can include sensors in many different devices and types of devices, and high value analytics can be applied to identify hidden relationships and drive increased efficiencies. This can apply to both big data analytics and real-time analytics. Some of these devices may be referred to as edge devices, and may involve edge computing circuitry. These devices may provide a variety of stored or generated data, such as network data or data specific to the network devices themselves. Some data may be processed with an event stream processing engine, which may reside in the cloud or in an edge device before being stored intraining dataset124.

Training dataset

124 may be stored in various compressed formats such as a coordinate format, a compressed sparse column format, a compressed sparse row format, etc.Training dataset124 further may be stored using various structures as known to those skilled in the art including a file system, a relational database, a system of tables, a structured query language database, etc. onSVDD training device100 or on distributedcomputing system2128.SVDD training device100 may coordinate access totraining dataset124 that is distributed across a plurality of computing devices. For example,training dataset124 may be stored in a cube distributed across a grid of computers as understood by a person of skill in the art. As another example,training dataset124 may be stored in a multi-node Hadoop® cluster. For instance, Apache™ Hadoop® is an open-source software framework for distributed computing supported by the Apache Software Foundation. As another example,training dataset124 may be stored in a cloud of computers and accessed using cloud computing technologies, as understood by a person of skill in the art. The SAS® LASR™ Analytic Server developed and provided by SAS Institute Inc. of Cary, N.C. may be used as an analytic platform to enable multiple users to concurrently access data stored intraining dataset124. Some systems may use SAS In-Memory Statistics for Hadoop® developed and provided by SAS Institute Inc. of Cary, N.C. to read big data once and analyze it several times by persisting it in-memory for the entire session. Some systems may be of other types and configurations.

A SVDD model is used in domains where a majority of data intraining dataset124 belongs to a single class. An SVDD model for normal data description builds a minimum radius hypersphere around the data. The objective function for the SVDD model for normal data description is

max(Σ_i=1ⁿα_i(x_i·x_i)−Σ_i=1ⁿΣ_j=1α_iα_j(x_i·x_j)), (1)

subject to:

Σ_i=1ⁿα_i=1, (2)

0≦α_i≦C, ∇i=1, . . . ,n, (3)

where x_iε
^m,i=1, . . . ,n represents n observations intraining dataset124, α_iε
: are Lagrange constants, C=1/nf is a penalty constant that controls a trade-off between a volume and errors, and f is an expected outlier fraction. The expected outlier fraction is generally known to an analyst. Data preprocessing can ensure thattraining dataset124 belongs to a single class. In this case, f can be set to a very low value such as 0.001. SV is the set of support vectors that includes the observation vectors intraining dataset124 that have C≧α_i>0 after solving equation (1) above. SV_<Cis a subset of the support vectors that includes the observation vectors intraining dataset124 that have C>α_i>0 after solving equation (1) above.
Depending upon a position of an observation vector, the following results are true:
Center position: Σ_i=1ⁿα_i=α. (4)
Inside position: ∥x_i−a∥<R→α_i=0. (5)
Data description position: ∥x_i−a∥=R→α_i<C. (6)
Outside position: ∥x_i−a∥>R→α_i=C. (7)
where α is a center of the hypersphere and R is a radius of the hypersphere. The radius of the hypersphere is calculated as:
R²=x_k·x_k−2Σ_i=^N^svα_i(x_i·x_k)+Σ_i=1^N^svΣ_j=1^N^svα_iα_j(x_i,x_j) (8)
using any x_kεSV_<C, x_iand x_jare the support vectors, α_iand α_jare the Lagrange constants of the associated support vector, and N_SVis a number of the support vectors included in the set of support vectors. An observation z is indicated as an outlier when dist²(z)>R², where dist²(z)=(z·z)−2 Σ_i=1^N^SVα_i(x_i·z)+Σ_i=1^N^SVΣ_j=1^N^SVα_iα_j(x_i·x_j), where z is the observation vector.
Referring toFIG. 2, a SVDD is illustrated in accordance with an illustrative embodiment that defines anormal data description200 having a radius R from a center a.Data description200 is characterized by observation vectors202 (shown as data points on the graph), which are the set of support vectors SV. For illustration,observation vectors202 are defined by values of variables x1 and x2.
Normal data description200 can include a significant amount of space with a very sparse distribution of training observations. Scoring with this model can increase the probability of false positives. Hence, instead of a circular shape, a compact bounded outline around the data that approximates a shape of data intraining dataset124 is preferred. This is possible using a kernel function. The SVDD is made flexible by replacing the inner product (x_i·x_j) with a suitable kernel function K(x_i,x_j). A Gaussian kernel function is described herein, though this is not intended to be limiting. For example, any exponential function or polynomial function may be used. The Gaussian kernel function may be defined as:
$\begin{matrix} K (x_{i}, x_{j}) = \exp \frac{- { x_{i} - x_{j} }^{2}}{2 s^{2}} & (9) \end{matrix}$
where s is a kernel parameter that is termed a Gaussian bandwidth parameter.
The objective function for the SVDD model with the Gaussian kernel function is
max(Σ_i=1ⁿα_iK(x_i,x_i)−Σ_i=1ⁿΣ_j=1ⁿα_iα_jK(x_i,x₁)), (10)
subject to:
Σ_i=1ⁿα_i=1, (11)
0≦α_i≦C, Λi=1, . . . ,n (12)
Where again SV is the set of support vectors that includes the observation vectors intraining dataset124 that have C≧α_i>0 after solving equation (1) above. SV_<Cis the subset of the support vectors that includes the observation vectors intraining dataset124 that have C>α_i>0 after solving equation (1) above.
The results from equations (4) to (7) above remain valid. The threshold is computed as:
R²=K(x_k,x_k)−2Σ_i=1^N^svα_iK(x_i,x_k)+Σ_i=1^N^svΣ_j=1^N^svα_iα_jK(x_i,x_j) (13)
using any x_kεSV_<C, where x_iand x_jare the support vectors, α_iand α_jare the Lagrange constants of the associated support vector, and N_SVis a number of the support vectors included in the set of support vectors.
An observation vector z is indicated as an outlier when dist²(z)>R², where
dist²(z)=K(z,z)−2 Σ_i=1^N^svα_iK(x_i,z)+Σ_i=1^N^svΣ_j=1^N^svα_iα_jK(x_i,x_j). (14)
Σ_i=1ⁿΣ_j=1ⁿα_iα_jK(x_i,x_j) is a constant that can be denoted as W and that can be determined from the set of support vectors. R²is a threshold determined using the set of support vectors. For a Gaussian kernel function, K(z, z)=1. Thus, equation (14) can be simplified to dist²(z)=1−2Σ_i=1^N^SVα_iK(x_i, z)+W for a Gaussian kernel function.
Referring toFIG. 3, a SVDD is shown in accordance with an illustrative embodiment that defines aflexible data description300.Flexible data description300 is characterized bysupport vectors302, which are the set of support vectors SV.
Referring toFIGS. 4A, 4B, and 4C, example operations associated withtraining application122 are described. For example,training application122 may be used to createSVDD126 fromtraining dataset124. Instead of using all observations fromtraining dataset124,training application122 computesSVDD126 by iteratively computing an SVDD on independent random samples obtained fromtraining dataset124 and combining them.Training application122 has been shown to work well even when the random samples have only a few observations.
Additional, fewer, or different operations may be performed depending on the embodiment oftraining application122. The order of presentation of the operations ofFIGS. 4A, 4B, and 4C is not intended to be limiting. Although some of the operational flows are presented in sequence, the various operations may be performed in various repetitions, concurrently (in parallel, for example, using threads and/or a distributed computing system), and/or in other orders than those that are illustrated. For example, a user may executetraining application122, which causes presentation of a first user interface window, which may include a plurality of menus and selectors such as drop down menus, buttons, text boxes, hyperlinks, etc. associated withtraining application122 as understood by a person of skill in the art. The plurality of menus and selectors may be accessed in various orders. An indicator may indicate one or more user selections from a user interface, one or more data entries into a data field of the user interface, one or more data items read from computer-readable medium108 or otherwise defined with one or more default values, etc. that are received as an input bytraining application122.
Referring toFIG. 4A, in anoperation400, a first indicator may be received that indicatestraining dataset124. For example, the first indicator indicates a location and a name oftraining dataset124. As an example, the first indicator may be received bytraining application122 after selection from a user interface window or after entry by a user into a user interface window. In an alternative embodiment,training dataset124 may not be selectable. For example, a most recently created dataset may be used automatically.
In an operation402, a second indicator may be received that indicates a plurality of variables oftraining dataset124 to define x_i. The second indicator may indicate that all or only a subset of the variables stored intraining dataset124 be used to defineSVDD126. For example, the second indicator indicates a list of variables to use by name, column number, etc. In an alternative embodiment, the second indicator may not be received. For example, all of the variables may be used automatically.
In anoperation404, a third indicator may be received that indicates a sample size N_s. The third indicator indicates a number of observations to use fromtraining dataset124, a percentage of observations to use fromtraining dataset124, etc. N_smay be very small. For illustration, N_smay be between 3 and 20 for a training dataset that includes greater than 15,000 observations for a dataset with two variables. N_sobservations may be created fromtraining dataset124 by sampling. An example sampling algorithm is uniform sampling though other random sampling algorithms may be used. For illustration, the sample size N_smay be selected to be any value greater than a number of the plurality of variables oftraining dataset124 to define x_iindicated in operation402.
In anoperation406, a fourth indicator of a kernel function to apply may be received. For example, the fourth indicator indicates a name of a kernel function. The fourth indicator may be received bytraining application122 after selection from a user interface window or after entry by a user into a user interface window. A default value for the kernel function may further be stored, for example, in computer-readable medium108. As an example, a kernel function may be selected from “Gaussian”, “Exponential”, etc. For example, a default kernel function may be the Gaussian kernel function though any positive definite kernel function could be used. Of course, the kernel function may be labeled or selected in a variety of different manners by the user as understood by a person of skill in the art. In an alternative embodiment, the kernel function may not be selectable, and a single kernel function is implemented intraining application122. For example, the Gaussian kernel function may be used by default or without allowing a selection.
In anoperation408, a fifth indicator of a kernel parameter value to use with the kernel function may be received. For example, a value for s, the Gaussian bandwidth parameter, may be received for the Gaussian kernel function. In an alternative embodiment, the fifth indicator may not be received. For example, a default value for the kernel parameter value may be stored, for example, in computer-readable medium108 and used automatically or the kernel parameter value may not be used. In another alternative embodiment, the value of the kernel parameter may not be selectable. Instead, a fixed, predefined value may be used.
In anoperation410, a sixth indicator of a value of the expected outlier fraction f may be received. In an alternative embodiment, the sixth indicator may not be received. For example, a default value may be stored, for example, in computer-readable medium108 and used automatically. In another alternative embodiment, the value of the expected outlier fraction f may not be selectable. Instead, a fixed, predefined value may be used.
In anoperation412, a value of the penalty constant C=1/nf may be computed from n and f.
In anoperation414, a seventh indicator of a value of a maximum number of iterations M may be received. In an alternative embodiment, the seventh indicator may not be received. For example, a default value may be stored, for example, in computer-readable medium108 and used automatically or the maximum number of iterations M may not be used. In another alternative embodiment, the value of the maximum number of iterations M may not be selectable. Instead, a fixed, predefined value may be used. The maximum number of iterations M may be identified as a first stop criterion. The maximum number of iterations M may be selected to stop execution when convergence is not being reached. Merely for illustration, the maximum number of iterations M may be set between 10 and 1000 though the user may determine that other values are more suitable for their application as understood by a person of skill in the art, for example, on the accuracy desired, computing resources available, etc.
In anoperation416, an eighth indicator of α convergence test may be received. For example, the eighth indicator indicates a name of α convergence test. The eighth indicator may be received bytraining application122 after selection from a user interface window or after entry by a user into a user interface window. A default value for the convergence test may further be stored, for example, in computer-readable medium108. As an example, α convergence test may be selected from “Max Iterations”, “R²only”, “α only”, “R²and α”, etc. For example, a default convergence test may be “R²and α” as discussed further below. Of course, the convergence test may be labeled or selected in a variety of different manners by the user as understood by a person of skill in the art. In an alternative embodiment, the convergence test may not be selectable, and a single convergence test is implemented bytraining application122. For example, the convergence test “R²and a” as discussed further below may be used by default or without allowing a selection.
In anoperation418, a ninth indicator of a value of a distance tolerance value ε_Rmay be received if the convergence test selected includes an evaluation of changes in value of the threshold R²from iteration to iteration. In an alternative embodiment, the ninth indicator may not be received. For example, a default value may be stored, for example, in computer-readable medium108 and used automatically or the distance tolerance value ε_Rmay not be used. In another alternative embodiment, the value of the distance tolerance parameter may not be selectable. Instead, a fixed, predefined value may be used. The distance tolerance parameter ε_Rmay be identified as a second stop criterion.
In an operation420, a tenth indicator of a value of a center tolerance value ε_αmay be received if the convergence test selected includes an evaluation of changes in a center a from iteration to iteration. In an alternative embodiment, the tenth indicator may not be received. For example, a default value may be stored, for example, in computer-readable medium108 and used automatically or the center tolerance parameter ε_αmay not be used. In another alternative embodiment, the value of the center tolerance parameter may not be selectable. Instead, a fixed, predefined value may be used. The center tolerance parameter ε_αmay be identified as a third stop criterion. Values for the tolerance parameters ε_Rand/or ε_α may be selected to achieve a representational quality oftraining dataset124 bySVDD126.
In anoperation422, an eleventh indicator of a value of a number of consecutive iterations for convergence to be complete t may be received. In an alternative embodiment, the eleventh indicator may not be received. For example, a default value may be stored, for example, in computer-readable medium108 and used automatically or the number of consecutive iterations for convergence to be complete t may not be used. In another alternative embodiment, the value of the number of consecutive iterations for convergence to be complete may not be selectable. Instead, a fixed, predefined value may be used. Use of the number of consecutive iterations for convergence to be complete avoids convergence to a local extrema by requiring that the stop criterion be satisfied for a consecutive number of iterations. Merely for illustration, the number of consecutive iterations for convergence to be complete t may be set between 1 and 10 though the user may determine that other values are more suitable for their application. The number of consecutive iterations for convergence to be complete t set to a value of 5 has been used to achieve quality results forSVDD126.
In anoperation424, a twelfth indicator of a value of a number of sample computations per iteration q may be received. In an alternative embodiment, the twelfth indicator may not be received. For example, a default value may be stored, for example, in computer-readable medium108 and used automatically or the number of sample computations per iteration q may not be used. In another alternative embodiment, the value of the number of sample computations per iteration may not be selectable. Instead, a fixed, predefined value may be used. Merely for illustration, the number of sample computations per iteration q may be set between 1 and 5 though the user may determine that other values are more suitable for their application. The sample computations per iteration q set to a value of 2 or 3 has been used to achieve quality results forSVDD126.
In anoperation426, a first set of observation vectors x_iare randomly selected fromtraining dataset124, for example, using uniform random sampling to select the sample size N_snumber of observations. Each observation vector x_iincludes values for each of the plurality of variables indicated in operation402.
In anoperation428, an optimal value for the objective function is computed by optimizing the objective function using the kernel function defined based on the fourth indicator and the selected first set of observation vectors x_i. For example, equations (10)-(13) above are used to solve for SV, a first set of support vectors that have 0<α_i≦C. As part of the solving for the optimal solution, values for the computed penalty constant C and/or the kernel parameter value may be used as indicated above. Values for the Lagrange constants α_ifor each support vector of the first set of support vectors, for R², and for the center position α are computed as part of the optimal solution. Only the SV_<Care needed for the computations of R², and only the SV are needed for the computations of α, which avoids an additional read oftraining dataset124 thereby improving performance.
In anoperation432, iteration counter values i and k may be initialized. For example, i may be initialized to one, and k may be initialized to zero. i may be identified as an iteration counter, and k may be identified as a consecutive convergence counter.
Referring toFIG. 4B, in anoperation434, a determination is made concerning whether or not the number of sample computations per iteration q>1. In an alternative embodiment, the number of sample computations per iteration, q may not be used, in which case,operations434,436, and444-458 are not implemented bytraining application122. As another option, the sample computations per iteration q may be implemented bytraining application122, but not selected for use by the user by setting q≦1 to skipoperations434,436, and444-458. When q>1, processing continues in anoperation436. When q≦1, processing continues in anoperation438.
Inoperation436, a sampling iteration counter value j may be initialized, and processing continues in anoperation444. For example, j may be initialized to one.
In anoperation438, a second set of observation vectors x_iare randomly selected fromtraining dataset124, for example, using uniform random sampling to select the sample size N_snumber of observations a next time. Each second observation vector x_iincludes values for each of the plurality of variables indicated in operation402.
In anoperation440, an optimal value for the objective function is computed by optimizing the objective function using the kernel function defined based on the fourth indicator and the selected second set of observation vectors x_i. For example, equations (10)-(13) above are used to solve for SV, a second set of support vectors that have 0<α_i≦C, along with values for the Lagrange constants α_ifor each support vector of the second set of support vectors, for R², and for the center position α.
In anoperation442, the first set of support vectors are updated to include the second set of support vectors SV computed inoperation440, and processing continues in an operation460 shown referring toFIG. 4C.
Inoperation444, the second set of observation vectors x_iare randomly selected fromtraining dataset124, for example, using uniform random sampling to select the sample size N_snumber of observations a next time. Each second observation vector x_iincludes values for each of the plurality of variables indicated in operation402.
In anoperation446, an optimal value for the objective function is computed by optimizing the objective function using the kernel function defined based on the fourth indicator and the selected second set of observation vectors x_i. For example, equations (10)-(13) above are used to solve for SV, a second set of support vectors that have 0<α_i≦C, along with values for the Lagrange constants α_ifor each support vector of the second set of support vectors, for R², and for the center position α.
In anoperation448, a determination is made concerning whether or not sampling iteration counter value j=1. When j=1, processing continues in anoperation450. When j≠1, processing continues in anoperation452.
Inoperation450, a set of iteration support vectors is initialized with the second set of support vectors SV computed inoperation446, and processing continues in an operation454.
Inoperation452, the set of iteration support vectors is updated to include the second set of support vectors SV computed inoperation446, and processing continues in operation454.
In operation454, a determination is made concerning whether or not the number of sample computations per iteration q have been performed by comparing the sampling iteration counter value j to the number of sample computations per iteration q. When j≦q, processing continues in anoperation456. When j>q, processing continues in anoperation458.
Inoperation456, the sampling iteration counter value j is incremented by adding one to the current value, and processing continues inoperation444 to process a next sampled set of observation vectors to supplement the set of iteration support vectors.
Inoperation458, the first set of support vectors is updated to include the set of iteration support vectors, and processing continues in operation460 shown referring toFIG. 4C.
Referring toFIG. 4C, in operation460, an optimal value for the objective function is computed by optimizing the objective function using the kernel function defined based on the fourth indicator and the updated first set of support vectors. For example, equations (10)-(13) above are used to solve for SV, a third set of support vectors that have 0<α_i≦C, along with values for the Lagrange constants α_ifor each support vector of the third set of support vectors, for R², and for the center position α. The penalty constant C=1/nf may be computed for n equal to a number of vectors of the updated first set of support vectors.
In an operation462, a determination is made concerning whether or not i≧M. In an alternative embodiment, the maximum number of iterations M may not be used, in which case, operation462 is not implemented bytraining application122. When i≧M, processing continues in anoperation464. When i<M, processing continues in anoperation466. i≧M is a first stop condition.
Inoperation464, the third set of support vectors, α_ithe Lagrange constants for each of the third set of support vectors, the center position α, and/or R²computed from the third set of support vectors are stored asSVDD126, and processing is complete for definingSVDD126. Any other constants associated with the third set of support vectors may be stored. For example, K(z, z)=1 may be stored when the Gaussian kernel function is used and/or W=Σ_i=1^NΣ_j=1^Nα_iα_jK(x_i,x_j) may be stored for use in computing dist²(z) when scoring is performed as discussed further below.
Inoperation466, one or more convergence parameters may be computed as additional stop conditions dependent on the convergence test indicated inoperation416. For example, when “Max Iterations” is indicated, none of operations466-476 may be performed and no convergence parameters are computed. When “R²” is indicated, operation470 may be skipped, and only an R²convergence parameter is computed. When “α” is indicated, operation468 may be skipped and only an α convergence parameter is computed. When “R²and α” is indicated, R²and α convergence parameters are both computed.
The R²convergence parameter may be computed as
$c_{pR} = \frac{ R_{j}^{2} - R_{j - 1}^{2} }{R_{j - 1}^{2}},$
where R_j²is the threshold computed using the third set of support vectors that have 0<α_i≦C computed in operation460, and R_j-1²is the threshold computed using the first set of support vectors that have 0<α₁<C. Prior to computing c_pR, a value of R_j-1²may be tested to determine if the value is zero. If so, c_pRmay be set to a very large value.
The α convergence parameter may be computed as C_pa=∥α_j−α_j-1∥/∥α_j-1∥, where α_j=Σ_i-1^N^svα_ix_iis computed using the third set of support vectors as x_i, and α_j-1=Σ_i=1^N^SVα_ix_iis computed using the first set of support vectors as x_i. Prior to computing c_pa, a value of α_j-1may be tested to determine if the value is zero. If so, c_pamay be set to a very large value.
In an operation468, a determination is made concerning whether or not c_pR≦ε_R·c_pR<ε_Ris a second stop condition. When c_pR≦ε_R, processing continues in an operation470. When c_pR>ε_R, processing continues in anoperation472.
In operation470, a determination is made concerning whether or not c_pa≦ε_α·c_pa≦ε_ais a third stop condition. When c_pa≦ε_a, processing continues in anoperation474. When c_pa>ε_a, processing continues inoperation472.
Inoperation472, the consecutive convergence counter k is reset to zero to indicate that convergence has not occurred, and processing continues in anoperation478.
Inoperation474, the consecutive convergence counter k is incremented by adding one to the current value, and processing continues in an operation476.
In operation476, a determination is made concerning whether or not k≧t. When k≧t, processing continues inoperation464 to complete processing because convergence has occurred for the indicated number of consecutive iterations. When k<t, processing continues inoperation478. k≧t is a fourth stop condition.
Inoperation478, the first set of support vectors is replaced with the third set of support vectors computed in operation460.
In anoperation482, the iteration counter i is incremented by adding one to the current value, and processing continues inoperation434 shown referring toFIG. 4B to perform another iteration.
Referring toFIG. 5, afirst example dataset500 including a first dimension (variable) x1 and a second dimension (variable) x2 having a banana shape is shown fortraining dataset124 in accordance with an illustrative embodiment.First example dataset500 included 11,016 observations.
FIGS. 6-12 show the development of a final solution forfirst example dataset500 usingtraining application122 and the operations ofFIGS. 4A, 4B, and 4C with sample size N_s=4. The Gaussian kernel function was used with a value of s=50. The remaining parameters were: ε_a=ε_R=1e−5, M=1000, f=0.0001, q=1, and t=10. For example,FIG. 6 shows a plurality ofsupport vectors600 that are the third set of support vectors computed in operation460 for i=1. The plurality ofsupport vectors600 are indicated by black dots, and the remaining observations offirst example dataset500 are shown with gray dots. The plurality ofsupport vectors600 includes six support vectors.
FIG. 7 shows a plurality ofsupport vectors700 that are the third set of support vectors computed in operation460 for i=2. The plurality ofsupport vectors700 are again indicated by black dots. The plurality ofsupport vectors700 includes nine support vectors.
FIG. 8 shows a plurality ofsupport vectors800 that are the third set of support vectors computed in operation460 for i=10. The plurality ofsupport vectors800 are again indicated by black dots. The plurality ofsupport vectors800 includes thirteen support vectors.
FIG. 9 shows a plurality ofsupport vectors900 that are the third set of support vectors computed in operation460 for i=30. The plurality ofsupport vectors900 are again indicated by black dots. The plurality ofsupport vectors900 includes nineteen support vectors.
FIG. 10 shows a plurality ofsupport vectors1000 that are the third set of support vectors computed in operation460 for i=70. The plurality ofsupport vectors1000 are again indicated by black dots. The plurality ofsupport vectors1000 includes nineteen support vectors.
FIG. 11 shows a plurality ofsupport vectors1100 that are the third set of support vectors computed in operation460 for i=140. The plurality ofsupport vectors1100 are again indicated by black dots. The plurality ofsupport vectors1100 includes nineteen support vectors.
FIG. 12 shows a plurality ofsupport vectors1200 that are the third set of support vectors computed in operation460 for i=170. The plurality ofsupport vectors1200 are again indicated by black dots. The plurality ofsupport vectors1200 includes nineteen support vectors.
As illustrated byFIGS. 6-12, at lower iteration numbers, the plurality of support vectors were in the interior offirst example dataset500. As the number of iterations increased, the operations oftraining application122 moved the plurality of support vectors toward the data description. At and near convergence, the plurality of support vectors were primarily along the data description.
Referring toFIG. 13, an R²curve1300 shows a variation in R²computed using the third set of support vectors computed in operation460 in successive iterations of operation460 from i=1 to i=120.
For comparison, an SVDD′ was computed using all of the observations infirst example dataset500. After solving for the optimal solution using all of the observations infirst example dataset500, SVDD′ included 21 support vectors and resulted in R²=0.8789 and required 1.98 seconds of computing time. In comparison,SVDD126 included 19 support vectors and resulted in R²=0.872 and required only 0.32 seconds of computing time for N_s=6.
Referring toFIG. 14, arun time curve1400 and a number of iterations curve1402 are shown as a function of sample size N_s, which ranged from3 to20. Runtime curve1400 and number of iterations curve1402 are b-spline curve fits to the data points for each sample size N_s. For example, run time curve data points are shown with open circles, and number of iterations curve data points are shown with an “x”. A minimum runtime data point1404 indicates a minimum run time occurred for N_s=6.
Referring toFIG. 15, asecond example dataset1500 including a first dimension (variable) x1 and a second dimension (variable) x2 having a star shape is shown fortraining dataset124 in accordance with an illustrative embodiment.Second example dataset1500 included 64,000 observations. The Gaussian kernel function was used with a value of s=52. The remaining parameters were: ε_a=ε_R=1e−5, M=1000, f=0.0001, q=1, and t=10.
Referring toFIG. 16, arun time curve1600 and a number of iterations curve1602 are shown as a function of sample size N_s, which ranged from3 to20. Runtime curve1600 and number of iterations curve1602 are b-spline curve fits to the data points for each sample size N_s. For example, run time curve data points are shown with open circles, and number of iterations curve data points are shown with an “x”. A minimum runtime data point1604 indicates a minimum run time occurred for N_s=11.
For comparison, an SVDD′ was computed using all of the observations insecond example dataset1500. SVDD′ included 76 support vectors and resulted in R²=0.9362 and required 11.55 seconds of computing time. In comparison,SVDD126 computed usingtraining application122 with N_s=11 included 44 support vectors and resulted in R²=0.932 and required 0.28 seconds of computing time.
Referring toFIG. 17, athird example dataset1700 including a first dimension (variable) x1 and a second dimension (variable) x2 having a two-donut shape is shown fortraining dataset124 in accordance with an illustrative embodiment.Second example dataset1500 included 1,333,334 observations. The Gaussian kernel function was used with a value of s=1.5. The remaining parameters were: ε_a=ε_R=1e−5, M=1000, f=0.0001, q=1, and t=10.
Referring toFIG. 18, arun time curve1800 and a number of iterations curve1802 are shown as a function of sample size N_s, which ranged from3 to20. Runtime curve1800 and number of iterations curve1802 are b-spline curve fits to the data points for each sample size N_s. For example, run time curve data points are shown with open circles, and the number of iterations curve data points are shown with an “x”. A minimum runtime data point1804 indicates a minimum run time occurred for N_s=11.
For comparison, an SVDD′ was computed using all of the observations inthird example dataset1700. SVDD′ included 178 support vectors and resulted in R²=0.8982 and required 32 minutes of computing time. In comparison,SVDD126 computed usingtraining application122 with N_s=11 included 37 support vectors and resulted in R²=0.897 and required 0.29 seconds of computing time.
Referring toFIG. 19, aprocessing time curve1900 shows a processing (run) time as a function of a number of observations intraining dataset124 selected fromthird example dataset1700 to compute SVDD′. The processing time has an exponential shape as a function of a number of observations used to compute SVDD′. The training time for SVDD′ is low for small or moderately sized training datasets of size up to 15,000 observations, but is prohibitively high for large datasets.
For extremely large training datasets, efficiency gains can be realized using a distributed implementation. Referring toFIG. 20, a block diagram of anSVDD training system2000 is shown in accordance with an illustrative embodiment. In an illustrative embodiment,SVDD training system2000 may include acontroller device2002, one ormore worker devices2004, and anetwork2006. For illustration, the one ormore worker devices2004 may include afirst worker device2004a, asecond worker device2004b, . . . , and annth worker device2004n.Controller device2002 and the one ormore worker devices2004 are in communication throughnetwork2006.
Network2006 may include one or more networks of the same or different types.Network2006 can be any type of wired and/or wireless public or private network including a cellular network, a local area network, a wide area network such as the Internet or the World Wide Web, etc.Network2006 further may comprise sub-networks and consist of any number of communication devices.
Controller device2002 and the one ormore worker devices2004 may include computers of any form factor such as a server computer, a desktop, a smart phone, a laptop, a personal digital assistant, an integrated messaging device, a tablet computer, etc. For illustration,controller device2002 and the one ormore worker devices2004 are each an instance ofSVDD training device100.Training dataset124 with M_sobservations is distributed acrossp worker devices2004. Each worker device computesSVDD126 for its M_s/p observations usingtraining application122 to determine its own plurality of support vectors SV*_i, where i indicates the worker device. Once SVDD computations are completed, each worker device sends its plurality of support vectors SV*_itocontroller device2002. Thecontroller device2002 forms a union of all the worker device support vectors SV*_ias S′=U_i=1^pSV*_ito create data set S′. An optimal value for the objective function is computed by optimizing the objective function using the kernel function defined based on the fourth indicator and the dataset S′. For example, equations (10)-(13) above are used to solve for SV, a final plurality of support vectors that have 0<α_i≦C, along with values for the Lagrange constants α_ifor each support vector of the final plurality of support vectors, the center position α, and R². The final plurality of support vectors computed bycontroller device2002 along with values for the Lagrange constants α_ifor each support vector of the final plurality of support vectors, the center position α, and R²may be stored asSVDD126.
Referring toFIG. 21, a block diagram of anoutlier identification device2100 is shown in accordance with an illustrative embodiment.Outlier identification device2100 may include asecond input interface2102, asecond output interface2104, acommunication interface2106, a second non-transitory computer-readable medium2108, asecond processor2110, anoutlier identification application2122,SVDD126, ascoring dataset2124, and anoutlier dataset2126. Fewer, different, and/or additional components may be incorporated intooutlier identification device2100.Outlier identification device2100 andSVDD training device100 may be the same or different devices.
Second input interface2102 provides the same or similar functionality as that described with reference toinput interface102 ofSVDD training device100 though referring tooutlier identification device2100.Second output interface2104 provides the same or similar functionality as that described with reference tooutput interface104 ofSVDD training device100 though referring tooutlier identification device2100.Second communication interface2106 provides the same or similar functionality as that described with reference tocommunication interface106 ofSVDD training device100 though referring tooutlier identification device2100. Data and messages may be transferred betweenoutlier identification device2100 and a distributedcomputing system2128 usingsecond communication interface2106. Second computer-readable medium2108 provides the same or similar functionality as that described with reference to computer-readable medium108 ofSVDD training device100 though referring tooutlier identification device2100.Second processor2110 provides the same or similar functionality as that described with reference toprocessor110 ofSVDD training device100 though referring tooutlier identification device2100.
Outlier identification application2122 performs operations associated with creatingoutlier dataset2126 from data stored inscoring dataset2124 usingSVDD126.SVDD126 may be used to classify data stored inscoring dataset2124 and to identify outliers in scoringdataset2124 that are stored inoutlier dataset2126 to support various data analysis functions as well as provide alert/messaging related to the identified outliers stored inoutlier dataset2126. Dependent on the type of data stored intraining dataset124 andscoring dataset2124,outlier dataset2126 may identify anomalies as part of process control, for example, of a manufacturing process, for machine condition monitoring, for example, an electro-cardiogram device, for image classification, for intrusion detection, for fraud detection, etc. Some or all of the operations described herein may be embodied inoutlier identification application2122. The operations may be implemented using hardware, firmware, software, or any combination of these methods.
Referring to the example embodiment ofFIG. 21,outlier identification application2122 is implemented in software (comprised of computer-readable and/or computer-executable instructions) stored in second computer-readable medium2108 and accessible bysecond processor2110 for execution of the instructions that embody the operations oftraining application122.Outlier identification application2122 may be written using one or more programming languages, assembly languages, scripting languages, etc.Outlier identification application2122 may be integrated with other analytic tools. For example,outlier identification application2122 may be part of SAS® Enterprise Miner™ developed and provided by SAS Institute Inc. of Cary, N.C. that may be used to create highly accurate predictive and descriptive models based on analysis of vast amounts of data from across an enterprise. Data mining is applicable in a variety of industries.
Outlier identification application2122 may be implemented as a Web application.Outlier identification application2122 may be integrated with other system processing tools to automatically process data generated as part of operation of an enterprise, to identify any outliers in the processed data, and to provide a warning or alert associated with the outlier identification usingsecond input interface2102,second output interface2104, and/orsecond communication interface2106 so that appropriate action can be initiated in response to the outlier identification.Outlier identification application2122 andtraining application122 further may be integrated applications.
Training dataset124 andscoring dataset2124 may be generated, stored, and accessed using the same or different mechanisms. Similar totraining dataset124, scoringdataset2124 may include a plurality of rows and a plurality of columns with the plurality of rows referred to as observations or records, and the columns referred to as variables that are associated with an observation.Scoring dataset2124 may be transposed.
Similar totraining dataset124, scoringdataset2124 may be stored on second computer-readable medium2108 or on one or more computer-readable media of distributedcomputing system2128 and accessed byoutlier identification device2100 usingsecond communication interface2106. Data stored inscoring dataset2124 may be a sensor measurement or a data communication value, may be generated or captured in response to occurrence of an event or a transaction, generated by a device such as in response to an interaction by a user with the device, etc. The data stored inscoring dataset2124 may include any type of content represented in any computer-readable format such as binary, alphanumeric, numeric, string, markup language, etc. The content may include textual information, graphical information, image information, audio information, numeric information, etc. that further may be encoded using various encoding techniques as understood by a person of skill in the art. The data stored inscoring dataset2124 may be captured at different time points periodically, intermittently, when an event occurs, etc. One or more columns may include a time value. Similar totraining dataset124, data stored inscoring dataset2124 may be generated as part of the IoT, and some or all data may be processed with an ESPE.
Similar totraining dataset124, scoringdataset2124 may be stored in various compressed formats such as a coordinate format, a compressed sparse column format, a compressed sparse row format, etc.Scoring dataset2124 further may be stored using various structures as known to those skilled in the art including a file system, a relational database, a system of tables, a structured query language database, etc. onSVDD training device100, onoutlier identification device2100, and/or on distributedcomputing system2128.Outlier identification device2100 and/or distributedcomputing system2128 may coordinate access toscoring dataset2124 that is distributed across a plurality of computing devices. For example, scoringdataset2124 may be stored in a cube distributed across a grid of computers as understood by a person of skill in the art. As another example, scoringdataset2124 may be stored in a multi-node Hadoop® cluster. For instance, Apache™ Hadoop® is an open-source software framework for distributed computing supported by the Apache Software Foundation. As another example, scoringdataset2124 may be stored in a cloud of computers and accessed using cloud computing technologies, as understood by a person of skill in the art. The SAS® LASR™ Analytic Server developed and provided by SAS Institute Inc. of Cary, N.C. may be used as an analytic platform to enable multiple users to concurrently access data stored inscoring dataset2124.
Referring toFIG. 22, example operations ofoutlier identification application2122 to useSVDD126 to classifyscoring dataset2124 and createoutlier dataset2126 are described. The operations ofFIGS. 4A, 4B, 4C, and 22 may be distributed between one or more applications that are integrated or that are independent.
In anoperation2200, a thirteenth indicator is received that indicates scoringdataset2124. For example, the thirteenth indicator indicates a location and a name ofscoring dataset2124. As an example, the thirteenth indicator may be received byoutlier identification application2122 after selection from a user interface window or after entry by a user into a user interface window. In an alternative embodiment, scoringdataset2124 may not be selectable. For example, a most recently created dataset may be used automatically.
In anoperation2202, a fourteenth indicator is received that indicatesSVDD126. For example, the fourteenth indicator indicates a location and a name ofSVDD126. As an example, the fourteenth indicator may be received byoutlier identification application2122 after selection from a user interface window or after entry by a user into a user interface window. In an alternative embodiment,SVDD126 may not be selectable. For example, a default name and location forSVDD126 may be used automatically.
In an operation2204, a fifteenth indicator may be received that indicates a plurality of variables ofscoring dataset2124 to define observation vector z. The same set of the plurality of variables selected in operation402 to define SVDD126 should be selected. The fifteenth indicator may indicate that all or only a subset of the variables stored inscoring dataset2124 be used to defineSVDD126. For example, the fifteenth indicator indicates a list of variables to use by name, column number, etc. In an alternative embodiment, the fifteenth indicator may not be received. For example, all of the variables may be used automatically.
Similar tooperations406 and408, in anoperation2206, a sixteenth indicator of a kernel function and any kernel parameter value to apply may be received. The same kernel function and any kernel parameter value selected inoperations406 and408 to define SVDD126 should be selected. For example, the sixteenth indicator indicates a name of a kernel function. The sixteenth indicator may be received byoutlier identification application2122 after selection from a user interface window or after entry by a user into a user interface window. A default value for the kernel function may further be stored, for example, in second computer-readable medium2108.
In anoperation2208, a seventeenth indicator is received that indicatesoutlier dataset2126. For example, the seventeenth indicator indicates a location and a name ofoutlier dataset2126. As an example, the seventeenth indicator may be received byoutlier identification application2122 after selection from a user interface window or after entry by a user into a user interface window. In an alternative embodiment,outlier dataset2126 may not be selectable. For example, a default name and location foroutlier dataset2126 may be used automatically.
In anoperation2210, a first observation is selected as observation vector z from scoringdataset2124.
In an operation2212, a distance value for observation vector z is computed usingSVDD126 as dist²(z)=K(z,z)−2 Σ_i=1^N^svα_iK(x_i, z)+Σ_i=1^N^svα_iα_jK(x_i,x_j). As discussed previously, some of the values may be constant and may have been saved withSVDD126.
In an operation2214, a determination is made concerning whether or not dist²(z)>R², where R²may have been saved withSVDD126. When dist²(z)≧R², processing continues in anoperation2216. When dist²(z)≦R², processing continues in anoperation2218.
Inoperation2216, observation vector z and/or an indicator of observation vector z is stored tooutlier dataset2126, and processing continue inoperation2218.
Inoperation2218, a determination is made concerning whether or not scoringdataset2124 includes another observation. When scoringdataset2124 includes another observation, processing continues in anoperation2220. When scoringdataset2124 does not include another observation, processing continues in anoperation2222.
Inoperation2220, a next observation is selected as observation vector z from scoringdataset2124, and processing continues in operation2212 to determine if the next observation is an outlier.
Inoperation2222, scoring results are output. For example, statistical results associated with the scoring may be stored on one or more devices and/or on second computer-readable medium2108 in a variety of formats as understood by a person of skill in the art.Outlier dataset2126 and/or the scoring results further may be output to asecond display2116, to asecond printer2120, etc. In an illustrative embodiment, an alert message may be sent to another device usingsecond communication interface2106, printed onsecond printer2120 or another printer, presented visually onsecond display2116 or another display, presented audibly using asecond speaker2118 or another speaker when an outlier is identified.
To confirm that the data description defined bySVDD126 using the sampling method performed bytraining application122 is similar to SVDD′ computed using the entirety oftraining dataset124 to train in a single iteration, scoring was performed using a 200×200 data grid.FIG. 23 depicts scoringresults using SVDD126 computed using the operations ofFIGS. 4A, 4B, 4C, and 22 withfirst example dataset500 astraining dataset124.FIG. 24 depicts scoring results using SVDD′ computed by training using the entirety offirst example dataset500 astraining dataset124 in a single iteration.
FIG. 25 depicts scoringresults using SVDD126 computed using the operations ofFIGS. 4A, 4B, 4C, and 22 withsecond example dataset1500 astraining dataset124.FIG. 26 depicts scoring results using SVDD′ computed by training using the entirety ofsecond example dataset1500 astraining dataset124 in a single iteration.
FIG. 27 depicts scoringresults using SVDD126 computed using the operations ofFIGS. 4A, 4B, 4C, and 22 withthird example dataset1700 astraining dataset124.FIG. 28 depicts scoring results using SVDD′ computed by training using the entirety ofthird example dataset1700 astraining dataset124 in a single iteration. The scoring results are similar in each case and were achieved with an approximately order of magnitude faster computational speed.
Training application122 incrementally learnstraining dataset124 at each iteration by computingSVDD126 on an independent random sample selected with replacement fromtraining dataset124. The illustrative results show thattraining application122 is extremely fast and provides a nearly identical data description as compared to training using the entire dataset in a single iteration.Training application122 can be implemented as a wrapper code around a core module for SVDD training computations either in a single machine or in a multi-machine distributed environment.
There are applications fortraining application122 in areas such as process control and equipment health monitoring where the size oftraining dataset124 can be very large, consisting of a few million observations.Training dataset124 may include sensor readings measuring multiple key health or process parameters at a very high frequency. For example, a typical airplane currently has 7,000 sensors measuring critical health parameters and creates 2.5 terabytes of data per day. By2020, this number is expected to triple or quadruple to over 7.5 terabytes. In such applications, multiple SVDD training models may be developed with each representing a different operating mode of the equipment or different process settings. A successful application of SVDD in these types of application require algorithms that can train using huge amounts of training data in an efficient manner.
The word “illustrative” is used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as “illustrative” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Further, for the purposes of this disclosure and unless otherwise specified, “α” or “an” means “one or more”. Still further, using “and” or “or” in the detailed description is intended to include “and/or” unless specifically indicated otherwise.
The foregoing description of illustrative embodiments of the disclosed subject matter has been presented for purposes of illustration and of description. It is not intended to be exhaustive or to limit the disclosed subject matter to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the disclosed subject matter. The embodiments were chosen and described in order to explain the principles of the disclosed subject matter and as practical applications of the disclosed subject matter to enable one skilled in the art to utilize the disclosed subject matter in various embodiments and with various modifications as suited to the particular use contemplated.

Claims

1. A non-transitory computer-readable medium having stored thereon computer-readable instructions that when executed by a computing device cause the computing device to:

randomly select a first set of observation vectors from a training dataset, wherein a number of the first set of observation vectors is a predefined sample size;

compute a first optimal value of an objective function defined for a support vector data description (SVDD) model using the selected first set of observation vectors to define a first set of support vectors, wherein the first set of support vectors define a first data description for the training dataset;

(a) randomly select a second set of observation vectors from the training dataset, wherein a number of the second set of observation vectors is the predefined sample size;

(b) compute a second optimal value of the objective function using the selected second set of observation vectors to define a second set of support vectors, wherein the second set of support vectors define a second data description for the training dataset;

(c) update the first set of support vectors to include the defined second set of support vectors;

(d) compute a third optimal value of the objective function using the updated first set of support vectors to define a third set of support vectors, wherein the third set of support vectors define a third data description for the training dataset;

(e) compute a value of a stop parameter;

(f) determine whether or not a stop condition is satisfied by comparing the computed value to a stop criterion;

(g) when the stop condition is not satisfied,

define the first set of support vectors as the defined third set of support vectors; and

repeat (a)-(g) until the stop condition is satisfied; and

when the stop condition is satisfied, output the defined third set of support vectors for identifying an outlier in a scoring dataset.

2. The non-transitory computer-readable medium ofclaim 1, wherein the objective function defined for the SVDD model is max(Σ_i=1ⁿα_iK(x_i,x_i)−Σ_i=1ⁿΣ_j=1ⁿα_iα_jK(x_i,x_j)), subject to Σ_i=1ⁿα_i=1 and 0≦α_i≦C, ∇i=1, . . . , n, where K(x_i,x_j) is a kernel function, n is the predefined sample size, C=1/nf where f is an expected outlier fraction, x_iare the selected observation vectors for each computation, and α_iare Lagrange constants.

3. The non-transitory computer-readable medium ofclaim 2, wherein the expected outlier fraction is a predefined input value.

4. The non-transitory computer-readable medium ofclaim 2, wherein the x_ithat have 0<α_i≦C are the defined set of support vectors for each computation.

5. The non-transitory computer-readable medium ofclaim 4, wherein, when the stop condition is satisfied, the computer-readable instructions further cause the computing device to output the Lagrange constants α_kfor each of the defined third set of support vectors for identifying the outlier.

6. The non-transitory computer-readable medium ofclaim 2, wherein the kernel function is a Gaussian kernel function.

7. The non-transitory computer-readable medium ofclaim 4, wherein, when the stop condition is satisfied, the computer-readable instructions further cause the computing device to compute a threshold using the defined third set of support vectors.

8. The non-transitory computer-readable medium ofclaim 7, wherein the threshold is computed using R²=K(x_k,x_k)−2Σ_i=1^Nα_iK(x_i,x_k)+Σ_i=1^NΣ_j=1^Nα_iα_jK(x_i, x_j), where x_kis any support vector of the set of support vectors for each computation that have 0<α_i<C, x_iand x_jare the defined support vectors for each computation, α_iand α_jare the Lagrange constants of the associated support vector, and N is a number of support vectors included in the defined set of support vectors for each computation.

9. The non-transitory computer-readable medium ofclaim 8, wherein, when the stop condition is satisfied, the computer-readable instructions further cause the computing device to output the computed threshold for identifying the outlier.

10. The non-transitory computer-readable medium ofclaim 9, wherein, after outputting the defined third set of support vectors, the computer-readable instructions further cause the computing device to:

read an observation vector from a scoring dataset;

compute a distance value using the defined third set of support vectors, the Lagrange constants, and the read observation vector; and

when the computed distance value is greater than the computed threshold, identify the read observation vector as an outlier.

11. The non-transitory computer-readable medium ofclaim 10, wherein the distance value is computed using dist²(z)=K(z,z)−2Σ_i=1^Nα_iK(x_i,z)+Σ_i=1^NΣ_j=1^Nα_iα_jK(x_i, x_j), where z is the read observation vector.

12. The non-transitory computer-readable medium ofclaim 10, wherein when the computed distance value is not greater than the computed threshold, the read observation vector is not identified as an outlier.

13. The non-transitory computer-readable medium ofclaim 1, wherein each observation vector includes a plurality of values, wherein each value of the plurality of values is associated with a variable to define a plurality of variables, wherein each variable of the plurality of variables describes a characteristic of a physical object.

14. The non-transitory computer-readable medium ofclaim 13, wherein the predefined sample size is greater than a number of the plurality of variables.

15. The non-transitory computer-readable medium ofclaim 1, wherein, after (b) and before (c), the computer-readable instructions further cause the computing device to:

initialize a set of iteration support vectors as the defined second set of support vectors; and

a predefined number of times,

randomly select a fourth set of observation vectors from the training dataset, wherein a number of the fourth set of observation vectors is the predefined sample size;

compute a fourth optimal value of the objective function using the selected fourth set of observation vectors to define a fourth set of support vectors, wherein the fourth set of support vectors define a fourth data description for the training dataset; and

update the set of iteration support vectors to include the defined fourth set of support vectors;

wherein the updated set of iteration support vectors replace the defined second set of support vectors in (c).

16. The non-transitory computer-readable medium ofclaim 1, wherein the computed value is a number of iterations of (d), and the stop criterion is a predefined maximum number of iterations, wherein the determination is that the stop condition is satisfied when the computed value is greater than or equal to the predefined maximum number of iterations.

17. The non-transitory computer-readable medium ofclaim 2, wherein the computed value is computed using c_p=∥α_j−α_j-1∥/∥α_j-1∥, where α_j=Σ_i=1^Nα_ix_iwhere x_iare the defined support vectors for each computation, α_iis the Lagrange constant of the associated support vector, and N is a number of support vectors included in the defined set of support vectors for each computation, and α_j-1=Σ_i=1^N^pα_ipx_ipwhere x_ipare the defined support vectors for a previous computation, α_ipis the Lagrange constant of the associated previously computed support vector, and N_pis a number of support vectors included in the defined set of support vectors for the previous computation, and the stop criterion is a predefined center tolerance value.

18. The non-transitory computer-readable medium ofclaim 17, wherein the determination is that the stop condition is satisfied when c_p≦ε₁, where ε₁is the predefined center tolerance value.

19. The non-transitory computer-readable medium ofclaim 8, wherein the computed value is computed using

c_{p} = \frac{ R_{j}^{2} - R_{j - 1}^{2} }{R_{j - 1}^{2}},

where R_j²is the threshold computed using the defined third set of support vectors, and R_j-1²is the threshold computed using the defined first set of support vectors, and the stop criterion is a predefined distance tolerance value.

20. The non-transitory computer-readable medium ofclaim 19, wherein the determination is that the stop condition is satisfied when c_p≦ε₁, where ε₁is the predefined distance tolerance value.

21. The non-transitory computer-readable medium ofclaim 19, wherein a second computed value is computed using c_p2=∥α_j−α_j-1∥/∥α_j-1∥, where α_j=Σ_i=1^Nα_ix_iis computed using the defined third set of support vectors, and α_j-1=Σ_i=1^N^pα_ipx_ipis computed using the defined first set of support vectors, and a second stop criterion is a predefined center tolerance value.

22. The non-transitory computer-readable medium ofclaim 21, wherein the determination is that the stop condition is satisfied when c_p≦ε₁, where ε₁is the predefined distance tolerance value, and c_p2≦ε₂, where ε₂is the predefined center tolerance value.

23. The non-transitory computer-readable medium ofclaim 1, wherein determining whether or not the stop condition is satisfied comprises:

determining a number of consecutive satisfactory comparisons between the computed value and the stop criterion for iterations of (f); and

when the determined number of consecutive satisfactory comparisons exceeds a predefined threshold number, the determination is that the stop condition is satisfied.

24. A computing device comprising:

a processor; and

a non-transitory computer-readable medium operably coupled to the processor, the computer-readable medium having computer-readable instructions stored thereon that, when executed by the processor, cause the computing device to

(e) compute a value of a stop parameter;

(g) when the stop condition is not satisfied,

repeat (a)-(g) until the stop condition is satisfied; and

25. The computing device ofclaim 24, wherein determining whether or not the stop condition is satisfied comprises:

26. The computing device ofclaim 24, wherein, after (b) and before (c), the computer-readable instructions further cause the computing device to:

a predefined number of times,

27. A method of determining a support vector data description for outlier identification, the method comprising:

randomly selecting, by a computing device, a first set of observation vectors from a training dataset, wherein a number of the first set of observation vectors is a predefined sample size;

computing, by the computing device, a first optimal value of an objective function defined for a support vector data description (SVDD) model using the selected first set of observation vectors to define a first set of support vectors, wherein the first set of support vectors define a first data description for the training dataset;

(a) randomly selecting, by the computing device, a second set of observation vectors from the training dataset, wherein a number of the second set of observation vectors is the predefined sample size;

(b) computing, by the computing device, a second optimal value of the objective function using the selected second set of observation vectors to define a second set of support vectors, wherein the second set of support vectors define a second data description for the training dataset;

(c) updating, by the computing device, the first set of support vectors to include the defined second set of support vectors;

(d) computing, by the computing device, a third optimal value of the objective function using the updated first set of support vectors to define a third set of support vectors, wherein the third set of support vectors define a third data description for the training dataset;

(e) computing, by the computing device, a value of a stop parameter;

(f) determining, by the computing device, whether or not a stop condition is satisfied by comparing the computed value to a stop criterion;

(g) when the stop condition is not satisfied,

defining, by the computing device, the first set of support vectors as the defined third set of support vectors; and

repeating (a)-(g), by the computing device, until the stop condition is satisfied; and

when the stop condition is satisfied, outputting, by the computing device, the defined third set of support vectors for identifying an outlier in a scoring dataset.

28. The method ofclaim 27, wherein determining whether or not the stop condition is satisfied comprises:

29. The method ofclaim 27, further comprising, after (b) and before (c):

a predefined number of times,

30. The method ofclaim 27, wherein each observation vector includes a plurality of values, wherein each value of the plurality of values is associated with a variable to define a plurality of variables, wherein each variable of the plurality of variables describes a characteristic of a physical object, wherein the predefined sample size is greater than a number of the plurality of variables.