Detailed Description
Exemplary embodiments of the present invention are described below with reference to the accompanying drawings, in which various details of embodiments of the invention are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the invention. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
It should be noted that the embodiment of the present invention is applicable to a scenario with an audit (approval) action, for example, document audit (approval), where the document is not limited to a service form. For convenience of explaining the whole operation process, the present invention is described by taking a service list as an example.
The words to which the invention relates are to be construed as follows:
weka: machine learning (machine learning) and data mining (data mining) software based on open sources in a JAVA environment.
The two classification problems: if we want to identify whether a picture is a cat or not. That is, a classifier is trained to input a picture, represented by the feature vector xx, and to output if a cat is present, represented by yy being 0 or 1.
Principle of J48: based on a top-down strategy, a recursive divide and conquer strategy, an attribute is selected to be placed at a root node, a branch is generated for each possible attribute value, an instance is divided into a plurality of subsets, each subset corresponding to a branch of the root node, and then the process is recursively repeated on each branch. When all instances have the same classification, stop.
Enumerated types: in practical applications, there are only a few possible values of some variables. If the sex of a person has only two possible values, the week has only seven possible values. Variables that are more specific to such values in C language can be defined as follows. Enumeration refers to enumerating values of variables one by one, and the variables are limited to values within the enumerated values.
Referring to fig. 1, a main flowchart of a document auditing method based on data dimension provided by the embodiment of the present invention is shown, and includes the following steps:
s101: determining data dimension, extracting feature data corresponding to the data dimension in the bill, and generating a data set by combining an auditing result of the bill;
s102: training and testing a data model according to the data set to obtain a trained data model;
s103: and receiving a document to be audited, extracting the characteristic data corresponding to the data dimension in the document to be audited, and inputting the extracted characteristic data into the trained data model to obtain the auditing result of the document to be audited.
In the above embodiment, for step S101, the feature selection preliminarily selects multiple dimensions according to the service research, for example, within 200 dimensions may be selected according to the actual service scene.
The invention refers to the past data for training the data model, and the past data model predicts the current result according to the historical data.
But some data is changing in real time, such as a certain user level applying for a service order a year ago; or the service system is not completely constructed before, so that the characteristic data is not recorded and stored. For example, the user level in the scoring system, the total after-sales rate of the user, corresponds to the total after-sales rate of SKU (Stock Keeping Unit) goods.
It should be noted that the real-time feature data herein refers to data that may change in real-time, but does not represent that the current data is to be used to replace the data in the obtained document. And because the documents are historical documents, if part of the information in the documents is changed, the auditing result may change, and a certain influence is generated on the data model training or the data model training.
In consideration of these factors, the information in the document needs to be collected to ensure real-time performance and accuracy of the data. The collection mode can be as follows:
1) determining a data record during generation according to the generation time of the document and the document identification, and calling the real-time characteristic data from the data record;
2) JSF (Java Server Faces, Java building framework) can be used for remote calling, or calling is carried out from another department through one interface; if locally stored, it can be directly acquired locally. The present invention is primarily directed to the first mode.
The historical documents are processed completely, so that the historical documents have audit results, and the number of the audit results in different business scenes may be different.
Based on the number of data dimensions SA and the number of audit results SR (e.g., audit success, audit failure), the number of samples actually required is determined as SA × SR × 1000 (or other values are also possible).
TABLE 1 sample data
| Dimension 1 | Dimension 2 | Dimension 3 | Dimension 4 | Dimension 5 | Audit results |
| Document 1 | | | | | | |
| Document 2 | | | | | | |
Further, before inputting the data set into the data model, a preprocessing operation may be performed on the data, specifically:
1) normalization (numeric conversion to enumeration): commodity type, brand, order type, etc.;
2) data missing value processing: for example, commodity prices, substituting average values for missing values; commodity class, replacing missing value with most appeared value;
3) discretizing: for example, a continuity feature such as a level is converted into a classification attribute. The grade is a numerical attribute, and the continuous grades are divided into a stack, for example, 200-.
For the above data preprocessing, weka, tensorflow, python language, R language, etc. can be used as a data analysis tool. Taking weka as an example, the data pre-processing maps to weka as: enumerate-String To Nominal, Numberic To Nominal, Missing value processing-Replace Missing Values, discretization-Discription.
Furthermore, the training of the data model is an iterative process, so that the data in the provided data set and the auditing result may be reused. In order to facilitate the subsequent re-extraction of the data for application, the data may be transferred to a persistent state for storage, for example, written to MySQL.
In addition, for the processing bottleneck or storage capacity limitation of the existing MySQL, the data can be pulled to the data mart (data migration) during the low-traffic period to reduce the traffic system pressure. Such as removing old data from 2012.
For step S102, the data set may be divided into a training set and a test set in a certain proportion (e.g., 60% to 80%), and the output value is known. The training set is used to generate models and the test set is used to validate models.
The prior art is generally concerned with whether a model test is possible or not, based on a predetermined expected value, and if the expected value is exceeded, the model is available.
In the invention, the accuracy of the model training is compared with the feasibility of the model test. Since there is also an accuracy (i.e., model accuracy) when the data model is actually trained using the training set, a decrease in accuracy indicates an overfitting if the test set is used (the model overfitts the training set, not necessarily on other data).
After testing the data model based on the test set, the obtained test results may include audit results to which the test data belongs. The auditing result can be probability values of test data belonging to different categories or TOP1, and is set according to actual conditions.
It should be noted that, in practical applications, each class in a service scenario has a corresponding probability value, and if the predicted auditing result of each class is lower than the corresponding training accuracy, the model needs to be adjusted. In addition, the maximum probability value TOP1 in the obtained result may also be extracted, and if the maximum probability value is also lower than the corresponding training accuracy, the model needs to be adjusted.
The business scenario here includes, but is not limited to, audit closing, direct claims, pickup by the gate, replacement by the gate, delivery by the customer, and the current business customer service audit results can be summarized as these six types. The different classification results represent the subsequent processing direction of the document by the business party.
For example, as shown in fig. 2, the overall accuracy of the evaluation result is about 70%, but the accuracy of theclasses 0 and 10 is significantly lower than the overall accuracy, and model adjustment is required.
In addition, the auditing result of each classification can also be compared with the expected classification, for example, the accuracy of each classification is required to reach 80%, and a certain dimension can be adjusted under the condition that other dimensions are not changed, so that the algorithm or the classifier is perfected through dimension adjustment.
However, if the result does not meet the expected value, for example, the error rate is too high, the training dimension in the model can be adjusted subsequently, and then the data is retrained. But it is also possible to re-enter the process of selecting dimension data according to data dimensions, because some data are not actually needed, and referring specifically to fig. 3, the present invention selects the re-enter dimension selection step to consider the most complete flow.
The expected value may be an expected value derived from the cost loss model and the customer satisfaction model, for example, the cost loss rate is 0.1, the customer satisfaction is 0.8, and the final expected value is (1-0.9) × 0.8 ═ 0.72. However, the expected value may not be set too low, and there may be a plurality of situations that are greater than the preset value, so Top1 or Top N may be selected according to actual needs.
The model is continuously trained so that the accuracy rate approaches the expected value, and the expected value should be greater than 1/n and less than or equal to 1 (when equal to 1/n is the n-face-throwing screen test, less than 1/n is different from the n-face-throwing screen test, and equal to 1 is a perfect model), of course, the actual situation also needs to adjust the expected value according to the sample distribution.
The classification algorithm is supplemented here, many classification algorithms exist in the current data mining research, the present invention mainly uses the supervised classification algorithm in weka (other algorithms can be used, the present invention is not limited), the corresponding relation in weka is decision tree-J48, and a simple decision tree is shown in fig. 4.
In the decision tree, each node on the created tree represents a location where a decision must be made based on the input, and there will be one node moving to the next until a leaf node is reached where the predicted output can be derived.
For step S103, after the training model is finished, the model can be applied to serve as a foundation for auditing (applying for thousands of services) to efficiently complete document auditing.
And deploying the constructed model into a production environment, wherein the verification process is similar to the data training process, and only customer service audit results are recorded at the same time. And collecting data for a period of time, and comparing the overall accuracy with the accuracy of each audit result.
It is explained that the customer service auditing result is recorded, and the document auditing is usually operated by the data model and the customer service simultaneously before the model is determined to be completely OK, and if the two results are different, the model dimension and the characteristic data need to be adjusted. After the test set tests the model, the customer service performs verification once more.
Of course, if the model is fully available at this time, it may not be necessary to record customer service reviews.
For documents to be tested in practical application, operations such as preprocessing, feature data extraction and the like are required before the documents are input into a data model.
And if the test result meets the condition that the test result is more than or equal to the preset expected value, storing the data to the database cluster with successful auditing. And the documents which are successfully checked can be provided for an after-sale system to be called back for use. The design is a design of a system architecture, and aims to decouple, the after-sales system is a business system, and the predicted auditing result can be directly pushed to the after-sales system.
If the test result does not meet the preset expected value, the data can be stored in a database cluster with failed audit, and the data can be returned to the after-sales system for customer service audit.
In addition to the above-mentioned manner, the data verification method may also be performed by combining other evaluation strategies, for example:
and (3) cross validation: the initial sample is divided into K sub-samples, one individual sub-sample is retained as data for the verification model, and the other K-1 samples are used for training. Repeating the cross validation for K times, validating each subsample once, averaging the results of the K times or using other combination modes, and finally obtaining a single estimation;
the classifier evaluates the policy, see for example the weka classifier evaluation policy in fig. 5.
In the above, evaluation needs to be carried out on the cyclic digital words in the model construction process (the training accuracy and the testing accuracy are similar and close to 1, which means that the model is stable and extremely accurate), and the basic indexes of data mining are as shown in the following table 2:
TABLE 2 data mining basic metrics
| True prediction | p | n | |
| p' | TP(True Positives) | FP(False Positives) | P' |
| n' | FN(False Negatives) | TN(True Negatives) | N' |
| P | N | |
Wherein, TP: true Positive, predicted by the model as a Positive sample, actually Positive, a predicted pair. tprate is TP/(TP + FN)
TN: true Negative, the model predicts as Negative samples, actually Negative, prediction pairs. tnrate ═ TN/(TP + TN).
FN: false Negative, samples predicted by the model to be Negative, actually positive, are predicted to be False. fnrate is FN/(TP + FN).
FP: false Positive, samples predicted by the model to be Positive, actually negative, and prediction error fprate ═ FP/(FP + TN).
Precision: in the predicted result, the true correct number is a proportion of the overall result. precision is TP/(TP + FP).
Recall: in the predicted result, the ratio of the number of true correct bits to the number of true correct bits in the entire data set (actually positive) is precision or tprate.
F-Measure: is Precision and Recall weighted harmonic mean, in the F-measure function, when the dimension alpha is 1, F1 integrates the results of Precision and Recall, and when F1 is higher, it can be said that the test method is more effective F1=2TP/(2TP+FP+FN)。。
MCC: metrics for unbalanced data sets
ROC: the ROC curve, whose abscissa is False Positive Rate (FPR) and ordinate is True Positive Rate (TPR), can be kept constant when the distribution of positive and negative samples in the test set changes.
PRC: the PRC curve, in the case of very uneven distribution of positive and negative samples (highly skewed data), reflects the classifier's quality more effectively than the ROC.
The method provided by the embodiment carries out model construction evaluation by using a large amount of existing document data, can dynamically configure the expected value of the classification result according to the training accuracy, solves the problems of subjectivity and uncertainty in the manual examination process, and can replace a method for automatically examining and examining the service list by customer service; the automatic auditing model has stronger stability, can also carry out supervised or unsupervised learning, and along with the increase of the document quantity, the model base can also carry out self-learning and self-perfection.
Referring to fig. 6, a schematic flow chart of an optional data dimension-based document auditing method according to an embodiment of the present invention is shown, including the following steps:
s601: determining a data dimension, and extracting feature data corresponding to the data dimension in the bill;
s602: counting the data volume of the feature data under each data dimension, and eliminating the data dimension of which the data volume exceeds a preset data volume threshold value and the feature data corresponding to the eliminated data dimension;
s603: generating a data set according to the feature data after dimension screening and by combining the examination result of the document;
s604: training and testing a data model according to the data set to obtain a trained data model;
s605: and receiving a document to be audited, extracting the characteristic data corresponding to the data dimension in the document to be audited, and inputting the extracted characteristic data into the trained data model to obtain the auditing result of the document to be audited.
In the above embodiment, steps S601, S603, S604 and S605 can refer to the descriptions of steps S101 to S103 shown in fig. 1, and are not described again here.
In the above embodiment, for step S602, in the initial establishment stage of the data model, a large and comprehensive feature set is usually found for step-by-step screening, so that there may be operations of dimension reduction or addition subsequently.
The dimension reduction is taken as an example for specific explanation:
the enumerated value of a feature is severely skewed (e.g., order type, 90% of self-run) or severely dispersed (e.g., 5000 or more commodity classes), so if the feature is subsequently measured to have a low impact on the accuracy of the classifier (< 0.1%), the feature can be deleted or merged (when dispersed).
The invention mainly audits documents, such as service sheets, and usually has a plurality of audit results, and the corresponding problem is a multi-classification problem.
Because the data in the document may have a condition that the quantity of a certain feature space is too large, the problem that the model is too large and cannot be loaded occurs when the classifier is used in the subsequent training process, and redundant features not only influence or mislead the classifier, but also mean that the fitting risk is too large.
In order to solve the problem, the invention uses an algorithm (such as a clustering algorithm) to perform dimension adjustment (hereinafter, referred to as dimension reduction) operation before model training, and meanwhile, the smaller the dimension, the less the data set of the required training test, the more ideal the training speed and the time consumption, and the better result can be obtained.
The clustering algorithm used by the invention is mainly a KNN algorithm. The KNN algorithm is not influenced by an overlarge image of a certain feature space, can perform feature deletion accuracy rate change test, determines a feature screening result through the test, and finally reduces the feature space occupying the overlarge space, which is similar to digital discretization as a whole. For example, there are 5000 original categories, and after dimensionality reduction, the categories are classified to obtain 200 categories.
In addition, it should be noted that the deletion of the features by the KNN algorithm may have a certain effect on the training of the model, but may be negligible.
The invention can also carry out dimension reduction by two modes:
1) characteristic extraction: the new feature after feature extraction is a mapping of the original features, and can also be understood as a combination of the original features. For example, the original feature is [ a, b, c, d, e ], and the new feature is [ a + b, c x d, e/a ];
2) selecting characteristics: the features after feature selection are a subset of the original features, and can also be understood as useless features directly removed from the original features. For example, the original feature is [ a, b, c, d, e ], and the new feature is [ a, c, e ].
Through the dimension reduction operation, the data dimension can be reasonably planned, and a classifier or a classification algorithm, such as J48, is used after the dimension is not reduced. And dimension reduction can be performed before data preprocessing, and the strict sequence is not required.
It should be noted that feature addition may also be performed during training of the test data model, for example, important features are omitted, such as user-level features.
The method provided by the embodiment can eliminate some features with smaller classification checking result images through dimension reduction operation, so that the accuracy of the checking result is improved, and the data storage capacity in a data set is reduced.
The method provided by the embodiment of the invention can refer to the original document and has stronger generalization capability; the usability of the data model is high, complex business rules can be completely replaced, the setting of dynamic expected values is supported, and the fault tolerance rate is high; the stability is higher, can carry out supervision or non-supervision formula study, and to the continuous increase of document volume, the model base also can self-study and perfect, realizes the effect of high-efficient, high accuracy processing document.
Referring to fig. 7, a technical architecture of the present invention is shown, and the entire system is composed of a client module, an after-sales system server cluster, an automatic audit server cluster, and a data storage system in terms of server architecture.
1) Client module
The system comprises an APP client and a PC client, wherein the APP client and the PC client are used for applying for submitting/modifying a service list;
2) after-sales system service cluster
Mainly process the request sent by the client, process the data preprocessing, belong to the application layer server cluster;
3) automatic audit server cluster
In the face of the large number of service orders per day, the automated audit servers need to form a cluster architecture (which works to perform computing tasks with a high degree of closeness of cooperation through a set of loosely integrated computer software and/or hardware connections).
4) Data storage system
The cluster storage servers storing the feature data (i.e. service list) can be divided into three types: the system comprises a server for storing the data which are successfully checked, a server for storing the data which are unsuccessfully checked and a server for storing the automatic checking model.
Referring to fig. 8, the whole system is composed of an after-market system server, an automatic audit server and a data storage system in terms of service details.
1) After-sale system server
Including a main thread and a worker thread. The main thread manages the working thread, the working thread comprises a message forwarding thread and a service calling thread, and the main thread and the working thread are not divided in detail here.
However, it is common to perform partitioning, for example, where one worker thread fails and the main thread specifies which other thread to replace.
2) Automatic auditing system server
The method comprises three steps of characteristic filling, characteristic preprocessing and model prediction. The existing data model generated by the classification algorithm is mainly used, the existing data is predicted according to the existing data model, and whether the audit is successful is judged according to conditions.
3) The successful audit storage server is used for storing the service list which is successfully audited
4) The audit failure storage server is used for storing the service list with audit failure
5) And the automatic auditing model set storage server is used for storing the data model generated by the J48 algorithm.
Referring to fig. 9, a schematic diagram of main modules of adocument auditing apparatus 900 based on data dimension according to an embodiment of the present invention is shown, including:
the datadimension determining module 901 is configured to determine a data dimension, extract feature data corresponding to the data dimension in a document, and generate a data set in combination with an audit result of the document;
a datamodel training module 902, configured to perform a training test on a data model according to the data set to obtain a trained data model;
and thedocument auditing module 903 is used for receiving a document to be audited, extracting the characteristic data corresponding to the data dimension in the document to be audited, and inputting the extracted characteristic data into the trained data model to obtain an auditing result of the document to be audited.
In the implementation apparatus of the present invention, the datadimension determining module 901 is configured to:
and if the characteristic data does not exist in the document, acquiring a data record of the document according to the document identification and the generation time of the document so as to extract the characteristic data from the data record.
In the implementation apparatus of the present invention, the datadimension determining module 901 is further configured to:
and counting the data volume of the feature data under each data dimension, and eliminating the data dimension of which the data volume exceeds a preset data volume threshold value and the feature data corresponding to the eliminated data dimension.
In the implementation apparatus of the present invention, the datamodel training module 902 is configured to:
dividing the data set into a training set and a test set;
inputting training data and audit results in the training set into the data model, obtaining total training accuracy according to the training accuracy under each audit result, and generating a data model to be tested;
inputting the test data in the test set into the data model to be tested, and if the total test accuracy is greater than or equal to the training accuracy and the test accuracy under each audit result is greater than or equal to the training accuracy, determining that the data model to be tested is the trained data model.
In addition, the detailed implementation of the device in the embodiment of the present invention has been described in detail in the above method, so that the repeated description is not repeated here.
Fig. 10 shows anexemplary system architecture 1000 of a data dimension-based document auditing method or a data dimension-based document auditing apparatus to which an embodiment of the present invention may be applied.
As shown in fig. 10, thesystem architecture 1000 may includeterminal devices 1001, 1002, 1003, anetwork 1004, and a server 1005 (by way of example only). Thenetwork 1004 is used to provide a medium for communication links between theterminal devices 1001, 1002, 1003 and theserver 1005.Network 1004 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
A user may use theterminal devices 1001, 1002, 1003 to interact with aserver 1005 via anetwork 1004 to receive or transmit messages or the like. Theterminal devices 1001, 1002, 1003 may have installed thereon various messenger client applications such as shopping applications, web browser applications, search applications, instant messenger, mailbox clients, social platform software, etc. (by way of example only).
Theterminal devices 1001, 1002, 1003 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.
Theserver 1005 may be a server that provides various services, such as a backend management server (for example only) that supports shopping websites browsed by users using theterminal devices 1001, 1002, 1003. The backend management server may analyze and perform other processing on the received data such as the product information query request, and feed back a processing result (for example, target push information, product information — just an example) to the terminal device.
It should be noted that the document auditing method based on the data dimension provided by the embodiment of the present invention is generally executed by theserver 1005, and accordingly, the document auditing apparatus based on the data dimension is generally disposed in theserver 1005.
It should be understood that the number of terminal devices, networks, and servers in fig. 10 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.
Referring now to FIG. 11, shown is a block diagram of acomputer system 1100 suitable for use with a terminal device implementing an embodiment of the present invention. The terminal device shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present invention.
As shown in fig. 11, thecomputer system 1100 includes a Central Processing Unit (CPU)1101, which can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)1102 or a program loaded from astorage section 1108 into a Random Access Memory (RAM) 1103. In theRAM 1103, various programs and data necessary for the operation of thesystem 1100 are also stored. TheCPU 1101,ROM 1102, andRAM 1103 are connected to each other by abus 1104. An input/output (I/O)interface 1105 is also connected tobus 1104.
The following components are connected to the I/O interface 1105: aninput portion 1106 including a keyboard, mouse, and the like; anoutput portion 1107 including a signal output unit such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; astorage section 1108 including a hard disk and the like; and acommunication section 1109 including a network interface card such as a LAN card, a modem, or the like. Thecommunication section 1109 performs communication processing via a network such as the internet. Adriver 1110 is also connected to the I/O interface 1105 as necessary. A removable medium 1111 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on thedrive 1110 as necessary, so that a computer program read out therefrom is mounted into thestorage section 1108 as necessary.
In particular, according to the embodiments of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through thecommunication portion 1109 and/or installed from theremovable medium 1111. The above-described functions defined in the system of the present invention are executed when the computer program is executed by a Central Processing Unit (CPU) 1101.
It should be noted that the computer readable medium shown in the present invention can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present invention, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present invention may be implemented by software or hardware. The described modules may also be provided in a processor, which may be described as: a processor includes a data dimension determination module, a data model training module, and a document review module. The names of the modules do not form a limitation on the modules themselves in some cases, for example, the data model training module may also be described as a "module for training a data model based on feature data and audit results".
As another aspect, the present invention also provides a computer-readable medium that may be contained in the apparatus described in the above embodiments; or may be separate and not incorporated into the device. The computer readable medium carries one or more programs which, when executed by a device, cause the device to comprise:
determining data dimension, extracting feature data corresponding to the data dimension in the bill, and generating a data set by combining an auditing result of the bill;
training and testing a data model according to the data set to obtain a trained data model;
and receiving a document to be audited, extracting the characteristic data corresponding to the data dimension in the document to be audited, and inputting the extracted characteristic data into the trained data model to obtain the auditing result of the document to be audited.
According to the technical scheme of the embodiment of the invention, the original documents can be referred to, and the generalization capability is strong; the usability of the data model is high, complex business rules can be completely replaced, the setting of dynamic expected values is supported, and the fault tolerance rate is high; the stability is higher, can carry out supervision or non-supervision formula study, and to the continuous increase of document volume, the model base also can self-study and perfect, realizes the effect of high-efficient, high accuracy processing document.
The above-described embodiments should not be construed as limiting the scope of the invention. Those skilled in the art will appreciate that various modifications, combinations, sub-combinations, and substitutions can occur, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.