System and method for providing coding for target dataTechnical Field
The present invention relates to the field of tax data application technology, and more particularly, to a system and method for providing encoding for target data.
Background
The most recently issued "tax Classification coding Table for goods and services" by the national tax administration strictly classifies goods and services into 4207 categories, of which the major category 675 is, and the minor category 3532 is. The national tax bureau issued in 2016 requires a trial and error to add tax classification codes and code-related functions to the billing software. In addition, a great number of mistakes and inaccurate manual labeling codes exist in the invoices issued by the local side, so that misleading effects exist on the statistics, analysis and other works for preventing tax evasion and tax leakage of enterprises based on enterprise tax rate and entry and sales item analysis of the sold commodities. Because of the limitation of professional knowledge and energy of invoicers and office data analysts, the feasibility of coding and classifying massive commodity and service names is too low, so that the office data analysis is more accurate and convenient for the invoicers, and a classification recommendation system which depends on big data technology and machine learning models is specially designed.
The naive bayes method is a classification method based on bayes theorem and independent assumption of characteristic conditions, for a given training data set, first, based on the independent assumption of characteristic conditions, the joint probability distribution of input/output is learned, and then, based on the model, the output of posterior probability is obtained for the given input by using bayes theorem. The classification model obtained by naive Bayes training has better accuracy through testing, and the main idea is that the training stage takes the commodity name word segmentation of the training sample set as input, then obtains the prior probability of all categories (codes), and calculates the conditional probability of all word segmentation when a certain code is taken; and the classification stage is used for word segmentation of commodity names and calculating the probability of all possible codes according to the Bayesian theorem. However, since naive bayes are algorithms based on statistics and have certain limitations based on independent assumption of feature conditions, firstly, the naive bayes cannot process unregistered words, data with fewer samples or excessive classification errors in the samples can cause inaccurate classification, words at different positions in commodity names are processed consistently, probability calculation cost for calculating all categories is high and communication bandwidth is large in each time of recognition, so that the unknown words cannot be practically used, and an optimization scheme is needed to be proposed.
The traditional database system stores all data on a disk, so that the disk is required to be accessed frequently to perform data reading operation, and the performance is lower when the data volume is large and the reading operation is frequent. In recent years, the memory capacity is continuously increased, the price is continuously reduced, and meanwhile, the requirement on the implementation corresponding capability of a database system is increasingly increased, so that the performance of the database is improved by fully utilizing the memory technology to become a hot spot.
Disclosure of Invention
In view of the above, the present invention proposes a system for providing encoding for target data, the system comprising:
the training module is used for acquiring training data, wherein the training data comprises a classification coding table and historical invoice data, training the classification coding table and the historical invoice data, acquiring training results and generating a plurality of training models based on the training results;
the model merging module is used for merging the multiple groups of training models based on the multiple groups of training models and superposing the same data set of the training results;
the coding providing module is used for reading training result data, the coding module is connected with a plurality of interfaces, receives target data which are transmitted by the interfaces and need to be coded, and provides coding information for the target data which need to be coded.
Optionally, training the classification code table and the historical invoice data includes:
filtering the training data or correcting the content of coding marking errors in the training data to obtain corrected data;
preprocessing the correction training data, wherein the preprocessing is to filter time information, blank spaces and punctuation existing in the target data;
performing word segmentation and cleaning on the correction training data, adding position weights to the segmented and cleaned correction training data, extracting units and specification model data in the correction training data, acquiring record frequency corresponding to a classification coding table according to the correction training data, acquiring rule set training data based on the record frequency, and combining the extracted units and specification model correction training data with the rule set training data to acquire sample training data;
and constructing a training result data set based on the sample training data and storing the training result data set in a distributed file system hdfs.
Optionally, the cleaning process includes: filtering digital adjective connection pattern data, filtering brand part-of-speech data, filtering nouns, adjectives, verbs, filtering a plurality of adjectives and adjectives in a noun connection pattern.
Optionally, the training result data set includes: trade name word segmentation, coding, position weight and frequency data, trade unit, coding and frequency data, trade specification model, coding and frequency data; commodity code and frequency data and commodity code and frequency data.
Optionally, providing the coding information to the target data to be coded includes:
acquiring prior probability and conditional probability based on training result data;
broadcasting prior probability and conditional probability when the target data needing to be provided with codes are large data cluster batch data and recommended codes need to be obtained in batches;
dividing and filtering commodity names of target data, adding position weights to commodity names, acquiring conditional probabilities of a plurality of corresponding codes of the target data according to conditional probability data, acquiring prior probabilities of the plurality of corresponding codes of the target data according to prior probabilities, acquiring the product of the conditional probability and the prior probability of any one corresponding code of the target data, and taking the corresponding code of the maximum value of the multiplier as a recommended code.
Optionally, the method further comprises: the Web end code providing module is used for importing a training result data set stored in the distributed file system hdfs into a PostgreSQL database of a production environment, providing a Web end target data acquisition interface, and after target data are acquired, returning the first five analogies of the acquired target data to coding, coding names and recommendation probability.
Optionally, the system further comprises: the memory database end code providing module loads the training result data set and the correction data to a data structure server redis;
the correction data is the commodity name and the coding data of accurate or preferential recommendation;
and returning the first five analogies, the code name and the recommendation probability to the acquired target data.
Optionally, the memory database end code providing module,
the recommended target data and the recommended codes of the target data are written into a cache, the preset expiration time is set, and related information is matched from the cache when each code recommendation is performed.
Optionally, the memory database end code providing module is configured to, if the obtained target data is matched with the correction data in the data structure server redis, take the correction data corresponding code with a probability of 0.5 as a first-bit recommended code, normalize the recommended code probability, and multiply the recommended code probability by 0.5 as four bits after the recommended code.
Optionally, the system further comprises: and the online information feedback module acquires any type of recommended code information actively selected by the user in the first five analogies recommended by the Web end code recommendation module, and feeds the recommended code information back to the training model.
Optionally, the prior probability is obtained by dividing the frequency in the training result data by the frequency in the pre-word-segmentation code-frequency data in the training result data set.
Optionally, the conditional probability is obtained by dividing the frequency of the name word segmentation data, the unit data and the specification model data in the training result data set by the frequency of the word segmentation corresponding to the codes.
The invention also provides a method for providing coding for target data, the method comprising:
acquiring training data, wherein the training data comprises a classification coding table and historical invoice data, training the classification coding table and the historical invoice data, acquiring training results, and generating a plurality of training models based on the training results;
based on a plurality of groups of training models, continuously merging the plurality of groups of training models, and superposing the same data set of training results;
and the coding module is connected with a plurality of interfaces, receives target data to be coded, which are transmitted by the interfaces, and provides coding information for the target data to be coded.
Optionally, training the classification code table and the historical invoice data includes:
filtering the training data or correcting the content of coding marking errors in the training data to obtain corrected training data;
preprocessing the correction training data, wherein the preprocessing is to filter time information, blank spaces and punctuation existing in the target data;
performing word segmentation and cleaning on the correction training data, adding position weights to the segmented and cleaned correction training data, extracting units and specification model data in the correction training data, acquiring record frequency corresponding to a classification coding table according to the correction training data, acquiring rule set training data based on the record frequency, and combining the extracted units and specification model correction training data with the rule set training data to acquire sample training data;
and constructing a training result data set based on the sample training data and storing the training result data set in a distributed file system hdfs.
Optionally, the cleaning process includes: filtering digital adjective connection pattern data, filtering brand part-of-speech data, filtering nouns, adjectives, verbs, filtering a plurality of adjectives and adjectives in a noun connection pattern.
Optionally, the training result data set includes: trade name word segmentation, coding, position weight and frequency data, trade unit, coding and frequency data, trade specification model, coding and frequency data; commodity code and frequency data and commodity code and frequency data.
Optionally, providing the coding information to the target data to be coded includes:
acquiring prior probability and conditional probability based on training result data;
broadcasting prior probability and conditional probability when the target data needing to be provided with codes are large data cluster batch data and recommended codes need to be obtained in batches;
dividing and filtering commodity names of target data, adding position weights to commodity names, acquiring conditional probabilities of a plurality of corresponding codes of the target data according to conditional probability data, acquiring prior probabilities of the plurality of corresponding codes of the target data according to prior probabilities, acquiring the product of the conditional probability and the prior probability of any one corresponding code of the target data, and taking the corresponding code of the maximum value of the multiplier as a recommended code.
Optionally, the method further comprises: and importing a training result data set stored by the hdfs of the distributed file system into a PostgreSQL database of a production environment, providing a Web end target data acquisition interface, and after acquiring target data, returning the last five analogies of the acquired target data to coding, coding names and recommendation probabilities.
Optionally, the method further comprises:
loading the training result data set and the correction data to a data structure server redis;
the correction data is the commodity name and the coding data of accurate or preferential recommendation;
and returning the first five analogies, the code name and the recommendation probability to the acquired target data.
Optionally, the method further comprises:
the recommended target data and the recommended codes of the target data are written into a data structure server redis cache, the preset expiration time is set, and related information is matched from the cache when each code is recommended.
Optionally, the method further comprises:
if the acquired target data is matched with the correction data in the data structure server redis, the correction data is correspondingly encoded with the probability of 0.5 as a first recommended code, and the recommended code probabilities are normalized and multiplied by 0.5 to be four bits after recommended codes.
Optionally, the method further comprises:
and acquiring any type of recommended code information actively selected by the user in the first five analogies recommended by the Web end code recommendation module, and feeding back the information to the training model.
Optionally, the prior probability is obtained by dividing the frequency in the training result data by the frequency in the pre-word-segmentation code-frequency data in the training result data set.
Optionally, the conditional probability is obtained by dividing the frequency of the name word segmentation data, the unit data and the specification model data in the training result data set by the frequency of the word segmentation corresponding to the codes.
According to the invention, the history data is preprocessed in detail and effectively according to the actual data condition, interference information is removed, and the training accuracy is improved;
the invention provides batch identification, web end code recommendation and quick recommendation of various code recommendation interfaces based on the data structure server redis at the same time, and is a data storage method for improving the redis performance of the data structure server; in addition, a model merging and online information feedback module is provided to further improve the model recommendation accuracy;
the invention better solves the problems of short text classification codes such as similar business, food names and the like in the fields of tax, food and medicine supervision and the like.
Drawings
FIG. 1 is a block diagram of a system for providing encoding for target data in accordance with the present invention;
FIG. 2 is a flow chart of a method for providing encoding for target data in accordance with the present invention.
Detailed Description
The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the examples described herein, which are provided to fully and completely disclose the present invention and fully convey the scope of the invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, like elements/components are referred to by like reference numerals.
Unless otherwise indicated, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. In addition, it will be understood that terms defined in commonly used dictionaries should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.
The present invention provides asystem 200 for providing encoding for target data, as shown in fig. 1, thesystem 200 comprising:
the training module 201 obtains training data, where the training data includes a classification coding table and historical invoice data, and trains the classification coding table and the historical invoice data, and includes:
filtering the training data or correcting the content of coding marking errors in the training data to obtain corrected training data;
preprocessing the correction training data, wherein the preprocessing is to filter time information, blank spaces and punctuation existing in the target data;
performing word segmentation and cleaning on the correction training data, adding position weights to the segmented and cleaned correction training data, extracting units and specification model data in the correction training data, acquiring record frequency corresponding to a classification coding table according to the correction training data, acquiring rule set training data based on the record frequency, and combining the extracted units and specification model correction training data with the rule set training data to acquire sample training data;
and constructing a training result data set based on the sample training data and storing the training result data set in a distributed file system hdfs.
The cleaning process comprises the following steps: filtering the digital adjective connection mode data, filtering brand part-of-speech data, filtering nouns, adjectives, verbs, filtering a plurality of adjectives and adjectives in the noun connection mode;
the training result data set includes: trade name word segmentation, coding, position weight and frequency data, trade unit, coding and frequency data, trade specification model, coding and frequency data; commodity code and frequency data and commodity code and frequency data;
acquiring training results, and generating a plurality of training models based on the training results;
the model merging module 202 merges the multiple groups of training models continuously based on the multiple groups of training models, and superimposes the same data set of the training results;
the code providing module 203 reads training result data, the code module is connected with a plurality of interfaces, receives target data to be coded transmitted by the interfaces, and provides coding information for the target data to be coded;
providing encoding information for target data to be encoded includes:
acquiring prior probability and conditional probability based on training result data;
broadcasting prior probability and conditional probability when the target data needing to be provided with codes are large data cluster batch data and recommended codes need to be obtained in batches; preprocessing target data, wherein the preprocessing is to filter time information, blank spaces and punctuation existing in the target data;
dividing and filtering the preprocessed commodity names of the target data, adding position weights to the commodity names, acquiring conditional probabilities of a plurality of corresponding codes of the target data according to conditional probability data, acquiring prior probabilities of the plurality of corresponding codes of the target data according to prior probabilities, acquiring the product of the conditional probability and the prior probability of any one corresponding code of the target data, and taking the corresponding code of the maximum value of the multiplier as a recommended code.
The prior probability is obtained by dividing the frequency number in the training result data by the frequency number in the pre-word-segmentation coding-frequency number data in the training result data set.
The conditional probability is obtained by dividing the frequency of the name word segmentation data, the unit data and the specification model data in the training result data set by the frequency of the corresponding coded word segmentation.
The Web end code providing module 204, the Web end code providing module 204 imports the training result data set stored by the distributed file system hdfs into the postgreSQL database of the production environment to provide a Web end target data acquisition interface, and after acquiring target data, the Web end code providing module obtains the target data by returning the first five analogies, the code name and the recommendation probability.
The memory database end code providing module 205 loads the training result data set and the correction data to the data structure server redis;
first, the frequency count total value is stored with a key of "code: sum" and a type of String.
Secondly, because the code length is 19, the I/O pressure is large in the recommended process, and the codes in the result set are replaced by numbers from 1 to N, specifically: numbering codes which appear in the code-name data, the code-frequency data after word segmentation and the code-frequency data before word segmentation which are acquired by the rule set to form a code-number corresponding relation; and then, the data of the three data sets are stored in the redis in the form of hash tables with keys of 'code: $ number' (the $ number means variable number) and field of 'name', 'token', 'doc', value of name respectively.
Further, the generated code-number correspondence is formed into a character string of "0:code1,1:code2 …" and stored in redis, wherein the key is "code: total".
Then, respectively storing name word segmentation data, unit data and specification model data into a redis in a hash table, wherein key is word segmentation, unit and specification model, field is position (wherein the position weights of the unit and the specification model are 0), and value is a character string of which all codes corresponding to the same name and position are in the form of 'number 1:freq1 and number 2:freq2 …';
the correction data are accurate or preferential recommended commodity names and coding data thereof, codes in the manual correction data are replaced by numbers by utilizing a coding-numbering table generated in the last step, keys stored in rediss are 'artificial: $name: dw: $dw: ggxh: $ggxh', and values are corresponding numbers. The method comprises the steps of carrying out a first treatment on the surface of the
And returning the first five analogies, the code name and the recommendation probability to the acquired target data.
The memory database end code providing module 205 writes the recommended target data and the recommended code of the target data into the cache, sets a preset expiration time, and matches related information from the cache when each code is recommended.
And the memory database end code providing module 205 is used for providing the first recommended code with the probability of 0.5 as the corresponding code of the corrected data if the acquired target data is matched with the corrected data in the data structure server redis, and multiplying the recommended code probability by 0.5 as four bits after normalization.
The online information feedback module 206 acquires any type of recommended code information actively selected by the user in the first five types of analogies recommended by the Web end code recommendation module, and feeds the recommended code information back to the training model.
The invention also proposes a method for providing coding for target data, as shown in fig. 2, comprising:
obtaining training data, wherein the training data comprises a classification coding table and historical invoice data, training the classification coding table and the historical invoice data comprises the following steps of:
filtering the training data or correcting the content of coding marking errors in the training data to obtain corrected training data;
preprocessing the correction training data, wherein the preprocessing is to filter time information, blank spaces and punctuation existing in the target data;
performing word segmentation and cleaning on the correction training data, adding position weights to the segmented and cleaned correction training data, extracting units and specification model data in the correction training data, acquiring record frequency corresponding to a classification coding table according to the correction training data, acquiring rule set training data based on the record frequency, and combining the extracted units and specification model correction training data with the rule set training data to acquire sample training data;
and constructing a training result data set based on the sample training data and storing the training result data set in a distributed file system hdfs.
The cleaning process comprises the following steps: filtering digital adjective connection pattern data, filtering brand part-of-speech data, filtering nouns, adjectives, verbs, filtering a plurality of adjectives and adjectives in a noun connection pattern.
The training result data set includes: trade name word segmentation, coding, position weight and frequency data, trade unit, coding and frequency data, trade specification model, coding and frequency data; commodity code and frequency data and commodity code and frequency data;
acquiring training results, and generating a plurality of training models based on the training results;
based on a plurality of groups of training models, continuously merging the plurality of groups of training models, and superposing the same data set of training results;
the training result data is read, the coding module is connected with a plurality of interfaces, receives target data to be coded, which are transmitted by the interfaces, and provides coding information for the target data to be coded;
providing encoding information for target data to be encoded includes:
acquiring prior probability and conditional probability based on training result data;
broadcasting prior probability and conditional probability when the target data needing to be provided with codes are large data cluster batch data and recommended codes need to be obtained in batches;
dividing and filtering commodity names of target data, adding position weights to commodity names, acquiring conditional probabilities of a plurality of corresponding codes of the target data according to conditional probability data, acquiring prior probabilities of the plurality of corresponding codes of the target data according to prior probabilities, acquiring the product of the conditional probability and the prior probability of any one corresponding code of the target data, and taking the corresponding code of the maximum value of the multiplier as a recommended code;
the prior probability is obtained by dividing the frequency number in the training result data by the frequency number in the pre-word-segmentation coding-frequency number data in the training result data set.
The conditional probability is obtained by dividing the frequency of the name word segmentation data, the unit data and the specification model data in the training result data set by the frequency of the corresponding coded word segmentation.
And importing a training result data set stored by the hdfs of the distributed file system into a PostgreSQL database of a production environment, providing a Web end target data acquisition interface, and after acquiring target data, returning the last five analogies of the acquired target data to coding, coding names and recommendation probabilities.
Loading the training result data set and the correction data to a data structure server redis;
the correction data is the commodity name and the coding data of accurate or preferential recommendation;
and returning the first five analogies, the code name and the recommendation probability to the acquired target data.
The recommended target data and the recommended codes of the target data are written into a data structure server redis cache, the preset expiration time is set, and related information is matched from the cache when each code is recommended.
If the acquired target data is matched with the correction data in the data structure server redis, the correction data is correspondingly encoded with the probability of 0.5 as a first recommended code, and the recommended code probabilities are normalized and multiplied by 0.5 to be four bits after recommended codes.
And acquiring any type of recommended code information actively selected by the user in the first five analogies recommended by the Web end code recommendation module, and feeding back the information to the training model.
According to the invention, the history data is preprocessed in detail and effectively according to the actual data condition, interference information is removed, and the training accuracy is improved;
the invention provides batch identification, web end code recommendation and quick recommendation of various code recommendation interfaces based on the data structure server redis at the same time, and is a data storage method for improving the redis performance of the data structure server; in addition, a model merging and online information feedback module is provided to further improve the model recommendation accuracy;
the invention better solves the problems of short text classification codes such as similar business, food names and the like in the fields of tax, food and medicine supervision and the like.