Disclosure of Invention
The invention provides a multi-source heterogeneous big data processing system, which solves the technical problem that the structuring and unification of big data by manually extracting features take longer time in the related technology.
The invention provides a multi-source heterogeneous big data processing system, which comprises:
a category feature generation module that generates a main data category feature based on field names of main data, one of which corresponds to each of the main data category features; a data source feature generation module that generates data source features based on the original data set linked by the main data; a generated feature extractor for randomly extracting characters and/or words from the original dataset to generate unit feature vectors, and then combining the unit feature vectors to obtain generated features; a model generation module for generating a master data generation model; the main data generation model comprises a feature synthesis module, a second feature generator, a first neural network and a second neural network, wherein the feature synthesis module is used for synthesizing main data category features and generation features to generate basic features, and the first neural network inputs the basic features and then outputs the first features; the second feature generator randomly selects N pieces of main data from the main data set, generates a main data feature for each piece of extracted main data, and synthesizes all the generated main data features and main data category features to generate a second feature; the second feature and the first feature are input into a second neural network, the output of the second neural network is mapped to a classification space, and the classification space comprises two classification labels which respectively represent the input as the second feature and the input as the first feature;
word vectors are obtained by processing word parts in main data through a word jump model, the main data features are directly combined and generated by word vectors extracted from the main data and general vectors, and the general features refer to features which can be directly generated as the contents of the vectors in the direct main data;
training samples of the training set of the original data set and the main data generation model are derived from regional government affair data;
the main data generation module is used for inputting the field name of the main data input by the user into the category characteristic generation module to generate the category characteristic of the main data; and synthesizing the main data category characteristics with the generated characteristics generated from the original data set of the main data to be generated, inputting the basic characteristics into a first neural network of the main data generation model, and obtaining the main data of the original data set of the main data to be generated and the field names corresponding to the main data based on the first characteristics generated by the first neural network.
Further, the original data set linked to the main data refers to the original data set that needs to be associated with the main data.
Further, the first neural network and the second neural network are both multi-layer perceptron.
Further, the main data feature is spliced after the main data category feature when the main data feature and the main data category feature are synthesized.
Further, when the main data category features are synthesized with the generated features, random feature vectors are spliced after the main data category features.
Further, the dimensions of the first feature and the second feature are the same, and the first feature and the second feature are expressed as follows after matrixing:,/>an ith element representing a first row in the matrix U representing an ith primary data class feature; />An element representing the ith column of the jth row in the matrix U, a field representing that the jth main data corresponds to the ith main data class feature, m represents the total number of main data, and n represents the total number of main data class features of one main data.
Further, the second neural network outputs through the softmax layer, and the output value is a probability value.
Further, for the first neural network and the second neural network to be jointly trained, the trained loss function is:
wherein the method comprises the steps ofIndicating a loss value->Equal to the number of training samples of the training set, y is a set constant value, +.>Probability value representing classification label corresponding to second feature output when second feature of t training sample is input by second neural network,/second neural network>And when the second neural network inputs the g first feature of the t training sample, outputting a probability value of the classification label corresponding to the first feature.
Further, the training sample of the joint training is derived from the original data set which has constructed the main data, and the generated feature extractor extracts a plurality of times from one of the original data sets as the training sample to obtain a plurality of generated features, so that a plurality of basic features can be synthesized, and a plurality of first features are generated through the first neural network.
Further, the main data generating module generates a plurality of generating features from an original data set of main data to be generated, synthesizes a plurality of basic features respectively, inputs the synthesized basic features into the first neural network respectively to obtain a plurality of groups of main data, and deletes repeated main data from the plurality of groups of main data to obtain a final main data set.
The invention has the beneficial effects that: the invention can automatically generate the main data matched with the big data with limited source range, and the big data is structured and unified through the main data.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It is to be understood that these embodiments are merely discussed so that those skilled in the art may better understand and implement the subject matter described herein and that changes may be made in the function and arrangement of the elements discussed without departing from the scope of the disclosure herein. Various examples may omit, replace, or add various procedures or components as desired. In addition, features described with respect to some examples may be combined in other examples as well.
As shown in fig. 1, a multi-source heterogeneous big data processing system includes:
a category feature generation module 101 that generates main data category features based on field names of main data, one of which corresponds to each of the main data category features;
a data source signature generation module 102 that generates data source signatures based on the raw data sets linked by the master data;
the original data set linked to the main data refers to an original data set that needs to be associated with the main data, and on the other hand, the information of the main data is derived from the original data set.
A generated feature extractor 103 for randomly extracting characters and/or words from the original dataset to generate unit feature vectors, and then combining the unit feature vectors to obtain generated features;
a model generation module 104 for generating a master data generation model;
the main data generation model comprises a feature synthesis module, a second feature generator, a first neural network and a second neural network, wherein the feature synthesis module is used for synthesizing main data category features and generation features to generate basic features, and the first neural network inputs the basic features and then outputs the first features;
the second feature generator randomly selects N pieces of main data from the main data set, generates a main data feature for each piece of extracted main data, and synthesizes all the generated main data features and main data category features to generate a second feature;
the second feature and the first feature are input into a second neural network, the output of the second neural network is mapped to a classification space, and the classification space contains two classification labels which respectively represent the input as the second feature and the input as the first feature.
The first neural network and the second neural network are the same as the common neural network, and in one embodiment of the invention, the first neural network and the second neural network are both multi-layer perceptron;
in one embodiment of the invention, the first neural network and the second neural network are both convolutional neural networks.
In one embodiment of the invention, features are synthesized by stitching feature vectors, e.g., for two vectorsAnd->The result after synthesis is +.>。
Splicing the main data characteristics after the main data category characteristics when the main data characteristics and the main data category characteristics are synthesized;
splicing random feature vectors after the main data category features when the main data category features and the generated features are synthesized;
the dimensions of the first feature and the second feature are the same, and the first feature and the second feature are expressed as follows after matrixing:,/>representing the first in matrix UAn ith element of a row representing an ith main data class feature; />The element (j > 1) representing the ith column of the jth row in the matrix U represents a field of the jth main data corresponding to the ith main data class feature, m represents the total number of main data, and n represents the total number of main data class features of one main data.
For a word part (including a field name of main data) in the data, a word vector is obtained by processing the word part through a Skip-Gram model (Skip word model), the main data features are directly combined and generated by the word vector extracted from the main data and a general vector, and the general features refer to features which can be directly generated as the content of the vector in the direct main data.
To ensure consistent dimensions of the generated second features, the scope of the primary data may be limited, e.g., primary data in the primary data set are all of the same class, generated based on the same primary data table;
the first neural network and the second neural network combine to generate a neural network for the countermeasure.
In one embodiment of the invention, the first neural network and the second neural network are jointly trained with a trained loss function of:
wherein the method comprises the steps ofIndicating a loss value->Equal to the number of training samples of the training set, y is a set constant value, +.>Probability value representing classification label corresponding to second feature output when second feature of t training sample is input by second neural network,/second neural network>When the second neural network inputs the g first feature of the t training sample, the probability value of the output classification label corresponding to the first feature is represented;
the second neural network outputs through a softmax (normalized exponential function) layer, and the output value is a probability value.
The default value of y is 12.
The training samples of the joint training are derived from the original dataset from which the master data has been constructed. The generated feature extractor 103 extracts a plurality of generated features from one original data set as a training sample a plurality of times, so that a plurality of basic features can be synthesized, and a plurality of first features are generated through a first neural network;
the original data set and the training sample of the training set to be processed are generally derived from the same type of data source, for example, from regional government data.
The category feature generation module 101 generates a main data category feature based on field names of main data of an original data set of the training set.
A main data generation module 105 for inputting a field name of main data input by a user into the category feature generation module 101 to generate a main data category feature; and synthesizing the main data category characteristics with the generated characteristics generated from the original data set of the main data to be generated, inputting the basic characteristics into a first neural network of the main data generation model, and obtaining the main data of the original data set of the main data to be generated and the field names corresponding to the main data based on the first characteristics generated by the first neural network.
The fields of the generated main data need to be mapped with the corresponding field names.
The field name corresponding to the main data of the original data set of the main data obtained based on the first feature generated by the first neural network may be different from the field name of the main data input by the user.
In one embodiment of the present invention, the main data generating module 105 generates a plurality of generating features from the original data set of the main data to be generated, synthesizes the plurality of basic features respectively, inputs the synthesized plurality of basic features into the first neural network respectively to obtain a plurality of groups of main data, and deletes the repeated main data from the plurality of groups of main data to obtain a final main data set.
The embodiment has been described above with reference to the embodiment, but the embodiment is not limited to the above-described specific implementation, which is only illustrative and not restrictive, and many forms can be made by those of ordinary skill in the art, given the benefit of this disclosure, are within the scope of this embodiment.