Movatterモバイル変換


[0]ホーム

URL:


CN116662434B - Multi-source heterogeneous big data processing system - Google Patents

Multi-source heterogeneous big data processing system
Download PDF

Info

Publication number
CN116662434B
CN116662434BCN202310736600.3ACN202310736600ACN116662434BCN 116662434 BCN116662434 BCN 116662434BCN 202310736600 ACN202310736600 ACN 202310736600ACN 116662434 BCN116662434 BCN 116662434B
Authority
CN
China
Prior art keywords
main data
feature
data
features
neural network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310736600.3A
Other languages
Chinese (zh)
Other versions
CN116662434A (en
Inventor
张晶
董哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Weijia Information Technology Co ltd
Original Assignee
Hebei Weijia Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Weijia Information Technology Co ltdfiledCriticalHebei Weijia Information Technology Co ltd
Priority to CN202310736600.3ApriorityCriticalpatent/CN116662434B/en
Publication of CN116662434ApublicationCriticalpatent/CN116662434A/en
Application grantedgrantedCritical
Publication of CN116662434BpublicationCriticalpatent/CN116662434B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention relates to the technical field of big data, and discloses a multi-source heterogeneous big data processing system, which comprises: a category feature generation module that generates a main data category feature based on field names of main data, one of which corresponds to each of the main data category features; a data source feature generation module that generates data source features based on the original data set linked by the main data; a generated feature extractor for randomly extracting characters and/or words from the original dataset to generate unit feature vectors, and then combining the unit feature vectors to obtain generated features; a model generation module for generating a master data generation model; the main data generation module is used for generating main data of an original data set of the main data to be generated and field names corresponding to the main data; the method and the device can automatically generate the main data matched with the big data with limited source range, and structure and unify the big data through the main data.

Description

Multi-source heterogeneous big data processing system
Technical Field
The invention relates to the technical field of big data, in particular to a multi-source heterogeneous big data processing system.
Background
Big data comprises structured, semi-structured and unstructured data, the unstructured data becomes more and more a main part of the data, the requirements for structuring and unifying the big data are larger than those for mining the big data for big data with limited source scope, such as regional government big data, but the time for structuring and unifying the big data by manually extracting features is longer.
Disclosure of Invention
The invention provides a multi-source heterogeneous big data processing system, which solves the technical problem that the structuring and unification of big data by manually extracting features take longer time in the related technology.
The invention provides a multi-source heterogeneous big data processing system, which comprises:
a category feature generation module that generates a main data category feature based on field names of main data, one of which corresponds to each of the main data category features; a data source feature generation module that generates data source features based on the original data set linked by the main data; a generated feature extractor for randomly extracting characters and/or words from the original dataset to generate unit feature vectors, and then combining the unit feature vectors to obtain generated features; a model generation module for generating a master data generation model; the main data generation model comprises a feature synthesis module, a second feature generator, a first neural network and a second neural network, wherein the feature synthesis module is used for synthesizing main data category features and generation features to generate basic features, and the first neural network inputs the basic features and then outputs the first features; the second feature generator randomly selects N pieces of main data from the main data set, generates a main data feature for each piece of extracted main data, and synthesizes all the generated main data features and main data category features to generate a second feature; the second feature and the first feature are input into a second neural network, the output of the second neural network is mapped to a classification space, and the classification space comprises two classification labels which respectively represent the input as the second feature and the input as the first feature;
word vectors are obtained by processing word parts in main data through a word jump model, the main data features are directly combined and generated by word vectors extracted from the main data and general vectors, and the general features refer to features which can be directly generated as the contents of the vectors in the direct main data;
training samples of the training set of the original data set and the main data generation model are derived from regional government affair data;
the main data generation module is used for inputting the field name of the main data input by the user into the category characteristic generation module to generate the category characteristic of the main data; and synthesizing the main data category characteristics with the generated characteristics generated from the original data set of the main data to be generated, inputting the basic characteristics into a first neural network of the main data generation model, and obtaining the main data of the original data set of the main data to be generated and the field names corresponding to the main data based on the first characteristics generated by the first neural network.
Further, the original data set linked to the main data refers to the original data set that needs to be associated with the main data.
Further, the first neural network and the second neural network are both multi-layer perceptron.
Further, the main data feature is spliced after the main data category feature when the main data feature and the main data category feature are synthesized.
Further, when the main data category features are synthesized with the generated features, random feature vectors are spliced after the main data category features.
Further, the dimensions of the first feature and the second feature are the same, and the first feature and the second feature are expressed as follows after matrixing:,/>an ith element representing a first row in the matrix U representing an ith primary data class feature; />An element representing the ith column of the jth row in the matrix U, a field representing that the jth main data corresponds to the ith main data class feature, m represents the total number of main data, and n represents the total number of main data class features of one main data.
Further, the second neural network outputs through the softmax layer, and the output value is a probability value.
Further, for the first neural network and the second neural network to be jointly trained, the trained loss function is:
wherein the method comprises the steps ofIndicating a loss value->Equal to the number of training samples of the training set, y is a set constant value, +.>Probability value representing classification label corresponding to second feature output when second feature of t training sample is input by second neural network,/second neural network>And when the second neural network inputs the g first feature of the t training sample, outputting a probability value of the classification label corresponding to the first feature.
Further, the training sample of the joint training is derived from the original data set which has constructed the main data, and the generated feature extractor extracts a plurality of times from one of the original data sets as the training sample to obtain a plurality of generated features, so that a plurality of basic features can be synthesized, and a plurality of first features are generated through the first neural network.
Further, the main data generating module generates a plurality of generating features from an original data set of main data to be generated, synthesizes a plurality of basic features respectively, inputs the synthesized basic features into the first neural network respectively to obtain a plurality of groups of main data, and deletes repeated main data from the plurality of groups of main data to obtain a final main data set.
The invention has the beneficial effects that: the invention can automatically generate the main data matched with the big data with limited source range, and the big data is structured and unified through the main data.
Drawings
FIG. 1 is a schematic block diagram of a multi-source heterogeneous big data processing system of the present invention.
In the figure: the system comprises a category feature generation module 101, a data source feature generation module 102, a generation feature extractor 103, a model generation module 104 and a main data generation module 105.
Detailed Description
The subject matter described herein will now be discussed with reference to example embodiments. It is to be understood that these embodiments are merely discussed so that those skilled in the art may better understand and implement the subject matter described herein and that changes may be made in the function and arrangement of the elements discussed without departing from the scope of the disclosure herein. Various examples may omit, replace, or add various procedures or components as desired. In addition, features described with respect to some examples may be combined in other examples as well.
As shown in fig. 1, a multi-source heterogeneous big data processing system includes:
a category feature generation module 101 that generates main data category features based on field names of main data, one of which corresponds to each of the main data category features;
a data source signature generation module 102 that generates data source signatures based on the raw data sets linked by the master data;
the original data set linked to the main data refers to an original data set that needs to be associated with the main data, and on the other hand, the information of the main data is derived from the original data set.
A generated feature extractor 103 for randomly extracting characters and/or words from the original dataset to generate unit feature vectors, and then combining the unit feature vectors to obtain generated features;
a model generation module 104 for generating a master data generation model;
the main data generation model comprises a feature synthesis module, a second feature generator, a first neural network and a second neural network, wherein the feature synthesis module is used for synthesizing main data category features and generation features to generate basic features, and the first neural network inputs the basic features and then outputs the first features;
the second feature generator randomly selects N pieces of main data from the main data set, generates a main data feature for each piece of extracted main data, and synthesizes all the generated main data features and main data category features to generate a second feature;
the second feature and the first feature are input into a second neural network, the output of the second neural network is mapped to a classification space, and the classification space contains two classification labels which respectively represent the input as the second feature and the input as the first feature.
The first neural network and the second neural network are the same as the common neural network, and in one embodiment of the invention, the first neural network and the second neural network are both multi-layer perceptron;
in one embodiment of the invention, the first neural network and the second neural network are both convolutional neural networks.
In one embodiment of the invention, features are synthesized by stitching feature vectors, e.g., for two vectorsAnd->The result after synthesis is +.>
Splicing the main data characteristics after the main data category characteristics when the main data characteristics and the main data category characteristics are synthesized;
splicing random feature vectors after the main data category features when the main data category features and the generated features are synthesized;
the dimensions of the first feature and the second feature are the same, and the first feature and the second feature are expressed as follows after matrixing:,/>representing the first in matrix UAn ith element of a row representing an ith main data class feature; />The element (j > 1) representing the ith column of the jth row in the matrix U represents a field of the jth main data corresponding to the ith main data class feature, m represents the total number of main data, and n represents the total number of main data class features of one main data.
For a word part (including a field name of main data) in the data, a word vector is obtained by processing the word part through a Skip-Gram model (Skip word model), the main data features are directly combined and generated by the word vector extracted from the main data and a general vector, and the general features refer to features which can be directly generated as the content of the vector in the direct main data.
To ensure consistent dimensions of the generated second features, the scope of the primary data may be limited, e.g., primary data in the primary data set are all of the same class, generated based on the same primary data table;
the first neural network and the second neural network combine to generate a neural network for the countermeasure.
In one embodiment of the invention, the first neural network and the second neural network are jointly trained with a trained loss function of:
wherein the method comprises the steps ofIndicating a loss value->Equal to the number of training samples of the training set, y is a set constant value, +.>Probability value representing classification label corresponding to second feature output when second feature of t training sample is input by second neural network,/second neural network>When the second neural network inputs the g first feature of the t training sample, the probability value of the output classification label corresponding to the first feature is represented;
the second neural network outputs through a softmax (normalized exponential function) layer, and the output value is a probability value.
The default value of y is 12.
The training samples of the joint training are derived from the original dataset from which the master data has been constructed. The generated feature extractor 103 extracts a plurality of generated features from one original data set as a training sample a plurality of times, so that a plurality of basic features can be synthesized, and a plurality of first features are generated through a first neural network;
the original data set and the training sample of the training set to be processed are generally derived from the same type of data source, for example, from regional government data.
The category feature generation module 101 generates a main data category feature based on field names of main data of an original data set of the training set.
A main data generation module 105 for inputting a field name of main data input by a user into the category feature generation module 101 to generate a main data category feature; and synthesizing the main data category characteristics with the generated characteristics generated from the original data set of the main data to be generated, inputting the basic characteristics into a first neural network of the main data generation model, and obtaining the main data of the original data set of the main data to be generated and the field names corresponding to the main data based on the first characteristics generated by the first neural network.
The fields of the generated main data need to be mapped with the corresponding field names.
The field name corresponding to the main data of the original data set of the main data obtained based on the first feature generated by the first neural network may be different from the field name of the main data input by the user.
In one embodiment of the present invention, the main data generating module 105 generates a plurality of generating features from the original data set of the main data to be generated, synthesizes the plurality of basic features respectively, inputs the synthesized plurality of basic features into the first neural network respectively to obtain a plurality of groups of main data, and deletes the repeated main data from the plurality of groups of main data to obtain a final main data set.
The embodiment has been described above with reference to the embodiment, but the embodiment is not limited to the above-described specific implementation, which is only illustrative and not restrictive, and many forms can be made by those of ordinary skill in the art, given the benefit of this disclosure, are within the scope of this embodiment.

Claims (10)

a category feature generation module that generates a main data category feature based on field names of main data, one of which corresponds to each of the main data category features; a data source feature generation module that generates data source features based on the original data set linked by the main data; a generated feature extractor for randomly extracting characters and/or words from the original dataset to generate unit feature vectors, and then combining the unit feature vectors to obtain generated features; a model generation module for generating a master data generation model; the main data generation model comprises a feature synthesis module, a second feature generator, a first neural network and a second neural network, wherein the feature synthesis module is used for synthesizing main data category features and generation features to generate basic features, and the first neural network inputs the basic features and then outputs the first features; the second feature generator randomly selects N pieces of main data from the main data set, generates a main data feature for each piece of extracted main data, and synthesizes all the generated main data features and main data category features to generate a second feature; the second feature and the first feature are input into a second neural network, the output of the second neural network is mapped to a classification space, and the classification space comprises two classification labels which respectively represent the input as the second feature and the input as the first feature;
CN202310736600.3A2023-06-212023-06-21Multi-source heterogeneous big data processing systemActiveCN116662434B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202310736600.3ACN116662434B (en)2023-06-212023-06-21Multi-source heterogeneous big data processing system

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202310736600.3ACN116662434B (en)2023-06-212023-06-21Multi-source heterogeneous big data processing system

Publications (2)

Publication NumberPublication Date
CN116662434A CN116662434A (en)2023-08-29
CN116662434Btrue CN116662434B (en)2023-10-13

Family

ID=87720639

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202310736600.3AActiveCN116662434B (en)2023-06-212023-06-21Multi-source heterogeneous big data processing system

Country Status (1)

CountryLink
CN (1)CN116662434B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113157678A (en)*2021-04-192021-07-23中国人民解放军91977部队Multi-source heterogeneous data association method
CN113505222A (en)*2021-06-212021-10-15山东师范大学Government affair text classification method and system based on text circulation neural network
CN113590818A (en)*2021-06-302021-11-02中国电子科技集团公司第三十研究所Government affair text data classification method based on integration of CNN, GRU and KNN
CN113626511A (en)*2021-08-122021-11-09山东勤成健康科技股份有限公司Heterogeneous database fusion access system
CN114462603A (en)*2022-02-092022-05-10中国银行股份有限公司Knowledge graph generation method and device for data lake
CN114661810A (en)*2022-05-242022-06-24国网浙江省电力有限公司杭州供电公司 Lightweight multi-source heterogeneous data fusion method and system
CN115908022A (en)*2022-12-052023-04-04中信银行股份有限公司Abnormal transaction risk early warning method and system based on network modeling
CN115936624A (en)*2022-12-262023-04-07中国电信股份有限公司Basic level data management method and device
CN116226238A (en)*2023-05-062023-06-06合肥尚创信息技术有限公司Multi-dimensional heterogeneous big data mining method and system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20220245490A1 (en)*2021-02-032022-08-04Royal Bank Of CanadaSystem and method for heterogeneous multi-task learning with expert diversity

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN113157678A (en)*2021-04-192021-07-23中国人民解放军91977部队Multi-source heterogeneous data association method
CN113505222A (en)*2021-06-212021-10-15山东师范大学Government affair text classification method and system based on text circulation neural network
CN113590818A (en)*2021-06-302021-11-02中国电子科技集团公司第三十研究所Government affair text data classification method based on integration of CNN, GRU and KNN
CN113626511A (en)*2021-08-122021-11-09山东勤成健康科技股份有限公司Heterogeneous database fusion access system
CN114462603A (en)*2022-02-092022-05-10中国银行股份有限公司Knowledge graph generation method and device for data lake
CN114661810A (en)*2022-05-242022-06-24国网浙江省电力有限公司杭州供电公司 Lightweight multi-source heterogeneous data fusion method and system
CN115908022A (en)*2022-12-052023-04-04中信银行股份有限公司Abnormal transaction risk early warning method and system based on network modeling
CN115936624A (en)*2022-12-262023-04-07中国电信股份有限公司Basic level data management method and device
CN116226238A (en)*2023-05-062023-06-06合肥尚创信息技术有限公司Multi-dimensional heterogeneous big data mining method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
惠国保 ; .一种基于深度学习的多源异构数据融合方法.《现代导航》.2017,(第03期),65-70.*
李刚 等.基于信息融合的电力大数据可视化预处理方法.《广东电力》.2016,第29卷(第12期),10-14.*

Also Published As

Publication numberPublication date
CN116662434A (en)2023-08-29

Similar Documents

PublicationPublication DateTitle
US20180157636A1 (en)Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
CN103593194A (en)Object serialization method and device
US20150089415A1 (en)Method of processing big data, apparatus performing the same and storage media storing the same
CN113836038A (en)Test data construction method, device, equipment and storage medium
US20250245256A1 (en)Information processing system and method for processing information
CN110245228A (en)The method and apparatus for determining text categories
CN111858617B (en)User searching method and device, computer readable storage medium and electronic equipment
CN111506608A (en)Method and device for comparing structured texts
CN113449808A (en)Multi-source image-text information classification method and corresponding device, equipment and medium
CN111881664A (en) An information extraction method, device, device and medium combining RPA and AI
CN116796758A (en)Dialogue interaction method, dialogue interaction device, equipment and storage medium
CN118035423A (en)Information query method, device, computer equipment and storage medium
Kim et al.Deep-learned event variables for collider phenomenology
CN115438240B (en) Data processing method, device, electronic device and storage medium
CN116226238A (en)Multi-dimensional heterogeneous big data mining method and system
CN116662434B (en)Multi-source heterogeneous big data processing system
Quoc et al.A Vision-Language Foundation Model for Leaf Disease Identification
CN115546577A (en)Data enhancement method and device for multi-modal data set
Ding et al.Research on the Application of Improved Attention Mechanism in Image Classification and Object Detection.
US20070174306A1 (en)Data extraction and conversion methods and apparatuses
CN113780365A (en)Sample generation method and device
CN119066698A (en) A dynamic desensitization method for structured data
CN113010220A (en)Component type data processing method and system
CN114185536A (en)Credit investigation data processing method and device, computer equipment and storage medium
CN116975885A (en)Method, system, equipment and storage medium for generating reconciliation document based on configuration

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp