Disclosure of Invention
In view of the above, the embodiments of the present application provide a data classification model training method, a data processing method, a device, an electronic apparatus, and a storage medium, so as to at least solve the problems of low efficiency and poor effect of screening attribute names in the related art.
The technical scheme of the embodiment of the application is realized as follows:
The embodiment of the application provides a data classification model training method, which comprises the following steps:
Inputting a first characteristic tensor into a data classification model, and outputting a second characteristic tensor, wherein each row of first vectors in the first characteristic tensor corresponds to a first attribute name representing a data object;
Clustering each first vector based on the output second characteristic tensor to obtain at least one cluster;
labeling first attribute names corresponding to a set number of first vectors under each cluster to obtain a first sample set;
And determining a loss value based on the first sample set, and updating the weight parameters of the data classification model according to the determined loss value until all the determined first sample sets meet the set training ending condition.
In the above solution, labeling the first attribute names corresponding to the set number of first vectors in each cluster to obtain a first sample set includes:
combining the first attribute names corresponding to any two first vectors in the set number of first vectors under each cluster to obtain at least one first data set;
Labeling each first data set in the at least one first data set according to whether two first attribute names included in the first data set correspond to the same type of attribute, and obtaining a labeling result corresponding to each first data set;
And determining the first sample set based on the labeling result.
In the above solution, the determining the first sample set based on the labeling result includes:
determining at least one second data set based on the labeling result;
The first sample set is obtained according to all the determined second data sets, wherein the second data sets are composed of two first data sets meeting set composition conditions, the set composition conditions represent different labeling results corresponding to the two first data sets, and the two first data sets have the same first attribute name.
In the above scheme, the inputting the first feature tensor into the data classification model includes:
determining a corresponding first feature tensor based on at least one type of first information;
And inputting the determined first characteristic tensor into a data classification model.
In the above solution, the determining, based on at least one type of first information, a corresponding first feature tensor includes:
Constructing a bipartite graph based on the at least one type of first information, and determining a corresponding first characteristic tensor;
And/or the number of the groups of groups,
And performing word segmentation on the at least one type of first information, and determining a corresponding first characteristic tensor based on a word vector corresponding to a word segmentation result.
In the scheme, the corresponding first characteristic tensor is determined at least twice, and the first characteristic tensor which is determined is input into the data classification model, and the method comprises the following steps:
nonlinear transformation is carried out on the determined at least two first characteristic tensors, and a third characteristic tensor is obtained;
And inputting the obtained third characteristic tensor into a data classification model.
In the above scheme, the set training ending condition includes:
The labeling result represents that the first attribute names corresponding to the first vectors of the set quantity under the same cluster correspond to the same type of attribute; and the labeling result represents a result obtained by labeling the first attribute names corresponding to the first vectors with the set quantity under each cluster.
The embodiment of the application also provides a data processing method, which comprises the following steps:
inputting a fourth characteristic tensor into the data classification model, and outputting a fifth characteristic tensor, wherein each row of second vectors in the fourth characteristic tensor corresponds to a second attribute name representing the data object;
Clustering each second vector based on the output fifth feature tensor to obtain at least one cluster;
determining a third attribute name corresponding to each cluster based on the cluster center corresponding to each cluster obtained by clustering, wherein the third attribute name represents the attribute name of the corresponding cluster,
The data classification model is obtained by training the data classification model training method according to any one of the above.
In the above solution, the inputting the fourth feature tensor into the data classification model includes:
determining a corresponding fourth feature tensor based on the at least one type of second information;
And inputting the determined fourth characteristic tensor into a data classification model.
In the above solution, the determining, based on the at least one type of second information, a corresponding fourth feature tensor includes:
constructing a bipartite graph based on the at least one type of second information, determining a corresponding fourth feature tensor, and/or,
And performing word segmentation on the at least one type of second information, and determining a corresponding fourth characteristic tensor based on a word vector corresponding to a word segmentation result.
In the above solution, the determining, based on the cluster center corresponding to each cluster obtained by clustering, the third attribute name corresponding to each cluster includes:
And determining a second attribute name corresponding to the second vector closest to the corresponding cluster as a third attribute name of the corresponding cluster based on the cluster center corresponding to each cluster obtained by clustering.
The embodiment of the application also provides a training device for the data classification model, which comprises the following steps:
The first classification unit is used for inputting a first characteristic tensor into the data classification model and outputting a second characteristic tensor, wherein each row of first vectors in the first characteristic tensor corresponds to a first attribute name representing the data object;
the first clustering unit is used for clustering each first vector based on the output second characteristic tensor to obtain at least one cluster;
The labeling unit is used for labeling the first attribute names corresponding to the set number of first vectors under each cluster to obtain a first sample set;
And the training unit is used for determining a loss value based on the first sample set, and updating the weight parameters of the data classification model according to the determined loss value until all the determined first sample sets meet the set training ending condition.
The embodiment of the application also provides a data processing device, which comprises:
The device comprises a data object, a first classification unit, a second classification unit, a third classification unit, a fourth feature tensor, a first classification unit and a second classification unit, wherein the data object is used for classifying data according to the data object, and outputting a first feature tensor;
the second clustering unit is used for clustering each second vector based on the output fifth characteristic tensor to obtain at least one cluster;
a processing unit, configured to determine a third attribute name corresponding to each cluster based on a cluster center corresponding to each cluster obtained by clustering, where the third attribute name characterizes an attribute name of a corresponding cluster,
The data classification model is obtained by training the data classification model training method according to any one of the above.
The embodiment of the application also provides a first electronic device comprising a first processor and a first memory for storing a computer program capable of running on the processor,
Wherein the first processor is configured to execute the steps of the data classification model training method described in any one of the above when executing the computer program.
The embodiment of the application also provides a second electronic device comprising a second processor and a second memory for storing a computer program capable of running on the processor,
Wherein the second processor is configured to execute the steps of any one of the data processing methods when the computer program is run.
The embodiment of the application also provides a storage medium, on which a computer program is stored, the computer program, when executed by a processor, implements the steps of the data classification model training method described in any one of the above, or implements the steps of the data processing method described in any one of the above.
In the data classification model training method, each row of first vectors in the first feature tensor corresponds to a first attribute name representing a data object, the first feature tensor is input into a data classification model, the second feature tensor is output, each first vector is clustered based on the distance between each row of elements corresponding to any two rows of first vectors in the output second feature tensor, at least one cluster is obtained, the first attribute names corresponding to the first vectors under each cluster are marked, a first sample set is obtained, and weight parameters of the data classification model are updated based on the first sample set until all the determined first sample sets meet set end training conditions. In this way, the first vectors are clustered based on the distance between the first vectors, the corresponding first attribute names are marked based on the clustering result, and the marked first attribute names are used as sample set training data classification models, so that the trained data classification models can accurately judge the association degree between the first attribute names, the trained data classification models are used for data processing on the basis, the attribute names can be efficiently screened from a large amount of data, and the data processing effect is improved.
Detailed Description
Because of the characteristics of large scale, loose organization structure and the like of the internet content, challenges are presented to users for acquiring information and knowledge. The Knowledge Graph (knowledgegraph) lays a foundation for intelligent information application in the Internet age with strong semantic processing capability. Knowledge maps are intended to describe various entities and relationships that exist in the real world, and common description forms include "entity 1-relationship-entity 2" or "entity-attribute name-attribute value".
The standardization of the attribute names is an important step of information structuring of the knowledge graph, and comprises the steps of obtaining the attribute names of products, determining standard attribute names corresponding to the attribute names, and generating the knowledge graph by utilizing the determined standard attribute names so as to realize information comparison among different products.
In the related art, when one standardized attribute name is determined in a plurality of attribute names, all candidate attribute names are recalled through literal matching, and then the attribute names of the standard are manually screened, because literally matched attribute names may correspond to different attributes, literally unmatched attribute names may also correspond to the same attribute, for example, in an electric product, the attribute names of "shell and frame" have the same meaning as "optional shell" but have lower literal matching degree, and for example, the two attribute names of "optional shell" and "optional rated voltage" both comprise "optional" but the corresponding attributes are completely different. That is, recall the candidate attribute names for manual screening has low association degree, and the candidate attribute names with high association degree are easy to miss, and the attribute names of the standard need to be manually screened from the candidate attribute names, so that the efficiency of screening the attribute names is low and the effect is poor.
Based on the above, in the data classification model training method, each row of first vectors in the first feature tensor corresponds to a first attribute name representing a data object, the first feature tensor is input into the data classification model, the second feature tensor is output, each first vector is clustered based on the distance between each row of elements corresponding to any two rows of first vectors in the output second feature tensor, at least one cluster is obtained, the first attribute names corresponding to the first vectors under each cluster are marked, a first sample set is obtained, and weight parameters of the data classification model are updated based on the first sample set until all the determined first sample sets meet the set end training conditions. In this way, the first vectors are clustered based on the distance between the first vectors, the corresponding first attribute names are marked based on the clustering result, and the marked first attribute names are used as sample set training data classification models, so that the trained data classification models can accurately judge the association degree between the first attribute names, the trained data classification models are used for data processing on the basis, the attribute names can be efficiently screened from a large amount of data, and the data processing effect is improved. Meanwhile, compared with the scheme for manual screening based on literal matching recall candidate attribute names in the related art, the embodiment of the application can exclude attribute names with low association degree during labeling, thereby reducing the amount of data to be processed and improving the data processing efficiency.
The present application will be described in further detail with reference to the accompanying drawings and examples.
The embodiment of the application provides a data classification model training method, as shown in fig. 1, which comprises the following steps:
and 101, inputting the first characteristic tensor into a data classification model, and outputting a second characteristic tensor.
Each row of first vectors in the first characteristic tensor corresponds to a first attribute name representing the data object, and each row of elements in the second characteristic tensor corresponds to a distance representing any two rows of first vectors.
The first characteristic tensor is input into a data classification model, the data classification model outputs a second characteristic tensor, and each row of elements in the second characteristic tensor correspondingly represents the distance between two rows of first vectors in the first characteristic tensor. Here, a data object may be defined by a set of attributes including, but not limited to, an external entity, thing, contingent event or event, role, organization unit, place or structure, one attribute name of the data object characterizing the name of one of the set of attributes of the data object. In practical applications, the data object may be various products, such as a bookcase, a wardrobe, etc., and the attribute name of the wardrobe includes size, volume, etc. The distance can be Euclidean distance of the first vectors of two rows, and can also be obtained by calculating the distance through a measurement matrix M.
For example, the first feature tensor of the input data classification model isComprises three first vectors, a first vector A (0 10 1), a first vector B (1 00 0), a first vector C (0 10 0), and a second characteristic tensor of outputEach row characterizes the euclidean distance of the first vector a and the first vector B, the first vector a and the first vector C, the first vector B and the first vector C, respectively.
Wherein in an embodiment, the inputting the first feature tensor into the data classification model includes:
determining a corresponding first feature tensor based on at least one type of first information;
And inputting the determined first characteristic tensor into a data classification model.
Here, the corresponding type of data may be acquired based on each type of information. And the characteristic extraction of the information can be carried out by inputting at least one type of first information into a characteristic extraction layer, and carrying out characteristic extraction on each type of first information in the at least one type of input first information by the characteristic extraction layer to obtain a corresponding first characteristic tensor. The category of the first information includes, but is not limited to, key value pair information and/or text information. The first information is in the form of structured data and unstructured data, sources including, but not limited to, those obtained by crawling techniques, stored by the settings database.
In practical application, as shown in the schematic diagram of the product specification manual picture in fig. 2, the data in the schematic diagram is extracted through optical character recognition (OCR, optical Character Recognition) to obtain attribute-related key value pair information, where the key value pair information includes an attribute name and an optional attribute value. For example, the attribute name of "rated current" and the corresponding optional attribute values, including "2.5A", "6.3A", "12.5A", "16A", and the like, are extracted.
As shown in the specification table schematic diagram in fig. 3, extracting data in the schematic diagram to obtain text information related to the attribute. For example, text information such as "earth leakage protection module", "×" relay "," characteristic "," disconnecting switch "and the like is extracted.
In an embodiment, the determining, based on the at least one type of first information, a corresponding first feature tensor includes:
Constructing a bipartite graph based on the at least one type of first information, determining a corresponding first feature tensor, and/or,
And performing word segmentation on the at least one type of first information, and determining a corresponding first characteristic tensor based on a word vector corresponding to a word segmentation result.
Here, the category of the first information includes, but is not limited to, key value pair information and/or text information. The first feature tensor obtained by extracting features based on at least one type of first information may be a feature representation based on a graph relationship learning model such as LIINE model. And constructing a bipartite graph based on the attribute names of the key value pair information and optional attribute values, wherein the optional attribute values represent the value ranges of the attribute values. The attribute names and attribute values are considered as two distinct entities, and the valued relationship may be considered as an edge between the two entities. The characteristic representation of the attribute name entity reflects a physical attribute of the attribute name, for example, the values of a 'housing', 'rated current', 'housing current code' are all current values, so that the values are all attribute names related to the attribute of the current, and the association degree between the attribute names is higher. For another example, both "clothing color" and "shoe color" are attribute names associated with colors, and the degree of association between attribute names is higher. Thus, this value relationship may be represented by constructing a bipartite graph of the attributes, as shown in FIG. 4, and extracting a feature representation of the node based on the bipartite graph using a graph mining or graph learning algorithm.
In practical applications, there is a correlation between the literal meaning of the attribute name and the attribute characterized by the attribute name, for example, the "housing" and the "housing current code" are both attribute names describing the housing, and then the degree of correlation between them is higher than that between the "housing" and the "rated current". Thus, the feature extraction layer may be further trained in combination with the literal matching, thereby enabling the resulting first feature tensor to more accurately characterize the first information.
The method of extracting the first feature tensor based on the at least one type of first information may be based on feature representation of word embedding. Semantic analysis is performed by extracting word vectors from text data associated with attribute names. The text data extraction method comprises the steps of segmenting an attribute name to obtain at least one segmented word corresponding to the attribute name, and processing semantic vectors of the at least one segmented word to obtain a first vector corresponding to the attribute name. Here, the processing method includes, but is not limited to, stitching together the semantic vectors of at least one of the tokens and weighting and adding the semantic vectors of at least one of the tokens. The first vector of attribute names may be used to calculate the contextual semantic similarity of the attribute names. Here, the word vector model may be a Skip-gram model. As shown in fig. 5, the word vectors output by the model reflect the semantic similarity of the context between words. Thus, the attribute names can be represented by the word vectors output by the model, so that the association degree between different attribute names can be judged.
In practical applications, the semantic similarity of the context between the "housing" and the "shell" is higher, the semantic similarity of the context between the "housing" and the "current" is lower, and the semantic similarity of the context between the "housing" and the "nominal" is lower.
The first feature tensor obtained by feature extraction based on at least one type of first information may be a combination of feature representation learned based on graph relationship and feature representation of word embedding.
Thus, processing the first information by at least one feature representation method improves the universality of the data classification model by inputting and training the feature tensors corresponding to more sources of data. Meanwhile, based on the feature representation of the combination of the value relation and the semantics corresponding to the attribute names, the features of the corresponding attribute names can be more accurately represented, and the data classification effect of the model is better. In addition, the process of decoupling the two models of representation learning and measurement learning enables the characteristic extraction process and the measurement learning process of representation learning to be independent, and the data classification model obtained through training can be suitable for more application scenes and applied to more electronic equipment.
In one embodiment, the corresponding first feature tensor is determined at least twice, and the inputting of the determined first feature tensor into the data classification model comprises the following steps:
nonlinear transformation is carried out on the determined at least two first characteristic tensors, and a third characteristic tensor is obtained;
And inputting the obtained third characteristic tensor into a data classification model.
At least two times of corresponding first feature tensors are determined based on at least one type of first information, at least two first feature tensors are obtained, nonlinear transformation is carried out on the determined at least two first feature tensors, and the at least two first feature tensors are mapped to a new feature space, so that a third feature tensor is obtained. The way at least two first feature tensors are mapped to the new feature space, the third feature tensor can be obtained by equation (1):
Wherein,
Fi,j represents the element of the ith row and jth column in the third feature tensor;
Characterizing elements of an ith row and a jth column in a first feature tensor;
Characterizing elements of an ith row and a jth column in another first feature tensor;
Wg characterizationIs a transposed matrix of (a);
Ws characterizationIs a transposed matrix of (a).
Here, the way at least two first feature tensors are mapped to the new feature space may be obtained by hidden layer transformation of the deep neural network. The Wg and the Ws can be optimized based on the measurement learning result, so that the distance between attribute names corresponding to the same attribute is as small as possible, and the distance between attribute names corresponding to different attributes is as large as possible.
Step 102, clustering each first vector based on the output second characteristic tensor to obtain at least one cluster.
And clustering the first vectors based on the distance between the two rows of first vectors represented by each row of elements in the second characteristic tensor to obtain at least one cluster. Here, clustering may be performed based on the similarity pairs of the respective first vectors using a clustering algorithm such as k-means. In the clustering result, the similarity between the first vectors in the same cluster may be higher than a similarity threshold.
And 103, marking the first attribute names corresponding to the set number of first vectors under each cluster to obtain a first sample set.
And randomly selecting a set number of first vectors under each cluster, determining corresponding first attribute names, enabling the ratio of the set number to the number of the attribute names under each cluster to meet a set proportion, annotating the selected set number of first attribute names based on a set rule, and determining a first sample set based on an annotating result. Here, the grouping may be performed without grouping the selected first attribute names in a set number, and the grouping method includes, but is not limited to, classifying the first attribute names in a set number, and uniformly distributing the first attribute names to each set in a set proportion. And for the annotation of the first attribute name, a manual annotation mode can be adopted.
For example, if the set proportion is 5% and there are 200 first attribute names under the first cluster, the set number is 10, and the first attribute names under the 10 first clusters are randomly selected for annotation.
In an embodiment, the labeling the set number of first attribute names under each cluster to obtain a first sample set includes:
combining any two first attribute names of the set number of first attribute names under each cluster to obtain at least one first data set;
Labeling each first data set in the at least one first data set according to whether two first attribute names included in the first data set correspond to the same type of attribute, and obtaining a labeling result corresponding to each first data set;
And determining the first sample set based on the labeling result.
Here, any two first attribute names of the set number of first attribute names in each cluster are combined two by two to obtain at least one first data group (x, y) including the two first attribute names. The method of labeling the set number of first attribute names in each cluster may be that whether two first attribute names in the first data set correspond to the same type of attribute is manually judged and labeled, a labeling result corresponding to each first data set is obtained, and a first sample set is determined from the first attribute names based on the labeling result whether the two first attribute names in the characterized first data set belong to the same type of attribute.
In practical application, the manual labeling of the classification tasks is carried out, if the two first attribute names correspond to the same type of attribute, the labeling is 0, otherwise, the labeling is 1.
Therefore, the manual screening task for a large number of attribute names is converted into the manually marked two-class task, the difficulty of processing the attribute names by a user is simplified, the error rate of screening the attribute names is reduced, and the efficiency is improved. Meanwhile, as the distance of the first attribute names under each cluster is close to but not corresponding to the same type of attribute, the model is trained based on the first attribute names under each cluster, the model learning process can be accelerated, and the training efficiency of the model is improved.
In an embodiment, the determining the first sample set based on the labeling result includes:
determining at least one second data set based on the labeling result;
The first sample set is obtained according to all the determined second data sets, wherein the second data sets are composed of two first data sets meeting set composition conditions, the set composition conditions represent different labeling results corresponding to the two first data sets, and the two first data sets have the same first attribute name.
Based on the labeling result of the first data group, two first data groups satisfying the set construction condition are taken as data constituting the second data group. Here, the set composition condition characterizes that the two first data sets correspond to different labeling results, and that the two first data sets have and only have one identical first attribute name.
In practical application, three first attribute names are usedFor example, based on the labeling results, either one of two second data sets, a ternary data set, may be generatedOr pairs of positive and negative samples of binary data setsAndA corresponding first set of samples is obtained in either of the two second data sets.
And 104, determining a loss value based on the first sample set, and updating the weight parameters of the data classification model according to the determined loss value until all the determined first sample sets meet the set training ending condition.
Here, the Loss function of the data classification model is correspondingly set to Contrastive Loss binary group Loss and triple Loss based on the kind of the data group of the first sample set as binary data group or ternary data group.
In an embodiment, the set training ending condition includes:
The labeling result represents that the first attribute names corresponding to the first vectors of the set quantity under the same cluster correspond to the same type of attribute; and the labeling result represents a result obtained by labeling the first attribute names corresponding to the first vectors with the set quantity under each cluster.
Here, the set number of first attribute names under each cluster are labeled, and when the obtained labeling result represents that the set number of first attribute names under each cluster corresponds to the same type of attribute, the attribute names in each cluster obtained by clustering can be considered to be accurately clustered according to the corresponding attribute type based on the data classification model obtained by training, so that the attribute names in each cluster obtained by clustering correspond to the same type of attribute.
In the embodiment of the application, the first vectors are clustered based on the distance between the first vectors, the first attribute names corresponding to the clustering result are marked, and the marked first attribute names are used as a sample set training data classification model, so that the trained data classification model can accurately judge the degree of association between the first attribute names, and the trained data classification model is used for data processing on the basis, so that the attribute names can be efficiently screened from a large amount of data, and the data processing effect is improved. Meanwhile, compared with the scheme for manual screening based on literal matching recall candidate attribute names in the related art, the embodiment of the application can exclude attribute names with low association degree during labeling, thereby reducing the amount of data to be processed and improving the data processing efficiency. In addition, the process of decoupling the representation learning and the measurement learning of the two models enables the characteristic extraction process and the measurement learning process to be independent, and the data classification model obtained through training can be suitable for more application scenes and more electronic equipment.
Fig. 6 is a schematic flow chart of a data processing method according to an embodiment of the present application, and referring to fig. 6, the data processing method includes:
and 601, inputting the fourth characteristic tensor into a data classification model, and outputting a fifth characteristic tensor.
Wherein each row of second vectors in the fourth feature tensor corresponds to a second attribute name characterizing a data object; and each row of elements in the fifth characteristic tensor correspondingly represents the distance between any two rows of second vectors, and the data classification model is obtained by training the data classification model training method according to any one of the above.
Inputting the fourth characteristic tensor into a data classification model trained by the data classification model training method according to any one of the above, outputting a fifth characteristic tensor by the data classification model, wherein each row of elements in the fifth characteristic tensor represents the distance between two corresponding rows of second vectors in the fourth characteristic tensor. Here, a data object may be defined by a set of attributes including, but not limited to, an external entity, thing, contingent event or event, role, organization unit, place or structure, one attribute name of the data object characterizing the name of one of the set of attributes of the data object. In practical applications, the data object may be various products, such as a bookcase, a wardrobe, etc., and the attribute name of the wardrobe includes size, volume, etc.
In an embodiment, the inputting the fourth feature tensor into the data classification model includes:
determining a corresponding fourth feature tensor based on the at least one type of second information;
And inputting the determined fourth characteristic tensor into a data classification model.
Here, at least one type of second information may be input into the feature extraction layer, and feature extraction may be performed on each type of second information in the input at least one type of second information by the feature extraction layer, so as to obtain a corresponding fourth feature tensor. The category of the second information includes, but is not limited to, key value pair information and/or text information, and the form of the second information includes structured data and unstructured data, and the source includes, but is not limited to, being obtained through a crawling technology and being stored by a setting database.
In an embodiment, the determining the corresponding fourth feature tensor based on the at least one type of second information includes:
constructing a bipartite graph based on the at least one type of second information, determining a corresponding fourth feature tensor, and/or,
And performing word segmentation on the at least one type of second information, and determining a corresponding fourth characteristic tensor based on a word vector corresponding to a word segmentation result.
Here, the category of the second information includes, but is not limited to, key value pair information and/or text information. The fourth feature tensor obtained by extracting features based on at least one type of second information may be a feature representation based on a graph relationship learning model such as LIINE model. And constructing a bipartite graph based on the attribute names of the key value pair information and optional attribute values, wherein the optional attribute values represent the value ranges of the attribute values. The attribute names and attribute values are considered as two distinct entities, and the valued relationship may be considered as an edge between the two entities. The characteristic representation of the attribute name entity reflects a physical attribute of the attribute name.
In practical applications, there is a correlation between the literal meaning of the attribute name and the attribute characterized by the attribute name, for example, the "housing" and the "housing current code" are both attribute names describing the housing, and then the degree of correlation between them is higher than that between the "housing" and the "rated current". Thus, the feature extraction layer may be further trained in combination with literal matching, enabling the resulting fourth feature tensor to more accurately characterize the second information.
The fourth feature tensor obtained by extracting features based on at least one type of second information may be a feature representation based on word embedding. Semantic analysis is performed by extracting word vectors from text data associated with attribute names. The text data extraction method comprises the steps of segmenting the attribute names to obtain at least one segmented word corresponding to the attribute names, and processing semantic vectors of the at least one segmented word to obtain second vectors corresponding to the attribute names. Here, the processing method includes, but is not limited to, stitching together the semantic vectors of at least one of the tokens and weighting and adding the semantic vectors of at least one of the tokens. The second vector of attribute names may be used to calculate the contextual semantic similarity of the attribute names. The word vector model may be a Skip-gram model. As shown in fig. 5, the word vectors output by the model reflect the semantic similarity of the context between words. Thus, the degree of association between attribute names can be determined by the word vector output by the model.
The fourth feature tensor obtained by feature extraction based on the at least one type of second information may be a combination of feature representation learned based on graph relationship and feature representation of word embedding.
In this way, processing the second information by at least one feature representation method improves the universality of the data classification model by inputting and training the feature tensors corresponding to more sources of data. Meanwhile, based on the feature representation of the combination of the value relation and the semantics corresponding to the attribute names, the features of the corresponding attribute names can be more accurately represented, and the data classification effect of the model is better. In addition, the process of decoupling the representation learning and the measurement learning of the two models enables the characteristic extraction process and the measurement learning process to be independent, and the data classification model obtained through training can be suitable for more application scenes and more electronic equipment.
Step 602, clustering the second vectors based on the output fifth characteristic tensor to obtain at least one cluster.
And clustering the second vectors based on the distance between the two rows of second vectors represented by each row of elements in the fifth feature tensor to obtain at least one cluster. Here, clustering may be performed based on the similarity pairs of the respective second vectors using a clustering algorithm such as k-means. In the clustering result, the similarity between the second vectors in the same cluster may be higher than a similarity threshold.
Step 603, determining a third attribute name corresponding to each cluster based on the cluster center corresponding to each cluster obtained by clustering.
Wherein the third attribute name characterizes the attribute name of the corresponding cluster.
And for each cluster, determining a third attribute name from the second vector under the corresponding cluster according to the distance between the cluster and the cluster center vector obtained by clustering, and determining the third attribute name as the standard attribute name of the corresponding cluster.
In an embodiment, the determining, based on the cluster center corresponding to each cluster obtained by clustering, the third attribute name corresponding to each cluster includes:
And determining a second attribute name corresponding to the second vector closest to the corresponding cluster as a third attribute name of the corresponding cluster based on the cluster center corresponding to each cluster obtained by clustering.
The cluster center vector can represent the common characteristic of the second attribute names corresponding to the second vector in the cluster, wherein the second attribute name closest to the cluster center vector is determined as a third attribute name, and the determined third attribute name can represent the corresponding second attribute name as the standard attribute name of the corresponding cluster.
In this way, a large number of second attribute names can be classified through the fifth characteristic tensor of the characterization distance output by the data classification model, and the standard attribute name corresponding to each type of second attribute name is determined, compared with the recall of all candidate attribute names through literal matching in the related art, the embodiment of the application solves the problems of low efficiency and poor effect of standardized attribute names by manually screening the standard attribute names, can efficiently screen the attribute names from a large amount of data, and improves the data processing effect. Meanwhile, compared with the scheme for manual screening based on literal matching recall candidate attribute names in the related art, the embodiment of the application can exclude attribute names with low association degree during labeling, thereby reducing the amount of data to be processed and improving the data processing efficiency. In addition, the process of decoupling the representation learning and the measurement learning of the two models enables the characteristic extraction process and the measurement learning process to be independent, and the data classification model obtained through training can be suitable for more application scenes and more electronic equipment.
In the related art, the standardization of attribute names depends on manual judgment screening, and judgment needs to be performed manually in a large amount of data, so that the efficiency is low and the error rate is high. Examples of application of embodiments of the present application are given below:
in connection with a framework diagram of data processing shown in fig. 7, feature extraction is performed on information based on representation learning, and attribute names are described from different angles. The extracted features are fused based on metric learning, and the distance between attribute names is calculated. And then the active learning is used for decoupling the representation learning and the measurement learning, so that on one hand, the existing characteristics in the representation learning can be continuously optimized and the new characteristics can be determined, and on the other hand, the difficult sample can be screened, and the supervision learning of the measurement learning model can be performed by combining with the manual labeling. The learning, the active learning and the metric learning are synchronously performed in the whole process, so that the model learning process can be accelerated, and the training efficiency of the model is improved. In addition, the labor cost of the data processing task can be reduced based on the data processing framework of active learning.
A flow chart of data processing as shown in fig. 8, comprising:
Step 801, data information preparation.
And capturing information of the data associated with the attribute, wherein the information comprises key value pair information and text information.
Step 802, performing representation learning based on the data.
And extracting features, continuously updating a distance measurement model based on measurement learning under an active learning framework, and optimizing a distance matrix of the attribute name in a high-dimensional space.
Here, the feature representation of the node may be extracted based on graph mining or graph learning algorithm of graph relation learning, or may be performed based on a word vector model such as Skip-gram model or the like.
Clustering is carried out based on a distance matrix of the attribute names, the distance matrix stores distance calculation results among vectors representing the attribute names, and clustering is carried out based on the vectors representing the distance calculation results, so that at least one attribute name class cluster can be obtained. Based on the clustering result, the selection of difficult samples in the sample space can be realized by adopting a set query strategy framework. Here, the set query policy framework may be a set policy, and the difficult samples are selected according to the clustering result. Difficult samples refer to samples that are very close apart, but do not correspond to the same class of properties. Thus, the active learning selects the difficult sample based on the set rule, so that the model learning becomes efficient.
Randomly sampling a certain proportion of attribute names from each class cluster, and combining the attribute names of the same class cluster in pairs. And (3) manually marking the data sets of the difficult samples by two classification tasks, wherein the marking can be performed through the identification bit, 0 represents that the data sets belong to the same type of attribute, and 1 represents that the data sets belong to different types of attribute. Generating a difficult sample training set based on the labeled data sets may generate either one of two second data sets, a ternary data setOr pairs of positive and negative samples of binary data setsAndAnd training a data classification model by taking any one of the two second data sets as training data, and correspondingly setting a Loss function of the data classification model as Contrastive Loss binary group Loss and triple Loss based on the type of the data set of the first sample set as binary data set or ternary data set. And (3) training to enable the distance result output by the data classification model to be as small as possible, and enabling the distance of the attribute names corresponding to the same attribute to be as large as possible. Based on training of the data classification model, updating the distance matrix corresponding to the attribute name, and repeating the steps.
Step 803, clustering based on the distance of the attribute names, and selecting standard attribute names of the clusters based on the clustering result.
Here, based on the distance matrix output by the optimized data classification model, a better clustering result is obtained, and the attribute name corresponding to the vector closest to the cluster center vector in each cluster is determined as the standard attribute name of the cluster. In this way, the standard attribute names corresponding to each cluster can be obtained, the obtained standard attribute names are used as attribute name dictionaries, and the attribute names can be normalized based on the obtained attribute name dictionaries.
In an application embodiment of the application, the process of normalizing attribute names is optimized by model modeling Cheng Gaowei the sample distance in vector space. Meanwhile, the active learning decoupling representation learning and the metric learning are utilized, so that the characteristic iteration representing the learning and the metric learning are independent. In addition, in the active learning, the attribute names are standardized, and the dependence on manual work (tasks of artificial standards, collected by manual work, compared and screened) is converted into weak dependence on manual work (two kinds of tasks with fixed standards), so that the efficiency of the tasks is improved and the difficulty of the tasks is reduced by executing simple and standardized labeling tasks on part of sample sets.
In order to implement the method of the embodiment of the present application, the embodiment of the present application further provides a data classification model training device, which is disposed on a first electronic device, as shown in fig. 9, and the device includes:
The first classification unit 901 is configured to input a first feature tensor into the data classification model and output a second feature tensor, where each line of first vectors in the first feature tensor corresponds to a first attribute name representing a data object;
a first clustering unit 902, configured to cluster each first vector based on the output second feature tensor, to obtain at least one cluster;
A labeling unit 903, configured to label first attribute names corresponding to a set number of first vectors under each cluster, so as to obtain a first sample set;
And the training unit 904 is configured to determine a loss value based on the first sample set, and update the weight parameter of the data classification model according to the determined loss value until all the determined first sample sets meet a set training ending condition.
Wherein, in an embodiment, the labeling unit 903 is configured to:
combining the first attribute names corresponding to any two first vectors in the set number of first vectors under each cluster to obtain at least one first data set;
Labeling each first data set in the at least one first data set according to whether two first attribute names included in the first data set correspond to the same type of attribute, and obtaining a labeling result corresponding to each first data set;
And determining the first sample set based on the labeling result.
In an embodiment, the labeling unit 903 is configured to:
determining at least one second data set based on the labeling result;
The first sample set is obtained according to all the determined second data sets, wherein the second data sets are composed of two first data sets meeting set composition conditions, the set composition conditions represent different labeling results corresponding to the two first data sets, and the two first data sets have the same first attribute name.
In an embodiment, the first classification unit 901 is configured to:
determining a corresponding first feature tensor based on at least one type of first information;
And inputting the determined first characteristic tensor into a data classification model.
In an embodiment, the first classification unit 901 is further configured to:
Constructing a bipartite graph based on the at least one type of first information, determining a corresponding first feature tensor, and/or,
And performing word segmentation on the at least one type of first information, and determining a corresponding first characteristic tensor based on a word vector corresponding to a word segmentation result.
In an embodiment, the first classification unit 901 is further configured to:
the inputting of the determined first feature tensor into the data classification model comprises the following steps:
nonlinear transformation is carried out on the determined at least two first characteristic tensors, and a third characteristic tensor is obtained;
And inputting the obtained third characteristic tensor into a data classification model.
In an embodiment, the set training ending condition includes:
The labeling result represents that the first attribute names corresponding to the first vectors of the set quantity under the same cluster correspond to the same type of attribute; and the labeling result represents a result obtained by labeling the first attribute names corresponding to the first vectors with the set quantity under each cluster.
In practical applications, the first classification unit 901, the first clustering unit 902, the labeling unit 903, and the training unit 904 may be implemented by a Processor in a training device based on a data classification model, such as a central processing unit (CPU, central Processing Unit), a digital signal Processor (DSP, digital Signal Processor), a micro-control unit (MCU, microcontroller Unit), or a Programmable gate array (FPGA, field-Programmable GATE ARRAY).
It should be noted that, when the data classification model training device provided in the foregoing embodiment performs data classification model training, only the division of each program module is used for illustration, and in practical application, the processing allocation may be completed by different program modules according to needs, that is, the internal structure of the device is divided into different program modules to complete all or part of the processing described above. In addition, the data classification model training device and the data classification model training method provided in the foregoing embodiments belong to the same concept, and detailed implementation processes of the data classification model training device and the data classification model training method are detailed in the method embodiments, which are not described herein again.
In order to implement the method of the embodiment of the present application, the embodiment of the present application further provides a data processing apparatus, which is disposed on a second electronic device, as shown in fig. 10, and the apparatus includes:
a second classification unit 1001, configured to input a fourth feature tensor into the data classification model and output a fifth feature tensor, where each row of second vectors in the fourth feature tensor corresponds to a second attribute name representing the data object;
a second clustering unit 1002, configured to cluster each second vector based on the output fifth feature tensor, to obtain at least one cluster;
a processing unit 1003, configured to determine a third attribute name corresponding to each cluster based on a cluster center corresponding to each cluster obtained by clustering, where the third attribute name characterizes an attribute name of a corresponding cluster,
The data classification model is obtained by training the data classification model training method according to any one of the above.
Wherein, in an embodiment, the second classification unit 1001 is configured to:
determining a corresponding fourth feature tensor based on the at least one type of second information;
And inputting the determined fourth characteristic tensor into a data classification model.
In an embodiment, the second classification unit 1001 is further configured to:
constructing a bipartite graph based on the at least one type of second information, determining a corresponding fourth feature tensor, and/or,
And performing word segmentation on the at least one type of second information, and determining a corresponding fourth characteristic tensor based on a word vector corresponding to a word segmentation result.
In an embodiment, the processing unit 1003 is configured to:
And determining a second attribute name corresponding to the second vector closest to the corresponding cluster as a third attribute name of the corresponding cluster based on the cluster center corresponding to each cluster obtained by clustering.
In practical applications, the second classification unit 1001, the second aggregation unit 1002, and the processing unit 1003 may be implemented by a processor in a data processing apparatus, such as CPU, DSP, MCU or an FPGA.
It should be noted that, in the data processing apparatus provided in the foregoing embodiment, only the division of each program module is used for illustration, and in practical application, the processing allocation may be performed by different program modules according to needs, that is, the internal structure of the apparatus is divided into different program modules to complete all or part of the processing described above. In addition, the data processing apparatus and the data processing method embodiment provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the data processing apparatus and the data processing method embodiment are detailed in the method embodiment, which is not described herein again.
Based on the hardware implementation of the program module, and in order to implement the data classification model training method according to the embodiment of the present application, the embodiment of the present application further provides a first electronic device, as shown in fig. 11, a first electronic device 1100 includes:
a first communication interface 1101 capable of information interaction with other devices such as network devices and the like;
The first processor 1102 is connected to the first communication interface 1101, so as to implement information interaction with other devices, and is configured to execute, when running a computer program, a method provided by one or more technical solutions on the first electronic device side. And the computer program is stored on the first memory 1103.
Of course, in actual practice, the various components in the first electronic device 1100 would be coupled together by the first bus system 1104. It is appreciated that the first bus system 1104 is operable to enable connected communications between these components. The first bus system 1104 includes a first power bus, a first control bus, and a first status signal bus in addition to the first data bus. But for clarity of illustration, the various buses are labeled as a first bus system 1104 in fig. 11.
The first memory 1103 in the embodiment of the present application is used to store various types of data to support the operation of the first electronic device 1100. Examples of such data include any computer program for operating on the first electronic device 1100.
It is to be appreciated that the first memory 1103 can be volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The non-volatile Memory may be, among other things, a Read Only Memory (ROM), a programmable Read Only Memory (PROM, programmable Read-Only Memory), erasable programmable Read-Only Memory (EPROM, erasable Programmable Read-Only Memory), electrically erasable programmable Read-Only Memory (EEPROM, ELECTRICALLY ERASABLE PROGRAMMABLE READ-Only Memory), Magnetic random access Memory (FRAM, ferromagnetic random access Memory), flash Memory (Flash Memory), magnetic surface Memory, optical disk, or compact disk-Only Memory (CD-ROM, compact Disc Read-Only Memory), which may be disk Memory or tape Memory. The volatile memory may be random access memory (RAM, random Access Memory) which acts as external cache memory. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM, static Random Access Memory), synchronous static random access memory (SSRAM, synchronous Static Random Access Memory), dynamic random access memory (DRAM, dynamic Random Access Memory), synchronous dynamic random access memory (SDRAM, synchronous Dynamic Random Access Memory), and, Double data rate synchronous dynamic random access memory (DDRSDRAM, double Data Rate Synchronous Dynamic Random Access Memory), enhanced synchronous dynamic random access memory (ESDRAM, enhanced Synchronous Dynamic Random Access Memory), synchronous link dynamic random access memory (SLDRAM, syncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, direct Rambus Random Access Memory). The first memory 1103 described in embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed in the above embodiment of the present application may be applied to the first processor 1102, or implemented by the first processor 1102. The first processor 1102 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the method described above may be performed by integrated logic circuitry in hardware or instructions in software in the first processor 1102. The first processor 1102 may be a general purpose processor, DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The first processor 1102 may implement or perform the methods, steps and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the application can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the first memory 1103, and the first processor 1102 reads the programs in the first memory 1103, and in combination with the hardware thereof, performs the steps of the foregoing methods.
Optionally, when the first processor 1102 executes the program, a corresponding flow implemented by the first electronic device in each method of the embodiment of the present application is implemented, and for brevity, will not be described herein again.
Based on the hardware implementation of the program modules, and in order to implement the second electronic device side data processing method according to the embodiment of the present application, the embodiment of the present application further provides a second electronic device, as shown in fig. 12, where the second electronic device 1200 includes:
A second communication interface 1201 capable of information interaction with other devices such as a network device and the like;
The second processor 1202 is connected to the second communication interface 1201, so as to implement information interaction with other devices, and is configured to execute, when running a computer program, a method provided by one or more technical solutions on the second electronic device side. And the computer program is stored on the second memory 1203.
Of course, in actual practice, the various components in the second electronic device 1200 are coupled together via the second bus system 1204. It is appreciated that the second bus system 1204 is used to enable connected communications between these components. The second bus system 1204 includes a second power bus, a second control bus, and a second status signal bus in addition to the second data bus. But for clarity of illustration, the various buses are labeled in fig. 12 as the second bus system 1204.
The second memory 1203 in the embodiment of the present application is used to store various types of data to support the operation of the second electronic device 1200. Examples of such data include any computer program for operating on the second electronic device 1200.
It is to be appreciated that the second memory 1203 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. The nonvolatile Memory may be ROM, PROM, EPROM, EEPROM, FRAM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM, and the magnetic surface Memory may be magnetic disk Memory or magnetic tape Memory. The volatile memory may be RAM, which acts as external cache. By way of example, and not limitation, many forms of RAM are available, such as SRAM, SSRAM, DRAM, SDRAM, DDRSDRAM, ESDRAM, SLDRAM, DRRAM. The second memory 1203 described by embodiments of the present application is intended to comprise, without being limited to, these and any other suitable types of memory.
The method disclosed in the above embodiment of the present application may be applied to the second processor 1202 or implemented by the second processor 1202. The second processor 1202 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the method described above may be performed by integrated logic circuitry of hardware or instructions in software form in the second processor 1202. The second processor 1202 may be a general purpose processor, DSP, or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The second processor 1202 may implement or perform the methods, steps and logic blocks disclosed in embodiments of the present application. The general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed in the embodiment of the application can be directly embodied in the hardware of the decoding processor or can be implemented by combining hardware and software modules in the decoding processor. The software module may be located in a storage medium located in the second memory 1203, the second processor 1202 reading the program in the second memory 1203 and performing the steps of the method described above in connection with its hardware.
Optionally, when the second processor 1202 executes the program, a corresponding flow implemented by the second electronic device in each method of the embodiment of the present application is implemented, which is not described herein for brevity.
In an exemplary embodiment, the present application further provides a storage medium, i.e. a computer storage medium, specifically a computer readable storage medium, for example, including a first memory 1103 and a second memory 1203 storing a computer program, where the computer program may be executed by the first processor 1102 and the second processor 1202 of the electronic device, respectively, to complete the steps of the foregoing method. The computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash Memory, magnetic surface Memory, optical disk, or CD-ROM.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus, electronic device, and method may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is merely a logical function division, and there may be additional divisions of actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. In addition, the various components shown or discussed may be coupled or directly coupled or communicatively coupled to each other via some interface, whether indirectly coupled or communicatively coupled to devices or units, whether electrically, mechanically, or otherwise.
The units described as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, may be distributed on a plurality of network units, and may select some or all of the units according to actual needs to achieve the purpose of the embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of hardware plus a form of software functional unit.
It will be appreciated by those of ordinary skill in the art that implementing all or part of the steps of the above method embodiments may be accomplished by hardware associated with program instructions, and that the above program may be stored on a computer readable storage medium which, when executed, performs the steps comprising the above method embodiments, where the above storage medium includes various media that can store program code, such as removable storage devices, ROM, RAM, magnetic or optical disks.
Or the above-described integrated units of the application may be stored in a computer-readable storage medium if implemented in the form of software functional modules and sold or used as separate products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in essence or a part contributing to the prior art in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the methods described in the embodiments of the present application. The storage medium includes various media capable of storing program codes such as a removable storage device, a ROM, a RAM, a magnetic disk or an optical disk.
The technical schemes described in the embodiments of the present application may be arbitrarily combined without any collision. Unless otherwise indicated and defined, the term "connected" shall be construed broadly, and for example, may be electrical, may be in communication with the interior of two elements, may be in direct communication, may be in indirect communication via an intermediary, and may be understood by those of ordinary skill in the art in view of the specific meaning of the term.
In addition, in the present examples, "first," "second," etc. are used to distinguish similar objects and not necessarily to describe a particular order or sequence. It is to be understood that the "first\second\third" distinguishing objects may be interchanged where appropriate such that embodiments of the application described herein may be practiced in sequences other than those illustrated or described herein.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
The term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean that a exists alone, while a and B exist together, and B exists alone. In addition, the term "at least one" herein means any combination of any one or at least two of the plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.
Various combinations of the features described in the embodiments may be performed without contradiction, for example, different embodiments may be formed by combining different features, and various possible combinations of the features in the present application are not described further to avoid unnecessary repetition.