Disclosure of Invention
The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides an intelligent data management system and method based on multi-source data acquisition, which realize intelligent management and value evaluation of multi-source heterogeneous data.
In order to achieve the above purpose, an intelligent data management method based on multi-source data acquisition is provided, which comprises the following steps:
Step one, multi-source data from a multi-data source is obtained, and an original data set with a credibility mark is generated;
Step two, classifying data types of the data records in the original data set by adopting a data type recognition method based on combination of rules and a decision tree classifier, and entering a step three for data with time sequence data as the data types;
step three, extracting trend characteristics of the time sequence data by using a segmented multi-model time sequence characteristic extraction frame;
extracting unstructured features from the unstructured data through a feature extraction model based on a multi-layer perceptron;
Establishing a self-adaptive association relation for the structured data by adopting a knowledge graph technology, and generating association relation data;
Step six, taking the trend feature, unstructured feature and association relation data as input of a pre-constructed multi-level fusion data value evaluation model to generate dynamic weight data asset efficacy evaluation;
the step of obtaining multi-source data from a multi-data source, the step of generating an original data set with a credibility mark and generating an original data set with a credibility mark comprises the following steps:
Step 11, acquiring original multi-source data from multiple data sources through a distributed data acquisition network consisting of a plurality of types of data acquisition units;
step 12, adopting a data preprocessing unit of an edge computing architecture to perform standardization processing and credibility evaluation on the collected original multi-source data to form an original data set with credibility marks
The data type classification of the data records in the original data set by adopting a data type recognition method based on the combination of rules and classifiers comprises the following steps:
Step 21, constructing a data type feature extractor comprising a structural feature extraction rule, a time feature extraction rule and a content feature extraction rule, and extracting key feature vectors required by data type judgment from an original data set;
Step 22, based on the extracted key feature vector, classifying the data record by using a decision tree classifier to generate a classification result, wherein the classification result comprises time sequence data, unstructured data and structured data;
The application of the segmented multi-model time sequence feature extraction framework to extract trend features of the time sequence data comprises the following steps:
Step 31, constructing a segmented multi-model time sequence feature extraction frame comprising a self-adaptive segmentation module, a multi-model feature extraction module and a feature fusion module;
Step 32, dividing continuous time sequence data into data segments with similar statistical characteristics by adopting a similarity measurement method based on dynamic time warping through the self-adaptive segmentation module, dynamically determining the boundaries of the data segments through a sliding window algorithm, and automatically adjusting the window size according to the change characteristics of the time sequence data;
Step 33, through the multi-model feature extraction module comprising a statistic feature extraction channel, a frequency domain feature extraction channel and a shape feature extraction channel, applying a corresponding feature extraction method to each data segment in parallel, and extracting single-dimensional feature vectors of time sequence data from different dimensions;
step 34, fusing the single-dimensional feature vectors with different dimensions by utilizing the feature fusion module to generate trend features comprehensively representing trend characteristics of the time sequence data;
the method for extracting unstructured features from the unstructured data through a feature extraction model based on a multi-layer perceptron comprises the following steps of:
Step 41, extracting initial unstructured features from unstructured data by applying a feature extractor based on a statistical method;
step 42, designing a multi-layer perceptron model consisting of an input layer, a plurality of hidden layers and an output layer, adopting a fully-connected neural network structure, training the multi-layer perceptron model, and mapping initial unstructured features to unstructured feature spaces;
The multi-layer perceptron model consists of an input layer, a plurality of hidden layers and an output layer, and adopts a fully-connected neural network structure. The number of neurons of the input layer is consistent with the dimension of the initial characteristic, the hidden layer adopts a multi-layer structure, and the number of neurons of each layer is gradually decreased layer by layer.
Step 43, applying the trained multi-layer perceptron model to perform feature extraction on the unstructured data to generate unstructured features in a unified format;
the method for generating the association relationship data by using the dynamic evolution knowledge graph technology to establish the self-adaptive association relationship for the structured data comprises the following steps:
Step 51, constructing a domain ontology model and a key entity identification rule, and extracting a key entity and attributes thereof from the structured data;
Step 52, constructing a structural knowledge graph by applying a relation mapping rule based on the key entity and the attribute thereof extracted in the step 51;
Step 53, extracting a hierarchical association relation of the structured data and generating application-oriented association relation data based on the structural knowledge graph, wherein the hierarchical association relation comprises three levels of key entity direct association, path association and group association;
The pre-construction of the data asset efficacy assessment to generate dynamic weights includes the steps of:
Step 61, constructing a multi-level data value evaluation model framework comprising a value feature extraction layer, a feature fusion layer and a performance evaluation layer;
Step 62, extracting value indexes from data of different data types through the value feature extraction layers of three parallel feature processing channels including a trend feature processing channel, an unstructured feature processing channel and an association relation processing channel;
The method comprises the steps of extracting a multi-layer perception machine, a trend feature processing channel, an unstructured feature processing channel, an incidence relation processing channel, a graph neural network and a multi-layer perception machine, wherein the trend feature processing channel receives trend features extracted from a segmented multi-model time sequence feature extraction frame, value information contained in the trend features is extracted through the convolutional neural network;
Step 63, performing self-adaptive weighted fusion on the multidimensional value indexes from the value characteristic extraction layer through a characteristic fusion layer based on a deep neural network to generate comprehensive characteristic representation;
And step 64, simultaneously predicting the business value, the technical value and the innovation value of the data through a multi-task learning framework based on the comprehensive feature representation, and fusing the value evaluation results of the three dimensions into a final data asset effectiveness evaluation score.
The intelligent data management system based on multi-source data acquisition comprises an original data collection module, a data classification module, a multi-type feature extraction module and an asset efficiency evaluation module, wherein the modules are electrically connected;
the system comprises an original data collection module, a data classification module and a data processing module, wherein the original data collection module acquires multi-source data from a multi-data source, generates an original data set with a credibility mark, and sends the original data set to the data classification module;
The data classification module classifies data types of the data records in the original data set by adopting a data type identification method based on combination of rules and a decision tree classifier, divides time sequence data, unstructured data and structured data, and sends the time sequence data, the unstructured data and the structured data to the multi-type feature extraction module;
The multi-type feature extraction module is used for extracting trend features of the time sequence data by applying a segmented multi-model time sequence feature extraction frame, extracting unstructured features of the unstructured data by using a feature extraction model based on a multi-layer perceptron, establishing self-adaptive association relation of the structured data by using a knowledge graph technology, generating association relation data, and sending the trend features, the unstructured features and the association relation data to the asset efficacy evaluation module;
And the asset efficiency evaluation module is used for taking the trend characteristics, the unstructured characteristics and the association relation data as a pre-constructed multi-level fusion data value evaluation model to generate dynamic weight data asset efficiency evaluation.
Compared with the prior art, the invention has the beneficial effects that:
The invention discloses a specialized feature extraction technology for different types of data through multi-data source acquisition, data type identification based on combination of rules and decision tree classifiers, which comprises the steps of adopting a segmented multi-model time sequence feature extraction framework for time sequence data to accurately capture fluctuation, trend and periodic features of a time sequence, adopting a multi-layer perceptron-based feature extraction model for unstructured data to effectively mine hidden features in unstructured data such as texts, images and the like, adopting a knowledge graph technology for structured data to construct self-adaptive association relations, revealing internal relations between the data, and finally adopting a multi-layer fusion data value evaluation model to realize intelligent management and value evaluation of multi-source heterogeneous data.
Detailed Description
The technical solutions of the present invention will be clearly and completely described in connection with the embodiments, and it is obvious that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
As shown in fig. 1, an intelligent data management method based on multi-source data acquisition includes the following steps:
Step one, multi-source data from a multi-data source is obtained, and an original data set with a credibility mark is generated;
Step two, classifying data types of the data records in the original data set by adopting a data type recognition method based on combination of rules and a decision tree classifier, and entering a step three for data with time sequence data as the data types;
step three, extracting trend characteristics of the time sequence data by using a segmented multi-model time sequence characteristic extraction frame;
extracting unstructured features from the unstructured data through a feature extraction model based on a multi-layer perceptron;
Establishing a self-adaptive association relation for the structured data by adopting a knowledge graph technology, and generating association relation data;
Step six, taking the trend feature, unstructured feature and association relation data as input of a pre-constructed multi-level fusion data value evaluation model to generate dynamic weight data asset efficacy evaluation;
Wherein the multiple data sources include, but are not limited to, industrial sensors, internet of things devices, historical databases, and third party APIs;
In an embodiment of the present invention, the acquiring multi-source data from a multi-data source, generating an original data set with a confidence token, includes the steps of:
Step 11, acquiring original multi-source data from multiple data sources through a distributed data acquisition network consisting of a plurality of types of data acquisition units;
Specifically, the distributed data acquisition network comprises an industrial sensor acquisition unit, an Internet of things equipment acquisition unit, a history database access unit and a third party API call unit.
The industrial sensor acquisition unit is connected to various sensors in a production line, equipment and an environment monitoring system, acquires physical quantity data such as temperature, pressure, flow, current, vibration, gas concentration and the like in real time, establishes connection with the sensors by adopting standardized industrial communication protocols such as Modbus, profibus, OPC UA and the like, and reads the physical quantity data according to a preset sampling frequency.
The Internet of things equipment acquisition unit is connected with Internet of things terminals distributed at different positions through wireless communication technologies such as LoRa, NB-IoT, zigBee and the like, and acquires data such as equipment state, environmental parameters and position information;
the history database access unit is connected to various history database systems in the enterprise through a database connector, and comprises a relational database, a time sequence database, a document database and the like, and extracts data with different data types such as a history operation record, an equipment maintenance record, a quality detection record and the like;
The third party API calling unit is connected to an external data service provider through a standardized Web service interface such as REST API, SOAP and the like, and obtains external environment data such as weather data, market information, supply chain state and the like.
In the specific implementation process of the invention, each data acquisition unit executes the data acquisition task according to a preset sampling strategy. For example, the sampling frequency of the industrial sensor acquisition unit is set according to the change rate of the monitoring parameters, the key process parameters such as temperature and pressure are sampled at a high frequency every 10 seconds, the equipment state parameters such as motor current and vibration are sampled at a medium frequency every 30 seconds, and the environmental parameters such as environmental temperature and humidity are sampled at a low frequency every 5 minutes. The sampling frequency of the equipment acquisition unit of the Internet of things is set according to the balance of the service life of the battery of the equipment and the importance of data, and is generally set once every 1-10 minutes. The history database access unit performs incremental data extraction at a frequency of once per hour. The third party API call unit sets reasonable call intervals, typically varying from every 10 minutes to once per day, depending on the access restrictions of the API provider and the frequency of data updates.
The original data collected by all the collection units corresponding to each data source form original multi-source data, wherein the original data comprise industrial sensor data collected by an industrial sensor collection unit, internet of things equipment data collected by an internet of things equipment collection unit, a historical database record collected by a historical database access unit and API return data collected by a third-party API call unit;
step 12, carrying out standardization processing and credibility evaluation on the collected original multi-source data by adopting a data preprocessing unit of an edge computing architecture to form an original data set with credibility marks;
the standardized processing means that a special data preprocessing unit is configured for each data source, raw data of all acquisition units corresponding to the data source is received, and standardized processing of data format conversion, unit unification and time stamp alignment is performed.
The credibility evaluation is realized by adopting a multidimensional credibility evaluation method, and a credibility score is distributed to each piece of original data.
Specifically, in the data format conversion process, heterogeneous data from different sources are uniformly converted into a standard JSON or Avro format, so that the consistency of a data structure is ensured. In the unit unification stage, according to predefined unit conversion rules, measured values of different units are converted into system standard units, such as Fahrenheit to Celsius, english to metric, and the like. In the time stamp alignment process, the time information of all data is uniformly converted into a UTC time format, and time correction is carried out according to the actual delay of data acquisition, so that the comparability of different source data in the time dimension is ensured.
Further, the reliability evaluation is based on at least the following factors, namely, the reliability of the data source is evaluated based on historical performance, the freshness of the data is evaluated, the time-dependent reliability of the data is higher, the integrity of the data is checked, the integrity and the validity of the data fields are checked, the data consistency is evaluated through comparison with historical data or related data, the rationality of the data is evaluated, and the health state of the sensor is evaluated based on self-diagnosis information of the sensor.
In a specific implementation, the credibility evaluation adopts a weighted scoring model, and the weight of each evaluation factor is dynamically adjusted according to the requirements of different application scenes. For example, in a real-time control scenario, the data freshness is weighted higher, and in a quality analysis scenario, the data consistency is weighted higher. The final confidence score ranges from 0 to 100, where 0 means completely untrusted and 100 means completely trusted.
After standardized processing and credibility evaluation, the original multi-source data is organized into an original data set which is provided with credibility marks and is composed of a plurality of data records, wherein each data record at least comprises a global unique identifier which is used for uniquely identifying each data record, a time stamp which is accurate to a millisecond level, a data source identifier which marks the source type and specific equipment of the data, a parameter identifier which is used for uniquely identifying the monitored parameter type, a parameter value which is an actual measurement value subjected to standardized processing, a credibility score which represents the credibility of the data, and metadata which comprises related information of data acquisition and processing, such as sampling frequency, processing method and the like.
A specific example of the present invention is a device health management system for an intelligent manufacturing enterprise. The equipment health management system is connected to various sensors on core production equipment such as CNC (computer numerical control) machine tools, injection molding machines, robots and the like through an industrial sensor acquisition unit, acquires equipment operation parameters in real time, is connected to environment monitoring terminals distributed in workshops through an Internet of things equipment acquisition unit, acquires environment data such as temperature and humidity, dust concentration and the like, is connected to an MES (manufacturing execution system) and an equipment maintenance recording system of an enterprise through a history database access unit, extracts equipment history operation records and maintenance records, and acquires equipment parameter standards and fault feature libraries provided by equipment suppliers through a third party API (application program interface) calling unit. The data preprocessing unit performs standardization processing on the multi-source data, evaluates the data reliability based on factors such as sensor state, data acquisition time, data consistency and the like, and forms a device operation data set with reliability marks.
Further, the data type classification of the data records in the original data set by adopting a data type recognition method based on a combination of rules and a classifier comprises the following steps:
Step 21, constructing a data type feature extractor comprising a structural feature extraction rule, a time feature extraction rule and a content feature extraction rule, and extracting key feature vectors required by data type judgment from an original data set;
In particular, the structural feature extraction rules are used to analyze the organization of the data records, including, but not limited to, whether the data has a fixed key-value pair structure, the number and type distribution of data fields, the nesting hierarchy of the data, the serialization format of the data, and the like. For example, for JSON format data, the features of top field number, field type distribution (such as character string field proportion, numerical field proportion), nesting depth, etc. are extracted, and for CSV format data, the features of column number, numerical column proportion, date column proportion, etc. are extracted.
Wherein the time feature extraction rules are used to identify time series characteristics of the data including, but not limited to, whether a time stamp field is included, a distribution feature of the time stamps, a time interval regularity between data points, a continuity of the time series, etc. For example, whether a time field conforming to the ISO8601 format exists in the data is detected, the mean and variance of the time intervals of adjacent data points are calculated, the sampling frequency of the time series is analyzed, and the like.
Wherein the content feature extraction rules are used to analyze the content characteristics of the data, including but not limited to the length distribution of text fields, the statistics of numeric fields, the proportion of binary data, the identification of special format content (e.g., URL, image data, audio data), etc. For example, the average length and coefficient of variation of the text field are calculated, whether HTML tags or XML tags are included is detected, whether Base64 encoded binary data is included is identified, and so on.
In the implementation process of the invention, the data type feature extractor applies the three types of feature extraction rules to each data record in the original data set to generate a key feature vector containing a plurality of key features. The key feature vector contains a structural feature sub-vector (10-dimensional), a temporal feature sub-vector (8-dimensional) and a content feature sub-vector (12-dimensional), for a total of 30-dimensional features to characterize the type of data record.
Step 22, based on the extracted key feature vector, classifying the data record by using a decision tree classifier to generate a classification result, wherein the classification result comprises time sequence data, unstructured data and structured data;
Specifically, the decision tree classifier is constructed by adopting a C4.5 algorithm, and a classification model is constructed by recursively dividing a feature space by taking an information gain ratio as a splitting standard. Each internal node of the decision tree represents a test for a feature, each branch represents the output of the test, and each leaf node represents a class of data types.
In the process of constructing the decision tree classifier, a set of explicit classification rules is first defined as an initial structure of the decision tree based on expert knowledge. For example, data tends to be classified as time series data if it has distinct time series characteristics (timestamp fields exist and time intervals are uniform), as unstructured data if it contains a large amount of unstructured text content or binary coded content, and as structured data if it has a clear key-value pair structure and consists mainly of structured fields.
The decision tree classifier is then trained and optimized using the annotated first training dataset. The first training data set contains a plurality of data records from various data sources in the original data set as typical data samples, and each typical data sample is marked as time sequence data, unstructured data or structured data by a field expert. By training the decision tree classifier, finer classification rules are learned to handle boundary conditions with complex feature combinations.
In a further preferred embodiment of the invention, the classification result of the decision tree classifier can be verified and corrected by combining the context information of the data source and the semantic features of the data content to generate a more accurate classification result of the data type;
Specifically, on the basis of decision tree classification, data source context information and data content semantic features are introduced to verify and correct classification results. The data source context information comprises the source type, the acquisition mode, the expected use and the like of the data, and the semantic features of the data content are extracted through a simple text analysis and pattern matching method and are used for understanding the actual meaning of the data.
For the utilization of data source context information, the types of data typically generated by different data sources are recorded by maintaining a knowledge base of the data sources. For example, industrial sensor acquisition units typically generate time series data, historical database access units may generate structured data or time series data, and third party API call units may return various types of data. In the classifying process, the prior probability in the data source knowledge base is referenced, and the classifying result of the decision tree is adjusted.
For analysis of semantic features of data content, a set of predefined semantic rules and pattern matching rules may be applied. For example, it is detected whether the data field name contains time-related words (such as "time", "date", etc.), whether the data content conforms to the term pattern of the specific field is identified, whether the data structure conforms to the common data exchange format is determined, etc. These semantic features help to understand the actual use and type of data.
In the specific example of the device health management system of the intelligent manufacturing enterprise, the process of classifying the device operation data set based on the data type recognition method combining rules and statistical features comprises the steps of recognizing sensor data (such as continuous measurement values of physical quantities such as temperature, pressure and vibration) from a CNC machine tool, an injection molding machine and a robot as time sequence data, recognizing text descriptions and image records such as device maintenance records and fault reports as unstructured data, recognizing the data containing abundant unstructured contents and needing deep learning models for feature extraction, and recognizing tabular data such as device parameter configuration, production plans and bill of materials as structured data, wherein the data has clear relation structures and is suitable for establishing association relations through knowledge graph technology. Through accurate data type classification, the equipment health management system can select the most suitable processing method for each type of data, so that the data value is comprehensively mined, and equipment state monitoring, fault early warning and maintenance decision are supported.
Further, in an embodiment of the present invention, the extracting trend features of the time series data by using the segmented multi-model time series feature extracting framework includes the following steps:
Step 31, constructing a segmented multi-model time sequence feature extraction frame comprising a self-adaptive segmentation module, a multi-model feature extraction module and a feature fusion module;
And step 32, dividing continuous time sequence data into data segments with similar statistical characteristics by the self-adaptive segmentation module by adopting a similarity measurement method based on dynamic time warping, dynamically determining the boundaries of the data segments by the self-adaptive segmentation module by a sliding window algorithm, and automatically adjusting the window size according to the change characteristics of the time sequence data.
For example, for data segments with smooth changes, a larger window is used to increase the computational efficiency, and for data segments with severe changes, a smaller window is used to ensure the segmentation accuracy. In practical application, the variance of the data in the sliding window is calculated and compared with a preset threshold value to judge the change degree of the data, the smooth section is judged when the variance is smaller than the threshold value, and the fluctuation section is judged when the variance is larger than or equal to the threshold value.
Step 33, through the multi-model feature extraction module comprising a statistic feature extraction channel, a frequency domain feature extraction channel and a shape feature extraction channel, applying a corresponding feature extraction method to each data segment in parallel, and extracting single-dimensional feature vectors of time sequence data from different dimensions;
Specifically, the statistical feature extraction channel calculates statistics such as mean, variance, skewness, kurtosis and the like of each data segment, and is used for representing the basic statistical characteristics of the data segment. The frequency domain feature extraction channel converts the time domain data into a frequency domain representation through fast fourier transformation, extracts main frequency components of each data segment and energy distribution thereof, and is used for capturing periodicity and frequency characteristics of the data. The shape feature extraction channel adopts a symbolized aggregation approximation method to convert each data segment into a symbol sequence, and extracts the mode features of the data segment to be used for representing the shape features and the change modes of time sequence data. Each feature extraction channel operates independently to generate single-dimensional feature vectors with different dimensions.
Step 34, fusing the single-dimensional feature vectors with different dimensions by utilizing the feature fusion module to generate trend features comprehensively representing trend characteristics of the time sequence data;
Specifically, the feature fusion module adopts an attention mechanism to carry out weighted fusion on the single-dimensional feature vectors extracted by the three feature extraction channels. Firstly, an initial weight is allocated to each feature extraction channel, and then, the contribution degree of each feature extraction channel to final trend characterization is learned by using labeled time sequence data samples as training data in a supervised learning mode, so that the weight allocation is dynamically adjusted. In the fusion process, the statistical features, the frequency domain features and the shape features are weighted and summed according to the learned weights to form a final trend feature vector. The feature vector not only maintains the statistical characteristics of the original data, but also contains the frequency domain and the shape features, and can comprehensively reflect the trend change features of the time sequence data.
In the specific implementation process of the invention, the processing flow of the segmented multi-model time sequence feature extraction framework comprises the following steps of firstly inputting original time sequence data into an adaptive segmentation module to generate a plurality of data segments with similar features, then inputting each data segment into three feature extraction channels simultaneously to respectively extract statistical features, frequency domain features and shape features, and finally carrying out weighted fusion on the three types of features through a feature fusion module to output final trend feature representation. The whole process can realize the efficient and comprehensive extraction of the time sequence trend characteristics, and provides powerful support for the subsequent data analysis and decision.
In the above example of the device health management system for an intelligent manufacturing enterprise, the segmented multi-model timing feature extraction framework is applied to trend feature extraction of CNC machine spindle vibration signals. The vibration signal which is continuously collected is divided into a plurality of time periods through the self-adaptive segmentation module, and the time periods correspond to different machining states of the machine tool, such as no-load, normal machining, overload and the like. Then, the multi-model feature extraction module extracts features of each period from three dimensions of statistics, frequency domain and shape simultaneously, wherein the statistics features reflect the intensity and stability of vibration signals, the frequency domain features reflect the natural frequency and possible fault frequency of a machine tool spindle, and the shape features capture abnormal modes of vibration waveforms. And finally, the feature fusion module fuses the three types of features into a comprehensive trend feature vector according to the weight learned by the historical fault data. The trend feature vector can reflect the change trend of the running state of the CNC machine tool spindle, and effectively supports subsequent equipment state evaluation and potential fault early warning. For example, when the fused trend characteristic shows that the vibration frequency of the spindle gradually approaches to the critical frequency of the machine tool, an early warning signal is generated to prompt maintenance personnel to check the state of the spindle bearing, so that the early prevention of equipment faults is realized.
Further, the method for extracting unstructured features from the unstructured data based on a feature extraction model of a multi-layer perceptron comprises the following steps:
Step 41, extracting initial unstructured features from unstructured data by applying a feature extractor based on a statistical method;
specifically, for unstructured data of the text class, a bag of words model or a TF-IDF method is applied to extract word frequency features of the text, and an initial vector representation of the text is formed by calculating the occurrence frequency of each word in the text and the importance of each word in the whole corpus. In order to reduce feature dimensions, feature selection methods such as chi-square test are adopted, and top N terms with the most discriminative ability are reserved. In addition, other statistical characteristics of the text, such as average sentence length, punctuation mark distribution, keyword density and the like, can be extracted, so that the diversity of text representation is enriched.
For unstructured data of image class, classical feature descriptors such as HOG (direction gradient histogram), LBP (local binary pattern) and SIFT (scale invariant feature transform) are applied to extract texture and shape features of the image. The HOG feature captures the contour information of an object by calculating a gradient direction histogram of a local area of the image, the LBP feature extracts a local texture mode by comparing the intensity relation between a central pixel and surrounding pixels, and the SIFT feature captures the local structural feature of the image by detecting the gradient distribution of key points and the surrounding of the key points in the image. These feature descriptors complement each other and together form a comprehensive feature representation of the image.
And for the unstructured data of the audio class, acoustic feature descriptors such as MFCC (Mel frequency cepstrum coefficient), chromaticity features, energy features and the like are applied to extract the frequency spectrum characteristics and the time domain characteristics of the audio signal. The MFCC features simulate the perception characteristics of human ears on sounds with different frequencies, effectively represent the sound features of the audio, the chromaticity features represent the tone distribution of the audio, and the energy features reflect the intensity change of the audio signals in the time domain. By calculating statistics of these features, such as mean, variance, kurtosis, etc., a composite feature representation of the audio signal is formed.
Step 42, designing a multi-layer perceptron model consisting of an input layer, a plurality of hidden layers and an output layer, adopting a fully-connected neural network structure, training the multi-layer perceptron model, and mapping initial unstructured features to unstructured feature spaces;
Specifically, for unstructured data of different data types, a corresponding multi-layer perceptron model architecture is designed. Each multi-layer perceptron model consists of an input layer, a plurality of hidden layers and an output layer, and adopts a fully-connected neural network structure. The number of neurons of the input layer is consistent with the dimension of the initial characteristics, the hidden layers are of a multi-layer structure and generally comprise 2-3 hidden layers, the number of neurons of each layer is gradually decreased from layer to layer, such as 512 neurons of the first layer, 256 neurons of the second layer and 128 neurons of the third layer, more abstract characteristic representations are gradually extracted, the number of neurons of the output layer is generally set to 64-128, and compact and information-rich characteristic representations are generated.
In the design of the neural network, each hidden layer uses a ReLU activation function, introduces nonlinear transformation capability and enhances the expression capability of a model, and an output layer directly outputs a linear transformation result without using an activation function or uses a tanh activation function to limit the output range to a [ -1,1] interval. Batch Normalization layers are added between the layers, so that training convergence is accelerated and model stability is improved.
In the model training stage, a supervised learning mode is adopted, and a third training data set marked manually is used for training the multi-layer perceptron model. The third training data set contains unstructured data samples with class labels reflecting semantic classes or functional attributes of the data. The training goal is to enable the feature representation of the model output to effectively distinguish between different classes of unstructured data. Training uses a small batch gradient descent method, the batch size is typically 32-128, the loss function selects cross entropy loss (for classification tasks) or contrast loss (for feature learning tasks), the optimizer uses Adam algorithm, the initial learning rate is set to 0.001, and a learning rate decay strategy is used. To prevent overfitting, dropout (scale 0.3-0.5) and L2 regularization (coefficient 0.0001) mechanisms were added to the model and early-stop strategies were used to monitor the validation set performance, stopping training when the validation set performance was no longer improving for multiple consecutive rounds.
Step 43, applying the trained multi-layer perceptron model to perform feature extraction on the unstructured data to generate unstructured features in a unified format;
specifically, for unstructured data to be processed in step two, the initial features are first extracted by the statistical feature extractor of step 42. And then, inputting the initial characteristics into a trained multi-layer perceptron model of a corresponding type, and obtaining an activation value of a final layer (output layer) through forward propagation calculation of the model, wherein the activation value is used as the characteristic representation of the unstructured data.
In the feature extraction process, multiple levels of feature representations may be acquired for the same piece of unstructured data. In addition to the features of the output layer, the activation values of the last one or more hidden layers may be extracted as intermediate layer features. These intermediate layer features often contain more detailed information that can be complementary to the output layer features to collectively form a multi-scale feature representation.
In the above example of the device health management system of the intelligent manufacturing enterprise, the feature extraction model based on the multi-layer perceptron is applied to the processing of unstructured data such as device maintenance records and fault reports. Firstly, converting a text in a maintenance record into word frequency feature vectors through a TF-IDF method, extracting statistical features such as text length, professional term density and the like, and extracting textures and damage features of surfaces of HOG and LBP feature capturing equipment for fault images. These initial features are then mapped into semantically rich feature representations using a multi-layer perceptron model trained on historical fault cases. The text-specific multi-layer perceptron model can extract key semantic features reflecting equipment states (such as 'normal', 'slight anomaly', 'severe fault'), fault types (such as 'bearing wear', 'gear breakage') and maintenance operations (such as 'lubrication', 'replacement parts') from the maintenance records, and the image-specific multi-layer perceptron model can extract visual features representing equipment surface damage types, degrees and positions from the fault images. After PCA (principal component analysis) dimension reduction and random forest-based feature selection, generating a compact unstructured feature representation with dimension of 48, and combining the compact unstructured feature representation with the association relationship between the trend feature of the time sequence data and the structured data;
Further, the step of establishing an adaptive association relationship for the structured data by adopting a dynamic evolution knowledge graph technology, and the step of generating association relationship data comprises the following steps:
Step 51, constructing a domain ontology model and a key entity identification rule, and extracting a key entity and attributes thereof from the structured data;
Specifically, the domain ontology model is a concept hierarchy and a relationship network constructed based on specific domain knowledge and comprises three components, namely a category hierarchy, a relationship type definition and an attribute constraint rule. The category hierarchy structure defines main concepts and upper and lower hierarchical relationships thereof in the field, such as a classification system of core concepts of equipment types, process parameters, material properties and the like, the relationship type definition describes possible association modes between key entities of different categories, such as semantic relationships of 'containing', 'controlling', 'influencing', 'relying on', and the like, and the attribute constraint rules prescribe attribute characteristics of various key entities, including constraint information of data types, value ranges, units, validity conditions and the like of the attributes.
In the aspect of configuration of the key entity identification rules, the identification rules are defined for each type of key entity by combining category definition in the domain ontology model. These rules include both deterministic rules based on pattern matching, such as accurate recognition by field names, data formats, and value ranges, and probabilistic rules based on machine learning, for determining fuzzy conditions by pre-trained key entity recognition models. For example, the key entity of the production equipment type can be identified by the naming rule of the equipment number and the equipment parameter characteristics, and the key entity of the process parameter type can be identified by the parameter name key words and the distribution characteristics of the parameter values.
In the specific implementation process of the invention, the structural data is scanned and analyzed by applying the domain ontology model and the key entity identification rule, and all key entities and attributes thereof which meet the conditions are extracted. The extraction process firstly identifies the type of the key entity in the data, and then analyzes various attribute values of the key entity according to the attribute specifications defined in the ontology model. Each identified key entity contains at least a key entity unique identifier, a key entity type tag, a key entity name, a key entity attribute set, and a confidence score.
I modify the steps 52, 53 and 54 to better associate with the key entities and relationships extracted in the step 51, and give specific implementation modes of constructing a knowledge graph and establishing an adaptive association relationship.
Step 52, constructing a structural knowledge graph by applying a relation mapping rule based on the key entity and the attribute thereof extracted in the step 51;
Specifically, the relationship mapping rules comprise three types of rules, namely an explicit relationship mapping rule, an implicit relationship inference rule and a compound relationship construction rule. The explicit relationship mapping rules directly convert relationships between explicitly identified key entities in the structured data into edges in the structural knowledge graph, e.g., directly map "include" relationships in the device hierarchy, and "precedent" and "successor" relationships in the production flow, etc., into the graph. The implicit relation inference rule infers that the key entities possibly exist are associated based on the corresponding relation among the attribute values of the key entities, such as establishing an 'association' relation among different types of key entities with the same identifier, and establishing a 'similar' relation among key entities with similar attribute modes. The composite relationship construction rule is based on a combination mode of a plurality of basic relationships to construct a higher-level semantic relationship, for example, when the key entity A controls the key entity B and the key entity B affects the key entity C, the composite relationship between the key entity A indirectly affects the key entity C is established.
In the implementation process of the invention, the construction of the structural knowledge graph is divided into three stages. First, all the key entities extracted in step 51 are used as nodes of the knowledge graph, and each node retains a type label, a name and an attribute set of the key entity. And secondly, applying an explicit relation mapping rule, and establishing basic relation edges between corresponding key entity nodes according to the existing relation information in the structured data. Each relationship edge contains at least three basic attributes of relationship type, relationship strength and build time. Finally, an implicit relationship inference rule and a compound relationship construction rule are applied, the relationship among potential key entities is found and added, the semantic structure of the knowledge graph is enriched, the newly added relationship side is marked as an inferred relationship, and a confidence score is added to indicate the reliability level of the relationship. Through the above procedure, the key entities and their attributes extracted in step 51 are organized into a structured knowledge graph, in which the nodes represent the key entities, the edges represent the relationships, and together constitute a graphical representation of domain knowledge.
Step 53, extracting a hierarchical association relation of the structured data and generating application-oriented association relation data based on the structural knowledge graph, wherein the hierarchical association relation comprises three levels of key entity direct association, path association and group association;
Specifically, the direct association of key entities refers to the relationship between two key entities in the knowledge graph, which are directly connected through a single edge, and represents the most basic association form. The path association refers to an indirect relation between non-adjacent key entities connected by a path formed by a plurality of edges, and can reflect a more complex association mode. The group association refers to the overall relationship between closely associated key entity sets identified by a community detection algorithm, and represents a modularized structure and a functional partition in the system.
In the implementation process of the invention, the generation of the association relationship data at least comprises three steps. First, for each pair of key entities in the knowledge graph, the direct association strength and type between the key entities are calculated to form a key entity direct association matrix. And secondly, calculating the shortest path and the weighted path between any two key entities by applying a path analysis algorithm to generate a path association data set, wherein the path association data set comprises path length, an intermediate node sequence and comprehensive association strength. And finally, a community detection algorithm based on modularity optimization is applied, the knowledge graph is divided into a plurality of closely related key entity groups, common characteristics of key entities in the groups and interaction modes among the groups are analyzed, and a group related data set is generated.
The generated association relation data is organized by adopting a standardized data structure, and each association relation record at least comprises an association subject identifier which can be a single key entity, a key entity pair or key entity group, an association type direct association, a path association or group association, an association strength value, a value between 0 and 1, a degree of tightness of the association, an association directivity mark, a direction or no direction, an association effective time range and an association confidence score.
In the specific example of the device health management system of the intelligent manufacturing enterprise, the knowledge graph technology is applied to the establishment of the association relation of the structured data such as device parameter configuration, production plan, bill of materials and the like. First, the system extracts key entities of the CNC machine tool, the injection molding machine, the robot and other equipment, and the operation parameters, maintenance records and other attributes from the structured data. Then, a structural knowledge graph is constructed by applying a relation mapping rule, wherein the structural knowledge graph comprises a 'equipment-parameter' relation, such as a relation between a CNC machine tool and a spindle rotating speed parameter, a 'equipment-equipment' relation, such as a production line relation between the CNC machine tool and downstream conveying equipment, and a 'parameter-parameter' relation, such as an influence relation between a feeding speed and machining precision. Along with the production process, the system continuously collects new equipment operation data, and the relation strength in the knowledge graph is adjusted in real time through a state-aware dynamic association updating mechanism, for example, when the association between the increase of the temperature of the main shaft and the decrease of the processing precision is found to be enhanced, the weight of the relation between the two parameters is improved. Finally, the system extracts a multi-level correlation structure from the updated knowledge graph to generate correlation data, wherein the correlation data comprises direct correlation (such as 'main shaft temperature' directly influences 'processing precision'), path correlation (such as 'cooling system state' indirectly influences 'processing precision' by influencing 'main shaft temperature'), and group correlation (such as integral correlation between a temperature-related parameter group and a precision-related parameter group).
Further, the pre-construction of the data asset efficacy assessment to generate dynamic weights includes the steps of:
Step 61, constructing a multi-level data value evaluation model framework comprising a value feature extraction layer, a feature fusion layer and a performance evaluation layer;
Step 62, extracting value indexes from data of different data types through the value feature extraction layers of three parallel feature processing channels including a trend feature processing channel, an unstructured feature processing channel and an association relation processing channel;
In the embodiment of the invention, the trend feature processing channel receives trend features extracted from the segmented multi-model time sequence feature extraction frame, value information contained in the trend feature processing channel is extracted through a convolutional neural network, such as regularity of a data change mode, indirection and predictive capability of abnormal events, the unstructured feature processing channel receives unstructured features extracted from a feature extraction model of a multi-layer perceptron, key information points and implicit knowledge contained in the unstructured feature processing channel are identified through an attention mechanism, the association relation processing channel receives association relation data generated from a knowledge graph technology, and structured value indexes such as complexity, centrality and connection mode of relationships among entities are extracted through a graph neural network.
Step 63, performing self-adaptive weighted fusion on the multidimensional value indexes from the value characteristic extraction layer through a characteristic fusion layer based on a deep neural network to generate comprehensive characteristic representation;
Specifically, the feature fusion layer adopts a multi-head attention mechanism and a residual error connection structure to realize the fusion of the value indexes extracted by the three processing channels. The multi-head attention mechanism allows the model to pay attention to the value features of different dimensions and learn the interaction relationship between the value features, and the residual connection structure ensures that the original feature information is not lost in the deep network.
In the process of feature fusion, the output features of three processing channels are mapped to feature spaces with the same dimension, and then the attention weights among different features are calculated through a multi-head attention layer. For each feature, the attention mechanism will calculate its relevance to the other features and assign a fusion weight accordingly. The fusion method can adaptively adjust the importance of different types of features under different scenes, for example, trend features can obtain higher weight in scenes with larger data fluctuation, unstructured features can obtain higher weight in scenes needing deep understanding of content, and association relationship features can obtain higher weight in scenes needing comprehensive analysis of a relationship network.
Step 64, based on the comprehensive characteristic representation, predicting the business value, the technical value and the innovation value of the data through a multi-task learning framework simultaneously, and fusing the value evaluation results of the three dimensions into a final data asset efficacy evaluation score;
specifically, the efficiency evaluation layer adopts a multi-task learning framework and comprises three parallel evaluation branches, namely a business value evaluation branch, a technical value evaluation branch and an innovation value evaluation branch. The method comprises the steps of providing service value evaluation branch evaluation data, providing contribution degree of the service decision, flow optimization and income improvement, providing technical value evaluation branch evaluation data support degree of technical innovation, problem diagnosis and performance improvement, and providing heuristic property of the innovation value evaluation branch evaluation data on new product development, new mode exploration and knowledge discovery.
Each evaluation branch is formed by a special deep neural network that receives the composite feature representation from the feature fusion layer and outputs a value evaluation score for the corresponding dimension. The multi-task learning framework not only can provide multi-dimensional value evaluation, but also can promote evaluation of different dimensions mutually through sharing the feature representation layer, and improves the accuracy of overall evaluation.
The final data asset effectiveness evaluation score is obtained by weighting and fusing the evaluation results of the three dimensions. The fusion weight is dynamically adjusted according to different industries, different application scenes and different evaluation targets. For example, technical value may be weighted higher in a quality control scenario in manufacturing, business value may be weighted higher in a marketing scenario, and innovation value may be weighted higher in a research and development innovation scenario.
In the above example of the device health management system of the intelligent manufacturing enterprise, the multi-level fused data value evaluation model receives as input the feature vector extracted from unstructured data such as trend feature vector extracted from the CNC machine spindle vibration signal, device fault report, and association relationship data established by structured data such as device parameter configuration. The value feature extraction layer respectively extracts the early warning value of the trend feature of the vibration signal, the experience knowledge value in the fault report text and the optimization potential value in the association relation of the equipment parameters, the feature fusion layer carries out weighted fusion on the value features, dynamically adjusts the weight of each feature according to the focus of the current equipment health management, and the efficiency evaluation layer generates the comprehensive value evaluation score of the equipment data, wherein the comprehensive value evaluation score comprises the supporting value (business value) of preventive maintenance decision, the guiding value (technical value) of fault diagnosis and the heuristic value (innovation value) of equipment optimization improvement. Through the evaluation result, enterprises can identify the most valuable equipment monitoring data, optimize the data acquisition strategy and concentrate limited calculation and storage resources on the processing and analysis of high-value data, thereby improving the efficiency and the accuracy of equipment health management. For example, the system can automatically increase the sampling frequency and storage priority of vibration signal data with high early warning value for the evaluation display, strengthen text feature extraction and knowledge graph construction for fault reports with high diagnosis value for the evaluation display, and deepen causal relation analysis and optimization space mining for parameter associated data with high optimization value for the evaluation display. The intelligent resource allocation strategy based on the data value ensures that the equipment health management system can exert the maximum value of the data asset in the most efficient mode.
Example 2
As shown in FIG. 2, the intelligent data management system based on multi-source data acquisition comprises an original data collection module, a data classification module, a multi-type feature extraction module and an asset efficiency evaluation module, wherein the modules are electrically connected;
the system comprises an original data collection module, a data classification module and a data processing module, wherein the original data collection module acquires multi-source data from a multi-data source, generates an original data set with a credibility mark, and sends the original data set to the data classification module;
The data classification module classifies data types of the data records in the original data set by adopting a data type identification method based on combination of rules and a decision tree classifier, divides time sequence data, unstructured data and structured data, and sends the time sequence data, the unstructured data and the structured data to the multi-type feature extraction module;
The multi-type feature extraction module is used for extracting trend features of the time sequence data by applying a segmented multi-model time sequence feature extraction frame, extracting unstructured features of the unstructured data by using a feature extraction model based on a multi-layer perceptron, establishing self-adaptive association relation of the structured data by using a knowledge graph technology, generating association relation data, and sending the trend features, the unstructured features and the association relation data to the asset efficacy evaluation module;
And the asset efficiency evaluation module is used for taking the trend characteristics, the unstructured characteristics and the association relation data as a pre-constructed multi-level fusion data value evaluation model to generate dynamic weight data asset efficiency evaluation.
The above embodiments are only for illustrating the technical method of the present invention and not for limiting the same, and it should be understood by those skilled in the art that the technical method of the present invention may be modified or substituted without departing from the spirit and scope of the technical method of the present invention.