CN109582963A

Movatterモバイル変換

Info

Publication number: CN109582963A
Application number: CN201811438592.XA
Authority: CN
Inventors: 曾伟波; 张建辉; 林培煜; 潘淑英; 陈泰隆
Original assignee: Fujian Linewell Software Co Ltd
Current assignee: Fujian Linewell Software Co Ltd
Priority date: 2018-11-29
Filing date: 2018-11-29
Publication date: 2019-04-05

Abstract

The present invention relates to a kind of archives automatic classification method based on extreme learning machine.This method includes learning and running two stages, and two stage first steps will pass through preprocessing module, which be that data are carried out with standardization processing, removal and the incoherent information of this task；Content of text is unified for the coded format of UTF-8 by preprocessing module first；Then processing is filtered to forbidden character by the way of regular expression matching；Then participle and part-of-speech tagging are carried out using ICTCLAS Chinese lexical analysis system；Finally using Baidu deactivate vocabulary in text often occur but itself the word that text analyzing has little significance is filtered.The present invention can accurately understand the archive content in text while construct the lower archives dictionary of an efficient, stable dimension, while can guarantee there is higher nicety of grading.

Description

Automatic file classification method based on extreme learning machine

Technical Field

The invention belongs to the technical field of text classification, and particularly relates to an automatic file classification method based on an extreme learning machine.

Background

In the face of massive electronic file information, the current management mode depends on professionals with abundant file work experience to perform manual operation classification and classification supervision work in a file management system. However, with the explosive increase of the number of electronic files, the manpower consumed by manual classification is more and more, which greatly exceeds the workload of file staff, and in addition, different file professionals have unpredictable differences in the classification processing results of the same file material, which may cause inconsistency before and after classification of part of file files for a long time, so that the classification management of the electronic files by the automatic classification technology of computer texts is the best way to realize effective management and efficient utilization of the electronic files.

Some difficulties in the field of text classification still need to be solved urgently, and the problems are mainly 1): how to construct an efficient and stable semantic classification dictionary; 2): how to break the independence between words in the vector space model; (3) how to effectively balance the classification precision and the mass data training speed.

The invention provides an automatic file classification method based on an extreme learning machine. The method comprises a preprocessing module, a text feature extraction module, a feature fusion module and a classification module based on an extreme learning machine. The text feature extraction module comprises two sub-modules: the bottom layer feature extraction module and the middle layer feature autonomous learning module. The invention can effectively solve the problems in the text classification field.

Disclosure of Invention

The invention aims to solve the problem that the existing file text classification is not efficient and stable enough, and provides an automatic file classification method based on an extreme learning machine.

In order to achieve the purpose, the technical scheme of the invention is as follows: an automatic file classification method based on an extreme learning machine comprises the following steps:

step S1, training sample preprocessing: carrying out standardization processing on a text training sample set for model learning;

step S2, extracting the bottom-layer features of the text training sample: sending the preprocessed sample into a bottom layer feature extraction module to extract text bottom layer features, realizing two processes of construction of a file dictionary and a corpus and formation of bottom layer feature expression of a training sample, wherein the bottom layer features are expressed by selecting a vector space model, and the feature of each dimension in a vector is normalized TF-IDF weight;

step S3, self-learning of layer features in text training samples: combining the archive dictionary and the corpus generated in the step S2 to train a Skip-gram model in an unsupervised mode, and generating a training sample word vector by using the trained model; finally, forming a middle-layer characteristic expression of each training document by adopting a pooling technology;

step S4, combining the bottom layer and middle layer features of the text training sample: the bottom layer characteristics calculated in the step S2 and the middle layer characteristics calculated in the step S3 are weighted and connected in series to form the final fusion characteristic expression of the document;

step S5, training a file classification model based on the extreme learning machine: respectively training three archive classification models based on the extreme learning machine by adopting a supervised training mode based on the bottom layer feature calculated in the step S2, the middle layer feature calculated in the step S3 and the fusion feature calculated in the step S4, wherein the three archive classification models correspond to the bottom layer feature archive classification model, the middle layer feature archive classification model and the fusion feature archive classification model;

step S6, sample pretreatment to be determined: carrying out standardization processing on a sample to be judged;

step S7, extracting bottom layer features of the sample to be judged: sending the preprocessed samples into a bottom layer feature extraction module to extract text bottom layer features, and directly forming a bottom layer feature expression based on the archive dictionary generated in the step S2, wherein the bottom layer features are expressed by selecting a vector space model, and the features of each dimension in the vector are normalized TF-IDF weights;

step S8, extracting layer features in the sample to be judged: generating a word vector of the sample to be judged by using the Skip-gram model learned in the step S3, and finally forming a middle-layer feature expression of the sample to be judged by using a pooling technology;

step S9, combining the bottom layer and middle layer features of the sample text to be judged: the text bottom layer features calculated in the step S7 and the text middle layer features calculated in the step S8 are weighted and connected in series to form the final combined feature expression of the document to be judged;

step S10, automatically classifying the sample files to be determined: respectively sending the bottom layer features, the middle layer features and the combined features which are calculated in the steps S7, S8 and S9 into the three extreme learning machine-based archive classification models which are learned in the step S5 for classification, and synthesizing the classification results of the three classification models to obtain the archive category to which the sample to be judged belongs;

step S11, continuously operating the steps S6-S10, and finishing the classification of the text samples;

and step S12, inputting a new sample to be judged, and operating steps S6-S10 to finish automatic file classification of the new text sample.

In an embodiment of the present invention, the preprocessing of the samples in the steps S1 and S6 includes 4 processes: standardizing coding formats, removing illegal characters, segmenting words, performing part-of-speech tagging processing and stopping word processing; the text content is unified into a UTF-8 coding format by the standard coding format; removing illegal characters, and filtering the illegal characters in a regular expression matching mode; performing word segmentation and part-of-speech tagging by adopting an ICTCCLAS Chinese lexical analysis system; stop word processing employs a Baidu stop vocabulary to filter words that occur frequently in text but are not themselves meaningful to archive analysis.

In an embodiment of the present invention, in the step S2, the step of constructing the archive dictionary includes two processes of part-of-speech selection and bottom-layer feature selection; selecting nouns, verbs, adjectives and adverbs as reference words by part of speech selection, and combining the words with different parts of speech to form potential semantic information of a document; and the bottom-layer feature selection further selects the feature words which can represent the classification of the archives most by adopting a bottom-layer feature selection principle based on chi-square statistics on the basis of part of speech selection.

In an embodiment of the present invention, in the steps S4 and S9, the fusion policy of the bottom layer feature and the middle layer feature is: adopting weight combination, the formula is:wherein L represents the bottom layer feature vector, M represents the middle layer feature vector,the weight of the underlying features is set to 0.2, and | represents the concatenated symbols.

In an embodiment of the present invention, in the steps S3 and S8, the concrete process of pooling is as follows: dividing the dimensionality k of the vector expressed by the bottom layer characteristic text into N parts, accumulating word vectors corresponding to characteristic words appearing in each dimensionality, and if no word exists in the dimensionality, setting all the word vectors in the dimensionality to be 0; and after the word vectors in each dimension are counted, splicing the vectors again according to the sequence of the dimension degrees to obtain a brand new vector for representing the document.

In an embodiment of the present invention, in the step S10, a specific process of classifying the archive to which the sample to be determined belongs is as follows: and respectively sending the bottom layer characteristic, the middle layer characteristic and the combined characteristic of the sample to be judged into corresponding trained file classification models based on the extreme learning machine, and adding output result vectors of the three classification models to obtain a final judgment vector, wherein the label corresponding to the maximum value of the vector is the final file category.

Compared with the prior art, the invention has the following beneficial effects:

1. according to the file classification method based on the extreme learning machine, due to the fact that part of speech selection and bottom layer feature selection are adopted, file contents in a text can be accurately understood, and an efficient and stable file dictionary with low dimensionality is constructed;

2. according to the automatic file classification method based on the extreme learning machine, due to the adoption of the mode of feature fusion and classifier fusion, the classification precision can be effectively improved;

3. the invention relates to an automatic file classification method based on an extreme learning machine, which randomly selects hidden nodes in a training network structure and calculates output weights once by using a least square rule method, thereby completing the training of a network. Therefore, the network training is greatly simplified, and the training speed is dozens of times to hundreds of times faster than that of algorithms such as a classical SVM and the like.

Drawings

FIG. 1 is a general block diagram of an extreme learning machine-based automatic document classification method.

FIG. 2 is a flow chart of a middle layer feature pooling step.

FIG. 3 is a flow chart of a combined classifier.

FIG. 4 is a schematic diagram of an extreme learning machine-based archival classification model.

Detailed Description

The technical scheme of the invention is specifically explained below with reference to the accompanying drawings.

The invention provides an automatic file classification method based on an extreme learning machine, which specifically comprises the following steps:

(1) training sample pretreatment: carrying out standardization processing on a text training sample set for model learning, and removing information irrelevant to the task;

(2) extracting bottom layer features of the text training sample: sending the sample processed by the preprocessing module into a bottom layer feature extraction module to extract text bottom layer features, wherein the module comprises two processes of constructing a file dictionary and forming bottom layer feature expression of a training sample in a model learning stage, wherein the bottom layer features are expressed by selecting a vector space model, and the features of each dimension in the vector are normalized TF-IDF weights;

(3) the middle-layer feature of the text training sample is independently learned: and (3) combining the archival dictionary generated in the step (2) and the large-scale corpus, training a Skip-gram model in an unsupervised mode, and generating a training sample word vector by using the trained model. Finally, forming a middle-layer characteristic expression of each training document by adopting a pooling technology;

(4) combining the bottom layer and middle layer features of the text training sample: weighting and connecting the bottom layer characteristics calculated in the step (2) and the middle layer characteristics calculated in the step (3) in series to form final characteristic expression of the document;

(5) training a file classification model based on an extreme learning machine: respectively training three archive classification models based on an extreme learning machine by adopting a supervised training mode based on the sample bottom layer features calculated in the step (2), the middle layer features calculated in the step (3) and the fusion features calculated in the step (4), wherein the three archive classification models correspond to a bottom layer feature archive classification model, a middle layer feature archive classification model and a fusion feature archive classification model;

(6) preprocessing a sample to be judged: carrying out standardization processing on a sample to be judged, and removing information irrelevant to the task;

(7) extracting bottom layer features of a sample to be judged: sending the sample processed by the preprocessing module into a bottom layer feature extraction module to extract text bottom layer features, and directly forming a bottom layer feature expression based on the archive dictionary generated in the step (2), wherein the bottom layer features are expressed by selecting a vector space model, and the features of each dimension in the vector are normalized TF-IDF weights;

(8) extracting the layer features in the sample to be judged: generating a word vector of the sample to be judged by using the learned Skip-gram model in the step (3), and finally forming a middle-layer feature expression of the sample to be judged by adopting a pooling technology;

(9) combining the bottom layer and middle layer features of the sample text to be judged: weighting and connecting the text bottom layer features calculated in the step (7) and the text middle layer features calculated in the step (8) in series to form the final feature expression of the document to be judged;

(10) automatically classifying the sample files to be judged: respectively sending the bottom layer features, the middle layer features and the combined features calculated in the steps (7), (8) and (9) into the three extreme learning machine-based archive classification models learned in the step (5) for classification, and synthesizing the classification results of the three classification models to obtain the archive category to which the sample to be judged belongs;

(11) and (5) continuously executing the steps (6) - (10) in the model operation phase to finish the classification of the text sample.

(12) Inputting a new sample to be judged, and executing the steps (6) to (10) to finish the automatic file classification of the new text sample.

In the method for automatically classifying archives based on the extreme learning machine, the preprocessing of the samples in the step (1) and the step (6) comprises 4 processes: standardizing coding format, removing illegal characters, segmenting words, performing part-of-speech tagging and stopping word processing. The process 1 unifies the text content into the encoding format of UTF-8; 2, filtering the illegal characters by adopting a regular expression matching mode; in the process 3, an ICTCCLAS Chinese lexical analysis system is adopted to carry out word segmentation and part of speech tagging; process 4 filters words that occur frequently in text but are not themselves meaningful to archive analysis using a Baidu stop vocabulary.

In the method for automatically classifying the archives based on the extreme learning machine, in the step (2), the step of constructing the archives classification dictionary comprises two processes of part of speech selection and bottom layer feature selection. The process 1 selects nouns, verbs, adjectives and adverbs as reference words together, and combines words of different parts of speech to form potential semantic information of a document, so that the coverage of an archive dictionary can be ensured to the maximum extent, and the semantic information of the document is kept. And 2, further selecting the feature words which can represent the classification of the archives most by adopting a bottom-layer feature selection principle based on chi-square statistics on the basis of the process 1.

In the archive classification method based on the extreme learning machine, in the step (4) and the step (9), the fusion strategy of the bottom layer features and the middle layer features is as follows: adopting weight combination, the formula is:wherein L represents the bottom layer feature vector, M represents the middle layer feature vector,the weight of the underlying features is set to 0.2, and | represents the concatenated symbols.

In the automatic file classification method based on the extreme learning machine, the concrete process of pooling in the step (3) and the step (8) is as follows: dividing the dimensionality k of the vector expressed by the bottom layer characteristic text into N parts equally, accumulating the word vectors corresponding to the characteristic words appearing in each dimensionality, and if no word exists in the dimensionality, the word vectors of the dimensionality are all 0. And after the word vectors in each dimension are counted, splicing the vectors again according to the sequence of the dimension degrees to obtain a brand new vector for representing the document.

In the method for automatically classifying the archives based on the extreme learning machine, in the step (10), the concrete process of classifying the archives to which the samples to be judged belong is as follows: and respectively sending the bottom layer characteristic, the middle layer characteristic and the combined characteristic of the sample to be judged into corresponding trained file classification models based on the extreme learning machine, and adding output result vectors of the three classification models to obtain a final judgment vector, wherein the label corresponding to the maximum value of the vector is the final file category.

The following are specific examples of the present invention.

The first embodiment is as follows: see fig. 1. The automatic file classification method based on the extreme learning machine mainly comprises two stages: a model learning phase and a model operating phase. Each stage contains four modules: the system comprises a preprocessing module, a text feature extraction module, a bottom layer feature and middle layer feature fusion module and an extreme learning machine-based archive classification module. The text feature extraction module comprises two sub-modules: the bottom layer feature extraction module and the middle layer feature autonomous learning module. The method comprises the following steps:

(3) the middle-layer feature of the text training sample is independently learned: and (3) combining the archive classification dictionary generated in the step (2) and the large-scale corpus to train a Skip-gram model in an unsupervised mode, and generating a training sample word vector by using the trained model. Finally, forming a middle-layer characteristic expression of each training document by adopting a pooling technology;

(7) extracting bottom layer features of a sample to be judged: sending the sample processed by the preprocessing module into a bottom layer feature extraction module to extract text bottom layer features, and directly forming a bottom layer feature expression based on the archive dictionary generated in the step (2), wherein the bottom layer features adopt a TF-IDF algorithm to calculate the weight;

(11) and (5) continuously executing the steps (6) to (10) in the model operation stage, and completing automatic classification of the archives of the text samples.

Example two: see fig. 1, 2. The automatic file classification method of the extreme learning machine in this embodiment further details the pooling technical solutions in step (3) and step (6). The flow of the step is as follows:

(1) suppose that x words are contained in the archive file, t words are left after bottom layer feature extraction, and the text is represented asWherein the word vector for each word isEach word vector has k-dimensional features;

(2) equally dividing word vectors in the text T into N parts to form N word vector groups, wherein each group corresponds to T/N word vectors;

(3) for each word vector group the following operations are performed: accumulating all word vectors in the group, and finally forming a feature vector v (z) by each word vector group, wherein the dimension of the feature vector is also k;

(4) the feature vectors of the N word vector groups are connected in series to obtain the feature vector of the whole document, as shown in a formula:

where | represents symbols in series.

Example three: see fig. 1, 3. The embodiment further details the technical solution of step (10) based on the automatic file classification method of the extreme learning machine. The detailed contents of the classification of the sample archive to be determined in the step (10) are as follows:

the algorithm consists of the following steps:

(1) respectively extracting bottom layer characteristics, middle layer characteristics and fusion characteristics of the text sample;

(2) respectively sending the three characteristics into a trained file classification model based on the bottom layer characteristics, a trained file classification model based on the middle layer characteristics and a trained file classification model based on the fusion characteristics;

(3) adding the output result vectors of the three classification models (wherein each dimension of the vector corresponds to one class of archive category, and the numerical value of each dimension represents the probability that the text sample belongs to the archive category) to obtain a final output vector;

(4) and finding the quantity with the maximum value in the final output vector, wherein the corresponding file type is the file type of the sample to be judged.

Example four: referring to fig. 4, the automatic document classification method based on the extreme learning machine according to this embodiment further details the technical solution of the document classification model based on the extreme learning machine in step (5). The details are as follows:

an extreme learning machine based archive classification model. The extreme learning machine is a Single-hidden Layer Feedforward Neural network (SLFNs), and the network consists of an input Layer, a hidden Layer and an output Layer, wherein the input Layer is fully connected with the hidden Layer, and the hidden Layer is fully connected with the output Layer. Where the input layer X representation is a sample feature vector, the hidden layer includes L hidden neurons, typically L being much smaller than N (number of samples), and the output layer outputs m-dimensional vectors (corresponding to the number of archive classes). Different from the traditional neural network, the weight between the input layer and the hidden layer of the extreme learning machine is randomly generated, only the connection weight between the hidden layer and the output layer needs to be considered, the optimization process of the extreme learning machine needs to minimize the error and the output weight of the hidden layer, so the generalization capability of the model is the best, and the optimization target equation is as follows:

formula (1)

Wherein,

formula (2)

H is for multiple training samplesThe hidden layer output matrix of (2). Where x denotes the set of N training sample textual expressions, the size of H is determined by the number N of training samples and the number L of hidden nodes, typically L is much smaller than N. T is a label matrix formed by a training sample set, each row represents one sample and is stored in a one-hot form.

Formula (3)

Is the connection weight of the hidden layer and the output layer. Finally, the method can solve the formula (1)Analytic solution of (2):

equation (4).

The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims

1. An automatic file classification method based on an extreme learning machine is characterized by comprising the following steps:

2. The method for automatically classifying archives based on extreme learning machine as claimed in claim 1, wherein the preprocessing of samples in steps S1 and S6 comprises 4 processes: standardizing coding formats, removing illegal characters, segmenting words, performing part-of-speech tagging processing and stopping word processing; the text content is unified into a UTF-8 coding format by the standard coding format; removing illegal characters, and filtering the illegal characters in a regular expression matching mode; performing word segmentation and part-of-speech tagging by adopting an ICTCCLAS Chinese lexical analysis system; stop word processing employs a Baidu stop vocabulary to filter words that occur frequently in text but are not themselves meaningful to archive analysis.

3. The method for automatically classifying archives based on an extreme learning machine as claimed in claim 1, wherein in the step S2, the archives dictionary construction step comprises two processes of part of speech selection and bottom layer feature selection; selecting nouns, verbs, adjectives and adverbs as reference words by part of speech selection, and combining the words with different parts of speech to form potential semantic information of a document; and the bottom-layer feature selection further selects the feature words which can represent the classification of the archives most by adopting a bottom-layer feature selection principle based on chi-square statistics on the basis of part of speech selection.

4. The method for automatically classifying archives based on extreme learning machine as claimed in claim 1, wherein said steps S4,In S9, the fusion strategy of the bottom layer feature and the middle layer feature is: adopting weight combination, the formula is:wherein L represents the bottom layer feature vector, M represents the middle layer feature vector,the weight of the underlying features is set to 0.2, and | represents the concatenated symbols.

5. The method for automatically classifying archives based on extreme learning machine as claimed in claim 1, wherein the concrete process of pooling in steps S3 and S8 is as follows: dividing the dimensionality k of the vector expressed by the bottom layer characteristic text into N parts, accumulating word vectors corresponding to characteristic words appearing in each dimensionality, and if no word exists in the dimensionality, setting all the word vectors in the dimensionality to be 0; and after the word vectors in each dimension are counted, splicing the vectors again according to the sequence of the dimension degrees to obtain a brand new vector for representing the document.

6. The method for automatically classifying archives based on an extreme learning machine as claimed in claim 1, wherein in the step S10, the concrete process of classifying the archives to which the samples to be determined belong is as follows: and respectively sending the bottom layer characteristic, the middle layer characteristic and the combined characteristic of the sample to be judged into corresponding trained file classification models based on the extreme learning machine, and adding output result vectors of the three classification models to obtain a final judgment vector, wherein the label corresponding to the maximum value of the vector is the final file category.