Summary of the invention
In view of this, the embodiment of the invention provides a kind of trade classification method and terminal device based on machine learning,To solve the problems, such as that trade classification in the prior art is inefficient and inaccurate.
The first aspect of the embodiment of the present invention provides a kind of trade classification method based on machine learning, comprising:
Training set is obtained, the training set is the text collection through manually marking, and the training set is by a variety of categorys of employmentText constitute, for any text in the training set, the text includes management functions information, and the text markingThere is corresponding category of employment;
Word segmentation processing is carried out to the text, obtains vocabulary corresponding to the text;
By feature extraction, the vocabulary of the first preset number is obtained in the vocabulary as keyword;
For any keyword of acquisition, the term vector of the keyword is obtained by term vector model;
The term vector of all keywords is averaging, primary vector is obtained;
Maximum term vector in the term vector of all keywords is obtained, secondary vector is obtained;
The smallest term vector in the term vector of all keywords is obtained, third vector is obtained;
By the primary vector, the secondary vector and the third vector, the feature vector of the text is formed;
Pass through training set training trade classification model;
By completing the trade classification model of training, treats classifying text and carry out trade classification.
The second aspect of the embodiment of the present invention provides a kind of computer readable storage medium, the computer-readable storageMedia storage has computer-readable instruction, and the computer-readable instruction realizes following steps when being executed by processor:
Training set is obtained, the training set is the text collection through manually marking, and the training set is by a variety of categorys of employmentText constitute, for any text in the training set, the text includes management functions information, and the text markingThere is corresponding category of employment;
Word segmentation processing is carried out to the text, obtains vocabulary corresponding to the text;
By feature extraction, the vocabulary of the first preset number is obtained in the vocabulary as keyword;
For any keyword of acquisition, the term vector of the keyword is obtained by term vector model;
The term vector of all keywords is averaging, primary vector is obtained;
Maximum term vector in the term vector of all keywords is obtained, secondary vector is obtained;
The smallest term vector in the term vector of all keywords is obtained, third vector is obtained;
By the primary vector, the secondary vector and the third vector, the feature vector of the text is formed;
Pass through training set training trade classification model;
By completing the trade classification model of training, treats classifying text and carry out trade classification.
The third aspect of the embodiment of the present invention provides a kind of terminal device, including memory, processor and is stored inIn the memory and the computer-readable instruction that can run on the processor, the processor executes the computer canFollowing steps are realized when reading instruction:
Training set is obtained, the training set is the text collection through manually marking, and the training set is by a variety of categorys of employmentText constitute, for any text in the training set, the text includes management functions information, and the text markingThere is corresponding category of employment;
Word segmentation processing is carried out to the text, obtains vocabulary corresponding to the text;
By feature extraction, the vocabulary of the first preset number is obtained in the vocabulary as keyword;
For any keyword of acquisition, the term vector of the keyword is obtained by term vector model;
The term vector of all keywords is averaging, primary vector is obtained;
Maximum term vector in the term vector of all keywords is obtained, secondary vector is obtained;
The smallest term vector in the term vector of all keywords is obtained, third vector is obtained;
By the primary vector, the secondary vector and the third vector, the feature vector of the text is formed;
Pass through training set training trade classification model;
By completing the trade classification model of training, treats classifying text and carry out trade classification.
The present invention provides a kind of trade classification method and terminal device based on machine learning, will manually mark industry classOther text composing training collection, the content that text includes is management functions information, for any text in training set, by rightText carries out keyword extraction, constitutes this article by average value, maximum value and the minimum value of all keyword term vectors gotThis corresponding feature vector, comprehensively considers the semantic content of different term vectors, has preferable semantic depth, by completing featureThe training set of extraction is trained trade classification model, until reaching trained termination condition, by trained trade classification mouldType classifies to the text for including operation information, improves the efficiency and precision of classification.
Specific embodiment
In being described below, for illustration and not for limitation, the tool of such as particular system structure, technology etc is proposedBody details, to understand thoroughly the embodiment of the present invention.However, it will be clear to one skilled in the art that there is no these specificThe present invention also may be implemented in the other embodiments of details.In other situations, it omits to well-known system, device, electricityThe detailed description of road and method, in case unnecessary details interferes description of the invention.
In order to illustrate technical solutions according to the invention, the following is a description of specific embodiments.
The trade classification method based on machine learning that the embodiment of the invention provides a kind of.In conjunction with Fig. 1, this method comprises:
S101 obtains training set, and the training set is the text collection through manually marking, and the training set is by a variety of industriesThe text of classification is constituted, and for any text in the training set, the text includes management functions information, and the textIt is labeled with corresponding category of employment.
Wherein, every kind of category of employment in training set at least corresponds to a text, and each text uniquely maps a kind of industryClassification.
The business scope of managerial setup describes what enterprise, public institution, organ and self-employed individual were engaged inProduction and operating activities or other social activitieies contain its all management functions information.
The business scope information of one enterprise can be obtained by multiple channel, for example, in national organization mechanism code managementThe heart is that the organization of each nomocracy distributes a unique marking code, and retains a electronic record, archives for itIn record the business scope of the organization in detail, what is recorded in the business scope is the management functions information of enterprise.Certainly,The management functions information of each organization can also be obtained by other methods, it is not limited in the embodiment of the present invention.
The management functions information of one organization and its affiliated industry have very strong relevance, with its management functions letterBreath is foundation, which can be divided into a corresponding industry.
In embodiments of the present invention, the management functions information of multiple organizations, the operation of each organization can be obtainedBusiness information corresponds to a text, by manually carrying out category of employment mark to text, obtains the training set in this step.
Optionally, in embodiments of the present invention, a category of employment list can be preset, for example, can wrap in the listPointed all categorys of employment in classification standard containing " industrial sectors of national economy classification ", being arranged one for each category of employment canWith the mark of unique identification the sector classification, when manually carrying out category of employment mark to the text in training set, according toUnique mapping relations that category of employment in category of employment list is identified with it mark the category of employment corresponding to it for the textMark.
S102 carries out word segmentation processing to the text, obtains vocabulary corresponding to the text.
To the text in training set, text can be segmented by existing a variety of participle models, obtain the text pairThe word lists answered.
Optionally, can word segmentation result to text is further is screened, remove default stop words in word segmentation result andPunctuation mark, and remaining vocabulary is arranged according to word frequency descending, the vocabulary for being arranged in front the second preset number is chosen, by screeningTo vocabulary constitute the vocabulary in this step.
Remove default stop words, refer to removal " ", " " and some pairs of trade classifications there is no the default word of purposes,The vocabulary obtained after stop words and punctuation mark will be removed according to word frequency descending sort, optionally, before choosing in ranking results90% word, remove ranking results in come rear 10% word, using final the selection result as the vocabulary.
S103 obtains the vocabulary of the first preset number as keyword by feature extraction in the vocabulary.
TF-IDF (term frequency-inverse document frequency, word frequency-inverse document frequency)Algorithm is that by TF-IDF matrix a word and text can be calculated in a kind of feature extraction and feature weight computing techniqueThe correlation degree of classification obtains the value of a score, and the higher word of score, class discrimination ability is higher, in this step, can lead toThe value of the different degree of the text will be calculated by crossing TF-IDF and successively calculating each vocabulary in the word lists that step S102 is obtainedAs a result descending arranges, and chooses the vocabulary of the first preset number as keyword.For example, there are 50 words in word lists, pass through TF-IDF successively calculates the value of each word, and calculated result descending is arranged, and chooses preceding 10 words in ranking results as keyword.
S104 obtains the term vector of the keyword by term vector model for any keyword of acquisition.
By existing term vector model, the term vector of keyword can be obtained, in general, the term vector is one 256 dimensionVector.
The term vector of all keywords is averaging by S105, obtains primary vector, obtains all keywordsMaximum term vector in term vector obtains secondary vector, obtains the smallest term vector in the term vector of all keywords, obtainsThe feature vector of the text is formed by the primary vector, the secondary vector and the third vector to third vector.
For example, in step s105, the term vector of each keyword is 256 dimensions, then the text constructed by this stepFeature vector is the vector of 256*3 dimension, and this feature vector is made of primary vector, secondary vector and third vector,Primary vector, secondary vector and third vector are the vector of continuous 256 dimension in the feature vector of the text.
By the feature vector for the text that this method obtains, the semantic content of different keyword term vectors is comprehensively considered,There is higher semantic depth compared with the method for existing building Text eigenvector, to improve the essence of trade classificationDegree.
S106 passes through training set training trade classification model.
In embodiments of the present invention, trade classification model is deep neural network model, the deep neural network modelIncluding 4 layers, respectively input layer, the first hidden layer, the second hidden layer and output layer, the input of the input layer are the textCorresponding feature vector, first hidden layer include the first present count destination node, and second hidden layer includes secondThe activation primitive of present count destination node, first hidden layer and second hidden layer is relu function, the output layerFor the probability of the type of the text, the activation primitive of the output layer is logistics function.
Optionally, input layer includes a node, by the feature vector of the obtained text of step S105, as the input layerThe input of node;
First hidden layer includes 100 nodes, including 1 × 100 dimension, and activation primitive is relu function;
Second hidden layer includes 200 nodes, including 1 × 200 dimension, and activation primitive is relu function;
The activation primitive of output layer is logistics function, and output result is the probability of industry type, such as in training setIndustry is divided into 95 classes, then what output layer exported is the probability that the text is every one kind in this 95 class.
Optionally, being trained by training set to trade classification model includes: by the training set to the industryLearning rate, frequency of training, batch size and the termination error of disaggregated model are trained, until reach default training termination condition,Wherein, the default trained termination condition is to reach the frequency of training or word segmentation result error lower than the termination error.
Further, in conjunction with Fig. 2, the embodiment of the invention also provides a kind of method for obtaining optimal industry disaggregated model,This method comprises:
S1061 establishes multiple deep neural network models, for any two in the multiple deep neural network modelA deep neural network disaggregated model, the learning rate of described two deep neural network models, frequency of training, batch size and terminationError is different.
Optionally, it for the deep neural network trade classification model provided in step S106, establishes multiple by different ginsengsThe deep neural network model that number is constituted.
For example, learning rate chooses a value in 0.01,0.02 and 0.03;
Frequency of training chooses a value in 500,1000 and 2000;
Criticize a value in selection of dimension 100,200 and 500;
Termination error choose a value in 0.05,0.1 and 0.5;
Thus a variety of deep neural network disaggregated models be may make up, for example, learning rate is 0.01, frequency of training 500, is criticizedHaving a size of 100 and termination error be 0.05 when may make up a trade classification model.
S1062 is respectively trained the multiple deep neural network model by the training set.
Multiple deep neural network models in S1061 are trained respectively by training set, until reaching training eventuallyOnly condition.
S1063 obtains default test set.
In embodiments of the present invention, the acquisition process of test set and the acquisition process of training set are identical.
S1064, by the default test set to testing respectively the multiple deep neural network model.
S1065 chooses the highest deep neural network model of trade classification accuracy according to test result and treats pointClass text carries out trade classification.
Since the industry type of text each in test set is known, a such as test text X, test text X'sIndustry type is agricultural, by the way that the feature vector of X is inputted industry disaggregated model, if the X that trade classification model is calculatedType is the maximum probability of agricultural, then trade classification model is correctly, if trade classification model to the prediction of the type of text XThe type for the X being calculated is that the probability of agricultural is not the largest, such as trade classification model is calculated the type of text X and isThe maximum probability of animal husbandry, then trade classification model is wrong to the prediction of the type of text X.
By this method, by the test of test set, the accuracy of each trade classification model can be obtained respectively, fromAnd obtain optimal industry disaggregated model.
The embodiment of the present invention is respectively trained and is tested by the trade classification model combined to many kinds of parameters, obtains classification essenceHighest trade classification model is spent, the precision of trade classification is further improved.
By the highest deep neural network model of the precision obtained, treats classifying text and carry out trade classification.
S107 treats classifying text and carries out trade classification by completing the trade classification model of training.
The trade classification method based on machine learning that the embodiment of the invention provides a kind of, by training set comprising warpThe text for seeking mechanism management functions information carries out feature extraction, obtains the corresponding feature vector of the text, and the text marks someoneThe trade classification of work point class identifies, and using the feature vector of text in training set as input, is trained to trade classification model,Trade classification model by completing training treats classifying text and carries out trade classification, has reached automatic to industry based on management functionsThe purpose of classification, classification effectiveness is high and classification is accurate.
In conjunction with Fig. 3, in the trade classification model by completing training, after treating classifying text progress trade classification, thisInventive embodiments additionally provide a kind of trade classification method based on machine learning, and this method can be used for obtaining abnormal trade classificationAs a result, this method comprises:
S301, obtains all texts that category of employment in the training set is the first category of employment, and first industry isAny category of employment in a variety of categorys of employment in the training set.
As included multiple texts corresponding to 95 kinds of categorys of employment in category of employment list in training set, in this stepIn, each category of employment and all texts for belonging to the category are screened, all texts corresponding to the category are obtained.
S302 carries out Density Clustering to the feature vector of all texts of first category of employment, obtains described firstThe cluster of category of employment.
First category of employment is any of multiple categorys of employment in industry list, for example, the first category of employmentFor agricultural, then the text that all categorys of employment in training set are labeled as agricultural, such as 100 are obtained by step S101, to this 100A category of employment is the feature vector of the text of agricultural, carries out Density Clustering, such as passes through DBSCAN (Density-BasedSpatial Clustering of Applications with Noise) algorithm progress clustering, DBSCAN is a ratioMore representational density-based algorithms can be based on Density Clustering, obtain cluster corresponding to agricultural industry, cluster definitionIt can be cluster having region division highdensity enough for the maximum set for the point that density is connected, it is alternatively referred to as agriculture hereinThe portrait of industry corresponding to industry.
S303 obtains the central point and radius of the cluster of first category of employment.
S304, if classification results are the probability highest that the text to be sorted is the first category of employment, by it is described toThe feature vector of classifying text calculates the text to be sorted at a distance from the central point of the cluster of first industry.
Text to be sorted for one classifies to the text by the method for embodiment corresponding to Fig. 1 and Fig. 2Later, the probability that the text belongs to each category of employment is obtained, for example, obtain the probability highest that the text belongs to agricultural, then baseIn the feature vector of the text, at a distance from the central point that calculates cluster corresponding to the text and agricultural, if the distance is greater than stepThe radius of the cluster of agricultural in S303 then judges the text for an abnormal text.
Since the result that trade classification model obtains calculates its obtained for the management functions information for being included according to textThe probability for belonging to each category of employment, the trade classification model provided through the embodiment of the present invention, as obtained corresponding to the textCategory of employment be agricultural maximum probability, it is likely that in the presence of it includes management functions information also include a variety of with agriculture passThe information of the lower other industry of connection degree, although the category of employment of the text is caused to be the probability highest of agricultural, itself and agriculturalThe degree of association be also not especially big, determine that the text is an abnormal text at this time.
S305 judges the text to be sorted for abnormal text if the distance is greater than the radius of the cluster.
The embodiment of the present invention is based on clustering algorithm, cluster corresponding to an industry is obtained by training set, if a textTrade classification result be the sector, then judge whether the text big at a distance from the sector cluster according to the feature vector of the textIn the radius of the cluster, if more than then judging that the text for abnormal text, further provides foundation for industry exact classification.
Fig. 4 is a kind of trade classification schematic device based on machine learning provided in an embodiment of the present invention, in conjunction with Fig. 4,The device includes: first acquisition unit 41, participle unit 42, second acquisition unit 43, the acquisition list of third acquiring unit the 44, the 4thMember 45, training unit 46 and taxon 47;
First acquisition unit 41 is the text collection through manually marking, the instruction for obtaining training set, the training setPractice collection to be made of the text of a variety of categorys of employment, every kind of category of employment at least corresponds to a text, and each text uniquely maps oneKind of category of employment, for any text in the training set, the text includes management functions information, and the text markingThere is corresponding category of employment;
Participle unit 42 is used to carry out word segmentation processing to the text, obtains vocabulary corresponding to the text;
The vocabulary that second acquisition unit 43 is used to obtain the first preset number in the vocabulary by feature extraction is madeFor keyword;
Third acquiring unit 44 is used to obtain the keyword by term vector model for any keyword obtainedTerm vector;
4th acquiring unit 45 is used to for the term vector of all keywords being averaging, and obtains primary vector, obtains instituteMaximum term vector in the term vector of all keywords is stated, secondary vector is obtained, in the term vector for obtaining all keywordsThe smallest term vector obtains third vector, by the primary vector, the secondary vector and the third vector, described in compositionThe feature vector of text;
Training unit 46 is used for through training set training trade classification model;
Taxon 47 is used for the trade classification model by completing training, treats classifying text and carries out trade classification.
Further, which further includes screening unit 48, for removing default stop words and punctuate in word segmentation resultSymbol;Remaining vocabulary is arranged according to word frequency descending, the vocabulary for being arranged in front the second preset number is chosen, obtains the vocabularyTable.
Further, the second acquisition unit 43 is specifically used for: successively being calculated by TF-IDF every in the vocabularyDifferent degree of a vocabulary for the text;Calculated result descending is arranged, the vocabulary for being arranged in front the first preset number is chosenAs keyword.
Further, the trade classification model is deep neural network model, and the deep neural network model includes 4Layer, respectively input layer, the first hidden layer, the second hidden layer and output layer, the input of the input layer are the spy of the textVector is levied, first hidden layer includes the first present count destination node, and second hidden layer includes the second preset numberThe activation primitive of node, first hidden layer and second hidden layer is relu function, and the output layer is the textIndustry type probability, the activation primitive of the output layer is logistics function.
Further, which further includes establishing unit 49 and selection unit 410;
It is described to establish unit 49, multiple deep neural network models are established, for the multiple deep neural network modelIn any two deep neural network model, the learning rates of described two deep neural network models, frequency of training, batch sizeIt is different with termination error;
The training unit 46 is respectively trained the multiple deep neural network model by the training set;
First acquisition unit 41 is also used to obtain default test set;
Taxon 47 respectively tests the multiple deep neural network model by the default test set;
Selection unit 410 chooses the classification highest deep neural network model of accuracy according to test result;
The taxon 47 is specifically used for: by the classification highest deep neural network model of accuracy to described wait divideClass text carries out trade classification.
Further, which further includes cluster cell 411, computing unit 412 and judging unit 413, for obtainingAll texts that category of employment in training set is the first category of employment are stated, first industry is a variety of rows in the training setAny category of employment in industry classification;Density Clustering is carried out to the feature vector of all texts of first category of employment,Obtain the cluster of first category of employment;Obtain the central point and radius of the cluster of first category of employment;
If classification results are the probability highest that the text to be sorted is the first category of employment, computing unit 412 passes throughThe feature vector of the text to be sorted calculates the text to be sorted at a distance from the central point of the cluster of first industry;
If the distance is greater than the radius of the cluster, judging unit 413 judges to be abnormal literary to the text to be sortedThis.
The embodiment of the invention provides a kind of trade classification device based on machine learning, the device pass through in training setText comprising managerial setup management functions information carries out feature extraction, obtains the corresponding feature vector of the text, text markBe marked with manual sort trade classification mark, using the feature vector of text in training set as input, to trade classification model intoRow training, the trade classification model by completing training treat classifying text and carry out trade classification, reached based on management functions pairThe purpose that industry is classified automatically, classification effectiveness is high and classification is accurate.
Fig. 5 is a kind of schematic diagram of the trade classification terminal device based on machine learning provided in an embodiment of the present invention.Such asShown in Fig. 5, the trade classification terminal device 5 of the embodiment includes: processor 50, memory 51 and is stored in the memoryIn 51 and the computer program 52 that can be run on the processor 50, such as trade classification program.The processor 50 executesThe step in above-mentioned each trade classification embodiment of the method based on machine learning is realized when the computer program 52, such as is schemedStep 101 shown in 1 is to 107 or step 1061 shown in Fig. 2 to 1065 or step 301 shown in Fig. 3 to 305.Alternatively, instituteThe function that each module/unit in above-mentioned each Installation practice is realized when processor 50 executes the computer program 52 is stated, such asThe function of module 41 to 413 shown in Fig. 4.
Illustratively, the computer program 52 can be divided into one or more module/units, it is one orMultiple module/units are stored in the memory 51, and are executed by the processor 50, to complete the present invention.Described oneA or multiple module/units can be the series of computation machine program instruction section that can complete specific function, which is used forImplementation procedure of the computer program 52 in the trade classification terminal device 5 is described.
The trade classification terminal device 5 can be desktop PC, notebook, palm PC and cloud server etc.Calculate equipment.The trade classification terminal device may include, but be not limited only to, processor 50, memory 51.Those skilled in the artMember is appreciated that Fig. 5 is only the example of trade classification terminal device 5, does not constitute the limit to trade classification terminal device 5It is fixed, it may include perhaps combining certain components or different components, such as the row than illustrating more or fewer componentsIndustry classified terminal equipment can also include input-output equipment, network access equipment, bus etc..
The processor 50 can be central processing unit (Central Processing Unit, CPU), can also beOther general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit(Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) either other programmable logic device, discrete gate or transistor logic,Discrete hardware components etc..General processor can be microprocessor or the processor is also possible to any conventional processorDeng.
The memory 51 can be the internal storage unit of the trade classification terminal device 5, such as trade classification endThe hard disk or memory of end equipment 5.The memory 51 is also possible to the External memory equipment of the trade classification terminal device 5,Such as the plug-in type hard disk being equipped on the trade classification terminal device 5, intelligent memory card (Smart Media Card, SMC),Secure digital (Secure Digital, SD) card, flash card (Flash Card) etc..Further, the memory 51 may be used alsoWith the internal storage unit both including the trade classification terminal device 5 or including External memory equipment.The memory 51 is usedOther programs and data needed for storing the computer program and the trade classification terminal device.The memory 51It can be also used for temporarily storing the data that has exported or will export.
The embodiment of the present invention also provides a kind of computer readable storage medium, and the computer-readable recording medium storage hasComputer program, the computer program realize the row described in any of the above-described embodiment based on machine learning when being executed by processorThe step of industry classification method.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unitIt is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated listMember both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent productWhen, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantiallyThe all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other wordsIt embodies, which is stored in a storage medium, including some instructions are used so that a computerEquipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the present inventionPortion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (ROM, Read-OnlyMemory), random access memory (RAM, Random Access Memory), magnetic or disk etc. are various can store journeyThe medium of sequence code.
Embodiment described above is merely illustrative of the technical solution of the present invention, rather than its limitations;Although referring to aforementioned realityApplying example, invention is explained in detail, those skilled in the art should understand that: it still can be to aforementioned eachTechnical solution documented by embodiment is modified or equivalent replacement of some of the technical features;And these are modifiedOr replacement, the essence of corresponding technical solution is departed from the spirit and scope of the technical scheme of various embodiments of the present invention, it should allIt is included within protection scope of the present invention.