Invention content
In view of the above problem of the prior art, inventor is made that the present invention, for realizing under word in medical textThe ability of semantics recognition, lift scheme generalization ability be free to switch between different scenes, portable very strong, greatlyIt is big to reduce human resources consumption.
According to an embodiment of the invention, it provides a kind of medical Text character extraction and automatic matching method, feature existsIn including the following steps:
Step 1 extracts medical text from externally input medical data, and carries out word segmentation processing to medical text, obtainsTo the matched medical treatment word of modular word progress waited for and in controlled term list;
Step 2, for each medical word, operated by term vector, obtain each morpheme in the medical wordCorresponding N-dimensional vector forms M × N-dimensional matrix corresponding with the medical treatment word, wherein M in the medical word by wrappingThe quantity of the morpheme contained;
M × N-dimensional matrix dimensionality reduction corresponding with the medical word is vector by step 3, generates the vector after dimensionality reduction;
Step 4, calculate separately the vector after the dimensionality reduction with corresponding to each modular word in the controlled term list toThe vector distance of amount;
Step 5 is ranked up calculated vector distance from small to large, is chosen from the modular word of the controlled term listWith the preceding one or more modular words of vector distance sequence of the vector after the dimensionality reduction, as candidate modular word.
According to an embodiment of the invention, the method is further comprising the steps of:
Step 6, the logic calculated between the medical word and each candidate modular word include distance, by logic include away fromFrom minimum candidate modular word as with the final matched modular word of the medical word.
The method is further comprising the steps of according to an embodiment of the present invention, optionally,:
Editing distance between step 6, the calculating medical word and each candidate modular word, editing distance is minimumCandidate modular word is as the final matched modular word of the medical word.
The method is further comprising the steps of according to an embodiment of the present invention, optionally,:
Step 6, the logic calculated between the medical word and each candidate modular word include distance and editing distance, andInclude distance and the editing distance weighted sum by the logic, obtains the maximum candidate modular word of weighted sum result, makeFor the final matched modular word of the medical word.
According to an embodiment of the invention, in the step 3, the dimensionality reduction, the pondization side are carried out by pond methodMethod is one or more of average pond, maximum pond, minimum pond,
It, will M × N corresponding with the medical treatment word when using a kind of in average pond, maximum pond, minimum pondDimension matrix dimensionality reduction is 1 × N-dimensional vector, as the vector after the dimensionality reduction,
Wherein, when using several in average pond, maximum pond, minimum pond, by the vectorial cascade of Chi Huahou, shapeAt the vector after the dimensionality reduction.
According to an embodiment of the invention, further include after step 1:
Step 1-1, compared by text, judge whether the medical word and some modular word in controlled term list are completeIt is identical, if it is, the medical word match is directly terminated this method to the modular word.
According to an embodiment of the invention, the modular word and the medical word have attribute labeling, in the step 4In, it calculates separately the vector after the dimensionality reduction corresponding to the medical word and has with the medical word with the controlled term listThere is the vector distance of the vector corresponding to each modular word of identical attribute labeling.
According to an embodiment of the invention, the vector distance is Euclidean distance.
According to an embodiment of the invention, additionally provide it is a kind of for execute the medical Text character extraction of the method with fromDynamic matching system, it is characterised in that including word-dividing mode, term vector module, dimensionality reduction module, matching module,
Wherein, the word-dividing mode is used to extract medical text from externally input medical data, and to medical textWord segmentation processing is carried out, obtains waiting for carrying out matched medical word with the modular word in controlled term list;
For the term vector module for being operated by term vector, each morpheme institute obtained in the medical word is rightThe N-dimensional vector answered forms M × N-dimensional matrix, wherein M is the quantity of morpheme included in the medical word;
The dimensionality reduction module is for being vector by M × N-dimensional matrix dimensionality reduction corresponding with the medical treatment word, after generating dimensionality reductionVector;
The matching module is used for:
Calculate separately after the dimensionality reduction vector with the controlled term list in each modular word corresponding to vector toSpan from;
Calculated vector distance is ranked up from small to large, from the modular word of the controlled term list choose with it is describedThe preceding one or more modular words of vector distance sequence of vector after dimensionality reduction, as candidate modular word;
It includes distance and/or editing distance to calculate logic between the medical word and each candidate modular word, according to meterCalculate result select one in candidate modular word as with the final matched modular word of the medical treatment word.
According to an embodiment of the invention, a kind of computer readable storage medium, the computer-readable storage are additionally providedThe program for the above method is stored on medium, when described program is executed by processor, the step of execution according to the method.
Beneficial effects of the present invention essentially consist in:Feature extraction efficiency is improved, semantic knowledge under word in medical text is realizedOther ability, the model for being not based on rule make model generalization ability significant increase, be free to cut between different scenesIt changes, it is portable very strong, greatly reduce human resources consumption;We when test data matches, raw data setDirect matching rate can be stablized with matching rate after standardization automatic patching system 85%, substantially without manpower branch less than 8%It holds;Dynamically incremental data structure regularization system helps to timely feedback non-middle word character information, reaches in cycle is fed backThe further promotion of recognition effect;Compared to only with single vector-quantities Distance evaluation standard, effect is obviously improved, meanwhile, pass throughEuclidean distance substantially reduces the quantity that text compares operation as first order screening means, it means that has saved calculating moneySource improves calculating speed;Since term vector is without marking work, and the semantic information of vocabulary is contained, can greatly reduceHuman resources consume, and reduce the burden and difficulty of the differentiation work that medical terms need professional person to carry out.
Specific implementation mode
In the following, being described in further detail to the implementation of technical solution in conjunction with attached drawing.
It will be appreciated by those of skill in the art that although the following description is related to many of embodiment for the present inventionTechnical detail, but be only for not meaning that any restrictions for illustrating the example of the principle of the present invention.The present invention can be applicable inIn different from the occasion except technical detail exemplified below, without departing from the principle and spirit of the invention.
It, may pair can be in description in the present specification in addition, tedious in order to avoid making the description of this specification be limited toThe portion of techniques details obtained in prior art data has carried out the processing such as omission, simplification, accommodation, this technology for this fieldIt will be understood by for personnel, and this does not interfere with the open adequacy of this specification.
Hereinafter, it will describe for carrying out the embodiment of the present invention.Note that by description is provided with following order:1, it sends outThe summary of bright design;2, medical Text character extraction and automatic matching method (Fig. 1 to 3);3, medical Text character extraction with fromDynamic matching system (Fig. 4);4, the system according to an embodiment of the invention for being mounted with application program and storage are described using journeyThe computer-readable medium (Fig. 5) of sequence.
1, the summary of inventive concept
The unsupervised Chinese text Automatic signature extraction that the present invention relates to a kind of based on Word2Vec and Euclidean distance, logicIncluding distance standardizes the model combined with editing distance, include mainly following aspect:
1, according to the Word Embedding methods of Skip-Gram models, by word pair in internal medical text to be matchedThe specific coordinate points under N-dimensional space should be arrived, the Chinese text vectorization under unsupervised scene is realized, has reached the energy of semantics recognitionPower, i.e. algorithm can solve conventional method with semantic information in automatic identification text, the Chinese text after vectorization from sourceIt needs that a large amount of rule defects are added;
2, the vector matrix under text length/short sentence is constructed on this basis, passes through Max, Min, the longitudinal directions Mean feature poolChange method carries out dimensionality reduction and captures text key feature, and Optimum Matching list to be selected is calculated in conjunction with Euclidean distance;
3, include that distance and editing distance weighted calculation obtain Optimum Matching item by logic, it is outer to efficiently improveThe matching rate of portion's medical treatment big data and target classification;
4, before mode input, can establish the incremental data structure regularization system of a set of dynamic based on Active Learning comeExclude non-middle word character noisy to semantics recognition.Include mainly all kinds of writing architecture digital/letters, spcial character, EnglishAlias etc..
In the following, in conjunction with the embodiments come illustrate foregoing invention design realization.
2, medical Text character extraction and automatic matching method
Fig. 1 and 2 is according to the medical Text character extraction of the embodiment of the present invention and the part flow of automatic matching methodSchematic diagram.
As shown in Figure 1, the embodiment provides a kind of medical Text character extraction and automatic matching methods, mainlyInclude the following steps:
Step S100, medical text is extracted from externally input medical data, and word segmentation processing is carried out to medical text,It obtains waiting for carrying out matched medical word with the modular word in controlled term list;
Step S200, operated by term vector, obtain N-dimensional corresponding to each morpheme in the medical word toAmount forms M × N-dimensional matrix, wherein M is the quantity of morpheme included in the medical word;
Step S300, it is N-dimensional corresponding with the medical treatment word by M × N-dimensional matrix dimensionality reduction using unidirectional pond methodVector;
Step S400, the N-dimensional vector corresponding to the medical word and each rule in the controlled term list are calculated separatelyThe Euclidean distance of N-dimensional vector corresponding to model word;
Step S500, calculated Euclidean distance is ranked up, from the modular word in the controlled term list choose withThe smaller multiple modular words of Euclidean distance of N-dimensional vector corresponding to the medical treatment word, as candidate modular word;
Step S600, calculate logic between the medical word and each candidate modular word include distance and/or editor away fromFrom, and include distance and the editing distance weighted sum by the logic, obtain the maximum candidate specification of weighted sum resultWord, as the final matched modular word of the medical word.
Wherein, the N-dimensional vector corresponding to each modular word in the controlled term list is by each modular wordCarry out N-dimensional vector obtained from above-mentioned steps S200 and S300, wherein each modular word corresponds to the medical word.
It specifically, in the step s 100, for example, can be matched by carrying out medical text with the entry in Medical DictionaryMethod carries out the word segmentation processing, the medical word after being split, for example, split after medical word can be " asthma "," tumour ", etc..
Optionally, in the step s 100, compared by text, judge the medical word and some rule in controlled term listWhether model word is identical, if it is, the medical word match is directly terminated this method to the modular word.
For example, the Medical Dictionary may include 500,000 disease names (disease name in production).
Wherein, following methods can be used in the matching:
1) Forward Maximum Method method (by left-to-right direction);
2) reverse maximum matching method (by right to left direction);
3) minimum cutting (keeping the word number cut out in each sentence minimum);
4) two-way maximum matching method (carry out by it is left-to-right, by right to left twice sweep).
It will be appreciated by those of skill in the art that the above method can be carried out by various known ways/algorithm, and also can be intoRow combination, details are not described herein.
In step s 200, Word2vec tools can be used in the term vectorization operation, are also referred to as wordEmbeddings, effect are exactly that the words in natural language is switched to the dense vector (Dense that computer is appreciated thatVector), and the word of wherein similar import will be mapped to that similar position in vector space.
It is operated by the term vectorization, is each to look for a suitable position vector by word in embedded space.ThisA vector can reflect some meanings on the syntax and semantics of word.
As an example, the step S100 may also include:
Step S101, the forbidden character in the medical text is excluded, including all kinds of writings architecture digital/letter, specialCharacter, English alias etc..Here, above-mentioned forbidden character can be filtered out by the filtering rule set in advance.
Specifically, as shown in Fig. 2, in step s 200, the term vector turn to the unsupervised word based on Word2Vec toQuantization operation, the term vectorization operation mainly include the following steps that:
Step S201, learnt first by a large amount of language material, identification " just sampling word " and " negative sampling word ";
Step S202, positive sampling word distance constantly being furthered, the degree to further depends on its current distance, meanwhile, it willNegative sampling word distance constantly pushes away far, pushes away remote degree and depends on its current distance;
Step S203, it is that each word looks for a suitable position vector in embedded space.This vector can reflect wordSyntax and semantics on some meanings.
That is, the angle cos of the position vector of i.e. two words is bigger, i.e., it is more similar, then illustrate the two words at wordPossibility is bigger or the possibility of near synonym is bigger;Opposite cos is smaller, i.e., more dissimilar, illustrates that the two words may at wordThe smaller or near synonym possibility of property is smaller.It is also contemplated for the distance between vector distance simultaneously.
Wherein, the meaning of above-mentioned term " just sampling word " and " negative sampling word " is as follows:
Positive sampling word:The character/word in a window is frequently appeared in, the semantic similarity between them is very high, also complies withGrammer logic, for example " swollen " and " tumor ", " heavy breathing " and " asthma " etc. at word word.
Negative sampling word:Few character/word appeared in a window, that is, do not meet grammer logic, semantic similarity is lowWord, for example " swollen " and " black ", " heavy breathing " and " pain " etc. be not at the word of word.
As an example, learning by Word2Vec, word vector (that is, " morpheme " corresponds to a Chinese character) is obtained in mouldThe value dimension d=100 that word vector in type is trained by hyper parameter takes d=2 to carry out example below for ease of understanding.Such asShown in Fig. 3.
Swollen [0.498006, -2.489054], tumor [0.691923, -2.792727],
- cos (swollen, tumor)=0.999, distance (swollen, tumor)=0.360
Swollen [0.498006, -2.489054], big [- 0.340440, -0.981898]
- cos (swollen, big)=0.862, distance (swollen, big)=1.725
Swollen [0.498006, -2.489054], it is red [1.092340, -3.372209]
- cos (swollen, red)=0.993, distance (swollen, red)=1.064
Swollen [0.498006, -2.489054], the upper arm [- 4.788107,2.656263]
- cos (swollen, the upper arm)=- 0.647, distance (swollen, the upper arm)=7.376
Swollen [0.498006, -2.489054], split [- 4.193781,2.289126]
- cos (swollen, to split)=- 0.642, distance (swollen, to split)=6.696
Swollen [0.498006, -2.489054] are rolled over [- 3.655881,2.100383]
- cos (swollen, folding)=- 0.658, distance (swollen, folding)=6.190
Bone [- 3.781678,2.185360], tumor [0.691923, -2.792727],
- cos (bone, tumor)=- 0.693, distance (bone, tumor)=6.692
Bone [- 3.781678,2.185360], big [- 0.340440, -0.981898]
- cos (bone, big)=- 0.189, distance (bone, big)=4.676
Bone [- 3.781678,2.185360], it is red [1.092340, -3.372209]
- cos (bone, red)=- 0.742, distance (bone, red)=7.392
Bone [- 3.781678,2.185360], the upper arm [- 4.788107,2.656263]
- cos (bone, the upper arm)=0.999, distance (bone, the upper arm)=1.111
Bone [- 3.781678,2.185360], splits [- 4.193781,2.289126]
- cos (bone is split)=0.999, distance (bone is split)=0.424
Bone [- 3.781678,2.185360] is rolled over [- 3.655881,2.100383]
- cos (bone, folding)=0.999, distance (bone, folding)=0.151
The word vector corresponding to each word in medicine word can be obtained as a result,.
Specifically, in the step S300, the unidirectional pond method is longitudinal pond, for average pond (meanPooling), one kind in maximum pond (max pooling), minimum pond (min pooling) (that is, pond window for M ×1);
For example, by one in above-mentioned pond method, it can be by 10 × 100 matrix (10 1 × 100 dimension word vector shapesAt matrix) dimensionality reduction be 1 × 100 dimension vector.
Optionally, three kinds of pond methods above can be used, (10 1 × 100 dimension word vectors are formed by 10 × 100 matrixMatrix) dimensionality reduction be three 1 × 100 dimension vectors, later further cascade formed 1 × 300 dimensional vector, in order to rearIn the step of face matching primitives are carried out with each 1 × 300 dimensional vector corresponding to each modular word in controlled term list.
Optionally, in step S400, the modular word and the medical word have attribute labeling, in the stepIn S400, calculating separately the N-dimensional vector corresponding to the medical word has with the controlled term list with the medical wordThe Euclidean distance of N-dimensional vector corresponding to each modular word of identical attribute labeling.
Specifically, in step S400, (that is, through the above steps will each medical word after text vector be standardizedSpecification turns to dimension identical with modular word), based on all canonical names by theorem in Euclid space range formula, calculate the medical treatmentThe Euclidean distance of N-dimensional vector and the N-dimensional vector corresponding to each modular word in the controlled term list corresponding to word.
As an example, the medical treatment word can be quantified as 1 × 300 dimensional vector, indicate as follows:A=[1.393092,1.349219,…,1.311361,-2.02858,-0.15119,…,-1.24318,-0.44072,0.98503,…,-0.05916]
And the modular word (such as disease name) in controlled term list respectively has corresponding 1 × 300 dimensional vector, indicates as follows:
B1=[0.395221,0.45926 ..., -3.252446, -3.020052,4.52419 ..., 2.214458, -1.4547,-1.98543,…,2.56514]
……
Bk=[1.393092,1.349219 ..., 1.311361, -2.02858, -0.15119 ..., -1.24318, -0.44072,0.98503,…,-0.05916]
……
By calculating Euclidean distance, it can be deduced that, the modular word with A closest to (Euclidean distance is minimum) is Bk.Show hereinIn example, BkEuclidean distance with A is 0, indicates to exactly match.
Specifically, in step S500, calculated Euclidean distance is ranked up from small to large, from the controlled term listIn modular word in choose multiple modular words with the Euclidean distance minimum of the N-dimensional vector corresponding to the medical word, asCandidate modular word;
Alternately collect as an example, filtering out shortest 10 modular words of Euclidean distance.
Specifically, in step S600, the logic includes the weight that distance indicates the medical word and the modular wordIt is right, for example, indicating that the logic includes distance by the number of identical characters;The editing distance is indicated the medical treatmentWord is compiled as the required minimum edit operation number of the modular word;In the weighted sum, the logic include away fromFrom weight be 2 times of weight of the editing distance.
Optionally, it includes distance or the editing distance that the logic, which also can only be used alone,.
Specifically, editing distance (Edit Distance) is also known as Levenshtein distances, refer between two word strings byOne minimum edit operation number changed into needed for another.The edit operation of license includes that a character is substituted for anotherCharacter is inserted into a character, deletes a character.In general, editing distance is smaller, and the similarity of two strings is bigger.
For example, the screening after calculating Euclidean distance, from externally input medical word " palm of the hand locality epidermis groupKnit contusion " have with modular word " contusion of centre of the palm locality epidermal tissue ", " centre of the palm locality epidermal tissue dampens tumour "There is smaller distance.Next editing distance is calculated.
The editing distance of " contusion of palm of the hand locality epidermal tissue " → " contusion of centre of the palm locality epidermal tissue " is 1;" handThe editing distance of heart locality epidermal tissue contusion " → " centre of the palm locality epidermal tissue dampens tumour " is 3.According to above-mentioned behaviourMake, the former may be selected as matching result.
Above-mentioned " Euclidean distance ", " logic include distance ", the concept of " editing distance " belong to the known concept of this field, areFor the sake of concise, details are not described herein.
3, medical Text character extraction and automatic patching system
According to an embodiment of the invention, a kind of medical Text character extraction and automatic patching system are additionally provided, for holdingEach step of the method in row the application, as shown in figure 4, the medical treatment Text character extraction and automatic patching system systemSystem includes mainly word-dividing mode, term vector module, dimensionality reduction module, matching module.
Wherein, the word-dividing mode is used to extract medical text from externally input medical data, and to medical textWord segmentation processing is carried out, obtains waiting for carrying out matched medical word with the modular word in controlled term list;
For the term vector module for being operated by term vector, each morpheme institute obtained in the medical word is rightThe N-dimensional vector answered forms M × N-dimensional matrix, wherein M is the quantity of morpheme included in the medical word;
The dimensionality reduction module is used to use unidirectional pond method, is opposite with the medical word by M × N-dimensional matrix dimensionality reductionThe N-dimensional vector answered;
The matching module is used for:
It calculates separately corresponding to the N-dimensional vector corresponding to the medical word and each modular word in the controlled term listN-dimensional vector Euclidean distance;
Calculated Euclidean distance is ranked up, is chosen from the modular word in the controlled term list and the medical wordThe smaller multiple modular words of the Euclidean distance of N-dimensional vector corresponding to language, as candidate modular word;
It includes distance and/or editing distance to calculate logic between the medical word and each candidate modular word, and by instituteIt includes distance and the editing distance weighted sum to state logic, the maximum candidate modular word of weighted sum result is obtained, as instituteState medical word finally matched modular word.
In addition, different embodiments of the invention by software module or can also be stored in one or more computer-readableThe mode of computer-readable instruction on medium is realized, wherein the computer-readable instruction is when by processor or equipment groupWhen part executes, different embodiment of the present invention is executed.Similarly, software module, computer-readable medium and Hardware SubdivisionThe arbitrary combination of part is all expected from the present invention.The software module can be stored in any type of computer-readable storageOn medium, such as RAM, EPROM, EEPROM, flash memory, register, hard disk, CD-ROM, DVD etc..
4, the system according to an embodiment of the invention for being mounted with application program
With reference to Fig. 5, it illustrates the running environment of the system according to the ... of the embodiment of the present invention for being mounted with application program.
In the present embodiment, the system of the installation application program is installed and is run in electronic device.The electronicsDevice can be the computing devices such as desktop PC, notebook, palm PC and server.The electronic device may include but notIt is limited to memory, processor and display.Attached drawing illustrates only the electronic device with said modules, it should be understood thatIt is not required for implementing all components shown, the implementation that can be substituted is more or less component.
The memory can be the internal storage unit of the electronic device, such as electronics dress in some embodimentsThe hard disk or memory set.The memory can also be the External memory equipment of the electronic device in further embodiments,Such as the plug-in type hard disk being equipped on the electronic device, intelligent memory card (Smart Media Card, SMC), secure digital(Secure Digital, SD) blocks, flash card (Flash Card) etc..Further, the memory can also both include instituteThe internal storage unit for stating electronic device also includes External memory equipment.The memory is installed on the electronics dress for storingThe application software and Various types of data set, for example, it is described installation application program system program code etc..The memory may be used alsoFor temporarily storing the data that has exported or will export.
The processor can be in some embodiments central processing unit (Central Processing Unit,CPU), microprocessor or other data processing chips, for running the program code stored in the memory or processing data,Such as execute the system etc. of the installation application program.
The display can be in some embodiments light-emitting diode display, liquid crystal display, touch-control liquid crystal display withAnd OLED (Organic Light-Emitting Diode, Organic Light Emitting Diode) touches device etc..The display is for showingShow the information handled in the electronic device and for showing visual customer interface, such as application menu interface, answersWith icon interface etc..The component of the electronic device is in communication with each other by system bus.
Through the above description of the embodiments, those skilled in the art is it will be clearly understood that the above embodimentIn method the mode of required general hardware platform can be added to realize by software, naturally it is also possible to realized by hardware,But the former is more preferably embodiment in many cases.Based on this understanding, the technical solution of the application of the present invention is substantiallyThe part that contributes to existing technology can be embodied in the form of Software Commodities in other words, which depositsStorage is in a storage medium (such as ROM/RAM, magnetic disc, CD), including use (can be with so that a station terminal equipment for some instructionsIt is mobile phone, computer, server, air conditioner or the network equipment etc.) execute side described in each embodiment of the application of the present inventionMethod.
That is, according to an embodiment of the invention, additionally providing a kind of computer readable storage medium, the computerThe program for executing the method according to an embodiment of the invention is stored on readable storage medium storing program for executing, described program is handledWhen device executes, each step of the method is executed.
By upper, it will be appreciated that for illustrative purposes, specific embodiments of the present invention are described herein, still, can makeEach modification, without departing from the scope of the present invention.It will be apparent to one skilled in the art that drawn in flow chart step or thisIn the operation that describes and routine can be varied in many ways.More specifically, the order of step can be rearranged, step can be executed parallelSuddenly, step can be omitted, it may include other steps can make the various combinations or omission of routine.Thus, the present invention is only by appended powerProfit requires limitation.