Movatterモバイル変換


[0]ホーム

URL:


CN109830272A - Data normalization method, apparatus, computer equipment and storage medium - Google Patents

Data normalization method, apparatus, computer equipment and storage medium
Download PDF

Info

Publication number
CN109830272A
CN109830272ACN201910011828.XACN201910011828ACN109830272ACN 109830272 ACN109830272 ACN 109830272ACN 201910011828 ACN201910011828 ACN 201910011828ACN 109830272 ACN109830272 ACN 109830272A
Authority
CN
China
Prior art keywords
data
type
occurrence
item data
item
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910011828.XA
Other languages
Chinese (zh)
Other versions
CN109830272B (en
Inventor
金晓辉
阮晓雯
徐亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co LtdfiledCriticalPing An Technology Shenzhen Co Ltd
Priority to CN201910011828.XApriorityCriticalpatent/CN109830272B/en
Publication of CN109830272ApublicationCriticalpatent/CN109830272A/en
Application grantedgrantedCritical
Publication of CN109830272BpublicationCriticalpatent/CN109830272B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Landscapes

Abstract

The embodiment of the present application provides a kind of data normalization method, apparatus, computer equipment and storage medium.The described method includes: obtaining an item data to be normalized in physical examination report;Determine data type corresponding to the occurrence of the item data;The item data is standardized according to identified data type, wherein the mode of standardization corresponding to different types of data is different.The embodiment of the present application uses different standardization modes to the data of different types of data, can comprehensively be standardized to physical examination report, improve the precision and accuracy to the processing of physical examination reporting standardsization;Make the data after standardization that can be further used for model learning simultaneously, improves the consistency and accuracy of the data of model learning.

Description

Data normalization method, apparatus, computer equipment and storage medium
Technical field
This application involves technical field of data processing more particularly to a kind of data normalization method, apparatus, computer equipmentAnd storage medium.
Background technique
Electronics physical examination report generally comprises bulk information, and corresponding physical examination project is a variety of multinomial, is not easy to handle, and causes at presentAll relatively roughly, majority matches corresponding physical examination knot directly to identify by data format for common electronics physical examination report recognition methodsFruit, the data that screening can identify are stored and are standardized, and later period model learning is used for.However in different physical examination reports,The physical examination result of same item data expresses the consistent meaning, and the physical examination result in physical examination report is entirely different, and differentThe physical examination result difference of project data is also very big, can not be identified completely by this rough recognition methods, while bodyThe identification of inspection project is not also comprehensive, and the data identified are also unfavorable for the study of later period model.
Summary of the invention
The embodiment of the present application provides a kind of data normalization method, apparatus, computer equipment and storage medium, and number can be improvedAccording to the precision and accuracy of standardization.
In a first aspect, the embodiment of the present application provides a kind of data normalization method, this method comprises:
Obtain an item data to be normalized in physical examination report;Determine data class corresponding to the occurrence of the item dataType;The item data is standardized according to identified data type, wherein standard corresponding to different types of dataThe mode for changing processing is different.
Second aspect, the embodiment of the invention provides a kind of data normalization device, which includes usingThe corresponding unit of method described in the above-mentioned first aspect of execution.
The third aspect, the embodiment of the invention provides a kind of computer equipment, the computer equipment includes memory, withAnd the processor being connected with the memory;
The memory is for storing computer program, and the processor is for running the calculating stored in the memoryMachine program, to execute method described in above-mentioned first aspect.
Fourth aspect, the embodiment of the invention provides a kind of computer readable storage medium, the computer-readable storageMedia storage has computer program, when the computer program is executed by processor, realizes method described in above-mentioned first aspect.
Data type corresponding to occurrence of the embodiment of the present application by one item data of identification, and according to different dataType carries out different standardizations to the occurrence of the item data.The embodiment of the present application adopts the data of different types of dataWith different standardization modes, comprehensively physical examination report can be standardized, avoid important physical examination index or textThe omission of word feature improves precision and accuracy to the processing of physical examination reporting standardsization;After making standardization simultaneouslyData can be further used for model learning, improve the consistency and accuracy of the data of model learning.
Detailed description of the invention
Technical solution in order to illustrate the embodiments of the present invention more clearly, below will be to needed in embodiment descriptionAttached drawing is briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, general for this fieldFor logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is the flow diagram of data normalization method provided by the embodiments of the present application;
Fig. 2 is the sub-process schematic diagram of data normalization method provided by the embodiments of the present application;
Fig. 3 is the sub-process schematic diagram of data normalization method provided by the embodiments of the present application;
Fig. 4 is the sub-process schematic diagram of Fig. 3 provided by the embodiments of the present application;
Fig. 5 is the sub-process schematic diagram of Fig. 3 provided by the embodiments of the present application;
Fig. 6 is the schematic block diagram of data normalization device provided by the embodiments of the present application;
Fig. 7 is the schematic block diagram of type determining units provided by the embodiments of the present application;
Fig. 8 is the schematic block diagram of Standardisation Cell provided by the embodiments of the present application;
Fig. 9 is the schematic block diagram of canonical matching unit provided by the embodiments of the present application;
Figure 10 is the schematic block diagram of natural-sounding processing unit provided by the embodiments of the present application;
Figure 11 is the schematic block diagram of computer equipment provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, completeSite preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this ShenPlease in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative effortsExample, shall fall in the protection scope of this application.
The data being related in the embodiment of the present application with physical examination report in data instance be illustrated.It is to be appreciated thatScheme in the application can also be applied to other scenes, and can also be other is not the data of physical examination report type.
Fig. 1 is the flow diagram of data normalization method provided by the embodiments of the present application.As shown in Figure 1, this method packetInclude S101-S103.
S101 obtains an item data to be normalized in physical examination report.
Wherein, physical examination report can have more parts, be also possible to portion.In the present embodiment, physical examination report has more parts.Physical examinationData to be normalized in report have multinomial, such as weight, heart rate, liver color ultrasound, eyesight.Each single item data include: dataItem and occurrence.Such as data item: weight, occurrence are as follows: 176cm.If physical examination report has more parts, then obtaining more parts of physical examination reportsIn an item data to be normalized, occurrence corresponding to such as weight and weight;If physical examination report only has portion, obtainAn item data to be normalized in this part of physical examination report.It is to be appreciated that the physical examination result of different people may be by differentWhat doctor provided, due to the habit difference of each doctor, then may have in the physical examination result of same item data it is multiple notSame value, and the meaning of multiple different value expression is consistent, it is therefore desirable to physical examination result is standardized.
S102 determines data type corresponding to the occurrence of the item data, wherein data type include numeric type, pieceType, COMPLEX MIXED type is simply mixed in act type.
In the present embodiment, by physical examination report in data type include numeric type, enumeration type, type, complexity be simply mixedMixed type.It is to be understood that by physical examination report in data type be divided into four seed types, which can cover substantiallyAll physical examination results in physical examination report.
The numeric type i.e. occurrence of the item data is specific value, such as 175cm, 50kg.Enumeration type, such as: " feminine gender ", " justOften ", it " is not detected ", "+", " ++ " etc..Type is simply mixed, such as: " > 100 beats/min, nodal tachycardia ", this type is with numerical valueBased on.COMPLEX MIXED type such as " is shown in that multiple low echo nodules, maximum are located at lobus dexter about 14mm × 8mm, tubercle periphery in thyroid glandHave vascular circle around ".Such case may be pure words, it may be possible to which text and numerical value mixing, relatively complicated, it may includeEnumeration type and situations such as be simply mixed type.
In one embodiment, as shown in Fig. 2, step S102 includes the following steps S201-S206.
S201 obtains the occurrence of the item data and detects to the occurrence of the acquired item data.
The data item as corresponding to the item data are as follows: weight, the occurrence of the item data are as follows: 176cm.So obtain thisThe occurrence 176cm of data.The occurrence of the item data is detected to judge number corresponding to the occurrence of the item dataAccording to type.
S202 if the occurrence of the item data is number, or is the combination of number and unit, then it is determined that the item dataOccurrence corresponding to data type be numeric type.
Such as data item: age, corresponding occurrence are as follows: 28, it is as digital, determine that corresponding data type is numerical valueType.Such as data item: hemoglobin, corresponding occurrence are 135g/L, the as combination of number and unit, are determined correspondingData type is numeric type.
S203, if the occurrence of the item data is one of preset enumerated value, then it is determined that the tool of the item dataData type corresponding to body value is enumeration type.
Wherein, preset enumerated value includes " normal ", " Non Apparent Abnormality ", " showing no obvious abnormalities ", " feminine gender ", " notDetection ", " no hyperemia ", " no enlargement ", " without special ", " positive ", "abnormal" etc.;Include grade classification, is also considered as and enumeratesType, if such preset enumerated value includes "-", "+", " ++ ", " +++ " grade, such as the grade of glucose in urine, further include " I grades "," II grades ", " III level " etc., such as the grade of cleannes.
S204 if the occurrence of the item data not only includes text, but also includes number, whether not to judge the number of words of textWhether the number more than the first preset quantity and number appearance is less than the second preset quantity.
Such as the occurrence of certain item data are as follows: double kidney form size positions journey, the visible strong echo accompanying sound shadow of left kidney, size is about4*3mm.The occurrence of so item data not only includes text, but also includes number.It counts in the occurrence of the item dataWhether the number that the number of words of text and number occur judges number of words that text occurs more than the first preset quantity and number occursWhether number is more than the second preset quantity.Wherein, the first preset quantity can be 20, and the second preset quantity can be 2.TheOne preset quantity and the second preset quantity can also be other numerical value.
S205, if the number of words of the text in the occurrence of the item data is less than time that the first preset quantity and number occurNumber is less than the second preset quantity, determines that data type corresponding to the occurrence of the item data is that type is simply mixed.
S206, if the number of words of the text in the occurrence of the item data has been more than the first preset quantity, or number occursNumber be more than the number of words of text in the occurrence of the second preset quantity or the item data be more than the first preset quantityAnd the number that number occurs has been more than the second preset quantity, determines data type corresponding to the occurrence of the item data for complexityMixed type.
It should be noted that the scheme of the data type of data determined above is not limited thereto, in other embodiments,Other schemes can also be used to carry out the determination of data type.
S103 is standardized the item data according to identified data type, wherein different types of data instituteThe mode of corresponding standardization is different.
According to the different different to the mode of item data processing of data type.
In one embodiment, as shown in figure 3, step S103 includes the following steps S301-S305.
S301 obtains data type corresponding to the occurrence of the identified item data.
S302, if data type corresponding to the occurrence of the item data is numeric type, to the item data in physical examination reportOccurrence handled, with the data unit of the unified item data.
If height is 168cm, or it is 1.68m, then unified be converted into 168cm, 168cm for height, or uniformly willHeight is converted into 1.68m, 1.68m, and the data unit of the item data is so carried out unification.If there are more parts of physical examination reports, needConvert the occurrence of the item data in more parts of physical examination reports.
S303 will be in the item data occurrence if data type corresponding to the occurrence of the item data is enumeration typeText carries out unification, or the occurrence of the item data and pre-set numerical value are carried out matching mapping.
Such as " normal ", " Non Apparent Abnormality ", " showing no obvious abnormalities ", " feminine gender ", " being not detected ", " no special " indicateOne meaning, then be all unified for " normal ".The matching of the item data occurrence and pre-set numerical value maps, as will" normal " of physical examination item, "abnormal" are each mapped to 0 and 1, wherein 0 and 1 is the pre-set numerical value of the item data;To haveThe "-" of such as glucose in urine of grade classification, "+", " ++ ", " +++ " are each mapped to 0,1,2,3 etc., wherein 0,1,2,3 be to be somebody's turn to doThe pre-set numerical value of item data.
S304 uses regular expression if data type corresponding to the occurrence of the item data is that type is simply mixedMatched mode is standardized.
Regular expression describes the mode or rule of a kind of string matching, passes through predefined specific characterMatched text is gone in (rule) combination.It is standardized by the way of regular expression matching, first uses the mode of regular expressionMatched text is gone, then the text after matching is standardized.
S305 uses natural language if data type corresponding to the occurrence of the item data is COMPLEX MIXED typeThe method of processing is standardized.
Natural language processing (Natural Language Processing, NLP) by " understanding " to natural-sounding comeIt is standardized.
Embodiment shown in Fig. 3 is with according to different data types, such as numeric type, enumeration type are simply mixed type, are complicated mixedMould assembly etc. is standardized using different standardization processing methods.
In one embodiment, as shown in figure 4, step S303 includes the following steps S401-S405.
S401 obtains default regular expression corresponding to the occurrence of the item data according to the data item of the item data.
Such as data item: heart rate presets regular expression are as follows: Dou Xing.The occurrence led such as different physical examination Reporting CentersMay be: 80 beats/min of heart rate, sinus rate;Sinus property is aroused in interest, and 80 beats/min;80 beats/min, sinus property heart speed etc..Although different physical examinationsDescription in report is inconsistent, but all " sinus property " occurs.With default regular expression: Dou Xing, to be matched, it is easy toIt is fitted on the data item.It should be noted that default regular expression corresponding to the same data item can have it is multiple.
S402, judges whether default regular expression matches with the occurrence of the item data.
If default regular expression are as follows: Dou Xing " sinus property " occurs in the occurrence of the item data, it is determined that defaultRegular expression is matched with the occurrence of the item data, is otherwise determined and is mismatched.If it is determined that mismatching, then prompted.
S403, if default regular expression is matched with the occurrence of the item data, judge be in the occurrence of the item dataIt is no to have symbol and number.
There are the description that symbol is had in the occurrence of some data item, such as " < 30 times ".
S404 extracts the tool of the item data according to preset format if having symbol and number in the occurrence of the item dataFeature corresponding to body value, to obtain standardization result.
Wherein, preset format can be with are as follows: number, symbol, unit.Such as " < 30 times ", the feature extracted according to preset formatAre as follows: 30, <, it is secondary;Such as " < 3.12mmol/L ", the feature extracted according to preset format are as follows: 3.12, <, mmol/L.It will be according to pre-If the feature that format extracts is determined as standardization result.
S405 is extracted in the occurrence of the item data if the occurrence of the item data does not have symbol but to have numberNumber, using the number extracted as standardization result.
If the occurrence of the item data does not have symbol but to have number, then according to being a number or multiple digital (itsIn, the number of multiple numbers does not exceed the second preset quantity), it is divided into single digital and extracts and multiple digital extraction.It should be noted thatIt is if an only number, to extract a number, such as extract 80 in " 80 beats/min of heart rate, sinus rate ".If havingMultiple numbers, then multiple numbers are extracted, multiple end values of multiple number as the item data, as eye test is " leftEye vision 4.1, right vision 4.3 " extract 4.1 and 4.3, respectively correspond left vision and right vision.
This embodiment define the standardization modes of the data for the type that is simply mixed.
In one embodiment, as shown in figure 5, step S304 includes the following steps S501-S506.
S501, calls recurrence packet interface, and text corresponding to the occurrence to the item data carries out punctuate grouping.
Wherein, recurrence packet interface can be the interface provided in Chinese grammer analysis tool packet THULAC, useText corresponding to the occurrence to the item data carries out punctuate grouping.Wherein, THULAC is by Tsinghua University's natural languageA set of Chinese lexical analysis kit that processing is released with the development of society & culture's computing laboratory has Chinese word segmentation and part of speech markThe functions such as note.It is to be understood that having long sentence in text corresponding to the occurrence of the item data, includes short sentence in long sentence, includesSituations such as equal inside and outside number.Recurrence packet interface is called, text corresponding to the occurrence to the item data carries out punctuate grouping, greatlyGroup (section) includes middle group, includes in middle group (sentence) group (short sentence or word) etc..
Whether S502, the data type of the text after judging punctuate grouping belong to numeric type or enumeration type or simple mixedMould assembly.The short sentence that will make pauses in reading unpunctuated ancient writings after being grouped carries out the judgement of data type.
S503, if the data type of the text after punctuate grouping belongs to numeric type or enumeration type or type is simply mixed,Then using numeric type or enumeration type or the corresponding standardization mode of type is simply mixed it is standardized.
S504, if the data type of the text after punctuate grouping is not belonging to numeric type or enumeration type or is simply mixedType calls participle and part-of-speech tagging interface, carries out participle and part-of-speech tagging to the text after punctuate grouping, and analyzed,To obtain the first result.
Specifically, the short sentence after obtaining punctuate grouping, calls participle and part-of-speech tagging interface, short sentence is segmented, andPart of speech after determining participle;According to the part of speech after participle, the short sentence after punctuate grouping is analyzed according to certain rules, withObtain the first result.Wherein, part of speech includes noun, adjective etc..Termini generales are core word.Participle and part-of-speech tagging interface canTo be the interface provided in Chinese grammer analysis tool packet THULAC, for carrying out participle and part-of-speech tagging and grammerAnalysis etc..Corresponding function can also be completed using the participle of other participle tools offers and part-of-speech tagging interface.It pressesThe short sentence after punctuate grouping is analyzed according to certain rule, such as a short sentence can be regarded as three parts: 1) what organ,2) what's the matter, and 3) specific value;Such as 1) thyroid gland, 2) tubercle, 3) 2cm.It should be noted that the step in call participle andWhen part-of-speech tagging interface is analyzed, mainly the short sentence for having numerical value is analyzed, extracts numerical value corresponding to core wordFeature.If not having numerical characteristics in the sentence, the first result is sky.
S505 calls keyword extraction algorithms, counts to the short sentence after punctuate grouping, to show that candidate keywords go outThe second frequency that existing first frequency and candidate keywords occurs in more parts of physical examination report files where the item data, rootAccording to the first frequency and the second frequency from the one group of pass extracted in the candidate keywords in the item data occurrenceKeyword, using the keyword extracted as the second result.
Wherein, keyword extraction algorithm can be used TF-IDF algorithm, TF, Term Frequency, what keyword occurredFrequency, the frequency that keyword is occurred is as first frequency, i.e., (candidate) keyword occurs in the occurrence of the item dataFrequency;IDF, Inverse Document Frequency, reverse document frequency, what a word occurred in entire library dictionaryFrequency.Reverse document frequency is known as second frequency, i.e., is reported (candidate) keyword in more parts of physical examinations where the item dataThe frequency occurred in document.The item number is extracted from the candidate keywords according to the first frequency and the second frequencyAccording to one group of keyword in occurrence, specifically: first frequency corresponding to candidate keywords and second frequency are multiplied toTo multiplied result;Multiplied result is arranged according to descending;First group of candidate keywords after extracting arrangement;By this first group candidateKeyword thinks one group of keyword in data occurrence as this.Using this group of keyword as feature corresponding to the item data,Using this feature as the second result.
The data item as corresponding to the item data (physical examination project) is lung, the keyword extracted (feature) are as follows: inflammation,Calcification etc..Indicate that there is inflammation in lung and there is calcification phenomenon in lung.
S506, using first result and second result as standardization knot corresponding to the item data occurrenceFruit.
In one embodiment, before calling participle and part-of-speech tagging interface, the step further includes S503a.
S503a, detecting, which whether there is in the text after punctuate is grouped, number.If punctuate grouping after text in there areNumber executes the step of calling participle and part-of-speech tagging interface;If then being held in the text after punctuate grouping there is no there is numberRow step S505.
The embodiment, step " call participle and part-of-speech tagging interface, carry out participle and word to the text after punctuate groupingProperty mark, and analyzed " primarily directed to the situation for having number, if there is no there is number in text after punctuate grouping,Without execution " participle and part-of-speech tagging interface is called, participle and part-of-speech tagging are carried out to the text after punctuate grouping, and carry outThe step of analysis ", reduces standardized calculation amount, saved the standardized time.
In one embodiment, after step S506, the method also includes S506a, S506b, S506c.
S506a obtains feature and signature identification corresponding to the occurrence of the pre-set item data.
Such as data item lung, whether pre-set feature is " normal ", " inflammation ", " calcification " etc. respectively.Institute is rightThe signature identification answered is respectively that " 0,1 " (0 indicates normal;1 indicates abnormal), " 0,1 " (0 indicates no corresponding feature, that is, does not haveInflammation;1 indicates corresponding feature, that is, has inflammation), " 0,1 " (0 indicates no corresponding feature, i.e., no calcification;1 indicatesCorresponding feature, i.e. calcification).
S506b matches standardization result corresponding to the item data occurrence to obtain with pre-set featureTo matching result.
If standardization result is "abnormal", " inflammation ", then with the matching result that is obtained after pre-set characteristic matchingFor "abnormal", " inflammation ".
S506c is marked the standardization result using corresponding signature identification according to matching result.
If matching result be "abnormal", " inflammation ", using corresponding signature identification label result be respectively " 1 ","1","0";If matching result is "abnormal", " inflammation ", " calcification ", then the result point of the label using corresponding signature identificationIt Wei not " 1 ", " 1 ", " 1 ".
Further standardization result is marked for the embodiment, and standardization result is quantized, convenient for point of modelAnalysis and statistics.
Above method embodiment targetedly classifies to the data in physical examination report, and data type is such as divided into fourThe different type of kind, and different standardizations is carried out respectively to the data in physical examination report according to the Different Results of classification,Comprehensively physical examination report can be standardized, avoid the omission of important physical examination index or character features, improve simultaneouslyTo the precision and accuracy of the processing of physical examination reporting standardsization.Data after standardization can be further used for model learning, mentionThe high consistency and accuracy of the data of model learning.
Fig. 6 is the schematic block diagram of data normalization device provided by the embodiments of the present application.As shown in fig. 6, the device packetIt includes for executing unit corresponding to above-mentioned data normalization method.Specifically, as shown in fig. 6, the device 60 includes obtaining listFirst 601, type determining units 602, Standardisation Cell 603.
Acquiring unit 601, for obtaining an item data to be normalized in physical examination report.
Type determining units 602, for determining data type corresponding to the occurrence of the item data, wherein data classType includes numeric type, enumeration type, type, COMPLEX MIXED type is simply mixed.
In one embodiment, as shown in fig. 7, type determining units 602 are including obtaining detection unit 701, numeric type determinesUnit 702, enumeration type determination unit 703, quantity judging unit 704 and mixed type determination unit 705.Wherein, detection is obtainedUnit 701, for obtaining the occurrence of the item data and being detected to the occurrence of the acquired item data.Numeric type is trueOrder member 702 if the occurrence for the item data is number, or is the combination of number and unit, then it is determined that the item numberAccording to occurrence corresponding to data type be numeric type.Enumeration type determination unit 703, if the occurrence for the item data isOne of preset enumerated value, then it is determined that data type corresponding to the occurrence of the item data is enumeration type.QuantityJudging unit 704 if the occurrence for the item data not only includes text, but also includes number, judges that the number of words of text isIt is no to be less than whether the number that the first preset quantity and number occur is less than the second preset quantity.Mixed type determination unit 705,If the number of words for the text in the occurrence of the item data is less than the number that the first preset quantity and number occur and is less thanSecond preset quantity determines that data type corresponding to the occurrence of the item data is that type is simply mixed;Otherwise, it determines the item numberAccording to occurrence corresponding to data type be COMPLEX MIXED type.
Standardisation Cell 603, for being standardized according to identified data type to the item data, whereinThe mode of standardization corresponding to different types of data is different.
In one embodiment, as shown in figure 8, Standardisation Cell 603 includes type acquiring unit 801, numeric processing unit802, processing unit 803, canonical matching unit 804 and natural language processing unit 805 are enumerated.Wherein, type acquiring unit801, for obtaining data type corresponding to the occurrence of the identified item data.Numeric processing unit 802, if for shouldData type corresponding to the occurrence of item data is numeric type, is handled the occurrence of the item data in physical examination report,To unify the data unit of the item data.Processing unit 803 is enumerated, if data class corresponding to the occurrence for the item dataType is enumeration type, by the text in the item data occurrence carry out unification, or by the occurrence of the item data with presetNumerical value carry out matching mapping.Canonical matching unit 804, if data type corresponding to occurrence for the item data is letterSingle mixed type, then be standardized by the way of regular expression matching.Natural language processing unit 805, if being used for thisData type corresponding to the occurrence of data is COMPLEX MIXED type, then carries out standard using the method for natural language processingChange.
In one embodiment, as shown in figure 9, canonical matching unit 804 includes expression formula acquiring unit 901, matching judgmentUnit 902, sign digit judging unit 903, the first extraction unit 904 and the second extraction unit 905.Wherein, expression formula obtainsUnit 901 is taken, for the data item according to the item data, obtains default regular expressions corresponding to the occurrence of the item dataFormula.Matching judgment unit 902, for judging whether default regular expression matches with the occurrence of the item data.Sign digitJudging unit 903 judges in the occurrence of the item data if matching for default regular expression with the occurrence of the item dataWhether symbol and number are had.First extraction unit 904, if for having symbol and number in the occurrence of the item data, according toPreset format extracts feature corresponding to the occurrence of the item data, to obtain standardization result.Second extraction unit 905 is usedIf there is no symbol but to have number in the occurrence of the item data, the number in the occurrence of the item data is extracted, will be mentionedThe number of taking-up is as standardization result.
In one embodiment, as shown in Figure 10, natural language processing unit 805 is sentenced including punctuate unit 101, text typeDisconnected unit 102, part of speech analytical unit 103, keyword extracting unit 104, result determination unit 105.Wherein, punctuate unit 101,For calling recurrence packet interface, text corresponding to the occurrence to the item data carries out punctuate grouping.Text type judgementUnit 102, for judging whether the data type of the text after punctuate grouping belongs to numeric type or enumeration type or simple mixedMould assembly.If the data type of the text after punctuate grouping belongs to numeric type or enumeration type or type is simply mixed, number is triggeredValue processing unit enumerates processing unit or canonical matching unit.Part of speech analytical unit 103, if after for grouping of making pauses in reading unpunctuated ancient writingsThe data type of text is not belonging to numeric type or enumeration type or type is simply mixed, and calls participle and part-of-speech tagging interface, rightText after punctuate grouping carries out participle and part-of-speech tagging, and is analyzed, to obtain the first result.Keyword extracting unit104, for calling keyword extraction algorithms, the short sentence after punctuate grouping is counted, to obtain what candidate keywords occurredThe second frequency that first frequency and candidate keywords occur in more parts of physical examination report files where the item data, according to instituteFirst frequency and the second frequency are stated from the one group of keyword extracted in the item data occurrence in the candidate keywords,Using the keyword extracted as the second result.As a result determination unit 105 are used for first result and second resultAs standardization result corresponding to the item data occurrence.
In one embodiment, the natural language processing unit 804 further includes Digital Detecting unit 102a.Wherein, digitalDetection unit 102a, if the data type for the text after grouping of making pauses in reading unpunctuated ancient writings is not belonging to numeric type or enumeration type or simple mixedMould assembly, detecting, which whether there is in the text after punctuate is grouped, number.There is number if it exists, triggers part of speech analytical unit 103.IfThere is no numbers, trigger keyword extracting unit 104.
In one embodiment, the natural language processing unit 804 further includes signature identification acquiring unit 105a, featureWith unit 105b, marking unit 105c.Wherein, signature identification acquiring unit 105a, for obtaining the pre-set item dataOccurrence corresponding to feature and signature identification.Characteristic matching unit 105b, for will be corresponding to the item data occurrenceStandardization result is matched with pre-set feature to obtain matching result.Marking unit 105c, for being tied according to matchingFruit is marked the standardization result using corresponding signature identification.
It should be noted that it is apparent to those skilled in the art that, the tool of above-mentioned apparatus and each unitBody realizes process, can be no longer superfluous herein with reference to the corresponding description in preceding method embodiment, for convenience of description and succinctlyIt states.
Above-mentioned apparatus can be implemented as a kind of form of computer program, and computer program can be in meter as shown in figure 11It calculates and is run on machine equipment.
Figure 11 is a kind of schematic block diagram of computer equipment provided by the embodiments of the present application.The equipment is that terminal etc. is setIt is standby, such as mobile terminal, PC terminal, IPad.The equipment 110 includes the processor 112 connected by system bus 111, storageDevice and network interface 113, wherein memory may include non-volatile memory medium 114 and built-in storage 115.
The non-volatile memory medium 114 can storage program area 1141 and computer program 1142.This is non-volatile to depositWhen the computer program 1142 stored in storage media is executed by processor 112, it can be achieved that data normalization side described aboveMethod.The processor 112 supports the operation of whole equipment for providing calculating and control ability.The built-in storage 115 is non-volatileProperty storage medium in computer program operation provide environment, the computer program by processor 112 execute when, may make placeReason device 112 executes data normalization method described above.The network interface 113 is for carrying out network communication.Art technologyPersonnel are appreciated that structure shown in Figure 11, and only the block diagram of part-structure relevant to application scheme, is not constitutedRestriction to the equipment that application scheme is applied thereon, specific equipment may include more more or fewer than as shown in the figureComponent perhaps combines certain components or with different component layouts.
Wherein, the processor 112 is for running computer program stored in memory, to realize following steps:
Obtain an item data to be normalized in physical examination report;Determine data class corresponding to the occurrence of the item dataType;The item data is standardized according to identified data type, wherein standard corresponding to different types of dataThe mode for changing processing is different.
In one embodiment, the data type includes numeric type, enumeration type, type, COMPLEX MIXED type is simply mixed, describedWhen the step of the data type corresponding to the occurrence for executing the determination item data of processor 112, it is implemented as followsStep:
It obtains the occurrence of the item data and the occurrence of the acquired item data is detected;If the item dataOccurrence is number, or is the combination of number and unit, then it is determined that data type corresponding to the occurrence of the item dataFor numeric type;If the occurrence of the item data is one of preset enumerated value, then it is determined that the occurrence of the item dataCorresponding data type is enumeration type;If the occurrence of the item data not only includes text, but also includes number, text is judgedNumber of words whether be less than the first preset quantity and number occur number whether be less than the second preset quantity;If the item dataOccurrence in text number of words be less than the first preset quantity and number occur number be less than the second preset quantity, reallyData type corresponding to the occurrence of the fixed item data is that type is simply mixed;Otherwise, it determines the occurrence institute of the item data is rightThe data type answered is COMPLEX MIXED type.
In one embodiment, the data type includes numeric type, enumeration type, type, COMPLEX MIXED type is simply mixed, describedProcessor 112 is when executing the step that the data type according to determined by is standardized the item data, specificallyRealize following steps:
If data type corresponding to the occurrence of the item data be numeric type, to physical examination report in the item data it is specificValue is handled, with the data unit of the unified item data;If data type corresponding to the occurrence of the item data is to enumerateText in the item data occurrence is carried out unification by type, or by the occurrence of the item data and pre-set numerical value intoRow matching mapping;If data type corresponding to the occurrence of the item data is that type is simply mixed, regular expression is usedThe mode matched is standardized;If data type corresponding to the occurrence of the item data is COMPLEX MIXED type, using certainlyThe method of right Language Processing is standardized.
In one embodiment, if the data class corresponding to the occurrence for executing the item data of the processor 112Type is that type is simply mixed, then when the step being standardized by the way of regular expression matching, is implemented as follows step:
According to the data item of the item data, default regular expression corresponding to the occurrence of the item data is obtained;JudgementWhether default regular expression matches with the occurrence of the item data;If the occurrence of default regular expression and the item dataMatch, judges whether there is symbol and number in the occurrence of the item data;If having symbol and number in the occurrence of the item data,Feature corresponding to the occurrence of the item data is extracted, according to preset format to obtain standardization result;If the tool of the item dataThere is no symbol but to have number in body value, then extract the number in the occurrence of the item data, by the number extracted as markStandardization result.
In one embodiment, if the data corresponding to the occurrence for executing the item data of the processor 112Type is COMPLEX MIXED type, then when the step being standardized using the method for natural language processing, is implemented as follows stepIt is rapid:
Recurrence packet interface is called, text corresponding to the occurrence to the item data carries out punctuate grouping;Judgement punctuateWhether the data type of the text after grouping belongs to numeric type or enumeration type or type is simply mixed;If the text after punctuate groupingThis data type, which belongs to numeric type or enumeration type type is perhaps simply mixed, then uses numeric type or enumeration type or simpleThe corresponding standardization mode of mixed type is standardized;If the data type of the text after punctuate grouping is not belonging to countType is simply mixed in value type or enumeration type, calls participle and part-of-speech tagging interface, divides the text after punctuate groupingWord and part-of-speech tagging, and analyzed, to obtain the first result;Keyword extraction algorithms are called, to short after punctuate groupingSentence counted, with obtain candidate keywords occur first frequency and candidate keywords in more parts of bodies where the item dataThe second frequency occurred in inspection report file, mentions from the candidate keywords according to the first frequency with the second frequencyOne group of keyword in the item data occurrence is taken out, using the keyword extracted as the second result;By first resultWith second result as standardization result corresponding to the item data occurrence.
In one embodiment, the processor 112 execute it is described using first result and second result asAfter the step of standardization result corresponding to the item data occurrence, following steps are also realized:
Obtain feature corresponding to the occurrence of the pre-set item data and signature identification;By the item data occurrenceCorresponding standardization result is matched with pre-set feature to obtain matching result;According to matching result, using pairThe standardization result is marked in the signature identification answered.
In one embodiment, the processor 112 is executing the calling participle and part-of-speech tagging interface, is grouped to punctuateText afterwards carries out participle and part-of-speech tagging, and is analyzed, the step of to obtain the first result before, also realize following stepIt is rapid:
Whether there is in text after detection punctuate grouping has number;If there are number in the text after punctuate grouping,It executes and calls participle and part-of-speech tagging interface, participle and part-of-speech tagging are carried out to the text after punctuate grouping, and analyzed,With the step of obtaining the first result.
It should be appreciated that in the embodiment of the present application, alleged processor 112 can be central processing unit (CentralProcessing Unit, CPU), which can also be other general processors, digital signal processor (DigitalSignal Processor, DSP), specific integrated circuit (application program lication Specific IntegratedCircuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other canProgrammed logic device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessorOr the processor is also possible to any conventional processor etc..
Those of ordinary skill in the art will appreciate that be realize above-described embodiment method in all or part of the process,It is that relevant hardware can be instructed to complete by computer program.The computer program can be stored in a storage medium,The storage medium can be computer readable storage medium.The computer program is by the processing of at least one of the computer systemDevice executes, to realize the process step of the embodiment of the above method.
Therefore, present invention also provides a kind of storage mediums.The storage medium can be computer readable storage medium.It shouldStorage medium is stored with computer program, which performs the steps of when being executed by a processor
Obtain an item data to be normalized in physical examination report;Determine data class corresponding to the occurrence of the item dataType;The item data is standardized according to identified data type, wherein standard corresponding to different types of dataThe mode for changing processing is different.
In one embodiment, the data type includes numeric type, enumeration type, type, COMPLEX MIXED type is simply mixed, describedWhen the step of processor data type corresponding to the occurrence for executing the determination item data, it is implemented as follows stepIt is rapid:
It obtains the occurrence of the item data and the occurrence of the acquired item data is detected;If the item dataOccurrence is number, or is the combination of number and unit, then it is determined that data type corresponding to the occurrence of the item dataFor numeric type;If the occurrence of the item data is one of preset enumerated value, then it is determined that the occurrence of the item dataCorresponding data type is enumeration type;If the occurrence of the item data not only includes text, but also includes number, text is judgedNumber of words whether be less than the first preset quantity and number occur number whether be less than the second preset quantity;If the item dataOccurrence in text number of words be less than the first preset quantity and number occur number be less than the second preset quantity, reallyData type corresponding to the occurrence of the fixed item data is that type is simply mixed;Otherwise, it determines the occurrence institute of the item data is rightThe data type answered is COMPLEX MIXED type.
In one embodiment, the data type includes numeric type, enumeration type, type, COMPLEX MIXED type is simply mixed, describedProcessor is when executing the step that the data type according to determined by is standardized the item data, specific implementationFollowing steps:
If data type corresponding to the occurrence of the item data be numeric type, to physical examination report in the item data it is specificValue is handled, with the data unit of the unified item data;If data type corresponding to the occurrence of the item data is to enumerateText in the item data occurrence is carried out unification by type, or by the occurrence of the item data and pre-set numerical value intoRow matching mapping;If data type corresponding to the occurrence of the item data is that type is simply mixed, regular expression is usedThe mode matched is standardized;If data type corresponding to the occurrence of the item data is COMPLEX MIXED type, using certainlyThe method of right Language Processing is standardized.
In one embodiment, if processor data type corresponding to the occurrence for executing the item data isType is simply mixed, then when the step being standardized by the way of regular expression matching, is implemented as follows step:
According to the data item of the item data, default regular expression corresponding to the occurrence of the item data is obtained;JudgementWhether default regular expression matches with the occurrence of the item data;If the occurrence of default regular expression and the item dataMatch, judges whether there is symbol and number in the occurrence of the item data;If having symbol and number in the occurrence of the item data,Feature corresponding to the occurrence of the item data is extracted, according to preset format to obtain standardization result;If the tool of the item dataThere is no symbol but to have number in body value, then extract the number in the occurrence of the item data, by the number extracted as markStandardization result.
In one embodiment, if processor data type corresponding to the occurrence for executing the item dataFor COMPLEX MIXED type, then when the step being standardized using the method for natural language processing, step is implemented as follows:
Recurrence packet interface is called, text corresponding to the occurrence to the item data carries out punctuate grouping;Judgement punctuateWhether the data type of the text after grouping belongs to numeric type or enumeration type or type is simply mixed;If the text after punctuate groupingThis data type, which belongs to numeric type or enumeration type type is perhaps simply mixed, then uses numeric type or enumeration type or simpleThe corresponding standardization mode of mixed type is standardized;If the data type of the text after punctuate grouping is not belonging to countType is simply mixed in value type or enumeration type, calls participle and part-of-speech tagging interface, divides the text after punctuate groupingWord and part-of-speech tagging, and analyzed, to obtain the first result;Keyword extraction algorithms are called, to short after punctuate groupingSentence counted, with obtain candidate keywords occur first frequency and candidate keywords in more parts of bodies where the item dataThe second frequency occurred in inspection report file, mentions from the candidate keywords according to the first frequency with the second frequencyOne group of keyword in the item data occurrence is taken out, using the keyword extracted as the second result;By first resultWith second result as standardization result corresponding to the item data occurrence.
In one embodiment, the processor is described using first result and second result as this in executionAfter the step of standardization result corresponding to data occurrence, following steps are also realized:
Obtain feature corresponding to the occurrence of the pre-set item data and signature identification;By the item data occurrenceCorresponding standardization result is matched with pre-set feature to obtain matching result;According to matching result, using pairThe standardization result is marked in the signature identification answered.
In one embodiment, the processor is executing the calling participle and part-of-speech tagging interface, after punctuate groupingText carry out participle and part-of-speech tagging, and analyzed, the step of to obtain the first result before, also realize following stepIt is rapid:
Whether there is in text after detection punctuate grouping has number;If there are number in the text after punctuate grouping,It executes and calls participle and part-of-speech tagging interface, participle and part-of-speech tagging are carried out to the text after punctuate grouping, and analyzed,With the step of obtaining the first result.
The storage medium can be USB flash disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), magnetic diskOr the various computer readable storage mediums that can store program code such as CD.
In several embodiments provided herein, it should be understood that disclosed device, device and method, it can be withIt realizes by another way.For example, the apparatus embodiments described above are merely exemplary, the division of the unit,Only a kind of logical function partition, there may be another division manner in actual implementation.Those skilled in the art can be withIt is well understood, for convenience of description and succinctly, the specific work process of the device of foregoing description, equipment and unit canWith with reference to the corresponding process in preceding method embodiment, details are not described herein.The above, the only specific embodiment party of the applicationFormula, but the protection scope of the application is not limited thereto, and anyone skilled in the art discloses in the applicationIn technical scope, various equivalent modifications or substitutions can be readily occurred in, these modifications or substitutions should all cover the guarantor in the applicationWithin the scope of shield.Therefore, the protection scope of the application should be subject to the protection scope in claims.

Claims (10)

CN201910011828.XA2019-01-072019-01-07Data standardization method and device, computer equipment and storage mediumActiveCN109830272B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201910011828.XACN109830272B (en)2019-01-072019-01-07Data standardization method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201910011828.XACN109830272B (en)2019-01-072019-01-07Data standardization method and device, computer equipment and storage medium

Publications (2)

Publication NumberPublication Date
CN109830272Atrue CN109830272A (en)2019-05-31
CN109830272B CN109830272B (en)2022-08-30

Family

ID=66860174

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201910011828.XAActiveCN109830272B (en)2019-01-072019-01-07Data standardization method and device, computer equipment and storage medium

Country Status (1)

CountryLink
CN (1)CN109830272B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110957016A (en)*2019-11-212020-04-03山东鲁能软件技术有限公司Physical examination data intelligent recognition system and method based on health cloud management platform
CN113392939A (en)*2021-08-162021-09-14江苏苏宁银行股份有限公司Industrial code standardization method and device, electronic equipment and storage medium
CN115064237A (en)*2022-06-092022-09-16山东浪潮智慧医疗科技有限公司 A method to realize the standardization of hospital medical examination summary data
WO2024067442A1 (en)*2022-09-272024-04-04华为技术有限公司Data management method and related apparatus
CN119905194A (en)*2025-03-272025-04-29成都企康科技有限公司 Standardized processing method, device, electronic device and storage medium for physical examination data
CN120048497A (en)*2025-04-232025-05-27上海杉泰健康科技有限公司Method, system, electronic equipment and storage medium for reading medical examination sheets

Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20030105638A1 (en)*2001-11-272003-06-05Taira Rick K.Method and system for creating computer-understandable structured medical data from natural language reports
CN107545934A (en)*2017-05-112018-01-05新华三大数据技术有限公司The extracting method and device of numeric type index
CN108733837A (en)*2018-05-282018-11-02杭州依图医疗技术有限公司A kind of the natural language structural method and device of case history text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20030105638A1 (en)*2001-11-272003-06-05Taira Rick K.Method and system for creating computer-understandable structured medical data from natural language reports
CN107545934A (en)*2017-05-112018-01-05新华三大数据技术有限公司The extracting method and device of numeric type index
CN108733837A (en)*2018-05-282018-11-02杭州依图医疗技术有限公司A kind of the natural language structural method and device of case history text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
吕旭东: "一种电子病历系统体系结构及其关键技术", 《中国生物医学工程学报》*

Cited By (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110957016A (en)*2019-11-212020-04-03山东鲁能软件技术有限公司Physical examination data intelligent recognition system and method based on health cloud management platform
CN110957016B (en)*2019-11-212023-08-08山东鲁能软件技术有限公司Physical examination data intelligent identification system and method based on health cloud management platform
CN113392939A (en)*2021-08-162021-09-14江苏苏宁银行股份有限公司Industrial code standardization method and device, electronic equipment and storage medium
CN115064237A (en)*2022-06-092022-09-16山东浪潮智慧医疗科技有限公司 A method to realize the standardization of hospital medical examination summary data
WO2024067442A1 (en)*2022-09-272024-04-04华为技术有限公司Data management method and related apparatus
CN119905194A (en)*2025-03-272025-04-29成都企康科技有限公司 Standardized processing method, device, electronic device and storage medium for physical examination data
CN120048497A (en)*2025-04-232025-05-27上海杉泰健康科技有限公司Method, system, electronic equipment and storage medium for reading medical examination sheets

Also Published As

Publication numberPublication date
CN109830272B (en)2022-08-30

Similar Documents

PublicationPublication DateTitle
CN109830272A (en)Data normalization method, apparatus, computer equipment and storage medium
US9665565B2 (en)Semantic similarity evaluation method, apparatus, and system
CN110543631B (en)Implementation method and device for machine reading understanding, storage medium and electronic equipment
CN111898366A (en)Document subject word aggregation method and device, computer equipment and readable storage medium
US11755661B2 (en)Text entry assistance and conversion to structured medical data
US8935155B2 (en)Method for processing medical reports
CN112541066B (en) Text-based structured medical report detection method and related equipment
CN110929520B (en)Unnamed entity object extraction method and device, electronic equipment and storage medium
CN110263155B (en)Data classification method, and training method and system of data classification model
CN112631436B (en)Method and device for filtering sensitive words of input method
CN109471950B (en)Method for constructing structured knowledge network of abdominal ultrasonic text data
LitkowskiPattern dictionary of english prepositions
CN111063446B (en)Method, apparatus, device and storage medium for standardizing medical text data
CN109544376A (en)A kind of abnormal case recognition methods and calculating equipment based on data analysis
CN111859032A (en)Method and device for detecting character-breaking sensitive words of short message and computer storage medium
CN108363691A (en)A kind of field term identifying system and method for 95598 work order of electric power
US11526657B2 (en)Method and apparatus for error correction of numerical contents in text, and storage medium
CN115359799A (en) Speech recognition method, training method, device, electronic equipment and storage medium
CN110347805A (en)Petroleum industry security risk key element extracting method, device, server and storage medium
CN111177356A (en)Acid-base index medical big data analysis method and system
CN119476428A (en) A method and system for constructing a medical graph using data attributes
CN111427874A (en)Quality control method and device for medical data production and electronic equipment
CN112925910B (en)Auxiliary corpus labeling method, device, equipment and computer storage medium
Villavicencio et al.Discovering multiword expressions
Derczynski et al.Using signals to improve automatic classification of temporal relations

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp