A kind of form feature extracting method of semi-automatic learning typeTechnical field
The present invention relates to machine learning, data mining, online experience field, a kind of semi-automatic learning type is specifically referred toForm feature extracting method.
Background technology
With the popularization of Internet information technique and popular, by browser access retrieved web information with exchangeAs one of required skill for improving modern society's productivity.
When accessing retrieved web information, it may be necessary to frequently input information to website, such as:User logs in, deliver and commentBy, take part in a vote, some information need repeat and frequently enter, such as:User logs in, in different websites it is necessary to defeatedEnter the information such as different user name or password;And shopping online, buy different commodity it is necessary to repeatedly input oneself address,The information such as postcode, consignee's name.
Because these information may need frequent, substantial amounts of input, and information has unicity, such as shopping online, fromOneself address generally will not often change, and name is even more so, so outside almost all of modern markup language processing unitThe Man Machine Interface of shell, i.e. markup language processing unit, such as browser interface are filled out there is provided automated log on and list automatic generationFunction, mitigates the duplication of labour burden of the mankind, improves production efficiency.
If markup language processing unit shell is needed data Auto-writing to the list in markup language processing unitIn, it must be understood that the list project corresponding to relevant entry, such as:Addressee's name correspondence the 1st input frame, address of the addressee pairAnswer the 2nd input frame, addressee's postcode the 3rd input frame of correspondence., just must be it is to be understood that the structure of list be special under such ruleLevy, correctly could fill in data in corresponding project.
The HTML that World Wide Web Consortium is proposed, i.e. HTML, referred to as " markup language ", language standard makes internetThe web page files that can be made up of unified, standardization language generation by marking, referred to as " tab file ".Html language is to setThere is provided a series of standard base part on the basis of the label of shape structure, as long as markup language processing unit realizes that HTML is markedIt is accurate, it is possible to keep versatility.
When loading the making language document of website using markup language processing unit, if necessary to submit number to websiteAccording to, such as chat, make comments, buy and sell commodity, preserve customized information, website must just provide collection browser data collectionThe approach of data, " list is provided for this html language standard(form)" part, a list generally comprises following element:<form>:It is a list to state this, and the data among this can be submitted to server;<input>:<form>The son section of labelPoint, it is a single file text input frame to state this, according to type attributes, can show different patterns, such as:<input type=text>, a common input frame;<input type=password>, one conceal input content Password Input frame;Submission form button:Submission form is actually<input>One type attribute of label, when<input>The type attribute quilts of labelWhen being set to submit, a button can be showed in markup language processing unit, can be by when button is activated<form>It is all legal in label<input>The data of user's input are all submitted to server.
Existing characteristic analysis method, as shown in figure 1, leading to whenever markup language processing unit sends tab file loadedWhen knowing, it is assumed that the page occurs the content for including above element, then the interface provided by markup language processing unit to markNote file is analyzed, and takes out list<from><input>Feature, but such a method is in the dynamically labeled loading of high speed developmentSeem before technological side unable to do what one wishes, because dynamically labeled loading technique can cause problems with:
Markup language processing unit is sent after webpage loaded notice, does not have the content of login frame in tab file,And the markup language required for list is presented actually is continuing loading by the JavaScript scripts in tab file, alsoIt is to say, the markup language set required for list is now presented does not have real loading and completed, and can be lost so form feature is extractedLose;
Submitting button is not<input type=submit>, it may be possible to any one, which is added, calls JavaScriptThe html tag of scripted code, and submission form is completed by JavaScript scripts, can be lost so form feature is extractedLose;
Even<input>Input frame does not have quilt yet<form>Label is wrapped up.This, which results in browser and sends webpage, addsLoad can not meet the rule of static scanning after finishing notice, cause inquiry to fail.
The content of the invention
It is an object of the invention to by way of manually participating in there is provided one kind can further extract with integrality,The form feature extracting method of authenticity, the semi-automatic learning type of the web form architectural feature of accuracy.
The present invention is achieved through the following technical solutions a kind of form feature extracting method of semi-automatic learning type, including withLower step:
(1)Start learning device, learning device built-in token language processing apparatus;
(2)In the position of address field input marking language file;
(3)Learning device loads making language document by built-in browser;
(4)After the completion of loading, built-in browser notifies the loading of learning device making language document to complete, and generates mark languageSay aggregate;
(5)Learning device inserts study module in the making language document loaded;
(6)List is operated, by learning device complete documentation, and the characteristic information of correlation is generated;
(7)Receive after submitting button click event, study module thinks that study is completed, and form structure information is stored in into dataStorehouse;
(8)Whole list feature learning process is completed.
The above method handles the learning device of markup language device by manufacturing built in one, determine markup language, marksThe label that input frame is presented in language processing apparatus selection is defaulted as<input>Label.
In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but artificially be sentencedIt is disconnected.
When seeing that markup language processing unit indicates a need for the list of fill substance, the label language of form structure is presentedSpeech set has necessarily completely been present in markup language processing unit and suffered.
Imparting indicia language processing apparatus, when any<input>When label is activated, notify what learning device was activated<input>The object of label.
Learning device is by having activated<input>Label object, reads the attribute of this label.
Learning device has currently been activated by traveling through the markup language set in markup language processing unit, calculating<input>Absolute position of the label in markup language set.
Imparting indicia language processing apparatus, when producing list submission event, should not be committed to server, but notify to learnPractising device list submits event to be produced by which object.
In learning device, activation successively needs the input frame of fill substance, during this, and the input frame being activated willIt is recorded, was not activated, reconditioning will be ignored.Submitting button is clicked on, list is produced and submits event, learning device is receivedTo after event, the input frame information recorded in upper step and the corresponding URL of current markers file are stored in form feature database.
So far, study is completed.
Learning device can be interacted by this part with markup language processing unit, learn web form feature, andIt is stored in form feature database.
No matter which kind of engine, be worth, will finally be integrated into service environment to its performance, therefore, engine can be externalOffer enables third party device to operate the operate interface of oneself.
According to JavaScript language standard, when being clicked on using controller in markup language processing unit, it can produceA raw onClick event.
According to JavaScript language standard, when producing onClick events, a function can be called, and will triggeringOnClick object passes to function by parameter, allows JavaScript language according to this event action object.
A JavaScript function is write, this function can travel through the label pair in current markers language processing apparatus alwaysAs, and with oneself onClick processing function registration input labels, button labels, a labels, img labels onClick thingsPart, so as to the HTML controls of dynamic load after handling.
A JavaScript function is write, this function is responsible for collecting the information that onClick processing functions are sent out.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, learning device is by handling mark languageThe privately owned JavaScript interface that learning device is provided is registered to markup language processing unit by the interface that speech device itself is provided,This privately owned interface can make the JavaScript engine in markup language processing unit be communicated with learning device, privately owned be connect by thisThe label information being collected into can be sent to learning device by the JavaScript engine in mouth current markers file.
Further, the markup language processing unit of entity built in the learning device.
Further, the markup language processing unit of non-physical built in the learning device.
Further, the markup language processing unit is provided with operate interface.
Further, the markup language processing unit default label language is HTML.
Further, the markup language processing unit is Trident engines, and the operate interface connects for WebControlMouthful.Had the markup language processing units of many maturations at present, these devices include Microsoft Trident engines,The Blink engines of Google companies, the Gecko engines of Mozilla foundations, the WebKit engines of Apple Inc. and other phasesThe privately owned entity or virtual engine of Guan Hangye companies, and different markup language processing units is provided with corresponding interface, plants class nameTitle is various, and preferred markup language processing unit is Trident engines here, and its interface is corresponding WebControl interfaces.
Further, the built-in browser is IE browser.
Further, the markup language aggregate is JavaScript content for script.
The present invention compared with prior art, with advantages below and beneficial effect:
(1)The method of the invention can be by way of manually participating in, with semi-automatic machine learning markup language tableSingle structure, can be extracted with integrality, authenticity, the web form architectural feature of accuracy;
(2)Submitting button used in the method for the invention is<input type=submit>, submission form is by study dressCompletion is put, form feature extracts and is difficult failure;
(3)The method of the invention makes<input>Input frame<form>Label is wrapped up, so that browser sends webpageLoaded can meet the rule of static scanning after notifying, can be well on inquiry.
Brief description of the drawings
Fig. 1 is markup language processing unit workflow;
Fig. 2 is the learning device workflow with markup language learning device;
Fig. 3 is markup language COLLECTION TRAVERSALSThe function flow;
Fig. 4 is " click " event handling function flow.
Embodiment
The present invention is described in further detail with reference to embodiment, but the implementation of the present invention is not limited to this.
Embodiment:
Existing characteristic analysis method, as shown in figure 1, leading to whenever markup language processing unit sends tab file loadedWhen knowing, it is assumed that the page occurs the content for including above element, then the interface provided by markup language processing unit to markNote file is analyzed, and takes out list<from><input>Feature, but such a method is in the dynamically labeled loading of high speed developmentSeem before technological side unable to do what one wishes.
Present embodiment discloses a kind of form feature extracting method of semi-automatic learning type, this method can be by that can lead toThe mode manually participated in is crossed, with semi-automatic machine learning markup language form structure, can be extracted with integrality, trulyProperty, the web form architectural feature of accuracy.Specific implementation step is:
(1)Start learning device, can be appreciated that the human-computer interaction interface of a similar IE browser;
(2)In address field input marking language file, telltale mark language file position;
(3)Device loads making language document by built-in IE browser;
(4)After the completion of, built-in IE browser notifies the loading of learning device making language document to complete, and has generated markRemember language aggregate;
(5)Learning device inserts study module in the making language document loaded;
(6)Operate list, such as fill substance, choose an option, click on submitting button, these processes will by study fillPut complete documentation, or and generate the related characteristic information such as tag name, attribute, absolute position;
(7)Receive after submitting button click event, study module thinks that study is completed, and the characteristic information of form structure is depositedEnter database.Whole list feature learning process is completed.
The learning device workflow of markup language learning device is wherein carried, as shown in Fig. 2 default label language isHTML, the label that input frame is presented in the selection of markup language processing unit is defaulted as<input>Label.
In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but by artificially carrying outJudge.
When seeing that markup language processing unit indicates a need for the list of fill substance, the markup language of form structure is presentedSet has completely been present in markup language processing unit and suffered.
Imparting indicia language processing apparatus, when any<input>When label is activated, notify what learning device was activated<input>The object of label, markup language processing unit, when parsing markup language, is the unique correspondence of each label generationRelation entrance, learning device is by having activated<input>Label object, reads the attribute of this label, learning device by timeThe markup language set gone through in markup language processing unit, calculating has currently been activated<input>Label is in markup language setAbsolute position, imparting indicia language processing apparatus, when producing list and submitting event, should not be committed to server, but logicalKnow that learning device list submits event to be produced by which object, in learning device, activation successively needs fill substanceInput frame, during this, the input frame being activated will be recorded, and be not activated, reconditioning will be ignored, in studyIn device, submitting button is clicked on, list is produced and submits event, learning device is received after event, the input frame that will be recorded in upper stepInformation and the corresponding URL deposits form feature database of current markers file.
The selection of markup language processing unit is using the Trident engines of Microsoft, and its corresponding interface is WebControlInterface.
According to JavaScript language standard, when being clicked on using controller in markup language processing unit, it can produceA raw onClick event, when producing onClick events, can call a function, and the object that will trigger onClickFunction is passed to by parameter, allows JavaScript language according to this event action object.
A JavaScript function is wherein write, this function can travel through the mark in current markers language processing apparatus alwaysObject is signed, as shown in Fig. 3, and with oneself onClick processing function registration input labels, button labels, a labels, imgThe onClick events of label, so as to the HTML controls of dynamic load after handling, this function is responsible for collecting onClick processing lettersThe information that number is sent out is as shown in Figure 4.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, what Trident engines were providedWebControl interfaces, can release DocumentCompleted events, learning device in making language document loadedThe function of read control information is put into the mark of current markers file by the interface by handling markup language device itself offerIn set.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, learning device is by handling mark languageThe privately owned JavaScript interface that learning device is provided is registered to markup language processing unit by the interface that speech device itself is provided,This privately owned interface can make the JavaScript engine in markup language processing unit be communicated with learning device, privately owned be connect by thisThe label information being collected into can be sent to learning device by the JavaScript engine in mouth current markers file.
Learning device is received after final label information, write into Databasce.
It is described above, be only presently preferred embodiments of the present invention, any formal limitation not done to the present invention, it is every according toAccording to the present invention technical spirit above example is made any simple modification, equivalent variations, each fall within the present invention protectionWithin the scope of.