A kind of form feature extracting method of semi-automatic learning typeTechnical field
The present invention relates to machine learning, data mining, online experience field, specifically refer to a kind of form feature extracting method of semi-automatic learning type.
Background technology
Along with popularizing with popular of internet information technology, by browser access retrieved web information, become one of required skill improving modern society's yield-power with exchanging.
When access websites retrieving information, may need frequently to website input message, as: user logins, makes comments, takes part in a vote etc., some information is to need to repeat and frequent input, as: user's login, in different websites, will input the information such as different user names or password; And different commodity are bought in shopping online, will repeatedly input the information such as address, postcode, consignee's name of oneself.
Because these information may need frequent, a large amount of inputs, and information has unicity, for example shopping online, the address of oneself can often not change conventionally, and name is all the more so, so nearly all modern markup language treating apparatus shell, be the Man Machine Interface of markup language treating apparatus, as browser interface, automatic login and list are provided, and in generation, is filled out function automatically, alleviate the mankind's duplication of labour burden, enhance productivity.
If markup language treating apparatus shell need to be by data Auto-writing in the list in markup language treating apparatus, must know the corresponding list project of relevant entry, as: corresponding the 1st input frame of addressee's name, corresponding the 2nd input frame of address of the addressee, corresponding the 3rd input frame of addressee's postcode.Under rule like this, just must know the architectural feature of list, could correct data be filled in corresponding project.
The HTML that World Wide Web Consortium proposes, HTML (Hypertext Markup Language), is called for short " markup language ", and language standard can consist of unification, standardized language generation internet web page files mark is called for short " tab file ".It is benchmark that html language be take the label of tree structure, and a series of standard base parts are provided, as long as markup language treating apparatus is realized HTML standard, just can keep versatility.
When usage flag language processing apparatus loads the making language document of website, if need to submit data to website, as the commodity of chatting, make comments, buy and sell, preservation customized information etc., website just must provide the approach that gathers browser data image data, html language standard provides " list (form) " parts for this reason, a list comprises following element: <form> conventionally: state that this is a list, the data among this can be submitted to server; The child node of <input>:<formGre atT.GreaT.GT label, state that this is a single file text input frame, according to type attribute, can present different patterns, as: <input type=text>, a common input frame; <input type=password>, a Password Input frame of having hidden input content; Submission form button: submission form is real is a type attribute of <input> label, when the type of <input> label attribute is set to submit, can in markup language treating apparatus, present a button, when button is activated, the data of all legal <input> user's inputs in <form> label all can be submitted to server.
Existing characteristic analysis method, as shown in Figure 1, when markup language treating apparatus sends tab file loaded notice, just suppose that the page there will be the content that comprises above element, the interface providing by markup language treating apparatus is again analyzed tab file, take out the <from><inputGrea tT.GreaT.GT feature of list, but this kind of method seemed unable to do what one wishes in face of the dynamically labeled loading technique of high speed development, because dynamically labeled loading technique can cause following problem:
Markup language treating apparatus sends after webpage loaded notice, in tab file, do not have the content of login frame, and present the in fact continuation of the JavaScript script in tab file of the needed markup language of list, load, that is to say, now present the needed markup language set of list and really do not loaded, so form feature extracts meeting failure;
Submit button is not <input type=submit>, may be that any one has added the html tag that calls JavaScript scripted code, and submission form is completed by JavaScript script, so form feature extracts meeting failure;
Even <input> input frame is not wrapped up by <form> label yet.This just causes browser to send the rule that can not meet static scanning after webpage loaded is notified, and causes inquiring about unsuccessfully.
Summary of the invention
The object of the invention is to the mode by artificial participation, provide a kind of can further extract there is integrality, the form feature extracting method of the semi-automatic learning type of the web form architectural feature of authenticity, accuracy.
The present invention is achieved through the following technical solutions a kind of form feature extracting method of semi-automatic learning type, comprises the following steps:
(1) start learning device, the built-in markup language treating apparatus of learning device;
(2) in the position of address field input marking language file;
(3) learning device loads making language document by built-in browser;
(4), after having loaded, built-in browser notice learning device making language document has loaded, and generates markup language aggregate;
(5) learning device inserts study module in the making language document having loaded;
(6) operation list, by learning device complete documentation, and generates relevant characteristic information;
(7) receive after submit button click event, study module is thought and has been learnt, and deposits form structure information in database;
(8) whole form feature learning process completes.
Said method, by manufacturing the learning device of a built-in marks for treatment speaking unit, is determined markup language, and markup language treating apparatus selects the label that presents input frame to be defaulted as <input> label.
In semi-automatic learning process, machine does not need to identify webpage and when has loaded, but people is for judging.
When seeing that markup language treating apparatus demonstrates the list that needs fill substance, the markup language set that presents form structure necessarily the complete markup language treating apparatus that is present in suffered.
Inform markup language treating apparatus, when any <input> label is activated, the object of the <input> label that notice learning device is activated.
Learning device, by the <input> label object having activated, reads the attribute of this label.
Learning device, by the markup language set in traversal markup language treating apparatus, calculates the current absolute position of <input> label in markup language set of having activated.
Inform markup language treating apparatus, when producing submission of sheet event, be not committed to server, but notice learning device submission of sheet event by which object is produced.
In learning device, activate successively the input frame need fill substance, in this process, the input frame being activated will be recorded, and not be activated, being left in the basket of reconditioning.Click on submission button, produces submission of sheet event, and learning device is received after event, deposits the input frame information recording in upper step and URL corresponding to current tab file in form feature database.
So far, study completes.
Learning device can pass through these parts, carries out alternately with markup language treating apparatus, and study web form feature, and deposit form feature database in.
No matter which kind of engine, if want its performance to be worth, finally all will be integrated in service environment, and therefore, engine can externally provide and make third party device can operate the operation-interface of oneself.
According to JavaScript language standard, when using controller to click, can produce an onClick event in markup language treating apparatus.
According to JavaScript language standard, when producing onClick event, can call a function, and the object that triggers onClick is passed to function by parameter, make the JavaScript language can be according to this event action object.
Write a JavaScript function, this function can travel through the label object in current markup language treating apparatus always, and with oneself onClick, process the onClick event of function registration input label, button label, a label, img label, so that the HTML control of dynamic load after processing.
Write a JavaScript function, this function is responsible for collecting onClick and is processed the information that function is sent.
When markup language treating apparatus, confirming making language document is loaded when complete, the privately owned JavaScript interface that the interface that learning device self provides by marks for treatment speaking unit provides learning device is registered to markup language treating apparatus, this privately owned interface can make the JavaScript engine in markup language treating apparatus communicate by letter with learning device, by the JavaScript engine in the current tab file of this privately owned interface, the label information of collecting can be sent to learning device.
Further, the markup language treating apparatus of the built-in entity of described learning device.
Further, the built-in non-physical markup language treating apparatus of described learning device.
Further, described markup language treating apparatus is provided with operation-interface.
Further, described markup language treating apparatus default label language is HTML.
Further, described markup language treating apparatus is Trident engine, and described operation-interface is WebControl interface.The markup language treating apparatus that has had at present a lot of maturations, these devices comprise the Trident engine of Microsoft, the Gecko engine of the Blink engine of Google company, Mozilla foundation, privately owned entity or the virtual engine of the WebKit engine of Apple and other relevant industries companies, and different markup language treating apparatus are provided with corresponding interface, kind title is various, here preferably markup language treating apparatus is Trident engine, and its interface is corresponding WebControl interface.
Further, described built-in browser is IE browser.
Further, described markup language aggregate is JavaScript content for script.
The present invention compared with prior art, has the following advantages and beneficial effect:
(1) the method for the invention can, by the artificial mode participating in, with semi-automatic machine learning markup language form structure, can be extracted the web form architectural feature with integrality, authenticity, accuracy;
(2) the method for the invention submit button used is <input type=submit>, and submission form is completed by learning device, and form feature extracts and is difficult for unsuccessfully;
(3) the method for the invention make <input> input frame also <form> label wrap up, thereby browser sends the rule that can meet static scanning after webpage loaded is notified, and inquiry can be well on.
Accompanying drawing explanation
Fig. 1 is markup language treating apparatus workflow;
Fig. 2 is the learning device workflow with markup language learning device;
Fig. 3 is markup language set traversal function flow process;
Fig. 4 is " click " event handling function flow process.
Embodiment
Below in conjunction with embodiment, the present invention is described in further detail, but embodiments of the present invention are not limited to this.
Embodiment:
Existing characteristic analysis method, as shown in Figure 1, when markup language treating apparatus sends tab file loaded notice, just suppose that the page there will be the content that comprises above element, the interface providing by markup language treating apparatus is again analyzed tab file, take out the <from><inputGrea tT.GreaT.GT feature of list, but this kind of method seemed unable to do what one wishes in face of the dynamically labeled loading technique of high speed development.
The present embodiment discloses a kind of form feature extracting method of semi-automatic learning type, this method can be by passing through the artificial mode participating in, with semi-automatic machine learning markup language form structure, can extract the web form architectural feature with integrality, authenticity, accuracy.Concrete implementation step is:
(1) start learning device, can see the human-computer interaction interface of a similar IE browser;
(2) at address field input marking language file, telltale mark language file position;
(3) device loads making language document by built-in IE browser;
(4), after completing, built-in IE browser notice learning device making language document has loaded, and has generated markup language aggregate;
(5) learning device inserts study module in the making language document having loaded;
(6) operation list, as fill substance, choose an option, click on submission button etc., these processes will be by learning device complete documentation, or and the relevant characteristic information such as generating labels name, attribute, absolute position;
(7) receive after submit button click event, study module is thought and has been learnt, and deposits the characteristic information of form structure in database.Whole form feature learning process completes.
Wherein, with the learning device workflow of markup language learning device, as shown in Figure 2, default label language is HTML, and markup language treating apparatus selects the label that presents input frame to be defaulted as <input> label.
In semi-automatic learning process, machine does not need to identify webpage and when has loaded, but by people for judging.
See when markup language treating apparatus demonstrates the list that needs fill substance, the markup language set that the presents form structure complete markup language treating apparatus that is present in has suffered.
Inform markup language treating apparatus, when any <input> label is activated, the object of the <input> label that notice learning device is activated, markup language treating apparatus is when resolving markup language, for each label generates unique corresponding relation entrance, learning device is by the <input> label object having activated, read the attribute of this label, learning device is by the markup language set in traversal markup language treating apparatus, calculate the current absolute position of <input> label in markup language set of having activated, inform markup language treating apparatus, when producing submission of sheet event, be not committed to server, but notice learning device submission of sheet event by which object is produced, in learning device, activate successively the input frame that needs fill substance, in this process, the input frame being activated will be recorded, be not activated, to be left in the basket of reconditioning, in learning device, click on submission button, produce submission of sheet event, learning device is received after event, deposit the input frame information recording in upper step and URL corresponding to current tab file in form feature database.
The Trident engine of markup language treating apparatus choice for use Microsoft, its corresponding interface is WebControl interface.
According to JavaScript language standard, when using controller to click in markup language treating apparatus, can produce an onClick event, when producing onClick event, can call a function, and the object that triggers onClick is passed to function by parameter, make the JavaScript language can be according to this event action object.
Wherein write a JavaScript function, this function can travel through the label object in current markup language treating apparatus always, as shown in Fig. 3, and with oneself onClick, process the onClick event of function registration input label, button label, a label, img label, so that the HTML control of dynamic load after processing, this function is responsible for collecting onClick and is processed information that function sends as shown in Figure 4.
When markup language treating apparatus, confirming making language document is loaded when complete, the WebControl interface that Trident engine provides, can be when making language document loaded, emit DocumentCompleted event, the interface that learning device self provides by marks for treatment speaking unit, by the function of read control information, is put into the tag set of current tab file.
When markup language treating apparatus, confirming making language document is loaded when complete, the privately owned JavaScript interface that the interface that learning device self provides by marks for treatment speaking unit provides learning device is registered to markup language treating apparatus, this privately owned interface can make the JavaScript engine in markup language treating apparatus communicate by letter with learning device, by the JavaScript engine in the current tab file of this privately owned interface, the label information of collecting can be sent to learning device.
Learning device is received after final label information, write into Databasce.
The above, be only preferred embodiment of the present invention, not the present invention done to any pro forma restriction, and any simple modification, equivalent variations that every foundation technical spirit of the present invention is done above embodiment, within all falling into protection scope of the present invention.