Movatterモバイル変換


[0]ホーム

URL:


CN104063488A - Semi-automatic learning type form feature extraction method - Google Patents

Semi-automatic learning type form feature extraction method
Download PDF

Info

Publication number
CN104063488A
CN104063488ACN201410317562.9ACN201410317562ACN104063488ACN 104063488 ACN104063488 ACN 104063488ACN 201410317562 ACN201410317562 ACN 201410317562ACN 104063488 ACN104063488 ACN 104063488A
Authority
CN
China
Prior art keywords
markup language
learning device
semi
treating apparatus
form feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410317562.9A
Other languages
Chinese (zh)
Other versions
CN104063488B (en
Inventor
陈超一
范渊
吴永越
郑学新
姜毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu DBAPPSecurity Co Ltd
Original Assignee
Chengdu DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu DBAPPSecurity Co LtdfiledCriticalChengdu DBAPPSecurity Co Ltd
Priority to CN201410317562.9ApriorityCriticalpatent/CN104063488B/en
Publication of CN104063488ApublicationCriticalpatent/CN104063488A/en
Application grantedgrantedCritical
Publication of CN104063488BpublicationCriticalpatent/CN104063488B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses a semi-automatic learning type form feature extraction method. The method includes the following steps that (1) a learning device is started; (2) the position of a markup language file is input; (3) the markup language file is loaded through the learning device; (4) a markup language set body is generated; (5) a learning module is inserted in the markup language file; (6) a form is operated, a record is completed, and feature information is generated; (7) form structure information is stored in a database; (8) form features are completely learned. According to the method, in the artificial participation mode, the markup language form structure is learned through a semi-automatic machine, and web page form structure features which are complete, real and accurate can be extracted; the form is submitted through the learning device, so that form feature extraction is not prone to failure; an <input> frame is coated with a <form> label so that after a browser sends out a web page loading completion notice, a static scanning rule can be met, and inquiry can be conducted smoothly.

Description

A kind of form feature extracting method of semi-automatic learning type
Technical field
The present invention relates to machine learning, data mining, online experience field, specifically refer to a kind of form feature extracting method of semi-automatic learning type.
Background technology
Along with popularizing with popular of internet information technology, by browser access retrieved web information, become one of required skill improving modern society's yield-power with exchanging.
When access websites retrieving information, may need frequently to website input message, as: user logins, makes comments, takes part in a vote etc., some information is to need to repeat and frequent input, as: user's login, in different websites, will input the information such as different user names or password; And different commodity are bought in shopping online, will repeatedly input the information such as address, postcode, consignee's name of oneself.
Because these information may need frequent, a large amount of inputs, and information has unicity, for example shopping online, the address of oneself can often not change conventionally, and name is all the more so, so nearly all modern markup language treating apparatus shell, be the Man Machine Interface of markup language treating apparatus, as browser interface, automatic login and list are provided, and in generation, is filled out function automatically, alleviate the mankind's duplication of labour burden, enhance productivity.
If markup language treating apparatus shell need to be by data Auto-writing in the list in markup language treating apparatus, must know the corresponding list project of relevant entry, as: corresponding the 1st input frame of addressee's name, corresponding the 2nd input frame of address of the addressee, corresponding the 3rd input frame of addressee's postcode.Under rule like this, just must know the architectural feature of list, could correct data be filled in corresponding project.
The HTML that World Wide Web Consortium proposes, HTML (Hypertext Markup Language), is called for short " markup language ", and language standard can consist of unification, standardized language generation internet web page files mark is called for short " tab file ".It is benchmark that html language be take the label of tree structure, and a series of standard base parts are provided, as long as markup language treating apparatus is realized HTML standard, just can keep versatility.
When usage flag language processing apparatus loads the making language document of website, if need to submit data to website, as the commodity of chatting, make comments, buy and sell, preservation customized information etc., website just must provide the approach that gathers browser data image data, html language standard provides " list (form) " parts for this reason, a list comprises following element: <form> conventionally: state that this is a list, the data among this can be submitted to server; The child node of <input>:<formGre atT.GreaT.GT label, state that this is a single file text input frame, according to type attribute, can present different patterns, as: <input type=text>, a common input frame; <input type=password>, a Password Input frame of having hidden input content; Submission form button: submission form is real is a type attribute of <input> label, when the type of <input> label attribute is set to submit, can in markup language treating apparatus, present a button, when button is activated, the data of all legal <input> user's inputs in <form> label all can be submitted to server.
Existing characteristic analysis method, as shown in Figure 1, when markup language treating apparatus sends tab file loaded notice, just suppose that the page there will be the content that comprises above element, the interface providing by markup language treating apparatus is again analyzed tab file, take out the <from><inputGrea tT.GreaT.GT feature of list, but this kind of method seemed unable to do what one wishes in face of the dynamically labeled loading technique of high speed development, because dynamically labeled loading technique can cause following problem:
Markup language treating apparatus sends after webpage loaded notice, in tab file, do not have the content of login frame, and present the in fact continuation of the JavaScript script in tab file of the needed markup language of list, load, that is to say, now present the needed markup language set of list and really do not loaded, so form feature extracts meeting failure;
Submit button is not <input type=submit>, may be that any one has added the html tag that calls JavaScript scripted code, and submission form is completed by JavaScript script, so form feature extracts meeting failure;
Even <input> input frame is not wrapped up by <form> label yet.This just causes browser to send the rule that can not meet static scanning after webpage loaded is notified, and causes inquiring about unsuccessfully.
Summary of the invention
The object of the invention is to the mode by artificial participation, provide a kind of can further extract there is integrality, the form feature extracting method of the semi-automatic learning type of the web form architectural feature of authenticity, accuracy.
The present invention is achieved through the following technical solutions a kind of form feature extracting method of semi-automatic learning type, comprises the following steps:
(1) start learning device, the built-in markup language treating apparatus of learning device;
(2) in the position of address field input marking language file;
(3) learning device loads making language document by built-in browser;
(4), after having loaded, built-in browser notice learning device making language document has loaded, and generates markup language aggregate;
(5) learning device inserts study module in the making language document having loaded;
(6) operation list, by learning device complete documentation, and generates relevant characteristic information;
(7) receive after submit button click event, study module is thought and has been learnt, and deposits form structure information in database;
(8) whole form feature learning process completes.
Said method, by manufacturing the learning device of a built-in marks for treatment speaking unit, is determined markup language, and markup language treating apparatus selects the label that presents input frame to be defaulted as <input> label.
In semi-automatic learning process, machine does not need to identify webpage and when has loaded, but people is for judging.
When seeing that markup language treating apparatus demonstrates the list that needs fill substance, the markup language set that presents form structure necessarily the complete markup language treating apparatus that is present in suffered.
Inform markup language treating apparatus, when any <input> label is activated, the object of the <input> label that notice learning device is activated.
Learning device, by the <input> label object having activated, reads the attribute of this label.
Learning device, by the markup language set in traversal markup language treating apparatus, calculates the current absolute position of <input> label in markup language set of having activated.
Inform markup language treating apparatus, when producing submission of sheet event, be not committed to server, but notice learning device submission of sheet event by which object is produced.
In learning device, activate successively the input frame need fill substance, in this process, the input frame being activated will be recorded, and not be activated, being left in the basket of reconditioning.Click on submission button, produces submission of sheet event, and learning device is received after event, deposits the input frame information recording in upper step and URL corresponding to current tab file in form feature database.
So far, study completes.
Learning device can pass through these parts, carries out alternately with markup language treating apparatus, and study web form feature, and deposit form feature database in.
No matter which kind of engine, if want its performance to be worth, finally all will be integrated in service environment, and therefore, engine can externally provide and make third party device can operate the operation-interface of oneself.
According to JavaScript language standard, when using controller to click, can produce an onClick event in markup language treating apparatus.
According to JavaScript language standard, when producing onClick event, can call a function, and the object that triggers onClick is passed to function by parameter, make the JavaScript language can be according to this event action object.
Write a JavaScript function, this function can travel through the label object in current markup language treating apparatus always, and with oneself onClick, process the onClick event of function registration input label, button label, a label, img label, so that the HTML control of dynamic load after processing.
Write a JavaScript function, this function is responsible for collecting onClick and is processed the information that function is sent.
When markup language treating apparatus, confirming making language document is loaded when complete, the privately owned JavaScript interface that the interface that learning device self provides by marks for treatment speaking unit provides learning device is registered to markup language treating apparatus, this privately owned interface can make the JavaScript engine in markup language treating apparatus communicate by letter with learning device, by the JavaScript engine in the current tab file of this privately owned interface, the label information of collecting can be sent to learning device.
Further, the markup language treating apparatus of the built-in entity of described learning device.
Further, the built-in non-physical markup language treating apparatus of described learning device.
Further, described markup language treating apparatus is provided with operation-interface.
Further, described markup language treating apparatus default label language is HTML.
Further, described markup language treating apparatus is Trident engine, and described operation-interface is WebControl interface.The markup language treating apparatus that has had at present a lot of maturations, these devices comprise the Trident engine of Microsoft, the Gecko engine of the Blink engine of Google company, Mozilla foundation, privately owned entity or the virtual engine of the WebKit engine of Apple and other relevant industries companies, and different markup language treating apparatus are provided with corresponding interface, kind title is various, here preferably markup language treating apparatus is Trident engine, and its interface is corresponding WebControl interface.
Further, described built-in browser is IE browser.
Further, described markup language aggregate is JavaScript content for script.
The present invention compared with prior art, has the following advantages and beneficial effect:
(1) the method for the invention can, by the artificial mode participating in, with semi-automatic machine learning markup language form structure, can be extracted the web form architectural feature with integrality, authenticity, accuracy;
(2) the method for the invention submit button used is <input type=submit>, and submission form is completed by learning device, and form feature extracts and is difficult for unsuccessfully;
(3) the method for the invention make <input> input frame also <form> label wrap up, thereby browser sends the rule that can meet static scanning after webpage loaded is notified, and inquiry can be well on.
Accompanying drawing explanation
Fig. 1 is markup language treating apparatus workflow;
Fig. 2 is the learning device workflow with markup language learning device;
Fig. 3 is markup language set traversal function flow process;
Fig. 4 is " click " event handling function flow process.
Embodiment
Below in conjunction with embodiment, the present invention is described in further detail, but embodiments of the present invention are not limited to this.
Embodiment:
Existing characteristic analysis method, as shown in Figure 1, when markup language treating apparatus sends tab file loaded notice, just suppose that the page there will be the content that comprises above element, the interface providing by markup language treating apparatus is again analyzed tab file, take out the <from><inputGrea tT.GreaT.GT feature of list, but this kind of method seemed unable to do what one wishes in face of the dynamically labeled loading technique of high speed development.
The present embodiment discloses a kind of form feature extracting method of semi-automatic learning type, this method can be by passing through the artificial mode participating in, with semi-automatic machine learning markup language form structure, can extract the web form architectural feature with integrality, authenticity, accuracy.Concrete implementation step is:
(1) start learning device, can see the human-computer interaction interface of a similar IE browser;
(2) at address field input marking language file, telltale mark language file position;
(3) device loads making language document by built-in IE browser;
(4), after completing, built-in IE browser notice learning device making language document has loaded, and has generated markup language aggregate;
(5) learning device inserts study module in the making language document having loaded;
(6) operation list, as fill substance, choose an option, click on submission button etc., these processes will be by learning device complete documentation, or and the relevant characteristic information such as generating labels name, attribute, absolute position;
(7) receive after submit button click event, study module is thought and has been learnt, and deposits the characteristic information of form structure in database.Whole form feature learning process completes.
Wherein, with the learning device workflow of markup language learning device, as shown in Figure 2, default label language is HTML, and markup language treating apparatus selects the label that presents input frame to be defaulted as <input> label.
In semi-automatic learning process, machine does not need to identify webpage and when has loaded, but by people for judging.
See when markup language treating apparatus demonstrates the list that needs fill substance, the markup language set that the presents form structure complete markup language treating apparatus that is present in has suffered.
Inform markup language treating apparatus, when any <input> label is activated, the object of the <input> label that notice learning device is activated, markup language treating apparatus is when resolving markup language, for each label generates unique corresponding relation entrance, learning device is by the <input> label object having activated, read the attribute of this label, learning device is by the markup language set in traversal markup language treating apparatus, calculate the current absolute position of <input> label in markup language set of having activated, inform markup language treating apparatus, when producing submission of sheet event, be not committed to server, but notice learning device submission of sheet event by which object is produced, in learning device, activate successively the input frame that needs fill substance, in this process, the input frame being activated will be recorded, be not activated, to be left in the basket of reconditioning, in learning device, click on submission button, produce submission of sheet event, learning device is received after event, deposit the input frame information recording in upper step and URL corresponding to current tab file in form feature database.
The Trident engine of markup language treating apparatus choice for use Microsoft, its corresponding interface is WebControl interface.
According to JavaScript language standard, when using controller to click in markup language treating apparatus, can produce an onClick event, when producing onClick event, can call a function, and the object that triggers onClick is passed to function by parameter, make the JavaScript language can be according to this event action object.
Wherein write a JavaScript function, this function can travel through the label object in current markup language treating apparatus always, as shown in Fig. 3, and with oneself onClick, process the onClick event of function registration input label, button label, a label, img label, so that the HTML control of dynamic load after processing, this function is responsible for collecting onClick and is processed information that function sends as shown in Figure 4.
When markup language treating apparatus, confirming making language document is loaded when complete, the WebControl interface that Trident engine provides, can be when making language document loaded, emit DocumentCompleted event, the interface that learning device self provides by marks for treatment speaking unit, by the function of read control information, is put into the tag set of current tab file.
When markup language treating apparatus, confirming making language document is loaded when complete, the privately owned JavaScript interface that the interface that learning device self provides by marks for treatment speaking unit provides learning device is registered to markup language treating apparatus, this privately owned interface can make the JavaScript engine in markup language treating apparatus communicate by letter with learning device, by the JavaScript engine in the current tab file of this privately owned interface, the label information of collecting can be sent to learning device.
Learning device is received after final label information, write into Databasce.
The above, be only preferred embodiment of the present invention, not the present invention done to any pro forma restriction, and any simple modification, equivalent variations that every foundation technical spirit of the present invention is done above embodiment, within all falling into protection scope of the present invention.

Claims (8)

CN201410317562.9A2014-07-072014-07-07A kind of form feature extracting method of semi-automatic learning typeActiveCN104063488B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201410317562.9ACN104063488B (en)2014-07-072014-07-07A kind of form feature extracting method of semi-automatic learning type

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201410317562.9ACN104063488B (en)2014-07-072014-07-07A kind of form feature extracting method of semi-automatic learning type

Publications (2)

Publication NumberPublication Date
CN104063488Atrue CN104063488A (en)2014-09-24
CN104063488B CN104063488B (en)2017-09-01

Family

ID=51551202

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201410317562.9AActiveCN104063488B (en)2014-07-072014-07-07A kind of form feature extracting method of semi-automatic learning type

Country Status (1)

CountryLink
CN (1)CN104063488B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109445654A (en)*2018-09-282019-03-08成都安恒信息技术有限公司A kind of method that graphic interface program is filled a vacancy automatically
CN112836150A (en)*2021-02-032021-05-25捷玛计算机信息技术(上海)股份有限公司Identification method, system, equipment and medium for tracing code of medicine

Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20070271305A1 (en)*2006-05-182007-11-22Sivansankaran ChandrasekarEfficient piece-wise updates of binary encoded XML data
CN102681994A (en)*2011-03-072012-09-19北京百度网讯科技有限公司Webpage information extracting method and system
CN103440198A (en)*2013-08-272013-12-11星云融创(北京)信息技术有限公司Method for calibrating form
CN103443786A (en)*2011-03-152013-12-11高通股份有限公司 A task-independent machine learning method for identifying parallel layouts in web browsers
CN103514292A (en)*2013-10-092014-01-15南京大学Webpage data extraction method based on semi-supervised learning of small sample
CN103559234A (en)*2013-10-242014-02-05北京邮电大学System and method for automated semantic annotation of RESTful Web services
US20140089302A1 (en)*2009-09-302014-03-27Gennady LAPIRMethod and system for extraction
CN103699683A (en)*2014-01-022014-04-02国家电网公司Data processing method and data processing device
CN103793282A (en)*2012-11-022014-05-14阿里巴巴集团控股有限公司Browser and tab ending method thereof

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20070271305A1 (en)*2006-05-182007-11-22Sivansankaran ChandrasekarEfficient piece-wise updates of binary encoded XML data
US20140089302A1 (en)*2009-09-302014-03-27Gennady LAPIRMethod and system for extraction
CN102681994A (en)*2011-03-072012-09-19北京百度网讯科技有限公司Webpage information extracting method and system
CN103443786A (en)*2011-03-152013-12-11高通股份有限公司 A task-independent machine learning method for identifying parallel layouts in web browsers
CN103793282A (en)*2012-11-022014-05-14阿里巴巴集团控股有限公司Browser and tab ending method thereof
CN103440198A (en)*2013-08-272013-12-11星云融创(北京)信息技术有限公司Method for calibrating form
CN103514292A (en)*2013-10-092014-01-15南京大学Webpage data extraction method based on semi-supervised learning of small sample
CN103559234A (en)*2013-10-242014-02-05北京邮电大学System and method for automated semantic annotation of RESTful Web services
CN103699683A (en)*2014-01-022014-04-02国家电网公司Data processing method and data processing device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109445654A (en)*2018-09-282019-03-08成都安恒信息技术有限公司A kind of method that graphic interface program is filled a vacancy automatically
CN109445654B (en)*2018-09-282022-02-08成都安恒信息技术有限公司Method for automatically filling gaps in graphical interface program
CN112836150A (en)*2021-02-032021-05-25捷玛计算机信息技术(上海)股份有限公司Identification method, system, equipment and medium for tracing code of medicine
CN112836150B (en)*2021-02-032024-07-16捷玛计算机信息技术(上海)股份有限公司Identification method, system, equipment and medium for traceability code of medicine

Also Published As

Publication numberPublication date
CN104063488B (en)2017-09-01

Similar Documents

PublicationPublication DateTitle
CN111552665B (en)Method, device, equipment and medium for converting annotation information format
US20170109441A1 (en)Automatically generating a website specific to an industry
CN107292412A (en)A kind of problem Forecasting Methodology and forecasting system
CN105243159A (en)Visual script editor-based distributed web crawler system
CN106570750B (en) Automatic tax declaration method, system and browser plug-in based on browser plug-in
CN111125598A (en)Intelligent data query method, device, equipment and storage medium
CN107590236B (en)Big data acquisition method and system for building construction enterprises
CN106598991A (en)Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode
US20200026802A1 (en)Image quality independent searching of screenshots of web content
CN113723980A (en)Method and device for detecting advertisement landing page, electronic equipment and storage medium
CN113742559A (en) Keyword detection method and device, electronic equipment, storage medium
CN103678510A (en)Method and device for providing visualized label for webpage
JP2012078877A (en)Vulnerability inspection device, vulnerability inspection method and vulnerability inspection program
CN114398138B (en)Interface generation method, device, computer equipment and storage medium
CN111143404A (en)Service processing method and device
CN104063488A (en)Semi-automatic learning type form feature extraction method
CN114661745A (en) Recruitment information publishing method and device based on RPA and AI
JP5497925B2 (en) Content management apparatus, content management method and program
CN110276183B (en)Reverse Turing verification method and device, storage medium and electronic equipment
CN117908897A (en) Code review method, cloud server, system, device and storage medium
CN110069754A (en) A Method of Generating WEB Page Report Based on HTML
CN104679786A (en)Form processing method and device
CN106446024B (en)Method and device for automatically generating data model
US11733847B2 (en)Knowledge engine auto-generation of guided flow experience
US12332856B2 (en)Automated data quality detection for unstructured data

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp