Movatterモバイル変換


[0]ホーム

URL:


CN104063488B - A kind of form feature extracting method of semi-automatic learning type - Google Patents

A kind of form feature extracting method of semi-automatic learning type
Download PDF

Info

Publication number
CN104063488B
CN104063488BCN201410317562.9ACN201410317562ACN104063488BCN 104063488 BCN104063488 BCN 104063488BCN 201410317562 ACN201410317562 ACN 201410317562ACN 104063488 BCN104063488 BCN 104063488B
Authority
CN
China
Prior art keywords
markup language
learning device
language processing
semi
processing unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410317562.9A
Other languages
Chinese (zh)
Other versions
CN104063488A (en
Inventor
陈超
陈超一
范渊
吴永越
郑学新
姜毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu DBAPPSecurity Co Ltd
Original Assignee
Chengdu DBAPPSecurity Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu DBAPPSecurity Co LtdfiledCriticalChengdu DBAPPSecurity Co Ltd
Priority to CN201410317562.9ApriorityCriticalpatent/CN104063488B/en
Publication of CN104063488ApublicationCriticalpatent/CN104063488A/en
Application grantedgrantedCritical
Publication of CN104063488BpublicationCriticalpatent/CN104063488B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The invention discloses a kind of form feature extracting method of semi-automatic learning type, comprise the following steps:(1)Start learning device;(2)The position of input marking language file;(3)Learning device loads making language document;(4)Generate markup language aggregate;(5)In study module insertion making language document;(6)List is operated, complete documentation generates characteristic information;(7)Form structure information is stored in database;(8)Form feature study is completed.The method of the invention be able to can be extracted with integrality, authenticity, the web form architectural feature of accuracy by way of manually participating in, with semi-automatic machine learning markup language form structure;Submission form is completed by learning device, and form feature extracts and is difficult failure;Make<input>Input frame quilt<form>Label is wrapped up, so that browser, which is sent after webpage loaded is notified, can meet the rule of static scanning, can be well on inquiry.

Description

A kind of form feature extracting method of semi-automatic learning type
Technical field
The present invention relates to machine learning, data mining, online experience field, a kind of semi-automatic learning type is specifically referred toForm feature extracting method.
Background technology
With the popularization of Internet information technique and popular, by browser access retrieved web information with exchangeAs one of required skill for improving modern society's productivity.
When accessing retrieved web information, it may be necessary to frequently input information to website, such as:User logs in, deliver and commentBy, take part in a vote, some information need repeat and frequently enter, such as:User logs in, in different websites it is necessary to defeatedEnter the information such as different user name or password;And shopping online, buy different commodity it is necessary to repeatedly input oneself address,The information such as postcode, consignee's name.
Because these information may need frequent, substantial amounts of input, and information has unicity, such as shopping online, fromOneself address generally will not often change, and name is even more so, so outside almost all of modern markup language processing unitThe Man Machine Interface of shell, i.e. markup language processing unit, such as browser interface are filled out there is provided automated log on and list automatic generationFunction, mitigates the duplication of labour burden of the mankind, improves production efficiency.
If markup language processing unit shell is needed data Auto-writing to the list in markup language processing unitIn, it must be understood that the list project corresponding to relevant entry, such as:Addressee's name correspondence the 1st input frame, address of the addressee pairAnswer the 2nd input frame, addressee's postcode the 3rd input frame of correspondence., just must be it is to be understood that the structure of list be special under such ruleLevy, correctly could fill in data in corresponding project.
The HTML that World Wide Web Consortium is proposed, i.e. HTML, referred to as " markup language ", language standard makes internetThe web page files that can be made up of unified, standardization language generation by marking, referred to as " tab file ".Html language is to setThere is provided a series of standard base part on the basis of the label of shape structure, as long as markup language processing unit realizes that HTML is markedIt is accurate, it is possible to keep versatility.
When loading the making language document of website using markup language processing unit, if necessary to submit number to websiteAccording to, such as chat, make comments, buy and sell commodity, preserve customized information, website must just provide collection browser data collectionThe approach of data, " list is provided for this html language standard(form)" part, a list generally comprises following element:<form>:It is a list to state this, and the data among this can be submitted to server;<input>:<form>The son section of labelPoint, it is a single file text input frame to state this, according to type attributes, can show different patterns, such as:<input type=text>, a common input frame;<input type=password>, one conceal input content Password Input frame;Submission form button:Submission form is actually<input>One type attribute of label, when<input>The type attribute quilts of labelWhen being set to submit, a button can be showed in markup language processing unit, can be by when button is activated<form>It is all legal in label<input>The data of user's input are all submitted to server.
Existing characteristic analysis method, as shown in figure 1, leading to whenever markup language processing unit sends tab file loadedWhen knowing, it is assumed that the page occurs the content for including above element, then the interface provided by markup language processing unit to markNote file is analyzed, and takes out list<from><input>Feature, but such a method is in the dynamically labeled loading of high speed developmentSeem before technological side unable to do what one wishes, because dynamically labeled loading technique can cause problems with:
Markup language processing unit is sent after webpage loaded notice, does not have the content of login frame in tab file,And the markup language required for list is presented actually is continuing loading by the JavaScript scripts in tab file, alsoIt is to say, the markup language set required for list is now presented does not have real loading and completed, and can be lost so form feature is extractedLose;
Submitting button is not<input type=submit>, it may be possible to any one, which is added, calls JavaScriptThe html tag of scripted code, and submission form is completed by JavaScript scripts, can be lost so form feature is extractedLose;
Even<input>Input frame does not have quilt yet<form>Label is wrapped up.This, which results in browser and sends webpage, addsLoad can not meet the rule of static scanning after finishing notice, cause inquiry to fail.
The content of the invention
It is an object of the invention to by way of manually participating in there is provided one kind can further extract with integrality,The form feature extracting method of authenticity, the semi-automatic learning type of the web form architectural feature of accuracy.
The present invention is achieved through the following technical solutions a kind of form feature extracting method of semi-automatic learning type, including withLower step:
(1)Start learning device, learning device built-in token language processing apparatus;
(2)In the position of address field input marking language file;
(3)Learning device loads making language document by built-in browser;
(4)After the completion of loading, built-in browser notifies the loading of learning device making language document to complete, and generates mark languageSay aggregate;
(5)Learning device inserts study module in the making language document loaded;
(6)List is operated, by learning device complete documentation, and the characteristic information of correlation is generated;
(7)Receive after submitting button click event, study module thinks that study is completed, and form structure information is stored in into dataStorehouse;
(8)Whole list feature learning process is completed.
The above method handles the learning device of markup language device by manufacturing built in one, determine markup language, marksThe label that input frame is presented in language processing apparatus selection is defaulted as<input>Label.
In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but artificially be sentencedIt is disconnected.
When seeing that markup language processing unit indicates a need for the list of fill substance, the label language of form structure is presentedSpeech set has necessarily completely been present in markup language processing unit and suffered.
Imparting indicia language processing apparatus, when any<input>When label is activated, notify what learning device was activated<input>The object of label.
Learning device is by having activated<input>Label object, reads the attribute of this label.
Learning device has currently been activated by traveling through the markup language set in markup language processing unit, calculating<input>Absolute position of the label in markup language set.
Imparting indicia language processing apparatus, when producing list submission event, should not be committed to server, but notify to learnPractising device list submits event to be produced by which object.
In learning device, activation successively needs the input frame of fill substance, during this, and the input frame being activated willIt is recorded, was not activated, reconditioning will be ignored.Submitting button is clicked on, list is produced and submits event, learning device is receivedTo after event, the input frame information recorded in upper step and the corresponding URL of current markers file are stored in form feature database.
So far, study is completed.
Learning device can be interacted by this part with markup language processing unit, learn web form feature, andIt is stored in form feature database.
No matter which kind of engine, be worth, will finally be integrated into service environment to its performance, therefore, engine can be externalOffer enables third party device to operate the operate interface of oneself.
According to JavaScript language standard, when being clicked on using controller in markup language processing unit, it can produceA raw onClick event.
According to JavaScript language standard, when producing onClick events, a function can be called, and will triggeringOnClick object passes to function by parameter, allows JavaScript language according to this event action object.
A JavaScript function is write, this function can travel through the label pair in current markers language processing apparatus alwaysAs, and with oneself onClick processing function registration input labels, button labels, a labels, img labels onClick thingsPart, so as to the HTML controls of dynamic load after handling.
A JavaScript function is write, this function is responsible for collecting the information that onClick processing functions are sent out.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, learning device is by handling mark languageThe privately owned JavaScript interface that learning device is provided is registered to markup language processing unit by the interface that speech device itself is provided,This privately owned interface can make the JavaScript engine in markup language processing unit be communicated with learning device, privately owned be connect by thisThe label information being collected into can be sent to learning device by the JavaScript engine in mouth current markers file.
Further, the markup language processing unit of entity built in the learning device.
Further, the markup language processing unit of non-physical built in the learning device.
Further, the markup language processing unit is provided with operate interface.
Further, the markup language processing unit default label language is HTML.
Further, the markup language processing unit is Trident engines, and the operate interface connects for WebControlMouthful.Had the markup language processing units of many maturations at present, these devices include Microsoft Trident engines,The Blink engines of Google companies, the Gecko engines of Mozilla foundations, the WebKit engines of Apple Inc. and other phasesThe privately owned entity or virtual engine of Guan Hangye companies, and different markup language processing units is provided with corresponding interface, plants class nameTitle is various, and preferred markup language processing unit is Trident engines here, and its interface is corresponding WebControl interfaces.
Further, the built-in browser is IE browser.
Further, the markup language aggregate is JavaScript content for script.
The present invention compared with prior art, with advantages below and beneficial effect:
(1)The method of the invention can be by way of manually participating in, with semi-automatic machine learning markup language tableSingle structure, can be extracted with integrality, authenticity, the web form architectural feature of accuracy;
(2)Submitting button used in the method for the invention is<input type=submit>, submission form is by study dressCompletion is put, form feature extracts and is difficult failure;
(3)The method of the invention makes<input>Input frame<form>Label is wrapped up, so that browser sends webpageLoaded can meet the rule of static scanning after notifying, can be well on inquiry.
Brief description of the drawings
Fig. 1 is markup language processing unit workflow;
Fig. 2 is the learning device workflow with markup language learning device;
Fig. 3 is markup language COLLECTION TRAVERSALSThe function flow;
Fig. 4 is " click " event handling function flow.
Embodiment
The present invention is described in further detail with reference to embodiment, but the implementation of the present invention is not limited to this.
Embodiment:
Existing characteristic analysis method, as shown in figure 1, leading to whenever markup language processing unit sends tab file loadedWhen knowing, it is assumed that the page occurs the content for including above element, then the interface provided by markup language processing unit to markNote file is analyzed, and takes out list<from><input>Feature, but such a method is in the dynamically labeled loading of high speed developmentSeem before technological side unable to do what one wishes.
Present embodiment discloses a kind of form feature extracting method of semi-automatic learning type, this method can be by that can lead toThe mode manually participated in is crossed, with semi-automatic machine learning markup language form structure, can be extracted with integrality, trulyProperty, the web form architectural feature of accuracy.Specific implementation step is:
(1)Start learning device, can be appreciated that the human-computer interaction interface of a similar IE browser;
(2)In address field input marking language file, telltale mark language file position;
(3)Device loads making language document by built-in IE browser;
(4)After the completion of, built-in IE browser notifies the loading of learning device making language document to complete, and has generated markRemember language aggregate;
(5)Learning device inserts study module in the making language document loaded;
(6)Operate list, such as fill substance, choose an option, click on submitting button, these processes will by study fillPut complete documentation, or and generate the related characteristic information such as tag name, attribute, absolute position;
(7)Receive after submitting button click event, study module thinks that study is completed, and the characteristic information of form structure is depositedEnter database.Whole list feature learning process is completed.
The learning device workflow of markup language learning device is wherein carried, as shown in Fig. 2 default label language isHTML, the label that input frame is presented in the selection of markup language processing unit is defaulted as<input>Label.
In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but by artificially carrying outJudge.
When seeing that markup language processing unit indicates a need for the list of fill substance, the markup language of form structure is presentedSet has completely been present in markup language processing unit and suffered.
Imparting indicia language processing apparatus, when any<input>When label is activated, notify what learning device was activated<input>The object of label, markup language processing unit, when parsing markup language, is the unique correspondence of each label generationRelation entrance, learning device is by having activated<input>Label object, reads the attribute of this label, learning device by timeThe markup language set gone through in markup language processing unit, calculating has currently been activated<input>Label is in markup language setAbsolute position, imparting indicia language processing apparatus, when producing list and submitting event, should not be committed to server, but logicalKnow that learning device list submits event to be produced by which object, in learning device, activation successively needs fill substanceInput frame, during this, the input frame being activated will be recorded, and be not activated, reconditioning will be ignored, in studyIn device, submitting button is clicked on, list is produced and submits event, learning device is received after event, the input frame that will be recorded in upper stepInformation and the corresponding URL deposits form feature database of current markers file.
The selection of markup language processing unit is using the Trident engines of Microsoft, and its corresponding interface is WebControlInterface.
According to JavaScript language standard, when being clicked on using controller in markup language processing unit, it can produceA raw onClick event, when producing onClick events, can call a function, and the object that will trigger onClickFunction is passed to by parameter, allows JavaScript language according to this event action object.
A JavaScript function is wherein write, this function can travel through the mark in current markers language processing apparatus alwaysObject is signed, as shown in Fig. 3, and with oneself onClick processing function registration input labels, button labels, a labels, imgThe onClick events of label, so as to the HTML controls of dynamic load after handling, this function is responsible for collecting onClick processing lettersThe information that number is sent out is as shown in Figure 4.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, what Trident engines were providedWebControl interfaces, can release DocumentCompleted events, learning device in making language document loadedThe function of read control information is put into the mark of current markers file by the interface by handling markup language device itself offerIn set.
When markup language processing unit confirmation flag language file, which is loaded, to be finished, learning device is by handling mark languageThe privately owned JavaScript interface that learning device is provided is registered to markup language processing unit by the interface that speech device itself is provided,This privately owned interface can make the JavaScript engine in markup language processing unit be communicated with learning device, privately owned be connect by thisThe label information being collected into can be sent to learning device by the JavaScript engine in mouth current markers file.
Learning device is received after final label information, write into Databasce.
It is described above, be only presently preferred embodiments of the present invention, any formal limitation not done to the present invention, it is every according toAccording to the present invention technical spirit above example is made any simple modification, equivalent variations, each fall within the present invention protectionWithin the scope of.

Claims (8)

(7) receive after submitting button click event, study module thinks that study is completed, and form structure information is stored in into database;In semi-automatic learning process, machine simultaneously need not recognize when webpage loads completion, but by artificially being judged;Work as markWhen note language processing apparatus indicates a need for the list of fill substance, the markup language set that form structure is presented completely is depositedIt is that markup language processing unit is suffered;Imparting indicia language processing apparatus, when any<input>When label is activated, notifyWhat learning device was activated<input>The object of label, markup language processing unit, when parsing markup language, is each markLabel generation unique corresponding relation entrance, learning device is by having activated<input>Label object, reads the category of this labelProperty, learning device has currently been activated by traveling through the markup language set in markup language processing unit, calculating<input>LabelAbsolute position in markup language set, imparting indicia language processing apparatus, when producing list submission event, should not be submittedTo server, but notify learning device list to submit event to be produced by which object, in learning device, activate successivelyThe input frame of fill substance is needed, during this, the input frame being activated will be recorded, be not activated, reconditioningIt will be ignored, and in learning device, click on submitting button, and produce list and submit event, learning device is received after event, by upper stepThe corresponding URL deposits form feature database of input frame information and current markers file of middle record;
CN201410317562.9A2014-07-072014-07-07A kind of form feature extracting method of semi-automatic learning typeActiveCN104063488B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201410317562.9ACN104063488B (en)2014-07-072014-07-07A kind of form feature extracting method of semi-automatic learning type

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201410317562.9ACN104063488B (en)2014-07-072014-07-07A kind of form feature extracting method of semi-automatic learning type

Publications (2)

Publication NumberPublication Date
CN104063488A CN104063488A (en)2014-09-24
CN104063488Btrue CN104063488B (en)2017-09-01

Family

ID=51551202

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201410317562.9AActiveCN104063488B (en)2014-07-072014-07-07A kind of form feature extracting method of semi-automatic learning type

Country Status (1)

CountryLink
CN (1)CN104063488B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN109445654B (en)*2018-09-282022-02-08成都安恒信息技术有限公司Method for automatically filling gaps in graphical interface program
CN112836150B (en)*2021-02-032024-07-16捷玛计算机信息技术(上海)股份有限公司Identification method, system, equipment and medium for traceability code of medicine

Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102681994A (en)*2011-03-072012-09-19北京百度网讯科技有限公司Webpage information extracting method and system
CN103440198A (en)*2013-08-272013-12-11星云融创(北京)信息技术有限公司Method for calibrating form
CN103443786A (en)*2011-03-152013-12-11高通股份有限公司 A task-independent machine learning method for identifying parallel layouts in web browsers
CN103514292A (en)*2013-10-092014-01-15南京大学Webpage data extraction method based on semi-supervised learning of small sample
CN103559234A (en)*2013-10-242014-02-05北京邮电大学System and method for automated semantic annotation of RESTful Web services
CN103699683A (en)*2014-01-022014-04-02国家电网公司Data processing method and data processing device
CN103793282A (en)*2012-11-022014-05-14阿里巴巴集团控股有限公司Browser and tab ending method thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US9460064B2 (en)*2006-05-182016-10-04Oracle International CorporationEfficient piece-wise updates of binary encoded XML data
US20170147577A9 (en)*2009-09-302017-05-25Gennady LAPIRMethod and system for extraction

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN102681994A (en)*2011-03-072012-09-19北京百度网讯科技有限公司Webpage information extracting method and system
CN103443786A (en)*2011-03-152013-12-11高通股份有限公司 A task-independent machine learning method for identifying parallel layouts in web browsers
CN103793282A (en)*2012-11-022014-05-14阿里巴巴集团控股有限公司Browser and tab ending method thereof
CN103440198A (en)*2013-08-272013-12-11星云融创(北京)信息技术有限公司Method for calibrating form
CN103514292A (en)*2013-10-092014-01-15南京大学Webpage data extraction method based on semi-supervised learning of small sample
CN103559234A (en)*2013-10-242014-02-05北京邮电大学System and method for automated semantic annotation of RESTful Web services
CN103699683A (en)*2014-01-022014-04-02国家电网公司Data processing method and data processing device

Also Published As

Publication numberPublication date
CN104063488A (en)2014-09-24

Similar Documents

PublicationPublication DateTitle
CN102831345B (en)Injection point extracting method in SQL (Structured Query Language) injection vulnerability detection
CN101211364B (en)Method and system for social bookmarking of resources exposed in web pages
US20170109441A1 (en)Automatically generating a website specific to an industry
CN109643347A (en)Detection is interacted with the scripting of social media platform or other exceptions
US20150302110A1 (en)Decoupling front end and back end pages using tags
CN102779123B (en)A kind of website shows screenshotss method, system and the desk module and method of content
CN106598991A (en)Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode
CN107590236B (en)Big data acquisition method and system for building construction enterprises
US20200026802A1 (en)Image quality independent searching of screenshots of web content
JP2012079300A (en)Method, storage medium, device and system to extract navigation model for analysis of web application
CN112988599B (en)Page debugging method and device, electronic equipment and storage medium
CN101261669A (en)A method for visual validation system based on mouse operation
CN108920147A (en)A kind of Web page construction method, calculates equipment and storage medium at device
JP5234839B2 (en) Content management apparatus, content management method and program
CN104915438B (en) A method for obtaining PCU associated data in microblogs of a specific topic
CN106993016A (en) Method and device for processing network request and response
CN104063488B (en)A kind of form feature extracting method of semi-automatic learning type
CN114398138B (en)Interface generation method, device, computer equipment and storage medium
CN109240664A (en)A kind of method and terminal acquiring user behavior information
WO2023155274A1 (en)Recruitment information publishing method and apparatus based on rpa and ai
JP5497925B2 (en) Content management apparatus, content management method and program
CN104361121B (en)A kind of batch analytic method of WEB reporting systems formula
JP4996504B2 (en) SBM server, registration screen generation method and program
KR101992748B1 (en)Web-page information extraction apparatus and method by reasoning and learning of html tag information
CN111813999A (en)Method for improving expandability of intelligent contract field of Etheng

Legal Events

DateCodeTitleDescription
C06Publication
PB01Publication
C10Entry into substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp