Movatterモバイル変換


[0]ホーム

URL:


CN108334560A - A kind of information acquisition method and relevant device - Google Patents

A kind of information acquisition method and relevant device
Download PDF

Info

Publication number
CN108334560A
CN108334560ACN201810009236.XACN201810009236ACN108334560ACN 108334560 ACN108334560 ACN 108334560ACN 201810009236 ACN201810009236 ACN 201810009236ACN 108334560 ACN108334560 ACN 108334560A
Authority
CN
China
Prior art keywords
property name
attribute value
attribute
traverse path
xpath
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810009236.XA
Other languages
Chinese (zh)
Other versions
CN108334560B (en
Inventor
王策
张锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co LtdfiledCriticalTencent Technology Shenzhen Co Ltd
Priority to CN201810009236.XApriorityCriticalpatent/CN108334560B/en
Publication of CN108334560ApublicationCriticalpatent/CN108334560A/en
Application grantedgrantedCritical
Publication of CN108334560BpublicationCriticalpatent/CN108334560B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

The embodiment of the invention discloses a kind of information acquisition method and relevant devices, including:The second traverse path of the first traverse path and attribute value of Property Name is obtained first;Then according to first traverse path from obtaining the Property Name in page info and obtain the attribute value from the page info according to second traverse path;Then the mapping relations for establishing the Property Name and the attribute value are exported as acquisition of information result.Using the embodiment of the present invention, the accuracy of acquisition of information can be improved.

Description

A kind of information acquisition method and relevant device
Technical field
The present invention relates to field of computer technology more particularly to a kind of information acquisition methods and relevant device.
Background technology
Information carrier online at present is mainly text, can be the information for including in text by way of acquisition of informationStructuring processing is carried out, the same organizational form of table is become, what input information obtained system is urtext, such as:WebpageData or individual word content, output be set form information point.Information point is obtained from various documentsIt takes out, is then integrated in the form of unified, can efficiently obtain letter from a large amount of document by means of whichBreath.Acquisition of information is commonly based on extensible markup language path language (Xml Path Language, XPath) realization,The Property Name of information is fixed in current information acquisition method, only to the corresponding attribute of the Property Name of information neededValue configuration XPath, and the specific of attribute value is obtained in the corresponding file structure model of text (Dom tree) by XPathContent.For example, being the infobox information of encyclopaedia entry " XXX " as shown in Figure 1, wherein " Business Name ", " foreign language title " etc. areProperty Name, " XXX Co., Ltds of Shenzhen ", " ABC " are corresponding attribute values, obtain composition infobox information when," Business Name " and " foreign language title " is fixed, and " XXX Co., Ltds of Shenzhen " and " ABC " system are the XPath by themIt is obtained from the corresponding DOM tree of html text content of the Baidupedia page of " XXX ".
However, due in the different pages, the registration of attribute value is higher, and Property Name difference is larger, for example, figureThe corresponding Property Name of attribute value " internet " in 1 is " business scope ", still, in Baidu's entry " internet "In infobox information, the Property Name of " internet " is " Chinese name ".Therefore, this fixed attribute title, only adopts attribute valueCause the accuracy of acquisition of information low with the method that XPath modes are obtained.
Invention content
A kind of information acquisition method of offer of the embodiment of the present invention and relevant device.The accuracy of acquisition of information can be improved.
First aspect present invention provides a kind of information acquisition method, including:
Obtain the second traverse path of the first traverse path and attribute value of Property Name;
The Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;
The mapping relations for establishing the Property Name and the attribute value are exported as acquisition of information result.
Wherein, described to establish the Property Name and the mapping relations of the attribute value include:
Obtain the first map tags of the Property Name and the second map tags of the attribute value;
According to first map tags and second map tags, the Property Name and the attribute value are establishedMapping relations.
Wherein, described according to first map tags and second map tags, establish the Property Name and instituteThe mapping relations for stating attribute value include:
When first map tags are identical as the second map tags, the Property Name and the attribute value are establishedMapping relations.
Wherein, described the Property Name to be obtained from page info according to first traverse path and according to instituteIt states the second traverse path and obtains the attribute value from the page info and include:
Structure traversal tree is created according to the page info, wherein the structure traversal tree includes multiple content nodes;
The multiple content node on the structure traversal tree is traversed, the category is obtained according to first traverse pathProperty title and the attribute value is obtained according to second traverse path.
Wherein, first traverse path for obtaining Property Name and the second traverse path of attribute value include:
Obtain the attribute-bit of the Property Name and the attribute value;
According to the attribute-bit, first traverse path and second traverse path are obtained from configuration file,Wherein, the configuration file include the attribute-bit, it is corresponding with first traverse path and second traverse pathRelationship.
Wherein, the attribute-bit of the acquisition Property Name and the attribute value includes:
Obtain the uniform resource locator of the page info;
According to the uniform resource locator, the attribute-bit of the Property Name and the attribute value is obtained.
Wherein, described the Property Name is obtained from page info according to first traverse path to include:
Determine the type of the Property Name;
If the Property Name is open Property Name, according to the acquisition of the first traverse path of the Property NameProperty Name.
Correspondingly, second aspect of the present invention provides a kind of information acquisition device, including:
Path acquisition module, the second traverse path of the first traverse path and attribute value for obtaining Property Name;
Data obtaining module, for according to first traverse path obtained from page info the Property Name, withAnd the attribute value is obtained from the page info according to second traverse path;
As a result output module, establish the mapping relations of the Property Name and the attribute value as acquisition of information result intoRow output.
Wherein, the result output module is specifically used for:
Obtain the first map tags of the Property Name and the second map tags of the attribute value;
According to first map tags and second map tags, the Property Name and the attribute value are establishedMapping relations.
Wherein, the result output module is specifically used for:
When first map tags are identical as the second map tags, the Property Name and the attribute value are establishedMapping relations.
Wherein, described information acquisition module is specifically used for:
Structure traversal tree is created according to the page info, wherein the structure traversal tree includes multiple content nodes;
The multiple content node on the structure traversal tree is traversed, the category is obtained according to first traverse pathProperty title and the attribute value is obtained according to second traverse path.
Wherein, the path acquisition module is specifically used for:
Obtain the attribute-bit of the Property Name and the attribute value;
According to the attribute-bit, first traverse path and second traverse path are obtained from configuration file,Wherein, the configuration file include the attribute-bit, it is corresponding with first traverse path and second traverse pathRelationship.
Wherein, the path acquisition module is specifically used for:
Obtain the uniform resource locator of the page info;
According to the uniform resource locator, the attribute-bit of the Property Name and the attribute value is obtained.
Wherein, described information acquisition module is specifically used for:
Determine the type of the Property Name;
If the Property Name is open Property Name, according to the acquisition of the first traverse path of the Property NameProperty Name.
The third aspect, the present invention provides a kind of information acquisition apparatus, including:Processor, memory and communication bus,In, for realizing connection communication between processor and memory, processor executes the program stored in memory and uses communication busStep in a kind of information acquisition method that above-mentioned first aspect offer is provided.
In a possible design, information acquisition apparatus provided by the invention can include for executing in the above methodThe corresponding module of behavior.Module can be software and/or be hardware.
It is yet another aspect of the present invention to provide a kind of computer readable storage medium, in the computer readable storage mediumIt is stored with a plurality of instruction, described instruction is suitable for being loaded by processor and executing the method described in above-mentioned various aspects.
It is yet another aspect of the present invention to provide a kind of computer program products including instruction, when it runs on computersWhen so that computer executes the method described in above-mentioned various aspects.
Implement the embodiment of the present invention, obtains the first traverse path of Property Name and the second traversal road of attribute value firstDiameter;Then the Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;Finally establish the mapping relations of the Property Name and the attribute valueIt is exported as acquisition of information result.It is obtained using traverse path for Property Name and attribute value, and by attributeTitle and attribute value are mapped, and the accuracy of acquisition of information is improved.
Description of the drawings
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodimentAttached drawing be briefly described, it should be apparent that, drawings in the following description are some embodiments of the invention, for this fieldFor those of ordinary skill, without creative efforts, other drawings may also be obtained based on these drawings.
Fig. 1 is a kind of schematic diagram for acquisition of information result that prior art provides;
Fig. 2 is a kind of structural schematic diagram of Information Acquisition System provided in an embodiment of the present invention;
Fig. 3 is a kind of flow diagram of information acquisition method provided in an embodiment of the present invention;
Fig. 4 is the structural schematic diagram of DOM tree provided in an embodiment of the present invention a kind of;
Fig. 5 is the flow diagram of another information acquisition method provided in an embodiment of the present invention;
Fig. 6 is the flow diagram of another information acquisition method provided in an embodiment of the present invention;
Fig. 7 is the structural schematic diagram that the present invention implements a kind of information acquisition device provided;
Fig. 8 is a kind of structural schematic diagram for information acquisition apparatus that the embodiment of the present invention proposes.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, completeSite preparation describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hairEmbodiment in bright, the every other implementation that those of ordinary skill in the art are obtained without creative effortsExample, shall fall within the protection scope of the present invention.
Fig. 2 is referred to, Fig. 2 is a kind of structural schematic diagram of Information Acquisition System provided in an embodiment of the present invention, the informationAcquisition system includes user equipment 201 and server 202.Wherein, server 202 can be capable of providing network information browsing clothesWebsite (Web) server of business.User equipment 201 can refer to the equipment for providing the voice and/or data connection that arrive user,May be connected to laptop computer or desktop computer etc. computing device or its can be such as individual digitalThe autonomous device of assistant (Personal Digital Assistant, PDA) etc..Wherein, server is for receiving user equipmentThe service request of transmission, for the service request for asking browsing pages information, the unified resource for then parsing the webpage information is fixedPosition symbol (Uniform Resource Location, URL) obtains the traverse path and attribute of Property Name from configuration fileThe traverse path of value finally obtains Property Name according to the traverse path of Property Name, is obtained according to the traverse path of attribute valueAttribute value, the correspondence for finally establishing Property Name and attribute value are sent to user equipment as information result is obtained.UserEquipment is used to send service request to server, and obtaining acquisition of information result from server is shown.
System is obtained based on above- mentioned information, as shown in figure 3, a kind of information acquisition method that the embodiment of the present invention proposes, packetIt includes:S301, system loads configuration file pattern.conf and configuration file xpath.conf, wherein pattern.conf filesID number (pattern_id) including expression mode (pattern) and its corresponding expression mode, for example, as shown in table 1,Pattern.conf files include pattern_id 0 and pattern_id 1 and corresponding pattern.Xpath.conf textsPart includes the XPath of pattern_id, Property Name and attribute value.For example, as shown in table 2,0 time packet of pattern_idContain under " title " and " brief introduction " two Property Names and the XPath of their corresponding attribute values, pattern_id 1 and has includedThe XPath of Property Name " label " and corresponding two attribute values.S302, by parse page info URL fromPattern_id is obtained in pattern.conf files, and attribute is then obtained from xpath.conf files according to pattern_idThe XPath of value.S303 creates the DOM tree of page info, traverses the XPath of each attribute value under pattern_id, pressesAccording to the paths XPath, node content is obtained from DOM tree as corresponding attribute value.S304, according to Property Name and attributeThe correspondence of value, according to<Property Name, attribute value>Form output, can if a Property Name corresponds to M attribute valueWith according to<Property Name, attribute value 1, attribute value 2 ..., attribute value M>Form output, for example, Property Name " label " is correspondingTwo attribute values " stock name " and " company ", then can export<Label, stock name, company>.
Table 1.pattern.conf files
pattern_idpattern
0^https://baike\.baidu\.com/item/.+/\d+$
1^https://baike\.baidu\.com/subview/\d+/\d+\.htm$
Table 2.xpath.conf files
However, due in the different pages, the registration of attribute value is higher, and Property Name difference is larger, therefore,The different pages, possible entirely different, this fixed attribute title of the corresponding Property Name of the identical attribute values of XPath, only to belonging toProperty the method that is obtained using XPath modes of value cause the accuracy of acquisition of information low.In order to solve the problems, such as this, the present invention carriesGo out following solution.
Fig. 4 is referred to, Fig. 4 is the flow chart schematic diagram for another information acquisition method that the embodiment of the present invention proposes, shouldMethod includes but not limited to following steps:
S401 obtains the first traverse path of Property Name and the second traverse path of attribute value.
In the specific implementation, the service request of user equipment transmission can be received first, service request is believed for request pageThen breath obtains the URL of page info;According to the URL, the attribute-bit of Property Name and attribute value, last basis are obtainedAttribute-bit obtains the first traverse path and the second traverse path from configuration file, wherein configuration file includes attribute markKnow, the correspondence with the first traverse path and the second traverse path.
Wherein, system includes configuration file pattern.conf and configuration file xpath.conf, configuration filePattern.conf includes pattern and its corresponding pattern_id, for example, as shown in table 1, pattern.conf filesIncluding pattern_id 0 and pattern_id 1 and their corresponding pattern.Configuration file xpath.conf includesThe XPath of pattern_id, Property Name and attribute value, wherein the Property Name in xpath.conf files includes specificThe Property Name of Property Name and XPath forms, wherein the Property Name of XPath forms is open Property Name, open attributeThe corresponding attribute value of title is open attribute value, it should be noted that only there are one corresponding open categories for an open Property NameProperty value.For example, as shown in table 3, pattern_id 0 corresponds to two attribute titles, the first is specific Property Name, such as table 3In the first row shown in, Property Name " title " is specific Property Name, the XPath "/html/body/ of corresponding attribute valuediv[4]/div[2]/div/div[2]/dd/h1”;It is for second the Property Name of XPath forms, such as the second row institute in table 3Show, "/the html/body/div [4]/div [2]/div/dl [1]/dt [1] " in Property Name is the attribute-name of XPath formsClaim, the XPath "/html/body/div [4]/div [2]/div/dl [1]/dd [1] " of corresponding attribute value.
3. modified xpath.conf files of table
For example, after receiving service request, loading configuration file pattern.conf and configuration file firstThen xpath.conf obtains the URL of the requested page info of user equipment, the URL of the page is parsed by regular expression,Matching inquiry is carried out to configuration file pattern.conf and then obtains corresponding pattern and pattern_id, and is generatedPattern_id lists obtain particular community title and correspondence according to pattern_id lists from configuration file xpath.confAttribute value XPath and open Property Name XPath and corresponding open attribute value XPath.For example, can be firstFirst from configuration file pattern.conf as shown in Table 1 pattern_id is obtained, then according to pattern_id from such as table 3Shown in xpath.conf files obtain the XPath of Property Name, the XPath of attribute value or Property Name respectively, and then generateConfiguration information list as shown in table 4.Configuration information list includes pattern_id 0, particular community title " title " and corresponds toThe XPath "/html/body/div [4]/div [2]/div/div [2]/dd/h1 " of attribute value, open Property NameThe XPath of XPath "/html/body/div [4]/div [2]/div/dl [1]/dt [1] " and corresponding open attribute value "/html/body/div[4]/div[2]/div/dl[1]/dd[1]”。
4. configuration information list of table
pattern_idProperty Name/attribute valueXPath
0Title/html/body/div[4]/div[2]/div/div[2]/dd/h1
0Open Property Name/html/body/div[4]/div[2]/div/dl[1]/dt[2]
0Open attribute value/html/body/div[4]/div[2]/div/dl[1]/dd[2]
S402 obtains the Property Name and according to described according to first traverse path from page infoTwo traverse paths obtain the attribute value from the page info.
In the specific implementation, structure traversal tree can be created according to the page info, wherein the structure traversal, which is set, includesMultiple content nodes;The multiple content node on the structure traversal tree is traversed, is obtained according to first traverse pathThe Property Name and the attribute value is obtained according to second traverse path.
Optionally, according to first traverse path before obtaining the Property Name in page info, can be trueDetermine the type of Property Name;If it is determined that the Property Name is open Property Name (Xpath forms), then traversed according to described firstPath obtains the Property Name from page info.If the Property Name is specific Property Name, e.g., " company's industry" corporate business ", " development course ", then can be determined as Property Name, therefore in this case by business ", " development course " etc.The Property Name need not be obtained from page info according to the first traverse path.
For example, as shown in figure 5, by DOM parsing html page information, corresponding DOM tree are generated.DOM tree packetsContaining multiple content nodes, each content node shows as the content of text in a HTML markup or HTML markup.It is creatingAfter DOM tree, according to the XPath in configuration information list as shown in table 4, the traversal content node in DOM tree,Obtain the value of information of the corresponding node content as XPath.For example, when XPath be /html/head/title when, Ke YigenHtml nodes, head nodes and title nodes in DOM Tree shown in fig. 5 are traversed successively according to/html/head/title,Then the value of information of the content of text " My title " of title nodes as XPath is obtained, in this way according to different timesThe value of information that path obtains each XPath respectively is gone through, attribute information list as shown in table 5 is ultimately produced, attribute information rowTable includes particular community title " title " and the corresponding XPath values of information " XXX ", open Property Name and correspondingThe XPath values of information " foreign language title ", open attribute value and the corresponding XPath values of information " ABC ".
5. attribute information list of table
Property Name/attribute valueThe XPath values of information
TitleXXX
Open Property NameForeign language title
Open attribute valueABC
S403, the mapping relations for establishing the Property Name and the attribute value are exported as acquisition of information result.
In the specific implementation, if Property Name is particular community title, by the corresponding XPath values of information of particular community titleAs the corresponding attribute value of particular community title, if Property Name is open Property Name, by open Property NameThe XPath values of information and the XPath values of information of open attribute value establish mapping relations, according to<Property Name, attribute value>FormatOutput.
Such as:In attribute information list as shown in table 5, the corresponding attribute value of particular community title " title " is" XXX " opens the XPath values of information " ABC " that the corresponding attribute value of Property Name " foreign language title " is open attribute value, also,They can be exported respectively and be:<Title, XXX>,<Foreign language title, ABC>.
In embodiments of the present invention, the first traverse path of Property Name and the second traversal road of attribute value are obtained firstDiameter;Then the Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;Finally establish the mapping relations of the Property Name and the attribute valueIt is exported as acquisition of information result.It is obtained using traverse path for Property Name and attribute value, and by attributeTitle and attribute value are mapped, and the accuracy of acquisition of information is improved.
Fig. 6 is referred to, Fig. 6 is the flow chart schematic diagram for another information acquisition method that the embodiment of the present invention proposes, shouldMethod includes but not limited to following steps:
S601 obtains the first traverse path of Property Name and the second traverse path of attribute value.
In the specific implementation, the service request of user equipment transmission can be received first, service request is believed for request pageThen breath obtains the URL of page info;According to the URL, the attribute-bit of Property Name and attribute value, last basis are obtainedAttribute-bit obtains the first traverse path and the second traverse path from configuration file, wherein configuration file includes attribute markKnow, the correspondence with the first traverse path and the second traverse path.
Wherein, system includes configuration file pattern.conf and configuration file xpath.conf, configuration filePattern.conf includes pattern and its corresponding pattern_id, for example, as shown in table 1, pattern.conf filesIncluding pattern_id 0 and pattern_id 1 and their corresponding pattern.Configuration file xpath.conf includesThe XPath of pattern_id, Property Name and attribute value, wherein the Property Name in xpath.conf files includes specificThe Property Name of Property Name and XPath forms, wherein the Property Name of XPath forms is open Property Name, open attributeThe corresponding attribute value of title is open attribute value, it should be noted that only there are one corresponding open categories for an open Property NameProperty value.For example, as shown in table 6, pattern_id 0 corresponds to two attribute titles, the first is specific Property Name, such as table 6In the first row shown in, Property Name " title " is specific Property Name, the XPath "/html/body/ of corresponding attribute valuediv[4]/div[2]/div/div[2]/dd/h1”;For second the Property Name of XPath forms, as in table 6 the second row andShown in the third line, "/the html/body/div [4]/div [2]/div/dl [1]/dt [1] " in Property Name and "/html/Body/div [4]/div [2]/div/dl [1]/dt [2] " is the Property Name of XPath forms, the XPath of corresponding attribute value "/Html/body/div [4]/div [2]/div/dl [1]/dd [1] " and "/html/body/div [4]/div [2]/div/d1[1]/dd[2]”。
6. modified xpath.conf files of table
For example, after receiving service request, loading configuration file pattern.conf and configuration file firstThen xpath.conf obtains the URL of the requested page info of user equipment, the URL of the page is parsed by regular expression,Matching inquiry is carried out to configuration file pattern.conf and then obtains corresponding pattern and pattern_id, and is generatedPattern_id lists obtain particular community title and correspondence according to pattern_id lists from configuration file xpath.confAttribute value XPath and open Property Name XPath and corresponding open attribute value XPath, if shared n are openedProperty Name to be put, then can be respectively designated as open Property Name _ 1, open Property Name _ 2 ... open Property Name _ n,Corresponding open attribute value is named as open attribute value _ 1, open attribute value _ 2 ..., open attribute value _ n.For example, can be firstPattern_id is obtained from configuration file pattern.conf as shown in Table 1, then according to pattern_id from such as 6 institute of tableThe xpath.conf files shown obtain the XPath of Property Name, the XPath of attribute value or Property Name respectively, and then generate such asConfiguration information list shown in table 7.Configuration information list includes pattern_id 0, particular community title " title " and correspondingThe XPath "/html/body/div [4]/div [2]/div/div [2]/dd/h1 " of attribute value, Property Name _ 1 is openedXPath "/html/body/div [4]/div [2]/div/dl [1]/dt [1] " and corresponding open attribute value _ 1 XPath "/Html/body/div [4]/div [2]/div/dl [1]/dd [1] ", the XPath "/html/body/div for opening Property Name _ 2[4] the XPath "/html/body/div [4]/div of/div [2]/div/dl [1]/dt [2] " and corresponding open attribute value _ 2[2]/div/dl[1]/dd[2]”。
7. configuration information list of table
pattern_idProperty Name/attribute valueXPath
0Title/html/body/div[4]/div[2]/div/div[2]/dd/h1
0Open Property Name _ 1/html/body/div[4]/div[2]/div/dl[1]/dt[1]
0Open attribute value _ 1/html/body/div[4]/div[2]/div/dl[1]/dd[1]
0Open Property Name _ 2/html/body/div[4]/div[2]/div/dl[1]/dt[2]
0Open attribute value _ 2/html/body/div[4]/div[2]/div/dl[1]/dd[2]
S602 obtains the Property Name and according to described according to first traverse path from page infoTwo traverse paths obtain the attribute value from the page info.
In the specific implementation, structure traversal tree can be created according to the page info, wherein the structure traversal, which is set, includesMultiple content nodes;The multiple content node on the structure traversal tree is traversed, is obtained according to first traverse pathThe Property Name and the attribute value is obtained according to second traverse path.
Optionally, according to first traverse path before obtaining the Property Name in page info, can be trueDetermine the type of Property Name;If it is determined that the Property Name is open Property Name (Xpath forms), then traversed according to described firstPath obtains the Property Name from page info.If the Property Name is specific Property Name, e.g., " company's industry" corporate business ", " development course ", then can be determined as Property Name, therefore in this case by business ", " development course " etc.The Property Name need not be obtained from page info according to the first traverse path.
For example, as shown in figure 5, by DOM parsing html page information, corresponding DOM tree are generated.DOM tree packetsContaining multiple content nodes, each content node shows as the content of text in a HTML markup or HTML markup.It is creatingAfter DOM tree, according to the XPath in configuration information list as shown in table 7, the traversal content node in DOM tree,Obtain the value of information of the corresponding node content as XPath.For example, when XPath be /html/head/title when, Ke YigenHtml nodes, head nodes and title nodes in DOM Tree shown in fig. 5 are traversed successively according to/html/head/title,Then the value of information of the content of text " My title " of title nodes as XPath is obtained, in this way according to different timesThe value of information that path obtains each XPath respectively is gone through, attribute information list as shown in table 8 is ultimately produced, including specificProperty Name " title " and the corresponding XPath values of information " XXX ", the open Property Name _ 1 and corresponding XPath values of informationThe value of information " ABC " of " foreign language title ", open attribute value _ 1 and corresponding XPath, open Property Name _ 2 and correspondingThe XPath values of information " general headquarters place " and the value of information " China Shenzhen " of open attribute value _ 2 and the XPath answered.
8. attribute information list of table
Property Name/attribute valueThe XPath values of information
TitleXXX
Open Property Name _ 1Foreign language title
Open attribute value _ 1ABC
Open Property Name _ 2General headquarters place
Open attribute value _ 2China Shenzhen
S603 obtains the first map tags of the Property Name and the second map tags of the attribute value.
In the specific implementation, if Property Name/attribute value is open Property Name _ n or open attribute value _ n, can obtainFirst map tags of the value of information of the number " n " as corresponding XPath in open Property Name _ n, obtain open attributeThe second map tags of number " n " in value _ n as the value of information of corresponding XPath, wherein n, which can be 1,2,3 ... waits anyInteger.For example, in attribute information list as shown in table 8, the number " 1 " in open Property Name _ 1 is obtained as correspondingFirst map tags of the value of information " foreign language title " of XPath obtain the number " 1 " in open attribute value _ 1 as correspondingSecond map tags of the value of information " China Shenzhen " of XPath.
S604 establishes the Property Name and the category according to first map tags and second map tagsProperty value mapping relations, output information obtain result.
In the specific implementation, if Property Name is particular community title, by the corresponding XPath values of information of particular community titleAs the corresponding attribute value of particular community title, can by they according to<Property Name:Attribute value>Form exported,For example, in attribute information list as shown in table 8, the corresponding attribute value of particular community title " title " is exactly " XXX ", and willThey are exported:<Title, XXX>.
It, will open Property Name _ n pairs if Property Name/attribute value is open Property Name _ n or open attribute value _ nThe value of information of the XPath answered is stored in the nth position of open Property Name list as Property Name;Similarly, belong to openProperty value _ n corresponding XPath the value of information nth position of open list of attribute values, traversal attribute letter are stored in as attribute valueIt ceases opening Property Name _ 1 in list and arrives open Property Name _ n, and open attribute value _ n is arrived in open attribute value _ 1.Finally,When first map tags are identical as the second map tags, the corresponding Property Name of the first map tags and second are mappedThe corresponding attribute value of label establishes mapping relations, and can be according to<Property Name, attribute value>Form output.
For example, as shown in table 9-1 and table 9-2, the first of Property Name " foreign language title " is reflected in open Property Name listIt is 1 to penetrate label, and the second map tags of attribute value " ABC " are 1 in open list of attribute values, therefore, Property Name " outer literary fameFirst map tags of title " are identical as the second map tags of attribute value " ABC ", to establish " foreign language title " and " ABC "Mapping relations, and they are pressed<Foreign language title:ABC>Form output.Similarly, the first mapping of Property Name " general headquarters place "Label is 2, and the second map tags of attribute value " China Shenzhen " are also 2, " general headquarters place " and " Chinese deep so as to establishThe mapping relations of ditch between fields ", and export<General headquarters place:China Shenzhen>.
In embodiments of the present invention, the first traverse path of Property Name and the second traversal road of attribute value are obtained firstDiameter;Then the Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;Finally establish the mapping relations of the Property Name and the attribute valueIt is exported as acquisition of information result.It is obtained using traverse path for Property Name and attribute value, and by attributeTitle and attribute value are mapped, and the accuracy of acquisition of information is improved.
Fig. 7 is referred to, Fig. 7 is a kind of structural schematic diagram for information acquisition device that the embodiment of the present invention proposes, the informationAcquisition device may include:
Path acquisition module 701, the second traverse path of the first traverse path and attribute value for obtaining Property Name.
In the specific implementation, the service request of user equipment transmission can be received first, service request is believed for request pageThen breath obtains the URL of page info;According to the URL, the attribute-bit of Property Name and attribute value, last basis are obtainedAttribute-bit obtains the first traverse path and the second traverse path from configuration file, wherein configuration file includes attribute markKnow, the correspondence with the first traverse path and the second traverse path.
Wherein, system includes configuration file pattern.conf and configuration file xpath.conf, configuration filePattern.conf includes pattern and its corresponding pattern_id, for example, as shown in table 1, pattern.conf filesIncluding pattern_id 0 and pattern_id 1 and their corresponding pattern.Configuration file xpath.conf includesThe XPath of pattern_id, Property Name and attribute value, wherein the Property Name in xpath.conf files includes specificThe Property Name of Property Name and XPath forms, wherein the Property Name of XPath forms is open Property Name, open attributeThe corresponding attribute value of title is open attribute value, it should be noted that only there are one corresponding open categories for an open Property NameProperty value.For example, as shown in table 6, pattern_id 0 corresponds to two attribute titles, the first is specific Property Name, such as table 6In the first row shown in, Property Name " title " is specific Property Name, the XPath "/html/body/ of corresponding attribute valuediv[4]/div[2]/div/div[2]/dd/h1”;For second the Property Name of XPath forms, as in table 6 the second row andShown in the third line, "/the html/body/div [4]/div [2]/div/dl [1]/dt [1] " in Property Name and "/html/Body/div [4]/div [2]/div/dl [1]/dt [2] " is the Property Name of XPath forms, the XPath of corresponding attribute value "/Html/body/div [4]/div [2]/div/dl [1]/dd [1] " and "/html/body/div [4]/div [2]/div/d1[1]/dd[2]”。
For example, after receiving service request, loading configuration file pattern.conf and configuration file firstThen xpath.conf obtains the URL of the requested page info of user equipment, the URL of the page is parsed by regular expression,Matching inquiry is carried out to configuration file pattern.conf and then obtains corresponding pattern and pattern_id, and is generatedPattern_id lists obtain particular community title and correspondence according to pattern_id lists from configuration file xpath.confAttribute value XPath and open Property Name XPath and corresponding open attribute value XPath, if shared n are openedProperty Name to be put, then can be respectively designated as open Property Name _ 1, open Property Name _ 2 ... open Property Name _ n,Corresponding open attribute value is named as open attribute value _ 1, open attribute value _ 2 ..., open attribute value _ n.For example, can be firstPattern_id is obtained from configuration file pattern.conf as shown in Table 1, then according to pattern_id from such as 6 institute of tableThe xpath.conf files shown obtain the XPath of Property Name, the XPath of attribute value or Property Name respectively, and then generate such asConfiguration information list shown in table 7.Configuration information list includes pattern_id 0, particular community title " title " and correspondingThe XPath "/html/body/div [4]/div [2]/div/div [2]/dd/h1 " of attribute value, Property Name _ 1 is openedXPath "/html/body/div [4]/div [2]/div/dl [1]/dt [1] " and corresponding open attribute value _ 1 XPath "/Html/body/div [4]/div [2]/div/dl [1]/dd [1] ", the XPath "/html/body/div for opening Property Name _ 2[4] the XPath "/html/body/div [4]/div of/div [2]/div/dl [1]/dt [2] " and corresponding open attribute value _ 2[2]/div/dl[1]/dd[2]”。
Data obtaining module 702, for according to first traverse path obtained from page info the Property Name,And the attribute value is obtained from the page info according to second traverse path.
In the specific implementation, structure traversal tree can be created according to the page info, wherein the structure traversal, which is set, includesMultiple content nodes;The multiple content node on the structure traversal tree is traversed, is obtained according to first traverse pathThe Property Name and the attribute value is obtained according to second traverse path.
Optionally, according to first traverse path before obtaining the Property Name in page info, can be trueDetermine the type of Property Name;If it is determined that the Property Name is open Property Name (Xpath forms), then traversed according to described firstPath obtains the Property Name from page info.If the Property Name is specific Property Name, e.g., " company's industry" corporate business ", " development course ", then can be determined as Property Name, therefore in this case by business ", " development course " etc.The Property Name need not be obtained from page info according to the first traverse path.
For example, as shown in figure 5, by DOM parsing html page information, corresponding DOM tree are generated.DOM tree packetsContaining multiple content nodes, each content node shows as the content of text in a HTML markup or HTML markup.It is creatingAfter DOM tree, according to the XPath in configuration information list as shown in table 7, the traversal content node in DOM tree,Obtain the value of information of the corresponding node content as XPath.For example, when XPath be /html/head/title when, Ke YigenHtml nodes, head nodes and title nodes in DOM Tree shown in fig. 5 are traversed successively according to/html/head/title,Then the value of information of the content of text " My title " of title nodes as XPath is obtained, in this way according to different timesThe value of information that path obtains each XPath respectively is gone through, attribute information list as shown in table 8 is ultimately produced, including specificProperty Name " title " and the corresponding XPath values of information " XXX ", the open Property Name _ 1 and corresponding XPath values of informationThe value of information " ABC " of " foreign language title ", open attribute value _ 1 and corresponding XPath, open Property Name _ 2 and correspondingThe XPath values of information " general headquarters place " and the value of information " China Shenzhen " of open attribute value _ 2 and corresponding XPath.
As a result output module 703, the mapping relations for establishing the Property Name and the attribute value are obtained as informationResult is taken to be exported.
In the specific implementation, if Property Name is particular community title, by the corresponding XPath values of information of particular community titleAs the corresponding attribute value of particular community title, can by they according to<Property Name:Attribute value>Form exported,For example, in attribute information list as shown in table 8, the corresponding attribute value of particular community title " title " is exactly " XXX ", and willThey are exported:<Title, XXX>.
If Property Name/attribute value is open Property Name _ n or open attribute value _ n, open attribute-name is obtained firstThe first map tags of number " n " in title _ n as the value of information of corresponding XPath, open Property Name _ n is correspondingThe value of information of XPath is stored in the nth position of open Property Name list as Property Name, similarly, obtains open attributeThe second map tags of number " n " in value _ n as the value of information of corresponding XPath, open attribute value _ n is correspondingThe value of information of XPath is stored in the nth position of open list of attribute values as attribute value, wherein n can be 1,2,3 ... wait appointOne integer traverses opening Property Name _ 1 in attribute information list and arrives open Property Name _ n, and open attribute value _ 1 is arrivedOpen attribute value _ n.For example, in attribute information list as shown in table 8, the number " 1 " obtained in open Property Name _ 1 is madeThe first map tags for the corresponding XPath values of information " foreign language title " are 1, and " foreign language title " is stored as Property NameOn the 1st position of open Property Name list, the number " 1 " obtained in open attribute value _ 1 is believed as corresponding XPathSecond map tags of breath value " China Shenzhen ", and " China Shenzhen " is stored in the of open list of attribute values as attribute valueOn 1 position.
Finally, when first map tags are identical as the second map tags, by the corresponding attribute of the first map tagsTitle attribute value corresponding with the second map tags establishes mapping relations, and can be according to<Property Name, attribute value>FormOutput.
For example, as shown in table 9-1 and table 9-2, the first of Property Name " foreign language title " is reflected in open Property Name listIt is 1 to penetrate label, and the second map tags of attribute value " ABC " are 1 in open list of attribute values, therefore, Property Name " outer literary fameFirst map tags of title " are identical as the second map tags of attribute value " ABC ", to establish " foreign language title " and " ABC "Mapping relations press them<Foreign language title:ABC>Form output.Similarly, the first mapping mark of Property Name " general headquarters place "Label are 2, and the second map tags of attribute value " China Shenzhen " are also 2, so as to establish " general headquarters place " and " China Shenzhen "Mapping relations, and export<General headquarters place:China Shenzhen>.
In embodiments of the present invention, the first traverse path of Property Name and the second traversal road of attribute value are obtained firstDiameter;Then the Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;Finally establish the mapping relations of the Property Name and the attribute valueIt is exported as acquisition of information result.It is obtained using traverse path for Property Name and attribute value, and by attributeTitle and attribute value are mapped, and the accuracy of acquisition of information is improved.
Continuing with the structural schematic diagram for referring to Fig. 8, Fig. 8 being a kind of information acquisition apparatus that the embodiment of the present invention proposes.Such asShown in figure, which may include:At least one processor 801, at least one communication interface 802 are at least oneMemory 803 and at least one communication bus 804.
Wherein, processor 801 can be central processor unit, general processor, digital signal processor, special integratedCircuit, field programmable gate array either other programmable logic device, transistor logic, hardware component or it is arbitraryCombination.It may be implemented or execute various illustrative logic blocks, module and electricity in conjunction with described in the disclosure of inventionRoad.The processor can also be to realize the combination of computing function, such as combine comprising one or more microprocessors, number letterThe combination etc. of number processor and microprocessor.Communication bus 804 can be Peripheral Component Interconnect standard PCI bus or extension workIndustry normal structure eisa bus etc..The bus can be divided into address bus, data/address bus, controlling bus etc..For ease of indicating,It is only indicated with a thick line in Fig. 8, it is not intended that an only bus or a type of bus.Communication bus 804 is used forRealize the connection communication between these components.Wherein, the communication interface 802 of equipment is used for and other nodes in the embodiment of the present inventionEquipment carries out the communication of signaling or data.Memory 803 may include volatile memory, such as non-volatile dynamic random is depositedTake memory (Nonvolatile Random Access Memory, NVRAM), phase change random access memory (PhaseChange RAM, PRAM), magnetic-resistance random access memory (Magetoresistive RAM, MRAM) etc., can also include non-Volatile memory, for example, at least a disk memory, Electrical Erasable programmable read only memory (ElectricallyErasable Programmable Read-Only Memory, EEPROM), flush memory device, such as anti-or flash memory (NORFlash memory) or anti-and flash memory (NAND flash memory), semiconductor devices, such as solid state disk (SolidState Disk, SSD) etc..Memory 803 optionally can also be at least one storage for being located remotely from aforementioned processor 801Device.Batch processing code is stored in memory 803, and processor 801 executes the program in memory 803:
Obtain the second traverse path of the first traverse path and attribute value of Property Name;
The Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;
The mapping relations for establishing the Property Name and the attribute value are exported as acquisition of information result.
Optionally, processor 801 is additionally operable to execute following operating procedure:
Obtain the first map tags of the Property Name and the second map tags of the attribute value;
According to first map tags and second map tags, the Property Name and the attribute value are establishedMapping relations.
Optionally, processor 801 is additionally operable to execute following operating procedure:
When first map tags are identical as the second map tags, the Property Name and the attribute value are establishedMapping relations.
Optionally, processor 801 is additionally operable to execute following operating procedure:
Structure traversal tree is created according to the page info, wherein the structure traversal tree includes multiple content nodes;
The multiple content node on the structure traversal tree is traversed, the category is obtained according to first traverse pathProperty title and the attribute value is obtained according to second traverse path.
Optionally, processor 801 is additionally operable to execute following operating procedure:
Obtain the attribute-bit of the Property Name and the attribute value;
According to the attribute-bit, first traverse path and second traverse path are obtained from configuration file,Wherein, the configuration file include the attribute-bit, it is corresponding with first traverse path and second traverse pathRelationship.
Optionally, processor 801 is additionally operable to execute following operating procedure:
Obtain the uniform resource locator of the page info;
According to the uniform resource locator, the attribute-bit of the Property Name and the attribute value is obtained.
Optionally, processor 801 is additionally operable to execute following operating procedure:
Determine the type of the Property Name;
If the Property Name is open Property Name, according to the acquisition of the first traverse path of the Property NameProperty Name.
Further, processor can also be matched with memory and communication interface, executed and provided in foregoing invention embodimentThe operation of source control device.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its arbitrary combination realIt is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer programProduct includes one or more computer instructions.When loading on computers and executing the computer program instructions, all orIt partly generates according to the flow or function described in the embodiment of the present invention.The computer can be all-purpose computer, special meterCalculation machine, computer network or other programmable devices.The computer instruction can be stored in computer readable storage mediumIn, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computerInstruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data centerUser's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server orData center is transmitted.The computer readable storage medium can be any usable medium that computer can access orIt is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be withIt is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state diskSolid State Disk (SSD)) etc..
Above-described specific implementation mode has carried out further the purpose of the present invention, technical solution and advantageous effectIt is described in detail.All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included inWithin protection scope of the present invention.

Claims (15)

CN201810009236.XA2018-01-032018-01-03Information acquisition method and related equipmentActiveCN108334560B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201810009236.XACN108334560B (en)2018-01-032018-01-03Information acquisition method and related equipment

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201810009236.XACN108334560B (en)2018-01-032018-01-03Information acquisition method and related equipment

Publications (2)

Publication NumberPublication Date
CN108334560Atrue CN108334560A (en)2018-07-27
CN108334560B CN108334560B (en)2022-04-15

Family

ID=62924834

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201810009236.XAActiveCN108334560B (en)2018-01-032018-01-03Information acquisition method and related equipment

Country Status (1)

CountryLink
CN (1)CN108334560B (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20030212705A1 (en)*1994-12-072003-11-13Richard WilliamsonMethod and apparatus for mapping objects to multiple tables of a database
US20070244887A1 (en)*2006-04-182007-10-18Benq CorporationSystems and methods for discovering frequently accessed subtrees
CN101183385A (en)*2007-12-042008-05-21西安交通大学 An XML Query Method Based on Multimodal Index Structure
CN101593184A (en)*2008-05-292009-12-02国际商业机器公司The system and method for self-adaptively locating dynamic web page elements
JP2010012853A (en)*2008-07-022010-01-21Navitime Japan Co LtdPath search system, path search server, path search method, and terminal device
CN101887458A (en)*2010-07-062010-11-17江苏大学 A Method of Indexing XML Documents Based on Path Encoding
CN101984434A (en)*2010-11-162011-03-09东北大学Webpage data extracting method based on extensible language query
CN102693240A (en)*2011-03-252012-09-26北京航空航天大学Formal description method and device of Web service protocol semantics
CN102760150A (en)*2012-04-052012-10-31中国人民解放军国防科学技术大学Webpage extraction method based on attribute reproduction and labeled path
CN103049494A (en)*2012-12-072013-04-17华为技术有限公司Method and device for storing table of extensible markup language (XML) file
US20130297657A1 (en)*2012-05-012013-11-07Gajanan ChinchwadkarApparatus and Method for Forming and Using a Tree Structured Database with Top-Down Trees and Bottom-Up Indices
CN106294641A (en)*2016-08-032017-01-04朱杰A kind of orientation lookup method getting in touch with object
CN106599280A (en)*2016-12-232017-04-26北京奇虎科技有限公司Webpage node path information determination method and apparatus
CN106709980A (en)*2017-01-092017-05-24北京航空航天大学Complex three-dimensional scene modeling method based on formalization
CN106844640A (en)*2017-01-222017-06-13漳州科技职业学院A kind of web data analysis and processing method

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20030212705A1 (en)*1994-12-072003-11-13Richard WilliamsonMethod and apparatus for mapping objects to multiple tables of a database
US20070244887A1 (en)*2006-04-182007-10-18Benq CorporationSystems and methods for discovering frequently accessed subtrees
CN101183385A (en)*2007-12-042008-05-21西安交通大学 An XML Query Method Based on Multimodal Index Structure
CN101593184A (en)*2008-05-292009-12-02国际商业机器公司The system and method for self-adaptively locating dynamic web page elements
JP2010012853A (en)*2008-07-022010-01-21Navitime Japan Co LtdPath search system, path search server, path search method, and terminal device
CN101887458A (en)*2010-07-062010-11-17江苏大学 A Method of Indexing XML Documents Based on Path Encoding
CN101984434A (en)*2010-11-162011-03-09东北大学Webpage data extracting method based on extensible language query
CN102693240A (en)*2011-03-252012-09-26北京航空航天大学Formal description method and device of Web service protocol semantics
CN102760150A (en)*2012-04-052012-10-31中国人民解放军国防科学技术大学Webpage extraction method based on attribute reproduction and labeled path
US20130297657A1 (en)*2012-05-012013-11-07Gajanan ChinchwadkarApparatus and Method for Forming and Using a Tree Structured Database with Top-Down Trees and Bottom-Up Indices
CN103049494A (en)*2012-12-072013-04-17华为技术有限公司Method and device for storing table of extensible markup language (XML) file
CN106294641A (en)*2016-08-032017-01-04朱杰A kind of orientation lookup method getting in touch with object
CN106599280A (en)*2016-12-232017-04-26北京奇虎科技有限公司Webpage node path information determination method and apparatus
CN106709980A (en)*2017-01-092017-05-24北京航空航天大学Complex three-dimensional scene modeling method based on formalization
CN106844640A (en)*2017-01-222017-06-13漳州科技职业学院A kind of web data analysis and processing method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张婷等: "XPath语义特性及其对XML数据操作的应用研究", 《信息技术》*

Also Published As

Publication numberPublication date
CN108334560B (en)2022-04-15

Similar Documents

PublicationPublication DateTitle
US10936179B2 (en)Methods and systems for web content generation
US11216453B2 (en)Data visualization in a dashboard display using panel templates
US10452787B2 (en)Techniques for automated document translation
US8572202B2 (en)Persistent saving portal
US8799353B2 (en)Scope-based extensibility for control surfaces
US8468145B2 (en)Indexing of URLs with fragments
US12056434B2 (en)Generating tagged content from text of an electronic document
EP2041673A1 (en)Method for inheriting a wiki page layout for a wiki page
CN110851136A (en) Data acquisition method, device, electronic device and storage medium
CN103617043B (en)A kind of method and system uploaded with picture web data
Li et al.Asymptotic analysis for blow-up solutions in parabolic equations involving variable exponents
CN110888695A (en)Method and device for generating page based on permission
CN108334560A (en)A kind of information acquisition method and relevant device
CN114896543A (en)Public opinion analysis method, device and storage medium
Zhao et al.A note on activity floats in activity-on-arrow networks
US20240061989A1 (en)Generating an electronic document with a consistent text ordering
CN120106093A (en) A method and related device for online webpage translation
Lin et al.Comment on ‘Technical Note–Reaching more states for control of FMS’
Wei et al.Vertex-neighbour-integrity of composition graphs of paths and cycles
Zocchi et al.On general continuous triangular and two-sided power distributions
CN107330037A (en)Keyword optimization method and device and terminal equipment
HK1138395B (en)Persistent saving portal

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp