Invention content
A kind of information acquisition method of offer of the embodiment of the present invention and relevant device.The accuracy of acquisition of information can be improved.
First aspect present invention provides a kind of information acquisition method, including:
Obtain the second traverse path of the first traverse path and attribute value of Property Name;
The Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;
The mapping relations for establishing the Property Name and the attribute value are exported as acquisition of information result.
Wherein, described to establish the Property Name and the mapping relations of the attribute value include:
Obtain the first map tags of the Property Name and the second map tags of the attribute value;
According to first map tags and second map tags, the Property Name and the attribute value are establishedMapping relations.
Wherein, described according to first map tags and second map tags, establish the Property Name and instituteThe mapping relations for stating attribute value include:
When first map tags are identical as the second map tags, the Property Name and the attribute value are establishedMapping relations.
Wherein, described the Property Name to be obtained from page info according to first traverse path and according to instituteIt states the second traverse path and obtains the attribute value from the page info and include:
Structure traversal tree is created according to the page info, wherein the structure traversal tree includes multiple content nodes;
The multiple content node on the structure traversal tree is traversed, the category is obtained according to first traverse pathProperty title and the attribute value is obtained according to second traverse path.
Wherein, first traverse path for obtaining Property Name and the second traverse path of attribute value include:
Obtain the attribute-bit of the Property Name and the attribute value;
According to the attribute-bit, first traverse path and second traverse path are obtained from configuration file,Wherein, the configuration file include the attribute-bit, it is corresponding with first traverse path and second traverse pathRelationship.
Wherein, the attribute-bit of the acquisition Property Name and the attribute value includes:
Obtain the uniform resource locator of the page info;
According to the uniform resource locator, the attribute-bit of the Property Name and the attribute value is obtained.
Wherein, described the Property Name is obtained from page info according to first traverse path to include:
Determine the type of the Property Name;
If the Property Name is open Property Name, according to the acquisition of the first traverse path of the Property NameProperty Name.
Correspondingly, second aspect of the present invention provides a kind of information acquisition device, including:
Path acquisition module, the second traverse path of the first traverse path and attribute value for obtaining Property Name;
Data obtaining module, for according to first traverse path obtained from page info the Property Name, withAnd the attribute value is obtained from the page info according to second traverse path;
As a result output module, establish the mapping relations of the Property Name and the attribute value as acquisition of information result intoRow output.
Wherein, the result output module is specifically used for:
Obtain the first map tags of the Property Name and the second map tags of the attribute value;
According to first map tags and second map tags, the Property Name and the attribute value are establishedMapping relations.
Wherein, the result output module is specifically used for:
When first map tags are identical as the second map tags, the Property Name and the attribute value are establishedMapping relations.
Wherein, described information acquisition module is specifically used for:
Structure traversal tree is created according to the page info, wherein the structure traversal tree includes multiple content nodes;
The multiple content node on the structure traversal tree is traversed, the category is obtained according to first traverse pathProperty title and the attribute value is obtained according to second traverse path.
Wherein, the path acquisition module is specifically used for:
Obtain the attribute-bit of the Property Name and the attribute value;
According to the attribute-bit, first traverse path and second traverse path are obtained from configuration file,Wherein, the configuration file include the attribute-bit, it is corresponding with first traverse path and second traverse pathRelationship.
Wherein, the path acquisition module is specifically used for:
Obtain the uniform resource locator of the page info;
According to the uniform resource locator, the attribute-bit of the Property Name and the attribute value is obtained.
Wherein, described information acquisition module is specifically used for:
Determine the type of the Property Name;
If the Property Name is open Property Name, according to the acquisition of the first traverse path of the Property NameProperty Name.
The third aspect, the present invention provides a kind of information acquisition apparatus, including:Processor, memory and communication bus,In, for realizing connection communication between processor and memory, processor executes the program stored in memory and uses communication busStep in a kind of information acquisition method that above-mentioned first aspect offer is provided.
In a possible design, information acquisition apparatus provided by the invention can include for executing in the above methodThe corresponding module of behavior.Module can be software and/or be hardware.
It is yet another aspect of the present invention to provide a kind of computer readable storage medium, in the computer readable storage mediumIt is stored with a plurality of instruction, described instruction is suitable for being loaded by processor and executing the method described in above-mentioned various aspects.
It is yet another aspect of the present invention to provide a kind of computer program products including instruction, when it runs on computersWhen so that computer executes the method described in above-mentioned various aspects.
Implement the embodiment of the present invention, obtains the first traverse path of Property Name and the second traversal road of attribute value firstDiameter;Then the Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;Finally establish the mapping relations of the Property Name and the attribute valueIt is exported as acquisition of information result.It is obtained using traverse path for Property Name and attribute value, and by attributeTitle and attribute value are mapped, and the accuracy of acquisition of information is improved.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, completeSite preparation describes, it is clear that described embodiments are some of the embodiments of the present invention, instead of all the embodiments.Based on this hairEmbodiment in bright, the every other implementation that those of ordinary skill in the art are obtained without creative effortsExample, shall fall within the protection scope of the present invention.
Fig. 2 is referred to, Fig. 2 is a kind of structural schematic diagram of Information Acquisition System provided in an embodiment of the present invention, the informationAcquisition system includes user equipment 201 and server 202.Wherein, server 202 can be capable of providing network information browsing clothesWebsite (Web) server of business.User equipment 201 can refer to the equipment for providing the voice and/or data connection that arrive user,May be connected to laptop computer or desktop computer etc. computing device or its can be such as individual digitalThe autonomous device of assistant (Personal Digital Assistant, PDA) etc..Wherein, server is for receiving user equipmentThe service request of transmission, for the service request for asking browsing pages information, the unified resource for then parsing the webpage information is fixedPosition symbol (Uniform Resource Location, URL) obtains the traverse path and attribute of Property Name from configuration fileThe traverse path of value finally obtains Property Name according to the traverse path of Property Name, is obtained according to the traverse path of attribute valueAttribute value, the correspondence for finally establishing Property Name and attribute value are sent to user equipment as information result is obtained.UserEquipment is used to send service request to server, and obtaining acquisition of information result from server is shown.
System is obtained based on above- mentioned information, as shown in figure 3, a kind of information acquisition method that the embodiment of the present invention proposes, packetIt includes:S301, system loads configuration file pattern.conf and configuration file xpath.conf, wherein pattern.conf filesID number (pattern_id) including expression mode (pattern) and its corresponding expression mode, for example, as shown in table 1,Pattern.conf files include pattern_id 0 and pattern_id 1 and corresponding pattern.Xpath.conf textsPart includes the XPath of pattern_id, Property Name and attribute value.For example, as shown in table 2,0 time packet of pattern_idContain under " title " and " brief introduction " two Property Names and the XPath of their corresponding attribute values, pattern_id 1 and has includedThe XPath of Property Name " label " and corresponding two attribute values.S302, by parse page info URL fromPattern_id is obtained in pattern.conf files, and attribute is then obtained from xpath.conf files according to pattern_idThe XPath of value.S303 creates the DOM tree of page info, traverses the XPath of each attribute value under pattern_id, pressesAccording to the paths XPath, node content is obtained from DOM tree as corresponding attribute value.S304, according to Property Name and attributeThe correspondence of value, according to<Property Name, attribute value>Form output, can if a Property Name corresponds to M attribute valueWith according to<Property Name, attribute value 1, attribute value 2 ..., attribute value M>Form output, for example, Property Name " label " is correspondingTwo attribute values " stock name " and " company ", then can export<Label, stock name, company>.
Table 1.pattern.conf files
| pattern_id | pattern |
| 0 | ^https://baike\.baidu\.com/item/.+/\d+$ |
| 1 | ^https://baike\.baidu\.com/subview/\d+/\d+\.htm$ |
Table 2.xpath.conf files
However, due in the different pages, the registration of attribute value is higher, and Property Name difference is larger, therefore,The different pages, possible entirely different, this fixed attribute title of the corresponding Property Name of the identical attribute values of XPath, only to belonging toProperty the method that is obtained using XPath modes of value cause the accuracy of acquisition of information low.In order to solve the problems, such as this, the present invention carriesGo out following solution.
Fig. 4 is referred to, Fig. 4 is the flow chart schematic diagram for another information acquisition method that the embodiment of the present invention proposes, shouldMethod includes but not limited to following steps:
S401 obtains the first traverse path of Property Name and the second traverse path of attribute value.
In the specific implementation, the service request of user equipment transmission can be received first, service request is believed for request pageThen breath obtains the URL of page info;According to the URL, the attribute-bit of Property Name and attribute value, last basis are obtainedAttribute-bit obtains the first traverse path and the second traverse path from configuration file, wherein configuration file includes attribute markKnow, the correspondence with the first traverse path and the second traverse path.
Wherein, system includes configuration file pattern.conf and configuration file xpath.conf, configuration filePattern.conf includes pattern and its corresponding pattern_id, for example, as shown in table 1, pattern.conf filesIncluding pattern_id 0 and pattern_id 1 and their corresponding pattern.Configuration file xpath.conf includesThe XPath of pattern_id, Property Name and attribute value, wherein the Property Name in xpath.conf files includes specificThe Property Name of Property Name and XPath forms, wherein the Property Name of XPath forms is open Property Name, open attributeThe corresponding attribute value of title is open attribute value, it should be noted that only there are one corresponding open categories for an open Property NameProperty value.For example, as shown in table 3, pattern_id 0 corresponds to two attribute titles, the first is specific Property Name, such as table 3In the first row shown in, Property Name " title " is specific Property Name, the XPath "/html/body/ of corresponding attribute valuediv[4]/div[2]/div/div[2]/dd/h1”;It is for second the Property Name of XPath forms, such as the second row institute in table 3Show, "/the html/body/div [4]/div [2]/div/dl [1]/dt [1] " in Property Name is the attribute-name of XPath formsClaim, the XPath "/html/body/div [4]/div [2]/div/dl [1]/dd [1] " of corresponding attribute value.
3. modified xpath.conf files of table
For example, after receiving service request, loading configuration file pattern.conf and configuration file firstThen xpath.conf obtains the URL of the requested page info of user equipment, the URL of the page is parsed by regular expression,Matching inquiry is carried out to configuration file pattern.conf and then obtains corresponding pattern and pattern_id, and is generatedPattern_id lists obtain particular community title and correspondence according to pattern_id lists from configuration file xpath.confAttribute value XPath and open Property Name XPath and corresponding open attribute value XPath.For example, can be firstFirst from configuration file pattern.conf as shown in Table 1 pattern_id is obtained, then according to pattern_id from such as table 3Shown in xpath.conf files obtain the XPath of Property Name, the XPath of attribute value or Property Name respectively, and then generateConfiguration information list as shown in table 4.Configuration information list includes pattern_id 0, particular community title " title " and corresponds toThe XPath "/html/body/div [4]/div [2]/div/div [2]/dd/h1 " of attribute value, open Property NameThe XPath of XPath "/html/body/div [4]/div [2]/div/dl [1]/dt [1] " and corresponding open attribute value "/html/body/div[4]/div[2]/div/dl[1]/dd[1]”。
4. configuration information list of table
| pattern_id | Property Name/attribute value | XPath |
| 0 | Title | /html/body/div[4]/div[2]/div/div[2]/dd/h1 |
| 0 | Open Property Name | /html/body/div[4]/div[2]/div/dl[1]/dt[2] |
| 0 | Open attribute value | /html/body/div[4]/div[2]/div/dl[1]/dd[2] |
S402 obtains the Property Name and according to described according to first traverse path from page infoTwo traverse paths obtain the attribute value from the page info.
In the specific implementation, structure traversal tree can be created according to the page info, wherein the structure traversal, which is set, includesMultiple content nodes;The multiple content node on the structure traversal tree is traversed, is obtained according to first traverse pathThe Property Name and the attribute value is obtained according to second traverse path.
Optionally, according to first traverse path before obtaining the Property Name in page info, can be trueDetermine the type of Property Name;If it is determined that the Property Name is open Property Name (Xpath forms), then traversed according to described firstPath obtains the Property Name from page info.If the Property Name is specific Property Name, e.g., " company's industry" corporate business ", " development course ", then can be determined as Property Name, therefore in this case by business ", " development course " etc.The Property Name need not be obtained from page info according to the first traverse path.
For example, as shown in figure 5, by DOM parsing html page information, corresponding DOM tree are generated.DOM tree packetsContaining multiple content nodes, each content node shows as the content of text in a HTML markup or HTML markup.It is creatingAfter DOM tree, according to the XPath in configuration information list as shown in table 4, the traversal content node in DOM tree,Obtain the value of information of the corresponding node content as XPath.For example, when XPath be /html/head/title when, Ke YigenHtml nodes, head nodes and title nodes in DOM Tree shown in fig. 5 are traversed successively according to/html/head/title,Then the value of information of the content of text " My title " of title nodes as XPath is obtained, in this way according to different timesThe value of information that path obtains each XPath respectively is gone through, attribute information list as shown in table 5 is ultimately produced, attribute information rowTable includes particular community title " title " and the corresponding XPath values of information " XXX ", open Property Name and correspondingThe XPath values of information " foreign language title ", open attribute value and the corresponding XPath values of information " ABC ".
5. attribute information list of table
| Property Name/attribute value | The XPath values of information |
| Title | XXX |
| Open Property Name | Foreign language title |
| Open attribute value | ABC |
S403, the mapping relations for establishing the Property Name and the attribute value are exported as acquisition of information result.
In the specific implementation, if Property Name is particular community title, by the corresponding XPath values of information of particular community titleAs the corresponding attribute value of particular community title, if Property Name is open Property Name, by open Property NameThe XPath values of information and the XPath values of information of open attribute value establish mapping relations, according to<Property Name, attribute value>FormatOutput.
Such as:In attribute information list as shown in table 5, the corresponding attribute value of particular community title " title " is" XXX " opens the XPath values of information " ABC " that the corresponding attribute value of Property Name " foreign language title " is open attribute value, also,They can be exported respectively and be:<Title, XXX>,<Foreign language title, ABC>.
In embodiments of the present invention, the first traverse path of Property Name and the second traversal road of attribute value are obtained firstDiameter;Then the Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;Finally establish the mapping relations of the Property Name and the attribute valueIt is exported as acquisition of information result.It is obtained using traverse path for Property Name and attribute value, and by attributeTitle and attribute value are mapped, and the accuracy of acquisition of information is improved.
Fig. 6 is referred to, Fig. 6 is the flow chart schematic diagram for another information acquisition method that the embodiment of the present invention proposes, shouldMethod includes but not limited to following steps:
S601 obtains the first traverse path of Property Name and the second traverse path of attribute value.
In the specific implementation, the service request of user equipment transmission can be received first, service request is believed for request pageThen breath obtains the URL of page info;According to the URL, the attribute-bit of Property Name and attribute value, last basis are obtainedAttribute-bit obtains the first traverse path and the second traverse path from configuration file, wherein configuration file includes attribute markKnow, the correspondence with the first traverse path and the second traverse path.
Wherein, system includes configuration file pattern.conf and configuration file xpath.conf, configuration filePattern.conf includes pattern and its corresponding pattern_id, for example, as shown in table 1, pattern.conf filesIncluding pattern_id 0 and pattern_id 1 and their corresponding pattern.Configuration file xpath.conf includesThe XPath of pattern_id, Property Name and attribute value, wherein the Property Name in xpath.conf files includes specificThe Property Name of Property Name and XPath forms, wherein the Property Name of XPath forms is open Property Name, open attributeThe corresponding attribute value of title is open attribute value, it should be noted that only there are one corresponding open categories for an open Property NameProperty value.For example, as shown in table 6, pattern_id 0 corresponds to two attribute titles, the first is specific Property Name, such as table 6In the first row shown in, Property Name " title " is specific Property Name, the XPath "/html/body/ of corresponding attribute valuediv[4]/div[2]/div/div[2]/dd/h1”;For second the Property Name of XPath forms, as in table 6 the second row andShown in the third line, "/the html/body/div [4]/div [2]/div/dl [1]/dt [1] " in Property Name and "/html/Body/div [4]/div [2]/div/dl [1]/dt [2] " is the Property Name of XPath forms, the XPath of corresponding attribute value "/Html/body/div [4]/div [2]/div/dl [1]/dd [1] " and "/html/body/div [4]/div [2]/div/d1[1]/dd[2]”。
6. modified xpath.conf files of table
For example, after receiving service request, loading configuration file pattern.conf and configuration file firstThen xpath.conf obtains the URL of the requested page info of user equipment, the URL of the page is parsed by regular expression,Matching inquiry is carried out to configuration file pattern.conf and then obtains corresponding pattern and pattern_id, and is generatedPattern_id lists obtain particular community title and correspondence according to pattern_id lists from configuration file xpath.confAttribute value XPath and open Property Name XPath and corresponding open attribute value XPath, if shared n are openedProperty Name to be put, then can be respectively designated as open Property Name _ 1, open Property Name _ 2 ... open Property Name _ n,Corresponding open attribute value is named as open attribute value _ 1, open attribute value _ 2 ..., open attribute value _ n.For example, can be firstPattern_id is obtained from configuration file pattern.conf as shown in Table 1, then according to pattern_id from such as 6 institute of tableThe xpath.conf files shown obtain the XPath of Property Name, the XPath of attribute value or Property Name respectively, and then generate such asConfiguration information list shown in table 7.Configuration information list includes pattern_id 0, particular community title " title " and correspondingThe XPath "/html/body/div [4]/div [2]/div/div [2]/dd/h1 " of attribute value, Property Name _ 1 is openedXPath "/html/body/div [4]/div [2]/div/dl [1]/dt [1] " and corresponding open attribute value _ 1 XPath "/Html/body/div [4]/div [2]/div/dl [1]/dd [1] ", the XPath "/html/body/div for opening Property Name _ 2[4] the XPath "/html/body/div [4]/div of/div [2]/div/dl [1]/dt [2] " and corresponding open attribute value _ 2[2]/div/dl[1]/dd[2]”。
7. configuration information list of table
| pattern_id | Property Name/attribute value | XPath |
| 0 | Title | /html/body/div[4]/div[2]/div/div[2]/dd/h1 |
| 0 | Open Property Name _ 1 | /html/body/div[4]/div[2]/div/dl[1]/dt[1] |
| 0 | Open attribute value _ 1 | /html/body/div[4]/div[2]/div/dl[1]/dd[1] |
| 0 | Open Property Name _ 2 | /html/body/div[4]/div[2]/div/dl[1]/dt[2] |
| 0 | Open attribute value _ 2 | /html/body/div[4]/div[2]/div/dl[1]/dd[2] |
S602 obtains the Property Name and according to described according to first traverse path from page infoTwo traverse paths obtain the attribute value from the page info.
In the specific implementation, structure traversal tree can be created according to the page info, wherein the structure traversal, which is set, includesMultiple content nodes;The multiple content node on the structure traversal tree is traversed, is obtained according to first traverse pathThe Property Name and the attribute value is obtained according to second traverse path.
Optionally, according to first traverse path before obtaining the Property Name in page info, can be trueDetermine the type of Property Name;If it is determined that the Property Name is open Property Name (Xpath forms), then traversed according to described firstPath obtains the Property Name from page info.If the Property Name is specific Property Name, e.g., " company's industry" corporate business ", " development course ", then can be determined as Property Name, therefore in this case by business ", " development course " etc.The Property Name need not be obtained from page info according to the first traverse path.
For example, as shown in figure 5, by DOM parsing html page information, corresponding DOM tree are generated.DOM tree packetsContaining multiple content nodes, each content node shows as the content of text in a HTML markup or HTML markup.It is creatingAfter DOM tree, according to the XPath in configuration information list as shown in table 7, the traversal content node in DOM tree,Obtain the value of information of the corresponding node content as XPath.For example, when XPath be /html/head/title when, Ke YigenHtml nodes, head nodes and title nodes in DOM Tree shown in fig. 5 are traversed successively according to/html/head/title,Then the value of information of the content of text " My title " of title nodes as XPath is obtained, in this way according to different timesThe value of information that path obtains each XPath respectively is gone through, attribute information list as shown in table 8 is ultimately produced, including specificProperty Name " title " and the corresponding XPath values of information " XXX ", the open Property Name _ 1 and corresponding XPath values of informationThe value of information " ABC " of " foreign language title ", open attribute value _ 1 and corresponding XPath, open Property Name _ 2 and correspondingThe XPath values of information " general headquarters place " and the value of information " China Shenzhen " of open attribute value _ 2 and the XPath answered.
8. attribute information list of table
| Property Name/attribute value | The XPath values of information |
| Title | XXX |
| Open Property Name _ 1 | Foreign language title |
| Open attribute value _ 1 | ABC |
| Open Property Name _ 2 | General headquarters place |
| Open attribute value _ 2 | China Shenzhen |
S603 obtains the first map tags of the Property Name and the second map tags of the attribute value.
In the specific implementation, if Property Name/attribute value is open Property Name _ n or open attribute value _ n, can obtainFirst map tags of the value of information of the number " n " as corresponding XPath in open Property Name _ n, obtain open attributeThe second map tags of number " n " in value _ n as the value of information of corresponding XPath, wherein n, which can be 1,2,3 ... waits anyInteger.For example, in attribute information list as shown in table 8, the number " 1 " in open Property Name _ 1 is obtained as correspondingFirst map tags of the value of information " foreign language title " of XPath obtain the number " 1 " in open attribute value _ 1 as correspondingSecond map tags of the value of information " China Shenzhen " of XPath.
S604 establishes the Property Name and the category according to first map tags and second map tagsProperty value mapping relations, output information obtain result.
In the specific implementation, if Property Name is particular community title, by the corresponding XPath values of information of particular community titleAs the corresponding attribute value of particular community title, can by they according to<Property Name:Attribute value>Form exported,For example, in attribute information list as shown in table 8, the corresponding attribute value of particular community title " title " is exactly " XXX ", and willThey are exported:<Title, XXX>.
It, will open Property Name _ n pairs if Property Name/attribute value is open Property Name _ n or open attribute value _ nThe value of information of the XPath answered is stored in the nth position of open Property Name list as Property Name;Similarly, belong to openProperty value _ n corresponding XPath the value of information nth position of open list of attribute values, traversal attribute letter are stored in as attribute valueIt ceases opening Property Name _ 1 in list and arrives open Property Name _ n, and open attribute value _ n is arrived in open attribute value _ 1.Finally,When first map tags are identical as the second map tags, the corresponding Property Name of the first map tags and second are mappedThe corresponding attribute value of label establishes mapping relations, and can be according to<Property Name, attribute value>Form output.
For example, as shown in table 9-1 and table 9-2, the first of Property Name " foreign language title " is reflected in open Property Name listIt is 1 to penetrate label, and the second map tags of attribute value " ABC " are 1 in open list of attribute values, therefore, Property Name " outer literary fameFirst map tags of title " are identical as the second map tags of attribute value " ABC ", to establish " foreign language title " and " ABC "Mapping relations, and they are pressed<Foreign language title:ABC>Form output.Similarly, the first mapping of Property Name " general headquarters place "Label is 2, and the second map tags of attribute value " China Shenzhen " are also 2, " general headquarters place " and " Chinese deep so as to establishThe mapping relations of ditch between fields ", and export<General headquarters place:China Shenzhen>.
In embodiments of the present invention, the first traverse path of Property Name and the second traversal road of attribute value are obtained firstDiameter;Then the Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;Finally establish the mapping relations of the Property Name and the attribute valueIt is exported as acquisition of information result.It is obtained using traverse path for Property Name and attribute value, and by attributeTitle and attribute value are mapped, and the accuracy of acquisition of information is improved.
Fig. 7 is referred to, Fig. 7 is a kind of structural schematic diagram for information acquisition device that the embodiment of the present invention proposes, the informationAcquisition device may include:
Path acquisition module 701, the second traverse path of the first traverse path and attribute value for obtaining Property Name.
In the specific implementation, the service request of user equipment transmission can be received first, service request is believed for request pageThen breath obtains the URL of page info;According to the URL, the attribute-bit of Property Name and attribute value, last basis are obtainedAttribute-bit obtains the first traverse path and the second traverse path from configuration file, wherein configuration file includes attribute markKnow, the correspondence with the first traverse path and the second traverse path.
Wherein, system includes configuration file pattern.conf and configuration file xpath.conf, configuration filePattern.conf includes pattern and its corresponding pattern_id, for example, as shown in table 1, pattern.conf filesIncluding pattern_id 0 and pattern_id 1 and their corresponding pattern.Configuration file xpath.conf includesThe XPath of pattern_id, Property Name and attribute value, wherein the Property Name in xpath.conf files includes specificThe Property Name of Property Name and XPath forms, wherein the Property Name of XPath forms is open Property Name, open attributeThe corresponding attribute value of title is open attribute value, it should be noted that only there are one corresponding open categories for an open Property NameProperty value.For example, as shown in table 6, pattern_id 0 corresponds to two attribute titles, the first is specific Property Name, such as table 6In the first row shown in, Property Name " title " is specific Property Name, the XPath "/html/body/ of corresponding attribute valuediv[4]/div[2]/div/div[2]/dd/h1”;For second the Property Name of XPath forms, as in table 6 the second row andShown in the third line, "/the html/body/div [4]/div [2]/div/dl [1]/dt [1] " in Property Name and "/html/Body/div [4]/div [2]/div/dl [1]/dt [2] " is the Property Name of XPath forms, the XPath of corresponding attribute value "/Html/body/div [4]/div [2]/div/dl [1]/dd [1] " and "/html/body/div [4]/div [2]/div/d1[1]/dd[2]”。
For example, after receiving service request, loading configuration file pattern.conf and configuration file firstThen xpath.conf obtains the URL of the requested page info of user equipment, the URL of the page is parsed by regular expression,Matching inquiry is carried out to configuration file pattern.conf and then obtains corresponding pattern and pattern_id, and is generatedPattern_id lists obtain particular community title and correspondence according to pattern_id lists from configuration file xpath.confAttribute value XPath and open Property Name XPath and corresponding open attribute value XPath, if shared n are openedProperty Name to be put, then can be respectively designated as open Property Name _ 1, open Property Name _ 2 ... open Property Name _ n,Corresponding open attribute value is named as open attribute value _ 1, open attribute value _ 2 ..., open attribute value _ n.For example, can be firstPattern_id is obtained from configuration file pattern.conf as shown in Table 1, then according to pattern_id from such as 6 institute of tableThe xpath.conf files shown obtain the XPath of Property Name, the XPath of attribute value or Property Name respectively, and then generate such asConfiguration information list shown in table 7.Configuration information list includes pattern_id 0, particular community title " title " and correspondingThe XPath "/html/body/div [4]/div [2]/div/div [2]/dd/h1 " of attribute value, Property Name _ 1 is openedXPath "/html/body/div [4]/div [2]/div/dl [1]/dt [1] " and corresponding open attribute value _ 1 XPath "/Html/body/div [4]/div [2]/div/dl [1]/dd [1] ", the XPath "/html/body/div for opening Property Name _ 2[4] the XPath "/html/body/div [4]/div of/div [2]/div/dl [1]/dt [2] " and corresponding open attribute value _ 2[2]/div/dl[1]/dd[2]”。
Data obtaining module 702, for according to first traverse path obtained from page info the Property Name,And the attribute value is obtained from the page info according to second traverse path.
In the specific implementation, structure traversal tree can be created according to the page info, wherein the structure traversal, which is set, includesMultiple content nodes;The multiple content node on the structure traversal tree is traversed, is obtained according to first traverse pathThe Property Name and the attribute value is obtained according to second traverse path.
Optionally, according to first traverse path before obtaining the Property Name in page info, can be trueDetermine the type of Property Name;If it is determined that the Property Name is open Property Name (Xpath forms), then traversed according to described firstPath obtains the Property Name from page info.If the Property Name is specific Property Name, e.g., " company's industry" corporate business ", " development course ", then can be determined as Property Name, therefore in this case by business ", " development course " etc.The Property Name need not be obtained from page info according to the first traverse path.
For example, as shown in figure 5, by DOM parsing html page information, corresponding DOM tree are generated.DOM tree packetsContaining multiple content nodes, each content node shows as the content of text in a HTML markup or HTML markup.It is creatingAfter DOM tree, according to the XPath in configuration information list as shown in table 7, the traversal content node in DOM tree,Obtain the value of information of the corresponding node content as XPath.For example, when XPath be /html/head/title when, Ke YigenHtml nodes, head nodes and title nodes in DOM Tree shown in fig. 5 are traversed successively according to/html/head/title,Then the value of information of the content of text " My title " of title nodes as XPath is obtained, in this way according to different timesThe value of information that path obtains each XPath respectively is gone through, attribute information list as shown in table 8 is ultimately produced, including specificProperty Name " title " and the corresponding XPath values of information " XXX ", the open Property Name _ 1 and corresponding XPath values of informationThe value of information " ABC " of " foreign language title ", open attribute value _ 1 and corresponding XPath, open Property Name _ 2 and correspondingThe XPath values of information " general headquarters place " and the value of information " China Shenzhen " of open attribute value _ 2 and corresponding XPath.
As a result output module 703, the mapping relations for establishing the Property Name and the attribute value are obtained as informationResult is taken to be exported.
In the specific implementation, if Property Name is particular community title, by the corresponding XPath values of information of particular community titleAs the corresponding attribute value of particular community title, can by they according to<Property Name:Attribute value>Form exported,For example, in attribute information list as shown in table 8, the corresponding attribute value of particular community title " title " is exactly " XXX ", and willThey are exported:<Title, XXX>.
If Property Name/attribute value is open Property Name _ n or open attribute value _ n, open attribute-name is obtained firstThe first map tags of number " n " in title _ n as the value of information of corresponding XPath, open Property Name _ n is correspondingThe value of information of XPath is stored in the nth position of open Property Name list as Property Name, similarly, obtains open attributeThe second map tags of number " n " in value _ n as the value of information of corresponding XPath, open attribute value _ n is correspondingThe value of information of XPath is stored in the nth position of open list of attribute values as attribute value, wherein n can be 1,2,3 ... wait appointOne integer traverses opening Property Name _ 1 in attribute information list and arrives open Property Name _ n, and open attribute value _ 1 is arrivedOpen attribute value _ n.For example, in attribute information list as shown in table 8, the number " 1 " obtained in open Property Name _ 1 is madeThe first map tags for the corresponding XPath values of information " foreign language title " are 1, and " foreign language title " is stored as Property NameOn the 1st position of open Property Name list, the number " 1 " obtained in open attribute value _ 1 is believed as corresponding XPathSecond map tags of breath value " China Shenzhen ", and " China Shenzhen " is stored in the of open list of attribute values as attribute valueOn 1 position.
Finally, when first map tags are identical as the second map tags, by the corresponding attribute of the first map tagsTitle attribute value corresponding with the second map tags establishes mapping relations, and can be according to<Property Name, attribute value>FormOutput.
For example, as shown in table 9-1 and table 9-2, the first of Property Name " foreign language title " is reflected in open Property Name listIt is 1 to penetrate label, and the second map tags of attribute value " ABC " are 1 in open list of attribute values, therefore, Property Name " outer literary fameFirst map tags of title " are identical as the second map tags of attribute value " ABC ", to establish " foreign language title " and " ABC "Mapping relations press them<Foreign language title:ABC>Form output.Similarly, the first mapping mark of Property Name " general headquarters place "Label are 2, and the second map tags of attribute value " China Shenzhen " are also 2, so as to establish " general headquarters place " and " China Shenzhen "Mapping relations, and export<General headquarters place:China Shenzhen>.
In embodiments of the present invention, the first traverse path of Property Name and the second traversal road of attribute value are obtained firstDiameter;Then the Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;Finally establish the mapping relations of the Property Name and the attribute valueIt is exported as acquisition of information result.It is obtained using traverse path for Property Name and attribute value, and by attributeTitle and attribute value are mapped, and the accuracy of acquisition of information is improved.
Continuing with the structural schematic diagram for referring to Fig. 8, Fig. 8 being a kind of information acquisition apparatus that the embodiment of the present invention proposes.Such asShown in figure, which may include:At least one processor 801, at least one communication interface 802 are at least oneMemory 803 and at least one communication bus 804.
Wherein, processor 801 can be central processor unit, general processor, digital signal processor, special integratedCircuit, field programmable gate array either other programmable logic device, transistor logic, hardware component or it is arbitraryCombination.It may be implemented or execute various illustrative logic blocks, module and electricity in conjunction with described in the disclosure of inventionRoad.The processor can also be to realize the combination of computing function, such as combine comprising one or more microprocessors, number letterThe combination etc. of number processor and microprocessor.Communication bus 804 can be Peripheral Component Interconnect standard PCI bus or extension workIndustry normal structure eisa bus etc..The bus can be divided into address bus, data/address bus, controlling bus etc..For ease of indicating,It is only indicated with a thick line in Fig. 8, it is not intended that an only bus or a type of bus.Communication bus 804 is used forRealize the connection communication between these components.Wherein, the communication interface 802 of equipment is used for and other nodes in the embodiment of the present inventionEquipment carries out the communication of signaling or data.Memory 803 may include volatile memory, such as non-volatile dynamic random is depositedTake memory (Nonvolatile Random Access Memory, NVRAM), phase change random access memory (PhaseChange RAM, PRAM), magnetic-resistance random access memory (Magetoresistive RAM, MRAM) etc., can also include non-Volatile memory, for example, at least a disk memory, Electrical Erasable programmable read only memory (ElectricallyErasable Programmable Read-Only Memory, EEPROM), flush memory device, such as anti-or flash memory (NORFlash memory) or anti-and flash memory (NAND flash memory), semiconductor devices, such as solid state disk (SolidState Disk, SSD) etc..Memory 803 optionally can also be at least one storage for being located remotely from aforementioned processor 801Device.Batch processing code is stored in memory 803, and processor 801 executes the program in memory 803:
Obtain the second traverse path of the first traverse path and attribute value of Property Name;
The Property Name is obtained from page info according to first traverse path and is traversed according to described secondPath obtains the attribute value from the page info;
The mapping relations for establishing the Property Name and the attribute value are exported as acquisition of information result.
Optionally, processor 801 is additionally operable to execute following operating procedure:
Obtain the first map tags of the Property Name and the second map tags of the attribute value;
According to first map tags and second map tags, the Property Name and the attribute value are establishedMapping relations.
Optionally, processor 801 is additionally operable to execute following operating procedure:
When first map tags are identical as the second map tags, the Property Name and the attribute value are establishedMapping relations.
Optionally, processor 801 is additionally operable to execute following operating procedure:
Structure traversal tree is created according to the page info, wherein the structure traversal tree includes multiple content nodes;
The multiple content node on the structure traversal tree is traversed, the category is obtained according to first traverse pathProperty title and the attribute value is obtained according to second traverse path.
Optionally, processor 801 is additionally operable to execute following operating procedure:
Obtain the attribute-bit of the Property Name and the attribute value;
According to the attribute-bit, first traverse path and second traverse path are obtained from configuration file,Wherein, the configuration file include the attribute-bit, it is corresponding with first traverse path and second traverse pathRelationship.
Optionally, processor 801 is additionally operable to execute following operating procedure:
Obtain the uniform resource locator of the page info;
According to the uniform resource locator, the attribute-bit of the Property Name and the attribute value is obtained.
Optionally, processor 801 is additionally operable to execute following operating procedure:
Determine the type of the Property Name;
If the Property Name is open Property Name, according to the acquisition of the first traverse path of the Property NameProperty Name.
Further, processor can also be matched with memory and communication interface, executed and provided in foregoing invention embodimentThe operation of source control device.
In the above-described embodiments, can come wholly or partly by software, hardware, firmware or its arbitrary combination realIt is existing.When implemented in software, it can entirely or partly realize in the form of a computer program product.The computer programProduct includes one or more computer instructions.When loading on computers and executing the computer program instructions, all orIt partly generates according to the flow or function described in the embodiment of the present invention.The computer can be all-purpose computer, special meterCalculation machine, computer network or other programmable devices.The computer instruction can be stored in computer readable storage mediumIn, or from a computer readable storage medium to the transmission of another computer readable storage medium, for example, the computerInstruction can pass through wired (such as coaxial cable, optical fiber, number from a web-site, computer, server or data centerUser's line (DSL)) or wireless (such as infrared, wireless, microwave etc.) mode to another web-site, computer, server orData center is transmitted.The computer readable storage medium can be any usable medium that computer can access orIt is comprising data storage devices such as one or more usable mediums integrated server, data centers.The usable medium can be withIt is magnetic medium, (for example, floppy disk, hard disk, tape), optical medium (for example, DVD) or semiconductor medium (such as solid state diskSolid State Disk (SSD)) etc..
Above-described specific implementation mode has carried out further the purpose of the present invention, technical solution and advantageous effectIt is described in detail.All within the spirits and principles of the present invention, any modification, equivalent replacement, improvement and so on should be included inWithin protection scope of the present invention.