The content of the invention
In order to solve the above technical problems, the present invention provides a kind of method, apparatus and equipment for handling site maps, can meetWebsite and the respective needs of search engine.
According to an aspect of the present invention, there is provided a kind of method for handling site maps, including:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps.
Preferably, it is described to obtain the link of the page in site maps and also include after conducting interviews:
Keyword and text characteristic value are extracted to the page of access;
According to the keyword of extraction and text characteristic value and the keyword and the comparative result of text characteristic value that prestore, deleteThe link that search is included is influenceed in site maps.
Preferably, influenceing the link that search is included in the result deletion site maps according to access includes:
Access result be occur the HTTP 404 that can not access it is wrong when, delete corresponding to link;Or,
When it is the page response time to be more than or equal to given threshold to access result, corresponding link is deleted;Or,
When accessing the title, keyword and imperfect description that result is the page, corresponding link is deleted;Or,
When title, keyword and the description for accessing body matter and the page that result is the page mismatch, delete correspondingLink.
Preferably, the keyword and text characteristic value according to extraction and the keyword and the ratio of text characteristic value that prestoreRelatively result, deleting influences the link that search is included in site maps include:
It is one according to the keyword of extraction and text characteristic value and the keyword and the comparative result of text characteristic value that prestoreCause, be judged as that content repeats to submit, delete corresponding link.
Preferably, methods described also includes:
It is supplied to search engine to access the new site maps of generation.
Preferably, methods described also includes:
Scanned for after recording the new site maps of the search engine access and that includes includes data.
According to another aspect of the present invention, there is provided a kind of device for handling site maps, including:
Acquisition module, for obtaining the site maps of website according to presupposed information;
Access modules, for the site maps obtained according to the acquisition module, obtain the link of the page in site mapsAnd conduct interviews;
First processing module, included for deleting influence search in site maps according to the access result of the access modulesLink;
Generation module, for generating new site maps after the first processing module is handled.
Preferably, described device also includes:
Second processing module, for extracting keyword and text characteristic value to the page of access, according to the keyword of extractionWith text characteristic value and the keyword and the comparative result of text characteristic value that prestore, deleting influences what search was included in site mapsLink;
The generation module generates new website after the first processing module and the Second processing module are handledMap.
Preferably, described device also includes:
Output module, the new site maps for the generation module to be generated are supplied to search engine to access.
Preferably, described device also includes:
Monitoring module, scanned for for recording after the search engine accesses new site maps and that includes includes numberAccording to.
Preferably, the first processing module includes:
First deletes unit, for access result be occur the HTTP 404 that can not access it is wrong when, corresponding to deletionLink;Or,
Second delete unit, for access result be the page response time be more than or equal to given threshold when, deletion pairThe link answered;Or,
3rd deletes unit, for when accessing the title, keyword and imperfect description that result is the page, deleting correspondingLink;Or,
4th deletes unit, for accessing title, keyword and the description that result is the body matter and the page of the pageDuring mismatch, corresponding link is deleted.
According to another aspect of the present invention, there is provided a kind of processing equipment, including:
Memory, for storage program,
Processor, for performing the following procedure of the memory storage:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps.
It can be found that the technical scheme of the embodiment of the present invention, is first carried out by obtaining in site maps after the link of the pageAccess, found according to result is accessed after having an impact the link that search is included, just deleting influences the chain that search is included in site mapsConnect, regenerate new site maps, can thus realize and processing is optimized to original site maps of website, avoid as far as possibleOccur the link that various contents are bad or easily malfunction in site maps, so as to lift site maps quality, can also increaseThe possibility for adding searched engine to include, meets the needs of website and search engine.
Embodiment
The preferred embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although the disclosure is shown in accompanying drawingPreferred embodiment, however, it is to be appreciated that may be realized in various forms the disclosure without the embodiment party that should be illustrated hereFormula is limited.On the contrary, these embodiments are provided so that the disclosure is more thorough and complete, and can be by the disclosureScope is intactly communicated to those skilled in the art.
The present invention provides a kind of method for handling site maps, can meet website and the respective needs of search engine.
Fig. 1 is the indicative flowchart of the method for processing site maps according to an embodiment of the invention.
As shown in figure 1, including:
Step 101, the site maps according to presupposed information acquisition website.
In the step, can according to website it is consensus after, the configuration information that is provided according to website obtains the net of websiteStand map.
Step 102, obtain the link of the page in site maps and conduct interviews.
In the step, each URL (Uniform Resource Locator, unified resource positioning in site maps are obtainedSymbol) link, and URL link is conducted interviews to verify respectively.
Step 103, the link that search is influenceed in site maps and is included is deleted according to access result.
In the step, include according to the link that influence search is included in result deletion site maps is accessed:
Access result be occur the HTTP 404 that can not access it is wrong when, delete corresponding to link;Or,
When it is the page response time to be more than or equal to given threshold to access result, corresponding link is deleted;Or,
When accessing the title, keyword and imperfect description that result is the page, corresponding link is deleted;Or,
When title, keyword and the description for accessing body matter and the page that result is the page mismatch, delete correspondingLink.
Step 104, the new site maps of generation.
In the step, after each link that search is included is influenceed in deleting site maps, rearrange and generate new websiteMap.
It can be found that the technical scheme of the embodiment of the present invention, is first carried out by obtaining in site maps after the link of the pageAccess, found according to result is accessed after having an impact the link that search is included, just deleting influences the chain that search is included in site mapsConnect, regenerate new site maps, can thus realize and processing is optimized to original site maps of website, avoid as far as possibleOccur the link that various contents are bad or easily malfunction in site maps, so as to lift site maps quality, can also increaseThe possibility for adding searched engine to include, meets the needs of website and search engine.
Technical scheme is more specifically introduced further below.
Fig. 2 is another indicative flowchart of the method for processing site maps according to an embodiment of the invention.
As shown in Fig. 2 including:
Step 201, the site maps according to presupposed information acquisition website.
The step referring to above-mentioned steps 101 description.
Step 202, obtain the link of the page in site maps and conduct interviews.
The step referring to above-mentioned steps 102 description.
Step 203, the link that search is influenceed in site maps and is included is deleted according to access result.
The step referring to above-mentioned steps 103 description.
Step 204, the page to access extract keyword and text characteristic value.
In the step, keyword extraction is carried out to the content of the page using existing algorithms of different, and to body matterText characteristic value is extracted, the present invention is not limited.
Step 205, keyword and text characteristic value according to extraction and the keyword to prestore and the comparison of text characteristic valueAs a result, deleting influences the link that search is included in site maps.
It is keyword and text characteristic value and the keyword to prestore and the ratio of text characteristic value according to extraction in the stepRelatively result is consistent, is judged as that content repeats to submit, deletes corresponding link.
Step 206, the new site maps of generation.
Step 207, it is supplied to search engine to access the new site maps of generation.
In the step, the new site maps of generation can be replaced the original site maps in website, for search engine to netStand and access new site maps, can also be configured by website, new site maps are directly accessed to service platform by search engine,The present invention is not limited, as long as search engine can be allowed to access new site maps.
It should be noted that the processing of above-mentioned steps 202,203 is closed with step 204,205 processing without the order of certaintySystem, above-mentioned steps arrangement are only the convenience described.
It should be noted that it can also include after above-mentioned steps 207:After recording the new site maps of the search engine accessWhat is scanned for and include includes data.
It can be found that the technical scheme of the embodiment of the present invention, can delete shadow in site maps according to access result respectivelyRing link and the keyword and text characteristic value and the keyword to prestore and the ratio of text characteristic value according to extraction that search is includedRelatively result, deleting influences the link that search is included in site maps, there is provided effect of optimization.Furthermore it is also possible to record the searchEngine is scanned for after accessing new site maps and that includes includes data, so as to provide reference for follow-up site maps modificationOr analyzed for website.
Fig. 3 is another indicative flowchart of the method for processing site maps according to an embodiment of the invention.
As shown in figure 3, including:
Step 301, sitemap service platforms carry out data extraction according to the configuration information of website to the sitemap of website.
In the step, website is consensus in advance with sitemap service platforms (hereinafter referred service platform), is set by websitePut the mapping relations of sitemap and service platform, it is allowed to the configuration information such as address information that service platform provides according to websiteTo sitemap processing.Website sets mapping relations to be realized by XML.The setting that service platform provides according to website is believedBreath, data extraction can be carried out to sitemap, obtain the URL information of wherein each link.
Step 302, service platform are checked the URL in the sitemap of extraction respectively, judge to access whether URL goes outThe mistakes of HTTP 404 that can not now access, if it is, into step 311, the URL is deleted from sitemap and records reason, such asFruit is no, into step 303.
The mistakes of HTTP 404 mean that the webpage that link is pointed to is not present, i.e. the URL failures of original web page, such case warpIt can often occur, such as:Webpage URL create-rules change, web page files are renamed or shift position, importing link misspelling etc.,Original URL addresses are caused not access;When web page server is connected to similar request, 404 conditional codes can be returned, are toldThe resource to be asked of browser is simultaneously not present.Therefore, when occur HTTP 404 that URL can not access it is wrong when, represented the URLThrough failure, the URL is now deleted from sitemap and records reason.
Step 303, service platform judge whether the page response speed for accessing URL is abnormal, if it is, into step 311,The URL is deleted from sitemap and records reason, if not, into step 304.
When URL can be accessed normally, the response speed of the page is detected, response speed can be weighed by the response timeAmount.If the response time is more than or equal to given threshold, it is believed that response speed is abnormal, if less than given threshold, it is believed that responseSpeed is normal.Given threshold, can rule of thumb value, such as be arranged to 500 milliseconds or 1 second, the present invention be not limited.
When should be noted, it can also be contrasted according to page history access response speed with current accessed response speed,Judge whether response speed is abnormal.If the current response time is more much larger than the historical responses time, more than some threshold value, it is believed thatResponse speed is abnormal.
Therefore, when page response velocity anomaly, represent that the page corresponding to the URL may net corresponding to problematic or URLNetwork connection may be problematic, and these can all influence the viewing experience of user, and the URL is now deleted from sitemap and records originalCause.
Step 304, service platform judge whether the TKD of the page is imperfect, if it is, into step 311, from sitemapMiddle deletion URL simultaneously records reason, if not, into step 305.
TKD is title title, keyword keywords, the abbreviation for describing description.TKD format content can be withIt is as follows:
<title>Here it is title content</title>
<Meta name=" keywords " content=" being key words content here "/>
<Meta name=" description " content=" being description content here "/>
Keyword keywords is a website webmaster to some page setting of website so that user is drawn by searchThe vocabulary of this webpage can be searched out by holding up, and keyword represents the market orientation of website.Description, alternatively referred to as " content are describedLabel ", " description label " or " synopsis ", reflect the main contents of webpage.
Usually complete TKD just meets the search rule of search engine, if TKD is imperfect, does not meet search engineSearch rule, then search engine may not search for the page, or not include the linked contents.Thus, it is found that TKD is endlessThe URL is deleted from sitemap when whole and records reason.
Step 305, service platform judge whether page body content mismatches with TKD, if it is, into step 311, fromThe URL is deleted in sitemap and records reason, if not, into step 306.
In the step, according to the body matter in the page, the keyword for whether occurring in TKD in text is judged, textWhether content corresponding with TKD title and description, if there is the keyword in TKD, the content of text be with TKD title andDescription is corresponding, and expression is matching, is otherwise unmatched.If mismatch, then it is probably that text setting is wrong,Either TKD is set wrong, and these can all influence the search quality of search engine and influence the viewing experience of user.Therefore, send outWhether existing page body content deletes the URL from sitemap and records reason when being mismatched with TKD.
Step 306, service platform carry out keyword extraction to the content of the page, and to text contents extraction text featureValue.
In the step, service platform can carry out keyword extraction using existing algorithms of different to the content of the page, and rightBody matter extracts text characteristic value, and the present invention is not limited.
For example, keyword extraction can use existing TFIDF (term frequency-inverse documentFrequency, word frequency -- inverted file frequency) algorithm, the algorithm is mainly to preserve all word informations with a dictionary, soAccording to value value sorts to dictionary afterwards, and last weighting weight several words in the top are as keyword.For example, body matter is carriedText characteristic value is taken, can be using the text feature based on Context Framework or based on ontological Text character extractionMethod etc..
Step 307, service platform are by the keyword of the keyword of extraction and text characteristic value and service platform storage and justLiterary characteristic value is compared, and the situation that content is submitted in repetition is checked for, if it is, into step 311, from sitemapMiddle deletion URL simultaneously records reason, if not, into step 308.
The step passes through the keyword and text feature that store the keyword of extraction and text characteristic value with service platformValue is compared, to carry out the matching of the text degree of correlation, if having found same keyword and text feature in service platformValue, it is judged as that content repeats.By the matching detection, so as to check for the situation that content is submitted in repetition.TakingBusiness platform, prestore the keyword and text characteristic value of each page article detected.
Step 308, service platform is preserved the keyword of extraction, text characteristic value and corresponding link, for follow-upUsed in duplicate checking.
Step 309, the new sitemap data of service platform generation after treatment obtain for search engine.
In the step, it can be configured in website, instruction search engine directly arrives service platform and obtains sitemap, orPerson, service platform directly can replace new sitemap the original sitemap of website.
Step 310, service platform carry out collection situation monitoring to newest sitemap data.
Included if sitemap URL is searched engine, meeting return label information, service platform monitoring URL is searched to be drawnSituation about including is held up, reference can be provided for follow-up adjustment sitemap.
Step 311, service platform delete the link from sitemap, and record reason and analyzed for website.
In the step, the reason for link is deleted can be recorded in detail, is analyzed for website.
It can be found that the sitemap data of the website of acquisition analyzed by the technical scheme of the embodiment of the present inventionFilter, and the checking that conducted interviews to the sitemap links provided, also carry out keyword extraction and text feature to body matter in additionValue extraction, and the keyword with prestoring and text characteristic value are matched, so as to avoid submitting duplicate contents or poor qualityContent.Search engine can also be finally monitored to sitemap collection situation.By above-mentioned processing, the present invention canTo optimize sitemap quality, what the searched engine of lifting web site contents was included includes quantity, allows search engine preferably to includeThe page of website, also solve the problems, such as that duplicate contents, rubbish contents are submitted to search drop power caused by search engine, can be withThe preferably situation of monitoring web site contents.
The method of the above-mentioned processing site maps for describing the present invention in detail, accordingly, the present invention also provides a kind of processingThe device of site maps.
Fig. 4 is a kind of schematic block diagram of the device of processing site maps of the present invention.
As shown in figure 4, a kind of device for handling site maps, including:At acquisition module 401, access modules 402, firstManage module 403, generation module 404.The device of the processing site maps of the present invention, can be service platform or other equipment.
Acquisition module 401, for obtaining the site maps of website according to presupposed information.
Device can according to website it is consensus after, the configuration information that is provided by acquisition module 401 according to website, obtainThe site maps of website.
Access modules 402, for the site maps obtained according to the acquisition module 401, obtain the page in site mapsLink and conduct interviews.
Access modules 402 obtain each URL link in site maps, and URL link is conducted interviews to test respectivelyCard.
First processing module 403, influence to search for being deleted in site maps according to the access result of the access modules 402The link that rope is included.
First processing module 403 deletes the link that search is influenceed in site maps and is included according to various different access results.
Generation module 404, for generating new site maps after the first processing module 403 is handled.
Fig. 5 is a kind of another schematic block diagram of the device of processing site maps of the present invention.
As shown in figure 5, a kind of device for handling site maps, including:At acquisition module 401, access modules 402, firstModule 403, generation module 404 are managed, the function of each module is referring to described in Fig. 4.
In addition, described device also includes:Second processing module 405.
Second processing module 405, for extracting keyword and text characteristic value to the page of access, according to the key of extractionWord and text characteristic value and the keyword and the comparative result of text characteristic value to prestore, deleting influences search in site maps includesLink;The generation module 404 is raw after the first processing module 403 and the Second processing module 405 are handledInto new site maps.
Second processing module 405 is according to the keyword and text characteristic value of extraction and the keyword to prestore and text featureThe comparative result of value is consistent, is judged as that content repeats to submit, deletes corresponding link.
Described device also includes:Output module 406.
Output module 406, the new site maps for the generation module to be generated are supplied to search engine to access.
The new site maps of generation can be replaced the original site maps in website by the present invention, be visited for search engine to websiteNew site maps are asked, can also be configured by website, new site maps, this hair are directly accessed to service platform by search engineIt is bright not to be limited, as long as search engine can be allowed to access new site maps.
Described device also includes:Monitoring module 407.
Monitoring module 407, scanned for for recording after the search engine accesses new site maps and that includes includesData.
Wherein, the first processing module 403 includes:First deletion unit 4031, second is deleted unit the 4032, the 3rd and deletedExcept unit 4033 or the 4th deletes unit 4034.
First deletes unit 4031, for when it is the HTTP404 mistakes for occurring accessing to access result, deleting correspondingLink.
Second deletes unit 4032, for when it is the page response time to be more than or equal to given threshold to access result, deletingExcept corresponding link.
3rd deletes unit 4033, for when accessing the title, keyword and imperfect description that result is the page, deletingCorresponding link.
4th deletes unit 4034, for access body matter that result is the page and the title of the page, keyword andWhen description mismatches, corresponding link is deleted.
The present invention also provides a kind of processing equipment.
Fig. 6 is a kind of schematic block diagram of processing equipment of the present invention.
As shown in fig. 6, processing equipment includes:Memory 601 and processor 602.
Memory 601, for storage program,
Processor 602, the following procedure stored for performing the memory 601:
The site maps of website are obtained according to presupposed information;
Obtain the link of the page in site maps and conduct interviews;
Influence to search for the link included in site maps according to accessing result and deleting;
Generate new site maps.
It should be noted that other programs that memory 601 stores, referring specifically to the description in previous methods flow, hereinRepeat no more, processor 602 is additionally operable to perform other programs that memory 601 stores.
In summary, the technical scheme of the embodiment of the present invention, the sitemap data of the website of acquisition analyzedFilter, conduct interviews checking to the sitemap links provided, also carries out keyword extraction and text characteristic value to body matter in additionExtraction, and the keyword with prestoring and text characteristic value are matched, so as to avoid submitting duplicate contents or poor qualityContent.Search engine can also be finally monitored to sitemap collection situation.By above-mentioned processing, the present invention can be withOptimize sitemap quality, what the searched engine of lifting web site contents was included includes quantity, allows search engine preferably to include netThe page stood, also solve the problems, such as that duplicate contents, rubbish contents search for drop power caused by being submitted to search engine, can also be moreThe situation of good monitoring web site contents.
Technique according to the invention scheme above is described in detail by reference to accompanying drawing.
In addition, the method according to the invention is also implemented as a kind of computer program, the computer program includes being used forPerform the computer program code instruction of the above steps limited in the above method of the present invention.Or according to the present invention'sMethod is also implemented as a kind of computer program product, and the computer program product includes computer-readable medium, in the meterThe computer program for performing the above-mentioned function of being limited in the above method of the invention is stored with calculation machine computer-readable recording medium.AbilityField technique personnel will also understand is that, various illustrative logical blocks, module, circuit and algorithm with reference to described by disclosure hereinStep may be implemented as the combination of electronic hardware, computer software or both.
Flow chart and block diagram in accompanying drawing show that the possibility of the system and method for multiple embodiments according to the present invention is realExisting architectural framework, function and operation.At this point, each square frame in flow chart or block diagram can represent module, a journeyA part for sequence section or code, a part for the module, program segment or code is comprising one or more defined for realizingThe executable instruction of logic function.It should also be noted that at some as in the realization replaced, the function of being marked in square frame also may be usedWith with different from the order marked in accompanying drawing generation.For example, two continuous square frames can essentially perform substantially in parallel,They can also be performed in the opposite order sometimes, and this is depending on involved function.It is also noted that block diagram and/or streamThe combination of each square frame and block diagram in journey figure and/or the square frame in flow chart, function or operation as defined in performing can be usedSpecial hardware based system realize, or can be realized with the combination of specialized hardware and computer instruction.
It is described above various embodiments of the present invention, described above is exemplary, and non-exclusive, andIt is not limited to disclosed each embodiment.In the case of without departing from the scope and spirit of illustrated each embodiment, for this skillMany modifications and changes will be apparent from for the those of ordinary skill in art field.The selection of term used herein, purportThe principle of each embodiment, practical application or improvement to the technology in market are best being explained, or is making the artOther those of ordinary skill are understood that each embodiment disclosed herein.