A kind of web page title treating method and apparatusTechnical field
The application relates to Internet technical field, particularly relates to a kind of web page title treating method and apparatus.
Background technology
Title is the description to full page i.e. core, the description to whole website especially of the title of homepage,Directly tell user what you do, it is provided that what product or service.General write on title be allThe key word that website is most crucial, the quality of title is to affect the most direct factor of key word ranking.
In the prior art, by the way of artificial, mainly carry out the determination of title.In general, asThe head of a station, if the webpage safeguarded is fewer, major part can compare the title arranging each webpage diligently, sideThe content that method is expressed generally by the page, extracts the key word of several core, then take key word go each greatlySearch engine is searched for, and takes relevant title, then from others title N number of, splits, is combined into oneselfThe title of the page, goes each big search engine to check the effect of new title after a time after amendment;If safeguarding and criticizingThe page of amount, the head of a station understands drawing template establishment, and similar a collection of page batch is used a template generation page markTopic, goes the sampling of each big search engine to check the effect of input after a time.
Above scheme has following shortcoming: a, when the webpage that the head of a station safeguards is less, although have energy pairPage title is optimized, but lacks effective algorithm and be supported, and the workload of artificial investigation is bigger,Optimize efficiency at a fairly low.B, when the head of a station safeguards substantial amounts of webpage, it is unpractical for piling up by manpower, singleThe batch page title of one template generation is to forward during partial page, but is probably another part negatively, owing to being sampling analysis, whole structure can only be paid close attention to, be hardly formed lasting effect of optimization.And searchThe crawl to the page held up in index is random, cannot be updated the batch page, the most just in the short timeRanking cannot be had influence on, the problem that a lot of page does not update at search engine can be run into when the head of a station samples, take outSample efficiency is the lowest.
Summary of the invention
The embodiment of the present application proposes a kind of web page title treating method and apparatus, in order to solve at present at webpageInefficient technical problem in the determination of title.
In one aspect, the embodiment of the present application provides a kind of web page title processing method, including:
Obtain web page contents, described web page contents is carried out parsing and obtains page unit, extract described webpage listThe Feature Words of unit;
Described Feature Words is screened, determines the emphasis Feature Words being included in title dictionary, by describedEmphasis Feature Words generates header cell;
Generate head stack according to described header cell, and the title in described head stack given a mark,Select the title of highest scoring.
In yet another aspect, the embodiment of the present application provides a kind of web page title processing means, including:
Web analysis submodule, is used for obtaining web page contents, described web page contents is carried out parsing and obtains webpageUnit;
Feature Words extracts submodule, for extracting the Feature Words of described page unit;
Dictionary submodule, is used for preserving title dictionary;
Feature Words screening submodule, for screening described Feature Words, determines and is included in title dictionaryIn emphasis Feature Words;
Emphasis Feature Words processes submodule, for described emphasis Feature Words is generated header cell;
Head stack generates submodule, for generating head stack according to described header cell;
Title marking submodule, for giving a mark the title in described head stack, selects highest scoringTitle.
Have the beneficial effect that:
In this application, by web page contents being resolved and carrying out feature extraction, it is possible to obtain webpageFeature Words, and by the title dictionary safeguarded, Feature Words is screened, will be contained in the emphasis in title dictionary specialLevy word and generate header cell as emphasis Feature Words, and generate head stack according to described header cell, and rightTitle in described head stack is given a mark, and selects the title of highest scoring, so that it is determined that go out the mark of webpageTopic, uses the scheme of the application, can avoid artificial too many participation, improve in the determination of web page titleEfficiency.
Accompanying drawing explanation
The specific embodiment of the application is described below with reference to accompanying drawings, wherein:
Fig. 1 shows the flow chart of web page title processing method in the embodiment of the present application;
Fig. 2 shows in the embodiment of the present application one that one before webpage is thrown in web page title handling process is shownExample flow chart;
Fig. 3 shows in the embodiment of the present application two that one before webpage is thrown in web page title handling process is shownExample flow chart;
Fig. 4 shows in the embodiment of the present application three that one after webpage is thrown in web page title handling process is shownExample flow chart;
Fig. 5 shows the structure chart of the web page title processing means of an example in the application;
Fig. 6 shows the structure chart of the web page title processing means of an example in the application;
Fig. 7 shows the structure chart of the web page title processing means of an example in the application;
Fig. 8 shows the structure chart of the web page title processing means of an example in the application;
Fig. 9 shows the structure chart of the web page title processing means of an example in the application;
Figure 10 shows the structure chart of the web page title processing means of an example in the application.
Detailed description of the invention
Technical scheme and advantage in order to make the application are clearer, below in conjunction with accompanying drawing to the application'sExemplary embodiment is described in more detail, it is clear that described embodiment is only the one of the applicationSection Example rather than all embodiments exhaustive.And in the case of not conflicting, in this explanationFeature in embodiment and embodiment can be combined with each other.
Inventor notices during invention and is currently mainly determined title by the webpage that the head of a station is maintenance, thisMainly by the mode manually determined, there is inefficient problem, for above-mentioned deficiency, the embodiment of the present application carriesGo out a kind of web page title processing scheme, be illustrated below.
The application can apply at website SEO (Search Engine Optimization, search engine optimization)In, it is a kind of to utilize the search rule of search engine to improve current web about the nature in search engineThe mode of ranking.
After the page title that present application is mentioned may refer to open webpage, check the head mark of page source codeThe content in title label in label, such as:<title>1688.com, the whole world maximum that Alibaba makesPurchase wholesale platform</title>.
Fig. 1 shows the web page title process chart of the embodiment of the present application, as it can be seen, include:
Step 101, it is thus achieved that web page contents, carries out parsing and obtains page unit, extract webpage web page contentsThe Feature Words of unit;
Step 102, screens extracting the Feature Words obtained, determines the weight being included in title dictionaryPoint Feature Words, generates header cell by emphasis Feature Words;
Step 103, generates head stack according to the header cell generated, and enters the title in head stackRow marking, selects the title of highest scoring.
Beneficial effect:
The embodiment of the present application is by resolving web page contents and carrying out feature extraction, it is possible to obtain webpageFeature Words, and by the title dictionary safeguarded, Feature Words is screened, will be contained in the emphasis in title dictionary specialLevy word and generate header cell as emphasis Feature Words, and generate head stack according to header cell, and to titleTitle in set is given a mark, and selects the title of highest scoring, so that it is determined that go out the title of webpage, usesThe scheme of the application, can avoid artificial too many participation, improve the efficiency in the determination of web page title.
Further, so that the title generated more meets the crawl feature that each search engine is current, alsoCan implement in the following manner.
In enforcement, after emphasis Feature Words is generated header cell, also header cell is exported and draw to searchHold up and scan for, generate search head stack according to the ranking of search engine feedback;
The concrete scheme according to header cell generation head stack is: according to header cell and search title setSymphysis becomes head stack.
Beneficial effect:
Owing to adding the search rank situation that search engine currently obtains according to header cell, and basis simultaneouslyHeader cell and search head stack generate head stack, and such head stack embodies each search engineCurrent crawl feature, then the title in head stack is given a mark, the title selecting highest scoring is permissibleThe title generated is made more to meet the crawl feature that each search engine is current.
Further, in order to preferably safeguard title dictionary, it is also possible to implement in the following manner.
In enforcement, after extracting the Feature Words of page unit, include not in title dictionary at Feature WordsDuring neologisms, push these neologisms and carry out manual examination and verification, when these neologisms of manual examination and verification are effective neologisms, add toTitle dictionary.
Beneficial effect:
By embodiment of above, can be with Dynamic Maintenance title dictionary so that title dictionary is the most continuousPerfect, and improving of title dictionary also can make the screening of title dictionary more rationally, and then optimize according to thisThe web page title that application scheme is determined.
Further, after selecting the title of highest scoring, net can be carried out according to the title of highest scoringPage is thrown in, it is also possible to the problem not mature enough in order to solve early stage title generating algorithm, and incorporates head of a station's warpTest, throw in again optimized the generation precision of title by manual intervention after, implement the most in the following manner.
In enforcement, after selecting the title of highest scoring, also the title of highest scoring is pushed manual examination and verification,Determine the title of web page contents according to manual examination and verification and carry out webpage input.
Beneficial effect:
Owing to also having the link of manual examination and verification after selecting the title of highest scoring, thus pass through manual interventionOptimize the generation precision of title.
Above scheme is all the content before webpage is thrown in, after webpage is thrown in, it is also possible to further,In order to routinely web page title be safeguarded, it is also possible to implement in the following manner.
After webpage is thrown in, persistently the search engine ranking of webpage is monitored, in ranking less than settingDuring threshold value, it is updated web page title processing.
Beneficial effect:
After throwing at webpage, persistently the search engine ranking of webpage is monitored, is less than in rankingWhen setting threshold value, it is updated web page title processing, it is possible to routinely web page title is safeguarded,Thus it is further ensured that the quality of web page title.
After webpage is thrown in, the search engine ranking to this webpage can be started immediately and be monitored, but,Further, after throwing in due to webpage, reptile will not may catch away webpage immediately, therefore, throws at webpageAfter putting, search engine ranking to webpage is monitored wasting monitoring resource at once, therefore, it can by withUnder type is implemented.
In enforcement, after webpage is thrown in, web daily record is analyzed, is finding that reptile has climbed away webpageAfter, start the search engine ranking to this webpage and be monitored.
Beneficial effect: judge by whether reptile has been climbed away webpage, can avoid catching away net reptileJust the search engine ranking of this webpage was monitored before Ye, saves monitoring resource.
Further, the index of this page after climbing away webpage a period of time due to reptile, just can be updated, differentThe time interval that engine updates is different, in order to more save monitoring resource, therefore, it can in the following mannerImplement.
In enforcement, that safeguards each search engine crawls index timetable;After finding that reptile has climbed away webpage,According to crawling index timetable startup, the search engine ranking of this webpage is monitored.
Beneficial effect: decide when to start ranking monitoring programme according to crawling index timetable, can dropThe low number of times crawled, more saves monitoring resource.
For the ease of the enforcement of the application, illustrate with example below.
Embodiment one
Embodiment one is an example in web page title handling process before webpage input, shown in Fig. 2, bagInclude:
Step 201, it is thus achieved that web page contents, carries out Xpath parsing and obtains page unit web page contents;
In the application, the analysis mode to web page contents the most specifically limits, as long as can be resolved by web page contentsObtain page unit.
Step 202, extracts the Feature Words of page unit by feature extraction algorithm;
As a example by shopping website, feature extraction algorithm for Feature Words can include market word, attribute word,Product words etc., such as marketing word can be bag postal, special price etc., and attribute word can be brand generic, classification genusProperty etc.;Product word can be one-piece dress, trousers etc..
Step 203, screens extracting the Feature Words obtained, determines the weight being included in title dictionaryPoint Feature Words;
After the Feature Words that extraction obtains is screened by this step, it is also possible to will be not included in title dictionaryFeature Words pushes manual examination and verification as neologisms, when these neologisms of manual examination and verification are effective neologisms, adds title toDictionary.So can be with Dynamic Maintenance title dictionary so that title dictionary is the most perfect, and titleImproving of dictionary also can make the screening of title dictionary more rationally, and then optimization determines according to the application schemeThe web page title gone out.
Step 204, generates header cell by emphasis Feature Words by the way of permutation and combination;
Step 205, using header cell as input, performs title generating algorithm and generates head stack;
Step 206, gives a mark to the title in head stack, selects the title of highest scoring.
It can be to utilize algorithm relevant to web page contents to title that title in head stack carries out markingProperty is given a mark, and is not defined concrete marking algorithm in the application.
After selecting the title of highest scoring, webpage input can be carried out according to the title of highest scoring, alsoAfter selecting the title of highest scoring, also the title of highest scoring can be pushed manual examination and verification, according to peopleWork examination & verification determines the title of web page contents and carries out webpage input.
The title of highest scoring is pushed this mode of manual examination and verification, head of a station's experience can be incorporated, pass through peopleWork intervention optimizes the generation precision of title.
Embodiment two
Embodiment two is also an example in web page title handling process before webpage input, so that rawThe title become more meets the crawl feature that each search engine is current, and this example combines each search engineSearch Results, concrete as it is shown on figure 3, include:
Step 301, it is thus achieved that web page contents, carries out Xpath parsing and obtains page unit web page contents;
Step 302, extracts the Feature Words of page unit by feature extraction algorithm;
Step 303, screens extracting the Feature Words obtained, determines the weight being included in title dictionaryPoint Feature Words;
Step 304, generates header cell by emphasis Feature Words by the way of permutation and combination;
Step 305, exports header cell and scans for search engine, according to the row of search engine feedbackName situation generates search head stack;
This step specifically can be gone each big search engine to search for using header cell as term, take rankingThe title of forward (such as first five) generates search head stack.
Step 306, generates header cell and search head stack as input, execution title generating algorithmHead stack;
Step 307, gives a mark to the title in head stack, selects the title of highest scoring.
In embodiment two, owing to adding the search rank feelings that search engine currently obtains according to header cellCondition, and generate head stack according to header cell and search head stack simultaneously, such head stack embodiesThe crawl feature that each search engine is current, then the title in head stack given a mark, select scoreThe highest title is so that the title generated more meets the crawl feature that each search engine is current.
In embodiment two, implementing of relevant portion can refer to embodiment one.
Embodiment three
Embodiment three is an example in web page title handling process after webpage input, as shown in Figure 4:Including:
Step 401, after webpage is thrown in, is analyzed web daily record on the same day, it is judged that whether reptile grabsWalking webpage, if so, carry out step 402, otherwise, return step 401 carries out the web daily record of next day and dividesAnalysis;
In implementing, step 401 not necessarily step, after allowing for throwing in due to webpage, climbWorm will not may catch away webpage immediately, therefore, and search engine ranking to webpage at once after webpage is thrown inIt is monitored wasting monitoring resource, carries out subsequent treatment again after reptile catches away webpage judging, it is possible to moreEffectively it is monitored.
Step 402, is monitored the search engine ranking of this webpage according to crawling index timetable startup;
In implementing, the search engine ranking to this webpage can be directly initiated after throwing at webpageIt is monitored, it is also possible to judge just to directly initiate after reptile catches away webpage this webpage is searched in step 401Index is held up ranking and is monitored, and increases and crawls index timetable, allows for reptile and climb away webpage a period of timeThe rear index that just can update this page, the time interval that different engines updates is different, therefore, according to climbingTake index timetable startup the search engine ranking of this webpage is monitored, the number of times crawled can be reduced,More save monitoring resource.
Step 403, it is judged that whether this page is less than setting threshold value in the ranking of search engine, if so, carries outStep 405, otherwise, carries out step 404;
Specifically how to judge whether this page can enter less than setting threshold value as required in the ranking of search engineRow difference configuration, such as, obtains altogether the ranking of five search engines, wherein at three search enginesMiddle ranking enters front 20 i.e. not very less than setting threshold value;Or the average ranking feelings according to five search enginesCondition, if average ranking is less than 20, calculates less than setting threshold value.
Owing to, in the application, the ranking monitoring to the page is a lasting process, is judging that this page is being searchedThe ranking held up of index whether less than when setting threshold value, can obtain this page search engine the latest rank,And the title in page presentation, such that it is able to judge that title updates the most, it is possible to after determining renewalThe ranking of title, carries out ranking monitoring for the title after updating.
Step 404, returns step 403 when the monitoring cycle expires;
Step 405, is updated web page title processing.
After throwing at webpage, persistently the search engine ranking of webpage is monitored, is less than in rankingWhen setting threshold value, it is updated web page title processing, it is possible to routinely web page title is safeguarded,Thus it is further ensured that the quality of web page title.
Specifically update processing mode, can be title to be carried out quick rollback or optimizes further, quicklyRollback is directly to use the title used before this webpage, when optimizing further, can carry out manual intervention,By manually directly selecting new web page title, it is also possible to by manually determining the webpage that carries out in a application againTitle handling process.In being embodied as, it is also possible to without manual decision, judging that this page is being searchedThe ranking that index is held up, whether less than when setting threshold value, re-starts the web page title handling process in the application.
Based on same inventive concept, the embodiment of the present application additionally provides a kind of web page title processing means, byThe principle solving problem in these equipment is similar to a kind of web page title processing method, the therefore reality of these equipmentExecute the enforcement of the method for may refer to, repeat no more in place of repetition.
As it is shown in figure 5, the web page title processing means in the application may include that
Web analysis submodule 501, is used for obtaining web page contents, web page contents is carried out parsing and obtains webpageUnit;
Feature Words extracts submodule 502, for extracting the Feature Words of page unit;
Dictionary submodule 503, is used for preserving title dictionary;
Feature Words screening submodule 504, for screening Feature Words, determines and is included in title dictionaryIn emphasis Feature Words;
Emphasis Feature Words processes submodule 505, for emphasis Feature Words is generated header cell;
Head stack generates submodule 506, for generating head stack according to header cell;
Title marking submodule 507, for giving a mark the title in head stack, selects highest scoringTitle.
Subsequent descriptions for convenience, determines 501, Feature Words extracts submodule 502, word by web analysis submoduleStorehouse submodule 503, Feature Words screening submodule 504, emphasis Feature Words process submodule 505, head stackGenerate submodule 506, title marking submodule 507 is included in title generation module 51, when realizing,Do not limit the unit in Fig. 5 to be included in the middle of a module.
So that the title generated more meets the crawl feature that each search engine is current, at emphasis Feature WordsReason submodule 505 is additionally operable to export header cell scan for search engine, feeds back according to search engineRanking generate search head stack;Head stack generates submodule 506, for according to header cellHead stack is generated with search head stack.
In order to dynamically, preferably safeguard title dictionary so that title dictionary is the most perfect, andImproving of title dictionary also can make the screening of title dictionary more rationally, and then optimizes according to the application schemeThe web page title determined, the web page title processing means in the application can also as shown in Figure 6, includingOne manually runs module 601;
Feature Words screening submodule 504, is additionally operable to when the neologisms that Feature Words includes not in title dictionary,Push neologisms and seek module 601 to the first labour movement;
The first labour movement battalion module 601, for when manual examination and verification neologisms are effective neologisms, add neologisms and arrivesThe title dictionary preserved in dictionary submodule 503.
Web page title processing means in the application can also include webpage putting module 701, is used for carrying out netPage is thrown in.In title generation module 51, title marking submodule is after the title selecting highest scoring, canExport to webpage putting module 701 with the title of the most just highest scoring, it is also possible to as it is shown in fig. 7, markThe title of highest scoring is pushed to second and manually runs module 702 by topic generation module 51, and second manually runsModule 702, for determining the title of web page contents according to manual examination and verification and being supplied to webpage putting module 701,Webpage input is carried out again by webpage putting module 701.Mode as shown in Figure 7 is owing to selecting highest scoringTitle after also have the link of manual examination and verification, thus it is accurate to optimize the generation of title by manual interventionDegree.
In order to realize the maintenance after webpage is thrown in, the web page title processing means in the application can also includeRanking monitoring module 801, after throwing at webpage, persistently supervises the search engine ranking of webpageControl, in ranking less than when setting threshold value, exports the too low warning of ranking.
When implementing, ranking monitoring module 801 can take this page search engine the latest rank,And the title in page presentation, such that it is able to judge that title updates the most, it is possible to after determining renewalThe ranking of title, carries out ranking monitoring for the title after updating.
When the too low warning of the ranking that ranking monitoring module 801 exports, web page title quickly can be returnedRoll or optimize further, when optimizing further, manual intervention can be carried out, carry out the side of manual interventionCase as shown in Figure 8, manually runs module 802 including ranking monitoring module 801 and the 3rd;3rd people's labour movementBattalion's module 802, for when the too low warning of the ranking receiving ranking monitoring module 801, entering web page titleRow manual intervention.
After webpage is thrown in, the search engine ranking to this webpage can be started immediately and be monitored, but,After throwing in due to webpage, reptile will not may catch away webpage immediately, therefore, after webpage is thrown at onceThe search engine ranking of webpage is monitored wasting monitoring resource, further, the webpage in the applicationTitle processing means can also be as it is shown in figure 9, also include log analysis module 901, for throwing at webpageAfterwards, web daily record is analyzed, after finding that reptile has climbed away webpage, notifies ranking monitoring module801, ranking monitoring module 801 starts the monitoring of the search engine ranking to webpage according to this notice.
After ranking monitoring module 801 receives the notice of log analysis module 901, can start net at onceThe monitoring of the search engine ranking of page, but, just can update this page after climbing away webpage a period of time due to reptileThe index in face, the time interval that different engines updates is different, in order to more save monitoring resource, therefore,Further, the web page title processing means in the application can also also include crawling drawing as shown in Figure 10Hold up index module 1001, crawl index timetable for safeguard each search engine;
Ranking monitoring module 801 is after the notice receiving log analysis module 901, according to crawling engine ropeThe index timetable that crawls drawing module 1001 maintenance starts the monitoring of the search engine ranking to webpage.
For convenience of description, each several part of apparatus described above is divided into various module or unit respectively with functionDescribe.Certainly, can be the function of each module or unit at same or multiple softwares when implementing the applicationOr hardware realizes.
Those skilled in the art are it should be appreciated that embodiments herein can be provided as method, system or meterCalculation machine program product.Therefore, the application can use complete hardware embodiment, complete software implementation or knotThe form of the embodiment in terms of conjunction software and hardware.And, the application can use and wherein wrap one or moreComputer-usable storage medium containing computer usable program code (include but not limited to disk memory,CD-ROM, optical memory etc.) form of the upper computer program implemented.
The application is with reference to method, equipment (system) and the computer program product according to the embodiment of the present applicationThe flow chart of product and/or block diagram describe.It should be understood that can by computer program instructions flowchart and/ or block diagram in each flow process and/or flow process in square frame and flow chart and/or block diagram and/Or the combination of square frame.These computer program instructions can be provided to general purpose computer, special-purpose computer, embeddingThe processor of formula datatron or other programmable data processing device is to produce a machine so that by calculatingThe instruction that the processor of machine or other programmable data processing device performs produces for realizing at flow chart oneThe device of the function specified in individual flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions may be alternatively stored in and computer or the process of other programmable datas can be guided to setIn the standby computer-readable memory worked in a specific way so that be stored in this computer-readable memoryInstruction produce and include the manufacture of command device, this command device realizes in one flow process or multiple of flow chartThe function specified in flow process and/or one square frame of block diagram or multiple square frame.
These computer program instructions also can be loaded in computer or other programmable data processing device, makesSequence of operations step must be performed to produce computer implemented place on computer or other programmable devicesReason, thus the instruction performed on computer or other programmable devices provides for realizing flow chart oneThe step of the function specified in flow process or multiple flow process and/or one square frame of block diagram or multiple square frame.
Although having been described for the preferred embodiment of the application, but those skilled in the art once knowing baseThis creativeness concept, then can make other change and amendment to these embodiments.So, appended right is wantedAsk and be intended to be construed to include preferred embodiment and fall into all changes and the amendment of the application scope.