Disclosure of Invention
In order to solve the above problems, the present invention provides a method for pushing profit information of a user, which can not only push profit information to users meeting conditions, but also ensure the comprehensiveness of the information, and adopts the following technical scheme:
the invention provides a user profit information pushing method based on web crawlers and big data analysis, which is used for pushing corresponding profit information to different users according to user information and is characterized by comprising the following steps: step S1, acquiring user information of a user; step S2, text analysis is carried out on the user information in sequence, and a plurality of condition characteristics reflecting the profit conditions of each user are obtained respectively; step S3, classifying the users according to the condition characteristics, thereby obtaining different user classifications; step S4, utilizing web crawler to obtain profit information related to profit from multiple information publishing websites associated with user profit; step S5, text analysis is carried out on the profit information to obtain profit conditions corresponding to different profit information; step S6, obtaining the matching degree of the user classification and the profit information according to the condition characteristics corresponding to the user classification and the profit condition corresponding to the profit information; step S7, pushing the profit information to the users in the user category according to the matching degree with the profit information.
The method for pushing the user profit information based on the web crawler and the big data analysis provided by the invention can also have the technical characteristics that the step S3 comprises the following steps: step S3-1, forming the condition characteristics into characteristic vectors corresponding to the users; step S3-2, clustering the users based on the feature vectors; and step S3-3, acquiring the user tags of each type of users in turn.
The user profit information pushing method based on the web crawler and the big data analysis provided by the invention can also have the technical characteristics that the clustering of the step S3-2 is carried out by adopting a clustering algorithm based on community discovery.
The method for pushing the user profit information based on the web crawler and the big data analysis provided by the invention can also have the technical characteristics that the step S5 comprises the following steps: step S5-1, carrying out data cleaning on the profit information so as to remove invalid information in the profit information; and step S5-2, performing text analysis on the cleaned profit information to obtain the profit conditions contained in each profit information.
The user profit information pushing method based on web crawler and big data analysis provided by the invention can also have the technical characteristics that in the step S6, the matching degree is the condition satisfaction degree, and the step S6 comprises the following steps: step S6-1, judging whether each condition characteristic of the enterprise classification is in accordance with the profit condition in the profit information, if so, giving a high matching value, and if not, giving a low matching value; step S6-2, setting different weight values for different types of profit conditions; and step S6-3, multiplying the matching value by the corresponding weight value, and then carrying out summation calculation to obtain the condition satisfaction degree between the enterprise classification and the profit information.
The method for pushing the user profit information based on the web crawler and the big data analysis provided by the invention can also have the technical characteristics that the pushing in the step S7 is carried out according to the following rules: and setting a matching degree threshold value, and sending the profit information higher than the matching degree threshold value to the user under the user classification when the matching degree between the user classification and the profit information is higher than the matching degree threshold value.
Action and Effect of the invention
According to the user profit information pushing method based on the web crawler and the big data analysis, the text analysis is performed on the user information to obtain a plurality of condition characteristics, the users are classified according to the condition characteristics, the profit information is also subjected to the text analysis to obtain a plurality of profit conditions, and therefore the matching degree between the user classification and the profit information can be obtained according to the condition characteristics and the profit conditions, the profit information is pushed to each user in the user classification, the user can obtain the profit information which is in accordance with the conditions of the user, and the user can obtain the profit information without spending a large amount of time for screening and condition matching.
Furthermore, since the above process is performed in which the users are classified into different categories and the condition features correspond to the user categories, the condition matching performed in the process is performed based on the categories rather than on the individual users. Therefore, even if the condition satisfaction of each classification and profit information needs to be calculated one by one, the calculation amount thereof is much smaller than that of matching based on each user, and condition matching can be completed quickly and efficiently.
Detailed Description
The following description of the embodiments of the present invention will be made with reference to the accompanying drawings. In the embodiments, a case where enterprises need to obtain fund policy support information issued by a government agency is described as an example, that is, in the embodiments, the enterprises are users who have a need to obtain profit information such as policy support information in compliance with their own conditions.
< example >
Fig. 1 is a flowchart of a user profit information pushing method based on web crawlers and big data analysis according to an embodiment of the present invention.
As shown in fig. 1, the method for pushing user profit information based on web crawlers and big data analysis of the present embodiment includes the following steps.
In step S1, user information of the user is acquired.
In this embodiment, the user is an enterprise, and the user information includes enterprise basic information such as a name, an address, and a contact information of the user, and enterprise operation information such as business license information, shareholder and capital information, principal information, branch information, clearing information, administrative license information, administrative penalty information, whether to list in an abnormal business directory, and whether to list in a seriously illegal losing enterprise list. The basic information of the enterprise can be provided by the enterprise during registration, the business operation information can be provided by the enterprise during registration, and the business operation information can also be acquired from a related information public website under the condition of acquiring the permission of the enterprise.
Step S2, text analysis is sequentially performed on the user information, and a plurality of condition features reflecting the profit conditions of the respective users are obtained. Each enterprise operation information comprises characteristic attributes and characteristic values.
In this embodiment, the enterprise business information obtained from the information publicizing website is a plurality of texts, and the condition characteristics can be obtained by performing text analysis on the texts respectively. For example, a pre-stored word stock is used for segmenting a text of '3000 ten thousand registered capital' of the first company, and then the registered capital of the first company is judged to be '3000 ten thousand' according to the context relationship of the text; at this time, "3000 ten thousand" is a condition feature, the attribute of the feature is "registered capital", and the feature value is "3000 ten thousand".
In addition, when the business management information is provided by a business, since a method of filling in contents in a designated column by a user (for example, a method of directly filling in contents of "3000 ten thousand" in the column of "registered capital") is adopted, it is possible to obtain the characteristic attribute and the characteristic value of the condition characteristic without performing text analysis such as word segmentation.
After the analysis and the acquisition of the condition characteristics of all the users are completed, the process may proceed to step S3.
And step S3, classifying the users according to the condition characteristics, thereby obtaining different user classifications.
In this embodiment, the classification of the user is performed based on a user portrait processing technique, and includes the following steps:
step S3-1, the condition features are formed into feature vectors corresponding to the users. Whether policy support can be obtained is related to the qualification and the operation condition of the enterprise, and the condition characteristics in the embodiment can reflect the qualification and the operation condition of the enterprise, so all the condition characteristics are used for forming the characteristic vector; in other embodiments, different condition features may be selected to form the feature vector for conditions required by different application scenarios.
And step S3-2, clustering the users based on the characteristic vectors, namely, clustering enterprises by adopting a clustering algorithm based on the characteristic vectors. In this embodiment, the clustering algorithm based on community discovery in the prior art is used to cluster enterprises, so as to obtain different enterprise classifications (i.e., user classifications). In addition, in other embodiments, other clustering algorithms in the prior art, such as a K-means clustering algorithm, a hierarchical clustering algorithm, etc., may also be used for clustering.
In this embodiment, a specific implementation method of the clustering algorithm based on community discovery is as follows: and calculating the correlation coefficient according to the feature vector for all users (namely enterprises), so as to obtain a correlation coefficient matrix among all users. And connecting the two users when the correlation coefficient between the two users is larger than a preset threshold value, so that a network taking all the users as nodes can be obtained. The users are equivalent to each point, the users form the structure of the whole network through the mutual connection relationship, in the network, the connection between some users is tight, the connection between some users is sparse, the part with the tight connection can be regarded as a community, the nodes in the community are connected tightly, the relative connection between two communities is sparse, and the community structure is called. Based on this, the existing fast underfolding algorithm is used for community discovery, and the obtained set containing nodes in each community is the user group, namely, the clustering of the users is realized.
After clustering is finished, each type of enterprise has certain similarity, such as business similarity, registered place similarity, registered capital similarity and the like. That is, each type of business is relatively similar and, in one aspect or in several aspects, relatively similar.
And step S3-3, acquiring the user tags of each type of users in turn.
As described above, after the enterprises are clustered according to the clustering algorithm, each class of enterprises has a certain similarity. Therefore, with the existing user representation processing technology, a plurality of user tags of each type of enterprises are obtained according to common characteristics (for example, common condition characteristics) of each type of enterprises. At this point, each user tag represents a common characteristic of the category of businesses, such as a common registry or the like.
In step S4, profit information related to the profit is acquired from a plurality of information distribution websites associated with the user' S profit by using a web crawler.
In this embodiment, the information distribution websites associated with the profit of the enterprise as the user are mainly various policy support information distribution websites and policy news distribution websites, for example, tax preferential policy information distributed to an enterprise in a poor area may be distributed on a bulletin page of a regional government website, talent introduction support information may be distributed on a bulletin page of a regional talent center website, and these information are published to the public through public channels, so this embodiment adopts a crawler technology to acquire and store these information one by one.
Step S5, performing text analysis on the profit information to obtain profit conditions corresponding to different profit information, which specifically includes the following steps:
and step S5-1, carrying out data cleaning on the profit information and removing invalid information. Because the specific form of the profit information is usually announced or news, the profit information will contain a large amount of redundancy information (such as website addresses, web page parameters and the like) irrelevant to the profit after being captured by the crawler module, and the redundancy information can be removed after data cleaning. Data cleansing may be performed in a regular expression-based manner to remove information that conforms to a redundant form of information (e.g., a text form that conforms to a web site).
And step S5-2, performing text analysis on the cleaned profit information to obtain the profit conditions contained in each profit information. Different articles are formed after the profit information such as bulletin news is cleaned, and the specific profit conditions contained in each profit information can be obtained through text analysis such as word segmentation, keyword extraction and the like.
In this embodiment, the word segmentation and the keyword extraction may be performed based on a pre-stored word bank, and each of the obtained profit conditions includes a plurality of profit keywords having different attributes. For example, after proposing a business creation guarantee loan government interest for college students 'business entrepreneurs who provide more than 15 employment positions for place a and extracting keywords for place … …, profit keywords such as place a (registered place), more than 15 (employment position number), college student's business entrepreneur (enterprise property), business creation guarantee loan (profit item) and government interest (profit content) are obtained, wherein the content in the brackets is the attribute corresponding to the profit keyword. In addition, the embodiment also replaces and unifies some profit keywords with the same meaning but different characters through a preset replacement word bank, so as to avoid the situation that the profit keywords with the substantially same meaning are mistakenly considered to be different due to different selected words.
After the above data cleaning and text analysis, the profit conditions are extracted from the profit information, and the profit conditions may be stored together with the acquisition time thereof, and then the process proceeds to step S6.
In step S6, a matching degree between the user classification and the profit information is obtained according to the condition characteristics corresponding to the user classification and the profit condition corresponding to the profit information.
In this embodiment, the matching degree between the user classification and the profit information is expressed by the condition satisfaction degree.
The condition satisfaction degree refers to the degree that an enterprise satisfies the profit condition in the profit information, and is obtained by calculating in a manner of combining a matching value and a weight value, and the specific calculation rule is as follows:
first, it is determined whether each conditional feature of the enterprise classification meets each profit condition in the profit information (i.e., it is determined whether the conditional feature and the profit condition having the same attribute are consistent), if yes, a high matching value is given, and if not, a low matching value is given, or even the matching value is marked as 0.
Then, according to the influence of different types of conditions on the satisfaction degree, weight values are set for different types of profit conditions, the matching values are multiplied by the corresponding weight values and then are added and calculated, and the condition satisfaction degree between a certain enterprise classification and certain profit information can be obtained.
For example, when the condition characteristics of a business classification include "place a" (registered place) and a profit information includes the profit condition of "place a" (registered place), a high matching value is given to the condition of "registered place" because the two are identical; meanwhile, since the registered place is a very important condition in terms of the business support policy of the enterprise (for example, if the registered place is not somewhere, it is impossible to enjoy the support policy thereof at all), a high weight value is given to this condition. In addition, in the above calculation process, the matching values of the condition features and the profit conditions stored in the text form may be set only in two levels of high matching value and low matching value, but the matching values stored in the digital form may be set in different levels of high, medium, low, and the like to reflect the closeness of the condition features and the profit conditions.
The above calculation is performed in a one-to-one traversal manner, that is, each enterprise category is sequentially compared with each profit information and the condition satisfaction degree is calculated, so as to obtain the condition satisfaction degrees between different enterprise categories and all the profit information, and then step S7 may be performed.
Step S7, pushing the profit information to the users in the user category according to the matching degree with the profit information.
After the matching degree (i.e., the condition satisfaction degree) between the different business classifications and all the profit information is obtained through step S6, the profit information with a high matching degree (i.e., a high value of the condition satisfaction degree) can be sent to each business in the business classifications.
In this embodiment, the profitability information to be sent is determined in a manner of using a condition satisfaction threshold. For example, if the satisfaction of the conditions between a certain business category and several pieces of profit information is higher than a preset threshold, the profit information higher than the threshold is sent to all the businesses under the category (which may be sent according to the contact information provided when the businesses register). After receiving the corresponding profit information, the enterprise can judge whether the enterprise can profit from the profit according to the bulletin or news content and take corresponding countermeasures.
In the above process, since the present embodiment adopts a manner of determining the profit information that needs to be sent based on the threshold, it is highly likely that the enterprise receives the profit information whose condition satisfaction degree is higher than the threshold but in fact the condition itself is not completely strictly satisfied. In this case, although the enterprise does not fully meet the corresponding conditions, it is still possible to make an effort in the shortage aspect based on the profit information, and to obtain corresponding fund policy support later.
Examples effects and effects
According to the user profit information pushing method based on the web crawler and the big data analysis, the text analysis is performed on the user information to obtain a plurality of condition characteristics, the users are classified according to the condition characteristics, the profit information is also subjected to the text analysis to obtain a plurality of profit conditions, and therefore the matching degree between the user classification and the profit information can be obtained according to the condition characteristics and the profit conditions, the profit information is pushed to each user in the user classification, the user can obtain the profit information which is in accordance with the conditions of the user, and the user can obtain the profit information without spending a large amount of time for screening and condition matching.
Furthermore, since the above process is performed in which the users are classified into different categories and the condition features correspond to the user categories, the condition matching performed in the process is performed based on the categories rather than on the individual users. Therefore, even if the condition satisfaction of each classification and profit information needs to be calculated one by one, the calculation amount thereof is much smaller than that of matching based on each user, and condition matching can be completed quickly and efficiently.
In the embodiment, the matching degree between the user classification and the profit information is measured by adopting the condition satisfaction degree obtained based on the matching value and the weight value, so that the matching degree between one user classification and different profit information can be reflected integrally, important conditions can have greater influence on the result of the matching degree, the matching degree is more accurate, and the subsequent information push is more accurate.
In addition, the profit information is pushed according to the preset matching degree threshold, so that the profit information can be pushed when the user does not completely meet the profit condition of the profit information, the user can obtain more profit information which is possibly met, and the user can determine the future development direction according to the profit information, so that a longer pushing effect is generated, and the user can develop for a long time.
In the embodiment, the text analysis process of the profit information further comprises the step of replacing and unifying the profit keywords, so that the situation that the profit keywords which are substantially the same are mistakenly considered to be different due to different selected vocabularies can be avoided.
The above embodiments are only used to illustrate specific embodiments of the present invention, and the method for pushing user profit information based on web crawlers and big data analysis of the present invention is not limited to the scope of the above embodiments.
For example, in the embodiment, the matching degree is expressed by using the condition satisfaction degree derived based on the matching value and the weight value. However, in the present invention, the matching degree can also be obtained directly based on the matching value, or directly based on the number of the condition features matching with the profit conditions, such a simplified manner cannot reflect the matching degree as a whole, but it can be applied in some application scenarios that are simpler and do not need to make too much overall consideration, and the amount of calculation when the method of the present invention is applied to these scenarios can be reduced, thereby improving the efficiency.
In the embodiment, the profit information is pushed according to the preset matching degree threshold, so that the user can obtain some profit information which cannot fully meet the conditions at present. However, in the present invention, the user may also select whether such profit information needs to be obtained during registration, and when the user does not need to select, the condition features of the user are compared with the profit conditions corresponding to the profit information having a matching degree above a threshold (that is, the matching degree with the user category where the user is located is above the threshold) one by one, and the profit information is sent only when the user completely matches the profit conditions, so that the user can select to obtain only the profit information completely meeting the conditions according to his own will, and the information obtaining efficiency is improved.