Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The term "and/or" is used herein to describe only one relationship, and means that three relationships may exist, for example, A and/or B, and that three cases exist, A alone, A and B together, and B alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.
As shown in fig. 1, an embodiment of the present application provides a method for collecting cyclic automation data, including:
 s1, constructing a multisource data entry set according to a distributed entry discovery mechanism of a knowledge graph, and forming an entry URL set by using dynamic priority queue management, a timeliness evaluation model and a semi-parameter batch processing global decision mechanism.
In actual application, the system first obtains an initial set of URLs through a search engine API, a social media API, and an industry catalog database.
In the embodiment of the application, permission or agreement of a data acquisition path is obtained before automatic data acquisition, for example, a browser is adopted to access a target page during data acquisition, and page information is acquired in ocr forms or by a browser self-checking source code function after the target page is captured.
For example, when telephone number data of the catering industry needs to be collected, the system can search for XX restaurant contact ways through an API, obtain merchant information under relevant topics of catering through a microblog API, and extract restaurant list page URLs from an industry catalog.
Based on the initial URLs, the system builds a knowledge graph, takes restaurant names, areas, cuisines and other information as nodes, and builds an association relationship between the restaurant names, the areas, the cuisines and other information. Through the atlas analysis, the system identifies aggregated pages containing multiple restaurant contact information, which are typically of higher data value. And then, the system utilizes a semi-parameter batch processing global decision mechanism to carry out resource optimization allocation.
Specifically, for each restaurant website, the system analyzes the domain name age (such as longer history of domain name of a comment), update frequency (such as that of a food blog may be irregular), and content richness (such as that of an official website may be more comprehensive), and constructs a feature vector.
Based on these features, the system trains a semi-parametric rewards prediction model and calculates the expected data revenue for each URL using Thompson sampling. Finally, the system generates an optimized URL lot acquisition plan, such as preferentially accessing a comment with high update frequency to the latest restaurant list page, and reducing access frequency to the slow update official website to form the entry URL set.
S2, according to the entry URL set, a high-value link queue is formed based on DOM structural feature analysis and semantic association evaluation and link value scoring of TF-IDF and Word2 Vec.
Taking restaurant data collection as an example, the system starts with an entry URL (such as XX restaurant page of a comment) obtained in S1, and firstly obtains web content and performs preprocessing. The system analyzes the HTML structure, removes irrelevant elements such as advertisements, navigation and the like, and builds a standardized DOM tree. Next, the system uses multiple technical strategies to extract links in parallel, matching all http(s) links with regular expressions, locating links for all restaurant detail pages with XPath (e.g./div [ @ class = "shop-list" ]/a), and identifying clickable elements with CSS selectors (e.g.. Shop-titlea).
For some content that is dynamically loaded using JavaScript (more restaurants are loaded as a drop down), the system can also capture these dynamically generated links. The system then analyzes the value of these links using the TF-IDF algorithm and the Word2Vec model.
For example, when the link text contains keywords such as "contact us", "subscribe to phone", etc., or the text surrounding the link is highly related to the contact, the system may give a higher score. The system also considers the location of the links in the DOM tree, typically links in the page body content are more valuable than links in the navigation bar or footer.
Finally, the system sets dynamic threshold filtering advertisements and low-value links, such as automatically eliminating obvious advertisement links (such as URLs containing "ad", "sponsor" words) and typical automatic data acquisition traps (such as infinite links caused by calendar page turning), and finally forming a high-value link queue, wherein the links mainly point to restaurant detail pages, contact pages and other high-value pages containing telephone numbers.
And S3, acquiring page contents according to the high-value link queue, extracting a digital sequence by using a regular expression, and performing matching judgment by using a telephone number pattern knowledge base to form a page queue containing effective telephone numbers.
When the system accesses the high-value link generated in S2 (e.g., a detail page of a restaurant), the visual text content of the page is first extracted by DOM parsing. The system processes various encoding formats and recognizes hidden text, ensuring full coverage of the area that may contain telephone numbers. Next, the system extracts all potential sequences of numbers using regular expressions, including consecutive numbers (e.g., "13812345678"), numbers with separators (e.g., "010-87654321"), and special formats (e.g., "86 (10) 87654321").
For each extracted number sequence, the system records its contextual information, such as prefix text of "order phone:", "contact us:", etc. The system then verifies the candidate digits using a global phone number format rules repository.
For example, chinese phone numbers are typically 11 digits and begin with 1, and XX fixed phones are typically 010 digits plus 8 digits. The system is capable of handling various format changes such as bracketed, international area code or telephone numbers with separators.
Finally, the system combines the context information to perform function identification, such as identifying the telephone with prefix containing 'order' as restaurant order telephone and identifying the telephone containing 'complaint' as customer service telephone, thereby forming a page queue containing the effective telephone numbers of each restaurant and the functional attributes thereof.
And S4, performing time sequence interpolation by using a self-paying diffusion model according to the page queue, and performing merchant information association by means of a domain name extraction algorithm and a domain name merchant mapping knowledge base to form a merchant data set.
In the restaurant data acquisition, the system designs a reasonable resource allocation strategy, such as limiting the maximum request number of a single restaurant website to 1000 times per hour, and the total grabbing depth is not more than 5 layers.
By mixing breadth first and depth first strategies, the system arranges the request sequence that the basic information pages of a plurality of restaurants are widely acquired first, and then detailed contact information of each restaurant is deeply acquired. The system continuously monitors the acquisition progress and the success rate, and when the reverse climbing mechanism of a certain catering platform is found to cause the success rate to be reduced, the request frequency of the platform can be dynamically reduced.
Meanwhile, the system records the processed state and supports breakpoint continuous transmission, for example, after the network is interrupted, the system can continuously collect from the restaurant list which is suspended last time. For collected data, the system extracts domain names and groups, such as grouping all URLs from "trianding. Com" into one group and "meiuan. Com" into another group.
The system also uses the historical data to build a domain name merchant map, such as identifying that both "xms.dianping.com" and "www.xiaomaoshu.com" belong to "kitten potato restaurant". In addition, the system uses a self-paying diffusion model for time series interpolation, especially for those restaurant data that are updated regularly (e.g., five new menus are released weekly). The system records its historical update patterns and is able to intelligently predict possible data changes during network outages. Through multiple sampling analysis, the system increases the actual acquisition frequency for data points with high uncertainty (such as holiday special menus), and more dependent model predictions for data with high predictability (such as fixed telephone numbers), thereby optimizing resource allocation.
And S5, extracting field information and identifying telephone number groups by utilizing a neural network model based on field extraction of supervised learning and a telephone number group identification model based on sequence labeling according to the merchant data set, and generating a structured data set.
In order to train a high-performance data extraction model, the system firstly collects a large amount of historical webpage data containing various dining website formats, such as pages of different styles, such as a comment, a group, restaurant networks and the like. The expert team marks the data, and marks the key field positions of the merchant name, address, operating range and the like, and the functional attributes of each telephone number (such as order telephone, distribution telephone and the like).
The system then performs feature engineering processing on the training data to extract DOM structural features (e.g., HTML tag type, depth where the field is located), text semantic features (e.g., text content, context), and positional relationship features (e.g., relative position in the page). Based on these features, the system builds a training model.
Through the migration learning method, the system builds a field extraction neural network by utilizing a pre-trained BERT language model, and performs fine tuning aiming at specific terms and structures in the catering industry.
Meanwhile, the system adopts BiLSTM-CRF architecture to train a telephone number grouping identification model, and the model can accurately identify the function types of a plurality of telephone numbers of the same restaurant.
After model training is completed, the system uses 10-fold cross-validation to optimize model super parameters, so that the model super parameters are ensured to have good generalization capability under different dining website formats. By applying the models, the system can accurately extract normalized merchant information from disordered webpage content, for example, the merchant name 'the little cat potato (the Korean shop)' and the telephone number '010-12345678' are accurately identified from texts such as 'the little cat potato (the Korean shop)' of business hours: 10:00-22:00 and the order line: 010-12345678 ', and the functional attribute of the merchant name' the little cat potato (the Korean shop) 'and the telephone number' 010-12345678 'is the order line'.
And S6, performing multidimensional data fingerprint generation to perform data deduplication according to the structured data set, and performing data protection by adopting a differential privacy technology to form a high-quality data set after deduplication and desensitization.
When a large amount of catering data are processed, the system adopts a multidimensional data fingerprint algorithm to perform deduplication. The algorithm not only considers restaurant names and telephone numbers, but also combines address, business class and other information to generate unique data fingerprints.
For example, "kitten's potato (kohlrabi)" and "kitten's potato kohlrabi" are slightly different in name, but the system can recognize that they are the same restaurant through the similarity of telephone numbers and addresses.
The system adopts a multi-stage duplicate removal strategy, namely, firstly, exact matching is carried out to remove completely repeated data (such as completely identical restaurant records), then similarity calculation is used to identify approximate duplicates (such as records with slightly different names but identical phones and addresses), and finally, the most complete information is reserved through intelligent merging.
For example, one record contains a detailed address but the phone is not full, another record is complete but the address is abbreviated, and the system would merge into one record containing the complete phone and detailed address.
In terms of privacy protection, the system automatically recognizes personal privacy information, such as a restaurant manager's private cell phone number, identification card number, etc., and processes these sensitive fields.
For example, the personal phone number partial mask is processed (e.g., "138 x 5678") to blur the detailed address (keep only to street level).
The system also adopts a differential privacy technology, and adds proper amount of noise during data analysis, so that individual information is not revealed even if aggregation statistical data are released. Through the technologies, the system finally forms a high-quality data set after de-duplication and desensitization, which not only ensures the uniqueness and the integrity of the data, but also ensures the compliance and the privacy protection.
As shown in fig. 2, S1 specifically includes:
 S1.1, acquiring an initial URL set according to a search engine API, a social media API and an industry catalog database.
In the multi-source data entry acquisition process, the system fully utilizes various data sources to collect and screen initial URLs.
Taking business telephone number collection as an example, the system firstly obtains a URL list of a related webpage by using keyword combinations (such as 'certain e-commerce company contact mode', 'XX catering enterprise telephone', and the like) through search engine APIs (application program interfaces) such as hundred degrees, google and the like.
Meanwhile, the system accesses social media APIs such as microblogs, linkedln and the like, and captures contact information and related links released by the official account of the enterprise. In addition, the system is also accessed into an industry catalog database such as yellow page network, enterprise search, sky eye search and the like, and the enterprise network address and the contact page link in the industry catalog database are directly extracted.
These operations produce a large number of original URLs, constituting a broad, but unfiltered, initial set of URLs. By means of the multi-source data fusion mode, the system can extend the coverage range of data acquisition to the maximum extent, and the limitation caused by dependence on a single data source is avoided. Especially for small enterprises or newly established enterprises which are difficult to find through conventional channels, the system can acquire information of the small enterprises or newly established enterprises in time through the social media API and the latest index of a search engine, so that the comprehensiveness and timeliness of data are ensured.
S1.2, constructing a knowledge graph according to the initial URL set and combining domain knowledge, and identifying high-value entry nodes through graph analysis to generate a primarily screened URL subset.
In this link, the system combines the initial URL set acquired in S1.1 with the pre-established domain knowledge to construct a simple knowledge graph.
Taking catering industry as an example, the system firstly extracts basic information such as domain names, path structures, page titles and the like from the initial URL, and the basic information is used as a basic node of the map. The system then associates these nodes with concepts in domain knowledge (e.g., restaurant classification, geographic location, cuisine type, etc.).
For example, a node in a URL that contains "huoguo" or a title that contains "hot pot" would be conceptually associated with "hot pot restaurant". Meanwhile, the system also establishes the association relation between nodes, such as linking different restaurant websites under the same company or clustering restaurants in the same area.
Based on the knowledge graph, the system can perform deep analysis to identify the most valuable entry nodes. For example, the system may prefer an aggregated page containing multiple pieces of restaurant information (e.g., recommended posts for a food forum), an official contact page, and a detail page with rich user reviews, as these pages typically contain more valid phone number information. Through the graph analysis, the system generates a URL subset subjected to preliminary screening and value evaluation, and provides a basis for subsequent prioritization.
S1.3, designing a dynamic priority queue according to the URL subset, calculating a priority score based on the PageRank value, the content update frequency and the historical acquisition result of each URL, and generating a URL queue with ordered priorities.
In the dynamic priority queue management link, the system carries out further priority evaluation and sequencing on the URL subsets screened in the step S1.2.
The system designs a dynamic priority queue, and assigns a weight value to each URL by comprehensively considering various factors.
First, the system calculates the PageRank value for each URL, evaluating its importance and authority throughout the network, with higher PageRank values being generally given higher initial weights for the corporate networks.
And secondly, the system analyzes the content updating frequency of the website corresponding to the URL, and gives higher timeliness weight to the frequently updated website (such as a catering website for updating the promotion information every day) so as to ensure that the latest information can be captured in time.
In addition, the system refers to the history collection record to give higher experience weight to websites that have provided high quality phone number data in the past. Based on the comprehensive calculation of these factors, the system generates a comprehensive priority score for each URL and creates a prioritized URL queue accordingly.
The method not only considers the static importance of the URL, but also integrates dynamic factors and historical experience, so that the resource allocation is more reasonable and efficient. It is worth noting that the queue is not static and unchanged, but is continuously adjusted according to the real-time acquisition result and the system resource state, and the dynamic characteristic is reflected.
The method specifically comprises the following steps of:
 A1, collecting characteristic data of a target website, and generating a characteristic vector containing domain name age, updating frequency and content richness through characteristic engineering processing.
In the first step of a semi-parameter batch processing global decision mechanism, the system performs comprehensive target website characteristic data acquisition and processing.
For each target website, the system collects multidimensional feature data including domain name age (domain name registration time obtained by WHOIS query), website size (estimated by site map or page count), technical architecture (CMS system used for identification, front end framework, etc.), update frequency (analyzed by historical snapshot comparison), content richness (text content density estimated, number of multimedia elements, etc.), external reference (reverse link number and quality analyzed), and historical acquisition success rate, etc.
After collecting these original features, the system performs complex feature engineering processing. Firstly, carrying out standardization processing on numerical value type characteristics to enable the characteristics of different dimensions to be comparable, then processing missing values, filling the characteristics which cannot be obtained (such as the ages of the domain names which cannot be queried by certain websites) by using industry average values or data of similar websites, then carrying out characteristic selection, using correlation analysis and importance evaluation to keep the characteristics with the most predicted value, and finally constructing combined characteristics, such as combining update frequency and content richness to form an information value density characteristic. Through the series of processing, the system generates high-quality feature vectors containing key dimensions such as domain name age, updating frequency, content richness and the like, and provides a solid foundation for subsequent model training.
A2, training by using a parameterized and non-parameterized mixed model according to the feature vector to form a semi-parameter rewarding prediction model.
In this link, the system builds a semi-parametric reward prediction model using the feature vectors generated by A1. The model adopts a mixed architecture combining parameterization and non-parameterization methods, and fully exerts the advantages of the two methods.
In the parameterization part, the system captures the linear relation between the features and the acquired rewards (such as the acquired number of effective telephone numbers) by using a generalized linear model, and the parameterization part has simple and definite structure and high calculation efficiency and is suitable for processing definite feature association.
For example, the update frequency of a website and the freshness of data are generally in positive correlation, and can be effectively expressed by a linear model. In the non-parameterized part, the system adopts complex models such as random forests or gradient lifting trees and the like, and is used for capturing nonlinear interaction and complex modes among features.
This section is particularly suited to handle combinations of features that are complex or difficult to formulate with simple formulas, such as complex relationships between domain name age, content structure and collection efficiency.
And the system carries out weighted fusion on the prediction results of the two models, continuously adjusts the weight through a Bayesian optimization method, and finally forms a semi-parameter rewarding prediction model. The hybrid architecture not only maintains the interpretability and the calculation efficiency of the parameterized model, but also has the capability of processing complex relations of the non-parameterized model, and is particularly suitable for an automatic data acquisition scene with various website characteristics and complex relations.
And A3, executing a Thompson sampling algorithm to calculate expected data gain values of all URL targets according to the half-parameter rewarding prediction model, and generating a priority decision scheme for URL batch acquisition.
Based on the semi-parameter rewards prediction model trained by A2, the system calculates expected data benefit values of each URL target by using a Thompson sampling algorithm, and generates an optimized batch acquisition decision scheme. The core idea of the thompson sampling algorithm is to balance exploration and utilization by probability sampling.
In particular implementations, the system first maintains a posterior belief of the rewards distribution for each URL, which may initially be a Gaussian distribution based on a predictive model. At each decision point, the system randomly extracts a sample value from the posterior distribution of each URL, represents the potential benefit of that URL, and then selects the batch of URLs with the highest sample values for access.
The method naturally balances exploration and utilization, namely, for URLs with wider posterior distribution (namely, options with high uncertainty), the URLs with high posterior distribution mean value and small variance (namely, options with high profit) are selected frequently, so that the utilization strategy is reflected.
As the system continuously collects new data, the posterior distribution gradually converges, and the decision is more and more accurate. Compared with the traditional epsilon-greedy algorithm, the Toepson sampling does not need to manually set the exploration rate parameter, but is naturally adjusted according to uncertainty, so that the method is more flexible and adaptive. In this way, the system generates a priority decision scheme that maximizes both the expected revenue and maintains a lot acquisition of URLs that are sufficiently explored.
And A3.1, acquiring expected reward data of the URL target set, and calculating an optimal combination through a combination optimization algorithm to form an initial batch scheme.
In a detailed implementation of thompson sampling, the system first needs to obtain the desired reward data for the URL target set and calculate the optimal combination.
For each candidate URL, the system extracts a number of samples (typically 100-1000) from its back-candidate reward distribution, calculating the expected reward value and its confidence interval. These reward data are typically measured in terms of "effective amount of information acquired per unit of resource", such as the number of effective phone numbers acquired per request, the amount of useful data captured per second, etc.
After obtaining the desired rewards data, the system is faced with a combinatorial optimization problem of how to select a set of URLs to maximize the overall desired rewards in situations where resources are limited. This is essentially a backpack problem with restraint.
The system adopts an improved greedy algorithm or dynamic programming method to solve, and considers the dependency relationship and the complementary effect between URLs. For example, the value of a detail page of a dining platform is correspondingly increased after the list page is accessed, and the value of a plurality of similar websites is simultaneously accessed and possibly reduced due to information repetition. By solving this optimization problem, the system forms an initial batch scheme, determining which URLs should be accessed in the current batch, and their order of access and resource allocation proportions.
And A3.2, carrying out diversity constraint calculation by utilizing a sub-module function maximization framework according to the initial batch scheme, and generating diversified batch combinations.
After the initial batch plan is formed, the system introduces diversity constraints through the sub-module function maximization framework, further optimizing the batch combination.
The sub-functions are a class of functions with a 'marginal gain decrementing' characteristic, and are well suited to modeling selection problems with diversity requirements. In an automated data acquisition system, continuous selection of URLs of the same type often results in redundancy of information, reducing overall efficiency. The system defines a sub-module function such that the marginal benefit of adding a new URL to a batch is inversely proportional to the similarity of the selected URLs.
In specific implementation, the system firstly builds a similarity matrix between URLs, and calculates the similarity between every two URLs based on the dimensions of domain names, content types, target audience and the like. Then, a greedy algorithm is used to build batches step by step under the condition that sub-die constraints are met, each time a URL is selected that maximizes marginal revenue while ensuring that the average similarity with the selected URL does not exceed a preset threshold. The method can ensure that the batches contain URLs of different types and different sources while ensuring the overall expected benefits, and improves the information coverage and diversity.
For example, instead of focusing on a single source, the system would mix and select a restaurant review website, restaurant lineup, and local business catalogue in the same batch, thereby forming a diverse batch combination.
A3.3, according to the diversified batch combinations, coordinating multi-node task allocation by using a federal learning framework to form a distributed decision scheme.
After the diversified batch combinations are formed, the system utilizes the federal learning framework to coordinate multi-node task allocation, and a distributed decision scheme is generated.
In large-scale automated data acquisition systems, a plurality of acquisition nodes are typically deployed, distributed in different geographic locations or network environments. The federal learning framework enables these nodes to share knowledge and coordinate actions while maintaining some autonomy.
First, each node maintains a local model and gathers local observations, such as response time, success rate, etc. for a particular website. Then, model parameter synchronization is carried out periodically among the nodes, the global knowledge base is updated, and the original data is kept in the local place. The design not only improves the overall learning efficiency of the system, but also reduces the communication overhead.
In terms of task allocation, the system employs a decentralised coordination mechanism, such as a weighted voting or auction mechanism. For example, when multiple nodes are all adapted to access a high value URL, the system may comprehensively consider the current load, historical success rate and network conditions of the nodes to assign tasks to the most suitable nodes. The system also realizes the knowledge distillation technology, compresses and distributes the globally learned strategy to each node, so that each node can be quickly adapted to the environmental change. Through the coordination mechanism, the system forms an efficient distributed decision scheme, fully utilizes the advantages of a distributed architecture, and improves the overall acquisition efficiency and robustness.
And A3.4, executing a multi-objective optimization algorithm to fuse the compliance constraint conditions according to the distributed decision scheme, and generating a final URL batch acquisition execution plan for updating the URL queues with ordered priorities.
After the distributed decision scheme is formed, the system fuses the compliance constraint conditions through a multi-objective optimization algorithm to generate a final URL batch acquisition execution plan.
The system treats data collection as a multi-objective optimization problem, and the main objectives include maximizing data value (such as telephone number acquisition), minimizing resource consumption (such as request times and bandwidth use), and minimizing rule-based risks (such as obeying robots. Txt rules and avoiding overlarge burden on target websites).
The system adopts the pareto optimization method to find the optimal balance point between the targets.
In particular, the system first converts the compliance beams into hard conditions and soft penalty terms. Hard constraints, such as compliance with robots. Txt prohibit rules, the system will eliminate URLs that violate these rules directly, and soft constraints, such as request frequency limits, will translate into penalty terms to add to the optimization objective function.
The system also establishes a self-adaptive rate limiting mechanism, dynamically adjusts the request frequency according to the response time and the error rate of the target website, and avoids triggering a reverse climbing mechanism. By the multi-objective optimization method, the system generates a final URL batch acquisition execution plan which balances data value, resource efficiency and compliance risk. The plan explicitly specifies the access time, request parameters and processing priority of each URL, and is used for guiding the system to update the URL queues with ordered priorities, so that the acquisition process is efficient and compliant, and long-term stable operation can be ensured.
And A4, dynamically adjusting weight parameters of URL exploration and utilization strategies according to the priority decision scheme obtained by the URLs in batches, calculating priority scores of the URLs, updating a URL queue with priority ordering, and forming an optimized URL access sequence and frequency.
In this link, the system acquires the priority decision scheme in batches according to the URL generated in A3, and further realizes adjustment and priority update of the dynamic exploration and utilization strategy.
Firstly, the system dynamically adjusts the weight parameters of the exploration and utilization strategy according to the current task progress and the resource condition.
For example, in the early stages of acquisition, the system tends to increase the exploration weight, try various types of URLs, accumulate experience, and in the later stages of acquisition, increase the utilization weight, concentrate resources on the known URL types with high returns, and ensure results.
And secondly, the system dynamically adjusts specific parameters of each URL according to real-time feedback. When the recent data quality of a certain type of URL is found to be improved significantly, the priority of the URL is correspondingly improved.
In addition, the system also takes into account time and environmental factors such as reducing requests at peak web traffic times to avoid triggering protection mechanisms, or increasing access to the e-commerce web site at certain points in time (e.g., holiday promotions). Through the multi-dimensional dynamic adjustment, the system continuously updates the priority scores of the various items in the URL queue and reorders the priority scores to form the optimized URL access sequence and frequency. The self-adaptive mechanism enables the system to keep high-efficiency running in complex and changeable network environments, and flexibly meets various challenges.
S1.4, carrying out timeliness evaluation according to the URL queue, and dynamically adjusting the access frequency and the priority to form the entry URL set.
And in the link of timeliness evaluation and self-adaptive adjustment, the system performs dynamic timeliness analysis and adjustment on the priority queue established in the step S1.3.
The system firstly establishes an updating period model of website contents, and analyzes the content change rules of different websites through the history grabbing records.
For example, the promotional information for the e-commerce platform may be updated daily, while the basic contact of the company may change monthly or quarterly. Based on these analyses, the system builds a timeliness assessment model, setting the appropriate access period for the different types of URLs. For highly time-efficient content (e.g., new job contact information for recruitment sites), the system will increase its priority in the queue and increase the access frequency, while for less time-efficient content (e.g., fixed-line telephones for government agencies), the access frequency is correspondingly reduced, allocating resources to URLs that need to be updated more timely.
The system also realizes a self-adaptive mechanism, and can adjust the estimated update period according to the actual acquisition result. For example, if a restaurant website is accessed for a plurality of times without content change, the system automatically prolongs the access interval, and once the content is found to start to change frequently, the access interval is shortened correspondingly. Through the dynamic adjustment, the system forms an entry URL set which has the advantages of full coverage and important emphasis, so that the timeliness of data can be ensured, and the resource utilization efficiency can be optimized.
As shown in fig. 3, S2 specifically includes:
 S2.1, acquiring webpage content according to the entry URL set, analyzing and cleaning by adopting HTML, and acquiring a standardized DOM tree structure.
In the page acquisition and preprocessing link, the system starts actual data acquisition work based on the optimized entry URL set.
Firstly, the system sends an HTTP request to acquire the original content of the target webpage, and the process needs to process various network conditions and server responses, including setting a reasonable timeout mechanism, processing redirection, maintaining a Cookie state and the like.
After the original content is obtained, the system firstly carries out coding detection, automatically identifies character codes (such as UTF-8, GB2312, GBK and the like) of pages, and ensures that multi-language content such as Chinese can be accurately analyzed. The system then performs HTML parsing to convert the text content into a structured DOM tree, which uses a specialized HTML parser (e.g., lxml, beautifulSoup, etc.) that can handle non-standard HTML formats and repair common tag errors. After the analysis is completed, the system performs content cleaning to remove non-content elements such as JavaScript codes, CSS styles, notes and the like, and simultaneously identifies and filters parts of advertisement content, navigation bars, footers and the like which are irrelevant to target data. This step is critical to improving the accuracy of subsequent analysis and avoids interference of noisy data with the analysis results. Finally, the system normalizes the cleaned content into a standard DOM tree structure, so that the unified processing of a subsequent algorithm is facilitated.
The whole pretreatment process considers the treatment of various abnormal conditions, such as automatic repair of a missing label, standardization of special characters and the like, and ensures the stability and accuracy of subsequent analysis. Through the processing, the system converts the original chaotic webpage content into a DOM tree with a clear structure and easy analysis, and lays a solid foundation for intelligent link discovery.
And S2.2, extracting links by using a regular expression, XPath positioning and a CSS selector according to the DOM tree structure to form an initial link set.
In the multi-strategy link extraction link, the system adopts various technical means to process DOM tree structures in parallel, and extracts valuable links to the maximum extent.
First, the system uses regular expressions to match all possible URL patterns, identifying http(s) links in the text content, which can capture those links that are not in standard a tags, such as links in plain text URL or JavaScript code.
At the same time, the system uses XPath techniques to precisely locate links in a particular structure, such as// div [ @ class = "content" ]/a/@ href can locate all links in a content area.
In addition, the system employs CSS selector techniques, such as product-list-item can select all item links in the product list.
Besides the static extraction method, the system also integrates a JavaScript execution environment, can capture dynamically generated links, and solves the challenges brought by the fact that modern websites adopt AJAX technology to dynamically load content in a large number. For example, when a page scroll loads more content or clicks on the "display more" button, the system can simulate these operations and extract the newly appearing links.
The system also handles special situations such as auto-completion of the relative path (converting "/products/1" to full URL), URL decoding (processing% 20 etc. encoded characters), removing session identifier in URL etc. to ensure that the extracted link format is uniform and efficient.
Through the parallel application of the multiple strategies, the system can comprehensively collect link resources in the page to form an initial set containing links of various sources, and rich candidates are provided for subsequent value evaluation. The multi-strategy parallel method remarkably improves the coverage rate and adaptability of link discovery, and can cope with websites realized by various different structures and technologies.
S2.3, analyzing the relation between the link text and the context according to the initial link set through a TF-IDF algorithm, and calculating semantic similarity by combining a Word2Vec model to obtain a link value scoring result;
 in the link value evaluation link, the system performs in-depth analysis on the extracted initial link set, and calculates a value score for each link.
First, the system uses the TF-IDF algorithm to analyze the relevance of the linked text to the context. The system treats the linked text and its surrounding context as one document, calculates word frequency (TF) and Inverse Document Frequency (IDF), and identifies keywords with high discrimination. For example, links in the link text or surrounding text that contain words of "contact us," "telephone," "customer service," etc., may achieve a higher relevance score.
Meanwhile, the system performs deep semantic analysis by combining a Word2Vec model, and the model can understand semantic relations among words through pre-trained Word vectors, and can identify related contents even if completely matched keywords do not appear.
For example, even if the link does not directly contain the word "phone", but has semantically related words such as "dial", "consultation", etc., the system can identify its potential value.
In addition to text semantics, the system evaluates structural features of links, such as DOM tree node depth (links in the body content are typically more valuable than links in the navigation bar or footer), sub-link density (the greater the number of links to a page, the more important the page is typically), and the location of the links in the page (links in the center region of the page are typically more important than links in the edge region).
The system also considers historical data, such as the historical yield of a particular path pattern under the domain name. All of these features are integrated by a random forest model, and a composite value score is calculated for each link, typically ranging from 0-100, with higher scores indicating that the link is more likely to contain target information. The multidimensional scoring mechanism can comprehensively evaluate the potential value of the links and provide scientific basis for subsequent link screening.
And S2.4, setting a dynamic threshold value to filter advertisements and low-value links according to the link value scoring result, and removing an automatic data acquisition trap by adopting a heuristic algorithm to form a high-value link queue.
In the link filtering and optimizing link, the system intelligently screens and processes the links scored in S2.3 to form a final high-value link queue.
First, the system sets a dynamic threshold to filter low value links, which is not fixed, but automatically adjusts according to the overall quality profile of the current lot links.
For example, if the current lot links are generally of higher quality, the system will raise the threshold to only preserve links of the best quality, and otherwise lower the threshold appropriately to ensure adequate collection. The system focuses on and filters ad links specifically, automatically eliminating such distracters by identifying common ad features (e.g., keywords including "ad", "sponsor", "promotion", etc., or pointing to ad network domain names).
Meanwhile, the system adopts a heuristic algorithm to identify and avoid automatic data acquisition traps, such as infinite calendar page turning, label circulation, parameter traps and the like, which can lead to modes of automatic data acquisition falling into circulation or exponential URL explosion.
The system can also detect and process the URL repetition problem, and normalize URLs (such as URLs with different session identifications or sequencing parameters) which are different in form and actually point to the same content, so that repeated access is avoided.
In addition, the system optimizes the reserved high-value links, including completing relative paths, removing URL anchors, standardizing parameter sequences and the like, so as to ensure the unified specification of the link formats.
Finally, the system ranks the priority of the links according to the value scores, and sets reasonable acquisition intervals, so that excessive requests are prevented from being sent to the same domain name in a short time.
Through the series of filtering and optimizing processes, the system extracts a high-quality high-value link queue from the original hybrid link set, provides a high-efficiency target page set for the subsequent telephone number extraction link, and remarkably improves the overall acquisition efficiency and the data quality.
As shown in fig. 4, S3 specifically includes:
 And S3.1, acquiring page contents according to the high-value link queue, and extracting visible text contents by adopting DOM analysis.
In the page text extraction link, the system accesses the target pages one by one based on the high-value link queue, and extracts text content possibly containing telephone numbers.
Firstly, the system sends an HTTP request to acquire page content, and dynamically adjusts a request strategy according to actual conditions, such as setting different User-agents, maintaining a Cookie state or processing JavaScript redirection, and the like, so as to cope with access restrictions of various websites.
After the content is acquired, the system extracts visible text through DOM parsing, and the process not only comprises obvious text elements such as regular paragraphs, titles and the like, but also pays special attention to special areas possibly containing contact information, such as footers, sidebars, contact pages and the like.
The system can intelligently handle various complications such as handling special coding (converting HTML entities like,', etc. to normal characters), identifying text substitution descriptions (alt attributes) in pictures, and even trying to extract pseudo-element content (e.g. passing, added content) in CSS styles.
It is particularly noted that the system also enables the detection and extraction of hidden text, and some web sites may use various techniques to hide telephone numbers, such as display, visibility using CSS, or set the text color to be the same as the background. The system can identify and extract these hidden contents by analyzing the CSS attributes.
In addition, the system focuses on dynamically loaded content to obtain complete information by simulating user interactions (e.g., clicking a "display more" button or triggering a particular event). For complex layouts, the system will analyze the spatial relationship of the elements, correctly associating the phone number with its descriptive text.
Through the comprehensive and fine extraction technology, the system ensures complete coverage of all the areas possibly containing telephone numbers in the page, outputs the cleaned and normalized plain text content, and lays a foundation for subsequent digital sequence identification.
And S3.2, extracting candidate digital sequences by using a regular expression according to the visible text content, and recording the context information of each candidate digital sequence in the original text.
In the step of digital sequence recognition, the system performs fine analysis on the extracted plain text content to recognize all the digital sequences possibly constituting telephone numbers.
First, the system uses a series of specially designed regular expression patterns to identify various forms of digital combinations. These patterns include consecutive numbers (e.g., 13800138000), digits with separators (e.g., 010-8888888, 0755.83744944), bracketed numbers (e.g., 010) 8888888), internationally prefixed numbers (e.g., + 8613800138000), and various mixed forms (e.g., +86 (10) 6552-9988).
The regular expression of the system is carefully designed, and can process telephone number format habits of different countries and regions, such as 11-bit digital format of Chinese mobile phone numbers, 10-bit digital format of American zone numbers and the like.
At the same time, the system handles special situations, such as that the numbers in the text may be separated by spaces, tabs or line breaks, and even full-angle numbers or Chinese numbers may be used for representation.
For each identified digital sequence, the system records not only the sequence itself, but also its complete context information in the original, typically including 50-100 characters each. Such contextual information is critical to subsequent phone number verification and function identification, and can provide key cues such as "customer service phone:", "order hotline:", "Forservice:", etc.
The system also records the position information of the digital sequence in the page, such as the type of HTML element (in tags such as < p >, < div >, < span >), the CSS class name (such as class= "tel" or class= "contact"), and the spatial position relative to the page, which helps determine the importance and function of the digital sequence.
Through the series of fine recognition and information association processing, the system outputs a candidate number sequence set containing rich metadata, and provides a comprehensive analysis basis for the next telephone number pattern matching.
And S3.3, carrying out various combination and formatting processing by using a telephone number mode knowledge base according to the candidate number sequence to generate a primarily identified telephone number list.
In the telephone number pattern matching link, the system uses the global telephone number format knowledge base to verify and format the candidate digit sequence.
The telephone number knowledge base of the system covers the number rules of more than 200 countries and regions worldwide, including international area codes, domestic area code length rules, total number length requirements, mobile phone number prefix rules and the like of each country/region.
For example, chinese cell phone number must be 11 digits and begin with 1, landline phones are typically area numbers (e.g., 010, 0755) plus 7-8 local numbers, united states and Canada employ North American Numbering Plan (NANP), use 3-digit area numbers plus 7 local numbers, and so on.
The system performs various combining and formatting processes on each candidate digit sequence, for example, for "01088888888", the system may try various segmentation methods such as "010-8888-8888", "0108-888-888", etc., and then verify which segmentation conforms to the valid phone number format based on the knowledge base.
The system can also handle special situations such as where a local person is used to omit a short representation of an area code, use a compound format of extension numbers, or include multiple telephone numbers in the same text block. For a national enterprise website, the system can identify the telephone number formats of multiple countries/regions and classify correctly.
In the verification process, the system not only considers the compliance of the number format, but also can combine some heuristic rules to enhance the judgment accuracy, such as excluding the number sequences which are obviously the product model, price, date and the like.
To increase efficiency, the system employs a multi-stage filtering strategy that first screens out sequences that do not significantly match the telephone number characteristics quickly with simple rules, and then performs more detailed format verification on potentially valid sequences. Through this complex series of matching and verification processes, the system outputs a list of primarily identified telephone numbers, each labeled with its possible country/region attribution and format validity scores, providing a basis for subsequent context verification and classification.
And S3.4, according to the telephone number list, carrying out verification and function type identification by combining the context information to form a page queue containing effective telephone numbers.
And in the context verification and classification link, the system carries out deep context analysis on the primarily identified telephone number, further confirms the validity of the telephone number and identifies the function type of the telephone number.
First, the system analyzes the context vocabulary around each phone number, looking for keywords that can be confirmed as an explicit identification of the phone number, such as "phone", "Tel", "contact", "dial", etc.
The system analyzes texts in different ranges before and after the number by adopting a sliding window method, and distributes weights according to the distances between the keywords and the number, wherein the influence of the keywords with the closer distances is larger.
Such context-based verification can effectively exclude digit sequences that, although in format, are not actually telephone numbers, such as product numbers, order numbers, etc.
After confirming the validity, the system further analyzes the context to identify the function type of the phone number. The system uses a predefined function type dictionary containing various common telephone function classifications and their corresponding feature words, such as customer service (customer service, consultation, support), sales (sales, ordering, purchasing, sales), technical support (technical, fault, maintenance, technical), etc.
By matching the feature words in the context, the system is able to assign the most likely function label to each phone number.
For situations where the contextual information is insufficient, the system will analyze the position of the number in the page and DOM structural features, such as the number located in the "contact us" page being more likely to be a customer service call and the number located in the product detail page being more likely to be a sales call.
In addition, the system may also identify time limits for the number usage, such as "weekdays 9:00-18:00" etc. period labels.
After completing these analyses, the system stores each validated phone number in the form of a key pair in queue B together with the URL and title of the current page. Each entry in this queue contains complete metadata such as number text, standardized format, country/region attribution, function type, credibility score, etc., providing rich structured data for subsequent merchant information associations. Through the deep context analysis and function recognition, the system greatly improves the accuracy and practical value of telephone number extraction.
S4.1, extracting a second-level domain name and a top-level domain name according to the page queue, and grouping by adopting a domain name clustering algorithm to generate a domain name grouping result.
In the domain name extraction and analysis link, the system carries out systematic processing on the page queues containing the effective telephone numbers so as to realize the preliminary classification of merchant information.
First, the system parses each URL, extracting its secondary domain name and top domain name. For example, from a URL such as "https:// beiding. Shop. Sample. Com/contact", the system recognizes the secondary domain name "shop. Sample" and the top domain name "com", while recording the sub domain name "beijing" as possible region information. The system may also identify special situations, such as a website using country code top-level domain names (e.g.,. Cn,. Jp) may represent business entities in a particular country, and a website using a special domain name of. Edu,. Gov, etc. may be an educational institution or government agency. After extracting the domain name, the system adopts a domain name clustering algorithm to group. Such clustering is based not only on perfect matching, but also on considering the similarity of domain names, and can identify cases such as "shop. Sample. Com" and "mobile. Sample. Com" that originate from the same organization but use different subdomains. The system calculates the similarity of the domain names by using algorithms such as editing distance, longest public substring and the like, and performs intelligent matching by combining known common domain name modes (such as www, m, shop common to enterprises and other subdomain prefixes). In addition, the system also analyzes the path structure pattern of the URL, identifying different merchants that may be from the same content management system or e-commerce platform. For example, "platform.com/shop/A" and "platform.com/shop/B" may be different merchants on the same platform. Through the multidimensional domain name analysis and clustering technology, the system generates a domain name grouping result, effectively groups pages possibly belonging to the same merchant or the same organization into a group, and lays a foundation for subsequent merchant information association. The primary grouping based on the domain name greatly improves the data processing efficiency and avoids redundant work of independently processing each URL.
And S4.2, establishing a domain name merchant mapping table according to the business database and the historical acquisition data.
In the construction link of the domain name merchant mapping table, the system integrates various data sources to establish the corresponding relation between the domain name and the actual merchant entity.
First, the system utilizes the existing commercial database resources, such as enterprise business registration database, commercial information service platform (such as enterprise search, sky eye search) and industry catalog database, to obtain the known domain name-merchant correspondence. These official or professional data sources provide a large amount of validated underlying mapping information.
And secondly, analyzing historical acquisition data by the system, and extracting the association mode of the domain name and the merchant name from the historical acquisition data. Through statistical analysis of a large amount of historical data, the system can identify those high-frequency and stable correspondences, such as that a specific domain name almost always appears simultaneously with a certain merchant name.
The system also adopts a machine learning technology to train a special mapping prediction model, and the model comprehensively considers domain name text characteristics (such as brand keywords contained in domain names), webpage content characteristics (such as website logo and company names in copyright statement) and link relation characteristics (such as reference modes of other known merchant websites) to predict merchant entities possibly corresponding to unknown domain names.
The model continuously improves the accuracy through continuous learning, and can process new domain names which are not directly matched with records.
In addition, the system implements a manual checksum feedback mechanism that allows an expert to review and correct automatically generated mappings, which in turn is used to further improve model performance. Through the multi-source data fusion and intelligent learning technologies, the system constructs a comprehensive and accurate domain name merchant mapping table, which not only contains direct mapping relations, but also records the reliability scores and data sources of mapping, and provides reliable knowledge base support for subsequent merchant information association. The construction of the mapping table is a dynamic and continuous process, and the system can update and expand the mapping data periodically to ensure that the mapping data is kept synchronous with the continuously changing internet business environment.
And S4.3, adding merchant basic information for each URL according to the domain name grouping result and the domain name merchant mapping table, and generating a URL data set associated with the merchant.
In the primary association link of the merchant information, the system combines the domain name grouping result generated before with the domain name merchant mapping table, and adds corresponding merchant basic information for each URL.
Firstly, the system performs query matching on each domain name group, and finds out the corresponding merchant record from the domain name merchant mapping table. For the case of direct matching, the system directly associates the corresponding merchant ID, name, industry classification, etc. base information.
For domain names that do not directly match records, the system may infer that an attempt is made to perform an approximate match based on similarity calculations and rules, as in the case of dealing with domain name variants (example-shop. Com versus example hop. Com) or subdomain name changes (shop. Example. Com versus m. Example. Com).
After determining the merchant association, the system tags each URL with a set of core merchant attributes including a unique identifier (e.g., merchant ID), merchant name (possibly including formal name and common acronym), industry classification (e.g., major classes and sub-classes of greater subdivisions such as catering, retail, service, etc.), business scale (e.g., large chain, small business, individual merchant, etc.), establishment time, regional information, etc.
For special cases, such as where a domain name may correspond to multiple merchants (e.g., commercial platform websites) or where a merchant may use multiple domain names, the system may establish many-to-many associations and record the confidence scores for each association.
In addition, the system marks the source of the data and the timestamp for each association, thereby facilitating subsequent data updating and conflict resolution. Through the systematic information association process, URL data is converted from simple website links to structured data with rich merchant contexts, so that a URL data set associated with merchants is formed. The association not only provides valuable business background information, but also provides important classification dimension for subsequent data grouping and content analysis, thereby greatly enhancing the business value and application potential of the data.
And S4.4, according to the URL data set, conducting content subdivision on the page title and the content abstract by using a text clustering algorithm to form the merchant data set.
In a content-based grouping optimization link, the system performs finer content analysis and grouping on URL datasets of associated merchant information.
First, the system analyzes the title and content abstract of the page corresponding to each URL, and extracts keywords and topic information. Through natural language processing techniques, the system identifies the main content type of the page, such as a product introduction page, a contact information page, a company profile page, and so forth.
The system adopts a text clustering algorithm (such as K-means, hierarchical clustering or topic model) to finely group different content pages of the same merchant.
For example, all of the store pages of a restaurant may form one sub-cluster, the menu pages form another sub-cluster, and the order contact pages form a third sub-cluster. The clustering not only considers text similarity, but also combines the URL path mode and the page structure characteristics, so that the functional category of the content can be identified more accurately. The system is particularly concerned with those pages containing contact information, which are marked as high value data sources.
For large merchant websites, the system can also identify page groups of different departments or business lines, such as sales departments, customer service centers, technical supports and the like, which has important guiding significance for the subsequent telephone number function classification. In addition, the system analyzes the temporal attributes of the pages to identify regularly updated content (e.g., promotional information) and relatively stable content (e.g., primary contact) for differentiated handling.
With such content-based fine grouping, the system organizes URL data that might otherwise be intermixed into a well-structured, well-functioning merchant dataset, each grouping having specific content features and business functions. The optimized data structure not only improves the accuracy of the subsequent AI model extraction, but also provides a more reasonable organization framework for data display and application, so that the final data product meets the actual service requirements and the use habits of users.
S4.5, original data generated by an automatic data acquisition activity are acquired, and a standardized time sequence data set is formed through normalization and feature extraction processing.
In the link of automatic data acquisition activity data processing and standardization, the system carries out systematic processing on the original data generated in the acquisition process, and prepares for subsequent time sequence analysis.
First, the system collects raw data generated by all automated data collection activities, including the time stamp, URL, response status code, response time, data size, and extracted information content of each request, etc. These raw data are often in different formats, are large-scale and contain noise, and require standardized processing.
The system first cleans the data for missing values (e.g., where some requests do not get a response), outliers (e.g., extreme response times), and conflicting data (e.g., different content as if the URL were acquired in a short time). The system then performs data normalization to uniformly convert the features of different dimensions (such as response time in millisecond order and data amount in KB order) into standard range, and usually uses Z-score normalization or Min-Max scaling method. Next, the system performs feature extraction to mine valuable time-pattern features from the raw data, such as access frequency of specific URLs, content update period, time variation of data acquisition success rate, and the like.
The system can also construct derivative features, such as second-order features of calculated data change rate, request density, content similarity and the like, so that the expression capability of the data is enhanced. Finally, the system performs time granularity alignment on the data according to the service requirement, and may aggregate the original second-level data into a minute-level, hour-level or day-level time sequence for subsequent analysis.
Through the series of processing, the system converts chaotic original automatic data acquisition activity data into a standardized time sequence data set with unified structure, rich characteristics and time alignment, the data not only reflects the content change rule of a target website, but also records the performance characteristics of the automatic data acquisition system, and provides a high-quality training and reasoning basis for a subsequent diffusion model.
And S4.6, performing forward diffusion and backward diffusion processes by using a conditional diffusion model according to the standardized time series data set to generate time series interpolation data.
In the link of conditional diffusion model and time sequence interpolation, the system utilizes the diffusion probability model technology of the leading edge to process the problem of missing values in the time sequence data.
The core idea of the conditional diffusion model is to consider the data generation process as a step-by-step denoising process, and the technology is particularly suitable for processing website content change data with complex time dependence.
First, the system defines a forward diffusion process, gradually adding gaussian noise to the complete time series data through multiple steps until completely randomized.
The system then trains a neural network to learn the back-diffusion process, i.e., to gradually recover the original signal from the noise. The network usually adopts a U-Net or a transducer architecture, and can effectively capture the long-term dependency relationship of time series data.
After training, the system uses the model to perform conditional generation, namely when missing segments in the time sequence are encountered, a known part of the time sequence is used as a conditional input to guide the model to generate missing parts consistent with the known data.
In particular, the system will leave the known data points unchanged, applying a back diffusion process to only the missing parts, gradually recovering the possible data values from random noise. This conditional generation ensures natural continuity of the interpolation results with the known data. Compared with the traditional interpolation method, the diffusion model can generate a result which is more consistent with the inherent distribution characteristic of the data, especially for updating the data for website contents with complex modes (such as periodicity, trend and sudden change).
For example, when an e-commerce web site is temporarily inaccessible due to technical maintenance, the system can predict content changes that may occur during this period of time based on historical access patterns and automatically adjust the prediction after access is resumed. By the advanced time sequence interpolation technology, the system can effectively process the problem of data loss caused by various reasons, ensure the continuity and the integrity of time sequence data and provide a reliable basis for subsequent analysis.
And S4.7, carrying out data analysis by using a diversity sampling algorithm according to the time sequence interpolation data to obtain a prediction uncertainty index.
In the steps of diversity sampling and uncertainty evaluation, the system carries out deep analysis on time series interpolation data generated by the diffusion model, and the reliability of prediction is quantized.
First, the system employs a diversity sampling strategy, rather than simply generating a single prediction result, by running the diffusion model multiple times, each time using a different random seed, generating multiple sets of possible interpolation schemes (typically 50-100 sets). These different schemes together form a prediction distribution reflecting the uncertainty of the model for the predicted values at different points in time.
Next, the system calculates statistical features of all sampling results, such as a mean (as a final predicted value), a standard deviation (as an uncertainty measure), a quantile (for constructing a prediction interval), etc., for each time point. The system is particularly concerned with those points in time where the inter-sample variance is large, which generally indicates that the model's predictions at these points are highly ambiguous and may require more real data to verify.
Based on these statistical analyses, the system generates an uncertainty indicator, typically expressed as a confidence score between 0 and 1, or the width of the prediction interval, for each predicted point in time.
The system may also identify different types of sources of uncertainty, such as occasional uncertainty (randomness of the data itself) and cognitive uncertainty (limitation of model knowledge). For unusual changes that may be caused by special events (e.g., e-commerce web sites proliferate in traffic on the day of sales), the system may be marked as a high uncertainty area and alerts may require special attention.
In addition, the system will continually update these uncertainty estimates over time, and as more observations are obtained, the prediction interval of the model will typically narrow, with reduced uncertainty. Through the comprehensive diversity sampling and uncertainty evaluation, the system not only provides specific predicted values, but also quantifies the reliability degree of the predictions, so that a decision maker can reasonably allocate resources according to the reliability degree of the predictions, and a data acquisition strategy can be planned more scientifically.
And S4.8, according to the prediction uncertainty index, executing a self-adaptive adjustment strategy to update the acquisition frequency parameter, so as to form data with better time continuity and integrity.
In the implementation link of the self-adaptive adjustment strategy, the system intelligently optimizes the acquisition strategy according to the uncertainty evaluation result to form a closed-loop self-adaptive system.
First, the system builds an uncertainty threshold rule that classifies the predicted time points into multiple levels by uncertainty, such as altitude determination (confidence > 0.9), medium determination (confidence 0.6-0.9), and altitude uncertainty (confidence < 0.6).
The system then designs a differentiated resource allocation strategy for each level, for which the system may reduce the actual acquisition frequency for highly determined points in time, rely more on model predictions, maintain the normal acquisition frequency for moderately determined points, and significantly increase the acquisition frequency for highly uncertain points, acquiring more real data to reduce uncertainty. The system also considers the business value of the data, and can adopt a more conservative acquisition strategy for data with lower business value (such as secondary information of non-core pages) even if the prediction is uncertain, and can maintain a certain acquisition frequency for high-value data (such as contact information updating of important clients) even if the prediction is more definite so as to ensure the safety.
In addition, the system realizes a dynamic feedback mechanism, and when the actually acquired data has a significant difference from the prediction, the model retraining or parameter adjustment is triggered to adapt to the change of the data distribution.
The system also designs a resource balancing algorithm, so that reasonable acquisition resource allocation can be obtained for each target website and time point under the overall resource constraint. The intelligent resource scheduling not only considers the uncertainty of prediction, but also considers the actual factors such as network conditions, server load, anti-climbing mechanism and the like. Through the series of self-adaptive adjustment strategies, the system can maximize the resource utilization efficiency on the premise of ensuring the data quality, and realize the optimal time continuity and integrity of the data. With the increase of the system running time and the enrichment of data accumulation, the self-adaptive mechanism becomes more and more accurate, and an intelligent acquisition system with continuous self-optimization is formed.
S5, specifically, collecting historical webpage data containing various website formats and field types S5.1.
In the historical webpage data collection link, the system establishes a comprehensive and various training data resource base, and provides a solid foundation for subsequent AI model training.
First, the system systematically collects historical web page data covering various industries, types of web sites, and is concerned with pages containing rich structured information (e.g., merchant names, addresses, phones, etc.). Source diversification is collected, including published web page archives (e.g., waybackMachine of INTERNETARCHIVE), historical snapshots provided by commercial data suppliers, collection records accumulated by long-term operation of the system itself, and the like.
The system classifies the collected pages in a multi-dimensional manner, marks the collected pages according to industries (such as catering, retail, service industry and the like), website types (such as enterprise official networks, electronic commerce platforms, social media and the like), content structures (such as table formats, list types, paragraph types and the like) and technical implementation (such as static HTML, javaScript rendering and responsive design) and ensures the diversity and representativeness of training data.
In particular, the system may focus on gathering web page types that are particularly challenging, such as unusual complex layouts, non-standard field representations, multi-lingual mix content, pages that use a large amount of pictorial information, etc., to enhance the generalization capability of the model. In addition, the system also collects the same website data in different periods, captures the evolution trend of website design and content organization, and enables the model to adapt to the continuously-changing webpage design style.
For scarce but important web page types, the system will also employ synthetic data techniques to create more training samples through template variation or content reorganization. All the collected data are subjected to preliminary quality screening, pages with obvious damage, incomplete content or over-high repeatability are removed, version management is carried out, and the sources, acquisition time and basic statistical characteristics of the data are recorded.
Through the systematic historical data collection work, the system constructs a rich training resource library containing various website formats and field types, and lays a solid foundation for subsequent expert labeling and model training.
S5.2, building a training data set through expert annotation, and carrying out functional classification on key fields containing merchant names, addresses and business ranges and telephone numbers in the annotation data.
In the link of expert annotation and training data set establishment, a system organizes professional team to carry out high-quality manual annotation on historical webpage data, and standard answers required by supervised learning are generated.
Firstly, the system makes a detailed labeling guide, and clearly defines various fields (such as merchant names, addresses, operation ranges, business hours and the like) to be identified, standard formats thereof and judging standards of functional classifications (such as switchboard, customer service, sales, technical support and the like) of telephone numbers.
In order to ensure the labeling quality, the system adopts a multi-level auditing mechanism that after a primary label maker finishes basic labeling, a senior auditor reviews, and finally judges the complex or disputed cases by field experts.
The system also implements a performance evaluation and training mechanism of the annotators, and continuously improves the professional level and standard uniformity of the annotating team through regular consistency test and case study.
In the technical aspect, the system develops a special labeling tool, supports efficient field selection, attribute labeling and functional classification, records uncertainty and difficulty rating in the labeling process, and provides important references for subsequent model training. Considering the diversity requirement of the data, the system ensures the balance of the labeling samples in the dimensions of industry distribution, website types, field complexity and the like, and avoids the bias of training data.
For rare but important situations (such as contact information in an unconventional format), the system can specially increase the labeling proportion of corresponding samples, so that the model can process various edge situations.
In addition, the system also implements a data segmentation strategy, and the labeling data is divided into a training set, a verification set and a test set according to the proportion of 8:1:1, so that the distribution similarity of the three sets in each dimension is ensured, and meanwhile, data leakage (such as the scattering of pages from the same website into different sets) is avoided.
Through the professional and strict labeling flow, the system establishes a high-quality training data set which contains rich merchant field information and telephone number function classification, and provides reliable supervision signals for the next feature engineering and model training.
And S5.3, carrying out feature engineering processing on the training data set, extracting DOM structural features, text semantic features and position relation features, and constructing a training model based on the DOM structural features, the text semantic features and the position relation features.
In the link of feature engineering and training model construction, the system carries out deep analysis and feature extraction on the labeling data set, and prepares rich input signals for subsequent model training.
First, the system extracts DOM structural features, including HTML tag type, tag nesting depth, element position, CSS class name and ID, element size and visibility, etc., which can reflect the structured information and visual layout of the web page.
For example, the system may identify those elements that are within a particular container (e.g., class= "contact-info") as more likely to contain contact information.
Secondly, the system extracts text semantic features, including word bag representation of text content, TF-IDF features, word embedding vectors, named entity recognition results, etc., which can capture semantic information and entity types of text.
For example, the system may learn an association pattern that identifies keywords such as "contact", "dial", and the like, to telephone numbers.
Again, the system extracts positional relationship features, including relative locations between elements, proximity relationships, possible text-to-digital pairing patterns, etc., which can express spatial relationships between page elements.
For example, the system may learn a common layout pattern that identifies tag text with its corresponding value, such as a pattern in which the tag is right at the left value in "phone: 12345678".
Based on these rich features, the system builds the infrastructure of the training model. For the field extraction task, the system employs an encoder-decoder architecture, where the encoder is responsible for converting page features into high-dimensional representations, and the decoder is responsible for identifying specific fields from those representations.
For telephone number grouping and function identification tasks, the system adopts a sequence labeling framework, treats telephone numbers and contexts thereof as a sequence, and learns and predicts the function label of each telephone number.
The system also realizes feature selection and dimension reduction technology, removes redundant features through methods such as correlation analysis, principal component analysis and the like, and improves model training efficiency.
In addition, the system designs a feature fusion mechanism, can adaptively adjust the weights of different types of features, and optimizes the feature combination aiming at different website structures. Through the systematic characteristic engineering and model construction work, the system provides rich and refined input representation for subsequent deep learning model training, and the learning efficiency and the performance level of the model are greatly improved.
S5.4, constructing a field extraction neural network model based on the pre-training language model by adopting a transfer learning method, and carrying out fine adjustment on the extraction of the structural information of the webpage.
In the link of transfer learning and field extraction model construction, the system utilizes the strong semantic understanding capability of the pre-training language model to develop a neural network model specially used for webpage structural information extraction.
First, the system selects an appropriate pre-trained language model, such as BERT, roBERTa, or chinese version, as a basis, which has mastered rich language knowledge and semantic understanding capabilities through self-supervised learning over a vast array of texts.
The system then performs domain-adaptive tuning on these pre-trained models, further training the models using a large number of industry-related text (e.g., business description, product introduction, etc.) to better understand domain-specific terms and expressions.
Then, the system designs a special task fine tuning stage, combines the pre-training model with a task specific output layer, and constructs a complete field extraction network.
The system adopts a coder-decoder framework in a specific architecture, wherein the coder is responsible for converting the webpage text and structural characteristics thereof into context-aware vector representations based on a pre-training model, and the decoder adopts a Conditional Random Field (CRF) or pointer network and other structures and is responsible for accurately positioning and extracting target fields from the vector representations. The system also innovatively combines text and structure information, converts the HTML structure into special tags to be inserted into the text sequence, and enables the model to understand the text content and the semantics of the page structure at the same time.
For example, < divclass = "contact" > telephone: 12345678</div > is converted into a mixed input of special tag sequences and text content. In the training process, the system adopts a multi-task learning method, and simultaneously optimizes a plurality of related targets (such as field boundary identification, field type classification, entity relation extraction and the like) so that the model can learn data characteristics from different angles.
The system also implements challenge training and data enhancement techniques that improve the generalization ability and robustness of the model by generating challenge samples and transforming existing samples. Through the method based on the transfer learning, the system effectively utilizes language knowledge contained in the pre-training model, greatly reduces the dependence on the labeling data, and improves the field extraction accuracy of the model under a complex webpage structure, especially for field information with changeable formats or nonstandard expression.
S5.5, designing a sequence labeling network structure, and training a telephone number grouping identification model by adopting BiLSTM-CRF architecture.
In the link of sequence labeling model and telephone number grouping identification, the system designs a special neural network architecture for identifying the function types and grouping relations of a plurality of telephone numbers of the same merchant.
Firstly, the system adopts a two-way long-short-term memory network (BiLSTM) as an infrastructure, and the network can effectively capture the front-back dependency relationship of sequence data, and is particularly suitable for processing text and number sequences.
The BiLSTM network receives input features processed by the embedded layer, including phone number text, surrounding context vocabulary, location information, etc., and learns the sequence patterns from both forward and backward directions. The system then adds a Conditional Random Field (CRF) layer on top of the BiLSTM output layer, forming a BiLSTM-CRF architecture. The CRF layer is able to learn transition probabilities between tags, taking into account the overall rationality of the tag sequence, as if a merchant is unlikely to have multiple "head phones" but likely to have multiple "branch phones.
This structural design allows the model to not only focus on the local features of a single phone number, but also to take into account overall tag consistency constraints.
In the aspect of feature engineering, the system constructs rich feature vectors for each telephone number, including format features of the number (such as length, whether area code is contained, in the link of sequence labeling model and telephone number grouping identification, the system designs a special neural network architecture for identifying the function types and grouping relations of a plurality of telephone numbers of the same merchant.
Firstly, the system adopts a two-way long-short-term memory network (BiLSTM) as an infrastructure, and the network can effectively capture the front-back dependency relationship of sequence data, and is particularly suitable for processing text and number sequences.
The BiLSTM network receives input features processed by the embedded layer, including phone number text, surrounding context vocabulary, location information, etc., and learns the sequence patterns from both forward and backward directions. The system then adds a Conditional Random Field (CRF) layer on top of the BiLSTM output layer, forming a BiLSTM-CRF architecture. The CRF layer is able to learn transition probabilities between tags, taking into account the overall rationality of the tag sequence, as if a merchant is unlikely to have multiple "head phones" but likely to have multiple "branch phones.
This structural design allows the model to not only focus on the local features of a single phone number, but also to take into account overall tag consistency constraints.
In terms of feature engineering, the system builds rich feature vectors for each phone number, including format features of the number itself (such as length, area code, mobile phone number), context semantic features (such as function indicators "customer service", "sales", etc. appearing around), location features (such as relative location in the page, distance from other numbers), etc.
The system also introduces a focus mechanism that enables the model to better focus on contextual information related to phone number function decisions, such as department names or service type descriptions that may occur before and after.
In addition, the system also realizes a multi-head self-attention structure, so that the model can simultaneously pay attention to different types of related information and integrate the related information. Through the BiLSTM-CRF architecture with professional design, the system can accurately identify the function types (such as a switchboard, customer service, sales, technical support and the like) of each of a plurality of telephone numbers under the same merchant, reasonably groups the telephone numbers with similar functions, and provides an important basis for subsequent data integration and display.
And S5.6, extracting super parameters of the network and telephone number grouping identification model through cross verification optimization fields, and improving generalization capability of the field extraction neural network model and the telephone number grouping identification model based on sequence labeling under different website formats.
In the link of cross validation and super parameter optimization, the system adopts a scientific and strict method to evaluate and optimize the model performance, so that the system has good generalization capability under various website formats.
Firstly, the system implements K-fold cross validation, usually adopts a 10-fold scheme, the training data set is divided into 10 subsets with similar sizes, 9 subsets of training models are used each time, the rest 1 is used for validation, 10 times are carried out in turn, and finally average performance is taken as an evaluation index.
The method can comprehensively evaluate the performance of the model under different data distribution, and avoid the deviation possibly caused by a single test set.
Based on the cross-validation results, the system performs comprehensive super-parameter optimization, and the adjusted parameters comprise network architecture parameters (such as the hidden layer size, the layer number, the attention head number and the like of the LSTM), optimizer parameters (such as the learning rate, the momentum, the weight attenuation and the like), regularization parameters (such as the dropout rate, the L1/L2 regularization strength and the like), training strategy parameters (such as the batch size, the learning rate scheduling strategy, the early-stop condition and the like).
Compared with the traditional grid search or random search, the system adopts a Bayesian optimization method to perform efficient parameter search, and the method can more intelligently explore a parameter space and quickly find out a near-optimal parameter combination.
During the optimization process, the system evaluates model performance separately for different types of web site formats, focusing on performance under special formats (e.g., highly dynamic pages, pages of non-traditional layout) in particular, ensuring that the model does not significantly degrade on certain specific types.
The system also adopts multi-index comprehensive evaluation, and simultaneously considers accuracy, recall rate, F1 score and custom indexes related to specific tasks, such as accuracy of field boundary identification, consistency of telephone number function classification and the like, so as to ensure balanced performance of the model in all aspects.
In addition, the system also implements an error analysis mechanism, which records and classifies error cases on the verification set in detail, identifies short plates of the model and adjusts architecture or parameters in a targeted manner. Through the systematic cross-validation and super-parameter optimization flow, the system finally obtains field extraction network and telephone number grouping identification models with stable performance and strong generalization capability, and the models can keep high-level performance under various website formats and data distribution, thereby providing reliable technical support for practical application.
And S5.7, extracting a neural network model according to the fields, performing automatic analysis on the pages of each merchant group, and identifying and extracting key field information.
In the field extraction neural network application link, the system deploys the optimized model to the actual production environment, and performs automatic analysis and information extraction on pages of each merchant group.
Firstly, the system preprocesses the webpage to be processed, including HTML analysis, text extraction, feature calculation and the like, and converts the webpage to model input consistent with the training data format. Then, the system activates the field to extract the neural network, inputs the preprocessed page data into the model, and the neural network automatically identifies and locates the target field through multi-layer calculation.
In the recognition process, the model calculates probability scores of various field types, such as "merchant name", "address", "business scope", "time of establishment", etc., for each token (token) in the text, and then finds out the optimal field boundaries and type assignment by combining the CRF layer using a forward-backward algorithm.
The process fully utilizes the understanding capability of the pre-training language model to the context and the modeling capability of the sequence labeling model to the tag dependency relationship, and can accurately identify the field information with complex format and expression mode.
For each field identified, the system also calculates a confidence score reflecting the degree of certainty of the model for the extracted result. The system may be particularly concerned with low confidence extraction results and may initiate backup processes such as secondary verification using a rules engine or marking requiring manual review.
In addition, the system also realizes an adaptive processing mechanism, and the processing strategy can be automatically adjusted for different types of webpages, such as increasing the depth of feature extraction for pages with complex structures, and analyzing after rendering for pages with high dynamic state.
The system also records intermediate states and attention weight distribution in the processing process, so that subsequent result interpretation and model improvement are facilitated. Through the intelligent field extraction process, the system can accurately identify and extract key merchant information such as merchant names, detailed addresses, operation ranges, established times, registered capital and the like from the mixed webpage content, and the structured field information provides high-quality basic data for subsequent merchant portrait construction and data application.
And S5.8, according to the telephone number grouping recognition model, automatically grouping and functionally recognizing a plurality of telephone numbers of the same merchant, recognizing and extracting telephone number data.
In the application link of the telephone number grouping identification model, the system utilizes a trained BiLSTM-CRF model to automatically identify and group a plurality of telephone numbers of the same merchant.
First, the system collects as input all extracted phone numbers under the same merchant, along with their context information (e.g., descriptive text around the number, title and URL of the page where it is located, etc.). The system then pre-processes the input data, including normalizing phone number formats, word segmentation and tokenizing process context text, computing location features, etc., and converts it into a model acceptable sequence input form. Then, the system activates a telephone number grouping recognition model, the model converts an input sequence into a vector representation through an embedding layer, then extracts sequence features through a BiLSTM layer, and finally outputs the most probable function label of each telephone number through a CRF layer.
This process considers not only the characteristics and direct context of each number itself, but also the interrelationship between numbers and the rationality of the overall tag sequence.
The system attaches function labels (e.g., switchboard, customer service, sales, technical support, complaints, reservations, etc.) identified by the model to the corresponding telephone numbers and groups the numbers initially according to the functional similarity. In addition, the system analyzes the regional characteristics of the number (such as the region to which the area code belongs) and the use scene (such as a private line, 24-hour service and the like), and further refines the grouping information.
For some complications, such as the possibility that a number serves multiple functions at the same time, the system will calculate the probability distribution of multiple tags and select the most dominant function as the primary tag and the other possible functions as the secondary tags.
The system may also record confidence indicators for packet identification, and for low confidence results, may initiate a manual review process.
Through the intelligent telephone number function identification and grouping processing, the system can convert the originally isolated telephone number data into address book information with definite functional attributes and organization structures, and the practicality and commercial value of the data are greatly improved. The user can quickly find out the contact ways suitable for specific purposes according to the needs, and directly contact with the sales department when the product information is required to be consulted, and directly contact with technical support when the technical problem is encountered, so that the information inquiry and use efficiency is remarkably improved.
And S5.9, carrying out association degree analysis on the key field information and the telephone number data, and confirming the main merchant name of the data packet to obtain the structured data set.
In the link of association analysis and main body merchant confirmation, the system carries out deep association analysis on the extracted field information and telephone number data to confirm the main body merchant name and core information of the data packet.
First, the system statistically analyzes the variations in merchant names (e.g., "ABC company", "ABC group", "ABC limited", etc.) that occur in each merchant group, and calculates the frequency of occurrence, location importance (e.g., at the title or salient location), and association strength (co-occurrence relationship with other key fields) of each variation.
Based on these statistics, the system uses a weighted voting mechanism to determine the most likely subject merchant name that will be identified as the primary key for the entire dataset.
In addition, the system analyzes the association pattern between the merchant name and the telephone number, and identifies the most core contact way and the corresponding business entity. Under complex conditions, the system also analyzes the hierarchical structure and content organization of the page, distinguishes the relationship between the main business and the subsidiary business, and between the parent company and the subsidiary company, and the like, and ensures the accuracy and hierarchy of the data packet. The system focuses on multiple relationships that may exist, such as different brands or business lines under the same group, by constructing entity relationship maps to represent these complex associations.
Meanwhile, the system can analyze the consistency and complementarity between fields, such as the regional consistency of addresses and telephone number area codes, the corresponding relation between the operating range and specific business departments, and the like, and the reliability of the whole data is improved through the cross verification. For information that is conflicted or inconsistent, the system may intelligently reconcile based on the confidence score and business importance, preferably preserving more reliable and more core information.
And finally, organizing all the information subjected to association analysis and integration into a structured data set which takes a main merchant as a center by the system, clearly displaying key data such as basic information, multi-level contact information, service range and the like of the merchant, and maintaining association relation and hierarchical structure among the information. The structured data set subjected to the deep association analysis and the main body confirmation not only solves the problems of data dispersion and identity confusion, but also provides rich merchant portraits and relational networks, and greatly improves the commercial application value and user experience of the data.
The embodiment of the application also provides a circulating automatic data acquisition system which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the method is realized when the processor executes the computer program.
The embodiment of the application also provides a computer device which comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method of the cyclic automation data acquisition method.
The embodiment of the application also provides a computer readable storage medium which stores computer instructions for causing a computer to execute the method of the cyclic automation data acquisition method. The embodiments of the present application also provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the above-described cyclic automation data collection method.
The application has the following technical effects:
 through a knowledge-graph distributed entry discovery mechanism and dynamic priority queue management, more efficient resource allocation and acquisition scheduling are realized, and acquisition efficiency is improved. The link discovery algorithm based on DOM structural features and semantic association improves the link value evaluation accuracy and reduces the invalid acquisition rate. The telephone number recognition mechanism combined by various algorithms is utilized, so that the accuracy and adaptability of telephone number extraction are greatly improved. The self-noted diffusion model is adopted to conduct time sequence interpolation, and the problem of data continuity in interrupt recovery and incremental update scenes is effectively solved. The data extraction model based on the AI can be automatically adapted to different website structures, and the accuracy and coverage rate of data extraction are improved. The application of multidimensional data fingerprint and differential privacy technology ensures the uniqueness and compliance of data and improves the data quality.
By the technical scheme, the method and the device can realize more efficient and more accurate data acquisition under the background of explosive growth of internet information and increasingly complex website structure, and meet the application scene demands of business intelligence, risk monitoring and the like with higher requirements on data quality and timeliness. Meanwhile, the application of the multidimensional data fingerprint and differential privacy technology ensures the uniqueness and compliance of the data, and provides important guarantee for data security and privacy protection.