Movatterモバイル変換


[0]ホーム

URL:


CN120407898A - Circular automated data acquisition method and system - Google Patents

Circular automated data acquisition method and system

Info

Publication number
CN120407898A
CN120407898ACN202510434655.8ACN202510434655ACN120407898ACN 120407898 ACN120407898 ACN 120407898ACN 202510434655 ACN202510434655 ACN 202510434655ACN 120407898 ACN120407898 ACN 120407898A
Authority
CN
China
Prior art keywords
data
url
model
merchant
queue
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202510434655.8A
Other languages
Chinese (zh)
Inventor
金紫燕
杨泽
李思萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Yulore Innovation Technology Co ltd
Original Assignee
Beijing Yulore Innovation Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Yulore Innovation Technology Co ltdfiledCriticalBeijing Yulore Innovation Technology Co ltd
Priority to CN202510434655.8ApriorityCriticalpatent/CN120407898A/en
Publication of CN120407898ApublicationCriticalpatent/CN120407898A/en
Pendinglegal-statusCriticalCurrent

Links

Landscapes

Abstract

Translated fromChinese

本申请提供一种循环自动化数据采集方法及系统,方法包括:形成入口URL集合;依据入口URL集合,基于DOM结构特征分析与语义关联度评估,以及TF‑I DF和Word2Vec的链接价值评分,形成高价值链接队列;依据高价值链接队列,获取页面内容,形成包含有效电话号码的页面队列;依据页面队列,运用自我注意的扩散模型进行时间序列插补,形成商户数据集;根据所述商户数据集,利用字段提取神经网络模型和电话号码分组识别模型进行字段信息提取和电话号码分组识别,生成结构化数据集;依据结构化数据集,执行多维数据指纹生成进行数据去重。本申请解决了传统自动化数据采集技术在复杂网页结构识别、数据时效性维护和数据质量保证方面的技术问题。

The present application provides a cyclic automated data collection method and system, the method comprising: forming an entry URL set; based on the entry URL set, forming a high-value link queue based on DOM structural feature analysis and semantic relevance evaluation, as well as TF‑IDF and Word2Vec link value scoring; based on the high-value link queue, obtaining page content to form a page queue containing valid phone numbers; based on the page queue, using a self-attention diffusion model to perform time series interpolation to form a merchant data set; based on the merchant data set, using a field extraction neural network model and a phone number grouping recognition model to extract field information and recognize phone number groups to generate a structured data set; based on the structured data set, performing multidimensional data fingerprint generation to perform data deduplication. The present application solves the technical problems of traditional automated data collection technology in complex web page structure recognition, data timeliness maintenance, and data quality assurance.

Description

Method and system for collecting circulation automatic data
Technical Field
The application relates to the technical field of Internet automatic data acquisition, in particular to a method and a system for cyclic automatic data acquisition.
Background
The Internet automatic data acquisition technology is a technical means for automatically collecting network information and is widely applied to the fields of search engines, data analysis, market research and the like. With the explosive growth of internet information, various website structures are increasingly complex, data formats are diversified, and the traditional automatic data acquisition technology faces a great challenge.
The common automatic data acquisition technology in the market at present mainly comprises two types, namely automatic data acquisition based on rules and distributed automatic data acquisition. The automatic data acquisition based on the rules locates and extracts specific contents in the webpage through the preset rules such as XPath and CSS selector, is simple to operate but has poor adaptability, and the distributed automatic data acquisition improves the acquisition efficiency through multi-node parallel processing, but still has the problem of low extraction precision under the complex webpage structure.
The existing automatic data acquisition technology generally adopts a single data extraction method, mainly depends on DOM analysis and regular expression matching, and has limited recognition capability on complex webpage structures. These techniques often require custom rules for specific websites when dealing with dynamic loading of content, unstructured text, and diverse layouts, lack versatility and flexibility, and fail to achieve real-time updates and efficient loop collection.
In addition, the existing automated data acquisition technology has obvious defects in terms of data timeliness and accuracy. On one hand, the internet information is frequently updated, the traditional automatic data acquisition cannot capture content change in time, and on the other hand, the entity extraction process has higher error rate and relatively slower crawling speed, so that the requirements of large-scale and high-quality data acquisition are difficult to meet. These problems severely restrict the application of automated data acquisition technology in scenes with high timeliness requirements such as business intelligence and risk monitoring.
Disclosure of Invention
In view of the above, the application provides a method and a system for circularly and automatically collecting data, which solve the problems of limited identification capability, data timeliness and insufficient accuracy of complex webpage structures in the prior art.
The embodiment of the application provides a circulating automatic data acquisition method, which comprises the steps of constructing a multisource data entry set according to a distributed entry discovery mechanism of a knowledge graph, utilizing a dynamic priority queue management, a timeliness assessment model and a semi-parameter batch processing global decision mechanism to form an entry URL set, forming a high-value link queue according to the entry URL set based on DOM structural feature analysis and semantic association assessment and link value scoring of TF-IDF and Word2Vec, acquiring page contents according to the high-value link queue, utilizing a regular expression to extract a digital sequence, utilizing a telephone number pattern knowledge base to carry out matching judgment to form a page queue containing an effective telephone number, utilizing a self-noted diffusion model to carry out time sequence interpolation according to the page queue, utilizing a domain name extraction algorithm and a domain name merchant mapping knowledge base to carry out merchant information association to form a merchant data set, utilizing a neural network model based on merchant learning field extraction and a telephone number grouping identification model based on sequence labeling to carry out field information extraction and telephone number grouping identification to generate a structured data set, carrying out multi-dimensional data differential and de-sensitized data, and carrying out fingerprint data quality de-sensitized data, and carrying out a high-level data security technology.
The distributed entry discovery mechanism based on the knowledge graph is constructed, a multisource data entry set is built, a dynamic priority queue management and timeliness assessment model is utilized to form an entry URL set, the method comprises the steps of acquiring an initial URL set according to a search engine API, a social media API and an industry catalog database, building the knowledge graph according to the initial URL set and combining domain knowledge, identifying high-value entry nodes through graph analysis, generating a primarily screened URL subset, designing a dynamic priority queue according to the URL subset, calculating priority scores based on PageRank values, content update frequencies and historical acquisition results of each URL to generate a priority ordered URL queue, timeliness assessment is conducted according to the URL queues, and access frequency and priority are dynamically adjusted to form the entry URL set.
The method comprises the steps of obtaining webpage content according to the entry URL set, analyzing and cleaning through HTML, obtaining a standardized DOM tree structure, extracting links according to the DOM tree structure through regular expressions, XPath positioning and a CSS selector, forming an initial link set, analyzing the relation between link texts and context according to the initial link set through a TF-IDF algorithm, calculating semantic similarity by combining a Word2Vec model, obtaining a link value scoring result, setting a dynamic threshold value to filter advertisements and low-value links according to the link value scoring result, and removing an automatic data acquisition trap through a heuristic algorithm to form a high-value link queue.
The method comprises the steps of obtaining page contents according to the high-value link queue, extracting a digital sequence by using a regular expression, carrying out matching judgment by means of a telephone number mode knowledge base, forming a page queue containing effective telephone numbers, obtaining the page contents according to the high-value link queue, adopting DOM analysis to extract visible text contents, extracting candidate digital sequences by using the regular expression according to the visible text contents, recording context information of each candidate digital sequence in an original text, carrying out various combination and formatting processing by using the telephone number mode knowledge base according to the candidate digital sequences, generating a primarily identified telephone number list, carrying out verification and function type identification according to the telephone number list and combining the context information, and forming the page queue containing the effective telephone numbers.
The method comprises the steps of extracting a secondary domain name and a top domain name according to the page queue, grouping by adopting a domain name clustering algorithm to generate a domain name grouping result, establishing a domain name merchant mapping table according to a business database and historical acquisition data, adding merchant basic information for each URL according to the domain name grouping result and the domain name merchant mapping table to generate a URL data set associated with the merchant, and conducting content subdivision on a page title and a content abstract by adopting a text clustering algorithm according to the URL data set to form the merchant data set.
The method for obtaining the field extraction neural network model and the telephone number grouping recognition model based on the sequence annotation comprises the steps of collecting historical webpage data containing various website formats and field types, establishing a training data set through expert annotation, conducting feature engineering processing on the training data set, extracting DOM structural features, text semantic features and position relation features, constructing the training model based on the DOM structural features, the text semantic features and the position relation features, adopting a transfer learning method, constructing the field extraction neural network model based on the pre-training language model, conducting fine tuning on webpage structural information extraction, designing a sequence annotation network structure, training a telephone number grouping recognition model by adopting a BiLSTM-CRF architecture, optimizing super parameters of the field extraction network and the telephone number grouping recognition model through cross verification, improving the generalization capability of the field extraction neural network model and the telephone number grouping recognition model based on the sequence annotation under different website formats, automatically executing the telephone number grouping recognition model based on the training data set according to the training data set, automatically conducting the key number grouping recognition model and the telephone number grouping recognition model based on the automatic key network structure extraction and the sequence annotation, conducting automatic telephone number grouping recognition on the key number grouping recognition model according to the merchant data set, and carrying out association degree analysis on the key field information and the telephone number data, and confirming the main merchant name of the data packet to obtain the structured data set.
The self-attention diffusion model performs time sequence interpolation, and comprises the steps of collecting original data generated by automatic data collection activities, forming a standardized time sequence data set through normalization and feature extraction, performing forward diffusion and reverse diffusion processes by using a conditional diffusion model according to the standardized time sequence data set to generate time sequence interpolation data, performing data analysis by using a diversity sampling algorithm according to the time sequence interpolation data to obtain a prediction uncertainty index, and performing adaptive adjustment strategy updating collection frequency parameters according to the prediction uncertainty index to form data with better time continuity and integrity.
The method comprises the steps of executing a semi-parameter batch processing global decision mechanism with covariates, specifically comprising the steps of collecting feature data of target websites, generating feature vectors comprising domain name age, updating frequency and content richness through feature engineering processing, training by using parameterized and non-parameterized mixed models according to the feature vectors to form a semi-parameter rewarding prediction model, executing a Tompson sampling algorithm to calculate expected data profit values of all URL targets according to the semi-parameter rewarding prediction model to generate a priority decision scheme for URL batch acquisition, dynamically adjusting weight parameters of a URL exploration and utilization strategy according to the priority decision scheme for URL batch acquisition, calculating priority scores of all URLs, updating the URL queues with priority ordering to form an optimized URL access sequence and frequency.
The method comprises the steps of executing a Thompson sampling algorithm to calculate expected data gain values of URL targets according to the semi-parameter rewarding prediction model, generating a priority decision scheme for URL batch acquisition, wherein the method comprises the steps of obtaining expected rewarding data of a URL target set, calculating optimal combination through a combination optimization algorithm to form an initial batch scheme, carrying out diversity constraint calculation by utilizing a sub-function maximization framework according to the initial batch scheme to generate diversified batch combinations, coordinating multi-node task allocation by utilizing a federal learning framework according to the diversified batch combinations to form a distributed decision scheme, and executing multi-target optimization algorithm fusion compliance constraint conditions according to the distributed decision scheme to generate a final URL batch acquisition execution plan for updating a URL queue with priority ordering.
The embodiment of the application also provides a computer device which comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method of the cyclic automation data acquisition method.
The embodiment of the application also provides a computer readable storage medium which stores computer instructions for causing a computer to execute the method of the cyclic automation data acquisition method. The embodiments of the present application also provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the above-described cyclic automation data collection method.
The application has the following technical effects:
Through a knowledge-graph distributed entry discovery mechanism and dynamic priority queue management, more efficient resource allocation and acquisition scheduling are realized, and acquisition efficiency is improved.
The link discovery algorithm based on DOM structural features and semantic association improves the link value evaluation accuracy and reduces the invalid acquisition rate.
The telephone number recognition mechanism combined by various algorithms is utilized, so that the accuracy and adaptability of telephone number extraction are greatly improved.
The self-noted diffusion model is adopted to conduct time sequence interpolation, and the problem of data continuity in interrupt recovery and incremental update scenes is effectively solved.
The data extraction model based on the AI can be automatically adapted to different website structures, and the accuracy and coverage rate of data extraction are improved. The application of multidimensional data fingerprint and differential privacy technology ensures the uniqueness and compliance of data and improves the data quality.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for the embodiments are briefly described below, which are incorporated in and constitute a part of the specification, these drawings showing embodiments consistent with the present disclosure and together with the description serve to illustrate the technical solutions of the present disclosure. It is to be understood that the following drawings illustrate only certain embodiments of the present disclosure and are therefore not to be considered limiting of its scope, for the person of ordinary skill in the art may admit to other equally relevant drawings without inventive effort.
FIG. 1 is a schematic flow chart of a method for cyclic automation data collection according to an embodiment of the present application;
FIG. 2 is a schematic flow chart of an intelligent entrance discovery method according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of an intelligent link discovery method according to an embodiment of the present application;
Fig. 4 is a flowchart of a phone number extraction method according to an embodiment of the present application.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present disclosure more apparent, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. The components of the embodiments of the present disclosure, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present disclosure provided in the accompanying drawings is not intended to limit the scope of the disclosure, as claimed, but is merely representative of selected embodiments of the disclosure. All other embodiments, which can be made by those skilled in the art based on the embodiments of this disclosure without making any inventive effort, are intended to be within the scope of this disclosure.
It should be noted that like reference numerals and letters refer to like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
The term "and/or" is used herein to describe only one relationship, and means that three relationships may exist, for example, A and/or B, and that three cases exist, A alone, A and B together, and B alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, may mean including any one or more elements selected from the group consisting of A, B and C.
As shown in fig. 1, an embodiment of the present application provides a method for collecting cyclic automation data, including:
s1, constructing a multisource data entry set according to a distributed entry discovery mechanism of a knowledge graph, and forming an entry URL set by using dynamic priority queue management, a timeliness evaluation model and a semi-parameter batch processing global decision mechanism.
In actual application, the system first obtains an initial set of URLs through a search engine API, a social media API, and an industry catalog database.
In the embodiment of the application, permission or agreement of a data acquisition path is obtained before automatic data acquisition, for example, a browser is adopted to access a target page during data acquisition, and page information is acquired in ocr forms or by a browser self-checking source code function after the target page is captured.
For example, when telephone number data of the catering industry needs to be collected, the system can search for XX restaurant contact ways through an API, obtain merchant information under relevant topics of catering through a microblog API, and extract restaurant list page URLs from an industry catalog.
Based on the initial URLs, the system builds a knowledge graph, takes restaurant names, areas, cuisines and other information as nodes, and builds an association relationship between the restaurant names, the areas, the cuisines and other information. Through the atlas analysis, the system identifies aggregated pages containing multiple restaurant contact information, which are typically of higher data value. And then, the system utilizes a semi-parameter batch processing global decision mechanism to carry out resource optimization allocation.
Specifically, for each restaurant website, the system analyzes the domain name age (such as longer history of domain name of a comment), update frequency (such as that of a food blog may be irregular), and content richness (such as that of an official website may be more comprehensive), and constructs a feature vector.
Based on these features, the system trains a semi-parametric rewards prediction model and calculates the expected data revenue for each URL using Thompson sampling. Finally, the system generates an optimized URL lot acquisition plan, such as preferentially accessing a comment with high update frequency to the latest restaurant list page, and reducing access frequency to the slow update official website to form the entry URL set.
S2, according to the entry URL set, a high-value link queue is formed based on DOM structural feature analysis and semantic association evaluation and link value scoring of TF-IDF and Word2 Vec.
Taking restaurant data collection as an example, the system starts with an entry URL (such as XX restaurant page of a comment) obtained in S1, and firstly obtains web content and performs preprocessing. The system analyzes the HTML structure, removes irrelevant elements such as advertisements, navigation and the like, and builds a standardized DOM tree. Next, the system uses multiple technical strategies to extract links in parallel, matching all http(s) links with regular expressions, locating links for all restaurant detail pages with XPath (e.g./div [ @ class = "shop-list" ]/a), and identifying clickable elements with CSS selectors (e.g.. Shop-titlea).
For some content that is dynamically loaded using JavaScript (more restaurants are loaded as a drop down), the system can also capture these dynamically generated links. The system then analyzes the value of these links using the TF-IDF algorithm and the Word2Vec model.
For example, when the link text contains keywords such as "contact us", "subscribe to phone", etc., or the text surrounding the link is highly related to the contact, the system may give a higher score. The system also considers the location of the links in the DOM tree, typically links in the page body content are more valuable than links in the navigation bar or footer.
Finally, the system sets dynamic threshold filtering advertisements and low-value links, such as automatically eliminating obvious advertisement links (such as URLs containing "ad", "sponsor" words) and typical automatic data acquisition traps (such as infinite links caused by calendar page turning), and finally forming a high-value link queue, wherein the links mainly point to restaurant detail pages, contact pages and other high-value pages containing telephone numbers.
And S3, acquiring page contents according to the high-value link queue, extracting a digital sequence by using a regular expression, and performing matching judgment by using a telephone number pattern knowledge base to form a page queue containing effective telephone numbers.
When the system accesses the high-value link generated in S2 (e.g., a detail page of a restaurant), the visual text content of the page is first extracted by DOM parsing. The system processes various encoding formats and recognizes hidden text, ensuring full coverage of the area that may contain telephone numbers. Next, the system extracts all potential sequences of numbers using regular expressions, including consecutive numbers (e.g., "13812345678"), numbers with separators (e.g., "010-87654321"), and special formats (e.g., "86 (10) 87654321").
For each extracted number sequence, the system records its contextual information, such as prefix text of "order phone:", "contact us:", etc. The system then verifies the candidate digits using a global phone number format rules repository.
For example, chinese phone numbers are typically 11 digits and begin with 1, and XX fixed phones are typically 010 digits plus 8 digits. The system is capable of handling various format changes such as bracketed, international area code or telephone numbers with separators.
Finally, the system combines the context information to perform function identification, such as identifying the telephone with prefix containing 'order' as restaurant order telephone and identifying the telephone containing 'complaint' as customer service telephone, thereby forming a page queue containing the effective telephone numbers of each restaurant and the functional attributes thereof.
And S4, performing time sequence interpolation by using a self-paying diffusion model according to the page queue, and performing merchant information association by means of a domain name extraction algorithm and a domain name merchant mapping knowledge base to form a merchant data set.
In the restaurant data acquisition, the system designs a reasonable resource allocation strategy, such as limiting the maximum request number of a single restaurant website to 1000 times per hour, and the total grabbing depth is not more than 5 layers.
By mixing breadth first and depth first strategies, the system arranges the request sequence that the basic information pages of a plurality of restaurants are widely acquired first, and then detailed contact information of each restaurant is deeply acquired. The system continuously monitors the acquisition progress and the success rate, and when the reverse climbing mechanism of a certain catering platform is found to cause the success rate to be reduced, the request frequency of the platform can be dynamically reduced.
Meanwhile, the system records the processed state and supports breakpoint continuous transmission, for example, after the network is interrupted, the system can continuously collect from the restaurant list which is suspended last time. For collected data, the system extracts domain names and groups, such as grouping all URLs from "trianding. Com" into one group and "meiuan. Com" into another group.
The system also uses the historical data to build a domain name merchant map, such as identifying that both "xms.dianping.com" and "www.xiaomaoshu.com" belong to "kitten potato restaurant". In addition, the system uses a self-paying diffusion model for time series interpolation, especially for those restaurant data that are updated regularly (e.g., five new menus are released weekly). The system records its historical update patterns and is able to intelligently predict possible data changes during network outages. Through multiple sampling analysis, the system increases the actual acquisition frequency for data points with high uncertainty (such as holiday special menus), and more dependent model predictions for data with high predictability (such as fixed telephone numbers), thereby optimizing resource allocation.
And S5, extracting field information and identifying telephone number groups by utilizing a neural network model based on field extraction of supervised learning and a telephone number group identification model based on sequence labeling according to the merchant data set, and generating a structured data set.
In order to train a high-performance data extraction model, the system firstly collects a large amount of historical webpage data containing various dining website formats, such as pages of different styles, such as a comment, a group, restaurant networks and the like. The expert team marks the data, and marks the key field positions of the merchant name, address, operating range and the like, and the functional attributes of each telephone number (such as order telephone, distribution telephone and the like).
The system then performs feature engineering processing on the training data to extract DOM structural features (e.g., HTML tag type, depth where the field is located), text semantic features (e.g., text content, context), and positional relationship features (e.g., relative position in the page). Based on these features, the system builds a training model.
Through the migration learning method, the system builds a field extraction neural network by utilizing a pre-trained BERT language model, and performs fine tuning aiming at specific terms and structures in the catering industry.
Meanwhile, the system adopts BiLSTM-CRF architecture to train a telephone number grouping identification model, and the model can accurately identify the function types of a plurality of telephone numbers of the same restaurant.
After model training is completed, the system uses 10-fold cross-validation to optimize model super parameters, so that the model super parameters are ensured to have good generalization capability under different dining website formats. By applying the models, the system can accurately extract normalized merchant information from disordered webpage content, for example, the merchant name 'the little cat potato (the Korean shop)' and the telephone number '010-12345678' are accurately identified from texts such as 'the little cat potato (the Korean shop)' of business hours: 10:00-22:00 and the order line: 010-12345678 ', and the functional attribute of the merchant name' the little cat potato (the Korean shop) 'and the telephone number' 010-12345678 'is the order line'.
And S6, performing multidimensional data fingerprint generation to perform data deduplication according to the structured data set, and performing data protection by adopting a differential privacy technology to form a high-quality data set after deduplication and desensitization.
When a large amount of catering data are processed, the system adopts a multidimensional data fingerprint algorithm to perform deduplication. The algorithm not only considers restaurant names and telephone numbers, but also combines address, business class and other information to generate unique data fingerprints.
For example, "kitten's potato (kohlrabi)" and "kitten's potato kohlrabi" are slightly different in name, but the system can recognize that they are the same restaurant through the similarity of telephone numbers and addresses.
The system adopts a multi-stage duplicate removal strategy, namely, firstly, exact matching is carried out to remove completely repeated data (such as completely identical restaurant records), then similarity calculation is used to identify approximate duplicates (such as records with slightly different names but identical phones and addresses), and finally, the most complete information is reserved through intelligent merging.
For example, one record contains a detailed address but the phone is not full, another record is complete but the address is abbreviated, and the system would merge into one record containing the complete phone and detailed address.
In terms of privacy protection, the system automatically recognizes personal privacy information, such as a restaurant manager's private cell phone number, identification card number, etc., and processes these sensitive fields.
For example, the personal phone number partial mask is processed (e.g., "138 x 5678") to blur the detailed address (keep only to street level).
The system also adopts a differential privacy technology, and adds proper amount of noise during data analysis, so that individual information is not revealed even if aggregation statistical data are released. Through the technologies, the system finally forms a high-quality data set after de-duplication and desensitization, which not only ensures the uniqueness and the integrity of the data, but also ensures the compliance and the privacy protection.
As shown in fig. 2, S1 specifically includes:
S1.1, acquiring an initial URL set according to a search engine API, a social media API and an industry catalog database.
In the multi-source data entry acquisition process, the system fully utilizes various data sources to collect and screen initial URLs.
Taking business telephone number collection as an example, the system firstly obtains a URL list of a related webpage by using keyword combinations (such as 'certain e-commerce company contact mode', 'XX catering enterprise telephone', and the like) through search engine APIs (application program interfaces) such as hundred degrees, google and the like.
Meanwhile, the system accesses social media APIs such as microblogs, linkedln and the like, and captures contact information and related links released by the official account of the enterprise. In addition, the system is also accessed into an industry catalog database such as yellow page network, enterprise search, sky eye search and the like, and the enterprise network address and the contact page link in the industry catalog database are directly extracted.
These operations produce a large number of original URLs, constituting a broad, but unfiltered, initial set of URLs. By means of the multi-source data fusion mode, the system can extend the coverage range of data acquisition to the maximum extent, and the limitation caused by dependence on a single data source is avoided. Especially for small enterprises or newly established enterprises which are difficult to find through conventional channels, the system can acquire information of the small enterprises or newly established enterprises in time through the social media API and the latest index of a search engine, so that the comprehensiveness and timeliness of data are ensured.
S1.2, constructing a knowledge graph according to the initial URL set and combining domain knowledge, and identifying high-value entry nodes through graph analysis to generate a primarily screened URL subset.
In this link, the system combines the initial URL set acquired in S1.1 with the pre-established domain knowledge to construct a simple knowledge graph.
Taking catering industry as an example, the system firstly extracts basic information such as domain names, path structures, page titles and the like from the initial URL, and the basic information is used as a basic node of the map. The system then associates these nodes with concepts in domain knowledge (e.g., restaurant classification, geographic location, cuisine type, etc.).
For example, a node in a URL that contains "huoguo" or a title that contains "hot pot" would be conceptually associated with "hot pot restaurant". Meanwhile, the system also establishes the association relation between nodes, such as linking different restaurant websites under the same company or clustering restaurants in the same area.
Based on the knowledge graph, the system can perform deep analysis to identify the most valuable entry nodes. For example, the system may prefer an aggregated page containing multiple pieces of restaurant information (e.g., recommended posts for a food forum), an official contact page, and a detail page with rich user reviews, as these pages typically contain more valid phone number information. Through the graph analysis, the system generates a URL subset subjected to preliminary screening and value evaluation, and provides a basis for subsequent prioritization.
S1.3, designing a dynamic priority queue according to the URL subset, calculating a priority score based on the PageRank value, the content update frequency and the historical acquisition result of each URL, and generating a URL queue with ordered priorities.
In the dynamic priority queue management link, the system carries out further priority evaluation and sequencing on the URL subsets screened in the step S1.2.
The system designs a dynamic priority queue, and assigns a weight value to each URL by comprehensively considering various factors.
First, the system calculates the PageRank value for each URL, evaluating its importance and authority throughout the network, with higher PageRank values being generally given higher initial weights for the corporate networks.
And secondly, the system analyzes the content updating frequency of the website corresponding to the URL, and gives higher timeliness weight to the frequently updated website (such as a catering website for updating the promotion information every day) so as to ensure that the latest information can be captured in time.
In addition, the system refers to the history collection record to give higher experience weight to websites that have provided high quality phone number data in the past. Based on the comprehensive calculation of these factors, the system generates a comprehensive priority score for each URL and creates a prioritized URL queue accordingly.
The method not only considers the static importance of the URL, but also integrates dynamic factors and historical experience, so that the resource allocation is more reasonable and efficient. It is worth noting that the queue is not static and unchanged, but is continuously adjusted according to the real-time acquisition result and the system resource state, and the dynamic characteristic is reflected.
The method specifically comprises the following steps of:
A1, collecting characteristic data of a target website, and generating a characteristic vector containing domain name age, updating frequency and content richness through characteristic engineering processing.
In the first step of a semi-parameter batch processing global decision mechanism, the system performs comprehensive target website characteristic data acquisition and processing.
For each target website, the system collects multidimensional feature data including domain name age (domain name registration time obtained by WHOIS query), website size (estimated by site map or page count), technical architecture (CMS system used for identification, front end framework, etc.), update frequency (analyzed by historical snapshot comparison), content richness (text content density estimated, number of multimedia elements, etc.), external reference (reverse link number and quality analyzed), and historical acquisition success rate, etc.
After collecting these original features, the system performs complex feature engineering processing. Firstly, carrying out standardization processing on numerical value type characteristics to enable the characteristics of different dimensions to be comparable, then processing missing values, filling the characteristics which cannot be obtained (such as the ages of the domain names which cannot be queried by certain websites) by using industry average values or data of similar websites, then carrying out characteristic selection, using correlation analysis and importance evaluation to keep the characteristics with the most predicted value, and finally constructing combined characteristics, such as combining update frequency and content richness to form an information value density characteristic. Through the series of processing, the system generates high-quality feature vectors containing key dimensions such as domain name age, updating frequency, content richness and the like, and provides a solid foundation for subsequent model training.
A2, training by using a parameterized and non-parameterized mixed model according to the feature vector to form a semi-parameter rewarding prediction model.
In this link, the system builds a semi-parametric reward prediction model using the feature vectors generated by A1. The model adopts a mixed architecture combining parameterization and non-parameterization methods, and fully exerts the advantages of the two methods.
In the parameterization part, the system captures the linear relation between the features and the acquired rewards (such as the acquired number of effective telephone numbers) by using a generalized linear model, and the parameterization part has simple and definite structure and high calculation efficiency and is suitable for processing definite feature association.
For example, the update frequency of a website and the freshness of data are generally in positive correlation, and can be effectively expressed by a linear model. In the non-parameterized part, the system adopts complex models such as random forests or gradient lifting trees and the like, and is used for capturing nonlinear interaction and complex modes among features.
This section is particularly suited to handle combinations of features that are complex or difficult to formulate with simple formulas, such as complex relationships between domain name age, content structure and collection efficiency.
And the system carries out weighted fusion on the prediction results of the two models, continuously adjusts the weight through a Bayesian optimization method, and finally forms a semi-parameter rewarding prediction model. The hybrid architecture not only maintains the interpretability and the calculation efficiency of the parameterized model, but also has the capability of processing complex relations of the non-parameterized model, and is particularly suitable for an automatic data acquisition scene with various website characteristics and complex relations.
And A3, executing a Thompson sampling algorithm to calculate expected data gain values of all URL targets according to the half-parameter rewarding prediction model, and generating a priority decision scheme for URL batch acquisition.
Based on the semi-parameter rewards prediction model trained by A2, the system calculates expected data benefit values of each URL target by using a Thompson sampling algorithm, and generates an optimized batch acquisition decision scheme. The core idea of the thompson sampling algorithm is to balance exploration and utilization by probability sampling.
In particular implementations, the system first maintains a posterior belief of the rewards distribution for each URL, which may initially be a Gaussian distribution based on a predictive model. At each decision point, the system randomly extracts a sample value from the posterior distribution of each URL, represents the potential benefit of that URL, and then selects the batch of URLs with the highest sample values for access.
The method naturally balances exploration and utilization, namely, for URLs with wider posterior distribution (namely, options with high uncertainty), the URLs with high posterior distribution mean value and small variance (namely, options with high profit) are selected frequently, so that the utilization strategy is reflected.
As the system continuously collects new data, the posterior distribution gradually converges, and the decision is more and more accurate. Compared with the traditional epsilon-greedy algorithm, the Toepson sampling does not need to manually set the exploration rate parameter, but is naturally adjusted according to uncertainty, so that the method is more flexible and adaptive. In this way, the system generates a priority decision scheme that maximizes both the expected revenue and maintains a lot acquisition of URLs that are sufficiently explored.
And A3.1, acquiring expected reward data of the URL target set, and calculating an optimal combination through a combination optimization algorithm to form an initial batch scheme.
In a detailed implementation of thompson sampling, the system first needs to obtain the desired reward data for the URL target set and calculate the optimal combination.
For each candidate URL, the system extracts a number of samples (typically 100-1000) from its back-candidate reward distribution, calculating the expected reward value and its confidence interval. These reward data are typically measured in terms of "effective amount of information acquired per unit of resource", such as the number of effective phone numbers acquired per request, the amount of useful data captured per second, etc.
After obtaining the desired rewards data, the system is faced with a combinatorial optimization problem of how to select a set of URLs to maximize the overall desired rewards in situations where resources are limited. This is essentially a backpack problem with restraint.
The system adopts an improved greedy algorithm or dynamic programming method to solve, and considers the dependency relationship and the complementary effect between URLs. For example, the value of a detail page of a dining platform is correspondingly increased after the list page is accessed, and the value of a plurality of similar websites is simultaneously accessed and possibly reduced due to information repetition. By solving this optimization problem, the system forms an initial batch scheme, determining which URLs should be accessed in the current batch, and their order of access and resource allocation proportions.
And A3.2, carrying out diversity constraint calculation by utilizing a sub-module function maximization framework according to the initial batch scheme, and generating diversified batch combinations.
After the initial batch plan is formed, the system introduces diversity constraints through the sub-module function maximization framework, further optimizing the batch combination.
The sub-functions are a class of functions with a 'marginal gain decrementing' characteristic, and are well suited to modeling selection problems with diversity requirements. In an automated data acquisition system, continuous selection of URLs of the same type often results in redundancy of information, reducing overall efficiency. The system defines a sub-module function such that the marginal benefit of adding a new URL to a batch is inversely proportional to the similarity of the selected URLs.
In specific implementation, the system firstly builds a similarity matrix between URLs, and calculates the similarity between every two URLs based on the dimensions of domain names, content types, target audience and the like. Then, a greedy algorithm is used to build batches step by step under the condition that sub-die constraints are met, each time a URL is selected that maximizes marginal revenue while ensuring that the average similarity with the selected URL does not exceed a preset threshold. The method can ensure that the batches contain URLs of different types and different sources while ensuring the overall expected benefits, and improves the information coverage and diversity.
For example, instead of focusing on a single source, the system would mix and select a restaurant review website, restaurant lineup, and local business catalogue in the same batch, thereby forming a diverse batch combination.
A3.3, according to the diversified batch combinations, coordinating multi-node task allocation by using a federal learning framework to form a distributed decision scheme.
After the diversified batch combinations are formed, the system utilizes the federal learning framework to coordinate multi-node task allocation, and a distributed decision scheme is generated.
In large-scale automated data acquisition systems, a plurality of acquisition nodes are typically deployed, distributed in different geographic locations or network environments. The federal learning framework enables these nodes to share knowledge and coordinate actions while maintaining some autonomy.
First, each node maintains a local model and gathers local observations, such as response time, success rate, etc. for a particular website. Then, model parameter synchronization is carried out periodically among the nodes, the global knowledge base is updated, and the original data is kept in the local place. The design not only improves the overall learning efficiency of the system, but also reduces the communication overhead.
In terms of task allocation, the system employs a decentralised coordination mechanism, such as a weighted voting or auction mechanism. For example, when multiple nodes are all adapted to access a high value URL, the system may comprehensively consider the current load, historical success rate and network conditions of the nodes to assign tasks to the most suitable nodes. The system also realizes the knowledge distillation technology, compresses and distributes the globally learned strategy to each node, so that each node can be quickly adapted to the environmental change. Through the coordination mechanism, the system forms an efficient distributed decision scheme, fully utilizes the advantages of a distributed architecture, and improves the overall acquisition efficiency and robustness.
And A3.4, executing a multi-objective optimization algorithm to fuse the compliance constraint conditions according to the distributed decision scheme, and generating a final URL batch acquisition execution plan for updating the URL queues with ordered priorities.
After the distributed decision scheme is formed, the system fuses the compliance constraint conditions through a multi-objective optimization algorithm to generate a final URL batch acquisition execution plan.
The system treats data collection as a multi-objective optimization problem, and the main objectives include maximizing data value (such as telephone number acquisition), minimizing resource consumption (such as request times and bandwidth use), and minimizing rule-based risks (such as obeying robots. Txt rules and avoiding overlarge burden on target websites).
The system adopts the pareto optimization method to find the optimal balance point between the targets.
In particular, the system first converts the compliance beams into hard conditions and soft penalty terms. Hard constraints, such as compliance with robots. Txt prohibit rules, the system will eliminate URLs that violate these rules directly, and soft constraints, such as request frequency limits, will translate into penalty terms to add to the optimization objective function.
The system also establishes a self-adaptive rate limiting mechanism, dynamically adjusts the request frequency according to the response time and the error rate of the target website, and avoids triggering a reverse climbing mechanism. By the multi-objective optimization method, the system generates a final URL batch acquisition execution plan which balances data value, resource efficiency and compliance risk. The plan explicitly specifies the access time, request parameters and processing priority of each URL, and is used for guiding the system to update the URL queues with ordered priorities, so that the acquisition process is efficient and compliant, and long-term stable operation can be ensured.
And A4, dynamically adjusting weight parameters of URL exploration and utilization strategies according to the priority decision scheme obtained by the URLs in batches, calculating priority scores of the URLs, updating a URL queue with priority ordering, and forming an optimized URL access sequence and frequency.
In this link, the system acquires the priority decision scheme in batches according to the URL generated in A3, and further realizes adjustment and priority update of the dynamic exploration and utilization strategy.
Firstly, the system dynamically adjusts the weight parameters of the exploration and utilization strategy according to the current task progress and the resource condition.
For example, in the early stages of acquisition, the system tends to increase the exploration weight, try various types of URLs, accumulate experience, and in the later stages of acquisition, increase the utilization weight, concentrate resources on the known URL types with high returns, and ensure results.
And secondly, the system dynamically adjusts specific parameters of each URL according to real-time feedback. When the recent data quality of a certain type of URL is found to be improved significantly, the priority of the URL is correspondingly improved.
In addition, the system also takes into account time and environmental factors such as reducing requests at peak web traffic times to avoid triggering protection mechanisms, or increasing access to the e-commerce web site at certain points in time (e.g., holiday promotions). Through the multi-dimensional dynamic adjustment, the system continuously updates the priority scores of the various items in the URL queue and reorders the priority scores to form the optimized URL access sequence and frequency. The self-adaptive mechanism enables the system to keep high-efficiency running in complex and changeable network environments, and flexibly meets various challenges.
S1.4, carrying out timeliness evaluation according to the URL queue, and dynamically adjusting the access frequency and the priority to form the entry URL set.
And in the link of timeliness evaluation and self-adaptive adjustment, the system performs dynamic timeliness analysis and adjustment on the priority queue established in the step S1.3.
The system firstly establishes an updating period model of website contents, and analyzes the content change rules of different websites through the history grabbing records.
For example, the promotional information for the e-commerce platform may be updated daily, while the basic contact of the company may change monthly or quarterly. Based on these analyses, the system builds a timeliness assessment model, setting the appropriate access period for the different types of URLs. For highly time-efficient content (e.g., new job contact information for recruitment sites), the system will increase its priority in the queue and increase the access frequency, while for less time-efficient content (e.g., fixed-line telephones for government agencies), the access frequency is correspondingly reduced, allocating resources to URLs that need to be updated more timely.
The system also realizes a self-adaptive mechanism, and can adjust the estimated update period according to the actual acquisition result. For example, if a restaurant website is accessed for a plurality of times without content change, the system automatically prolongs the access interval, and once the content is found to start to change frequently, the access interval is shortened correspondingly. Through the dynamic adjustment, the system forms an entry URL set which has the advantages of full coverage and important emphasis, so that the timeliness of data can be ensured, and the resource utilization efficiency can be optimized.
As shown in fig. 3, S2 specifically includes:
S2.1, acquiring webpage content according to the entry URL set, analyzing and cleaning by adopting HTML, and acquiring a standardized DOM tree structure.
In the page acquisition and preprocessing link, the system starts actual data acquisition work based on the optimized entry URL set.
Firstly, the system sends an HTTP request to acquire the original content of the target webpage, and the process needs to process various network conditions and server responses, including setting a reasonable timeout mechanism, processing redirection, maintaining a Cookie state and the like.
After the original content is obtained, the system firstly carries out coding detection, automatically identifies character codes (such as UTF-8, GB2312, GBK and the like) of pages, and ensures that multi-language content such as Chinese can be accurately analyzed. The system then performs HTML parsing to convert the text content into a structured DOM tree, which uses a specialized HTML parser (e.g., lxml, beautifulSoup, etc.) that can handle non-standard HTML formats and repair common tag errors. After the analysis is completed, the system performs content cleaning to remove non-content elements such as JavaScript codes, CSS styles, notes and the like, and simultaneously identifies and filters parts of advertisement content, navigation bars, footers and the like which are irrelevant to target data. This step is critical to improving the accuracy of subsequent analysis and avoids interference of noisy data with the analysis results. Finally, the system normalizes the cleaned content into a standard DOM tree structure, so that the unified processing of a subsequent algorithm is facilitated.
The whole pretreatment process considers the treatment of various abnormal conditions, such as automatic repair of a missing label, standardization of special characters and the like, and ensures the stability and accuracy of subsequent analysis. Through the processing, the system converts the original chaotic webpage content into a DOM tree with a clear structure and easy analysis, and lays a solid foundation for intelligent link discovery.
And S2.2, extracting links by using a regular expression, XPath positioning and a CSS selector according to the DOM tree structure to form an initial link set.
In the multi-strategy link extraction link, the system adopts various technical means to process DOM tree structures in parallel, and extracts valuable links to the maximum extent.
First, the system uses regular expressions to match all possible URL patterns, identifying http(s) links in the text content, which can capture those links that are not in standard a tags, such as links in plain text URL or JavaScript code.
At the same time, the system uses XPath techniques to precisely locate links in a particular structure, such as// div [ @ class = "content" ]/a/@ href can locate all links in a content area.
In addition, the system employs CSS selector techniques, such as product-list-item can select all item links in the product list.
Besides the static extraction method, the system also integrates a JavaScript execution environment, can capture dynamically generated links, and solves the challenges brought by the fact that modern websites adopt AJAX technology to dynamically load content in a large number. For example, when a page scroll loads more content or clicks on the "display more" button, the system can simulate these operations and extract the newly appearing links.
The system also handles special situations such as auto-completion of the relative path (converting "/products/1" to full URL), URL decoding (processing% 20 etc. encoded characters), removing session identifier in URL etc. to ensure that the extracted link format is uniform and efficient.
Through the parallel application of the multiple strategies, the system can comprehensively collect link resources in the page to form an initial set containing links of various sources, and rich candidates are provided for subsequent value evaluation. The multi-strategy parallel method remarkably improves the coverage rate and adaptability of link discovery, and can cope with websites realized by various different structures and technologies.
S2.3, analyzing the relation between the link text and the context according to the initial link set through a TF-IDF algorithm, and calculating semantic similarity by combining a Word2Vec model to obtain a link value scoring result;
in the link value evaluation link, the system performs in-depth analysis on the extracted initial link set, and calculates a value score for each link.
First, the system uses the TF-IDF algorithm to analyze the relevance of the linked text to the context. The system treats the linked text and its surrounding context as one document, calculates word frequency (TF) and Inverse Document Frequency (IDF), and identifies keywords with high discrimination. For example, links in the link text or surrounding text that contain words of "contact us," "telephone," "customer service," etc., may achieve a higher relevance score.
Meanwhile, the system performs deep semantic analysis by combining a Word2Vec model, and the model can understand semantic relations among words through pre-trained Word vectors, and can identify related contents even if completely matched keywords do not appear.
For example, even if the link does not directly contain the word "phone", but has semantically related words such as "dial", "consultation", etc., the system can identify its potential value.
In addition to text semantics, the system evaluates structural features of links, such as DOM tree node depth (links in the body content are typically more valuable than links in the navigation bar or footer), sub-link density (the greater the number of links to a page, the more important the page is typically), and the location of the links in the page (links in the center region of the page are typically more important than links in the edge region).
The system also considers historical data, such as the historical yield of a particular path pattern under the domain name. All of these features are integrated by a random forest model, and a composite value score is calculated for each link, typically ranging from 0-100, with higher scores indicating that the link is more likely to contain target information. The multidimensional scoring mechanism can comprehensively evaluate the potential value of the links and provide scientific basis for subsequent link screening.
And S2.4, setting a dynamic threshold value to filter advertisements and low-value links according to the link value scoring result, and removing an automatic data acquisition trap by adopting a heuristic algorithm to form a high-value link queue.
In the link filtering and optimizing link, the system intelligently screens and processes the links scored in S2.3 to form a final high-value link queue.
First, the system sets a dynamic threshold to filter low value links, which is not fixed, but automatically adjusts according to the overall quality profile of the current lot links.
For example, if the current lot links are generally of higher quality, the system will raise the threshold to only preserve links of the best quality, and otherwise lower the threshold appropriately to ensure adequate collection. The system focuses on and filters ad links specifically, automatically eliminating such distracters by identifying common ad features (e.g., keywords including "ad", "sponsor", "promotion", etc., or pointing to ad network domain names).
Meanwhile, the system adopts a heuristic algorithm to identify and avoid automatic data acquisition traps, such as infinite calendar page turning, label circulation, parameter traps and the like, which can lead to modes of automatic data acquisition falling into circulation or exponential URL explosion.
The system can also detect and process the URL repetition problem, and normalize URLs (such as URLs with different session identifications or sequencing parameters) which are different in form and actually point to the same content, so that repeated access is avoided.
In addition, the system optimizes the reserved high-value links, including completing relative paths, removing URL anchors, standardizing parameter sequences and the like, so as to ensure the unified specification of the link formats.
Finally, the system ranks the priority of the links according to the value scores, and sets reasonable acquisition intervals, so that excessive requests are prevented from being sent to the same domain name in a short time.
Through the series of filtering and optimizing processes, the system extracts a high-quality high-value link queue from the original hybrid link set, provides a high-efficiency target page set for the subsequent telephone number extraction link, and remarkably improves the overall acquisition efficiency and the data quality.
As shown in fig. 4, S3 specifically includes:
And S3.1, acquiring page contents according to the high-value link queue, and extracting visible text contents by adopting DOM analysis.
In the page text extraction link, the system accesses the target pages one by one based on the high-value link queue, and extracts text content possibly containing telephone numbers.
Firstly, the system sends an HTTP request to acquire page content, and dynamically adjusts a request strategy according to actual conditions, such as setting different User-agents, maintaining a Cookie state or processing JavaScript redirection, and the like, so as to cope with access restrictions of various websites.
After the content is acquired, the system extracts visible text through DOM parsing, and the process not only comprises obvious text elements such as regular paragraphs, titles and the like, but also pays special attention to special areas possibly containing contact information, such as footers, sidebars, contact pages and the like.
The system can intelligently handle various complications such as handling special coding (converting HTML entities like,', etc. to normal characters), identifying text substitution descriptions (alt attributes) in pictures, and even trying to extract pseudo-element content (e.g. passing, added content) in CSS styles.
It is particularly noted that the system also enables the detection and extraction of hidden text, and some web sites may use various techniques to hide telephone numbers, such as display, visibility using CSS, or set the text color to be the same as the background. The system can identify and extract these hidden contents by analyzing the CSS attributes.
In addition, the system focuses on dynamically loaded content to obtain complete information by simulating user interactions (e.g., clicking a "display more" button or triggering a particular event). For complex layouts, the system will analyze the spatial relationship of the elements, correctly associating the phone number with its descriptive text.
Through the comprehensive and fine extraction technology, the system ensures complete coverage of all the areas possibly containing telephone numbers in the page, outputs the cleaned and normalized plain text content, and lays a foundation for subsequent digital sequence identification.
And S3.2, extracting candidate digital sequences by using a regular expression according to the visible text content, and recording the context information of each candidate digital sequence in the original text.
In the step of digital sequence recognition, the system performs fine analysis on the extracted plain text content to recognize all the digital sequences possibly constituting telephone numbers.
First, the system uses a series of specially designed regular expression patterns to identify various forms of digital combinations. These patterns include consecutive numbers (e.g., 13800138000), digits with separators (e.g., 010-8888888, 0755.83744944), bracketed numbers (e.g., 010) 8888888), internationally prefixed numbers (e.g., + 8613800138000), and various mixed forms (e.g., +86 (10) 6552-9988).
The regular expression of the system is carefully designed, and can process telephone number format habits of different countries and regions, such as 11-bit digital format of Chinese mobile phone numbers, 10-bit digital format of American zone numbers and the like.
At the same time, the system handles special situations, such as that the numbers in the text may be separated by spaces, tabs or line breaks, and even full-angle numbers or Chinese numbers may be used for representation.
For each identified digital sequence, the system records not only the sequence itself, but also its complete context information in the original, typically including 50-100 characters each. Such contextual information is critical to subsequent phone number verification and function identification, and can provide key cues such as "customer service phone:", "order hotline:", "Forservice:", etc.
The system also records the position information of the digital sequence in the page, such as the type of HTML element (in tags such as < p >, < div >, < span >), the CSS class name (such as class= "tel" or class= "contact"), and the spatial position relative to the page, which helps determine the importance and function of the digital sequence.
Through the series of fine recognition and information association processing, the system outputs a candidate number sequence set containing rich metadata, and provides a comprehensive analysis basis for the next telephone number pattern matching.
And S3.3, carrying out various combination and formatting processing by using a telephone number mode knowledge base according to the candidate number sequence to generate a primarily identified telephone number list.
In the telephone number pattern matching link, the system uses the global telephone number format knowledge base to verify and format the candidate digit sequence.
The telephone number knowledge base of the system covers the number rules of more than 200 countries and regions worldwide, including international area codes, domestic area code length rules, total number length requirements, mobile phone number prefix rules and the like of each country/region.
For example, chinese cell phone number must be 11 digits and begin with 1, landline phones are typically area numbers (e.g., 010, 0755) plus 7-8 local numbers, united states and Canada employ North American Numbering Plan (NANP), use 3-digit area numbers plus 7 local numbers, and so on.
The system performs various combining and formatting processes on each candidate digit sequence, for example, for "01088888888", the system may try various segmentation methods such as "010-8888-8888", "0108-888-888", etc., and then verify which segmentation conforms to the valid phone number format based on the knowledge base.
The system can also handle special situations such as where a local person is used to omit a short representation of an area code, use a compound format of extension numbers, or include multiple telephone numbers in the same text block. For a national enterprise website, the system can identify the telephone number formats of multiple countries/regions and classify correctly.
In the verification process, the system not only considers the compliance of the number format, but also can combine some heuristic rules to enhance the judgment accuracy, such as excluding the number sequences which are obviously the product model, price, date and the like.
To increase efficiency, the system employs a multi-stage filtering strategy that first screens out sequences that do not significantly match the telephone number characteristics quickly with simple rules, and then performs more detailed format verification on potentially valid sequences. Through this complex series of matching and verification processes, the system outputs a list of primarily identified telephone numbers, each labeled with its possible country/region attribution and format validity scores, providing a basis for subsequent context verification and classification.
And S3.4, according to the telephone number list, carrying out verification and function type identification by combining the context information to form a page queue containing effective telephone numbers.
And in the context verification and classification link, the system carries out deep context analysis on the primarily identified telephone number, further confirms the validity of the telephone number and identifies the function type of the telephone number.
First, the system analyzes the context vocabulary around each phone number, looking for keywords that can be confirmed as an explicit identification of the phone number, such as "phone", "Tel", "contact", "dial", etc.
The system analyzes texts in different ranges before and after the number by adopting a sliding window method, and distributes weights according to the distances between the keywords and the number, wherein the influence of the keywords with the closer distances is larger.
Such context-based verification can effectively exclude digit sequences that, although in format, are not actually telephone numbers, such as product numbers, order numbers, etc.
After confirming the validity, the system further analyzes the context to identify the function type of the phone number. The system uses a predefined function type dictionary containing various common telephone function classifications and their corresponding feature words, such as customer service (customer service, consultation, support), sales (sales, ordering, purchasing, sales), technical support (technical, fault, maintenance, technical), etc.
By matching the feature words in the context, the system is able to assign the most likely function label to each phone number.
For situations where the contextual information is insufficient, the system will analyze the position of the number in the page and DOM structural features, such as the number located in the "contact us" page being more likely to be a customer service call and the number located in the product detail page being more likely to be a sales call.
In addition, the system may also identify time limits for the number usage, such as "weekdays 9:00-18:00" etc. period labels.
After completing these analyses, the system stores each validated phone number in the form of a key pair in queue B together with the URL and title of the current page. Each entry in this queue contains complete metadata such as number text, standardized format, country/region attribution, function type, credibility score, etc., providing rich structured data for subsequent merchant information associations. Through the deep context analysis and function recognition, the system greatly improves the accuracy and practical value of telephone number extraction.
S4.1, extracting a second-level domain name and a top-level domain name according to the page queue, and grouping by adopting a domain name clustering algorithm to generate a domain name grouping result.
In the domain name extraction and analysis link, the system carries out systematic processing on the page queues containing the effective telephone numbers so as to realize the preliminary classification of merchant information.
First, the system parses each URL, extracting its secondary domain name and top domain name. For example, from a URL such as "https:// beiding. Shop. Sample. Com/contact", the system recognizes the secondary domain name "shop. Sample" and the top domain name "com", while recording the sub domain name "beijing" as possible region information. The system may also identify special situations, such as a website using country code top-level domain names (e.g.,. Cn,. Jp) may represent business entities in a particular country, and a website using a special domain name of. Edu,. Gov, etc. may be an educational institution or government agency. After extracting the domain name, the system adopts a domain name clustering algorithm to group. Such clustering is based not only on perfect matching, but also on considering the similarity of domain names, and can identify cases such as "shop. Sample. Com" and "mobile. Sample. Com" that originate from the same organization but use different subdomains. The system calculates the similarity of the domain names by using algorithms such as editing distance, longest public substring and the like, and performs intelligent matching by combining known common domain name modes (such as www, m, shop common to enterprises and other subdomain prefixes). In addition, the system also analyzes the path structure pattern of the URL, identifying different merchants that may be from the same content management system or e-commerce platform. For example, "platform.com/shop/A" and "platform.com/shop/B" may be different merchants on the same platform. Through the multidimensional domain name analysis and clustering technology, the system generates a domain name grouping result, effectively groups pages possibly belonging to the same merchant or the same organization into a group, and lays a foundation for subsequent merchant information association. The primary grouping based on the domain name greatly improves the data processing efficiency and avoids redundant work of independently processing each URL.
And S4.2, establishing a domain name merchant mapping table according to the business database and the historical acquisition data.
In the construction link of the domain name merchant mapping table, the system integrates various data sources to establish the corresponding relation between the domain name and the actual merchant entity.
First, the system utilizes the existing commercial database resources, such as enterprise business registration database, commercial information service platform (such as enterprise search, sky eye search) and industry catalog database, to obtain the known domain name-merchant correspondence. These official or professional data sources provide a large amount of validated underlying mapping information.
And secondly, analyzing historical acquisition data by the system, and extracting the association mode of the domain name and the merchant name from the historical acquisition data. Through statistical analysis of a large amount of historical data, the system can identify those high-frequency and stable correspondences, such as that a specific domain name almost always appears simultaneously with a certain merchant name.
The system also adopts a machine learning technology to train a special mapping prediction model, and the model comprehensively considers domain name text characteristics (such as brand keywords contained in domain names), webpage content characteristics (such as website logo and company names in copyright statement) and link relation characteristics (such as reference modes of other known merchant websites) to predict merchant entities possibly corresponding to unknown domain names.
The model continuously improves the accuracy through continuous learning, and can process new domain names which are not directly matched with records.
In addition, the system implements a manual checksum feedback mechanism that allows an expert to review and correct automatically generated mappings, which in turn is used to further improve model performance. Through the multi-source data fusion and intelligent learning technologies, the system constructs a comprehensive and accurate domain name merchant mapping table, which not only contains direct mapping relations, but also records the reliability scores and data sources of mapping, and provides reliable knowledge base support for subsequent merchant information association. The construction of the mapping table is a dynamic and continuous process, and the system can update and expand the mapping data periodically to ensure that the mapping data is kept synchronous with the continuously changing internet business environment.
And S4.3, adding merchant basic information for each URL according to the domain name grouping result and the domain name merchant mapping table, and generating a URL data set associated with the merchant.
In the primary association link of the merchant information, the system combines the domain name grouping result generated before with the domain name merchant mapping table, and adds corresponding merchant basic information for each URL.
Firstly, the system performs query matching on each domain name group, and finds out the corresponding merchant record from the domain name merchant mapping table. For the case of direct matching, the system directly associates the corresponding merchant ID, name, industry classification, etc. base information.
For domain names that do not directly match records, the system may infer that an attempt is made to perform an approximate match based on similarity calculations and rules, as in the case of dealing with domain name variants (example-shop. Com versus example hop. Com) or subdomain name changes (shop. Example. Com versus m. Example. Com).
After determining the merchant association, the system tags each URL with a set of core merchant attributes including a unique identifier (e.g., merchant ID), merchant name (possibly including formal name and common acronym), industry classification (e.g., major classes and sub-classes of greater subdivisions such as catering, retail, service, etc.), business scale (e.g., large chain, small business, individual merchant, etc.), establishment time, regional information, etc.
For special cases, such as where a domain name may correspond to multiple merchants (e.g., commercial platform websites) or where a merchant may use multiple domain names, the system may establish many-to-many associations and record the confidence scores for each association.
In addition, the system marks the source of the data and the timestamp for each association, thereby facilitating subsequent data updating and conflict resolution. Through the systematic information association process, URL data is converted from simple website links to structured data with rich merchant contexts, so that a URL data set associated with merchants is formed. The association not only provides valuable business background information, but also provides important classification dimension for subsequent data grouping and content analysis, thereby greatly enhancing the business value and application potential of the data.
And S4.4, according to the URL data set, conducting content subdivision on the page title and the content abstract by using a text clustering algorithm to form the merchant data set.
In a content-based grouping optimization link, the system performs finer content analysis and grouping on URL datasets of associated merchant information.
First, the system analyzes the title and content abstract of the page corresponding to each URL, and extracts keywords and topic information. Through natural language processing techniques, the system identifies the main content type of the page, such as a product introduction page, a contact information page, a company profile page, and so forth.
The system adopts a text clustering algorithm (such as K-means, hierarchical clustering or topic model) to finely group different content pages of the same merchant.
For example, all of the store pages of a restaurant may form one sub-cluster, the menu pages form another sub-cluster, and the order contact pages form a third sub-cluster. The clustering not only considers text similarity, but also combines the URL path mode and the page structure characteristics, so that the functional category of the content can be identified more accurately. The system is particularly concerned with those pages containing contact information, which are marked as high value data sources.
For large merchant websites, the system can also identify page groups of different departments or business lines, such as sales departments, customer service centers, technical supports and the like, which has important guiding significance for the subsequent telephone number function classification. In addition, the system analyzes the temporal attributes of the pages to identify regularly updated content (e.g., promotional information) and relatively stable content (e.g., primary contact) for differentiated handling.
With such content-based fine grouping, the system organizes URL data that might otherwise be intermixed into a well-structured, well-functioning merchant dataset, each grouping having specific content features and business functions. The optimized data structure not only improves the accuracy of the subsequent AI model extraction, but also provides a more reasonable organization framework for data display and application, so that the final data product meets the actual service requirements and the use habits of users.
S4.5, original data generated by an automatic data acquisition activity are acquired, and a standardized time sequence data set is formed through normalization and feature extraction processing.
In the link of automatic data acquisition activity data processing and standardization, the system carries out systematic processing on the original data generated in the acquisition process, and prepares for subsequent time sequence analysis.
First, the system collects raw data generated by all automated data collection activities, including the time stamp, URL, response status code, response time, data size, and extracted information content of each request, etc. These raw data are often in different formats, are large-scale and contain noise, and require standardized processing.
The system first cleans the data for missing values (e.g., where some requests do not get a response), outliers (e.g., extreme response times), and conflicting data (e.g., different content as if the URL were acquired in a short time). The system then performs data normalization to uniformly convert the features of different dimensions (such as response time in millisecond order and data amount in KB order) into standard range, and usually uses Z-score normalization or Min-Max scaling method. Next, the system performs feature extraction to mine valuable time-pattern features from the raw data, such as access frequency of specific URLs, content update period, time variation of data acquisition success rate, and the like.
The system can also construct derivative features, such as second-order features of calculated data change rate, request density, content similarity and the like, so that the expression capability of the data is enhanced. Finally, the system performs time granularity alignment on the data according to the service requirement, and may aggregate the original second-level data into a minute-level, hour-level or day-level time sequence for subsequent analysis.
Through the series of processing, the system converts chaotic original automatic data acquisition activity data into a standardized time sequence data set with unified structure, rich characteristics and time alignment, the data not only reflects the content change rule of a target website, but also records the performance characteristics of the automatic data acquisition system, and provides a high-quality training and reasoning basis for a subsequent diffusion model.
And S4.6, performing forward diffusion and backward diffusion processes by using a conditional diffusion model according to the standardized time series data set to generate time series interpolation data.
In the link of conditional diffusion model and time sequence interpolation, the system utilizes the diffusion probability model technology of the leading edge to process the problem of missing values in the time sequence data.
The core idea of the conditional diffusion model is to consider the data generation process as a step-by-step denoising process, and the technology is particularly suitable for processing website content change data with complex time dependence.
First, the system defines a forward diffusion process, gradually adding gaussian noise to the complete time series data through multiple steps until completely randomized.
The system then trains a neural network to learn the back-diffusion process, i.e., to gradually recover the original signal from the noise. The network usually adopts a U-Net or a transducer architecture, and can effectively capture the long-term dependency relationship of time series data.
After training, the system uses the model to perform conditional generation, namely when missing segments in the time sequence are encountered, a known part of the time sequence is used as a conditional input to guide the model to generate missing parts consistent with the known data.
In particular, the system will leave the known data points unchanged, applying a back diffusion process to only the missing parts, gradually recovering the possible data values from random noise. This conditional generation ensures natural continuity of the interpolation results with the known data. Compared with the traditional interpolation method, the diffusion model can generate a result which is more consistent with the inherent distribution characteristic of the data, especially for updating the data for website contents with complex modes (such as periodicity, trend and sudden change).
For example, when an e-commerce web site is temporarily inaccessible due to technical maintenance, the system can predict content changes that may occur during this period of time based on historical access patterns and automatically adjust the prediction after access is resumed. By the advanced time sequence interpolation technology, the system can effectively process the problem of data loss caused by various reasons, ensure the continuity and the integrity of time sequence data and provide a reliable basis for subsequent analysis.
And S4.7, carrying out data analysis by using a diversity sampling algorithm according to the time sequence interpolation data to obtain a prediction uncertainty index.
In the steps of diversity sampling and uncertainty evaluation, the system carries out deep analysis on time series interpolation data generated by the diffusion model, and the reliability of prediction is quantized.
First, the system employs a diversity sampling strategy, rather than simply generating a single prediction result, by running the diffusion model multiple times, each time using a different random seed, generating multiple sets of possible interpolation schemes (typically 50-100 sets). These different schemes together form a prediction distribution reflecting the uncertainty of the model for the predicted values at different points in time.
Next, the system calculates statistical features of all sampling results, such as a mean (as a final predicted value), a standard deviation (as an uncertainty measure), a quantile (for constructing a prediction interval), etc., for each time point. The system is particularly concerned with those points in time where the inter-sample variance is large, which generally indicates that the model's predictions at these points are highly ambiguous and may require more real data to verify.
Based on these statistical analyses, the system generates an uncertainty indicator, typically expressed as a confidence score between 0 and 1, or the width of the prediction interval, for each predicted point in time.
The system may also identify different types of sources of uncertainty, such as occasional uncertainty (randomness of the data itself) and cognitive uncertainty (limitation of model knowledge). For unusual changes that may be caused by special events (e.g., e-commerce web sites proliferate in traffic on the day of sales), the system may be marked as a high uncertainty area and alerts may require special attention.
In addition, the system will continually update these uncertainty estimates over time, and as more observations are obtained, the prediction interval of the model will typically narrow, with reduced uncertainty. Through the comprehensive diversity sampling and uncertainty evaluation, the system not only provides specific predicted values, but also quantifies the reliability degree of the predictions, so that a decision maker can reasonably allocate resources according to the reliability degree of the predictions, and a data acquisition strategy can be planned more scientifically.
And S4.8, according to the prediction uncertainty index, executing a self-adaptive adjustment strategy to update the acquisition frequency parameter, so as to form data with better time continuity and integrity.
In the implementation link of the self-adaptive adjustment strategy, the system intelligently optimizes the acquisition strategy according to the uncertainty evaluation result to form a closed-loop self-adaptive system.
First, the system builds an uncertainty threshold rule that classifies the predicted time points into multiple levels by uncertainty, such as altitude determination (confidence > 0.9), medium determination (confidence 0.6-0.9), and altitude uncertainty (confidence < 0.6).
The system then designs a differentiated resource allocation strategy for each level, for which the system may reduce the actual acquisition frequency for highly determined points in time, rely more on model predictions, maintain the normal acquisition frequency for moderately determined points, and significantly increase the acquisition frequency for highly uncertain points, acquiring more real data to reduce uncertainty. The system also considers the business value of the data, and can adopt a more conservative acquisition strategy for data with lower business value (such as secondary information of non-core pages) even if the prediction is uncertain, and can maintain a certain acquisition frequency for high-value data (such as contact information updating of important clients) even if the prediction is more definite so as to ensure the safety.
In addition, the system realizes a dynamic feedback mechanism, and when the actually acquired data has a significant difference from the prediction, the model retraining or parameter adjustment is triggered to adapt to the change of the data distribution.
The system also designs a resource balancing algorithm, so that reasonable acquisition resource allocation can be obtained for each target website and time point under the overall resource constraint. The intelligent resource scheduling not only considers the uncertainty of prediction, but also considers the actual factors such as network conditions, server load, anti-climbing mechanism and the like. Through the series of self-adaptive adjustment strategies, the system can maximize the resource utilization efficiency on the premise of ensuring the data quality, and realize the optimal time continuity and integrity of the data. With the increase of the system running time and the enrichment of data accumulation, the self-adaptive mechanism becomes more and more accurate, and an intelligent acquisition system with continuous self-optimization is formed.
S5, specifically, collecting historical webpage data containing various website formats and field types S5.1.
In the historical webpage data collection link, the system establishes a comprehensive and various training data resource base, and provides a solid foundation for subsequent AI model training.
First, the system systematically collects historical web page data covering various industries, types of web sites, and is concerned with pages containing rich structured information (e.g., merchant names, addresses, phones, etc.). Source diversification is collected, including published web page archives (e.g., waybackMachine of INTERNETARCHIVE), historical snapshots provided by commercial data suppliers, collection records accumulated by long-term operation of the system itself, and the like.
The system classifies the collected pages in a multi-dimensional manner, marks the collected pages according to industries (such as catering, retail, service industry and the like), website types (such as enterprise official networks, electronic commerce platforms, social media and the like), content structures (such as table formats, list types, paragraph types and the like) and technical implementation (such as static HTML, javaScript rendering and responsive design) and ensures the diversity and representativeness of training data.
In particular, the system may focus on gathering web page types that are particularly challenging, such as unusual complex layouts, non-standard field representations, multi-lingual mix content, pages that use a large amount of pictorial information, etc., to enhance the generalization capability of the model. In addition, the system also collects the same website data in different periods, captures the evolution trend of website design and content organization, and enables the model to adapt to the continuously-changing webpage design style.
For scarce but important web page types, the system will also employ synthetic data techniques to create more training samples through template variation or content reorganization. All the collected data are subjected to preliminary quality screening, pages with obvious damage, incomplete content or over-high repeatability are removed, version management is carried out, and the sources, acquisition time and basic statistical characteristics of the data are recorded.
Through the systematic historical data collection work, the system constructs a rich training resource library containing various website formats and field types, and lays a solid foundation for subsequent expert labeling and model training.
S5.2, building a training data set through expert annotation, and carrying out functional classification on key fields containing merchant names, addresses and business ranges and telephone numbers in the annotation data.
In the link of expert annotation and training data set establishment, a system organizes professional team to carry out high-quality manual annotation on historical webpage data, and standard answers required by supervised learning are generated.
Firstly, the system makes a detailed labeling guide, and clearly defines various fields (such as merchant names, addresses, operation ranges, business hours and the like) to be identified, standard formats thereof and judging standards of functional classifications (such as switchboard, customer service, sales, technical support and the like) of telephone numbers.
In order to ensure the labeling quality, the system adopts a multi-level auditing mechanism that after a primary label maker finishes basic labeling, a senior auditor reviews, and finally judges the complex or disputed cases by field experts.
The system also implements a performance evaluation and training mechanism of the annotators, and continuously improves the professional level and standard uniformity of the annotating team through regular consistency test and case study.
In the technical aspect, the system develops a special labeling tool, supports efficient field selection, attribute labeling and functional classification, records uncertainty and difficulty rating in the labeling process, and provides important references for subsequent model training. Considering the diversity requirement of the data, the system ensures the balance of the labeling samples in the dimensions of industry distribution, website types, field complexity and the like, and avoids the bias of training data.
For rare but important situations (such as contact information in an unconventional format), the system can specially increase the labeling proportion of corresponding samples, so that the model can process various edge situations.
In addition, the system also implements a data segmentation strategy, and the labeling data is divided into a training set, a verification set and a test set according to the proportion of 8:1:1, so that the distribution similarity of the three sets in each dimension is ensured, and meanwhile, data leakage (such as the scattering of pages from the same website into different sets) is avoided.
Through the professional and strict labeling flow, the system establishes a high-quality training data set which contains rich merchant field information and telephone number function classification, and provides reliable supervision signals for the next feature engineering and model training.
And S5.3, carrying out feature engineering processing on the training data set, extracting DOM structural features, text semantic features and position relation features, and constructing a training model based on the DOM structural features, the text semantic features and the position relation features.
In the link of feature engineering and training model construction, the system carries out deep analysis and feature extraction on the labeling data set, and prepares rich input signals for subsequent model training.
First, the system extracts DOM structural features, including HTML tag type, tag nesting depth, element position, CSS class name and ID, element size and visibility, etc., which can reflect the structured information and visual layout of the web page.
For example, the system may identify those elements that are within a particular container (e.g., class= "contact-info") as more likely to contain contact information.
Secondly, the system extracts text semantic features, including word bag representation of text content, TF-IDF features, word embedding vectors, named entity recognition results, etc., which can capture semantic information and entity types of text.
For example, the system may learn an association pattern that identifies keywords such as "contact", "dial", and the like, to telephone numbers.
Again, the system extracts positional relationship features, including relative locations between elements, proximity relationships, possible text-to-digital pairing patterns, etc., which can express spatial relationships between page elements.
For example, the system may learn a common layout pattern that identifies tag text with its corresponding value, such as a pattern in which the tag is right at the left value in "phone: 12345678".
Based on these rich features, the system builds the infrastructure of the training model. For the field extraction task, the system employs an encoder-decoder architecture, where the encoder is responsible for converting page features into high-dimensional representations, and the decoder is responsible for identifying specific fields from those representations.
For telephone number grouping and function identification tasks, the system adopts a sequence labeling framework, treats telephone numbers and contexts thereof as a sequence, and learns and predicts the function label of each telephone number.
The system also realizes feature selection and dimension reduction technology, removes redundant features through methods such as correlation analysis, principal component analysis and the like, and improves model training efficiency.
In addition, the system designs a feature fusion mechanism, can adaptively adjust the weights of different types of features, and optimizes the feature combination aiming at different website structures. Through the systematic characteristic engineering and model construction work, the system provides rich and refined input representation for subsequent deep learning model training, and the learning efficiency and the performance level of the model are greatly improved.
S5.4, constructing a field extraction neural network model based on the pre-training language model by adopting a transfer learning method, and carrying out fine adjustment on the extraction of the structural information of the webpage.
In the link of transfer learning and field extraction model construction, the system utilizes the strong semantic understanding capability of the pre-training language model to develop a neural network model specially used for webpage structural information extraction.
First, the system selects an appropriate pre-trained language model, such as BERT, roBERTa, or chinese version, as a basis, which has mastered rich language knowledge and semantic understanding capabilities through self-supervised learning over a vast array of texts.
The system then performs domain-adaptive tuning on these pre-trained models, further training the models using a large number of industry-related text (e.g., business description, product introduction, etc.) to better understand domain-specific terms and expressions.
Then, the system designs a special task fine tuning stage, combines the pre-training model with a task specific output layer, and constructs a complete field extraction network.
The system adopts a coder-decoder framework in a specific architecture, wherein the coder is responsible for converting the webpage text and structural characteristics thereof into context-aware vector representations based on a pre-training model, and the decoder adopts a Conditional Random Field (CRF) or pointer network and other structures and is responsible for accurately positioning and extracting target fields from the vector representations. The system also innovatively combines text and structure information, converts the HTML structure into special tags to be inserted into the text sequence, and enables the model to understand the text content and the semantics of the page structure at the same time.
For example, < divclass = "contact" > telephone: 12345678</div > is converted into a mixed input of special tag sequences and text content. In the training process, the system adopts a multi-task learning method, and simultaneously optimizes a plurality of related targets (such as field boundary identification, field type classification, entity relation extraction and the like) so that the model can learn data characteristics from different angles.
The system also implements challenge training and data enhancement techniques that improve the generalization ability and robustness of the model by generating challenge samples and transforming existing samples. Through the method based on the transfer learning, the system effectively utilizes language knowledge contained in the pre-training model, greatly reduces the dependence on the labeling data, and improves the field extraction accuracy of the model under a complex webpage structure, especially for field information with changeable formats or nonstandard expression.
S5.5, designing a sequence labeling network structure, and training a telephone number grouping identification model by adopting BiLSTM-CRF architecture.
In the link of sequence labeling model and telephone number grouping identification, the system designs a special neural network architecture for identifying the function types and grouping relations of a plurality of telephone numbers of the same merchant.
Firstly, the system adopts a two-way long-short-term memory network (BiLSTM) as an infrastructure, and the network can effectively capture the front-back dependency relationship of sequence data, and is particularly suitable for processing text and number sequences.
The BiLSTM network receives input features processed by the embedded layer, including phone number text, surrounding context vocabulary, location information, etc., and learns the sequence patterns from both forward and backward directions. The system then adds a Conditional Random Field (CRF) layer on top of the BiLSTM output layer, forming a BiLSTM-CRF architecture. The CRF layer is able to learn transition probabilities between tags, taking into account the overall rationality of the tag sequence, as if a merchant is unlikely to have multiple "head phones" but likely to have multiple "branch phones.
This structural design allows the model to not only focus on the local features of a single phone number, but also to take into account overall tag consistency constraints.
In the aspect of feature engineering, the system constructs rich feature vectors for each telephone number, including format features of the number (such as length, whether area code is contained, in the link of sequence labeling model and telephone number grouping identification, the system designs a special neural network architecture for identifying the function types and grouping relations of a plurality of telephone numbers of the same merchant.
Firstly, the system adopts a two-way long-short-term memory network (BiLSTM) as an infrastructure, and the network can effectively capture the front-back dependency relationship of sequence data, and is particularly suitable for processing text and number sequences.
The BiLSTM network receives input features processed by the embedded layer, including phone number text, surrounding context vocabulary, location information, etc., and learns the sequence patterns from both forward and backward directions. The system then adds a Conditional Random Field (CRF) layer on top of the BiLSTM output layer, forming a BiLSTM-CRF architecture. The CRF layer is able to learn transition probabilities between tags, taking into account the overall rationality of the tag sequence, as if a merchant is unlikely to have multiple "head phones" but likely to have multiple "branch phones.
This structural design allows the model to not only focus on the local features of a single phone number, but also to take into account overall tag consistency constraints.
In terms of feature engineering, the system builds rich feature vectors for each phone number, including format features of the number itself (such as length, area code, mobile phone number), context semantic features (such as function indicators "customer service", "sales", etc. appearing around), location features (such as relative location in the page, distance from other numbers), etc.
The system also introduces a focus mechanism that enables the model to better focus on contextual information related to phone number function decisions, such as department names or service type descriptions that may occur before and after.
In addition, the system also realizes a multi-head self-attention structure, so that the model can simultaneously pay attention to different types of related information and integrate the related information. Through the BiLSTM-CRF architecture with professional design, the system can accurately identify the function types (such as a switchboard, customer service, sales, technical support and the like) of each of a plurality of telephone numbers under the same merchant, reasonably groups the telephone numbers with similar functions, and provides an important basis for subsequent data integration and display.
And S5.6, extracting super parameters of the network and telephone number grouping identification model through cross verification optimization fields, and improving generalization capability of the field extraction neural network model and the telephone number grouping identification model based on sequence labeling under different website formats.
In the link of cross validation and super parameter optimization, the system adopts a scientific and strict method to evaluate and optimize the model performance, so that the system has good generalization capability under various website formats.
Firstly, the system implements K-fold cross validation, usually adopts a 10-fold scheme, the training data set is divided into 10 subsets with similar sizes, 9 subsets of training models are used each time, the rest 1 is used for validation, 10 times are carried out in turn, and finally average performance is taken as an evaluation index.
The method can comprehensively evaluate the performance of the model under different data distribution, and avoid the deviation possibly caused by a single test set.
Based on the cross-validation results, the system performs comprehensive super-parameter optimization, and the adjusted parameters comprise network architecture parameters (such as the hidden layer size, the layer number, the attention head number and the like of the LSTM), optimizer parameters (such as the learning rate, the momentum, the weight attenuation and the like), regularization parameters (such as the dropout rate, the L1/L2 regularization strength and the like), training strategy parameters (such as the batch size, the learning rate scheduling strategy, the early-stop condition and the like).
Compared with the traditional grid search or random search, the system adopts a Bayesian optimization method to perform efficient parameter search, and the method can more intelligently explore a parameter space and quickly find out a near-optimal parameter combination.
During the optimization process, the system evaluates model performance separately for different types of web site formats, focusing on performance under special formats (e.g., highly dynamic pages, pages of non-traditional layout) in particular, ensuring that the model does not significantly degrade on certain specific types.
The system also adopts multi-index comprehensive evaluation, and simultaneously considers accuracy, recall rate, F1 score and custom indexes related to specific tasks, such as accuracy of field boundary identification, consistency of telephone number function classification and the like, so as to ensure balanced performance of the model in all aspects.
In addition, the system also implements an error analysis mechanism, which records and classifies error cases on the verification set in detail, identifies short plates of the model and adjusts architecture or parameters in a targeted manner. Through the systematic cross-validation and super-parameter optimization flow, the system finally obtains field extraction network and telephone number grouping identification models with stable performance and strong generalization capability, and the models can keep high-level performance under various website formats and data distribution, thereby providing reliable technical support for practical application.
And S5.7, extracting a neural network model according to the fields, performing automatic analysis on the pages of each merchant group, and identifying and extracting key field information.
In the field extraction neural network application link, the system deploys the optimized model to the actual production environment, and performs automatic analysis and information extraction on pages of each merchant group.
Firstly, the system preprocesses the webpage to be processed, including HTML analysis, text extraction, feature calculation and the like, and converts the webpage to model input consistent with the training data format. Then, the system activates the field to extract the neural network, inputs the preprocessed page data into the model, and the neural network automatically identifies and locates the target field through multi-layer calculation.
In the recognition process, the model calculates probability scores of various field types, such as "merchant name", "address", "business scope", "time of establishment", etc., for each token (token) in the text, and then finds out the optimal field boundaries and type assignment by combining the CRF layer using a forward-backward algorithm.
The process fully utilizes the understanding capability of the pre-training language model to the context and the modeling capability of the sequence labeling model to the tag dependency relationship, and can accurately identify the field information with complex format and expression mode.
For each field identified, the system also calculates a confidence score reflecting the degree of certainty of the model for the extracted result. The system may be particularly concerned with low confidence extraction results and may initiate backup processes such as secondary verification using a rules engine or marking requiring manual review.
In addition, the system also realizes an adaptive processing mechanism, and the processing strategy can be automatically adjusted for different types of webpages, such as increasing the depth of feature extraction for pages with complex structures, and analyzing after rendering for pages with high dynamic state.
The system also records intermediate states and attention weight distribution in the processing process, so that subsequent result interpretation and model improvement are facilitated. Through the intelligent field extraction process, the system can accurately identify and extract key merchant information such as merchant names, detailed addresses, operation ranges, established times, registered capital and the like from the mixed webpage content, and the structured field information provides high-quality basic data for subsequent merchant portrait construction and data application.
And S5.8, according to the telephone number grouping recognition model, automatically grouping and functionally recognizing a plurality of telephone numbers of the same merchant, recognizing and extracting telephone number data.
In the application link of the telephone number grouping identification model, the system utilizes a trained BiLSTM-CRF model to automatically identify and group a plurality of telephone numbers of the same merchant.
First, the system collects as input all extracted phone numbers under the same merchant, along with their context information (e.g., descriptive text around the number, title and URL of the page where it is located, etc.). The system then pre-processes the input data, including normalizing phone number formats, word segmentation and tokenizing process context text, computing location features, etc., and converts it into a model acceptable sequence input form. Then, the system activates a telephone number grouping recognition model, the model converts an input sequence into a vector representation through an embedding layer, then extracts sequence features through a BiLSTM layer, and finally outputs the most probable function label of each telephone number through a CRF layer.
This process considers not only the characteristics and direct context of each number itself, but also the interrelationship between numbers and the rationality of the overall tag sequence.
The system attaches function labels (e.g., switchboard, customer service, sales, technical support, complaints, reservations, etc.) identified by the model to the corresponding telephone numbers and groups the numbers initially according to the functional similarity. In addition, the system analyzes the regional characteristics of the number (such as the region to which the area code belongs) and the use scene (such as a private line, 24-hour service and the like), and further refines the grouping information.
For some complications, such as the possibility that a number serves multiple functions at the same time, the system will calculate the probability distribution of multiple tags and select the most dominant function as the primary tag and the other possible functions as the secondary tags.
The system may also record confidence indicators for packet identification, and for low confidence results, may initiate a manual review process.
Through the intelligent telephone number function identification and grouping processing, the system can convert the originally isolated telephone number data into address book information with definite functional attributes and organization structures, and the practicality and commercial value of the data are greatly improved. The user can quickly find out the contact ways suitable for specific purposes according to the needs, and directly contact with the sales department when the product information is required to be consulted, and directly contact with technical support when the technical problem is encountered, so that the information inquiry and use efficiency is remarkably improved.
And S5.9, carrying out association degree analysis on the key field information and the telephone number data, and confirming the main merchant name of the data packet to obtain the structured data set.
In the link of association analysis and main body merchant confirmation, the system carries out deep association analysis on the extracted field information and telephone number data to confirm the main body merchant name and core information of the data packet.
First, the system statistically analyzes the variations in merchant names (e.g., "ABC company", "ABC group", "ABC limited", etc.) that occur in each merchant group, and calculates the frequency of occurrence, location importance (e.g., at the title or salient location), and association strength (co-occurrence relationship with other key fields) of each variation.
Based on these statistics, the system uses a weighted voting mechanism to determine the most likely subject merchant name that will be identified as the primary key for the entire dataset.
In addition, the system analyzes the association pattern between the merchant name and the telephone number, and identifies the most core contact way and the corresponding business entity. Under complex conditions, the system also analyzes the hierarchical structure and content organization of the page, distinguishes the relationship between the main business and the subsidiary business, and between the parent company and the subsidiary company, and the like, and ensures the accuracy and hierarchy of the data packet. The system focuses on multiple relationships that may exist, such as different brands or business lines under the same group, by constructing entity relationship maps to represent these complex associations.
Meanwhile, the system can analyze the consistency and complementarity between fields, such as the regional consistency of addresses and telephone number area codes, the corresponding relation between the operating range and specific business departments, and the like, and the reliability of the whole data is improved through the cross verification. For information that is conflicted or inconsistent, the system may intelligently reconcile based on the confidence score and business importance, preferably preserving more reliable and more core information.
And finally, organizing all the information subjected to association analysis and integration into a structured data set which takes a main merchant as a center by the system, clearly displaying key data such as basic information, multi-level contact information, service range and the like of the merchant, and maintaining association relation and hierarchical structure among the information. The structured data set subjected to the deep association analysis and the main body confirmation not only solves the problems of data dispersion and identity confusion, but also provides rich merchant portraits and relational networks, and greatly improves the commercial application value and user experience of the data.
The embodiment of the application also provides a circulating automatic data acquisition system which comprises a memory, a processor and a computer program stored in the memory and capable of running on the processor, wherein the method is realized when the processor executes the computer program.
The embodiment of the application also provides a computer device which comprises at least one processor and a memory which is in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method of the cyclic automation data acquisition method.
The embodiment of the application also provides a computer readable storage medium which stores computer instructions for causing a computer to execute the method of the cyclic automation data acquisition method. The embodiments of the present application also provide a computer program product comprising computer instructions which, when executed by a processor, implement the steps of the above-described cyclic automation data collection method.
The application has the following technical effects:
through a knowledge-graph distributed entry discovery mechanism and dynamic priority queue management, more efficient resource allocation and acquisition scheduling are realized, and acquisition efficiency is improved. The link discovery algorithm based on DOM structural features and semantic association improves the link value evaluation accuracy and reduces the invalid acquisition rate. The telephone number recognition mechanism combined by various algorithms is utilized, so that the accuracy and adaptability of telephone number extraction are greatly improved. The self-noted diffusion model is adopted to conduct time sequence interpolation, and the problem of data continuity in interrupt recovery and incremental update scenes is effectively solved. The data extraction model based on the AI can be automatically adapted to different website structures, and the accuracy and coverage rate of data extraction are improved. The application of multidimensional data fingerprint and differential privacy technology ensures the uniqueness and compliance of data and improves the data quality.
By the technical scheme, the method and the device can realize more efficient and more accurate data acquisition under the background of explosive growth of internet information and increasingly complex website structure, and meet the application scene demands of business intelligence, risk monitoring and the like with higher requirements on data quality and timeliness. Meanwhile, the application of the multidimensional data fingerprint and differential privacy technology ensures the uniqueness and compliance of the data, and provides important guarantee for data security and privacy protection.

Claims (10)

Translated fromChinese
1.一种循环自动化数据采集方法,其特征在于,包括:1. A cyclic automated data acquisition method, comprising:依据知识图谱的分布式入口发现机制,构建多源数据入口集合,并利用动态优先级队列管理、时效性评估模型和半参数批处理全局决策机制,形成入口URL集合;Based on the distributed entry discovery mechanism of the knowledge graph, a multi-source data entry set is constructed, and a set of entry URLs is formed using dynamic priority queue management, timeliness evaluation model, and semi-parametric batch global decision mechanism.依据所述入口URL集合,基于DOM结构特征分析与语义关联度评估,以及TF-IDF和Word2Vec的链接价值评分,形成高价值链接队列;Based on the entry URL set, a high-value link queue is formed based on DOM structure feature analysis and semantic relevance evaluation, as well as TF-IDF and Word2Vec link value scores;依据所述高价值链接队列,获取页面内容,并利用正则表达式,提取数字序列,以及使用电话号码模式知识库进行匹配判断,形成包含有效电话号码的页面队列;Obtaining page content based on the high-value link queue, extracting digital sequences using regular expressions, and performing matching judgments using a telephone number pattern knowledge base to form a page queue containing valid telephone numbers;依据所述页面队列,运用自我注意的扩散模型进行时间序列插补,并借助域名提取算法和域名商户映射知识库进行商户信息关联,形成商户数据集;Based on the page queue, a self-attention diffusion model is used to perform time series interpolation, and merchant information is associated with a domain name extraction algorithm and a domain name merchant mapping knowledge base to form a merchant dataset;根据所述商户数据集,利用基于监督学习的字段提取神经网络模型和基于序列标注的电话号码分组识别模型进行字段信息提取和电话号码分组识别,生成结构化数据集;Based on the merchant data set, a field extraction neural network model based on supervised learning and a telephone number group recognition model based on sequence annotation are used to extract field information and recognize telephone number groups to generate a structured data set;依据所述结构化数据集,执行多维数据指纹生成进行数据去重,并采用差分隐私技术进行数据保护,形成去重和脱敏后的高质量数据集。Based on the structured data set, multi-dimensional data fingerprint generation is performed to deduplicate the data, and differential privacy technology is used to protect the data, forming a high-quality data set after deduplication and desensitization.2.根据权利要求1所述的方法,其特征在于,所述依据知识图谱的分布式入口发现机制构,建多源数据入口集合,并利用动态优先级队列管理和时效性评估模型,形成入口URL集合,包括:2. The method according to claim 1 is characterized in that the distributed portal discovery mechanism based on the knowledge graph constructs a multi-source data portal set, and uses a dynamic priority queue management and timeliness evaluation model to form a portal URL set, including:根据搜索引擎API、社交媒体API和行业目录数据库,获取初始URL集合;Obtain an initial set of URLs based on search engine APIs, social media APIs, and industry directory databases;依据所述初始URL集合,并结合领域知识构建知识图谱,通过图谱分析识别高价值入口节点,生成初步筛选的URL子集;Based on the initial URL set and in combination with domain knowledge, a knowledge graph is constructed, high-value entry nodes are identified through graph analysis, and a preliminary screened URL subset is generated;根据所述URL子集,设计动态优先级队列,并基于每个URL的PageRank值、内容更新频率和历史采集结果计算优先级分值,生成优先级排序的URL队列;Based on the URL subset, a dynamic priority queue is designed, and a priority score is calculated based on the PageRank value, content update frequency and historical collection results of each URL to generate a priority-sorted URL queue;依据所述URL队列进行时效性评估,动态调整访问频率和优先级,形成所述入口URL集合。The timeliness evaluation is performed based on the URL queue, and the access frequency and priority are dynamically adjusted to form the entry URL set.3.根据权利要求1所述的方法,其特征在于,所述依据所述入口URL集合,基于DOM结构特征分析与语义关联度评估,以及TF-IDF和Word2Vec的链接价值评分,形成高价值链接队列,包括:3. The method according to claim 1, wherein the step of forming a high-value link queue based on the set of entry URLs, based on DOM structural feature analysis and semantic relevance evaluation, and link value scores using TF-IDF and Word2Vec, comprises:根据所述入口URL集合获取网页内容,采用HTML解析和清洗,获取标准化的DOM树结构;Obtaining web page content according to the entry URL set, parsing and cleaning HTML to obtain a standardized DOM tree structure;依据所述DOM树结构,利用正则表达式、XPath定位和CSS选择器提取链接,形成初始链接集合;Extracting links based on the DOM tree structure using regular expressions, XPath positioning, and CSS selectors to form an initial link set;根据所述初始链接集合,通过TF-IDF算法,分析链接文本与上下文关系,并结合Word2Vec模型计算语义相似度,获取链接价值评分结果;Based on the initial link set, the relationship between the link text and the context is analyzed using the TF-IDF algorithm, and the semantic similarity is calculated in combination with the Word2Vec model to obtain a link value score result;依据所述链接价值评分结果,设置动态阈值过滤广告和低价值链接,并采用启发式算法去除自动化数据采集陷阱,形成高价值链接队列。Based on the link value scoring results, dynamic thresholds are set to filter advertisements and low-value links, and heuristic algorithms are used to remove automated data collection traps to form a high-value link queue.4.根据权利要求1所述的方法,其特征在于,所述依据所述高价值链接队列获取页面内容,运用正则表达式,提取数字序列,以及借助电话号码模式知识库进行匹配判断,形成包含有效电话号码的页面队列,包括:4. The method according to claim 1, wherein obtaining page content based on the high-value link queue, applying regular expressions to extract digital sequences, and performing matching judgments with a telephone number pattern knowledge base to form a page queue containing valid telephone numbers comprises:依据所述高价值链接队列获取页面内容,采用DOM解析提取可见文本内容;Obtaining page content based on the high-value link queue and extracting visible text content using DOM parsing;根据所述可见文本内容,使用正则表达式提取候选数字序列,并记录每个候选数字序列在原文中的上下文信息;Extract candidate digit sequences using regular expressions based on the visible text content, and record context information of each candidate digit sequence in the original text;依据所述候选数字序列,运用电话号码模式知识库进行多种组合和格式化处理,生成初步识别的电话号码列表;Based on the candidate digit sequence, a telephone number pattern knowledge base is used to perform multiple combination and formatting processes to generate a preliminary recognized telephone number list;根据所述电话号码列表,并结合上下文信息,进行验证和功能类型识别,形成包含有效电话号码的页面队列。Verification and function type identification are performed based on the telephone number list and in combination with context information to form a page queue containing valid telephone numbers.5.根据权利要求1所述的方法,其特征在于,所述依据所述页面队列,利用域名提取算法和域名商户映射知识库进行商户信息关联,形成商户数据集,包括:5. The method according to claim 1, wherein the step of associating merchant information based on the page queue using a domain name extraction algorithm and a domain name merchant mapping knowledge base to form a merchant data set comprises:依据所述页面队列,提取二级域名和顶级域名,并采用域名聚类算法进行分组,生成域名分组结果;Extracting the second-level domain names and top-level domain names based on the page queue, grouping them using a domain name clustering algorithm, and generating domain name grouping results;根据商业数据库和历史采集数据建立域名商户映射表;Establish a domain name merchant mapping table based on the business database and historical collection data;依据所述域名分组结果和所述域名商户映射表,为每个URL添加商户基础信息,生成与商户关联的URL数据集;According to the domain name grouping result and the domain name merchant mapping table, add merchant basic information to each URL to generate a URL data set associated with the merchant;根据所述URL数据集,运用文本聚类算法对页面标题和内容摘要进行内容细分,形成所述商户数据集。Based on the URL dataset, a text clustering algorithm is used to segment the page titles and content summaries to form the merchant dataset.6.根据权利要求1所述的方法,其特征在于,所述基于监督学习的字段提取神经网络模型和基于序列标注的电话号码分组识别模型的获得方法包括:6. The method according to claim 1, wherein the method for obtaining the supervised learning-based field extraction neural network model and the sequence labeling-based telephone number group recognition model comprises:收集包含各类网站格式和字段类型的历史网页数据;Collect historical web page data containing various website formats and field types;通过专家标注建立训练数据集,标注数据中的包含商户名称、地址、经营范围的关键字段以及电话号码的功能分类;Create a training dataset through expert annotation, annotating key fields including merchant name, address, business scope, and functional classification of phone number;对所述训练数据集进行特征工程处理,提取DOM结构特征、文本语义特征和位置关系特征,并基于DOM结构特征、文本语义特征和位置关系特征,构建训练模型;Performing feature engineering on the training data set to extract DOM structural features, text semantic features, and positional relationship features, and constructing a training model based on the DOM structural features, text semantic features, and positional relationship features;采用迁移学习方法,基于预训练语言模型构建字段提取神经网络模型,针对网页结构化信息提取进行微调;Using transfer learning methods, we built a field extraction neural network model based on a pre-trained language model and fine-tuned it for extracting structured information from web pages.设计序列标注网络结构,采用BiLSTM-CRF架构训练电话号码分组识别模型;Design a sequence annotation network structure and use the BiLSTM-CRF architecture to train a phone number group recognition model;通过交叉验证优化字段提取网络和电话号码分组识别模型的超参数,提高所述字段提取神经网络模型和所述基于序列标注的电话号码分组识别模型在不同网站格式下的泛化能力;Optimizing the hyperparameters of the field extraction network and the phone number group recognition model through cross-validation to improve the generalization capabilities of the field extraction neural network model and the sequence-labeled phone number group recognition model under different website formats;所述根据所述商户数据集,利用基于监督学习的字段提取神经网络模型和基于序列标注的电话号码分组识别模型进行字段信息提取和电话号码分组识别,生成结构化数据集,包括:The method of extracting field information and identifying phone number groups based on the merchant data set by using a supervised learning-based field extraction neural network model and a sequence-labeled phone number group identification model to generate a structured data set includes:依据所述字段提取神经网络模型,对每个商户分组的页面执行自动分析,识别并提取关键字段信息;Based on the field extraction neural network model, automatically analyze the pages of each merchant group to identify and extract key field information;根据所述电话号码分组识别模型,对同一商户的多个电话号码进行自动分组和功能识别,识别并提取电话号码数据;Automatically group and identify the functions of multiple phone numbers of the same merchant according to the phone number grouping recognition model, and identify and extract phone number data;对所述关键字段信息和所述电话号码数据进行关联度分析,确认数据分组的主体商户名称,得到所述结构化数据集。A correlation analysis is performed on the key field information and the telephone number data to confirm the main merchant name of the data group and obtain the structured data set.7.根据权利要求1所述的方法,其特征在于,所述自我注意的扩散模型进行时间序列插补,包括:7. The method according to claim 1, wherein the self-attention diffusion model performs time series interpolation, comprising:采集自动化数据采集活动生成的原始数据,通过归一化和特征提取处理,形成标准化的时间序列数据集;Collect raw data generated by automated data collection activities and process them through normalization and feature extraction to form standardized time series datasets;依据所述标准化的时间序列数据集,利用条件扩散模型执行前向扩散和反向扩散过程,生成时间序列插补数据;Based on the standardized time series data set, a conditional diffusion model is used to perform forward diffusion and backward diffusion processes to generate time series interpolation data;根据所述时间序列插补数据,运用多样性采样算法进行数据分析,获取预测不确定性指标;Based on the time series interpolation data, a diversity sampling algorithm is used to perform data analysis to obtain a prediction uncertainty index;依据所述预测不确定性指标,执行自适应调整策略更新采集频率参数,形成更优的时间连续性和完整性的数据。Based on the prediction uncertainty index, an adaptive adjustment strategy is executed to update the acquisition frequency parameters to form data with better time continuity and integrity.8.根据权利要求2所述的方法,其特征在于,所述根据所述URL子集,设计动态优先级队列的步骤,包括:8. The method according to claim 2, wherein the step of designing a dynamic priority queue based on the URL subset comprises:执行具有协变量的半参数批处理全局决策机制,具体包括:Implement a semi-parametric batch global decision mechanism with covariates, specifically:采集目标网站的特征数据,通过特征工程处理,生成包含域名年龄、更新频率、内容丰富度的特征向量;Collect the characteristic data of the target website and generate a feature vector containing the domain name age, update frequency, and content richness through feature engineering.依据所述特征向量,利用参数化和非参数化混合模型进行训练,形成半参数奖励预测模型;Based on the feature vector, a parametric and non-parametric hybrid model is used for training to form a semi-parametric reward prediction model;根据所述半参数奖励预测模型,执行汤普森采样算法计算各URL目标的期望数据收益值,生成URL批量获取的优先级决策方案;Based on the semi-parametric reward prediction model, the Thompson sampling algorithm is executed to calculate the expected data benefit value of each URL target and generate a priority decision plan for batch acquisition of URLs;依据所述URL批量获取的优先级决策方案,动态调整URL探索和利用策略的权重参数,计算各URL的优先级得分,更新优先级排序的URL队列,形成优化后的URL访问顺序和频率。According to the priority decision scheme for batch acquisition of URLs, the weight parameters of the URL exploration and utilization strategy are dynamically adjusted, the priority score of each URL is calculated, the priority-sorted URL queue is updated, and an optimized URL access sequence and frequency are formed.9.根据权利要求8所述的方法,其特征在于,所述根据所述半参数奖励预测模型,执行汤普森采样算法计算各URL目标的期望数据收益值,生成URL批量获取的优先级决策方案,包括:9. The method according to claim 8, wherein the step of executing the Thompson sampling algorithm to calculate the expected data revenue value of each URL target based on the semi-parametric reward prediction model and generating a priority decision scheme for batch acquisition of URLs comprises:获取URL目标集合的期望奖励数据,通过组合优化算法计算最优组合,形成初始批次方案;Obtain the expected reward data for the URL target set, calculate the optimal combination through the combinatorial optimization algorithm, and form the initial batch plan;依据所述初始批次方案,利用子模函数最大化框架进行多样性约束计算,生成多样化的批次组合;Based on the initial batch plan, a submodular function maximization framework is used to perform diversity constraint calculations to generate diverse batch combinations;根据所述多样化的批次组合,运用联邦学习框架协调多节点任务分配,形成分布式决策方案;Based on the diverse batch combinations, a federated learning framework is used to coordinate multi-node task allocation and form a distributed decision-making solution;依据所述分布式决策方案,执行多目标优化算法融合合规约束条件,生成最终的URL批量获取执行计划,用于更新优先级排序的URL队列。According to the distributed decision-making scheme, a multi-objective optimization algorithm is executed to integrate compliance constraints and generate a final URL batch acquisition execution plan for updating the priority-sorted URL queue.10.一种循环自动化数据采集系统,其特征在于,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现权利要求1至9中任一项所述的方法。10. A cyclic automatic data acquisition system, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the method according to any one of claims 1 to 9 when executing the computer program.
CN202510434655.8A2025-04-082025-04-08 Circular automated data acquisition method and systemPendingCN120407898A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202510434655.8ACN120407898A (en)2025-04-082025-04-08 Circular automated data acquisition method and system

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202510434655.8ACN120407898A (en)2025-04-082025-04-08 Circular automated data acquisition method and system

Publications (1)

Publication NumberPublication Date
CN120407898Atrue CN120407898A (en)2025-08-01

Family

ID=96521962

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202510434655.8APendingCN120407898A (en)2025-04-082025-04-08 Circular automated data acquisition method and system

Country Status (1)

CountryLink
CN (1)CN120407898A (en)

Similar Documents

PublicationPublication DateTitle
US11334949B2 (en)Automated news ranking and recommendation system
Kumar et al.Twitter data analytics
US11019107B1 (en)Systems and methods for identifying violation conditions from electronic communications
US11663254B2 (en)System and engine for seeded clustering of news events
US8583592B2 (en)System and methods of searching data sources
US10346358B2 (en)Systems and methods for management of data platforms
US8176440B2 (en)System and method of presenting search results
US9069853B2 (en)System and method of goal-oriented searching
US20090125549A1 (en)Method and system for calculating competitiveness metric between objects
US20080243787A1 (en)System and method of presenting search results
CN106383887A (en)Environment-friendly news data acquisition and recommendation display method and system
Vosecky et al.Searching for quality microblog posts: Filtering and ranking based on content analysis and implicit links
Lee et al.Leveraging microblogging big data with a modified density-based clustering approach for event awareness and topic ranking
US20130246463A1 (en)Prediction and isolation of patterns across datasets
CN111447575A (en)Short message pushing method, device, equipment and storage medium
CN110188291B (en)Document processing based on proxy log
US9245010B1 (en)Extracting and leveraging knowledge from unstructured data
Dongo et al.A qualitative and quantitative comparison between Web scraping and API methods for Twitter credibility analysis
KR102687013B1 (en)Method for intelligent document search through analyzing search keyword based on generative ai model
De Maio et al.Social media marketing through time‐aware collaborative filtering
US10922495B2 (en)Computerized environment for human expert analysts
CN119046533A (en)Big data-based user recommendation method and system
CN115906858A (en)Text processing method and system and electronic equipment
CN119311895A (en) Method, device, electronic device and medium for processing rules and regulations documents
Dolma et al.Improving bounce rate prediction for rare queries by leveraging landing page signals

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp