CN120407898A

Movatterモバイル変換

Info

Publication number: CN120407898A
Application number: CN202510434655.8A
Authority: CN
Inventors: 金紫燕; 杨泽; 李思萌
Original assignee: Beijing Yulore Innovation Technology Co ltd
Current assignee: Beijing Yulore Innovation Technology Co ltd
Priority date: 2025-04-08
Filing date: 2025-04-08
Publication date: 2025-08-01

Abstract

本申请提供一种循环自动化数据采集方法及系统，方法包括：形成入口URL集合；依据入口URL集合，基于DOM结构特征分析与语义关联度评估，以及TF‑I DF和Word2Vec的链接价值评分，形成高价值链接队列；依据高价值链接队列，获取页面内容，形成包含有效电话号码的页面队列；依据页面队列，运用自我注意的扩散模型进行时间序列插补，形成商户数据集；根据所述商户数据集，利用字段提取神经网络模型和电话号码分组识别模型进行字段信息提取和电话号码分组识别，生成结构化数据集；依据结构化数据集，执行多维数据指纹生成进行数据去重。本申请解决了传统自动化数据采集技术在复杂网页结构识别、数据时效性维护和数据质量保证方面的技术问题。

The present application provides a cyclic automated data collection method and system, the method comprising: forming an entry URL set; based on the entry URL set, forming a high-value link queue based on DOM structural feature analysis and semantic relevance evaluation, as well as TF‑IDF and Word2Vec link value scoring; based on the high-value link queue, obtaining page content to form a page queue containing valid phone numbers; based on the page queue, using a self-attention diffusion model to perform time series interpolation to form a merchant data set; based on the merchant data set, using a field extraction neural network model and a phone number grouping recognition model to extract field information and recognize phone number groups to generate a structured data set; based on the structured data set, performing multidimensional data fingerprint generation to perform data deduplication. The present application solves the technical problems of traditional automated data collection technology in complex web page structure recognition, data timeliness maintenance, and data quality assurance.