Movatterモバイル変換


[0]ホーム

URL:


CN118898234B - A system and method for automatically labeling government big data - Google Patents

A system and method for automatically labeling government big data
Download PDF

Info

Publication number
CN118898234B
CN118898234BCN202411397428.4ACN202411397428ACN118898234BCN 118898234 BCN118898234 BCN 118898234BCN 202411397428 ACN202411397428 ACN 202411397428ACN 118898234 BCN118898234 BCN 118898234B
Authority
CN
China
Prior art keywords
data
labeling
task
management module
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411397428.4A
Other languages
Chinese (zh)
Other versions
CN118898234A (en
Inventor
刘奎
王亚坤
陈垚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hebei Neusoft Software Co ltd
Original Assignee
Hebei Neusoft Software Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hebei Neusoft Software Co ltdfiledCriticalHebei Neusoft Software Co ltd
Priority to CN202411397428.4ApriorityCriticalpatent/CN118898234B/en
Publication of CN118898234ApublicationCriticalpatent/CN118898234A/en
Application grantedgrantedCritical
Publication of CN118898234BpublicationCriticalpatent/CN118898234B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Classifications

Landscapes

Abstract

Translated fromChinese

本发明公开了一种政务大数据自动标注系统及方法,涉及数据处理技术领域。系统包括数据源管理模块、标签管理模块、任务管理模块、执行节点管理模块、任务执行节点和标签校验模块;数据源管理模块用于从政务大数据中获取待标注数据;标签管理模块用于根据待标注数据对标签配置信息进行创建或修改处理;任务管理模块用于并生成任务指令;执行节点管理模块用于从多个服务节点中确定任务执行节点;任务执行节点用于根据标注规则将待标注数据标注上对应的标签分类;标签校验模块用于根据标注规则之间的相互关系,校验标注结果的准确性。本发明可以对大量政务数据进行自动化标注并进行校验,降低了人力成本,增强了数据标注的一致性和准确性。

The present invention discloses a system and method for automatic labeling of government big data, and relates to the field of data processing technology. The system includes a data source management module, a label management module, a task management module, an execution node management module, a task execution node and a label verification module; the data source management module is used to obtain data to be labeled from government big data; the label management module is used to create or modify label configuration information according to the data to be labeled; the task management module is used to generate task instructions; the execution node management module is used to determine the task execution node from multiple service nodes; the task execution node is used to label the data to be labeled with corresponding labels according to the labeling rules; the label verification module is used to verify the accuracy of the labeling results according to the mutual relationship between the labeling rules. The present invention can automatically label and verify a large amount of government data, reduce labor costs, and enhance the consistency and accuracy of data labeling.

Description

Automatic labeling system and method for government affair big data
Technical Field
The invention relates to the technical field of data processing, in particular to an automatic labeling system and method for government affair big data.
Background
Government affairs big data is a large amount of data collected, generated and used by government authorities during their daily operations, and generally includes citizen information, enterprise registration data, public safety records, economic statistics, business information, etc. The method has the characteristics of high value, high dimensionality and high complexity, and has important significance for government agency decision-making, service optimization and policy making. Management and analysis of government affair big data are crucial to improving transparency, efficiency and quality of government affair institution work, and simultaneously higher requirements are also provided for guaranteeing data safety and privacy protection.
With the penetration of digital transformation, government big data is accumulating at unprecedented speed, and the scale and diversity of the government big data are increasing. This presents new challenges for the management and analysis of government big data. In order to improve the management efficiency of the government affair big data, the government affair big data needs to be labeled.
However, the current government big data labeling work often depends on manual data labeling means, so that the efficiency is limited, the increasing data volume is difficult to deal with, the cost is high, the influence of subjectivity is easy to cause, and the accuracy and consistency of labeling results are difficult to guarantee.
Disclosure of Invention
Aiming at the technical problems and defects, the invention aims to provide the automatic marking system and the automatic marking method for the government affair big data, which can automatically mark a large amount of government affair data and verify marking results, thereby improving the data marking processing efficiency, reducing the labor cost and enhancing the consistency and the accuracy of the data marking.
In order to achieve the above purpose, in a first aspect, the invention provides an automatic labeling system for large government affair data, which comprises a data source management module, a label management module, a task management module, an execution node management module, a task execution node and a label verification module, wherein the data source management module is used for acquiring data to be labeled from large government affair data, the label management module is used for creating or modifying label configuration information according to the data to be labeled, the label configuration information comprises label classification and labeling rules, the task management module is used for creating a labeling task according to the data to be labeled and the label configuration information and generating a task instruction, the execution node management module is used for determining the task execution node from a plurality of service nodes after receiving the task instruction, the task execution node is used for executing the labeling task so as to classify corresponding labels on the data to be labeled according to the labeling rules to obtain labeling results, and the label verification module is used for verifying the accuracy of the labeling results according to the interrelationships among the labeling rules.
By adopting the automatic marking system for the government affair big data, the automation and the intellectualization of data processing are realized through a plurality of integrated modules, and the efficiency and the accuracy of marking the government affair data are obviously improved. The data source management module ensures efficient acquisition of data, the flexibility of the tag management module allows for rapid adaptation to new data features and business requirements, and the task management module optimizes resource allocation through intelligent task creation and scheduling. The dynamic task allocation mechanism of the execution node management module guarantees the quick response and processing of the task, and the accurate execution of the task execution node directly produces a preliminary labeling result. Finally, the label verification module automatically verifies the labeling result by deeply analyzing the complex relationship among the labeling rules, thereby ensuring high-quality output of data labeling. The invention can automatically label a large amount of government affair data and verify the labeling result, thereby improving the processing efficiency and reliability of data labeling, reducing manual labeling operation, lowering labor cost and enhancing the consistency and accuracy of data labeling.
In some embodiments, the tag verification module is further configured to adjust the sensitivity of the verification policy according to the verification result through a logistic regression model.
By adopting the technical scheme of the embodiment, the tag verification module realizes dynamic adjustment of the sensitivity of the verification strategy by applying the logistic regression model so as to respond to the continuously changed data characteristics and labeling requirements. The self-adaptive adjustment mechanism enables the system to optimize model parameters according to actual verification results, so that annotation errors can be more accurately identified and corrected. With the lapse of time and the accumulation of data, the system improves the verification accuracy, reduces false alarm and missing report by continuous learning and optimization, and ensures the high accuracy and high reliability of the government big data labeling result.
In some embodiments, the logistic regression model may include the following formula:
;
where y is the target variable and represents the verification result, p (y=1|x) represents the probability of y=1 given the feature x, α is the intercept term of the model, and β is the coefficient of the model.
By adopting the technical scheme of the embodiment, based on the formula in the logistic regression model, the system can estimate the accuracy probability of each label more accurately, thereby adjusting the verification rule and improving the accuracy and efficiency of the system. The probability-based verification method enables the system to be flexibly adjusted when facing complex and changeable data, ensures the quality of data labeling, reduces false alarm and missing report, and improves the overall reliability of government affair big data labeling.
In some embodiments, the execution node management module is specifically configured to monitor an online state and a load condition of the service node in real time, and determine a task execution node according to the online state and the load condition.
By adopting the technical scheme of the embodiment, the high-efficiency operation of the automatic government affair big data marking system and the optimal allocation of resources are ensured by executing the real-time monitoring function of the node management module. By monitoring the online state and the load condition of the service node in real time, the system can dynamically adjust task allocation, avoid overload and ensure quick response of the task. The intelligent load balancing strategy improves the stability and reliability of the system, improves the throughput of data processing, and ensures the continuity and high efficiency of labeling tasks.
In some embodiments, the service node is distributed in a plurality of servers, and the execution node management module is further configured to provide a registration service for the service node after the service node is deployed.
By adopting the technical scheme of the embodiment, the distributed deployment and registration service of the service nodes provides high expandability and flexibility for the automatic government big data labeling system. By deploying service nodes on multiple servers, the system is able to handle larger scale data sets while the registration service ensures fast integration and unified management of new nodes. The distributed architecture enhances the fault tolerance of the system, and even if part of nodes fail, the execution of the whole labeling task is not influenced, so that the persistence and stability of the labeling of government data are ensured.
In some embodiments, the automatic government affair big data labeling system further comprises a task execution recording module and an information display module, wherein the task execution recording module is used for recording the task execution condition of the labeling task, and the information display module is used for receiving the task execution condition and displaying the task execution condition to a user.
By adopting the technical scheme of the embodiment, the comprehensive labeling task monitoring and displaying capability is provided for the user through the task execution recording module and the information displaying module. The recording module records each link of task execution in detail, and the display module presents the information to the user in an intuitive mode, so that the user can know the task progress and execution condition in real time. The transparency not only enhances the trust of the user to the system, but also provides powerful support for system operation and task management.
In some embodiments, the task execution node is further configured to transmit the obtained labeling result to a result database for saving after executing the labeling task.
By adopting the technical scheme of the embodiment, the task execution node has the function of transmitting the labeling result to the result database for storage, and provides the data persistence and recycling capability for the government big data automatic labeling system. The labeling result stored in a centralized way is convenient for subsequent data analysis, auditing and reprocessing, and ensures the long-term value and traceability of the data. Meanwhile, the method and the device provide convenience for sharing and exchanging data, and promote the opening and interconnection of government affair data.
In some embodiments, the automatic government affair big data labeling system further comprises a report generating module, wherein the report generating module is used for generating a government affair big data labeling report according to the task execution condition and the labeling result through a template engine and a natural language technology.
By adopting the technical scheme of the embodiment, the report generation module automatically generates the annotation report through the template engine and the natural language technology, thereby greatly improving the report generation efficiency and quality of the government big data automatic annotation system. The module can rapidly produce structured and information-rich reports according to task execution conditions and labeling results, reduces the workload of manually writing the reports, and ensures the consistency and the professionality of the reports. Automated report generation and distribution provides timely and accurate data support for government decision makers.
In some embodiments, the data source management module is further configured to associate and aggregate the data to be annotated according to the data identification through the data exchange platform.
By adopting the technical scheme of the embodiment, the data source management module carries out data association and aggregation through the data exchange platform, and provides strong data integration capability for the automatic government affair big data labeling system. The aggregation method based on the data identification not only improves the organization and availability of the data, but also provides possibility for cross-department and cross-system data sharing and collaborative work. The method has important significance for breaking information islands and realizing centralized management and efficient utilization of government affair data.
The automatic labeling method for the government affair big data is applied to the automatic labeling system for the government affair big data, and comprises the steps of obtaining to-be-labeled data from the government affair big data, creating or modifying label configuration information according to the to-be-labeled data, wherein the label configuration information comprises label classification and labeling rules, creating a labeling task according to the to-be-labeled data and the label configuration information, generating a task instruction, determining a task execution node from a plurality of service nodes after receiving the task instruction, executing the labeling task through the task execution node, labeling the to-be-labeled data with corresponding labels according to the labeling rules to obtain labeling results, and checking the accuracy of the labeling results according to the interrelationship among the labeling rules.
The technical effects of the method according to the second aspect of the present invention are described in detail herein with reference to the automatic labeling system according to the first aspect.
The one or more technical schemes provided by the invention have at least the following technical effects or advantages:
1. The automatic labeling system for the government big data realizes the high automation of the data processing flow. From the data acquisition of the data source management module, the intelligent label configuration of the label management module and the automatic task generation of the task management module to the dynamic task allocation and execution of the execution node management module, the whole system reduces the manual intervention and improves the speed and the scale of data processing. The automation not only improves the efficiency, but also enhances the accuracy and consistency of data annotation by reducing human errors.
2. The invention strengthens the data security and the quality control by introducing the tag verification module. The label checking module checks the label accuracy according to the logic relation between the labeling rules, and ensures the high-quality output of the data. In addition, the use of the task execution recording module and the information display module improves the transparency of the system operation, so that the data labeling process is traceable and auditable, and the reliability of the data and the trust degree of the system are further enhanced.
3. The invention realizes resource optimization and load balancing by performing real-time monitoring and intelligent scheduling on the service nodes by the execution node management module. The distributed deployment and registration service of the service nodes are combined with the function of monitoring the on-line state and the load condition in real time, so that the tasks can be reasonably distributed according to the current load and the capacity of each node. The intelligent scheduling mechanism improves the overall performance of the system, ensures the high efficiency and stability of task execution, and can keep the response speed and processing capacity of the system even under the condition of high load.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention. It is evident that the drawings in the following description are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art. In the drawings:
FIG. 1 is a schematic diagram of an architecture of an automatic annotation system for government big data according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an architecture of another automatic annotation system for government big data according to an embodiment of the present invention;
fig. 3 is a flow chart of an automatic labeling method for government big data according to an embodiment of the invention.
Detailed Description
The terminology used in the following embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this disclosure is intended to encompass any or all possible combinations of one or more of the listed items.
The terms "first", "second" are used in the following for descriptive purposes only and are not to be construed as implying relative importance or implying a number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature, and in the description of embodiments of the invention, unless otherwise indicated, the meaning of "a plurality" is two or more.
It should also be noted that, unless explicitly stated or limited otherwise, the terms "disposed," "connected," and the like in the embodiments of the present invention should be construed broadly. For example, the "connection" may be a fixed connection, a detachable connection, or an integral connection, may be a mechanical connection, or an electrical connection, may be a direct connection, or an indirect connection via an intermediate medium, may be communication between two elements, or may be a wired communication connection or a wireless communication connection. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances. The following describes embodiments of the present invention in detail.
Taking enterprise management as an example, the process of manually marking government big data in the related art generally includes the steps that first, a worker needs to extract related information from structured and unstructured data of different sources such as enterprise registration information, tax records, market supervision reports and the like. They then manually categorize and label the information according to business rules and requirements of government authorities, such as "industry category", "enterprise scale", "credit rating", etc. During this process, staff may need to read a large number of text reports, view various documents submitted by the enterprise, and even judge compliance of the enterprise according to specific legal or policy criteria.
The manual labeling is not only low in efficiency, but also is easily influenced by subjective judgment of staff, so that inconsistency and errors of labeling results are caused. In addition, with the increase of the number of enterprises and the accumulation of data volume, the workload and complexity of manual labeling are continuously increased, and challenges are brought to data analysis and decision support of government institutions.
Compared with the manual annotation, the embodiment provides an automatic annotation system for government affair big data, which can improve the speed and scale of data processing and reduce errors and inconsistencies caused by human factors. The system can continuously process and analyze a large amount of data, automatically identify government affair big data information, accurately classify and apply preset labels such as enterprise types, scales, credit grades and the like, and verify marking results after marking, so that marking efficiency and accuracy are greatly improved. In addition, the system has self-learning and optimizing capabilities, and the accuracy and efficiency of labeling are improved continuously along with the time and data accumulation. The automatic process also reduces the dependence on human resources, reduces the long-term operation cost, enhances the flexibility and expansibility of the system, and enables the system to adapt to the continuously-changing government requirement and policy environment.
The invention provides an automatic labeling system for government affair big data, which comprises a data source management module 101, a label management module 102, a task management module 103, an execution node management module 104, a task execution node 105 and a label verification module 106, as shown in fig. 1.
The data source management module 101 is used for acquiring data to be marked from government affair big data.
In this embodiment, the data source management module 101 is a data input core of the automatic labeling system for large government affair data, and is responsible for efficiently and accurately acquiring data to be labeled from large government affair data. The module can be seamlessly connected with structured data sources such as MySQL and Oracle and unstructured data sources such as text files and PDF documents by integrating various database drivers and data interfaces. The method has the functions of data extraction, synchronization and preprocessing, and ensures the quality and consistency of data.
For example, in the field of city management, the module can extract video stream data from a city monitoring system in real time or acquire vehicle violation record data from a traffic management department, and provide raw materials for subsequent automatic labeling work.
The tag management module 102 is configured to create or modify tag configuration information according to the data to be marked. The tag configuration information comprises tag classification and labeling rules.
In this embodiment, the tag management module 102 may automatically analyze the features and contents of the data to be marked through an integrated machine learning algorithm and natural language processing technology, and intelligently identify key information and patterns in the data. Based on these analysis results, tag configuration information may be automatically created or updated, including defining new tag classifications and setting corresponding labeling rules.
For example, if new business registration information is encountered in the system, the tag management module 102 can automatically identify the industry and scale to which the business belongs and generate or adjust the corresponding tag without human intervention. In addition, the module can continuously optimize and refine the label system according to the statistical rule and the relevance in the data, and ensure the accuracy and timeliness of the labels, thereby realizing an automatic label management process without manual operation. The intelligent label management not only improves the efficiency and reduces human errors, but also enables the system to adapt to rapidly-changing data environments and business requirements.
On the other hand, the tag management module 102 also allows the user to dynamically create or modify tag configuration information based on the characteristics and business requirements of the data to be tagged. This process typically involves defining new label classifications, such as "type of business" or "policy domain," and setting specific labeling rules for those classifications. The user may enter rules through an intuitive interface, for example, automatically categorizing the size of the business based on the business' annual revenue and employee count, or automatically marking policy issues based on keywords in the text content. In addition, the tag management module 102 also supports adjustments to existing tags and rules to accommodate policy changes or the evolution of data characteristics. For example, if new regulations require additional categorization of businesses, the user may add new tags to the tag management module 102 and define corresponding labeling logic, and the system will then categorize and label the new data to be labeled according to these updated rules. The flexibility and configurability of this module ensures that the labeling process accurately reflects the latest business needs and policy guidelines.
The task management module 103 is configured to create a labeling task according to the data to be labeled and the label configuration information, and generate a task instruction.
In this embodiment, the task management module 103 is an intelligent task planning and distribution center in the automatic labeling system for government big data, and automatically designs and generates labeling tasks by analyzing a data set to be labeled and existing label configuration information.
The task management module 103 first evaluates the features and requirements of the dataset and then intelligently constructs task parameters and execution flows according to the tag classification and labeling rules provided by the tag management module 102. The task management module 103 then encapsulates this information into specific task instructions, including specific requirements for data annotation, priority, expected completion time, etc. These task instructions are then automatically assigned to the execution node management module 104, ensuring that each labeling task is efficiently and accurately initiated and executed.
For example, if the data to be annotated is a new set of enterprise registration records, the task management module 103 creates an annotation task based on the enterprise scale and industry classification rules defined in the tag configuration information, automatically generates task instructions containing these rules, and prepares the task for distribution to the appropriate execution nodes for processing. The automatic flow not only improves the efficiency of task creation, but also ensures the rationality of task allocation and the consistency of execution.
The execution node management module 104 is configured to determine, after receiving the task instruction, a task execution node 105 from a plurality of service nodes.
In this embodiment, the execution node management module 104 plays roles of task allocation and resource optimization in the automatic government affair big data labeling system. Upon receiving a task instruction from the task management module 103, the module first evaluates the status of the various service nodes in the current system, including their workload, performance metrics, and availability.
Then, based on the information and the specific requirements of the task, the executing node management module 104 uses a built-in intelligent scheduling algorithm, such as polling, least connection or priority queue, to select a most suitable node from the plurality of service nodes to execute the task, where the selected service node is the task executing node 105. The module may also consider factors such as the complexity of the task, the sensitivity of the data, the geographical location of the executing node, and the expected execution time in the selection process to ensure that the task is completed efficiently, stably, and safely.
For example, if a labeling task needs to process a large amount of data and has a high requirement on processing speed, the execution node management module 104 may select a service node with high computing power and low current load to execute the task, thereby achieving optimal allocation of resources and high efficiency of task execution.
The task execution node 105 is configured to execute the labeling task, so as to classify the label corresponding to the data to be labeled according to the labeling rule, and obtain a labeling result.
Wherein the labeling results include a classification of tags to which the data items are assigned, the tags reflecting features, attributes or classifications of the data. The labeling result enables the data which is originally unstructured or semi-structured to become orderly and queriable, provides convenience for further analysis, storage, retrieval and management of the data, and is an indispensable information foundation in application scenes such as government decision-making, policy making and resource optimization.
In this embodiment, the task execution node 105 is an execution unit in the automatic labeling system for large government data, and is responsible for performing actual labeling processing on the data according to task instructions and labeling rules. When the task execution node 105 receives a task instruction from the execution node management module 104, it will first parse the specific requirements and labeling rules of the task.
The automated annotation engine on the node is then started, and the data assigned to it to be annotated is analyzed and identified using built-in machine learning models and algorithms. In this process, the executing node automatically identifies the key information in the data according to predefined labeling rules, such as keyword matching, classification algorithm or cluster analysis, and matches the key information with the corresponding label classification. For example, if the task is to label the enterprise data, the execution node will automatically label the data as a category such as "small enterprise", "medium enterprise" or "large enterprise" according to information such as business income and asset size of the enterprise.
After the labeling is completed, the execution node updates the labeling result into the database, and simultaneously feeds back the task completion state and any necessary metadata to the task management module 103, so that the accuracy and traceability of the whole labeling process are ensured.
In this way, the task execution node 105 realizes automatic labeling of data, greatly improves processing efficiency and accuracy, and reduces the need for manual intervention.
According to the automatic labeling system for the government affair big data, the rapid, accurate and automatic processing of a large amount of government affair data is realized, the data processing efficiency is remarkably improved, the labor cost is reduced, and the consistency and accuracy of data labeling are enhanced. Firstly, the system realizes the effective acquisition of large-scale government affair data through the data source management module, and provides a data basis for the labeling process. And secondly, the label management module is introduced, so that the creation and modification of label configuration information become more flexible and accurate, and the system can dynamically adjust a label system according to the data characteristics and business requirements. The intelligent task creation and instruction generation of the task management module further improve the efficiency and accuracy of task allocation. And the real-time monitoring and intelligent scheduling of the execution node management module ensure the balanced load and efficient execution of tasks among a plurality of service nodes. Finally, the task execution node marks the data according to the established rule to obtain a marked result, thereby realizing automatic classification and marking of the data. In the whole, the system remarkably improves the efficiency of government affair data marking through an automatic technical means, reduces the labor cost, ensures the accuracy and consistency of marking results, and provides powerful data support for government affair decision.
In some embodiments, the executing node management module 104 is specifically configured to monitor the online status and the load status of the service node in real time, and determine the task executing node 105 according to the online status and the load status.
In this embodiment, the executing node management module 104 tracks the online status and load status of each service node through a real-time monitoring mechanism. The executing node management module 104 may continuously collect data regarding node availability and workload using heartbeat detection, resource usage monitoring, performance index analysis, and the like. For example, by monitoring key performance indicators such as CPU usage, memory usage, network latency, and response time, the executing node management module 104 is able to accurately evaluate the current workload of each node.
In addition, by implementing the fault detection and automatic recovery policies, the executing node management module 104 is able to quickly respond to the offline or fault status of the node, ensuring high availability of the system. When receiving the task instruction, the executing node management module 104 comprehensively considers the online states and the load conditions of all the nodes, and selects the most suitable service node to execute the task by applying an intelligent scheduling algorithm, such as load balancing, priority allocation, failover, and the like. The dynamic task allocation strategy ensures that the task can be efficiently and uniformly executed among a plurality of nodes, optimizes the resource utilization rate and improves the response speed and stability of the whole labeling flow.
In some embodiments, the service node is distributed in a plurality of servers, and the execution node management module 104 is further configured to provide a registration service for the service node after the service node is deployed.
In the embodiment, the service node adopts a distributed deployment strategy in the automatic government affair big data labeling system and spans a plurality of servers to enhance the expansibility and fault tolerance of the system. The executing node management module 104 is not only responsible for monitoring the online status and load status of these service nodes in real time, but also assumes an important role in service node lifecycle management. When a new service node completes deployment and prepares to join the labeling job, the executive node management module 104 provides registration services, allowing the node to register its own information with the system through standardized registration procedures, including node identification, processing power, current state, and the like. After the registration is successful, the service node is brought into the node pool of the system, starts to receive the task instruction from the execution node management module 104, and participates in the distributed processing of the data annotation.
The registration mechanism ensures the flexibility and dynamic expansion capability of the system and simultaneously provides convenience for unified management and maintenance of the service nodes.
In some embodiments, the task execution node 105 is further configured to transmit the obtained labeling result to the result database 111 for saving after executing the labeling task.
Specifically, after successfully executing the labeling task, the task execution node 105 is further responsible for transmitting the labeling results to the result database 111 for storage after analyzing the data to be labeled according to the predefined labeling rules and labeling the corresponding label classifications. This process involves storing the annotated dataset or metadata, including tag information and possibly analysis results, in a structured form into database 111, ensuring persistence and traceability of the data. The results database 111 serves as a central repository, providing not only auditing and monitoring capabilities for labeling results for system administrators, but also a basis for data sharing and further analysis for other modules or systems.
In this way, the task execution node 105 ensures the integrity and availability of labeling results, supporting the automation of government decisions and business processes.
In some embodiments, the data source management module 101 is further configured to associate and aggregate the data to be annotated according to the data identification through the data exchange platform.
In particular, the data source management module 101 utilizes data identification identifiers, such as uniform social credit codes, organization codes, or item unique identifiers, etc., to identify and correlate data to be annotated from different sources through tight integration with the data exchange platform (DXP).
The data source management module 101 first extracts data from a plurality of disparate data sources and then matches and associates the data according to a predefined data model and identification through the interfaces and services of the data exchange platform. For example, if the data source includes tax records, business registration information, and marketing administration reports, the data source management module 101 may identify business identifications in these data sets, and associate different data records of the same business to form a comprehensive data view. In this way, the module can gather the originally isolated data into a unified data set, and provide accurate and consistent original materials for subsequent automatic labeling and analysis. The process not only improves the availability and quality of the data, but also provides richer and comprehensive data support for government decision.
In some embodiments, the data to be annotated includes structured data and unstructured data.
For structured data, the data source management module 101 enables efficient association and aggregation by well-defined data models and query logic. These data are typically stored in a relational database, with a fixed format and schema. The data source management module 101 uses SQL queries or specific data access APIs to identify and extract relevant data records according to preset data association rules, such as primary key and foreign key relationships. Wherein the primary key and the foreign key are key concepts in the database for ensuring data integrity and establishing relationships between tables. The primary key is a field or combination of fields in a table that uniquely identifies each record, and the foreign key is a field or combination of fields in one table that references the primary key of another table, thereby establishing a relationship between the two tables. The use of foreign keys not only ensures the referential integrity of the data, but also allows the database to automatically maintain consistency of the relevant data in the sub-table when updating or deleting records in the parent table
For example, by unifying social credit codes as the associated fields, the data source management module 101 can retrieve and consolidate multidimensional data for the same enterprise from enterprise registration information, tax records, and market regulatory databases to form a comprehensive data set. In addition, the data source management module 101 may also apply data cleansing and transformation techniques to ensure consistency and accuracy of structured data of different data sources during the aggregation process.
The association and aggregation of unstructured data is more complex because such data, such as text files, images, video, etc., has no fixed format or structure. The data source management module 101 in this case would employ Natural Language Processing (NLP), computer vision, and other artificial intelligence techniques to identify and extract key information in the data. For example, for text data, the data source management module 101 may use text mining techniques to identify and mark topics, entities, and emotional trends in the document, and for image and video data, image identification and scene analysis techniques may be used to extract features of the visual content. Through these techniques, the data source management module 101 can abstract key information in unstructured data and correlate with structured data or other unstructured data, such as matching text descriptions in news stories or social media posts through events or scenes identified in the images, to achieve cross-modal data aggregation.
The tag verification module 106 is configured to verify the accuracy of the labeling result according to the correlation between the labeling rules.
The interrelationships may include, but are not limited to, exclusive relationships, subordinate relationships, conditional dependencies, and the like, and may be one or more of them.
The accuracy of the labeling result is checked to obtain the checking result, and the accuracy and the reliability of the labeling result are reflected. The verification result is to analyze mutual relations such as mutual exclusion relations, subordinate relations, conditional dependence relations and the like among marking rules, and determine whether the label of each data item correctly reflects the characteristics and classification of the label by using methods such as logic judgment, statistical analysis or machine learning algorithm and the like. The verification result not only comprises the confirmation of correct labeling, but also covers the identification of wrong or suspicious labeling, provides a basis for the correction of the subsequent labeling result, and ensures the quality and labeling precision of the data.
In this embodiment, the tag verification module 106 may retrieve the labeling result from the result database 111, and verify whether the labeled tag in the labeling result is accurate, or may directly obtain the labeled tag from the task execution node 105 to verify after the task execution node 105 completes the labeling task.
The tag verification module 106 is a key component in the government affair big data automatic labeling system, and is responsible for ensuring the accuracy and consistency of the tag. The module performs a verification process by analyzing and understanding complex interrelationships between labeling rules, such as mutual exclusion, dependencies, and conditional dependencies.
Specifically, the tag verification module 106 may verify the accuracy of the labeled tags by analyzing and comparing predefined correlations between the labeled tags in the labeling results. These relationships may include mutual exclusion relationships that ensure that the same data item is not wrongly assigned to two non-coexisting tags, dependency relationships that ensure that the hierarchy of tags is correct, e.g., a more specific tag is valid only when its broader class of tags exist, and conditional dependencies where assignment of certain tags may depend on the presence or value of other tags. The relationships can be automatically detected and verified by utilizing a logic judgment and rule engine, and when contradiction or non-compliance with the rule is found, the module marks potential errors and prompts manual auditing or automatic adjustment of the labels so as to ensure the logic and accuracy of data labeling. The rule-based verification mechanism is a key link for ensuring the labeling quality of government affair data, and is beneficial to reducing errors and improving the reliability of the data.
For example, in processing enterprise data, if there is a mutual exclusion relationship, the tag verification module 106 will check if the same enterprise is mistakenly marked as both a "small enterprise" and a "large enterprise", in the case of a affiliation, the tag verification module 106 will verify if an enterprise is marked as "export oriented" if the industry it belongs to is reasonably marked as "manufacturing", and in a conditional dependency, the tag verification module 106 will verify if it is eligible for a "government subsidy" based on the enterprise's "credit rating" tag.
Through the intelligent verification mechanism, the tag verification module 106 can automatically find and correct errors in the labeling process, so that the labeling quality and reliability of the whole system are improved.
By combining the functional modules, the government affair big data automatic labeling system provided by the embodiment realizes automation and intellectualization of data processing through the integrated multiple modules, and remarkably improves the efficiency and accuracy of government affair data labeling. The data source management module ensures efficient acquisition of data, the flexibility of the tag management module allows for rapid adaptation to new data features and business requirements, and the task management module optimizes resource allocation through intelligent task creation and scheduling. The dynamic task allocation mechanism of the execution node management module guarantees the quick response and processing of the task, and the accurate execution of the task execution node directly produces a preliminary labeling result. Finally, the label verification module performs strict verification on the labeling result by deeply analyzing the complex relationship among the labeling rules, so that high-quality output of data labeling is ensured. The invention can automatically label a large amount of government affair data and verify the labeling result, thereby improving the data labeling processing efficiency, reducing manual labeling operation, reducing labor cost and enhancing the consistency and accuracy of data labeling.
In some embodiments, the tag verification module 106 may automatically adjust the sensitivity of the verification rules through a logistic regression model based on the verification results.
The tag verification module 106 is adopted, and dynamic adjustment of the sensitivity of the verification strategy is realized through application of a logistic regression model so as to respond to the continuously changing data characteristics and labeling requirements. The self-adaptive adjustment mechanism enables the system to optimize model parameters according to actual verification results, so that annotation errors can be more accurately identified and corrected. With the lapse of time and the accumulation of data, the system improves the verification accuracy, reduces false alarm and missing report by continuous learning and optimization, and ensures the high accuracy and high reliability of the government big data labeling result.
The sensitivity of the verification rule is adjusted to ensure that the government big data automatic labeling system can keep better labeling quality and efficiency in different data sets and application scenes. Due to the diversity and complexity of data, a single verification rule may not be suitable for all conditions, resulting in excessive false positives or false negatives of the verification result. By adjusting the sensitivity, correct labels and errors can be identified more accurately, so that the workload of manual auditing is reduced, and the accuracy of automatic labels is improved. In addition, as time passes and data is accumulated, the data characteristics and the labeling rules may change, and the sensitivity of the verification rules is adjusted regularly to help the system adapt to the changes, so that long-term labeling performance and data quality are ensured.
Specifically, the tag verification module 106 may continuously monitor and analyze the verification result, and may identify labeling error conditions frequently occurring in the data labeling process from the verification result, where the labeling error conditions include, but are not limited to, mislabeling, missing labeling, or inconsistent labels.
The tag verification module 106 then adjusts the sensitivity of the rules or retrains the verification model based on these labeling error conditions to better accommodate changes in the data or labeling habits of the user.
For example, if a particular tag error occurs frequently, the sensitivity of the verification rule to the labeling result of the tag may be enhanced, such as the tag verification module 106 may increase the verification severity of the tag, add more checkpoints, or adjust the verification logic.
In this embodiment, the logistic regression model is a statistical model applied to the two-classification problem, which can be used to estimate the probability of occurrence of an event.
Specifically, the logistic regression model may include the following formula:
;
Wherein:
p (y=1|x) is the probability that the sample belongs to the positive class (label 1, i.e. y=1) given the feature x. y represents the target variable, i.e. the verification result. In a two-class problem, y can take two values, typically denoted by 0 and 1, y=1 representing a Positive class (Positive class) representing the occurrence of the predicted event and y=0 representing a negative class (NEGATIVE CLASS) representing the non-occurrence of the predicted event.
Specifically, x is a model input feature, and can be obtained from original data through data processing steps such as data cleaning, feature selection, feature extraction, feature scaling and the like.
The data cleaning comprises the steps of removing noise, processing missing values, removing duplication, modifying and the like on the original data so as to ensure the quality of the data.
Feature selection is the selection of data features from raw data that are related to a target variable based on domain knowledge, statistical analysis, or feature selection algorithms (e.g., recursive feature elimination, model-based feature selection, etc.).
Feature extraction is the conversion and encoding of selected features to fit them into the model's input. In particular, text data is encoded to convert text categories into a digital representation, which involves converting the text-form category data into a numerical form that can be processed by a machine learning model. For example, in processing enterprise data, the industry to which the enterprise belongs may be category data in text form, such as "manufacturing," "financial services," and the like. To use these text categories for the logistic regression model, each unique text category may be converted to a numeric vector using coding techniques such as One-Hot Encoding (One-Hot Encoding). Assume that the text category feature of "industry type" is to be processed, which includes three different industry categories, "science", "finance" and "manufacturing". Three new binary feature columns may be created, corresponding to the three industries, respectively. For an enterprise belonging to the "science and technology" industry, the "science and technology" column is set to 1, and the "finance" and "manufacturing" columns are both set to 0. Accordingly, if the business belongs to the "finance" industry, the "finance" column is 1, and the rest is 0. Thus, each business is assigned a unique numerical vector according to its industry type, e.g., a "science and technology" business corresponds to a vector of [1, 0, 0], "finance" business is [0, 1, 0], "manufacturing" business is [0, 0, 1]. This converts the text category into a digital representation.
Feature scaling is the scaling of features to lie in the same numerical range to avoid excessive sensitivity of the model to certain features, and common methods include min-max normalization or normalization.
Through the above series of data processing, the associated data may be combined into a vector x, which is a vector containing the multi-dimensional data element xn, and may be represented as x= [ x1, x2, …,xn ].
Further, y represents whether the data item is correctly tagged with a particular tag. For example, if it is being checked whether a data item is correctly marked as a "large business," y will be marked as 1 if the data item is indeed a "large business. If the data item is not a "large business," y will be marked as 0. The logistic regression model predicts p (y=1|x) from the input feature x, i.e., the probability that a data item is correctly labeled as a positive class (e.g., a "large business") given feature x. This probability can then be used in decisions such as setting a threshold value, when the prediction probability exceeds this threshold value, the prediction is considered to be a positive class, otherwise a negative class.
Alpha is the intercept term of the model, corresponding to the output when all features x are 0.
Beta is a coefficient of the model, and represents the influence degree of the characteristic x on the output, specifically, beta can be composed of a plurality of model coefficients, and is represented as beta= [ beta1,.., βn]T,n≥1,β1,.., βn ] which is a corresponding model coefficient of the element x1,.., xn in x respectively. Where α and β are model parameters, they may be obtained by a model training process that involves using historical labeling data to optimize model parameters to minimize prediction errors. Specifically, at the beginning of training, a and β may be assigned an initial value, respectively, and then the values of a and β may be continuously adjusted according to the gradient of the loss function by iteratively applying a gradient descent or other optimization algorithm. Each iteration aims to refine the model parameters to more accurately predict the target variable y until the model performance reaches an optimum or meets the stop condition. The finally determined alpha and beta enable the model to predict the most accurate tag classification probability p (y= 1|x) according to the input characteristic x, so that the sensitivity of the automatic adjustment checking rule of the tag checking module is effectively supported.
Α+βx is a linear combination used for linear prediction that converts the input feature x into a real value.
E is a natural constant, approximately equal to 2.71828.
Is a Sigmoid function that maps the value of α+βx into the (0, 1) interval, representing the probability.
In the tag verification module 106, if certain specific types of data are found to be frequently mislabeled, the verification stringency for these data can be increased by adjusting the parameters (α and β) in the logistic regression model. Specifically, the model can be made more sensitive to the erroneously labeled features by increasing the weights of these features (increasing the absolute value of β), thereby improving the accuracy of the verification.
In this embodiment, the application of the logistic regression model provides a powerful prediction tool for the tag verification module of the automatic government big data labeling system, so that the system can accurately predict the probability p (y=1|x) that the verification result y is positive (i.e. labeled correctly) based on the given feature x. By accurately estimating this probability, the model helps the system identify high risk false labels, thereby giving priority to manual auditing or automatic adjustment of these labels. In addition, the introduction of the model parameters alpha and beta enables the system to adjust the verification strategy according to the data characteristics and optimize the sensitivity of the verification rule so as to adapt to different data characteristics and service requirements. The data driving-based method not only improves the accuracy of verification, but also enhances the self-adaptive capacity of the system, and ensures the high quality and high efficiency of government data marking.
The following illustrates how a logistic regression model may be used to further verify the accuracy of these labeling results and adjust the sensitivity of the verification strategy based on the verification results.
The automatic labeling system for government affair big data in the embodiment is used for labeling enterprise data, and particularly identifying enterprises in the scientific and technological industry. The embodiment can check whether the enterprise belongs to the scientific and technological industry according to various characteristics (such as research and development investment, patent quantity, product type and the like) of the enterprise, and comprises the following steps:
(1) Initial model setting:
feature vector x = [ x1,x2, x3 ], representing input feature x including development investment of enterprise x1, patent number x2, technician number x3, etc.
A logistic regression model is determined that includes the following formula,
;
Where y=1 means that the enterprise belongs to the scientific industry, α is an intercept term, and β1、β2 and β3 are model coefficients, respectively.
(2) And (3) checking:
And collecting and marking data, namely collecting a batch of enterprise data, and primarily marking the enterprise data by using an automatic marking system.
And (3) checking and executing, namely checking the labeling results by using a logistic regression model, wherein the output p (y=1|x) of the model represents the probability that the enterprise belongs to the technical industry.
Analyzing the verification result, namely analyzing the consistency of the probability output by the model and the actual labeling, and identifying the case with high labeling error or uncertainty.
(3) And (3) adjusting a verification strategy:
Problems are identified if the predictive accuracy of certain feature combinations is found to be low or certain businesses are incorrectly labeled or missed.
And adjusting the coefficients, namely adjusting the values of the model parameters alpha, beta1、β2 and beta3 according to the verification result. For example, if the number of patents is found to be particularly important to distinguishing the enterprises in the technical industry, the value of beta2 can be increased, the weight of the beta2 in the model can be increased, and the sensitivity of the model to the characteristic of the number of patents can be further adjusted.
Retraining the model-retraining the model using the adjusted coefficients to expect improved accuracy of the model's predictions of the new data.
(4) Applying the adjusted model:
And (3) re-checking, namely marking and checking new enterprise data by using the adjusted model, and verifying whether adjustment is effective.
And continuously optimizing, namely continuously adjusting and optimizing model coefficients according to the continuously collected verification result so as to adapt to the change of data and improve the accuracy of labeling.
Through the flow, the logistic regression model can verify the accuracy of the labeling results, adjust the sensitivity of the verification strategy according to the verification results, and ensure the accuracy of the labeling results and the self-adaptive capacity of the system. The data driving-based method improves the performance and the reliability of the automatic government affair big data labeling system.
In some embodiments, the tag verification module 106 may further have an adaptive learning mechanism, which can obtain a verification result after verifying the accuracy of the labeling result, and automatically adjust the verification policy according to the verification result or user feedback (where the user manually verifies the labeling result), and modify parameters in the verification policy. Such adaptive learning mechanisms are typically based on machine learning models, particularly those with online learning or incremental learning capabilities.
In some embodiments, the automatic adjustment function of the tag verification module 106 may be implemented by integrating adaptive learning algorithms based on an adaptive learning mechanism, which are capable of dynamic optimization of policies and parameters based on verification results and user feedback. The following is a detailed procedure for achieving this function:
1. and collecting data, namely firstly collecting verification result data, including error detection, user correction behavior and labeling accuracy statistics.
2. Error analysis, which is to analyze the collected data, identify common error types, false positive and false negative conditions, and in which conditions the user is more prone to correction.
3. Model training, namely retraining the verification model according to the analysis result by using a machine learning technology such as supervised learning or reinforcement learning so as to improve the error identification capability.
4. Parameter optimization, namely automatically adjusting parameters in a verification strategy, such as a threshold value, weight or rule set, so as to better adapt to data characteristics and user behaviors.
5. User feedback loops, namely, a feedback mechanism is established, and the correction behavior of the user is used as a part of training data, so that the system can learn and adapt to the expectations and standards of the user.
6. And continuously monitoring, namely continuously monitoring the verification result after strategy and parameters are adjusted, ensuring the effectiveness of the improvement measures and providing basis for further optimization.
7. Automated deployment-once the validity of the adjusted verification policies and parameters is verified, these changes can be deployed automatically without manual intervention.
8. Transparency and interpretive-ensuring the transparency of the adjustment process, providing explanation and reason for the adjustment, and understanding why a particular modification will be made by the user.
9. And (3) performance evaluation, namely periodically evaluating the performance of the verification module, wherein the performance comprises indexes such as accuracy, recall rate, F1 score and the like, so that the verification quality is ensured to be continuously improved.
Through the self-adaptive learning mechanism, the tag verification module 106 can continuously evolve, so that the intelligence and the efficiency of the tag verification module are improved, the dependence on manual auditing is reduced, and meanwhile, the accuracy and the consistency of data labeling are ensured.
In some embodiments, the tag verification module 106 may also be provided with intelligent rule discovery functionality, by applying an unsupervised learning algorithm, the tag verification module 106 being able to automatically identify and learn patterns and association rules in the data. The tag verification module 106 analyzes the correctly labeled dataset, identifying common labeling patterns, abnormal patterns, and potential data distributions. For example, natural groupings in the data are found using clustering algorithms, or association rule learning is applied to find potential relationships between tags. These findings can help the present system to propose new labeling rules or optimize existing rules, even without explicit guidance, to automatically identify labeling errors in the data. In addition, the system can be further assisted to adapt to new data characteristics or changes through intelligent rule discovery, and verification logic can be updated without manual intervention.
In some embodiments, the tag verification module 106 may also have a multidimensional verification function, and may perform a more comprehensive verification process on tag accuracy, which not only considers logical consistency among tags, but also includes rationality of time sequence, accuracy of spatial distribution, and consistency in semantics.
For example, the tag verification module 106 may check whether the same entity's annotations at different points in time show a reasonable trend of evolution, or whether the geographic location tag matches the actual geographic data. In addition, using natural language processing techniques, the tag verification module 106 may evaluate the semantic consistency of the text data, ensuring that the tag accurately reflects the deep meaning of the data.
The multi-dimensional verification can remarkably improve the overall quality and reliability of the data annotation.
In some embodiments, the tag verification module 106 also has an automated repair suggestion function, and the tag verification module 106 not only marks a problem, but also provides a possible repair scheme when a labeling error or inconsistency is identified. This may be accomplished by applying advanced data analysis techniques such as decision trees, probabilistic models, or deep learning to predict the most likely correct tags. Repair suggestions may be generated based on the context information, the historical verification results, and the user corrective actions. For example, if an enterprise's "industry type" tag does not match its "business scope" tag, the tag verification module 106 may suggest that the "industry type" tag be updated to a value that matches the "business scope". The function can greatly improve the efficiency of the labeling flow and reduce the workload of manual auditing and correction.
In the related technology, government affair big data has high sensitivity, strict compliance requirements, wide data relevance, high requirements on data accuracy and real-time performance and other particularities. Such data often contains personal privacy information, enterprise operation data, and government decision bases, and thus security, accuracy, and legal use of the data are of particular concern.
For these particularities of the government affair big data, the tag verification module 106 of the embodiment can provide powerful guarantee measures for labeling of the government affair big data through functions of self-adaptive learning, intelligent rule discovery, multidimensional verification or automatic repair suggestion and the like. The label verification module 106 can ensure the accuracy and consistency of government affair data in the automatic labeling process, reduces the risk of decision errors caused by error labeling, improves the efficiency and response speed of data processing, and meets the requirement of government affair decision on data timeliness. In addition, the intelligent nature of the tag verification module 106 also helps accommodate dynamic changes in government data, providing a flexible, reliable and efficient data labeling and verification tool for governments, thereby enhancing the quality of government data management and the data support capabilities of government decisions.
In some embodiments, as shown in fig. 2, the automatic government big data labeling system may further include one or more of a task execution record module 107, an information presentation module 108, a report generation module 109, and a user management module 110.
In some embodiments, the task execution recording module 107 is configured to record a task execution condition of a labeling task of the task execution node 105, and the information presenting module 108 is configured to receive the task execution condition and present the task execution condition to a user.
The task execution record module 107 in the automatic labeling system serves as an auditor of the system, and records the execution condition of each labeling task in detail, including key information such as the start time, the end time, the processed data volume, the labeling accuracy, encountered errors, anomalies and the like of the task. These records not only help system administrators monitor and evaluate the progress and effectiveness of labeling tasks, but are also critical to continued improvement and troubleshooting of the system.
The information presentation module 108 is used as a part of the user interface, and receives the data of the task execution situation provided by the task execution recording module 107, and presents the data to the user intuitively in the form of charts, reports or logs. Thus, the user can know the state of the task, the quality of the label and the performance of the system in real time, so that timely decisions and adjustments can be made. The design of the information display module 108 focuses on the user experience, ensures clear and easy-to-understand presentation of information, and enables the user to quickly grasp the overall situation and details of the labeling task.
Through the cooperative work of the two modules, the automatic labeling system not only improves the efficiency and accuracy of data processing, but also enhances the transparency of the system and the friendliness of user interaction.
In some embodiments, the report generating module 109 is configured to generate, through a template engine and natural language technology, a government affair big data labeling report according to the task execution situation and the labeling result.
The report generation module 109 is an intelligent component in the government big data automatic labeling system, and automatically generates a labeling report by combining a template engine and natural language generation technology.
The report generation module 109 first receives detailed task execution and annotation result data from the task execution logging module 107 and then extracts key information and statistics according to a predefined report template or user-specified format. The module converts these data into smooth, accurate text descriptions using natural language processing techniques while guaranteeing the expertise and readability of the report content.
For example, the report generation module 109 may automatically summarize the overall accuracy of the annotations, the data coverage, the major trends or potential problems found, and so forth. In addition, the report generation module 109 can also add charts, images, or other visual elements as needed to enhance expressive and convincing of the report.
Finally, the report output by the report generating module 109 not only comprehensively reflects the labeling condition of the government affair big data, but also is presented in a format which is easy to understand and share, so that the efficiency and quality of report generation are greatly improved, and the workload of manually writing the report is reduced.
In some embodiments, the user management module 110 is configured to provide user management services including user authentication, role management, and rights control, and provide corresponding system access restrictions according to the rights of the user.
The user management module 110 may limit the user's operation and editing rights to the tag management module 102, the task management module 103, the execution node management module 104, and other functional modules.
For example, an average user may be able to view only the tag configuration information in the tag management module 102 without the right to make modifications, the task management module 103 may only allow authorized users to create and schedule labeling tasks, and the operations of the execution node management module 104 may be limited to a system administrator for monitoring and maintenance of service nodes. Through such hierarchical rights management, the user management module 110 ensures the security of government data, avoids unauthorized access and potential risk of data disclosure, and ensures transparency and accountability of system operation.
The user management module 110 can provide comprehensive user management services to ensure security and compliance of the system by authenticating users, assigning roles, and setting rights. The module verifies the user identity by implementing a user authentication mechanism to ensure that only authorized users can access the system. In addition, it is responsible for role management, allowing a system administrator to assign users to different roles, each with different responsibilities and access levels, according to their work responsibilities and needs. The rights control function further refines the access restrictions to system resources, ensuring that users can only access data and functions corresponding to their roles. For example, an average user may be able to view reports and perform basic queries only, while an administrator is able to make modifications to user management, system configuration, and data labeling rules.
Through these measures, the user management module 110 ensures transparency and accountability of system operation while preventing unauthorized access and potential risk of data leakage.
The automatic government affair big data labeling system is an integrated and automatic solution. The labeling process of the government affair data is optimized and simplified through a series of modules which work cooperatively.
The system begins with a data source management module 101 that is responsible for efficiently obtaining data to be annotated from structured and unstructured data sources. The tag management module 102 then automatically creates or updates tag configuration information, including tag classification and labeling rules, based on the data characteristics and business requirements. The task management module 103 further intelligently creates labeling tasks according to the data to be labeled and the label configuration information, generates task instructions, and distributes the task instructions to the most suitable service nodes for execution by the execution node management module 104. After receiving the task instruction, the task execution node 105 automatically labels the data by using advanced machine learning technology and transmits the result to the result database for storage.
In addition, the system further comprises a user management module 110 for ensuring safety and compliance, a task execution recording module 107 for recording task execution conditions, and an information display module 108 for displaying the task execution conditions to a user. The report generation module 109 automatically generates reports based on task performance and labeling results, while the predictive analysis module provides prospective data trend analysis.
Overall, the system remarkably improves the efficiency, accuracy and safety of government affair data marking through an automation and intelligent technology, reduces labor cost and enhances the flexibility and maintainability of the system.
The automatic government affair big data labeling system of the embodiment has wide application scenes, and is specifically as follows:
1. enterprise liveness analysis:
The application of the method is to strengthen the accurate service and supervision of market main bodies. The enterprise liveness analysis models of different industries, different areas and different enterprise scales are acquired accurately through the technical means.
The method has the functions of helping to research and judge structural problems existing in regional economic development, finding zombie enterprises for local governments 'precision', providing decision references for regional local governments, and further making corresponding policies to help the regional governments to transform or leave the market.
2. Risk early warning analysis:
The system monitors the credit status of enterprises or individuals and automatically identifies potential risk signals such as legal litigation, asset auction, tax violations and the like.
The system has the function of providing risk early warning for financial institutions and supervision departments, helping the financial institutions and supervision departments to take measures early, and reducing potential loss.
3. And (3) analyzing key investment projects:
The system marks the related data of the investment project, including project progress, fund use condition, expected benefit and the like.
The method has the function of enabling government departments to effectively monitor heavy investment projects, ensuring the maximization of investment benefits and finding and solving problems in time.
4. Intelligent approval:
the system automatically marks various data in the approval process, such as the integrity, compliance and the like of the application material.
The intelligent level of the approval process is improved, the workload of manual auditing is reduced, the approval speed is accelerated, and the government service efficiency is improved.
5. Policy benefit enterprise:
The system marks the government benefit and enterprise policy, and simultaneously analyzes the qualification and condition of the enterprise, thereby realizing the accurate matching of the policy and the enterprise.
The function is to help enterprises to quickly know and apply for applicable policy support, and simultaneously to enable governments to implement the facility-benefit measures more effectively, so as to promote enterprise development and economic growth.
In each scene, the government big data automatic data labeling system can provide high-efficiency and accurate analysis results through automatic data processing and intelligent labeling, thereby providing powerful data support for government decisions. By the mode, the system not only improves the efficiency of government affair data processing, but also enhances the scientificity and the accuracy of decision making, and has important significance for pushing government affair digital transformation.
The invention provides an automatic labeling method for government affair big data, which is applied to the automatic labeling system for government affair big data provided by any embodiment, as shown in figure 3, and comprises the following steps:
step 201, obtaining data to be marked from the administration big data.
Specifically, a data source management module of the automatic government affair big data labeling system is responsible for extracting required data to be labeled from a government affair data warehouse. Such data may include structured data such as database records, as well as unstructured data such as text documents and multimedia files. The module utilizes high-efficiency data extraction technology to ensure the integrity and consistency of data and provide high-quality original materials for subsequent labeling work.
And 202, creating or modifying the tag configuration information according to the data to be marked.
The tag configuration information comprises tag classification and labeling rules.
Specifically, the tag management module of the government big data automatic labeling system allows a user to create new tag classifications or modify existing tag configurations according to new characteristics of data to be labeled. The configuration information not only defines the classification labels of the data, but also contains specific labeling rules, such as logic for classifying the data based on specific conditions. This step is critical because it provides the necessary guidance and criteria for automated labeling.
And 203, creating a labeling task according to the data to be labeled and the label configuration information, and generating a task instruction.
Specifically, a task management module of the automatic government affair big data labeling system intelligently creates labeling tasks according to the defined data to be labeled and label configuration information. The module designs the execution flow of the task, determines the target and the requirement of the task, and generates detailed task instructions. These instructions will instruct the subsequent execution nodes how accurately to complete the labeling.
Step 204, after receiving the task instruction, determining a task execution node from the plurality of service nodes.
Specifically, after receiving a task instruction, an execution node management module of the automatic government affair big data labeling system selects the most suitable node to execute a labeling task according to the state and the capacity of a service node in the current system. The module adopts an intelligent scheduling algorithm, and takes load balance, processing speed and historical performance of the nodes into consideration so as to ensure that tasks can be completed efficiently and accurately.
And 205, executing the labeling task through the task execution node so as to label the data to be labeled with the corresponding label classification according to the labeling rule, thereby obtaining a labeling result.
Specifically, the selected task execution node in the automatic government affair big data labeling system starts to execute the labeling task according to the received task instruction and the labeling rule. The automatic labeling engine on the node analyzes the data to be labeled, applies a machine learning model and algorithm, classifies and matches the data with the corresponding labels, and completes labeling. Upon completion, the tagged data is stored or updated in a database for further analysis and decision making. The process not only improves the labeling speed and accuracy, but also reduces manual intervention and ensures the consistency and repeatability of data processing.
And step 206, checking the accuracy of the labeling result according to the interrelationship among the labeling rules.
And verifying the accuracy of the marked labels by analyzing and comparing the predefined interrelationships among the marked labels in the marked results. These correlations are logical relationships, possibly including exclusive relationships, ensuring that the same data item is not wrongly assigned to two non-coexisting tags, subordinate relationships ensuring that the hierarchy of tags is correct, e.g., a more specific tag is valid only if its broader class of tags exist, and conditional dependencies, the assignment of some tags may depend on the presence or value of other tags.
The correlations are automatically detected and verified by using a logic judgment and rule engine, and when contradiction or non-compliance with the rule is found, the module marks potential errors and prompts manual auditing or automatic adjustment of the labels so as to ensure the logic and accuracy of the data labeling. The rule-based verification mechanism is a key link for ensuring the labeling quality of government affair data, and is beneficial to reducing errors and improving the reliability of the data.
The technical effects of the automatic government affair big data labeling method in this embodiment may refer to the technical effects of the automatic government affair big data labeling system in the foregoing embodiment, and will not be described herein.
The foregoing embodiments are merely for illustrating the technical solution of the present invention, but not for limiting the same, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art that modifications may be made to the technical solution described in the foregoing embodiments or equivalents may be substituted for parts of the technical features thereof, and such modifications or substitutions do not depart from the spirit of the corresponding technical solution from the scope of the technical solution of the embodiments of the present invention.

Claims (10)

CN202411397428.4A2024-10-092024-10-09 A system and method for automatically labeling government big dataActiveCN118898234B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202411397428.4ACN118898234B (en)2024-10-092024-10-09 A system and method for automatically labeling government big data

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202411397428.4ACN118898234B (en)2024-10-092024-10-09 A system and method for automatically labeling government big data

Publications (2)

Publication NumberPublication Date
CN118898234A CN118898234A (en)2024-11-05
CN118898234Btrue CN118898234B (en)2025-01-10

Family

ID=93267047

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202411397428.4AActiveCN118898234B (en)2024-10-092024-10-09 A system and method for automatically labeling government big data

Country Status (1)

CountryLink
CN (1)CN118898234B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN119786072A (en)*2024-12-112025-04-08四川互慧软件有限公司 An enhanced prediction method for infection risk of HIV/AIDS patients

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112611997A (en)*2020-12-012021-04-06国网河南省电力公司电力科学研究院Online verification method and system for hitching relation of platform area gateway table
CN114398684A (en)*2022-03-252022-04-26腾讯科技(深圳)有限公司Block chain-based information processing method and device, storage medium and electronic equipment
CN115914668A (en)*2022-12-222023-04-04湖南快乐阳光互动娱乐传媒有限公司Live broadcast stream processing method and live broadcast source station cluster
CN118378132A (en)*2024-06-212024-07-23暗物质(北京)智能科技有限公司Minio-based model training data labeling method and Minio-based model training data labeling system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114328837B (en)*2021-12-302025-05-23企查查科技股份有限公司 Sequence labeling method, device, computer equipment, and storage medium
CN114722777B (en)*2022-05-112025-03-18金蝶软件(中国)有限公司 Label marking method, computer device and computer storage medium
CN117724697A (en)*2023-11-072024-03-19南威软件股份有限公司Method, system, equipment and storage medium for implementing government industry label rule engine
CN118364117A (en)*2024-03-282024-07-19中移系统集成有限公司Government affair question-answering method based on knowledge graph and related equipment

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN112611997A (en)*2020-12-012021-04-06国网河南省电力公司电力科学研究院Online verification method and system for hitching relation of platform area gateway table
CN114398684A (en)*2022-03-252022-04-26腾讯科技(深圳)有限公司Block chain-based information processing method and device, storage medium and electronic equipment
CN115914668A (en)*2022-12-222023-04-04湖南快乐阳光互动娱乐传媒有限公司Live broadcast stream processing method and live broadcast source station cluster
CN118378132A (en)*2024-06-212024-07-23暗物质(北京)智能科技有限公司Minio-based model training data labeling method and Minio-based model training data labeling system

Also Published As

Publication numberPublication date
CN118898234A (en)2024-11-05

Similar Documents

PublicationPublication DateTitle
Diba et al.Extraction, correlation, and abstraction of event data for process mining
CN117876016A (en)Distributed market data acquisition management system
CN117827750A (en)Personnel file automatic archiving method and system
CN118210983A (en) Intelligent adaptive retrieval enhancement system, method and storage medium
CN118941395B (en) An automatic generation and management system for data assets based on data operation technology
CN118898234B (en) A system and method for automatically labeling government big data
CN119048027B (en)Enterprise intelligent management method and system based on artificial intelligence
CN118037469B (en)Financial management system based on big data
CN117371940A (en)Holographic intelligent control method and system for financial credit and debit management
CN118657213A (en) Decision-making solution acquisition method and system based on multi-source heterogeneous knowledge graph
CN119494463A (en) A method for monitoring urban operation indicators based on multi-source heterogeneity
CN119646160A (en) Power data security policy large model question-answering system and method based on relationship pooling
CN119379305A (en) Method for constructing a customer complaint handling system to deal with regulatory complaints
CN119940715A (en) Intelligent accounting data management and compliance system and method
CN118626771A (en) Digital management system and method based on multi-source data information
CN118761835A (en) A data risk identification method and identification device based on enterprise credit
CN119669872B (en) An information-based accounting archive management method and system
CN120297690A (en) Production scheduling strategy adjustment system and method
CN118733714A (en) A semantic large model optimization method and system for power scenarios
CN119067238B (en)Method and system for generating scientific creation big data model based on big model
CN119558405A (en) Intelligent matching and pre-scoring system and method for scientific and technological projects based on large models
CN119180266A (en)Historical data-based audit opinion generation method, device and equipment
CN115982429B (en)Knowledge management method and system based on flow control
VajpayeeThe role of machine learning in automated data pipelines and warehousing: enhancing data integration, transformation, and analytics
CN120448504B (en)Customer service question-answering processing method, device, equipment and medium based on multi-Agent large model

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp