Disclosure of Invention
In view of the above, it is necessary to provide a data acquisition method, an apparatus, a computer device and a storage medium capable of improving data acquisition efficiency.
A method of data processing, said method comprising:
acquiring appointed time interval information, and acquiring first data updated in a corresponding time interval from a business database according to the time interval information;
screening target field information which does not match with the time interval information from field information stored in an intermediate table, wherein the intermediate table comprises the field information of a characteristic field of data to be acquired, and the data to be acquired is data determined from a business database according to preset acquisition logic;
acquiring second data from the business database according to the target field information;
and performing data integration processing on the first data and the second data to obtain target data.
In one embodiment, the method further comprises:
and storing the target data into a partition table of a corresponding partition in the data warehouse.
In one embodiment, the intermediate table is a data table disposed in a data warehouse.
In one embodiment, before obtaining the specified time interval information, the method further includes: acquiring preset acquisition logic information; determining data which accords with the acquisition logic information in the service database as data to be acquired; extracting field information from the characteristic field of the data to be collected and storing the field information into the intermediate table.
In one embodiment, the screening of the field information stored in the intermediate table for the target field information not matching the time interval information includes: acquiring task parameter information, reading field information matched with the task parameter information from the intermediate table, and storing the field information into a temporary table; and screening target field information which does not match with the time interval information from the field information stored in the temporary table.
In one embodiment, collecting the second data from the business database according to the target field information comprises: generating a structured query language by taking the target field information as a value corresponding to the query condition; second data is collected from the business database according to the structured query language.
In one embodiment, after performing the data integration processing on the first data and the second data, the method further includes: and carrying out data deduplication processing on the data after the data integration processing.
In one embodiment, after performing the data integration processing on the first data and the second data, the method further includes: and comparing the data after the data integration processing with the data to be acquired, and removing the data different from the data to be acquired.
A data acquisition device, said device comprising:
the first data acquisition module is used for acquiring the appointed time interval information and acquiring first data updated in the corresponding time interval from the business database according to the time interval information;
the field information acquisition module is used for screening target field information which does not match with the time interval information from field information stored in an intermediate table, the intermediate table comprises field information of a characteristic field of data to be acquired, and the data to be acquired is data determined from a business database according to preset acquisition logic;
the second data acquisition module is used for acquiring second data from the business database according to the target field information;
and the data integration processing module is used for performing data integration processing on the first data and the second data to obtain target data.
A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the data acquisition method described above when executing the computer program.
A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the data acquisition method described above.
According to the data acquisition method, the data acquisition device, the computer equipment and the storage medium, the corresponding first data are acquired through the appointed time interval, the target field information which is not the appointed time interval is screened through the intermediate table containing the field information of the predetermined data to be acquired, the corresponding second data are acquired according to the target field information, and finally the first data and the second data are integrated to obtain the target data of the acquisition task.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more clearly understood, the present application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The data acquisition method provided by the application can be applied to the application environment shown in fig. 1. Theserver 102 acquires the designated time interval information, and acquires first data updated in the corresponding time interval from thebusiness database 104 according to the time interval information; screening target field information which is not matched with the time interval information from field information stored in an intermediate table 106, wherein the intermediate table 106 comprises the field information of a characteristic field of data to be acquired, and the data to be acquired is data determined from a business database according to preset acquisition logic; collecting second data from thebusiness database 104 according to the target field information; and performing data integration processing on the first data and the second data to obtain target data. Theserver 102 may be implemented as a stand-alone server or a server cluster composed of a plurality of servers.
In an embodiment, as shown in fig. 2, a data collection method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:
step S202: acquiring appointed time interval information, and acquiring first data updated in a corresponding time interval from a business database according to the time interval information.
The service database is a database for storing service data, and may be a relational database or a non-relational database. The service database may include at least one service data table. The first data is data updated to a certain data table in the service database in a specified time interval.
Specifically, the user may collect data by taking the time interval of data update as a condition for data filtering. The time interval information may be information indicating any valid time period or time point. The server acquires time interval information designated by a user, takes the time interval information as a condition for data screening, and takes data updated in a time interval or at a time point corresponding to the time interval information in the service database as first data for collection.
Step S204: and screening target field information which is not matched with the time interval information from field information stored in an intermediate table, wherein the intermediate table comprises the field information of the characteristic field of the data to be acquired, and the data to be acquired is data determined from a business database according to preset acquisition logic.
The intermediate table is a data table used for storing intermediate calculation results in the database. The data to be acquired is determined from the data of the business database according to user-defined or preset acquisition logic, and the purpose of determining the data to be acquired is to frame the data acquisition range. The characteristic field may be adaptively set according to a data type, for example, for return order data, the characteristic field may include at least one field of an order number, an order line number, a bank number, a table number, and an order time. The field information may be a field value in a field.
Specifically, the server performs matching screening from the field information stored in the intermediate table according to the time interval information, screens out the field information of the data to be acquired in the time interval not corresponding to the time interval information, and takes the screened field information as the target field information. For example, if the time interval information is yesterday, the field information of the data to be collected, which is not updated yesterday, is screened out from the intermediate table as the target field information.
Step S206: and collecting second data from the service database according to the target field information.
Wherein, the second data refers to the data inquired from the service database according to the target field information. Specifically, after the server obtains the target field information, the server may use the target field information as a condition for data screening, query the service data including the target field information from the service database, and Query the service data by using a Query Language matching the service database, for example, query the service data by using an IN Query IN Structured Query Language (SQL) Language, and collect the queried service data as the second data.
Step S208: and performing data integration processing on the first data and the second data to obtain target data.
Specifically, data integration processing is performed on the acquired first data and the acquired second data, and all data in a data set obtained after data integration is used as target data of the data acquisition task.
According to the data acquisition method, the corresponding first data are acquired in the appointed time interval, the target field information which is not in the appointed time interval is screened through the intermediate table containing the field information of the predetermined data to be acquired, the corresponding second data are acquired according to the target field information, and finally the first data and the second data are integrated to obtain the target data of the acquisition task.
In one embodiment, the method further comprises: and storing the target data into a partition table of a corresponding partition in the data warehouse. According to the data partitioning method and device, the acquired target data are stored in the corresponding partition tables, so that the data can be partitioned and partitioned quickly, computing resources consumed by data partitioning processing of a data warehouse of a large data platform are reduced, and the data processing efficiency is improved. The partition table can be a data table in the hive database, and here, data can be written into the partition table in a custom format or a default format, and the hive data table in the custom format can prevent the occurrence of data confusion caused by the fact that part of field contents contain a line break.
In one embodiment, the intermediate table is a data table disposed in a data warehouse. In this embodiment, the intermediate table is set in the data warehouse of the large data platform, and may be one or more data tables in the data warehouse, and the format of the intermediate table is not limited, and may be, for example, a hive data table. In a conventional collection method, an intermediate table is created in each sub-library of the service database, and then when data of a certain table is extracted, collection is performed in the form of an intermediate table of table association (inner join). For example, when the return list data is collected, the inquiry is carried out in a mode of associating (inner join) intermediate lists through the return list, and the newly added return order data and the original order data corresponding to the newly added return order are collected to a data warehouse together for statistical analysis of downstream sales data.
In this embodiment, the intermediate table for storing the intermediate data of data acquisition is directly set in the data warehouse, so that the intermediate table of each sub-database of the service database in the service system can be removed. Because the premise that data is written into the intermediate table in the service database (data source) is that the data source needs to be configured to have read-write permission, the traditional data acquisition only uses the service database master library, the system performance is reduced during acquisition, the normal operation of the service is influenced, and the security of the database is reduced by the data writing operation. By adopting the method of the embodiment, the intermediate table created in the service database is removed, and the inquiry by adopting a mode of an associated (inner join) intermediate table is not needed, so that the service data standby database can be used for data acquisition, the main database of the service data is not influenced, the service system can be decoupled, and the system safety is ensured.
In one embodiment, before obtaining the specified time interval information, the method further includes: acquiring preset acquisition logic information; determining data which accords with the acquisition logic information in the service database as data to be acquired; extracting field information from the characteristic field of the data to be collected and storing the field information into the intermediate table.
In this embodiment, before each acquisition task starts, acquisition logic information meeting the service rule may be preset according to the service rule of each acquisition task, a data acquisition range may be determined according to the preset acquisition logic information, that is, data meeting the acquisition logic is determined as data to be acquired, and field information in a feature field of the data to be acquired is extracted and stored in an intermediate table, where the feature field may be specified in advance according to different acquisition tasks, for example, an order number, an order line number, a library number, a table number, or order time may be specified as the feature field.
In one embodiment, the screening of the field information stored in the intermediate table for the target field information not matching the time interval information includes: acquiring task parameter information, reading field information matched with the task parameter information from the intermediate table, and storing the field information into a temporary table; and screening target field information which does not match with the time interval information from the field information stored in the temporary table.
In this embodiment, the task parameter refers to a parameter corresponding to the acquisition task configured by the user before the acquisition task is started, for example, data acquisition may be performed through a spark task of a big data platform, and when the spark task is started, the server obtains the task parameter configured by the user and loads the task parameter to the spark task. The task parameters can include information specifying a service database to be queried, specifying field information of a source table to be acquired, specifying a partition table to be written and the like.
In this embodiment, the intermediate table may include data to be acquired that is predetermined according to acquisition logic information of different acquisition tasks, and by acquiring and loading task parameters configured by a user before the acquisition tasks are started, the data to be acquired that conforms to the current acquisition task may be acquired from the intermediate table, and the data to be acquired of the current acquisition task may be stored in the temporary table for subsequent processing. By setting the task parameters, distributed task execution can be performed, the problem of single-point tasks is solved, and the efficiency of data acquisition is improved.
In one embodiment, collecting the second data from the business database according to the target field information comprises: generating a structured query language by taking the target field information as a value corresponding to the query condition; second data is collected from the business database according to the structured query language.
IN this embodiment, the target field information is used as a value corresponding to the query condition, and a structured query language, for example, an IN query statement of a relational database, is generated, so that corresponding data can be quickly located from the business database according to the target field information, and the efficiency of data acquisition is improved.
In one embodiment, after performing the data integration processing on the first data and the second data, the method further includes: and carrying out data deduplication processing on the data after the data integration processing. According to the embodiment, redundant repeated data can be removed by performing deduplication processing on the data, and the accuracy of data acquisition is improved.
In one embodiment, after performing the data integration processing on the first data and the second data, the method further includes: and comparing the data after data integration processing with the data to be acquired, and removing data different from the data to be acquired. In the embodiment, the data to be acquired which is determined in advance is compared with the integrated target data, so that the data in the non-acquisition range can be excluded, and the accuracy of data acquisition is further improved.
In the following, the data acquisition method of the present application is further described with reference to an application example, as shown in fig. 3 to fig. 4, fig. 3 shows a technical architecture diagram for executing a distributed data acquisition task in the application example, and fig. 4 shows a flow diagram of the data acquisition method in the application example, which specifically includes the following steps:
step 1: and acquiring data of the intermediate table, wherein the acquisition logic in the intermediate table can be the fusion of a plurality of scene logics, such as a return table, a change table and the like, and the corresponding acquisition logic can be defined according to service requirements.
Step 2: and reading the data of the intermediate table, and acquiring a spark task to read the data of the intermediate table into a memory, so as to facilitate subsequent data processing.
And step 3: and acquiring the return list in an increment mode, wherein the step is the first step of acquiring the data of the service list, the acquired increment data is newly increased change data yesterday and is stored in a memory to facilitate subsequent statistics and summarization.
And 4, step 4: newly-increased data of non-yesterday are collected, the step is a second step of data collection of the service list, partial stock data of the service list are inquired IN an IN mode and stored IN a memory, and subsequent statistics and summarization are facilitated.
And 5: and merging the filtering data, reading the incremental data and part of stock data acquired in the first two steps, summarizing and de-duplicating the incremental data and the part of stock data, and filtering the incremental data and the part of stock data with the data of the intermediate table to eliminate more data than the data of the intermediate table.
Step 6: and writing the data into a target table, and finally performing automatic data and table format matching on the last data in the last step according to a target HIVE table and a target table format configured by a user, and finally writing the data into a target partition table.
It should be understood that although the various steps in the flowcharts of fig. 2 and 4 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2 and 4 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performing the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternately with other steps or at least some of the sub-steps or stages of other steps.
In one embodiment, as shown in fig. 5, there is provided a data acquisition apparatus comprising: a firstdata acquisition module 510, a fieldinformation acquisition module 520, a seconddata acquisition module 530 and a dataintegration processing module 540, wherein:
a firstdata acquisition module 510, configured to acquire specified time interval information, and acquire, from the business database, first data updated in a corresponding time interval according to the time interval information;
a fieldinformation obtaining module 520, configured to screen, from field information stored in an intermediate table, target field information that does not match time interval information, where the intermediate table includes field information of a feature field of data to be acquired, and the data to be acquired is data determined from a business database according to preset acquisition logic;
a seconddata collecting module 530, configured to collect second data from the business database according to the target field information;
and the dataintegration processing module 540 is configured to perform data integration processing on the first data and the second data to obtain target data.
In one embodiment, the dataintegration processing module 540 is further configured to store the target data into a partition table of a corresponding partition in the data warehouse.
In one embodiment, the firstdata collecting module 510 is further configured to obtain preset collecting logic information before obtaining the specified time interval information; determining data which accords with the acquisition logic information in the service database as data to be acquired; extracting field information from the characteristic field of the data to be collected and storing the field information into the intermediate table.
In one embodiment, the fieldinformation obtaining module 520 obtains the task parameter information, reads the field information matched with the task parameter information from the intermediate table, and stores the field information in the temporary table; and screening target field information which does not match with the time interval information from the field information stored in the temporary table.
In one embodiment, the seconddata collection module 530 generates a structured query language using the target field information as a value corresponding to the query condition; second data is collected from the business database according to the structured query language.
In an embodiment, the dataintegration processing module 540 is further configured to perform data deduplication processing on the data after the data integration processing after performing the data integration processing on the first data and the second data.
In an embodiment, the dataintegration processing module 540 is further configured to, after performing data integration processing on the first data and the second data, compare the data after the data integration processing with the data to be acquired, and remove data different from the data to be acquired.
For specific limitations of the data acquisition device, reference may be made to the above limitations of the data acquisition method, which are not described herein again. The modules in the data acquisition device can be implemented in whole or in part by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing business data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data acquisition method.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: acquiring appointed time interval information, and acquiring first data updated in a corresponding time interval from a business database according to the time interval information; screening target field information which does not match with the time interval information from field information stored in an intermediate table, wherein the intermediate table comprises the field information of a characteristic field of data to be acquired, and the data to be acquired is data determined from a business database according to preset acquisition logic; acquiring second data from the business database according to the target field information; and performing data integration processing on the first data and the second data to obtain target data.
In one embodiment, the processor, when executing the computer program, further performs the steps of: and storing the target data into a partition table of a corresponding partition in the data warehouse.
In one embodiment, the processor executes the computer program to perform the following steps before obtaining the specified time interval information: acquiring preset acquisition logic information; determining data which accords with the acquisition logic information in the service database as data to be acquired; extracting field information from the characteristic field of the data to be collected and storing the field information into the intermediate table.
In one embodiment, when the processor executes the computer program to screen the field information stored in the intermediate table for the target field information that does not match the time interval information, the following steps are specifically implemented: acquiring task parameter information, reading field information matched with the task parameter information from the intermediate table, and storing the field information into a temporary table; and screening target field information which does not match with the time interval information from the field information stored in the temporary table.
In one embodiment, when the processor executes the computer program to collect the second data from the business database according to the target field information, the following steps are specifically implemented: generating a structured query language by taking the target field information as a value corresponding to the query condition; second data is collected from the business database according to the structured query language.
In one embodiment, after the processor executes the computer program to perform data integration processing on the first data and the second data, the following steps are further performed: and carrying out data deduplication processing on the data after the data integration processing.
In one embodiment, after the processor executes the computer program to perform data integration processing on the first data and the second data, the following steps are further performed: and comparing the data after the data integration processing with the data to be acquired, and removing the data different from the data to be acquired.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, performs the steps of: acquiring appointed time interval information, and acquiring first data updated in a corresponding time interval from a business database according to the time interval information; screening target field information which does not match with the time interval information from field information stored in an intermediate table, wherein the intermediate table comprises the field information of a characteristic field of data to be acquired, and the data to be acquired is data determined from a business database according to preset acquisition logic; acquiring second data from the business database according to the target field information; and performing data integration processing on the first data and the second data to obtain target data.
In one embodiment, the computer program when executed by the processor further performs the steps of: and storing the target data into a partition table of a corresponding partition in the data warehouse.
In one embodiment, the computer program further performs the following steps before being executed by the processor to obtain the specified time interval information: acquiring preset acquisition logic information; determining data which accords with the acquisition logic information in the service database as data to be acquired; extracting field information from the characteristic field of the data to be collected and storing the field information into the intermediate table.
In one embodiment, when the computer program is executed by the processor to realize the screening of the field information stored in the intermediate table for the target field information not matching with the time interval information, the following steps are specifically realized: acquiring task parameter information, reading field information matched with the task parameter information from the intermediate table, and storing the field information into a temporary table; and screening target field information which does not match with the time interval information from the field information stored in the temporary table.
In one embodiment, the computer program when executed by the processor for implementing the collecting of the second data from the service database according to the target field information specifically implements the following steps: generating a structured query language by taking the target field information as a value corresponding to the query condition; second data is collected from the business database according to the structured query language.
In one embodiment, after the computer program is executed by the processor to perform the data integration processing on the first data and the second data, the following steps are further performed: and carrying out data deduplication processing on the data after the data integration processing.
In one embodiment, after the computer program is executed by the processor to perform the data integration processing on the first data and the second data, the following steps are further performed: and comparing the data after the data integration processing with the data to be acquired, and removing the data different from the data to be acquired.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), rambus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.