CN115729938A

Movatterモバイル変換

Info

Publication number: CN115729938A
Application number: CN202211511821.2A
Authority: CN
Inventors: 丁振华; 袁庆祝
Original assignee: Zhongbo Information Technology Research Institute Co ltd
Current assignee: Zhongbo Information Technology Research Institute Co ltd
Priority date: 2022-11-29
Filing date: 2022-11-29
Publication date: 2023-03-03
Anticipated expiration: 2042-11-29
Also published as: CN115729938B

Abstract

本发明公开了一种基于DataX的数据快速上传引擎实现方法，属于大数据技术领域，约定数据上传对应业务功能的数据模板及数据库表；提供字段映射可视化配置界面，提供基于正则表达式的智能匹配算法；实现DataX执行实现的json配置文件生成器；封装SDK，提供统一标准的文件上传与数据导入一体化执行的API，以最终的csv格式的数据文件的绝对路径回填json配置，生成DataX可执行的json配置文件，通过java、shell或python的方式驱动DataX工具来完成数据同步任务，解决了提供一个通用的数据上传程序部件，实现便捷、高效、快速的数据上传的技术问题，简化数据上传导入功能的开发，提升开发效率。

The invention discloses a method for implementing a data fast upload engine based on DataX, which belongs to the field of big data technology, agrees on data templates and database tables for data upload corresponding business functions; provides a visual configuration interface for field mapping, and provides intelligent matching based on regular expressions Algorithm; realize the json configuration file generator implemented by DataX execution; encapsulate the SDK, provide a unified and standard API for integrated execution of file upload and data import, backfill the json configuration with the absolute path of the final data file in csv format, and generate DataX executable The json configuration file drives the DataX tool through java, shell or python to complete the data synchronization task, solves the technical problem of providing a common data upload program component, realizes convenient, efficient and fast data upload, and simplifies data upload and import function development to improve development efficiency.

Description

Translated fromChinese

一种基于DataX的数据快速上传引擎实现方法A DataX-based fast data upload engine implementation method

技术领域technical field

本发明属于大数据技术领域，尤其涉及一种基于DataX的数据快速上传引擎实现方法。The invention belongs to the technical field of big data, and in particular relates to a method for realizing a fast data upload engine based on DataX.

背景技术Background technique

在日常的各种业务应用系统中，Excel数据导入似乎是一种十分常用的基础功能。以JavaWeb技术为主的项目为例，通常采用基于Apache POI项目技术来实现对Excel的解析，编写相关SQL语句，并结合JDBC技术来实现数据导入。往往需要针对每个功能根据Excel数据模板和数据表的结构来完成开发实现。一方面需要为每个导入功能进行定制化的开发，耗费大量的开发精力；另一方面普通程序员由于经验和技术水平所限，大多实现的导入功能效率不高。In various daily business application systems, Excel data import seems to be a very common basic function. Taking JavaWeb technology-based projects as an example, Apache POI project technology is usually used to analyze Excel, write related SQL statements, and combine JDBC technology to realize data import. It is often necessary to complete the development and implementation of each function according to the Excel data template and the structure of the data table. On the one hand, it is necessary to carry out customized development for each imported function, which consumes a lot of development energy; on the other hand, due to the limited experience and technical level of ordinary programmers, most of the imported functions implemented are not efficient.

DataX是一个异构数据源离线同步工具，致力于实现各种异构数据源之间稳定高效的数据同步功能。DataX具有采用http协议传送数据、自带重连与重试机制、提供分布式事务与数据一致性保护、异构的数据库/文件系统之间高速交换数据、传输过程单进程内完成全内存操作等特点。DataX插件体系作为一套生态系统，提供了丰富的常见数据源的Reader插件、Writer插件，并支持自定义开发新插件。然而，DataX的执行实例，依赖事先编制的json配置文件(含执行线程数、读插件及对应的读数据源及表和字段、写插件及目标写数据源及表和字段)。在实际使用时，往往json配置文件中字段栏位较多，面对多系统、不同类型数据传输、大量表的应用场景，配置繁琐，效率低，工作量很大，因此将DataX作为一个内置通用的功能部件集成于具体项目中，亟需解决DataX执行实例的json配置文件的动态生成问题。DataX is an offline synchronization tool for heterogeneous data sources, dedicated to realizing stable and efficient data synchronization between various heterogeneous data sources. DataX uses the http protocol to transmit data, has its own reconnection and retry mechanism, provides distributed transactions and data consistency protection, high-speed data exchange between heterogeneous databases/file systems, completes full memory operations in a single process during transmission, etc. features. As an ecosystem, the DataX plug-in system provides a wealth of Reader plug-ins and Writer plug-ins for common data sources, and supports custom development of new plug-ins. However, the execution instance of DataX depends on the pre-compiled json configuration file (including the number of execution threads, read plug-ins and corresponding read data sources, tables and fields, write plug-ins and target write data sources, tables and fields). In actual use, there are often many fields in the json configuration file. Faced with the application scenarios of multiple systems, different types of data transmission, and a large number of tables, the configuration is cumbersome, inefficient, and heavy workload. Therefore, DataX is used as a built-in general The functional parts of the project are integrated in specific projects, and it is urgent to solve the problem of dynamically generating the json configuration file of the DataX execution instance.

发明内容Contents of the invention

本发明的目的是提供一种基于DataX的数据快速上传引擎实现方法，解决了提供一个通用的数据上传程序部件，实现便捷、高效、快速的数据上传的技术问题。The purpose of the present invention is to provide a method for implementing a data fast upload engine based on DataX, which solves the technical problem of providing a general data upload program component and realizing convenient, efficient and fast data upload.

为实现上述目的，本发明采用如下技术方案：一种基于DataX的数据快速上传引擎实现方法，包括如下步骤：In order to achieve the above object, the present invention adopts the following technical solutions: a method for implementing a data fast upload engine based on DataX, comprising the following steps:

步骤1：在分布式服务器中构建解析转换模块、数据读取模块、字段映射配置模块、配置文件生成模块、上传模块和调度模块；Step 1: Build the parsing and conversion module, data reading module, field mapping configuration module, configuration file generation module, upload module and scheduling module in the distributed server;

步骤2：解析转换模块预设一个约定模板，根据约定模板接收导入的Excel或csv格式的文件，即模板文件，通过Apache POI项目技术，对模板文件进行解析，获得模板文件的标题行中的数据列名及对应列索引信息；Step 2: The parsing and conversion module presets an agreed template, receives the imported file in Excel or csv format according to the agreed template, that is, the template file, and analyzes the template file through the Apache POI project technology to obtain the data in the header row of the template file Column name and corresponding column index information;

步骤3：数据读取模块采用jdbc或odbc技术连接数据库，并根据数据表名从数据库系统表中读取详细的字段列表清单；Step 3: The data reading module uses jdbc or odbc technology to connect to the database, and reads the detailed field list from the database system table according to the data table name;

步骤4：字段映射配置模块负责编写一个基于正则表达式技术实现的字段智能匹配算法，将数据模板标题行的列名与数据库表中字段名进行匹配，并建立映射关系；Step 4: The field mapping configuration module is responsible for writing a field intelligent matching algorithm based on regular expression technology, matching the column names in the header row of the data template with the field names in the database table, and establishing a mapping relationship;

步骤5：字段映射配置模块针对模板文件制作一个可视化的字段映射配置界面，用于展示通过智能匹配算法已自动映射的字段关系；Step 5: The field mapping configuration module makes a visual field mapping configuration interface for the template file, which is used to display the field relationship that has been automatically mapped through the intelligent matching algorithm;

步骤6：配置文件生成模块提供一个DataX执行实例的json配置文件生成器，根据模板文件的字段映射配置动态生成DataX执行使用的json配置文件；Step 6: The configuration file generation module provides a json configuration file generator for the DataX execution instance, and dynamically generates the json configuration file used by the DataX execution according to the field mapping configuration of the template file;

步骤7：上传模块提供通用数据上传导入SDK，完成文件的上传；Step 7: The upload module provides general data upload and import SDK to complete the file upload;

步骤8：上传模块提供XlsToCsv和XlsxToCsv工具类，将xls和xlsx两种Excel格式的数据文件转换成csv格式，以支持Excel格式的数据文件的上传导入；Step 8: The upload module provides XlsToCsv and XlsxToCsv tool classes to convert data files in Excel format of xls and xlsx into csv format to support the upload and import of data files in Excel format;

步骤9：调度模块采用Java封装的方法对DataX进行直接调用，或采用shell或python的方式，通过DataX工具提供的python脚本来触发DataX数据同步任务的执行。Step 9: The scheduling module uses the method of Java encapsulation to directly call DataX, or uses shell or python to trigger the execution of DataX data synchronization tasks through the python script provided by the DataX tool.

优选的，在执行步骤2时，导入的Excel或csv格式的文件均采用预设的约定模板，约定模板的首行为标题行，第二行开始为数据行。Preferably, when step 2 is performed, the imported files in Excel or csv format all adopt a preset agreed template, the first line of the agreed template is a title line, and the second line is a data line.

优选的，在执行步骤4时，建立映射关系的步骤具体包括：Preferably, when performing step 4, the step of establishing a mapping relationship specifically includes:

步骤S4-1：将数据模板标题行的列名首先与数据库表中字段名进行完全匹配，匹配上的建立映射关系；Step S4-1: Completely match the column name of the header row of the data template with the field name in the database table, and establish a mapping relationship on the matching;

步骤S4-2：未能匹配的模板列名再和数据库表中剩余字段的备注进行完全匹配，建立映射关系；Step S4-2: Completely match the unmatched template column names with the remarks of the remaining fields in the database table to establish a mapping relationship;

步骤S4-3：仍有剩余未匹配的模板列名再按照规则设定的相似度来和数据库表剩余字段中备注进行相似匹配。Step S4-3: There are still remaining unmatched template column names, and then perform similar matching with the remarks in the remaining fields of the database table according to the similarity degree set by the rules.

优选的，在执行步骤5时，可视化的字段映射配置界面支持手工调整映射关系。Preferably, when step 5 is performed, the visual field mapping configuration interface supports manual adjustment of the mapping relationship.

优选的，在执行步骤5时，可视化的字段映射配置界面以每张数据库表一个标签页的形式进行切换展示，标签页头显示数据库表名及表备注，下方以表格形式展示。Preferably, when step 5 is executed, the visual field mapping configuration interface is switched and displayed in the form of a tab page for each database table, and the header of the tab page displays the database table name and table remarks, and the table is displayed below.

优选的，在执行步骤6时，根据预置规则和字段映射配置自动生成DataX执行使用的json配置文件，具体包括Reader插件部分、Writer插件部分和数据清洗转换的Transfer模块部分。Preferably, when step 6 is performed, the json configuration file used for DataX execution is automatically generated according to the preset rules and field mapping configuration, specifically including the Reader plug-in part, the Writer plug-in part and the Transfer module part for data cleaning and conversion.

优选的，在执行步骤6时，需要在真正执行DataX任务时，将csv格式数据文件的绝对路径回填到json中，生成可执行的json文件。Preferably, when performing step 6, it is necessary to backfill the absolute path of the csv format data file into the json when actually executing the DataX task, so as to generate an executable json file.

优选的，在执行步骤7时，通用数据上传导入SDK具体包括集成文件的上传、数据文件的格式校验、与数据模板的匹配性验证、DataX可执行实例的json配置文件的生成和DataX工具的调用。Preferably, when performing step 7, the general data upload and import SDK specifically includes the upload of integrated files, the format verification of data files, the matching verification with data templates, the generation of json configuration files of DataX executable instances, and the generation of DataX tools. transfer.

本发明所述的一种基于DataX的数据快速上传引擎实现方法，解决了提供一个通用的数据上传程序部件，实现便捷、高效、快速的数据上传的技术问题，提供一个可视化的字段映射部件，实现针对各业务功能的数据导入模板(Excel或csv格式)与数据库表的映射关系，以及导入模板的列与数据库表的字段映射关系的通用维护配置功能。通过编写智能匹配算法，实现模板中标题行的列名与数据库表中字段名或字段备注的智能自动匹配映射，大大减轻使用人员操作复杂度与工作量，并能提升配置效率及配置准确率，实现了一个专用的DataX执行实例的json配置文件生成器：提供了一个根据数据导入模板的字段映射配置，和应用系统数据源及数据库表，生成对应的暂缺实际csv格式数据文件地址(txtfilereader插件中path属性值暂为特定占位符)的DataX执行需要的json配置文件的方法；以及一个将实际具体的csv格式数据文件地址回填json配置文件的方法。从而实现DataX执行实例的json配置文件的动态生成能力。很好的解决了DataX的实际应用中，json配置文件中字段栏位较多，面对多系统、不同类型数据传输、大量表的应用场景，配置繁琐，效率低的问题，提供上传的数据文件是否与功能导入模板格式规范相匹配的校验，集成Excel转csv的能力、以及对DataX的执行调度策略。从而实现支持Excel或csv格式数据文件的数据快速上传导入能力。The implementation method of a data fast upload engine based on DataX described in the present invention solves the technical problem of providing a general data upload program component to realize convenient, efficient and fast data upload, and provides a visual field mapping component to realize General maintenance and configuration functions for the mapping relationship between the data import template (Excel or csv format) and the database table for each business function, and the mapping relationship between the columns of the imported template and the fields of the database table. By writing an intelligent matching algorithm, the intelligent automatic matching and mapping between the column name of the header row in the template and the field name or field note in the database table is realized, which greatly reduces the operation complexity and workload of the user, and can improve the configuration efficiency and configuration accuracy. Implemented a dedicated DataX execution instance json configuration file generator: provided a field mapping configuration based on the data import template, and the application system data source and database table, and generated the corresponding address of the actual csv format data file (txtfilereader plug-in The path attribute value is temporarily a specific placeholder) DataX executes the required json configuration file method; and a method to backfill the actual specific csv format data file address to the json configuration file. In this way, the ability to dynamically generate the json configuration file of the DataX execution instance is realized. In the actual application of DataX, there are many fields in the json configuration file. Facing the application scenarios of multiple systems, different types of data transmission, and a large number of tables, the configuration is cumbersome and inefficient. Provide uploaded data files Verify whether it matches the format specification of the function import template, integrate the ability to convert Excel to csv, and execute the scheduling strategy for DataX. In this way, the ability to quickly upload and import data supporting Excel or csv format data files is realized.

本发明极大的简化应用系统中数据上传导入功能的开发，节约研发成本，提升开发效率，以及保障的业务执行的可靠性与大批量数据的写入效率。同时本发明也十分适用于基于Excel或csv格式文件进行大批量数据交互的业务场景。The present invention greatly simplifies the development of data upload and import functions in the application system, saves research and development costs, improves development efficiency, and guarantees the reliability of business execution and the writing efficiency of large batches of data. At the same time, the present invention is also very suitable for business scenarios in which large batches of data are exchanged based on Excel or csv format files.

附图说明Description of drawings

图1为本发明的流程图；Fig. 1 is a flowchart of the present invention;

图2为本发明的基于DataX的数据快速上传引擎实现方法逻辑图。Fig. 2 is a logic diagram of the implementation method of the DataX-based fast data upload engine of the present invention.

具体实施方式Detailed ways

如图1-图2所示的一种基于DataX的数据快速上传引擎实现方法，包括如下步骤：A DataX-based fast data upload engine implementation method as shown in Figure 1-Figure 2 includes the following steps:

导入的Excel或csv格式的文件均采用预设的约定模板，约定模板的首行为标题行，第二行为数据行。The imported files in Excel or csv format adopt the preset agreement template, the first line of the agreement template is the title line, and the second line is the data line.

本发明针对需要上传导入数据的业务功能维护模板及对应的要写入数据的数据库表；并规定模板首行为标题行，第2行开始为数据行。The invention aims at maintaining templates for business functions that need to upload and import data and corresponding database tables to be written into; and stipulates that the first row of the template is a title row, and the second row is a data row.

本发明使用Apache POI等技术，实现对Excel或csv格式模板的标题行的解析，获取标题列名、对应列索引等信息。The present invention uses technologies such as Apache POI to realize the parsing of the title line of the template in Excel or csv format, and obtain information such as title column name and corresponding column index.

本实施例中需要限定数据导入模板的格式为Excel(包含xls和xlsx后缀的格式)或csv，并约定模板首行为标题行，第2行开始为数据行。通过引入Apache POI等项目技术，对Excel或csv格式的模板文件进行解析，从而获得标题行中的数据列名及对应列索引信息。In this embodiment, it is necessary to limit the format of the data import template to Excel (the format containing the suffixes of xls and xlsx) or csv, and agree that the first line of the template is the title line, and the second line is the data line. By introducing Apache POI and other project technologies, the template files in Excel or csv format are analyzed to obtain the data column names and corresponding column index information in the header row.

步骤3：数据读取模块采用jdbc或odbc技术连接数据库，并根据数据表名从数据库系统表中读取详细的字段列表清单；本实施例中，字段列表清单包含字段名、字段类型、字段备注等信息。Step 3: The data reading module uses jdbc or odbc technology to connect to the database, and reads the detailed field list list from the database system table according to the data table name; in this embodiment, the field list list includes field name, field type, and field remarks and other information.

当前数据源的用户必须具有数据库系统表的查询权限，以便获取到相关数据表中的字段信息列表。The user of the current data source must have the query permission of the database system table in order to obtain the field information list in the related data table.

在执行步骤4时，建立映射关系的步骤，即正则表达式技术，编写智能匹配算法具体包括：When performing step 4, the step of establishing a mapping relationship, that is, regular expression technology, and writing an intelligent matching algorithm specifically includes:

本发明将模板标题行中列名与数据库表中字段名或字段备注进行规则匹配，自动建立映射关系；先按照字符串完全匹配来配对，完全匹配未配对上的数据库表字段再按照字符串匹配相似度进行配对，若是数据导入模板对应多张数据库表，则会智能生成多份映射关系。The present invention matches the column names in the template title row with the field names or field notes in the database table, and automatically establishes a mapping relationship; first, it is matched according to the complete matching of the strings, and the unmatched database table fields are completely matched, and then matched according to the strings The similarity is matched, and if the data import template corresponds to multiple database tables, multiple mapping relationships will be intelligently generated.

在执行步骤5时，可视化的字段映射配置界面支持手工调整映射关系。When performing step 5, the visual field mapping configuration interface supports manual adjustment of the mapping relationship.

可视化的字段映射配置界面以每张数据库表一个标签页的形式进行切换展示，标签页头显示数据库表名及表备注，下方以表格形式展示，本实施例中具体为模板对应多张数据库表时，以每张数据库表一个标签页的形式进行切换展示；标签页头显示数据库表名及表备注，下方以表格形式展示如下列：数据导入模板列索引、数据导入模板标题行的列名、数据库表字段名(下拉框，并自动选中已映射匹配的字段，下拉选项范围为当前映射字段以及该数据库表中所有尚未映射匹配的字段)、数据库表字段类型(与字段名列联动)、数据库表字段备注(与字段名列联动)、枚举值转化关系(key-value键值对列表，每个键值以英文逗号分隔，例如“1-男,0-女”)；数据库表字段名列提供下拉编辑功能，用于支持手工调整数据导入模板列与数据库表字段映射关系。The visual field mapping configuration interface is switched and displayed in the form of a tab page for each database table. The title of the tab page displays the database table name and table remarks, and the bottom part is displayed in the form of a table. In this embodiment, when the template corresponds to multiple database tables , switch and display in the form of a tab page for each database table; the header of the tab page displays the database table name and table remarks, and the following is displayed in tabular form below: the column index of the data import template, the column name of the header row of the data import template, and the database Table field name (drop-down box, and the field that has been mapped and matched is automatically selected, the range of drop-down options is the current mapped field and all unmapped and matched fields in the database table), database table field type (linked with the field name column), database table Field remarks (linked with the field name column), enumeration value conversion relationship (key-value key-value pair list, each key value is separated by English commas, such as "1-male, 0-female"); database table field name column The drop-down editing function is provided to support manual adjustment of the mapping relationship between the columns of the data import template and the fields of the database table.

在执行步骤6时，生成的DataX执行使用的json配置文件具体包括Reader插件部分、Writer插件部分和数据清洗转换的Transfer模块部分。When step 6 is executed, the generated json configuration file used for DataX execution specifically includes the Reader plug-in part, the Writer plug-in part and the Transfer module part for data cleaning and conversion.

在数据导入模板字段映射配置保存时即生成了DataX的json配置文件内容，但此时的json仍缺少实际执行的数据文件地址；需要在真正执行DataX任务时，将csv格式数据文件的绝对路径回填到json中，生成可执行的json文件。The content of the DataX json configuration file is generated when the data import template field mapping configuration is saved, but at this time the json still lacks the address of the actual data file to be executed; it is necessary to backfill the absolute path of the csv format data file when actually executing the DataX task To json, generate an executable json file.

本实施例中具体定义一个DataX执行实例json配置文件的基础模板，实现json基本框架结构，包含DataX执行速度配置等；主体内容部分(包含Reader插件部分、Writer插件部分、Transformer数据转换模块部分等)暂时以多个对应的占位符代替。In this embodiment, a basic template of a DataX execution instance json configuration file is specifically defined to realize the basic framework structure of json, including DataX execution speed configuration, etc.; the main content part (including the Reader plug-in part, the Writer plug-in part, the Transformer data conversion module part, etc.) Temporarily replaced by multiple corresponding placeholders.

Reader插件部分(主要是txtfilereader插件，数据文件地址及相应的数据模板的列信息)、Writer插件部分(目标库数据源连接信息、表及字段)，以及可能存在的数据清洗转换的Transfer模块(自定义groovy函数实现映射匹配中的枚举值转化功能)。Reader plug-in part (mainly txtfilereader plug-in, data file address and column information of the corresponding data template), Writer plug-in part (target database data source connection information, tables and fields), and possible data cleaning and conversion Transfer module (from Define the groovy function to implement the enumeration value conversion function in mapping matching).

本实施例中具体步骤为：Concrete steps in this embodiment are:

步骤S6-1：实现一个Reader插件部分json文本生成方法。按照数据导入模板的字段映射配置，采用DataX标准插件库中提供的支持csv读取的txtfilereader插件生成Reader插件部分的json文本(通过"skipHeader":"true"属性配置跳过数据文件首行标题行)。此时txtfilereader插件中path属性值，仍用一个特定占位符代替，在实际上传时，用真实的文件绝对路径替换即可。Step S6-1: Implement a part of the json text generation method of a Reader plug-in. According to the field mapping configuration of the data import template, use the txtfilereader plug-in that supports csv reading provided in the DataX standard plug-in library to generate the json text of the Reader plug-in (skip the header line of the first line of the data file by configuring the "skipHeader": "true" attribute ). At this time, the path attribute value in the txtfilereader plug-in is still replaced with a specific placeholder. When actually uploading, it can be replaced with the real absolute path of the file.

步骤S6-2：实现一个Writer插件部分json文本生成方法。按照数据导入模板的字段映射配置，以及与Reader插件部分的模板列的读取顺序，生成相对应的Writer插件部分json文本(包括数据源信息、数据库表名、字段名等)。根据目标数据源的数据库类型，使用DataX的标准插件库提供的对应Writer插件。数据源可提供一种方法直接从使用本发明的业务应用系统的数据源配置文件中直接读取。Step S6-2: Implement a Writer plug-in part of the json text generation method. According to the field mapping configuration of the data import template and the reading order of the template columns in the Reader plug-in part, the corresponding Writer plug-in part json text (including data source information, database table name, field name, etc.) is generated. According to the database type of the target data source, use the corresponding Writer plug-in provided by the standard plug-in library of DataX. The data source can provide a method to directly read from the data source configuration file using the business application system of the present invention.

步骤S6-3：实现一个Transformer数据转换模块部分json文本生成方法。若数据导入模板的字段映射配置中存在“枚举值转化关系(key-value键值对列表)”的配置列数据，则按照DataX执行实例json配置文件规则，生成基于自定义groovy函数的transformer字段转换模块的文本内容；如不存在“枚举值转化关系”的配置，则返回空字符串即可。Step S6-3: Implement a method for generating json text in a Transformer data conversion module. If there is configuration column data of "enumerated value conversion relationship (key-value key-value pair list)" in the field mapping configuration of the data import template, execute the instance json configuration file rules according to DataX to generate a transformer field based on a custom groovy function Convert the text content of the module; if there is no "enumeration value conversion relationship" configuration, just return an empty string.

步骤S6-4：将分别生成的Reader插件部分、Writer插件部分、Transformer数据转换模块部分三块的json文本回填替换预先定义DataX执行实例json配置文件的基础模板中，并返回保存。Step S6-4: Backfill the generated json text of the Reader plug-in part, Writer plug-in part, and Transformer data conversion module part into the basic template of the pre-defined DataX execution instance json configuration file, and return and save.

为了支持Excel格式的数据文件的上传导入，需要实现xls和xlsx两种Excel格式的数据文件转换成csv格式的功能。In order to support the upload and import of data files in Excel format, it is necessary to implement the function of converting data files in two Excel formats, xls and xlsx, into csv format.

因为本发明读数据采用的是DataX标准插件库中提供的txtfilereader插件，故必须将Excel格式的数据文件转换成csv格式才能实现数据的正确读取。Because the present invention reads data and adopts the txtfilereader plug-in provided in the DataX standard plug-in library, so the data file of Excel format must be converted into csv format to realize the correct reading of data.

在执行步骤7时，通用数据上传导入SDK具体包括集成文件的上传、数据文件的格式校验、与数据模板的匹配性验证、DataX可执行实例的json配置文件的生成和DataX工具的调用。When performing step 7, the general data upload and import SDK specifically includes the upload of the integration file, the format verification of the data file, the verification of the match with the data template, the generation of the json configuration file of the DataX executable instance, and the invocation of the DataX tool.

本发明提供一个通用的数据上传导入SDK。完成上传的数据文件是否与功能导入模板格式规范相匹配的校验，集成Excel转csv的能力、以及对DataX的执行调度策略。The present invention provides a universal data upload and import SDK. Complete the verification of whether the uploaded data file matches the format specification of the function import template, integrate the ability to convert Excel to csv, and execute the scheduling strategy for DataX.

本实施例中，采用特定编程语言(例Java，但不限于Java)实现一个标准规范的SDK，提供唯一的文件上传与数据导入一体化调用执行的开放API方法。In this embodiment, a specific programming language (such as Java, but not limited to Java) is used to implement a standard SDK, providing a unique open API method for integrated call execution of file upload and data import.

本实施例中具体步骤如下：Concrete steps are as follows in the present embodiment:

步骤S8-1：提供通用的文件上传方法，并校验是否合法的Excel或csv格式文件。Step S8-1: Provide a general file upload method, and check whether the file is legal in Excel or csv format.

步骤S8-2：在SDK内部，实现上传的数据文件与已配置映射规则的数据导入模板的校验功能，验证上传的数据文件是否匹配模板的格式规范，及与模板中数据列是否一致(即首行与模板的首行标题行的列名及顺序是否均一致，但允许比模板中多几列)。Step S8-2: Inside the SDK, implement the verification function of the uploaded data file and the data import template with configured mapping rules, verify whether the uploaded data file matches the format specification of the template, and whether it is consistent with the data columns in the template (ie Whether the column names and order of the first row are the same as the header row of the template, but a few more columns are allowed than in the template).

步骤S8-3：SDK中基于Apache POI项目技术实现Excel转换成csv的工具包，包含XlsToCsv和XlsxToCsv工具类；分别用来将用户上传的xls后缀或xlsx后缀转换成csv格式的数据文件，以适合DataX的txtfilereader插件进行数据读取。Step S8-3: Based on the Apache POI project technology in the SDK, the toolkit for converting Excel to csv is implemented, including XlsToCsv and XlsxToCsv tool classes; they are used to convert the xls suffix or xlsx suffix uploaded by the user into csv format data files to suit DataX's txtfilereader plugin for data reading.

步骤S8-4：数据文件上传后，非csv格式的先转成csv格式文件，调用DataX执行实例的json配置文件生成器提供的方法，将csv格式数据文件绝对路径回填数据导入模板对应json配置文件中，从而生成最终可用于DataX执行的json配置文件。Step S8-4: After the data file is uploaded, the non-csv format file is first converted into a csv format file, and the method provided by the json configuration file generator of the DataX execution instance is called to backfill the absolute path of the csv format data file into the corresponding json configuration file of the template , thus generating a json configuration file that can finally be used for DataX execution.

步骤S8-5：指定生成的实例化json配置文件，调度DataX工具，完成数据从csv格式数据文件中读取，经过transfer模块的清洗转换，最终写入目标数据库表中的任务。Step S8-5: Specify the generated instantiated json configuration file, schedule the DataX tool, complete the task of reading data from the csv format data file, after cleaning and conversion by the transfer module, and finally writing it into the target database table.

本发明通过编写自动字段映射算法，实现便捷高效的可视化在线字段映射的配置；实现专用的DataX执行实例的json配置文件生成器，提供针对特定数据导入模板的DataX的json配置文件生成方法，以及针对实际上传的数据文件，回填csv格式文件绝对路径生成DataX执行实例的json配置文件的方法；提供包含标准的实现文件上传与数据导入一体化调用执行的通用API、Excel文件转csv文件的SDK工具包。从而实现一种基于DataX的支持Excel(xls或xlsx后缀)或csv格式数据文件的数据快速上传引擎。The present invention realizes convenient and efficient configuration of visual online field mapping by writing an automatic field mapping algorithm; realizes a json configuration file generator for a dedicated DataX execution instance, provides a DataX json configuration file generation method for a specific data import template, and targets The actual uploaded data file, backfilling the absolute path of the csv format file to generate the json configuration file of the DataX execution instance; provides a standard general-purpose API that implements integrated call execution of file upload and data import, and an SDK toolkit for converting Excel files to csv files . Therefore, a DataX-based fast data upload engine supporting Excel (xls or xlsx suffix) or csv format data files is realized.

需要将csv格式的数据文件的绝对路径回填生成DataX可执行的json配置文件，供DataX调用。因DataX基于Java语言实现，故可采用Java封装相应方法对DataX进行直接调用。亦可采用shell或python的方式，通过DataX工具提供的python脚本来触发DataX数据同步任务的执行。It is necessary to backfill the absolute path of the data file in csv format to generate a DataX executable json configuration file for DataX to call. Since DataX is implemented based on the Java language, it is possible to use Java to encapsulate the corresponding method to directly call DataX. You can also use the shell or python to trigger the execution of the DataX data synchronization task through the python script provided by the DataX tool.

如图2所示为本发明的基于DataX的数据快速上传引擎实现方法逻辑图，描述了一个使用基于DataX的数据快速上传引擎进行数据上传导入的全过程。首先，针对需要实现数据文件上传导入的业务功能，维护对应的数据上传模板及相关数据库表；后台智能匹配算法自动完成数据模板列与数据库表中字段关系的映射，但可能存在部分列与字段无法自动匹配的情况；在字段映射可视化配置界面上可查看到已成功映射的情况，以及尚未映射的模板列，手工建立或调整模板列与字段映射关系；保存字段映射关系时，自动按照DataX工具的要求规范，拼装对应数据模板与数据库表的json配置内容；业务功能使用人员上传Excel或csv格式的数据文件，引擎自动进行数据文件的格式校验，及验证与对应的数据模板是否匹配；对校验通过的数据文件，若非csv格式，需要通过封装的SDK工具类将Excel格式数据文件转成csv格式；最后，以csv格式的数据文件的绝对路径，回填对应数据模板与数据库表的json配置生成最终的可执行json配置文件，驱动DataX工具执行csv格式数据文件往目标数据库表的数据同步任务。As shown in Figure 2, it is a logic diagram of the implementation method of the DataX-based fast data upload engine of the present invention, describing a whole process of using the DataX-based data fast upload engine for data upload and import. First, maintain the corresponding data upload template and related database tables for the business functions that need to upload and import data files; the background intelligent matching algorithm automatically completes the mapping between the columns of the data template and the fields in the database table, but there may be some columns and fields that cannot Automatic matching; on the field mapping visual configuration interface, you can view the successful mapping and the template columns that have not been mapped, and manually establish or adjust the mapping relationship between the template column and the field; when saving the field mapping relationship, automatically follow the DataX tool Require specifications, assemble corresponding data templates and json configuration content of database tables; business function users upload data files in Excel or csv format, and the engine automatically checks the format of the data files and verifies whether they match the corresponding data templates; proofreading If the data file passed the verification is not in csv format, you need to convert the Excel format data file into csv format through the packaged SDK tool class; finally, use the absolute path of the data file in csv format to backfill the corresponding data template and the json configuration generation of the database table The final executable json configuration file drives the DataX tool to perform data synchronization tasks from csv format data files to target database tables.

Claims

Translated fromChinese

1.一种基于DataX的数据快速上传引擎实现方法，其特征在于：包括如下步骤：1. A data fast upload engine implementation method based on DataX, is characterized in that: comprise the steps:

2.如权利要求1所述的一种基于DataX的数据快速上传引擎实现方法，其特征在于：在执行步骤2时，导入的Excel或csv格式的文件均采用预设的约定模板，约定模板的首行为标题行，第二行开始为数据行。2. a kind of data uploading engine implementation method based on DataX as claimed in claim 1 is characterized in that: when executing step 2, the files of the imported Excel or csv format all adopt the preset agreed template, and the agreed template The first row is a header row, and the second row starts as a data row.

3.如权利要求1所述的一种基于DataX的数据快速上传引擎实现方法，其特征在于：在执行步骤4时，建立映射关系的步骤具体包括：3. a kind of data upload engine implementation method based on DataX as claimed in claim 1, is characterized in that: when executing step 4, the step of establishing mapping relationship specifically comprises:

4.如权利要求1所述的一种基于DataX的数据快速上传引擎实现方法，其特征在于：在执行步骤5时，可视化的字段映射配置界面支持手工调整映射关系。4. A method for implementing a DataX-based fast data upload engine as claimed in claim 1, characterized in that: when step 5 is performed, the visual field mapping configuration interface supports manual adjustment of the mapping relationship.

5.如权利要求1所述的一种基于DataX的数据快速上传引擎实现方法，其特征在于：在执行步骤5时，可视化的字段映射配置界面以每张数据库表一个标签页的形式进行切换展示，标签页头显示数据库表名及表备注，下方以表格形式展示。5. The implementation method of a DataX-based fast data upload engine as claimed in claim 1, characterized in that: when step 5 is executed, the visual field mapping configuration interface is switched and displayed in the form of a tab page for each database table , the tab header displays the database table name and table remarks, which are displayed in a table below.

6.如权利要求1所述的一种基于DataX的数据快速上传引擎实现方法，其特征在于：在执行步骤6时，生成的DataX执行使用的json配置文件具体包括Reader插件部分、Writer插件部分和数据清洗转换的Transfer模块部分。6. A kind of DataX-based fast data uploading engine implementation method as claimed in claim 1, it is characterized in that: when executing step 6, the json configuration file that the generated DataX executes use specifically includes Reader plug-in part, Writer plug-in part and The Transfer module part of data cleaning and transformation.

7.如权利要求1所述的一种基于DataX的数据快速上传引擎实现方法，其特征在于：在执行步骤6时，需要在真正执行DataX任务时，将csv格式数据文件的绝对路径回填到json中，生成可执行的json文件。7. A method for implementing a data fast upload engine based on DataX as claimed in claim 1, characterized in that: when executing step 6, it is necessary to backfill the absolute path of the csv format data file into json when actually executing the DataX task In, an executable json file is generated.

8.如权利要求1所述的一种基于DataX的数据快速上传引擎实现方法，其特征在于：在执行步骤7时，通用数据上传导入SDK具体包括集成文件的上传、数据文件的格式校验、与数据模板的匹配性验证、DataX可执行实例的json配置文件的生成和DataX工具的调用。8. A method for implementing a data fast upload engine based on DataX as claimed in claim 1, wherein, when performing step 7, the general data upload and import SDK specifically includes uploading of integrated files, format verification of data files, The matching verification with the data template, the generation of the json configuration file of the DataX executable instance, and the invocation of the DataX tool.