

技术领域technical field
本发明属于大数据技术领域,尤其涉及一种基于DataX的数据快速上传引擎实现方法。The invention belongs to the technical field of big data, and in particular relates to a method for realizing a fast data upload engine based on DataX.
背景技术Background technique
在日常的各种业务应用系统中,Excel数据导入似乎是一种十分常用的基础功能。以JavaWeb技术为主的项目为例,通常采用基于Apache POI项目技术来实现对Excel的解析,编写相关SQL语句,并结合JDBC技术来实现数据导入。往往需要针对每个功能根据Excel数据模板和数据表的结构来完成开发实现。一方面需要为每个导入功能进行定制化的开发,耗费大量的开发精力;另一方面普通程序员由于经验和技术水平所限,大多实现的导入功能效率不高。In various daily business application systems, Excel data import seems to be a very common basic function. Taking JavaWeb technology-based projects as an example, Apache POI project technology is usually used to analyze Excel, write related SQL statements, and combine JDBC technology to realize data import. It is often necessary to complete the development and implementation of each function according to the Excel data template and the structure of the data table. On the one hand, it is necessary to carry out customized development for each imported function, which consumes a lot of development energy; on the other hand, due to the limited experience and technical level of ordinary programmers, most of the imported functions implemented are not efficient.
DataX是一个异构数据源离线同步工具,致力于实现各种异构数据源之间稳定高效的数据同步功能。DataX具有采用http协议传送数据、自带重连与重试机制、提供分布式事务与数据一致性保护、异构的数据库/文件系统之间高速交换数据、传输过程单进程内完成全内存操作等特点。DataX插件体系作为一套生态系统,提供了丰富的常见数据源的Reader插件、Writer插件,并支持自定义开发新插件。然而,DataX的执行实例,依赖事先编制的json配置文件(含执行线程数、读插件及对应的读数据源及表和字段、写插件及目标写数据源及表和字段)。在实际使用时,往往json配置文件中字段栏位较多,面对多系统、不同类型数据传输、大量表的应用场景,配置繁琐,效率低,工作量很大,因此将DataX作为一个内置通用的功能部件集成于具体项目中,亟需解决DataX执行实例的json配置文件的动态生成问题。DataX is an offline synchronization tool for heterogeneous data sources, dedicated to realizing stable and efficient data synchronization between various heterogeneous data sources. DataX uses the http protocol to transmit data, has its own reconnection and retry mechanism, provides distributed transactions and data consistency protection, high-speed data exchange between heterogeneous databases/file systems, completes full memory operations in a single process during transmission, etc. features. As an ecosystem, the DataX plug-in system provides a wealth of Reader plug-ins and Writer plug-ins for common data sources, and supports custom development of new plug-ins. However, the execution instance of DataX depends on the pre-compiled json configuration file (including the number of execution threads, read plug-ins and corresponding read data sources, tables and fields, write plug-ins and target write data sources, tables and fields). In actual use, there are often many fields in the json configuration file. Faced with the application scenarios of multiple systems, different types of data transmission, and a large number of tables, the configuration is cumbersome, inefficient, and heavy workload. Therefore, DataX is used as a built-in general The functional parts of the project are integrated in specific projects, and it is urgent to solve the problem of dynamically generating the json configuration file of the DataX execution instance.
发明内容Contents of the invention
本发明的目的是提供一种基于DataX的数据快速上传引擎实现方法,解决了提供一个通用的数据上传程序部件,实现便捷、高效、快速的数据上传的技术问题。The purpose of the present invention is to provide a method for implementing a data fast upload engine based on DataX, which solves the technical problem of providing a general data upload program component and realizing convenient, efficient and fast data upload.
为实现上述目的,本发明采用如下技术方案:一种基于DataX的数据快速上传引擎实现方法,包括如下步骤:In order to achieve the above object, the present invention adopts the following technical solutions: a method for implementing a data fast upload engine based on DataX, comprising the following steps:
步骤1:在分布式服务器中构建解析转换模块、数据读取模块、字段映射配置模块、配置文件生成模块、上传模块和调度模块;Step 1: Build the parsing and conversion module, data reading module, field mapping configuration module, configuration file generation module, upload module and scheduling module in the distributed server;
步骤2:解析转换模块预设一个约定模板,根据约定模板接收导入的Excel或csv格式的文件,即模板文件,通过Apache POI项目技术,对模板文件进行解析,获得模板文件的标题行中的数据列名及对应列索引信息;Step 2: The parsing and conversion module presets an agreed template, receives the imported file in Excel or csv format according to the agreed template, that is, the template file, and analyzes the template file through the Apache POI project technology to obtain the data in the header row of the template file Column name and corresponding column index information;
步骤3:数据读取模块采用jdbc或odbc技术连接数据库,并根据数据表名从数据库系统表中读取详细的字段列表清单;Step 3: The data reading module uses jdbc or odbc technology to connect to the database, and reads the detailed field list from the database system table according to the data table name;
步骤4:字段映射配置模块负责编写一个基于正则表达式技术实现的字段智能匹配算法,将数据模板标题行的列名与数据库表中字段名进行匹配,并建立映射关系;Step 4: The field mapping configuration module is responsible for writing a field intelligent matching algorithm based on regular expression technology, matching the column names in the header row of the data template with the field names in the database table, and establishing a mapping relationship;
步骤5:字段映射配置模块针对模板文件制作一个可视化的字段映射配置界面,用于展示通过智能匹配算法已自动映射的字段关系;Step 5: The field mapping configuration module makes a visual field mapping configuration interface for the template file, which is used to display the field relationship that has been automatically mapped through the intelligent matching algorithm;
步骤6:配置文件生成模块提供一个DataX执行实例的json配置文件生成器,根据模板文件的字段映射配置动态生成DataX执行使用的json配置文件;Step 6: The configuration file generation module provides a json configuration file generator for the DataX execution instance, and dynamically generates the json configuration file used by the DataX execution according to the field mapping configuration of the template file;
步骤7:上传模块提供通用数据上传导入SDK,完成文件的上传;Step 7: The upload module provides general data upload and import SDK to complete the file upload;
步骤8:上传模块提供XlsToCsv和XlsxToCsv工具类,将xls和xlsx两种Excel格式的数据文件转换成csv格式,以支持Excel格式的数据文件的上传导入;Step 8: The upload module provides XlsToCsv and XlsxToCsv tool classes to convert data files in Excel format of xls and xlsx into csv format to support the upload and import of data files in Excel format;
步骤9:调度模块采用Java封装的方法对DataX进行直接调用,或采用shell或python的方式,通过DataX工具提供的python脚本来触发DataX数据同步任务的执行。Step 9: The scheduling module uses the method of Java encapsulation to directly call DataX, or uses shell or python to trigger the execution of DataX data synchronization tasks through the python script provided by the DataX tool.
优选的,在执行步骤2时,导入的Excel或csv格式的文件均采用预设的约定模板,约定模板的首行为标题行,第二行开始为数据行。Preferably, when step 2 is performed, the imported files in Excel or csv format all adopt a preset agreed template, the first line of the agreed template is a title line, and the second line is a data line.
优选的,在执行步骤4时,建立映射关系的步骤具体包括:Preferably, when performing step 4, the step of establishing a mapping relationship specifically includes:
步骤S4-1:将数据模板标题行的列名首先与数据库表中字段名进行完全匹配,匹配上的建立映射关系;Step S4-1: Completely match the column name of the header row of the data template with the field name in the database table, and establish a mapping relationship on the matching;
步骤S4-2:未能匹配的模板列名再和数据库表中剩余字段的备注进行完全匹配,建立映射关系;Step S4-2: Completely match the unmatched template column names with the remarks of the remaining fields in the database table to establish a mapping relationship;
步骤S4-3:仍有剩余未匹配的模板列名再按照规则设定的相似度来和数据库表剩余字段中备注进行相似匹配。Step S4-3: There are still remaining unmatched template column names, and then perform similar matching with the remarks in the remaining fields of the database table according to the similarity degree set by the rules.
优选的,在执行步骤5时,可视化的字段映射配置界面支持手工调整映射关系。Preferably, when step 5 is performed, the visual field mapping configuration interface supports manual adjustment of the mapping relationship.
优选的,在执行步骤5时,可视化的字段映射配置界面以每张数据库表一个标签页的形式进行切换展示,标签页头显示数据库表名及表备注,下方以表格形式展示。Preferably, when step 5 is executed, the visual field mapping configuration interface is switched and displayed in the form of a tab page for each database table, and the header of the tab page displays the database table name and table remarks, and the table is displayed below.
优选的,在执行步骤6时,根据预置规则和字段映射配置自动生成DataX执行使用的json配置文件,具体包括Reader插件部分、Writer插件部分和数据清洗转换的Transfer模块部分。Preferably, when step 6 is performed, the json configuration file used for DataX execution is automatically generated according to the preset rules and field mapping configuration, specifically including the Reader plug-in part, the Writer plug-in part and the Transfer module part for data cleaning and conversion.
优选的,在执行步骤6时,需要在真正执行DataX任务时,将csv格式数据文件的绝对路径回填到json中,生成可执行的json文件。Preferably, when performing step 6, it is necessary to backfill the absolute path of the csv format data file into the json when actually executing the DataX task, so as to generate an executable json file.
优选的,在执行步骤7时,通用数据上传导入SDK具体包括集成文件的上传、数据文件的格式校验、与数据模板的匹配性验证、DataX可执行实例的json配置文件的生成和DataX工具的调用。Preferably, when performing step 7, the general data upload and import SDK specifically includes the upload of integrated files, the format verification of data files, the matching verification with data templates, the generation of json configuration files of DataX executable instances, and the generation of DataX tools. transfer.
本发明所述的一种基于DataX的数据快速上传引擎实现方法,解决了提供一个通用的数据上传程序部件,实现便捷、高效、快速的数据上传的技术问题,提供一个可视化的字段映射部件,实现针对各业务功能的数据导入模板(Excel或csv格式)与数据库表的映射关系,以及导入模板的列与数据库表的字段映射关系的通用维护配置功能。通过编写智能匹配算法,实现模板中标题行的列名与数据库表中字段名或字段备注的智能自动匹配映射,大大减轻使用人员操作复杂度与工作量,并能提升配置效率及配置准确率,实现了一个专用的DataX执行实例的json配置文件生成器:提供了一个根据数据导入模板的字段映射配置,和应用系统数据源及数据库表,生成对应的暂缺实际csv格式数据文件地址(txtfilereader插件中path属性值暂为特定占位符)的DataX执行需要的json配置文件的方法;以及一个将实际具体的csv格式数据文件地址回填json配置文件的方法。从而实现DataX执行实例的json配置文件的动态生成能力。很好的解决了DataX的实际应用中,json配置文件中字段栏位较多,面对多系统、不同类型数据传输、大量表的应用场景,配置繁琐,效率低的问题,提供上传的数据文件是否与功能导入模板格式规范相匹配的校验,集成Excel转csv的能力、以及对DataX的执行调度策略。从而实现支持Excel或csv格式数据文件的数据快速上传导入能力。The implementation method of a data fast upload engine based on DataX described in the present invention solves the technical problem of providing a general data upload program component to realize convenient, efficient and fast data upload, and provides a visual field mapping component to realize General maintenance and configuration functions for the mapping relationship between the data import template (Excel or csv format) and the database table for each business function, and the mapping relationship between the columns of the imported template and the fields of the database table. By writing an intelligent matching algorithm, the intelligent automatic matching and mapping between the column name of the header row in the template and the field name or field note in the database table is realized, which greatly reduces the operation complexity and workload of the user, and can improve the configuration efficiency and configuration accuracy. Implemented a dedicated DataX execution instance json configuration file generator: provided a field mapping configuration based on the data import template, and the application system data source and database table, and generated the corresponding address of the actual csv format data file (txtfilereader plug-in The path attribute value is temporarily a specific placeholder) DataX executes the required json configuration file method; and a method to backfill the actual specific csv format data file address to the json configuration file. In this way, the ability to dynamically generate the json configuration file of the DataX execution instance is realized. In the actual application of DataX, there are many fields in the json configuration file. Facing the application scenarios of multiple systems, different types of data transmission, and a large number of tables, the configuration is cumbersome and inefficient. Provide uploaded data files Verify whether it matches the format specification of the function import template, integrate the ability to convert Excel to csv, and execute the scheduling strategy for DataX. In this way, the ability to quickly upload and import data supporting Excel or csv format data files is realized.
本发明极大的简化应用系统中数据上传导入功能的开发,节约研发成本,提升开发效率,以及保障的业务执行的可靠性与大批量数据的写入效率。同时本发明也十分适用于基于Excel或csv格式文件进行大批量数据交互的业务场景。The present invention greatly simplifies the development of data upload and import functions in the application system, saves research and development costs, improves development efficiency, and guarantees the reliability of business execution and the writing efficiency of large batches of data. At the same time, the present invention is also very suitable for business scenarios in which large batches of data are exchanged based on Excel or csv format files.
附图说明Description of drawings
图1为本发明的流程图;Fig. 1 is a flowchart of the present invention;
图2为本发明的基于DataX的数据快速上传引擎实现方法逻辑图。Fig. 2 is a logic diagram of the implementation method of the DataX-based fast data upload engine of the present invention.
具体实施方式Detailed ways
如图1-图2所示的一种基于DataX的数据快速上传引擎实现方法,包括如下步骤:A DataX-based fast data upload engine implementation method as shown in Figure 1-Figure 2 includes the following steps:
步骤1:在分布式服务器中构建解析转换模块、数据读取模块、字段映射配置模块、配置文件生成模块、上传模块和调度模块;Step 1: Build the parsing and conversion module, data reading module, field mapping configuration module, configuration file generation module, upload module and scheduling module in the distributed server;
步骤2:解析转换模块预设一个约定模板,根据约定模板接收导入的Excel或csv格式的文件,即模板文件,通过Apache POI项目技术,对模板文件进行解析,获得模板文件的标题行中的数据列名及对应列索引信息;Step 2: The parsing and conversion module presets an agreed template, receives the imported file in Excel or csv format according to the agreed template, that is, the template file, and analyzes the template file through the Apache POI project technology to obtain the data in the header row of the template file Column name and corresponding column index information;
导入的Excel或csv格式的文件均采用预设的约定模板,约定模板的首行为标题行,第二行为数据行。The imported files in Excel or csv format adopt the preset agreement template, the first line of the agreement template is the title line, and the second line is the data line.
本发明针对需要上传导入数据的业务功能维护模板及对应的要写入数据的数据库表;并规定模板首行为标题行,第2行开始为数据行。The invention aims at maintaining templates for business functions that need to upload and import data and corresponding database tables to be written into; and stipulates that the first row of the template is a title row, and the second row is a data row.
本发明使用Apache POI等技术,实现对Excel或csv格式模板的标题行的解析,获取标题列名、对应列索引等信息。The present invention uses technologies such as Apache POI to realize the parsing of the title line of the template in Excel or csv format, and obtain information such as title column name and corresponding column index.
本实施例中需要限定数据导入模板的格式为Excel(包含xls和xlsx后缀的格式)或csv,并约定模板首行为标题行,第2行开始为数据行。通过引入Apache POI等项目技术,对Excel或csv格式的模板文件进行解析,从而获得标题行中的数据列名及对应列索引信息。In this embodiment, it is necessary to limit the format of the data import template to Excel (the format containing the suffixes of xls and xlsx) or csv, and agree that the first line of the template is the title line, and the second line is the data line. By introducing Apache POI and other project technologies, the template files in Excel or csv format are analyzed to obtain the data column names and corresponding column index information in the header row.
步骤3:数据读取模块采用jdbc或odbc技术连接数据库,并根据数据表名从数据库系统表中读取详细的字段列表清单;本实施例中,字段列表清单包含字段名、字段类型、字段备注等信息。Step 3: The data reading module uses jdbc or odbc technology to connect to the database, and reads the detailed field list list from the database system table according to the data table name; in this embodiment, the field list list includes field name, field type, and field remarks and other information.
当前数据源的用户必须具有数据库系统表的查询权限,以便获取到相关数据表中的字段信息列表。The user of the current data source must have the query permission of the database system table in order to obtain the field information list in the related data table.
步骤4:字段映射配置模块负责编写一个基于正则表达式技术实现的字段智能匹配算法,将数据模板标题行的列名与数据库表中字段名进行匹配,并建立映射关系;Step 4: The field mapping configuration module is responsible for writing a field intelligent matching algorithm based on regular expression technology, matching the column names in the header row of the data template with the field names in the database table, and establishing a mapping relationship;
在执行步骤4时,建立映射关系的步骤,即正则表达式技术,编写智能匹配算法具体包括:When performing step 4, the step of establishing a mapping relationship, that is, regular expression technology, and writing an intelligent matching algorithm specifically includes:
步骤S4-1:将数据模板标题行的列名首先与数据库表中字段名进行完全匹配,匹配上的建立映射关系;Step S4-1: Completely match the column name of the header row of the data template with the field name in the database table, and establish a mapping relationship on the matching;
步骤S4-2:未能匹配的模板列名再和数据库表中剩余字段的备注进行完全匹配,建立映射关系;Step S4-2: Completely match the unmatched template column names with the remarks of the remaining fields in the database table to establish a mapping relationship;
步骤S4-3:仍有剩余未匹配的模板列名再按照规则设定的相似度来和数据库表剩余字段中备注进行相似匹配。Step S4-3: There are still remaining unmatched template column names, and then perform similar matching with the remarks in the remaining fields of the database table according to the similarity degree set by the rules.
本发明将模板标题行中列名与数据库表中字段名或字段备注进行规则匹配,自动建立映射关系;先按照字符串完全匹配来配对,完全匹配未配对上的数据库表字段再按照字符串匹配相似度进行配对,若是数据导入模板对应多张数据库表,则会智能生成多份映射关系。The present invention matches the column names in the template title row with the field names or field notes in the database table, and automatically establishes a mapping relationship; first, it is matched according to the complete matching of the strings, and the unmatched database table fields are completely matched, and then matched according to the strings The similarity is matched, and if the data import template corresponds to multiple database tables, multiple mapping relationships will be intelligently generated.
步骤5:字段映射配置模块针对模板文件制作一个可视化的字段映射配置界面,用于展示通过智能匹配算法已自动映射的字段关系;Step 5: The field mapping configuration module makes a visual field mapping configuration interface for the template file, which is used to display the field relationship that has been automatically mapped through the intelligent matching algorithm;
在执行步骤5时,可视化的字段映射配置界面支持手工调整映射关系。When performing step 5, the visual field mapping configuration interface supports manual adjustment of the mapping relationship.
可视化的字段映射配置界面以每张数据库表一个标签页的形式进行切换展示,标签页头显示数据库表名及表备注,下方以表格形式展示,本实施例中具体为模板对应多张数据库表时,以每张数据库表一个标签页的形式进行切换展示;标签页头显示数据库表名及表备注,下方以表格形式展示如下列:数据导入模板列索引、数据导入模板标题行的列名、数据库表字段名(下拉框,并自动选中已映射匹配的字段,下拉选项范围为当前映射字段以及该数据库表中所有尚未映射匹配的字段)、数据库表字段类型(与字段名列联动)、数据库表字段备注(与字段名列联动)、枚举值转化关系(key-value键值对列表,每个键值以英文逗号分隔,例如“1-男,0-女”);数据库表字段名列提供下拉编辑功能,用于支持手工调整数据导入模板列与数据库表字段映射关系。The visual field mapping configuration interface is switched and displayed in the form of a tab page for each database table. The title of the tab page displays the database table name and table remarks, and the bottom part is displayed in the form of a table. In this embodiment, when the template corresponds to multiple database tables , switch and display in the form of a tab page for each database table; the header of the tab page displays the database table name and table remarks, and the following is displayed in tabular form below: the column index of the data import template, the column name of the header row of the data import template, and the database Table field name (drop-down box, and the field that has been mapped and matched is automatically selected, the range of drop-down options is the current mapped field and all unmapped and matched fields in the database table), database table field type (linked with the field name column), database table Field remarks (linked with the field name column), enumeration value conversion relationship (key-value key-value pair list, each key value is separated by English commas, such as "1-male, 0-female"); database table field name column The drop-down editing function is provided to support manual adjustment of the mapping relationship between the columns of the data import template and the fields of the database table.
步骤6:配置文件生成模块提供一个DataX执行实例的json配置文件生成器,根据模板文件的字段映射配置动态生成DataX执行使用的json配置文件;Step 6: The configuration file generation module provides a json configuration file generator for the DataX execution instance, and dynamically generates the json configuration file used by the DataX execution according to the field mapping configuration of the template file;
在执行步骤6时,生成的DataX执行使用的json配置文件具体包括Reader插件部分、Writer插件部分和数据清洗转换的Transfer模块部分。When step 6 is executed, the generated json configuration file used for DataX execution specifically includes the Reader plug-in part, the Writer plug-in part and the Transfer module part for data cleaning and conversion.
在数据导入模板字段映射配置保存时即生成了DataX的json配置文件内容,但此时的json仍缺少实际执行的数据文件地址;需要在真正执行DataX任务时,将csv格式数据文件的绝对路径回填到json中,生成可执行的json文件。The content of the DataX json configuration file is generated when the data import template field mapping configuration is saved, but at this time the json still lacks the address of the actual data file to be executed; it is necessary to backfill the absolute path of the csv format data file when actually executing the DataX task To json, generate an executable json file.
本实施例中具体定义一个DataX执行实例json配置文件的基础模板,实现json基本框架结构,包含DataX执行速度配置等;主体内容部分(包含Reader插件部分、Writer插件部分、Transformer数据转换模块部分等)暂时以多个对应的占位符代替。In this embodiment, a basic template of a DataX execution instance json configuration file is specifically defined to realize the basic framework structure of json, including DataX execution speed configuration, etc.; the main content part (including the Reader plug-in part, the Writer plug-in part, the Transformer data conversion module part, etc.) Temporarily replaced by multiple corresponding placeholders.
Reader插件部分(主要是txtfilereader插件,数据文件地址及相应的数据模板的列信息)、Writer插件部分(目标库数据源连接信息、表及字段),以及可能存在的数据清洗转换的Transfer模块(自定义groovy函数实现映射匹配中的枚举值转化功能)。Reader plug-in part (mainly txtfilereader plug-in, data file address and column information of the corresponding data template), Writer plug-in part (target database data source connection information, tables and fields), and possible data cleaning and conversion Transfer module (from Define the groovy function to implement the enumeration value conversion function in mapping matching).
本实施例中具体步骤为:Concrete steps in this embodiment are:
步骤S6-1:实现一个Reader插件部分json文本生成方法。按照数据导入模板的字段映射配置,采用DataX标准插件库中提供的支持csv读取的txtfilereader插件生成Reader插件部分的json文本(通过"skipHeader":"true"属性配置跳过数据文件首行标题行)。此时txtfilereader插件中path属性值,仍用一个特定占位符代替,在实际上传时,用真实的文件绝对路径替换即可。Step S6-1: Implement a part of the json text generation method of a Reader plug-in. According to the field mapping configuration of the data import template, use the txtfilereader plug-in that supports csv reading provided in the DataX standard plug-in library to generate the json text of the Reader plug-in (skip the header line of the first line of the data file by configuring the "skipHeader": "true" attribute ). At this time, the path attribute value in the txtfilereader plug-in is still replaced with a specific placeholder. When actually uploading, it can be replaced with the real absolute path of the file.
步骤S6-2:实现一个Writer插件部分json文本生成方法。按照数据导入模板的字段映射配置,以及与Reader插件部分的模板列的读取顺序,生成相对应的Writer插件部分json文本(包括数据源信息、数据库表名、字段名等)。根据目标数据源的数据库类型,使用DataX的标准插件库提供的对应Writer插件。数据源可提供一种方法直接从使用本发明的业务应用系统的数据源配置文件中直接读取。Step S6-2: Implement a Writer plug-in part of the json text generation method. According to the field mapping configuration of the data import template and the reading order of the template columns in the Reader plug-in part, the corresponding Writer plug-in part json text (including data source information, database table name, field name, etc.) is generated. According to the database type of the target data source, use the corresponding Writer plug-in provided by the standard plug-in library of DataX. The data source can provide a method to directly read from the data source configuration file using the business application system of the present invention.
步骤S6-3:实现一个Transformer数据转换模块部分json文本生成方法。若数据导入模板的字段映射配置中存在“枚举值转化关系(key-value键值对列表)”的配置列数据,则按照DataX执行实例json配置文件规则,生成基于自定义groovy函数的transformer字段转换模块的文本内容;如不存在“枚举值转化关系”的配置,则返回空字符串即可。Step S6-3: Implement a method for generating json text in a Transformer data conversion module. If there is configuration column data of "enumerated value conversion relationship (key-value key-value pair list)" in the field mapping configuration of the data import template, execute the instance json configuration file rules according to DataX to generate a transformer field based on a custom groovy function Convert the text content of the module; if there is no "enumeration value conversion relationship" configuration, just return an empty string.
步骤S6-4:将分别生成的Reader插件部分、Writer插件部分、Transformer数据转换模块部分三块的json文本回填替换预先定义DataX执行实例json配置文件的基础模板中,并返回保存。Step S6-4: Backfill the generated json text of the Reader plug-in part, Writer plug-in part, and Transformer data conversion module part into the basic template of the pre-defined DataX execution instance json configuration file, and return and save.
步骤7:上传模块提供通用数据上传导入SDK,完成文件的上传;Step 7: The upload module provides general data upload and import SDK to complete the file upload;
为了支持Excel格式的数据文件的上传导入,需要实现xls和xlsx两种Excel格式的数据文件转换成csv格式的功能。In order to support the upload and import of data files in Excel format, it is necessary to implement the function of converting data files in two Excel formats, xls and xlsx, into csv format.
因为本发明读数据采用的是DataX标准插件库中提供的txtfilereader插件,故必须将Excel格式的数据文件转换成csv格式才能实现数据的正确读取。Because the present invention reads data and adopts the txtfilereader plug-in provided in the DataX standard plug-in library, so the data file of Excel format must be converted into csv format to realize the correct reading of data.
在执行步骤7时,通用数据上传导入SDK具体包括集成文件的上传、数据文件的格式校验、与数据模板的匹配性验证、DataX可执行实例的json配置文件的生成和DataX工具的调用。When performing step 7, the general data upload and import SDK specifically includes the upload of the integration file, the format verification of the data file, the verification of the match with the data template, the generation of the json configuration file of the DataX executable instance, and the invocation of the DataX tool.
本发明提供一个通用的数据上传导入SDK。完成上传的数据文件是否与功能导入模板格式规范相匹配的校验,集成Excel转csv的能力、以及对DataX的执行调度策略。The present invention provides a universal data upload and import SDK. Complete the verification of whether the uploaded data file matches the format specification of the function import template, integrate the ability to convert Excel to csv, and execute the scheduling strategy for DataX.
步骤8:上传模块提供XlsToCsv和XlsxToCsv工具类,将xls和xlsx两种Excel格式的数据文件转换成csv格式,以支持Excel格式的数据文件的上传导入;Step 8: The upload module provides XlsToCsv and XlsxToCsv tool classes to convert data files in Excel format of xls and xlsx into csv format to support the upload and import of data files in Excel format;
本实施例中,采用特定编程语言(例Java,但不限于Java)实现一个标准规范的SDK,提供唯一的文件上传与数据导入一体化调用执行的开放API方法。In this embodiment, a specific programming language (such as Java, but not limited to Java) is used to implement a standard SDK, providing a unique open API method for integrated call execution of file upload and data import.
本实施例中具体步骤如下:Concrete steps are as follows in the present embodiment:
步骤S8-1:提供通用的文件上传方法,并校验是否合法的Excel或csv格式文件。Step S8-1: Provide a general file upload method, and check whether the file is legal in Excel or csv format.
步骤S8-2:在SDK内部,实现上传的数据文件与已配置映射规则的数据导入模板的校验功能,验证上传的数据文件是否匹配模板的格式规范,及与模板中数据列是否一致(即首行与模板的首行标题行的列名及顺序是否均一致,但允许比模板中多几列)。Step S8-2: Inside the SDK, implement the verification function of the uploaded data file and the data import template with configured mapping rules, verify whether the uploaded data file matches the format specification of the template, and whether it is consistent with the data columns in the template (ie Whether the column names and order of the first row are the same as the header row of the template, but a few more columns are allowed than in the template).
步骤S8-3:SDK中基于Apache POI项目技术实现Excel转换成csv的工具包,包含XlsToCsv和XlsxToCsv工具类;分别用来将用户上传的xls后缀或xlsx后缀转换成csv格式的数据文件,以适合DataX的txtfilereader插件进行数据读取。Step S8-3: Based on the Apache POI project technology in the SDK, the toolkit for converting Excel to csv is implemented, including XlsToCsv and XlsxToCsv tool classes; they are used to convert the xls suffix or xlsx suffix uploaded by the user into csv format data files to suit DataX's txtfilereader plugin for data reading.
步骤S8-4:数据文件上传后,非csv格式的先转成csv格式文件,调用DataX执行实例的json配置文件生成器提供的方法,将csv格式数据文件绝对路径回填数据导入模板对应json配置文件中,从而生成最终可用于DataX执行的json配置文件。Step S8-4: After the data file is uploaded, the non-csv format file is first converted into a csv format file, and the method provided by the json configuration file generator of the DataX execution instance is called to backfill the absolute path of the csv format data file into the corresponding json configuration file of the template , thus generating a json configuration file that can finally be used for DataX execution.
步骤S8-5:指定生成的实例化json配置文件,调度DataX工具,完成数据从csv格式数据文件中读取,经过transfer模块的清洗转换,最终写入目标数据库表中的任务。Step S8-5: Specify the generated instantiated json configuration file, schedule the DataX tool, complete the task of reading data from the csv format data file, after cleaning and conversion by the transfer module, and finally writing it into the target database table.
本发明通过编写自动字段映射算法,实现便捷高效的可视化在线字段映射的配置;实现专用的DataX执行实例的json配置文件生成器,提供针对特定数据导入模板的DataX的json配置文件生成方法,以及针对实际上传的数据文件,回填csv格式文件绝对路径生成DataX执行实例的json配置文件的方法;提供包含标准的实现文件上传与数据导入一体化调用执行的通用API、Excel文件转csv文件的SDK工具包。从而实现一种基于DataX的支持Excel(xls或xlsx后缀)或csv格式数据文件的数据快速上传引擎。The present invention realizes convenient and efficient configuration of visual online field mapping by writing an automatic field mapping algorithm; realizes a json configuration file generator for a dedicated DataX execution instance, provides a DataX json configuration file generation method for a specific data import template, and targets The actual uploaded data file, backfilling the absolute path of the csv format file to generate the json configuration file of the DataX execution instance; provides a standard general-purpose API that implements integrated call execution of file upload and data import, and an SDK toolkit for converting Excel files to csv files . Therefore, a DataX-based fast data upload engine supporting Excel (xls or xlsx suffix) or csv format data files is realized.
步骤9:调度模块采用Java封装的方法对DataX进行直接调用,或采用shell或python的方式,通过DataX工具提供的python脚本来触发DataX数据同步任务的执行。Step 9: The scheduling module uses the method of Java encapsulation to directly call DataX, or uses shell or python to trigger the execution of DataX data synchronization tasks through the python script provided by the DataX tool.
需要将csv格式的数据文件的绝对路径回填生成DataX可执行的json配置文件,供DataX调用。因DataX基于Java语言实现,故可采用Java封装相应方法对DataX进行直接调用。亦可采用shell或python的方式,通过DataX工具提供的python脚本来触发DataX数据同步任务的执行。It is necessary to backfill the absolute path of the data file in csv format to generate a DataX executable json configuration file for DataX to call. Since DataX is implemented based on the Java language, it is possible to use Java to encapsulate the corresponding method to directly call DataX. You can also use the shell or python to trigger the execution of the DataX data synchronization task through the python script provided by the DataX tool.
如图2所示为本发明的基于DataX的数据快速上传引擎实现方法逻辑图,描述了一个使用基于DataX的数据快速上传引擎进行数据上传导入的全过程。首先,针对需要实现数据文件上传导入的业务功能,维护对应的数据上传模板及相关数据库表;后台智能匹配算法自动完成数据模板列与数据库表中字段关系的映射,但可能存在部分列与字段无法自动匹配的情况;在字段映射可视化配置界面上可查看到已成功映射的情况,以及尚未映射的模板列,手工建立或调整模板列与字段映射关系;保存字段映射关系时,自动按照DataX工具的要求规范,拼装对应数据模板与数据库表的json配置内容;业务功能使用人员上传Excel或csv格式的数据文件,引擎自动进行数据文件的格式校验,及验证与对应的数据模板是否匹配;对校验通过的数据文件,若非csv格式,需要通过封装的SDK工具类将Excel格式数据文件转成csv格式;最后,以csv格式的数据文件的绝对路径,回填对应数据模板与数据库表的json配置生成最终的可执行json配置文件,驱动DataX工具执行csv格式数据文件往目标数据库表的数据同步任务。As shown in Figure 2, it is a logic diagram of the implementation method of the DataX-based fast data upload engine of the present invention, describing a whole process of using the DataX-based data fast upload engine for data upload and import. First, maintain the corresponding data upload template and related database tables for the business functions that need to upload and import data files; the background intelligent matching algorithm automatically completes the mapping between the columns of the data template and the fields in the database table, but there may be some columns and fields that cannot Automatic matching; on the field mapping visual configuration interface, you can view the successful mapping and the template columns that have not been mapped, and manually establish or adjust the mapping relationship between the template column and the field; when saving the field mapping relationship, automatically follow the DataX tool Require specifications, assemble corresponding data templates and json configuration content of database tables; business function users upload data files in Excel or csv format, and the engine automatically checks the format of the data files and verifies whether they match the corresponding data templates; proofreading If the data file passed the verification is not in csv format, you need to convert the Excel format data file into csv format through the packaged SDK tool class; finally, use the absolute path of the data file in csv format to backfill the corresponding data template and the json configuration generation of the database table The final executable json configuration file drives the DataX tool to perform data synchronization tasks from csv format data files to target database tables.
本发明所述的一种基于DataX的数据快速上传引擎实现方法,解决了提供一个通用的数据上传程序部件,实现便捷、高效、快速的数据上传的技术问题,提供一个可视化的字段映射部件,实现针对各业务功能的数据导入模板(Excel或csv格式)与数据库表的映射关系,以及导入模板的列与数据库表的字段映射关系的通用维护配置功能。通过编写智能匹配算法,实现模板中标题行的列名与数据库表中字段名或字段备注的智能自动匹配映射,大大减轻使用人员操作复杂度与工作量,并能提升配置效率及配置准确率,实现了一个专用的DataX执行实例的json配置文件生成器:提供了一个根据数据导入模板的字段映射配置,和应用系统数据源及数据库表,生成对应的暂缺实际csv格式数据文件地址(txtfilereader插件中path属性值暂为特定占位符)的DataX执行需要的json配置文件的方法;以及一个将实际具体的csv格式数据文件地址回填json配置文件的方法。从而实现DataX执行实例的json配置文件的动态生成能力。很好的解决了DataX的实际应用中,json配置文件中字段栏位较多,面对多系统、不同类型数据传输、大量表的应用场景,配置繁琐,效率低的问题,提供上传的数据文件是否与功能导入模板格式规范相匹配的校验,集成Excel转csv的能力、以及对DataX的执行调度策略。从而实现支持Excel或csv格式数据文件的数据快速上传导入能力。The implementation method of a data fast upload engine based on DataX described in the present invention solves the technical problem of providing a general data upload program component to realize convenient, efficient and fast data upload, and provides a visual field mapping component to realize General maintenance and configuration functions for the mapping relationship between the data import template (Excel or csv format) and the database table for each business function, and the mapping relationship between the columns of the imported template and the fields of the database table. By writing an intelligent matching algorithm, the intelligent automatic matching and mapping between the column name of the header row in the template and the field name or field note in the database table is realized, which greatly reduces the operation complexity and workload of the user, and can improve the configuration efficiency and configuration accuracy. Implemented a dedicated DataX execution instance json configuration file generator: provided a field mapping configuration based on the data import template, and the application system data source and database table, and generated the corresponding address of the actual csv format data file (txtfilereader plug-in The path attribute value is temporarily a specific placeholder) DataX executes the required json configuration file method; and a method to backfill the actual specific csv format data file address to the json configuration file. In this way, the ability to dynamically generate the json configuration file of the DataX execution instance is realized. In the actual application of DataX, there are many fields in the json configuration file. Facing the application scenarios of multiple systems, different types of data transmission, and a large number of tables, the configuration is cumbersome and inefficient. Provide uploaded data files Verify whether it matches the format specification of the function import template, integrate the ability to convert Excel to csv, and execute the scheduling strategy for DataX. In this way, the ability to quickly upload and import data supporting Excel or csv format data files is realized.
本发明极大的简化应用系统中数据上传导入功能的开发,节约研发成本,提升开发效率,以及保障的业务执行的可靠性与大批量数据的写入效率。同时本发明也十分适用于基于Excel或csv格式文件进行大批量数据交互的业务场景。The present invention greatly simplifies the development of data upload and import functions in the application system, saves research and development costs, improves development efficiency, and guarantees the reliability of business execution and the writing efficiency of large batches of data. At the same time, the present invention is also very suitable for business scenarios in which large batches of data are exchanged based on Excel or csv format files.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211511821.2ACN115729938B (en) | 2022-11-29 | 2022-11-29 | A method for implementing a fast data upload engine based on DataX |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211511821.2ACN115729938B (en) | 2022-11-29 | 2022-11-29 | A method for implementing a fast data upload engine based on DataX |
| Publication Number | Publication Date |
|---|---|
| CN115729938Atrue CN115729938A (en) | 2023-03-03 |
| CN115729938B CN115729938B (en) | 2025-06-17 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211511821.2AActiveCN115729938B (en) | 2022-11-29 | 2022-11-29 | A method for implementing a fast data upload engine based on DataX |
| Country | Link |
|---|---|
| CN (1) | CN115729938B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115964375A (en)* | 2023-03-15 | 2023-04-14 | 深圳迅策科技有限公司 | Json file generation method and device for DataX system and electronic equipment |
| CN116561224A (en)* | 2023-05-06 | 2023-08-08 | 上海二三四五网络科技有限公司 | System and method for generating data synchronization script based on DataX |
| CN117435662A (en)* | 2023-12-21 | 2024-01-23 | 深圳特斯联智能科技有限公司 | File import intelligent analysis method and system |
| CN120256169A (en)* | 2025-05-29 | 2025-07-04 | 苏州元脑智能科技有限公司 | Annotation data synchronization method, device, electronic device and storage medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110013681A1 (en)* | 2001-09-06 | 2011-01-20 | Qualcomm Incorporated | Generating and implementing a communication protocol and interface for high data rate signal transfer |
| CN106933835A (en)* | 2015-12-29 | 2017-07-07 | 航天信息软件技术有限公司 | The data lead-in method and system of a kind of compatibility parsing Excel file |
| CN108121757A (en)* | 2017-11-10 | 2018-06-05 | 广州优视网络科技有限公司 | A kind of method of data synchronization, device, system, computing device and storage medium |
| CN114490892A (en)* | 2022-02-16 | 2022-05-13 | 中银金融科技有限公司 | Data transmission method and device based on datax |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110013681A1 (en)* | 2001-09-06 | 2011-01-20 | Qualcomm Incorporated | Generating and implementing a communication protocol and interface for high data rate signal transfer |
| CN106933835A (en)* | 2015-12-29 | 2017-07-07 | 航天信息软件技术有限公司 | The data lead-in method and system of a kind of compatibility parsing Excel file |
| CN108121757A (en)* | 2017-11-10 | 2018-06-05 | 广州优视网络科技有限公司 | A kind of method of data synchronization, device, system, computing device and storage medium |
| CN114490892A (en)* | 2022-02-16 | 2022-05-13 | 中银金融科技有限公司 | Data transmission method and device based on datax |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115964375A (en)* | 2023-03-15 | 2023-04-14 | 深圳迅策科技有限公司 | Json file generation method and device for DataX system and electronic equipment |
| CN116561224A (en)* | 2023-05-06 | 2023-08-08 | 上海二三四五网络科技有限公司 | System and method for generating data synchronization script based on DataX |
| CN117435662A (en)* | 2023-12-21 | 2024-01-23 | 深圳特斯联智能科技有限公司 | File import intelligent analysis method and system |
| CN117435662B (en)* | 2023-12-21 | 2024-04-02 | 深圳特斯联智能科技有限公司 | File import intelligent analysis method and system |
| CN120256169A (en)* | 2025-05-29 | 2025-07-04 | 苏州元脑智能科技有限公司 | Annotation data synchronization method, device, electronic device and storage medium |
| Publication number | Publication date |
|---|---|
| CN115729938B (en) | 2025-06-17 |
| Publication | Publication Date | Title |
|---|---|---|
| CN115729938A (en) | A DataX-based fast data upload engine implementation method | |
| CN107704265B (en) | A configurable rule generation method for business flow | |
| CN102968306B (en) | A kind of code automatic generation method based on data model driving and system | |
| CN103049251B (en) | A kind of data base persistence layer device and database operation method | |
| CN103309904A (en) | Method and device for generating data warehouse ETL (Extraction, Transformation and Loading) codes | |
| CN105354036A (en) | Database operation code generating method and apparatus | |
| CN111666072A (en) | Software code and document robot method | |
| CN117608656A (en) | Mixed front end frame migration method based on AST and LLM | |
| WO2023087720A1 (en) | Applet generation method and apparatus, device and storage medium | |
| CN111190587A (en) | Method and system for automatically generating engineering front-end code based on JDBC | |
| CN102289544B (en) | Parametric drive system for electrical control cabinet | |
| CN118573738A (en) | Industrial control protocol configuration method, device, computer equipment and storage medium | |
| CN111026451A (en) | Parameter configuration method and configuration tool for transformer substation terminal interval | |
| CN117077650A (en) | Document generation system, method, equipment and medium based on dynamic template configuration | |
| CN117009422A (en) | Method for realizing data import by convenience business personnel | |
| CN113655996B (en) | An enterprise-level system generation method based on demand model | |
| CN108845793B (en) | ORM design method and device | |
| CN120066518A (en) | Automatic conversion method and system for Go language structure | |
| CN118860361A (en) | A method for online integration of SQL based on low-code platform | |
| CN117787231A (en) | Excel multi-configuration source import and export method and system | |
| CN117473965A (en) | Report online generation method for test data | |
| CN115185238A (en) | Method for converting ICC configuration of nuclear power station DCS system into IEE graphical configuration | |
| CN114816428A (en) | Object-C-based iOS project architecture method | |
| Yang et al. | Mapping Approach for Model Transformation of MDA Based on XMI/XML Platform | |
| CN102929629B (en) | Based on reliable WEB service assemble method and the system of aspect |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |