Movatterモバイル変換


[0]ホーム

URL:


CN115936358A - Feature processing method, generating method and device based on feature engineering platform - Google Patents

Feature processing method, generating method and device based on feature engineering platform
Download PDF

Info

Publication number
CN115936358A
CN115936358ACN202211535042.6ACN202211535042ACN115936358ACN 115936358 ACN115936358 ACN 115936358ACN 202211535042 ACN202211535042 ACN 202211535042ACN 115936358 ACN115936358 ACN 115936358A
Authority
CN
China
Prior art keywords
feature engineering
feature
node
engineering node
ith
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211535042.6A
Other languages
Chinese (zh)
Inventor
谭荣
梁阳
钱正宇
施恩
林湘粤
王超
戴琳
曾亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co LtdfiledCriticalBeijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211535042.6ApriorityCriticalpatent/CN115936358A/en
Publication of CN115936358ApublicationCriticalpatent/CN115936358A/en
Pendinglegal-statusCriticalCurrent

Links

Images

Landscapes

Abstract

Translated fromChinese

本公开提供了一种基于特征工程平台的特征处理方法,涉及人工智能、机器学习和数据处理领域。具体实现方案为:特征工程平台的可视界面上展示有N个特征工程节点,显示第1个对象列表,第1个对象列表包含的多个对象与多个数据序列各自对应;通过以下操作显示第i+1个对象列表:响应于从第i个对象列表中选择至少一个对象的操作,将至少一个对象作为第i个特征工程节点的输入对象,确定第i个特征工程节点的输出对象;根据第i个特征工程节点的输出对象确定第i+1个对象列表;显示第i+1个对象列表,i=1,......,N‑1;根据第N个对象列表,确定多个数据序列的特征工程的数据结构。本公开还提供了一种特征生成方法、装置、电子设备和存储介质。

Figure 202211535042

The present disclosure provides a feature processing method based on a feature engineering platform, and relates to the fields of artificial intelligence, machine learning and data processing. The specific implementation plan is as follows: N feature engineering nodes are displayed on the visual interface of the feature engineering platform, and the first object list is displayed. The multiple objects contained in the first object list correspond to multiple data sequences; the following operations are used to display The i+1th object list: in response to the operation of selecting at least one object from the i-th object list, at least one object is used as an input object of the i-th feature engineering node, and an output object of the i-th feature engineering node is determined; Determine the i+1 object list according to the output object of the i feature engineering node; display the i+1 object list, i=1,..., N-1; according to the N object list, Identify data structures for feature engineering of multiple data sequences. The present disclosure also provides a feature generation method, device, electronic equipment and storage medium.

Figure 202211535042

Description

Translated fromChinese
基于特征工程平台的特征处理方法、生成方法和装置Feature processing method, generation method and device based on feature engineering platform

技术领域Technical Field

本公开涉及人工智能技术领域,尤其涉及机器学习和数据处理技术领域。可应用于人工智能开发平台、端到端的企业级人工智能平台、机器学习平台等。更具体地,本公开提供了一种基于特征工程平台的特征处理方法、基于特征工程平台的特征生成方法、装置、电子设备和存储介质。The present disclosure relates to the field of artificial intelligence technology, and in particular to the field of machine learning and data processing technology. It can be applied to artificial intelligence development platforms, end-to-end enterprise-level artificial intelligence platforms, machine learning platforms, etc. More specifically, the present disclosure provides a feature processing method based on a feature engineering platform, a feature generation method based on a feature engineering platform, a device, an electronic device, and a storage medium.

背景技术Background Art

现阶段机器学习朝着更高的易用性、更低的技术门槛和开发成本的方向去发展。对于结构数据建模,特征工程极为耗时又至关重要,往往是决定模型性能的最关键的一步。At present, machine learning is developing towards higher ease of use, lower technical barriers and development costs. For structural data modeling, feature engineering is extremely time-consuming and crucial, and is often the most critical step in determining model performance.

发明内容Summary of the invention

本公开提供了一种基于特征工程平台的特征处理方法、基于特征工程平台的特征生成方法、装置、设备以及存储介质。The present disclosure provides a feature processing method based on a feature engineering platform, a feature generation method, an apparatus, a device and a storage medium based on a feature engineering platform.

根据第一方面,提供了一种基于特征工程平台的特征处理方法,特征工程平台的可视界面上展示有N个特征工程节点,N为大于1的整数;该方法包括:显示第1个对象列表,其中,第1个对象列表包含的多个对象与多个数据序列各自对应;通过以下操作显示第i+1个对象列表:响应于从第i个对象列表中选择至少一个对象的操作,将至少一个对象作为第i个特征工程节点的输入对象,并确定第i个特征工程节点的输出对象;根据第i个特征工程节点的输出对象确定第i+1个对象列表;显示第i+1个对象列表,其中,i=1,......,N-1;以及根据第N个对象列表,确定多个数据序列的特征工程的数据结构。According to a first aspect, a feature processing method based on a feature engineering platform is provided, wherein N feature engineering nodes are displayed on a visual interface of the feature engineering platform, where N is an integer greater than 1; the method comprises: displaying a first object list, wherein the first object list comprises a plurality of objects corresponding to respective data sequences; displaying an i+1th object list by the following operations: in response to an operation of selecting at least one object from the i-th object list, taking the at least one object as an input object of the i-th feature engineering node, and determining an output object of the i-th feature engineering node; determining an i+1th object list according to the output object of the i-th feature engineering node; displaying an i+1th object list, wherein i=1,...,N-1; and determining a data structure of feature engineering of the plurality of data sequences according to the N-th object list.

根据第二方面,提供了一种基于特征工程平台的特征生成方法,特征工程平台的可视界面上展示有N个特征工程节点以及每个特征工程节点的输入对象和对象列表,N为大于1的整数,输入对象是根据上述基于特征工程平台的特征处理方法选择的,对象列表是根据上述基于特征工程平台的特征处理方法生成的;该方法包括:获取多个数据序列,其中,多个数据序列与第1个特征工程节点的对象列表中的多个对象各自对应,每个数据序列的数据结构与对应的对象一致;响应于用于指示生成特征的指令,针对每个特征工程节点,根据该特征工程节点的输入对象,生成多个数据序列在该特征工程节点的特征序列,其中,多个数据序列在每个特征工程节点的特征序列的数据结构与该特征工程节点的对象列表中的对象一致;以及将多个数据序列在第N个特征工程节点的特征序列确定为多个数据序列的特征工程。According to a second aspect, a feature generation method based on a feature engineering platform is provided, wherein N feature engineering nodes and an input object and an object list of each feature engineering node are displayed on a visual interface of the feature engineering platform, wherein N is an integer greater than 1, and the input object is selected according to the feature processing method based on the feature engineering platform, and the object list is generated according to the feature processing method based on the feature engineering platform; the method comprises: obtaining a plurality of data sequences, wherein the plurality of data sequences respectively correspond to a plurality of objects in an object list of a first feature engineering node, and a data structure of each data sequence is consistent with a corresponding object; in response to an instruction for indicating feature generation, for each feature engineering node, according to the input object of the feature engineering node, generating a feature sequence of the plurality of data sequences at the feature engineering node, wherein the data structure of the feature sequence of the plurality of data sequences at each feature engineering node is consistent with an object in the object list of the feature engineering node; and determining the feature sequence of the plurality of data sequences at the Nth feature engineering node as feature engineering of the plurality of data sequences.

根据第三方面,提供了一种基于特征工程平台的特征处理装置,特征工程平台的可视界面上展示有N个特征工程节点,N为大于1的整数。该装置包括:第一显示模块,用于显示第1个对象列表,其中,第1个对象列表包含的多个对象与多个数据序列各自对应;第二显示模块,用于显示第i+1个对象列表,包括:第一确定子模块,用于响应于从第i个对象列表中选择至少一个对象的操作,将至少一个对象作为第i个特征工程节点的输入对象,并确定第i个特征工程节点的输出对象;第二确定子模块,用于根据第i个特征工程节点的输出对象确定第i+1个对象列表;显示子模块,用于显示第i+1个对象列表,其中,i=1,......,N-1;以及第一确定模块,用于根据第N个对象列表,确定多个数据序列的特征工程的数据结构。According to a third aspect, a feature processing device based on a feature engineering platform is provided, and N feature engineering nodes are displayed on a visual interface of the feature engineering platform, where N is an integer greater than 1. The device includes: a first display module for displaying the first object list, wherein the first object list includes multiple objects corresponding to multiple data sequences respectively; a second display module for displaying the i+1th object list, including: a first determination submodule for responding to the operation of selecting at least one object from the i-th object list, using at least one object as an input object of the i-th feature engineering node, and determining the output object of the i-th feature engineering node; a second determination submodule for determining the i+1th object list according to the output object of the i-th feature engineering node; a display submodule for displaying the i+1th object list, wherein i=1, ..., N-1; and a first determination module for determining the data structure of the feature engineering of multiple data sequences according to the N-th object list.

根据第四方面,提供了一种基于特征工程平台的特征生成装置,特征工程平台的可视界面上展示有N个特征工程节点以及每个特征工程节点的输入对象和对象列表,N为大于1的整数,输入对象是根据上述基于特征工程平台的特征处理装置选择的,对象列表是根据上述基于特征工程平台的特征处理装置生成的;该装置包括:获取模块,用于获取多个数据序列,其中,多个数据序列与第1个特征工程节点的对象列表中的多个对象各自对应,每个数据序列的数据结构与对应的对象一致;生成模块,用于响应于用于指示生成特征的指令,针对每个特征工程节点,根据该特征工程节点的输入对象,生成多个数据序列在该特征工程节点的特征序列,其中,多个数据序列在每个特征工程节点的特征序列的数据结构与该特征工程节点的对象列表中的对象一致;以及第二确定模块,用于将多个数据序列在第N个特征工程节点的特征序列确定为多个数据序列的特征工程。According to a fourth aspect, a feature generation device based on a feature engineering platform is provided, wherein N feature engineering nodes and an input object and an object list of each feature engineering node are displayed on a visual interface of the feature engineering platform, wherein N is an integer greater than 1, and the input object is selected according to the feature processing device based on the feature engineering platform, and the object list is generated according to the feature processing device based on the feature engineering platform; the device comprises: an acquisition module for acquiring a plurality of data sequences, wherein the plurality of data sequences respectively correspond to a plurality of objects in an object list of a first feature engineering node, and a data structure of each data sequence is consistent with a corresponding object; a generation module for generating, in response to an instruction for indicating feature generation, a feature sequence of the plurality of data sequences at the feature engineering node according to the input object of the feature engineering node, wherein the data structure of the feature sequence of the plurality of data sequences at each feature engineering node is consistent with an object in the object list of the feature engineering node; and a second determination module for determining the feature sequence of the plurality of data sequences at the Nth feature engineering node as feature engineering of the plurality of data sequences.

根据第五方面,提供了一种电子设备,包括:至少一个处理器;以及与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行根据本公开提供的方法。According to a fifth aspect, an electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the method provided according to the present disclosure.

根据第六方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,该计算机指令用于使计算机执行根据本公开提供的方法。According to a sixth aspect, a non-transitory computer-readable storage medium storing computer instructions is provided, wherein the computer instructions are used to cause a computer to execute the method provided according to the present disclosure.

根据第七方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序存储于可读存储介质和电子设备其中至少之一上,所述计算机程序在被处理器执行时实现根据本公开提供的方法。According to a seventh aspect, a computer program product is provided, comprising a computer program, wherein the computer program is stored on at least one of a readable storage medium and an electronic device, and wherein the computer program implements the method provided according to the present disclosure when executed by a processor.

应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that the content described in this section is not intended to identify the key or important features of the embodiments of the present disclosure, nor is it intended to limit the scope of the present disclosure. Other features of the present disclosure will become easily understood through the following description.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used to better understand the present solution and do not constitute a limitation of the present disclosure.

图1是根据本公开的一个实施例的可以应用基于特征工程平台的特征处理方法和基于特征工程平台的特征生成方法的示例性系统架构示意图;FIG1 is a schematic diagram of an exemplary system architecture to which a feature processing method based on a feature engineering platform and a feature generating method based on a feature engineering platform can be applied according to an embodiment of the present disclosure;

图2是根据本公开的一个实施例的基于特征工程平台的特征处理方法的流程图;FIG2 is a flow chart of a feature processing method based on a feature engineering platform according to an embodiment of the present disclosure;

图3A是根据本公开的一个实施例的特征工程平台的可视界面上展示多个特征工程节点的示意图;FIG3A is a schematic diagram showing multiple feature engineering nodes on a visual interface of a feature engineering platform according to an embodiment of the present disclosure;

图3B是根据本公开的一个实施例的特征工程节点的对象列表的示意图;FIG3B is a schematic diagram of an object list of a feature engineering node according to an embodiment of the present disclosure;

图4是根据本公开的一个实施例的基于特征工程平台的特征生成方法的流程图;FIG4 is a flow chart of a feature generation method based on a feature engineering platform according to an embodiment of the present disclosure;

图5是根据本公开的一个实施例的基于特征工程平台的特征处理装置的框图;FIG5 is a block diagram of a feature processing device based on a feature engineering platform according to an embodiment of the present disclosure;

图6是根据本公开的一个实施例的基于特征工程平台的特征生成装置的框图;FIG6 is a block diagram of a feature generation device based on a feature engineering platform according to an embodiment of the present disclosure;

图7是根据本公开的一个实施例的基于特征工程平台的特征处理方法和/或基于特征工程平台的特征生成方法的电子设备的框图。FIG7 is a block diagram of an electronic device according to a feature processing method based on a feature engineering platform and/or a feature generating method based on a feature engineering platform according to an embodiment of the present disclosure.

具体实施方式DETAILED DESCRIPTION

以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。The following is a description of exemplary embodiments of the present disclosure in conjunction with the accompanying drawings, including various details of the embodiments of the present disclosure to facilitate understanding, which should be considered as merely exemplary. Therefore, it should be recognized by those of ordinary skill in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the present disclosure. Similarly, for the sake of clarity and conciseness, descriptions of well-known functions and structures are omitted in the following description.

众多人工智能开发平台都集成了特征工程构建功能,通过可视化、交互式的方式创建数据处理(例如过滤、列选择、列删除等)、特征处理(例如尺度变换、归一化、OneHot编码等)、特征衍生(例如特征交叉、特征哈希、多项式扩展、时间戳特征衍生等)、特征自动化等类别的特征工程构建步骤。针对原始数据集,构建出的特征工程可以是原始数据集中数据的特征的集合。Many AI development platforms integrate feature engineering construction functions, which create feature engineering construction steps in categories such as data processing (such as filtering, column selection, column deletion, etc.), feature processing (such as scale transformation, normalization, OneHot encoding, etc.), feature derivation (such as feature crossover, feature hashing, polynomial expansion, timestamp feature derivation, etc.), and feature automation in a visual and interactive way. For the original data set, the constructed feature engineering can be a collection of features of the data in the original data set.

每个特征工程的构建步骤可以作为一个特征工程节点,所有特征工程节点(步骤,算子)串联起来形成一个特征工程平台(例如Pipeline),每个节点可以选择在当前步骤需要进行处理的特征列。前一个节点的输出数据作为后一个节点的输入数据,第一个节点的输入数据为原始数据集,最后一个节点的输出数据即为特征工程处理结果。Each feature engineering construction step can be used as a feature engineering node. All feature engineering nodes (steps, operators) are connected in series to form a feature engineering platform (such as Pipeline). Each node can select the feature column that needs to be processed in the current step. The output data of the previous node is used as the input data of the next node. The input data of the first node is the original data set, and the output data of the last node is the feature engineering processing result.

特征工程节点例如包括列选择,列删除,过采样,欠采样,过滤,特征交叉,多项式扩展,异常值处理,数值替换,数据类型转换,取整,最大最小归一化,标准化,哑编码,OneHot编码,顺序编码,尺度变换,时间戳衍生,特征异常平滑,分箱,特征哈希等。Feature engineering nodes include, for example, column selection, column deletion, oversampling, undersampling, filtering, feature crossover, polynomial expansion, outlier processing, value replacement, data type conversion, rounding, maximum and minimum normalization, standardization, dummy coding, OneHot coding, sequential coding, scale transformation, timestamp derivation, feature anomaly smoothing, binning, feature hashing, etc.

在特征工程平台中,实际执行特征工程节点之前,后续节点无法获知前面节点的输出(包括输出列名称以及数据类型),换言之,在前面节点实际运行之后,才能确定后续节点可选择的列和数据类型。In the feature engineering platform, before the feature engineering node is actually executed, the subsequent node cannot know the output of the previous node (including the output column name and data type). In other words, the columns and data types that can be selected by the subsequent node can only be determined after the previous node is actually run.

因此,特征工程节点只能逐步执行,根据前面节点的执行结果,确定后续节点的完整输入,进而选择特征输入列。无法一次性构建包含多个节点的处理结果的完整特征工程。单步执行、单步调试效率很低。Therefore, feature engineering nodes can only be executed step by step. According to the execution results of the previous nodes, the complete input of the subsequent nodes is determined, and then the feature input columns are selected. It is impossible to build a complete feature engineering including the processing results of multiple nodes at one time. Single-step execution and single-step debugging are very inefficient.

本公开的技术方案中,所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理,均符合相关法律法规的规定,且不违背公序良俗。In the technical solution of the present disclosure, the collection, storage, use, processing, transmission, provision and disclosure of user personal information involved are in compliance with the provisions of relevant laws and regulations and do not violate public order and good morals.

在本公开的技术方案中,在获取或采集用户个人信息之前,均获取了用户的授权或同意。In the technical solution of the present disclosure, the user's authorization or consent is obtained before obtaining or collecting the user's personal information.

图1是根据本公开一个实施例的可以应用基于特征工程平台的特征处理方法以及基于特征工程平台的特征生成方法的示例性系统架构示意图。需要注意的是,图1所示仅为可以应用本公开实施例的系统架构的示例,以帮助本领域技术人员理解本公开的技术内容,但并不意味着本公开实施例不可以用于其他设备、系统、环境或场景。Figure 1 is a schematic diagram of an exemplary system architecture to which a feature processing method based on a feature engineering platform and a feature generation method based on a feature engineering platform can be applied according to an embodiment of the present disclosure. It should be noted that Figure 1 is only an example of a system architecture to which an embodiment of the present disclosure can be applied, in order to help those skilled in the art understand the technical content of the present disclosure, but it does not mean that the embodiment of the present disclosure cannot be used in other devices, systems, environments or scenarios.

如图1所示,根据该实施例的系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线和/或无线通信链路等等。As shown in Fig. 1, thesystem architecture 100 according to this embodiment may includeterminal devices 101, 102, 103, anetwork 104 and aserver 105. Thenetwork 104 is used to provide a medium for communication links between theterminal devices 101, 102, 103 and theserver 105. Thenetwork 104 may include various connection types, such as wired and/or wireless communication links, etc.

用户可以使用终端设备101、102、103通过网络104与服务器105进行交互,以接收或发送消息等。终端设备101、102、103可以是各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机等等。Users can useterminal devices 101, 102, 103 to interact withserver 105 throughnetwork 104 to receive or send messages, etc.Terminal devices 101, 102, 103 can be various electronic devices, including but not limited to smart phones, tablet computers, laptops, etc.

本公开实施例所提供的基于特征工程平台的特征处理方法一般可以由终端设备101、102、103执行。相应地,本公开实施例所提供的基于特征工程平台的特征处理装置一般可以设置于终端设备101、102、103中。The feature processing method based on the feature engineering platform provided in the embodiment of the present disclosure can generally be executed by theterminal devices 101, 102, and 103. Accordingly, the feature processing device based on the feature engineering platform provided in the embodiment of the present disclosure can generally be set in theterminal devices 101, 102, and 103.

本公开实施例所提供的基于特征工程平台的特征生成方法一般可以由服务器105执行。相应地,本公开实施例所提供的基于特征工程平台的特征生成装置一般可以设置于服务器105中。The feature generation method based on the feature engineering platform provided in the embodiment of the present disclosure can generally be executed by theserver 105. Accordingly, the feature generation device based on the feature engineering platform provided in the embodiment of the present disclosure can generally be set in theserver 105.

图2是根据本公开的一个实施例的基于特征工程平台的特征处理方法的流程图。FIG2 is a flow chart of a feature processing method based on a feature engineering platform according to an embodiment of the present disclosure.

特征工程平台的可视界面上展示有N个特征工程节点,N为大于1的整数。N个特征工程节点例如包括过滤节点、采样节点、归一化节点、列删除节点等等。该N个特征工程节点可以是用户通过可视化界面添加的,并且用户可以调整该N个特征工程节点的顺序。The visual interface of the feature engineering platform displays N feature engineering nodes, where N is an integer greater than 1. The N feature engineering nodes include, for example, filtering nodes, sampling nodes, normalization nodes, column deletion nodes, etc. The N feature engineering nodes can be added by the user through the visual interface, and the user can adjust the order of the N feature engineering nodes.

该特征工程平台可以针对原始数据集进行特征工程的构建,原始数据集可以包括多个数据序列(多列数据),每列数据的数据结构(Schema)包括列标识(例如列名)和数据类型(例如bool、int、double、string等)。The feature engineering platform can construct feature engineering for the original data set, which may include multiple data sequences (multiple columns of data). The data structure (Schema) of each column of data includes a column identifier (such as a column name) and a data type (such as bool, int, double, string, etc.).

如图2所示,该基于特征工程平台的特征处理方法200可以包括操作S210~操作S230。操作S220包括操作S221~操作S223。As shown in Fig. 2, thefeature processing method 200 based on the feature engineering platform may include operations S210 to S230. Operation S220 includes operations S221 to S223.

在操作S210,显示第1个对象列表,第1个对象列表包含的多个对象与多个数据序列各自对应。In operation S210, a first object list is displayed, and a plurality of objects included in the first object list correspond to respective ones of the plurality of data sequences.

在操作S220,通过操作S221~操作S223显示第i+1个对象列表。In operation S220, an i+1th object list is displayed through operations S221 to S223.

在操作S221,响应于从第i个对象列表中选择至少一个对象的操作,将至少一个对象作为第i个特征工程节点的输入对象,并确定第i个特征工程节点的输出对象。In operation S221, in response to an operation of selecting at least one object from the i-th object list, the at least one object is used as an input object of the i-th feature engineering node, and an output object of the i-th feature engineering node is determined.

在操作S222,根据第i个特征工程节点的输出对象确定第i+1个对象列表。In operation S222 , an i+1th object list is determined according to the output objects of the i-th feature engineering node.

在操作S223,显示第i+1个对象列表。In operation S223, the (i+1)th object list is displayed.

在操作S230,根据第N个对象列表,确定多个数据序列的特征工程的数据结构。In operation S230 , a data structure of feature engineering of a plurality of data sequences is determined according to the Nth object list.

例如,第1个对象列表包含多个对象,该多个对象与原始数据集中的多个数据序列各自对应。原始数据集可以是待构建特征工程的数据集,构建得到的原始数据集的特征工程可以是原始数据集中多个数据序列的特征的集合。原始数据集中每个数据序列的数据结构Schema包括标识和数据类型,每个数据序列的数据结构Schema可以抽象为一个对象,该对象也包括标识和数据类型。对象列表中的对象可以是对象标识,对象列表中的每个对象具有对应的数据类型。For example, the first object list contains multiple objects, each of which corresponds to a plurality of data sequences in the original data set. The original data set may be a data set for feature engineering to be constructed, and the feature engineering of the original data set constructed may be a collection of features of multiple data sequences in the original data set. The data structure schema of each data sequence in the original data set includes an identifier and a data type, and the data structure schema of each data sequence may be abstracted into an object, which also includes an identifier and a data type. The objects in the object list may be object identifiers, and each object in the object list has a corresponding data type.

例如,i=1,......,N-1,i=1时,针对第1个对象列表,可以作为该第1个特征工程节点的可选择列表,用户可以从该第1个对象列表中选择至少一个对象作为第1个特征工程节点的输入对象。该第1个特征工程节点可以设置有配置信息,根据配置信息可以推断出所选择的对象经过该第1个特征工程节点可以产生几个新对象、产生的新对象的类型以及对产生的新对象生成新的标识。换言之,在第1个特征工程节点实际运行之前,选择该第1个特征工程节点的输入对象,可以自动推断出该第1个特征工程节点的输出对象。For example, when i=1, ..., N-1, i=1, the first object list can be used as a selectable list for the first feature engineering node, and the user can select at least one object from the first object list as the input object of the first feature engineering node. The first feature engineering node can be provided with configuration information, and according to the configuration information, it can be inferred that the selected object can generate several new objects, the types of the generated new objects, and the generation of new identifiers for the generated new objects after passing through the first feature engineering node. In other words, before the first feature engineering node is actually run, the output object of the first feature engineering node can be automatically inferred by selecting the input object of the first feature engineering node.

例如,该第1个特征工程节点是OneHot编码节点,第1个对象列表{a,b,c,d},用户选择a和b作为第1个特征工程节点的输入对象,第1个特征工程节点可以产生输出对象{a_onehot,b_onehot},输出对象{a_onehot,b_onehot}的数据类型均为int。a_onehot和b_onehot是输出对象的标识。For example, the first feature engineering node is a OneHot encoding node, the first object list is {a, b, c, d}, the user selects a and b as the input objects of the first feature engineering node, and the first feature engineering node can generate output objects {a_onehot, b_onehot}, and the data types of the output objects {a_onehot, b_onehot} are both int. a_onehot and b_onehot are the identifiers of the output objects.

因此,在第1个特征工程节点实际运行之前,在特征工程平台上可以直观快速地展示第1个特征工程节点的输入对象和输出对象。并且,用户通过调整第1个特征工程节点的输入对象,可以得到自己想要的输出对象。Therefore, before the first feature engineering node is actually run, the input object and output object of the first feature engineering node can be displayed intuitively and quickly on the feature engineering platform. In addition, the user can get the output object he wants by adjusting the input object of the first feature engineering node.

根据本公开的实施例,操作S222具体包括将第i个特征工程节点的输出对象以及第i个对象列表中除第i个特征工程节点的输入对象之外的对象组成的对象列表,作为第i+1个对象列表;或者将第i个特征工程节点的输出对象添加到第i个对象列表,得到新对象列表,作为第i+1个对象列表。According to an embodiment of the present disclosure, operation S222 specifically includes taking the output object of the i-th feature engineering node and an object list consisting of objects in the i-th object list except the input object of the i-th feature engineering node as the i+1-th object list; or adding the output object of the i-th feature engineering node to the i-th object list to obtain a new object list as the i+1-th object list.

例如,在得到第1个特征工程节点的输出对象之后(例如{a_onehot,b_onehot}),可以生成第2个对象列表。例如可以将第1个特征工程节点的输出对象{a_onehot,b_onehot}和对象列表中除输入对象以外的对象{c,d}组成的对象列表{a_onehot,b_onehot,c,d},作为第2个对象列表。或者,可以将第1个特征工程节点的输出对象{a_onehot,b_onehot}添加到第1个对象列表{a,b,c,d},得到新对象列表{a,b,c,d,a_onehot,b_onehot},作为第2个对象列表。For example, after obtaining the output object of the first feature engineering node (e.g., {a_onehot, b_onehot}), a second object list can be generated. For example, the object list {a_onehot, b_onehot} of the first feature engineering node and the object {c, d} other than the input object in the object list can be used as the second object list. Alternatively, the output object {a_onehot, b_onehot} of the first feature engineering node can be added to the first object list {a, b, c, d} to obtain a new object list {a, b, c, d, a_onehot, b_onehot} as the second object list.

例如,在特征工程平台上的可视界面上可以显示该第2个对象列表。第2个对象列表可以作为第2个特征工程节点的可选择列表,与上述操作类似,用户可以从该第2个对象列表中选择至少一个对象作为第2个特征工程节点的输入对象,得到第2个特征工程节点的输出对象,确定第3个对象列表,并在特征工程平台上的可视界面上显示该第3个对象列表。For example, the second object list can be displayed on the visual interface on the feature engineering platform. The second object list can be used as a selectable list of the second feature engineering node. Similar to the above operation, the user can select at least one object from the second object list as the input object of the second feature engineering node, obtain the output object of the second feature engineering node, determine the third object list, and display the third object list on the visual interface on the feature engineering platform.

以此类推,可以依次展示N个对象列表,用户可以从每个展示的对象列表中选择至少一个对象作为对应特征工程节点的输入对象,该对应特征工程节点可以产生输出对象。Similarly, N object lists may be displayed in sequence, and the user may select at least one object from each displayed object list as an input object of a corresponding feature engineering node, and the corresponding feature engineering node may generate an output object.

因此,可以在整个特征工程节点没有实际执行之前,自动推断出每个节点的输出Schema(包含输出列名称和数据类型),从而第N个对象列表中的多个对象的数据结构组成原始数据集的特征工程的数据结构,推断结果和实际运行结果一致。Therefore, the output Schema (including output column name and data type) of each node can be automatically inferred before the entire feature engineering node is actually executed, so that the data structures of multiple objects in the Nth object list constitute the data structure of the feature engineering of the original data set, and the inference result is consistent with the actual running result.

本公开实施例在整个特征工程实际运行之前,在特征工程平台上可以直观快速地展示每个特征工程节点的输入对象和输出对象。用户通过调整任一个特征工程节点的输入对象,所有特征工程节点的推断自动完成,从而可以一次性构建原始数据集的特征工程的数据结构。The embodiment of the present disclosure can intuitively and quickly display the input object and output object of each feature engineering node on the feature engineering platform before the entire feature engineering is actually run. By adjusting the input object of any feature engineering node, the inference of all feature engineering nodes is automatically completed, so that the data structure of the feature engineering of the original data set can be constructed at one time.

本公开实施例的特征工程节点的推断过程可以完全由前端实现,计算量很小,不依赖后端服务器,没有服务器延时,用户体验很好。即使在没有服务器计算资源的时候,用户也可完成特征工程的初步构建。The inference process of the feature engineering node of the embodiment of the present disclosure can be completely implemented by the front end, with a small amount of calculation, no reliance on the back-end server, no server delay, and a good user experience. Even when there are no server computing resources, users can complete the preliminary construction of feature engineering.

根据本公开实施例,操作S221具体包括根据第i个特征工程节点的推断类型,确定第i个特征工程节点的输出对象的数量;根据第i个特征工程节点的后缀名属性,确定第i个特征工程节点的输出对象的标识;根据第i个特征工程节点的输出类型属性,确定第i个特征工程节点的输出对象的数据类型。According to an embodiment of the present disclosure, operation S221 specifically includes determining the number of output objects of the i-th feature engineering node according to the inferred type of the i-th feature engineering node; determining the identifier of the output object of the i-th feature engineering node according to the suffix name attribute of the i-th feature engineering node; and determining the data type of the output object of the i-th feature engineering node according to the output type attribute of the i-th feature engineering node.

例如,为了使得各特征工程节点自动推断输出对象的功能,可以对各特征工程节点配置推断类型、后缀名属性和输出类型属性。For example, in order to enable each feature engineering node to automatically infer the function of the output object, the inference type, suffix name attribute and output type attribute may be configured for each feature engineering node.

例如,根据特征工程节点的推断类型,可以由特征工程节点的输入对象的数量,确定输出对象的数量。在特征工程节点的输入对象的标识后添加特征工程节点的后缀名,可以得到输出对象的标识。可以将特征工程节点的输出类型赋给输出对象,作为输出对象的数据类型。For example, according to the inferred type of the feature engineering node, the number of output objects can be determined by the number of input objects of the feature engineering node. The output object identifier can be obtained by adding the suffix of the feature engineering node after the identifier of the input object of the feature engineering node. The output type of the feature engineering node can be assigned to the output object as the data type of the output object.

推断类型(SchemaInferType),可以包括第一类型(Filter)、第二类型(Equal)、第三类型(Complement)、第四类型(Map)、第五类型(Reduce)和第六类型(FlatMap)。The inference type (SchemaInferType) may include the first type (Filter), the second type (Equal), the third type (Complement), the fourth type (Map), the fifth type (Reduce), and the sixth type (FlatMap).

后缀名属性(ColSuffix),可以为每个特征工程节点自定义后缀名属性,例如可以定义OneHot编码节点的后缀名为“onehot”,定义特征交叉节点的后缀名为“cross”等等。The suffix attribute (ColSuffix) can customize the suffix attribute for each feature engineering node. For example, the suffix of the OneHot encoding node can be defined as "onehot", the suffix of the feature cross node can be defined as "cross", and so on.

输出类型属性(OutputType),可以为每个特征工程节点自定义输出类型属性,作为输出对象的数据类型,输出类型属性可以包含bool、int、double、string等标准数据类型以及array<int>、array<string>等复合数据类型,以及表示原始类型的origin。Output type attribute (OutputType) can be customized for each feature engineering node. As the data type of the output object, the output type attribute can include standard data types such as bool, int, double, string, and composite data types such as array<int> and array<string>, as well as origin representing the original type.

为便于对本公开提供的推断类型、后缀名属性和输出类型属性进行说明,这里先对以下概念进行解释。To facilitate the description of the inferred type, suffix name attribute, and output type attribute provided by the present disclosure, the following concepts are first explained here.

初始列(all_cols):原始数据集的全部列;Initial columns (all_cols): all columns of the original data set;

特征输入列(input_cols):当前特征工程节点选择的输入对象;Feature input columns (input_cols): input objects selected by the current feature engineering node;

特征输出列(output_cols):当前特征工程节点对输入对象处理后产生的输出对象;Feature output columns (output_cols): output objects generated by the current feature engineering node after processing the input objects;

结果列(result_cols):当前特征工程节点的完整输出列,即下一个特征工程节点的对象列表,包含了当前特征工程节点的输出对象以及当前特征工程节点的对象列表(或对象列表中未被选择的对象)。Result columns (result_cols): The complete output columns of the current feature engineering node, that is, the object list of the next feature engineering node, including the output objects of the current feature engineering node and the object list of the current feature engineering node (or the unselected objects in the object list).

表1示出了根据本公开实施例的推断类型、后缀名属性和输出类型属性信息。Table 1 shows the inferred type, suffix name attribute, and output type attribute information according to an embodiment of the present disclosure.

表1Table 1

Figure BDA0003973933720000091
Figure BDA0003973933720000091

Figure BDA0003973933720000101
Figure BDA0003973933720000101

本实施例提供了多种特征工程节点的配置信息,配置信息包括推断类型、后缀名属性和输出类型属性,使得各特征工程节点在实际运行之前,能够基于输入对象自动推断输出对象,实现特征工程数据结构的初步构建。This embodiment provides configuration information for a variety of feature engineering nodes, including inference type, suffix name attribute, and output type attribute, so that each feature engineering node can automatically infer the output object based on the input object before actual operation, thereby realizing the preliminary construction of the feature engineering data structure.

例如,推断类型属于第一类型(Filter)的过滤、采样等节点,一般需要对原始数据集的所有列进行处理,因此结果列与初始列相等,特征输出列的Schema无变化。示例性地,初始列为{a,b,c,d},特征输入列为{a,b,c,d},特征输出列为{a,b,c,d},结果列为{a,b,c,d}。For example, the inference type belongs to the first type (Filter) of the filtering, sampling and other nodes, generally need to process all the columns of the original data set, so the result column is equal to the initial column, and the schema of the feature output column does not change. For example, the initial column is {a, b, c, d}, the feature input column is {a, b, c, d}, the feature output column is {a, b, c, d}, and the result column is {a, b, c, d}.

推断类型属于第二类型(Equal)的列选择节点,是对要进行特征处理的列进行选择,被选择的列(特征输入列)参与后续特征工程节点的处理,因此结果列与特征输入列相等。示例性地,初始列为{a,b,c,d},特征输入列为{a,b,c},特征输出列为{a,b,c},结果列为{a,b,c}。The column selection node with the inference type belonging to the second type (Equal) selects the columns to be processed by features. The selected columns (feature input columns) participate in the processing of subsequent feature engineering nodes, so the result columns are equal to the feature input columns. For example, the initial columns are {a, b, c, d}, the feature input columns are {a, b, c}, the feature output columns are {a, b, c}, and the result columns are {a, b, c}.

推断类型属于第三类型(Complement)的删除节点,是对特征输入列进行删除,因此,结果列为初始列和特征输入列的差集,且特征输出列的Schema无变化。示例性地,初始列为{a,b,c,d},特征输入列为{d},特征输出列为{a,b,c},结果列为{a,b,c}。The deletion node of the third type (Complement) is inferred to delete the feature input column. Therefore, the result column is the difference between the initial column and the feature input column, and the schema of the feature output column remains unchanged. For example, the initial column is {a, b, c, d}, the feature input column is {d}, the feature output column is {a, b, c}, and the result column is {a, b, c}.

上述第一类型(Filter)、第二类型(Equal)和第三类型(Complement)的特征工程节点的特征输出列的Schema均无变化,且结果列并非根据特征输出列确定的,而是根据操作含义确定的,因此,上述第一类型(Filter)、第二类型(Equal)和第三类型(Complement)的特征工程节点可以作为特例。The schema of the feature output columns of the above-mentioned first type (Filter), second type (Equal), and third type (Complement) feature engineering nodes remain unchanged, and the result column is not determined based on the feature output column, but based on the meaning of the operation. Therefore, the above-mentioned first type (Filter), second type (Equal), and third type (Complement) feature engineering nodes can be used as special cases.

推断类型属于第四类型(Map)的异常值处理、归一化、标准化、编码等节点,进一步属于一对一映射类型。根据本公开实施例,针对属于一对一映射类型的第i个特征工程节点,根据第i个特征工程节点的输入对象的标识以及第i个特征工程节点的后缀名属性,确定第i个特征工程节点的输出对象的标识。The inference type belongs to the fourth type (Map) of outlier processing, normalization, standardization, encoding and other nodes, which further belong to the one-to-one mapping type. According to the embodiment of the present disclosure, for the i-th feature engineering node belonging to the one-to-one mapping type, the identifier of the output object of the i-th feature engineering node is determined according to the identifier of the input object of the i-th feature engineering node and the suffix name attribute of the i-th feature engineering node.

示例性地,针对属于第四类型(Map)的OneHot编码节点,自定义后缀名“onehot”,自定义输出类型“int”。如果可选择列(对象列表){a,b,c,d},特征输入列为{a,b},那么,特征输出列为{a_onehot,b_onehot},特征输出列的数据类型为{int,int},结果列为{a,b,c,d,a_onehot,b_onehot},其中结果列中a,b,c,d保留已有的列名和数据类型。For example, for the OneHot encoding node belonging to the fourth type (Map), the suffix "onehot" is customized, and the output type "int" is customized. If the selectable column (object list) is {a, b, c, d}, and the feature input column is {a, b}, then the feature output column is {a_onehot, b_onehot}, the data type of the feature output column is {int, int}, and the result column is {a, b, c, d, a_onehot, b_onehot}, where a, b, c, d in the result column retain the existing column names and data types.

本实施例针对一对一映射类型的特征工程节点,能够自动生成输出对象的标识和数据类型,并且输出对象的标识信息可以包含特征工程信息(例如对象a_onehot包含了onehot信息),可以使得该输出对象的处理过程更加清晰。This embodiment is aimed at a one-to-one mapping type feature engineering node, and can automatically generate the identifier and data type of the output object, and the identification information of the output object can include feature engineering information (for example, the object a_onehot includes onehot information), which can make the processing process of the output object clearer.

推断类型属于第五类型(Reduce)的特征交叉、多项式扩展、特征哈希等节点,进一步属于多对一映射类型。根据本公开实施例,针对属于多对一映射类型的第i个特征工程节点,根据第i个特征工程节点的多个输入对象的标识以及第i个特征工程节点的后缀名属性,确定第i个特征工程节点的一个输出对象的标识。The nodes such as feature crossover, polynomial expansion, feature hashing, etc., which are inferred to belong to the fifth type (Reduce), further belong to the many-to-one mapping type. According to the embodiment of the present disclosure, for the i-th feature engineering node belonging to the many-to-one mapping type, the identifier of an output object of the i-th feature engineering node is determined according to the identifiers of multiple input objects of the i-th feature engineering node and the suffix name attribute of the i-th feature engineering node.

示例性地,针对属于第五类型(Reduce)的特征交叉节点,自定义后缀名“cross”,自定义输出类型“string”。如果可选择列{a,b,c,d},特征输入列为{a,b,c},那么,特征输出列为{a_b_c_cross},特征输出列的数据类型为{string},结果列为{a,b,c,d,a_b_c_cross},其中结果列中a,b,c,d保留已有的列名和数据类型。For example, for the feature cross node belonging to the fifth type (Reduce), the suffix "cross" is customized and the output type "string" is customized. If the columns {a, b, c, d} can be selected and the feature input column is {a, b, c}, then the feature output column is {a_b_c_cross}, the data type of the feature output column is {string}, and the result column is {a, b, c, d, a_b_c_cross}, where a, b, c, d in the result column retain the existing column names and data types.

本实施例针对多对一映射类型的特征工程节点,能够自动确定输出对象的数量、标识和数据类型,并且输出对象的标识信息可以包含特征工程信息(例如对象a_b_c_cross包含了cross信息),可以使得该输出对象的处理过程更加清晰。This embodiment is aimed at feature engineering nodes of many-to-one mapping type, and can automatically determine the number, identification and data type of output objects, and the identification information of the output objects can include feature engineering information (for example, object a_b_c_cross includes cross information), which can make the processing process of the output object clearer.

推断类型属于第六类型(FlatMap)的时间戳衍生节点,进一步属于一对多映射类型。根据本公开实施例,针对属于多对一映射类型的第i个特征工程节点,根据第i个特征工程节点的一个输入对象的标识以及第i个特征工程节点的后缀名属性,确定第i个特征工程节点的多个输出对象的标识。The timestamp derived node of the sixth type (FlatMap) is inferred to be further of the one-to-many mapping type. According to the disclosed embodiment, for the i-th feature engineering node belonging to the many-to-one mapping type, the identifiers of multiple output objects of the i-th feature engineering node are determined according to the identifier of an input object of the i-th feature engineering node and the suffix name attribute of the i-th feature engineering node.

示例性地,针对属于第六类型(FlatMap)的时间戳衍生节点,自定义后缀名包括[“year”,“month”,“day”,“hour”,“minute”,“second”,“season”],自定义输出类型包括[“int”,“int”,“int”,“int”,“int”,“int”,“string”]。如果可选择列{a,b,c,d},特征输入列为{a},a包含了时间信息,那么,特征输出列为{a_year,a_month,a_day,a_hour,a_minute,a_second,a_season},特征输出列的数据类型为{int,int,int,int,int,int,string},结果列为{a,b,c,d,a_year,a_month,a_day,a_hour,a_minute,a_second,a_season},其中结果列中a,b,c,d保留已有的列名和数据类型。Exemplarily, for the timestamp derived node belonging to the sixth type (FlatMap), the custom suffix names include ["year", "month", "day", "hour", "minute", "second", "season"], and the custom output types include ["int", "int", "int", "int", "int", "int", "int", "string"]. If the columns {a, b, c, d} can be selected, the feature input column is {a}, and a contains time information, then the feature output column is {a_year, a_month, a_day, a_hour, a_minute, a_second, a_season}, the data type of the feature output column is {int, int, int, int, int, string}, and the result column is {a, b, c, d, a_year, a_month, a_day, a_hour, a_minute, a_second, a_season}, where a, b, c, d in the result column retain the existing column names and data types.

本实施例针对一对多映射类型的特征工程节点,能够自动确定输出对象的数量、标识和数据类型,并且输出对象的标识信息可以包含特征工程信息(例如对象a_year包含了year信息),可以使得该输出对象的处理过程更加清晰。This embodiment is aimed at a one-to-many mapping type feature engineering node, and can automatically determine the number, identification and data type of the output objects, and the identification information of the output objects can include feature engineering information (for example, the object a_year includes year information), which can make the processing process of the output object clearer.

表2示出了根据本公开实施例的各特征工程节点的推断类型、后缀名属性和输出类型属性信息。Table 2 shows the inference type, suffix name attribute and output type attribute information of each feature engineering node according to an embodiment of the present disclosure.

表2Table 2

Figure BDA0003973933720000121
Figure BDA0003973933720000121

Figure BDA0003973933720000131
Figure BDA0003973933720000131

下面结合图3A~3B,对本公开提供的特征工程平台的可视界面的展示内容进行说明。3A to 3B , the display content of the visual interface of the feature engineering platform provided by the present disclosure is described below.

图3A是根据本公开的一个实施例的特征工程平台的可视界面上展示多个特征工程节点的示意图。FIG3A is a schematic diagram showing multiple feature engineering nodes on a visual interface of a feature engineering platform according to an embodiment of the present disclosure.

如图3A所示,多个特征工程节点包括最大最小归一化节点、数据类型转换节点、OneHot编码节点、特征交叉节点等。用户根据实际需求还可以添加新的节点。As shown in FIG3A , the multiple feature engineering nodes include a maximum and minimum normalization node, a data type conversion node, a OneHot encoding node, a feature intersection node, etc. Users can also add new nodes according to actual needs.

如图3A所示,每个特征工程节点可以选择特征输入列(输入对象)。例如,在最大最小归一化节点选择了{slen,swid}作为特征输入列,进行最大最小归一化操作,“slen”和“swid”是列名(输入对象标识)。类似地,在数据类型转换节点选择了{plen,pwid}作为特征输入列,进行数据类型转换操作,“plen”和“pwid”是列名(输入对象标识)。在OneHot编码节点选择了{class}作为特征输入列,进行OneHot编码,“class”是列名(输入对象标识)。在特征交叉节点选择了{slen,swid}作为特征输入列,进行特征交叉编码,“slen”和“swid”是列名(输入对象标识)。As shown in Figure 3A, each feature engineering node can select a feature input column (input object). For example, in the maximum and minimum normalization node, {slen, swid} is selected as the feature input column, and the maximum and minimum normalization operation is performed. "slen" and "swid" are column names (input object identifiers). Similarly, in the data type conversion node, {plen, pwid} is selected as the feature input column, and the data type conversion operation is performed. "plen" and "pwid" are column names (input object identifiers). In the OneHot encoding node, {class} is selected as the feature input column, and OneHot encoding is performed. "class" is the column name (input object identifier). In the feature intersection node, {slen, swid} is selected as the feature input column, and feature cross encoding is performed. "slen" and "swid" are column names (input object identifiers).

图3B是根据本公开的一个实施例的特征工程节点的对象列表的示意图。FIG. 3B is a schematic diagram of an object list of a feature engineering node according to an embodiment of the present disclosure.

如图3B所示,用户在展开数据类型转换节点之后,可以显示可选择的对象列表{slen,swid,plen,pwid,class,slen_norm},其中“slen_norm”是上一个特征工程节点最大最小归一化节点的输出对象。例如,用户可以从该对象列表中选择至少一个对象作为输入对象(特征输入列)。As shown in FIG3B , after the user expands the data type conversion node, a selectable object list {slen, swid, plen, pwid, class, slen_norm} may be displayed, where "slen_norm" is the output object of the previous feature engineering node, the maximum and minimum normalization node. For example, the user may select at least one object from the object list as an input object (feature input column).

本公开实施例可视化展示每个特征工程节点的对象列表,便于用户选取输入对象,以及直观观察到输出对象。并且可以根据实际需求调整任一个特征工程节点的输入对象,得到自己想要的输出对象。The disclosed embodiment visualizes the object list of each feature engineering node, which is convenient for users to select input objects and intuitively observe output objects. In addition, the input object of any feature engineering node can be adjusted according to actual needs to obtain the desired output object.

图4是根据本公开的一个实施例的基于特征工程平台的特征生成方法的流程图。FIG4 is a flow chart of a feature generation method based on a feature engineering platform according to an embodiment of the present disclosure.

如图4所示,该基于特征工程平台的特征生成方法400包括操作S410~S420。特征工程平台的可视界面上展示有N个特征工程节点以及每个特征工程节点的对象列表,N为大于1的整数,对象列表是根据上述基于特征工程平台的特征处理方法生成的。As shown in Figure 4, thefeature generation method 400 based on the feature engineering platform includes operations S410 to S420. The visual interface of the feature engineering platform displays N feature engineering nodes and an object list of each feature engineering node, where N is an integer greater than 1, and the object list is generated according to the feature processing method based on the feature engineering platform.

在操作S410,获取多个数据序列。In operation S410, a plurality of data sequences are acquired.

在操作S420,响应于用于指示生成特征的指令,针对每个特征工程节点,根据该特征工程节点的输入对象,生成多个数据序列在该特征工程节点的特征序列。In operation S420, in response to an instruction for instructing to generate features, for each feature engineering node, a feature sequence of multiple data sequences at the feature engineering node is generated according to an input object of the feature engineering node.

在操作S430,将多个数据序列在第N个特征工程节点的特征序列确定为多个数据序列的特征工程。In operation S430, feature sequences of the multiple data sequences at the Nth feature engineering node are determined as feature engineering of the multiple data sequences.

例如,N个特征工程节点包括过滤节点、采样节点、归一化节点、列删除节点等等。多个数据序列可以是原始数据集的多列数据,原始数据集是待进行特征工程构建的数据集,构建得到的原始数据集的特征工程可以是原始数据集中多个数据序列的特征的集合。For example, the N feature engineering nodes include filtering nodes, sampling nodes, normalization nodes, column deletion nodes, etc. The multiple data sequences may be multiple columns of data in an original data set, the original data set is a data set to be feature engineered, and the feature engineering of the constructed original data set may be a collection of features of multiple data sequences in the original data set.

例如,多个数据序列与第1个特征工程节点的对象列表中的多个对象各自对应,每个数据序列的数据结构Schema包括标识和数据类型,每个数据序列的数据结构与对应的对象一致。例如第1个对象列表{a,b,c,d},对象a表示第1个数据序列的标识,对象b表示第2个数据序列的标识,对象c表示第3个数据序列的标识,对象d表示第4个数据序列的标识。For example, multiple data sequences correspond to multiple objects in the object list of the first feature engineering node. The data structure schema of each data sequence includes an identifier and a data type, and the data structure of each data sequence is consistent with the corresponding object. For example, in the first object list {a, b, c, d}, object a represents the identifier of the first data sequence, object b represents the identifier of the second data sequence, object c represents the identifier of the third data sequence, and object d represents the identifier of the fourth data sequence.

例如,每个特征工程节点的对象列表包含了该特征工程节点可以选择的对象,被选择的对象作为特征工程节点的输入对象,特征工程节点基于输入对象可以产生输出对象。并且根据输出对象可以生成下一个特征工程节点的对象列表。换言之,在特征工程节点没有实际执行之前,可以自动推断出每个节点的输出对象(包含输出列名称和数据类型),从而第N个对象列表中的多个对象的数据结构组成原始数据集的特征工程的数据结构,推断结果和实际运行结果一致。For example, the object list of each feature engineering node contains objects that can be selected by the feature engineering node. The selected objects serve as input objects of the feature engineering node, and the feature engineering node can generate output objects based on the input objects. And the object list of the next feature engineering node can be generated based on the output objects. In other words, before the feature engineering node is actually executed, the output object (including the output column name and data type) of each node can be automatically inferred, so that the data structure of multiple objects in the Nth object list constitutes the data structure of the feature engineering of the original data set, and the inference result is consistent with the actual operation result.

例如,用户可以基于每个特征工程节点推断结果,手动调整输入对象,以达到自己满意的效果和想要的输出结果。并且用户可以一次性完成整个特征工程的构建,一次性提交到后端服务器执行特征工程的实际运行。For example, users can manually adjust the input object based on the inference results of each feature engineering node to achieve satisfactory results and desired output results. In addition, users can complete the construction of the entire feature engineering at one time and submit it to the backend server to execute the actual operation of the feature engineering at one time.

例如,服务器响应于用户输入的生成特征的指令(例如用户在特征工程平台的可视界面上选择运行操作),可以针对原始数据集的多个数据序列按照多个特征工程节点的顺序和选择的输入对象,依次执行特征工程的实际运行,依次生成原始数据集的多个数据序列在每个特征工程节点的特征序列。For example, in response to an instruction to generate features input by a user (for example, a user selects a run operation on a visual interface of a feature engineering platform), the server may sequentially perform actual runs of feature engineering for multiple data sequences of the original data set in the order of multiple feature engineering nodes and selected input objects, thereby sequentially generating feature sequences for multiple data sequences of the original data set at each feature engineering node.

例如,在第i个特征工程节点,根据该第i个特征工程节点的输入对象,可以生成多个数据序列在第i个特征工程节点的特征序列,可以称为第i个特征序列,该第i个特征序列的数据结构与第i个特征工程节点的对象列表中的对象一致。例如,第i个特征工程节点的对象列表{a_onehot,b_onehot,c,d},对象a_onehot是第i个特征序列中的第一特征的标识,对象b_onehot是第i个特征序列中的第二特征的标识,对象a是第i个特征序列中的第三特征的标识,对象b是第i个特征序列中的第四特征的标识。i=1,......N。For example, in the ith feature engineering node, according to the input object of the ith feature engineering node, multiple data sequences can be generated. The feature sequence of the ith feature engineering node can be called the ith feature sequence, and the data structure of the ith feature sequence is consistent with the object in the object list of the ith feature engineering node. For example, the object list of the ith feature engineering node is {a_onehot, b_onehot, c, d}, the object a_onehot is the identifier of the first feature in the ith feature sequence, the object b_onehot is the identifier of the second feature in the ith feature sequence, the object a is the identifier of the third feature in the ith feature sequence, and the object b is the identifier of the fourth feature in the ith feature sequence. i=1,...N.

例如,将原始数据集的多个数据序列在第N个特征工程节点的特征序列确定为原始数据集的特征工程,原始数据集的特征工程的数据结构与推断的第N个特征工程节点的对象列表中的对象一致。For example, multiple data sequences of the original data set are determined as feature sequences of the Nth feature engineering node as feature engineering of the original data set, and the data structure of the feature engineering of the original data set is consistent with the objects in the object list of the inferred Nth feature engineering node.

本实施例,一次性完成整个特征工程的构建,一次性提交到后端服务器执行特征工程的实际运行,断结果和实际运行结果一致,可以大大节省调试时间。In this embodiment, the construction of the entire feature engineering is completed at one time, and is submitted to the back-end server at one time to execute the actual operation of the feature engineering. The diagnostic result is consistent with the actual operation result, which can greatly save debugging time.

图5是根据本公开的一个实施例的基于特征工程平台的特征处理装置的框图。FIG5 is a block diagram of a feature processing device based on a feature engineering platform according to an embodiment of the present disclosure.

如图5所示,该基于特征工程平台的特征处理装置500包括第一显示模块510、第二显示模块520和第一确定模块530。第二显示模块520包括第一确定子模块521、第二确定子模块522和显示子模块523。As shown in FIG5 , thefeature processing device 500 based on the feature engineering platform includes afirst display module 510 , asecond display module 520 and afirst determination module 530 . Thesecond display module 520 includes a first determination submodule 521 , a second determination submodule 522 and a display submodule 523 .

第一显示模块510用于显示第1个对象列表,其中,第1个对象列表包含的多个对象与多个数据序列各自对应。Thefirst display module 510 is used to display the first object list, wherein the first object list includes a plurality of objects corresponding to the plurality of data sequences respectively.

第二显示模块520用于显示第i+1个对象列表。Thesecond display module 520 is used to display the (i+1)th object list.

第一确定子模块521用于响应于从第i个对象列表中选择至少一个对象的操作,将至少一个对象作为第i个特征工程节点的输入对象,并确定第i个特征工程节点的输出对象。The first determination submodule 521 is used to respond to the operation of selecting at least one object from the i-th object list, use the at least one object as the input object of the i-th feature engineering node, and determine the output object of the i-th feature engineering node.

第二确定子模块522用于根据第i个特征工程节点的输出对象确定第i+1个对象列表。The second determining submodule 522 is used to determine the (i+1)th object list according to the output object of the i-th feature engineering node.

显示子模块523用于显示第i+1个对象列表,其中,i=1,......,N-1。The display submodule 523 is used to display the i+1th object list, where i=1, ..., N-1.

第一确定模块530用于根据第N个对象列表,确定多个数据序列的特征工程的数据结构。Thefirst determination module 530 is used to determine the data structure of feature engineering of multiple data sequences according to the Nth object list.

第二确定子模块522用于将第i个特征工程节点的输出对象以及第i个对象列表中除第i个特征工程节点的输入对象之外的对象组成的对象列表,作为第i+1个对象列表;或者将第i个特征工程节点的输出对象添加到第i个对象列表,得到新对象列表,作为第i+1个对象列表。The second determination submodule 522 is used to take the output object of the i-th feature engineering node and the object list consisting of the objects in the i-th object list except the input object of the i-th feature engineering node as the i+1-th object list; or add the output object of the i-th feature engineering node to the i-th object list to obtain a new object list as the i+1-th object list.

根据本公开的实施例,第i个特征工程节点具有预设的推断类型、后缀名属性和输出类型属性。According to an embodiment of the present disclosure, the i-th feature engineering node has a preset inference type, a suffix name attribute, and an output type attribute.

第一确定子模块521包括第一确定单元、第二确定单元和第三确定单元。The first determining submodule 521 includes a first determining unit, a second determining unit and a third determining unit.

第一确定单元用于根据第i个特征工程节点的推断类型,确定第i个特征工程节点的输出对象的数量。The first determination unit is used to determine the number of output objects of the i-th feature engineering node according to the inference type of the i-th feature engineering node.

第二确定单元用于根据第i个特征工程节点的后缀名属性,确定第i个特征工程节点的输出对象的标识。The second determination unit is used to determine the identifier of the output object of the i-th feature engineering node according to the suffix name attribute of the i-th feature engineering node.

第三确定单元用于根据第i个特征工程节点的输出类型属性,确定第i个特征工程节点的输出对象的数据类型。The third determination unit is used to determine the data type of the output object of the i-th feature engineering node according to the output type attribute of the i-th feature engineering node.

推断类型包括一对一映射类型。第二确定单元用于针对属于一对一映射类型的第i个特征工程节点,根据第i个特征工程节点的输入对象的标识以及第i个特征工程节点的后缀名属性,确定第i个特征工程节点的输出对象的标识。The inference type includes a one-to-one mapping type. The second determination unit is used to determine, for an i-th feature engineering node belonging to the one-to-one mapping type, an identifier of an output object of the i-th feature engineering node according to an identifier of an input object of the i-th feature engineering node and a suffix name attribute of the i-th feature engineering node.

推断类型包括多对一映射类型。第二确定单元用于针对属于多对一映射类型的第i个特征工程节点,根据第i个特征工程节点的多个输入对象的标识以及第i个特征工程节点的后缀名属性,确定第i个特征工程节点的一个输出对象的标识。The inference type includes a many-to-one mapping type. The second determination unit is used to determine, for an i-th feature engineering node belonging to the many-to-one mapping type, an identifier of an output object of the i-th feature engineering node according to identifiers of multiple input objects of the i-th feature engineering node and a suffix name attribute of the i-th feature engineering node.

推断类型包括一对多映射类型。第二确定单元用于根据第i个特征工程节点的一个输入对象的标识以及第i个特征工程节点的后缀名属性,确定第i个特征工程节点的多个输出对象的标识。The inference type includes a one-to-many mapping type. The second determination unit is used to determine the identifiers of multiple output objects of the i-th feature engineering node according to the identifier of an input object of the i-th feature engineering node and the suffix name attribute of the i-th feature engineering node.

图6是根据本公开的一个实施例的基于特征工程平台的特征生成装置的框图。FIG6 is a block diagram of a feature generation device based on a feature engineering platform according to an embodiment of the present disclosure.

特征工程平台的可视界面上展示有N个特征工程节点以及每个特征工程节点的输入对象和对象列表,N为大于1的整数,输入对象是根据上述基于特征工程平台的特征处理装置500选择的,对象列表是根据上述基于特征工程平台的特征处理装置500生成的。The visual interface of the feature engineering platform displays N feature engineering nodes and the input objects and object lists of each feature engineering node, where N is an integer greater than 1. The input objects are selected based on thefeature processing device 500 based on the feature engineering platform, and the object list is generated based on thefeature processing device 500 based on the feature engineering platform.

如图6所示,该基于特征工程平台的特征生成装置600可以包括获取模块610和生成模块620和第二确定模块630。As shown in FIG. 6 , thefeature generation device 600 based on the feature engineering platform may include anacquisition module 610 , ageneration module 620 , and asecond determination module 630 .

获取模块610用于获取多个数据序列。Theacquisition module 610 is used to acquire multiple data sequences.

生成模块620用于响应于用于指示生成特征的指令,针对每个特征工程节点,根据该特征工程节点的输入对象,生成多个数据序列在该特征工程节点的特征序列,其中,多个数据序列在每个特征工程节点的特征序列的数据结构与该特征工程节点的对象列表中的对象一致。Thegeneration module 620 is used to respond to instructions for instructing to generate features, and for each feature engineering node, generate a plurality of data sequences as feature sequences of the feature engineering node according to the input object of the feature engineering node, wherein the data structure of the plurality of data sequences in the feature sequence of each feature engineering node is consistent with the object in the object list of the feature engineering node.

第二确定模块630用于将多个数据序列在第N个特征工程节点的特征序列确定为多个数据序列的特征工程。Thesecond determination module 630 is used to determine the feature sequence of the multiple data sequences at the Nth feature engineering node as the feature engineering of the multiple data sequences.

根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to an embodiment of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.

图7示出了可以用来实施本公开的实施例的示例电子设备700的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 7 shows a schematic block diagram of an exampleelectronic device 700 that can be used to implement an embodiment of the present disclosure. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workbenches, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device can also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely examples and are not intended to limit the implementation of the present disclosure described and/or required herein.

如图7所示,设备700包括计算单元701,其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序,来执行各种适当的动作和处理。在RAM 703中,还可存储设备700操作所需的各种程序和数据。计算单元701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。As shown in Figure 7, thedevice 700 includes a computing unit 701, which can perform various appropriate actions and processes according to a computer program stored in a read-only memory (ROM) 702 or a computer program loaded from astorage unit 708 into a random access memory (RAM) 703. In theRAM 703, various programs and data required for the operation of thedevice 700 can also be stored. The computing unit 701, theROM 702, and theRAM 703 are connected to each other via abus 704. An input/output (I/O)interface 705 is also connected to thebus 704.

设备700中的多个部件连接至I/O接口705,包括:输入单元706,例如键盘、鼠标等;输出单元707,例如各种类型的显示器、扬声器等;存储单元708,例如磁盘、光盘等;以及通信单元709,例如网卡、调制解调器、无线通信收发机等。通信单元709允许设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。A number of components in thedevice 700 are connected to the I/O interface 705, including: aninput unit 706, such as a keyboard, a mouse, etc.; anoutput unit 707, such as various types of displays, speakers, etc.; astorage unit 708, such as a disk, an optical disk, etc.; and acommunication unit 709, such as a network card, a modem, a wireless communication transceiver, etc. Thecommunication unit 709 allows thedevice 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks.

计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理,例如基于特征工程平台的特征处理方法和/或基于特征工程平台的特征生成方法。例如,在一些实施例中,基于特征工程平台的特征处理方法和/或基于特征工程平台的特征生成方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元708。在一些实施例中,计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到设备700上。当计算机程序加载到RAM 703并由计算单元701执行时,可以执行上文描述的基于特征工程平台的特征处理方法和/或基于特征工程平台的特征生成方法的一个或多个步骤。备选地,在其他实施例中,计算单元701可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行基于特征工程平台的特征处理方法和/或基于特征工程平台的特征生成方法。The computing unit 701 may be a variety of general and/or special processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various dedicated artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital signal processors (DSPs), and any appropriate processors, controllers, microcontrollers, etc. The computing unit 701 performs the various methods and processes described above, such as a feature processing method based on a feature engineering platform and/or a feature generation method based on a feature engineering platform. For example, in some embodiments, a feature processing method based on a feature engineering platform and/or a feature generation method based on a feature engineering platform may be implemented as a computer software program, which is tangibly contained in a machine-readable medium, such as astorage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed on thedevice 700 via theROM 702 and/or thecommunication unit 709. When the computer program is loaded into theRAM 703 and executed by the computing unit 701, one or more steps of the feature processing method based on the feature engineering platform and/or the feature generation method based on the feature engineering platform described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured in any other appropriate manner (e.g., by means of firmware) to execute a feature processing method based on a feature engineering platform and/or a feature generation method based on a feature engineering platform.

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips (SOCs), complex programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various implementations can include: being implemented in one or more computer programs that can be executed and/or interpreted on a programmable system including at least one programmable processor, which can be a special purpose or general purpose programmable processor that can receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。The program code for implementing the method of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general-purpose computer, a special-purpose computer, or other programmable data processing device, so that the program code, when executed by the processor or controller, implements the functions/operations specified in the flow chart and/or block diagram. The program code may be executed entirely on the machine, partially on the machine, partially on the machine and partially on a remote machine as a stand-alone software package, or entirely on a remote machine or server.

在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, device, or equipment. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or equipment, or any suitable combination of the foregoing. A more specific example of a machine-readable storage medium may include an electrical connection based on one or more lines, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user; and a keyboard and pointing device (e.g., a mouse or trackball) through which the user can provide input to the computer. Other types of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including acoustic input, voice input, or tactile input).

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented in a computing system that includes back-end components (e.g., as a data server), or a computing system that includes middleware components (e.g., an application server), or a computing system that includes front-end components (e.g., a user computer with a graphical user interface or a web browser through which a user can interact with implementations of the systems and techniques described herein), or a computing system that includes any combination of such back-end components, middleware components, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communications network). Examples of communications networks include: a local area network (LAN), a wide area network (WAN), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。A computer system may include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server is generated by computer programs running on respective computers and having a client-server relationship to each other.

应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that the various forms of processes shown above can be used to reorder, add or delete steps. For example, the steps recorded in this disclosure can be executed in parallel, sequentially or in different orders, as long as the desired results of the technical solutions disclosed in this disclosure can be achieved, and this document does not limit this.

上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The above specific implementations do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions can be made according to design requirements and other factors. Any modification, equivalent substitution and improvement made within the spirit and principle of the present disclosure shall be included in the protection scope of the present disclosure.

Claims (17)

1. A characteristic processing method based on a characteristic engineering platform is characterized in that N characteristic engineering nodes are displayed on a visual interface of the characteristic engineering platform, and N is an integer greater than 1; the method comprises the following steps:
displaying a 1 st object list, wherein a plurality of objects contained in the 1 st object list respectively correspond to a plurality of data sequences;
displaying an i +1 th object list by:
in response to an operation of selecting at least one object from the ith object list, taking the at least one object as an input object of the ith feature engineering node, and determining an output object of the ith feature engineering node;
determining an i +1 th object list according to the output object of the i-th characteristic engineering node;
displaying the (i + 1) th object list, wherein i =1, \8230;, N-1; and
and determining the data structure of the feature engineering of the plurality of data sequences according to the Nth object list.
2. The method of claim 1, wherein the determining an i +1 th object list from the output objects of the i-th feature engineering node comprises:
taking an output object of the ith feature engineering node and an object list composed of objects except an input object of the ith feature engineering node in the ith object list as the (i + 1) th object list; or
And adding the output object of the ith characteristic engineering node to the ith object list to obtain a new object list serving as the (i + 1) th object list.
3. The method according to claim 1 or 2, wherein the ith feature engineering node has a preset inference type, a suffix name attribute, and an output type attribute; the step of using the at least one object as an input object of the ith feature engineering node and determining an output object of the ith feature engineering node comprises:
determining the number of output objects of the ith feature engineering node according to the inferred type of the ith feature engineering node;
determining the identifier of an output object of the ith characteristic engineering node according to the suffix name attribute of the ith characteristic engineering node;
and determining the data type of the output object of the ith characteristic engineering node according to the output type attribute of the ith characteristic engineering node.
4. The method of claim 3, wherein the inferred type comprises a one-to-one mapping type; the determining the identifier of the output object of the ith feature engineering node according to the suffix name attribute of the ith feature engineering node comprises: for the ith feature engineering node belonging to the one-to-one mapping type,
and determining the identifier of the output object of the ith characteristic engineering node according to the identifier of the input object of the ith characteristic engineering node and the suffix name attribute of the ith characteristic engineering node.
5. The method of claim 3, wherein the inference type comprises a many-to-one mapping type; the determining the identifier of the output object of the ith feature engineering node according to the suffix name attribute of the ith feature engineering node comprises: for the ith feature engineering node belonging to the many-to-one mapping type,
and determining the identifier of an output object of the ith feature engineering node according to the identifiers of a plurality of input objects of the ith feature engineering node and the suffix name attribute of the ith feature engineering node.
6. The method of claim 3, wherein the inference type comprises a one-to-many mapping type; the determining the identifier of the output object of the ith feature engineering node according to the suffix name attribute of the ith feature engineering node comprises: for the ith feature engineering node belonging to the many-to-one mapping type,
and determining the identifications of a plurality of output objects of the ith feature engineering node according to the identification of one input object of the ith feature engineering node and the suffix name attribute of the ith feature engineering node.
7. A feature generation method based on a feature engineering platform, wherein N feature engineering nodes and an input object and an object list of each feature engineering node are displayed on a visual interface of the feature engineering platform, N is an integer greater than 1, the input object is selected according to the method of any one of claims 1 to 6, and the object list is generated according to the method of any one of claims 1 to 6; the method comprises the following steps:
acquiring a plurality of data sequences, wherein the data sequences respectively correspond to a plurality of objects in an object list of a 1 st feature engineering node, and the data structure of each data sequence is consistent with the corresponding object;
in response to an instruction for instructing feature generation, generating, for each feature engineering node, a feature sequence of the plurality of data sequences at the feature engineering node according to the input object of the feature engineering node, wherein the data structure of the feature sequence of the plurality of data sequences at each feature engineering node is consistent with the objects in the object list of the feature engineering node; and
and determining the characteristic sequences of the plurality of data sequences at the Nth characteristic engineering node as the characteristic engineering of the plurality of data sequences.
8. A characteristic processing device based on a characteristic engineering platform is characterized in that N characteristic engineering nodes are displayed on a visual interface of the characteristic engineering platform, and N is an integer greater than 1; the device comprises:
a first display module, configured to display a 1 st object list, where a plurality of objects included in the 1 st object list correspond to a plurality of data sequences respectively;
the second display module is used for displaying the (i + 1) th object list and comprises:
a first determining submodule, configured to, in response to an operation of selecting at least one object from an ith object list, take the at least one object as an input object of an ith feature engineering node, and determine an output object of the ith feature engineering node;
the second determining submodule is used for determining an i +1 th object list according to the output object of the i th characteristic engineering node;
a display sub-module for displaying the (i + 1) th object list, wherein i =1, \8230;, N-1; and
and the first determining module is used for determining the data structure of the characteristic engineering of the plurality of data sequences according to the Nth object list.
9. The apparatus according to claim 8, wherein the second determining submodule is configured to use an object list composed of the output object of the i-th feature engineering node and objects in the i-th object list except the input object of the i-th feature engineering node as the i + 1-th object list; or adding the output object of the ith characteristic engineering node to the ith object list to obtain a new object list as the (i + 1) th object list.
10. The apparatus according to claim 8 or 9, wherein the ith feature engineering node has a preset inference type, a suffix name attribute, and an output type attribute; the first determination submodule includes:
a first determining unit, configured to determine, according to the inferred type of the ith feature engineering node, the number of output objects of the ith feature engineering node;
a second determining unit, configured to determine, according to a suffix attribute of the ith feature engineering node, an identifier of an output object of the ith feature engineering node;
and the third determining unit is used for determining the data type of the output object of the ith feature engineering node according to the output type attribute of the ith feature engineering node.
11. The apparatus of claim 10, wherein the inferred type comprises a one-to-one mapping type; the second determining unit is used for determining the identifier of the output object of the ith feature engineering node according to the identifier of the input object of the ith feature engineering node and the suffix name attribute of the ith feature engineering node aiming at the ith feature engineering node belonging to the one-to-one mapping type.
12. The apparatus of claim 10, wherein the inference type comprises a many-to-one mapping type; the second determining unit is configured to determine, for an ith feature engineering node belonging to the many-to-one mapping type, an identifier of an output object of the ith feature engineering node according to identifiers of a plurality of input objects of the ith feature engineering node and a suffix attribute of the ith feature engineering node.
13. The apparatus of claim 10, wherein the inferred type comprises a one-to-many mapping type; the second determining unit is used for determining the identifications of a plurality of output objects of the ith characteristic engineering node according to the identification of one input object of the ith characteristic engineering node and the suffix name attribute of the ith characteristic engineering node.
14. A feature generation apparatus based on a feature engineering platform, wherein a visual interface of the feature engineering platform displays N feature engineering nodes, and an input object and an object list of each feature engineering node, where N is an integer greater than 1, the input object is selected according to the apparatus of any one of claims 8 to 13, and the object list is generated according to the apparatus of any one of claims 8 to 13; the device comprises:
the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring a plurality of data sequences, the data sequences respectively correspond to a plurality of objects in an object list of a 1 st feature engineering node, and the data structure of each data sequence is consistent with the corresponding object;
the generating module is used for responding to an instruction for indicating feature generation, and generating a feature sequence of the plurality of data sequences at each feature engineering node according to the input object of the feature engineering node, wherein the data structure of the feature sequence of the plurality of data sequences at each feature engineering node is consistent with the object in the object list of the feature engineering node; and
and the second determining module is used for determining the characteristic sequence of the plurality of data sequences at the Nth characteristic engineering node as the characteristic engineering of the plurality of data sequences.
15. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.
16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1 to 7.
17. A computer program product comprising a computer program stored on at least one of a readable storage medium and an electronic device, the computer program, when executed by a processor, implementing the method according to any one of claims 1 to 7.
CN202211535042.6A2022-11-302022-11-30Feature processing method, generating method and device based on feature engineering platformPendingCN115936358A (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN202211535042.6ACN115936358A (en)2022-11-302022-11-30Feature processing method, generating method and device based on feature engineering platform

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN202211535042.6ACN115936358A (en)2022-11-302022-11-30Feature processing method, generating method and device based on feature engineering platform

Publications (1)

Publication NumberPublication Date
CN115936358Atrue CN115936358A (en)2023-04-07

Family

ID=86650121

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN202211535042.6APendingCN115936358A (en)2022-11-302022-11-30Feature processing method, generating method and device based on feature engineering platform

Country Status (1)

CountryLink
CN (1)CN115936358A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN110727670A (en)*2019-10-112020-01-24集奥聚合(北京)人工智能科技有限公司Data structure prediction transfer and automatic data processing method based on flow chart
US10552002B1 (en)*2016-09-272020-02-04Palantir Technologies Inc.User interface based variable machine modeling
CN112668723A (en)*2020-12-292021-04-16杭州海康威视数字技术股份有限公司Machine learning method and system
CN113010164A (en)*2021-04-012021-06-22杭州初灵数据科技有限公司Visual machine learning feature extraction system and method based on feature calculation graph
WO2021208685A1 (en)*2020-04-172021-10-21第四范式(北京)技术有限公司Method and apparatus for executing automatic machine learning process, and device
CN114912544A (en)*2022-06-062022-08-16北京百度网讯科技有限公司Automatic characteristic engineering model training method and automatic characteristic engineering method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US10552002B1 (en)*2016-09-272020-02-04Palantir Technologies Inc.User interface based variable machine modeling
CN110727670A (en)*2019-10-112020-01-24集奥聚合(北京)人工智能科技有限公司Data structure prediction transfer and automatic data processing method based on flow chart
WO2021208685A1 (en)*2020-04-172021-10-21第四范式(北京)技术有限公司Method and apparatus for executing automatic machine learning process, and device
CN112668723A (en)*2020-12-292021-04-16杭州海康威视数字技术股份有限公司Machine learning method and system
CN113010164A (en)*2021-04-012021-06-22杭州初灵数据科技有限公司Visual machine learning feature extraction system and method based on feature calculation graph
CN114912544A (en)*2022-06-062022-08-16北京百度网讯科技有限公司Automatic characteristic engineering model training method and automatic characteristic engineering method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
I. CHEVYREV ET AL.: "Feature Engineering with Regularity Strucutres", JOURNAL OF SCIENTIFIC COMPUTING, vol. 98, 12 August 2021 (2021-08-12)*
王成;王昌琪;: "一种面向网络支付反欺诈的自动化特征工程方法", 计算机学报, no. 10, 15 October 2020 (2020-10-15)*

Similar Documents

PublicationPublication DateTitle
CN114862656B (en)Multi-GPU-based acquisition method for training cost of distributed deep learning model
CN114389969B (en) Client testing method, device, electronic device and storage medium
CN117407513B (en)Question processing method, device, equipment and storage medium based on large language model
CN117332068B (en)Man-machine interaction method and device, electronic equipment and storage medium
CN113515462B (en)Method, apparatus, device and storage medium for testing
CN114428907B (en) Information search method, device, electronic device and storage medium
CN118519660A (en)Code updating method, code generating method, task processing method and code interpreter based on large model
WO2024109860A1 (en)Interaction method and apparatus, electronic device and computer readable medium
CN112905270B (en) Workflow implementation method, device, platform, electronic device and storage medium
CN114064925A (en) Construction method, data query method, device, equipment and medium of knowledge graph
CN118916042A (en)Large language model analysis method, device and related equipment
CN115936358A (en)Feature processing method, generating method and device based on feature engineering platform
CN114756468A (en)Test data creating method, device, equipment and storage medium
CN114549122A (en)Model training method, commodity recommendation device, equipment and storage medium
CN114219453A (en)Task processing method and device, electronic equipment and storage medium
CN114116919A (en) Construction method, data query method, device, equipment and medium of knowledge graph
CN116303529B (en) Object acquisition method, device, electronic device, and computer-readable medium
CN113836291B (en)Data processing method, device, equipment and storage medium
CN118193996B (en) Text sequence generation method, processing method, model training method, and device for enhancing context learning ability
CN119377663B (en) Training data generation method, device, electronic device and storage medium
CN118484222B (en)Configuration updating method and device, electronic equipment and storage medium
CN113360770B (en)Content recommendation method, device, equipment and storage medium
CN111930704B (en)Service alarm equipment control method, device, equipment and computer readable medium
CN116405406A (en) Data discrepancy monitoring method, device, electronic device and computer readable medium
CN120386633A (en) Request processing method, device, electronic device and storage medium

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination

[8]ページ先頭

©2009-2025 Movatter.jp