CN105677353A

Movatterモバイル変換

Info

Publication number: CN105677353A
Application number: CN201610011587.5A
Authority: CN
Inventors: 白杨; 陈雨强
Original assignee: Beijing Wusi Chuangxiang Technology Co ltd
Current assignee: Beijing Wusi Chuangxiang Technology Co ltd
Priority date: 2016-01-08
Filing date: 2016-01-08
Publication date: 2016-06-15
Also published as: CN110442417A

Abstract

A feature extraction method, a machine learning method and a corresponding apparatus are provided. The feature extraction method comprises the following steps: acquiring a data record; acquiring feature extraction configuration items for defining how to extract predetermined features from the data records, wherein the feature extraction configuration items of each predetermined feature comprise source field items and processing method items, the source field items are used for defining fields of the data records related to each predetermined feature as source fields, and the processing method items are used for specifying references to data processing functions which are programmed into executable codes in advance, wherein the data processing functions are used for executing data processing for extracting each predetermined feature aiming at field values of the source fields defined by the source field items; and performing data processing on the field value of the data record based on the feature extraction configuration item to acquire the feature value of the predetermined feature. The feature extraction and machine learning technology according to the embodiment of the invention enhances the programming flexibility and code reusability, and is particularly suitable for big data application.

Description

Translated fromChinese

特征抽取方法、机器学习方法及其装置Feature extraction method, machine learning method and device thereof

技术领域technical field

本发明总体地涉及信息技术领域，更具体地涉及特征抽取方法、机器学习方法以及对应的装置。The present invention generally relates to the field of information technology, and more specifically relates to a feature extraction method, a machine learning method and a corresponding device.

背景技术Background technique

在数据挖掘、机器学习等信息技术领域，所处理的对象为数据，在对浩瀚的数据进行处理之前，通常都要对数据进行特征抽取。In the field of information technology such as data mining and machine learning, the object of processing is data. Before processing the vast data, it is usually necessary to extract the features of the data.

特征可作为数据处理的原材料，简单说来，每条数据记录可包括多个字段，而特征可指示各字段本身、或字段的局部、或字段的组合、或字段的变换或其它处理结果等，以帮助更好地反映数据分布的内在关联与潜在含义。以数据挖掘领域作为示例，特征是机器学习系统的原材料，对最终模型具有显著的影响，其中，高效、准确地提取特征能够帮助学习过程更好地提炼数据规律，从多个角度透析数据分布中的内在关联与潜在涵义。这个过程在机器学习中称为特征工程。特征工程的产出作为机器学习的素材，其质量好坏直接决定了机器学习问题刻画的准确性，进而影响模型的优劣。Features can be used as raw materials for data processing. Simply put, each data record can include multiple fields, and features can indicate each field itself, or a part of a field, or a combination of fields, or a field transformation or other processing results, etc. To help better reflect the internal correlation and potential meaning of the data distribution. Taking the field of data mining as an example, features are the raw materials of machine learning systems and have a significant impact on the final model. Among them, efficient and accurate feature extraction can help the learning process to better refine data laws, and analyze data distribution from multiple perspectives. internal correlation and potential meaning. This process is called feature engineering in machine learning. The output of feature engineering is used as the material of machine learning, and its quality directly determines the accuracy of machine learning problem description, which in turn affects the quality of the model.

实际上，不限于机器学习领域中的特征工程，在现有的任何数据处理系统中，通常都需要进行特征抽取,而为了从各字段内容中提取出相应的特征，普遍需要程序员针对每一类特征编写可执行的程序代码来进行特征抽取。In fact, it is not limited to feature engineering in the field of machine learning. In any existing data processing system, feature extraction is usually required. In order to extract corresponding features from the contents of each field, it is generally necessary for programmers to Class features Write executable program code for feature extraction.

例如，当希望获取给定数据(“data”)中每条记录的时间字段中的年份信息时，可以通过执行下面一段python程序来实现For example, when you want to get the year information in the time field of each record in the given data ("data"), you can implement the following python program

#param:list-datastoresrecordsoffieldsaslistofdictionary#param:list-datastoresrecordsoffieldsaslistofdictionary

#param:string-‘YYYY-MM-DD’formatteddatefield#param: string-'YYYY-MM-DD' formatted datefield

#return:list-Yearsequenceforeachrecord#return: list-Yearsequenceforeachrecord

defgetYearOf(data):defgetYearOf(data):

timeFields＝[rec[‘time’]forrecindata]timeFields=[rec['time']forrecindata]

years＝map(lambdax:x.split(‘-‘)[0],timeFields)years=map(lambdax:x.split('-')[0],timeFields)

returnyearsreturn years

在上述程序中，定义了一段用于从数据源(data)中原样抽取各个数据记录(rec)的年份(year)字段作为年份特征的代码，其中，首先从数据源的记录中提取时间字段，并按照时间字段的特定格式(yyyy-mm-dd)提取出以“-”分割出的yyyy(即，下标为0的部分)，将其映射到特征years，并返回提取的年份值。In the above program, a section of code for extracting the year field of each data record (rec) from the data source (data) is defined as the year feature, wherein, firstly, the time field is extracted from the record of the data source, And according to the specific format of the time field (yyyy-mm-dd), extract the yyyy separated by "-" (that is, the part with the subscript 0), map it to the feature years, and return the extracted year value.

可见，该段程序对于数据(年份字段)的格式以及特征抽取的输出都做了较强的约束。即，该段特征抽取代码是针对特定格式的数据和特定的输出定制的。因此，一般地，如果给定的数据的数据格式不同，和/或要取得的特征输出不同，那么都需要针对其具体格式、所使用的算法来编写内容迥异的代码。即便仅仅数据记录的字段输入顺序或特征输出顺序不同，都要重新编写一套完全定制化的代码。这不仅给程序员带来繁复的工作负担，而且也将在程序运行上耗费较大的开销。鉴于实际应用场景的多样化、数据规格的多元化，这种蛮力做法很难扩展与复用。It can be seen that this program has strong constraints on the format of the data (year field) and the output of feature extraction. That is, this piece of feature extraction code is customized for data in a specific format and a specific output. Therefore, in general, if the data format of the given data is different, and/or the characteristic output to be obtained is different, then it is necessary to write codes with very different content for the specific format and the algorithm used. Even if only the field input sequence or feature output sequence of the data record is different, a completely customized code must be rewritten. This not only brings a heavy workload to the programmer, but also consumes a lot of overhead in program operation. In view of the diversity of actual application scenarios and data specifications, this brute force approach is difficult to expand and reuse.

因此，现有的针对每种数据格式与抽取内容开发一套不同处理流程的思路是对问题规模的遍历，结果致使特征抽取的开发复杂度非线性增长，同时运行复杂度也很难约束。Therefore, the existing idea of developing a different processing flow for each data format and extracted content is to traverse the scale of the problem, resulting in a nonlinear increase in the development complexity of feature extraction, and it is also difficult to constrain the operational complexity.

发明内容Contents of the invention

鉴于上述情况，做出了本发明。The present invention has been made in view of the above circumstances.

根据本发明的一个方面，提供了一种针对数据记录进行特征抽取的方法，可以包括：数据记录获取步骤，获取数据记录；特征抽取配置项获取步骤，获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理；以及特征值获取步骤，基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值。According to one aspect of the present invention, there is provided a method for feature extraction for data records, which may include: a data record acquisition step, acquiring data records; a feature extraction configuration item acquisition step, acquiring features used to define how to extract from the data records The feature extraction configuration item of the predetermined feature, wherein, the feature extraction configuration item of each predetermined feature includes a source field item and a processing method item, and the source field item is used to limit the fields of the data records involved in each predetermined feature as the source field, the processing method item is used to specify a reference to a data processing function pre-programmed as executable code, wherein the data processing function is used to perform the extraction of each data processing of a predetermined feature; and a feature value obtaining step of performing data processing on field values of the data records based on feature extraction configuration items to obtain feature values of the predetermined feature.

进一步地，根据本发明实施例的特征抽取方法，其中，所述特征抽取配置项获取步骤可以包括：从设置了特征抽取配置项的配置文件读取特征抽取配置项或根据用户的输入操作来获取特征抽取配置项，其中，配置文件被存储在本地或远程接收。Further, according to the feature extraction method of the embodiment of the present invention, wherein the feature extraction configuration item obtaining step may include: reading the feature extraction configuration item from a configuration file in which the feature extraction configuration item is set or obtaining the feature extraction configuration item according to the user's input operation Feature extraction configuration items, where configuration files are stored locally or received remotely.

进一步地，根据本发明实施例的特征抽取方法，其中，所述特征抽取配置项获取步骤可以包括：向用户显示用于设置特征抽取配置项的界面；根据用户在所述界面上执行的输入操作来生成设置了特征抽取配置项的配置文件；以及从生成的配置文件中读取特征抽取配置项。Further, according to the feature extraction method of the embodiment of the present invention, wherein the feature extraction configuration item acquisition step may include: displaying to the user an interface for setting the feature extraction configuration item; according to the input operation performed by the user on the interface to generate a configuration file with feature extraction configuration items set; and read feature extraction configuration items from the generated configuration file.

进一步地，根据本发明实施例的特征抽取方法，其中，用于设置特征抽取配置项的界面可以为图形用户界面,所述图形用户界面可以包括用于手动编辑配置文件的文本编辑界面和/或用于显示特征抽取配置项的内容选项以供手动选择的选择输入型界面。Further, according to the feature extraction method of the embodiment of the present invention, wherein, the interface for setting feature extraction configuration items may be a graphical user interface, and the graphical user interface may include a text editing interface for manually editing configuration files and/or A selection input interface for displaying content options of feature extraction configuration items for manual selection.

进一步地，根据本发明实施例的特征抽取方法，其中,在所述特征抽取配置项获取步骤中,可以响应于用户的界面切换操作输入在文本编辑界面和选择输入型界面之间切换，在切换前界面下的特征抽取配置项设置结果被同步地显示到切换后的界面下。Further, according to the feature extraction method of the embodiment of the present invention, wherein, in the feature extraction configuration item acquisition step, it is possible to switch between the text editing interface and the selection input interface in response to the user's interface switching operation input, and the switching The setting results of the feature extraction configuration items in the previous interface are displayed synchronously in the switched interface.

进一步地，根据本发明实施例的特征抽取方法，其中，在选择输入型界面中，至少显示有数据记录的能够作为来源字段的各个字段以及设置的预定特征的特征抽取配置项。Further, according to the feature extraction method of the embodiment of the present invention, in the selection input interface, at least each field of the data record that can be used as the source field and the feature extraction configuration item of the set predetermined feature are displayed.

进一步地，根据本发明实施例的特征抽取方法，其中，在图形用户界面包括选择输入型界面的情况下，向用户显示用于设置特征抽取配置项的界面的步骤可以包括：将用户从所述各个字段中选择的字段显示为设置的来源字段，在所述来源项字段被选择的同时，将处理方法列表显示在来源字段附近，并将用户从处理方法列表中选择的处理方法显示为设置的处理方法。Further, in the feature extraction method according to the embodiment of the present invention, in the case where the graphical user interface includes a selection input interface, the step of displaying to the user an interface for setting feature extraction configuration items may include: selecting the user from the The field selected in each field is displayed as the set source field, and the processing method list is displayed near the source field while the source item field is selected, and the processing method selected by the user from the processing method list is displayed as the set source field. Approach.

进一步地，根据本发明实施例的特征抽取方法，其中，处理方法项列表包括所有处理方法且所有处理方法均处于激活状态，或者，处理方法项列表包括所有处理方法但只有能够应用于来源字段项的处理方法处于激活状态，或者，处理方法项列表仅包括能够应用于来源字段项的处理方法。Further, according to the feature extraction method of the embodiment of the present invention, wherein the list of processing method items includes all processing methods and all processing methods are in an active state, or the list of processing method items includes all processing methods but can only be applied to source field items is active, or the list of processing method items includes only processing methods that can be applied to source field items.

进一步地，根据本发明实施例的特征抽取方法，其中，所述每种预定特征的特征抽取配置项还可以包括与所述处理方法项相应的处理参数项，所述处理参数项用于限定所述数据处理函数涉及的参数。Further, in the feature extraction method according to the embodiment of the present invention, the feature extraction configuration item of each predetermined feature may further include a processing parameter item corresponding to the processing method item, and the processing parameter item is used to define the The parameters involved in the above data processing functions.

进一步地，根据本发明实施例的特征抽取方法，其中，所述每种预定特征的特征抽取配置项还可以包括存储位置标识，用于指示与所述每种预定特征的特征值相应的计算系数在存储器中的存储区域。Furthermore, according to the feature extraction method of the embodiment of the present invention, the feature extraction configuration item of each predetermined feature may further include a storage location identifier, which is used to indicate the calculation coefficient corresponding to the feature value of each predetermined feature storage area in memory.

进一步地，根据本发明实施例的特征抽取方法，其中，在所述特征值获取步骤中，对所述数据记录中的各条数据记录或由多条组成的各组数据记录可以并行地执行数据处理。Further, according to the feature extraction method of the embodiment of the present invention, in the feature value obtaining step, the data records can be executed in parallel for each data record in the data records or each group of data records consisting of multiple records. deal with.

进一步地，根据本发明实施例的特征抽取方法，其中，在所述特征值获取步骤中，可以由分布式计算集群来并行地执行数据处理。Further, in the feature extraction method according to the embodiment of the present invention, in the feature value obtaining step, data processing may be performed in parallel by a distributed computing cluster.

根据本发明的另一方面，提供了一种计算机执行的机器学习方法，可以包括：数据记录获取步骤，获取数据记录；特征抽取配置项获取步骤，获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理；特征值获取步骤，基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值；样本获得步骤，至少部分基于所述特征值获取步骤获取的特征值，形成特征向量，作为机器学习的样本；以及机器学习步骤，基于所述样本进行机器学习。According to another aspect of the present invention, there is provided a computer-executed machine learning method, which may include: a data record acquisition step, acquiring a data record; a feature extraction configuration item acquisition step, acquiring a predetermined The feature extraction configuration item of the feature, wherein, the feature extraction configuration item of each predetermined feature includes a source field item and a processing method item, and the source field item is used to limit the fields of the data records involved in each predetermined feature as the source field , the processing method item is used to specify a reference to a data processing function pre-programmed as executable code, wherein the data processing function is used to perform the extraction of each of the Data processing of predetermined features; feature value acquisition step, performing data processing on the field values of the data records based on feature extraction configuration items to obtain feature values of the predetermined features; sample obtaining step, at least partially based on the feature value acquisition step The acquired eigenvalues form a eigenvector as a sample for machine learning; and a machine learning step performs machine learning based on the sample.

进一步地，根据本发明实施例的机器学习方法，其中，在所述机器学习步骤中，基于所述样本进行模型训练、模型测试和模型应用之中的至少一项。Further, in the machine learning method according to the embodiment of the present invention, in the machine learning step, at least one of model training, model testing and model application is performed based on the sample.

根据本发明的另一方面，提供了一种针对数据记录进行特征抽取的计算装置，包括存储部件和处理器，存储部件中存储有计算机可执行指令集合，当所述计算机可执行指令集合被所述处理器执行时，执行下述步骤：数据记录获取步骤，获取数据记录；特征抽取配置项获取步骤，获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理；以及特征值获取步骤，基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值。According to another aspect of the present invention, a computing device for feature extraction for data records is provided, including a storage unit and a processor, the storage unit stores a set of computer-executable instructions, when the set of computer-executable instructions is obtained When the processor is executed, the following steps are performed: a data record obtaining step, obtaining a data record; a feature extraction configuration item obtaining step, obtaining a feature extraction configuration item used to limit how to extract a predetermined feature from the data record, wherein each The feature extraction configuration item of the predetermined feature includes a source field item and a processing method item, the source field item is used to limit the fields of the data records involved in each predetermined feature as the source field, and the processing method item is used to specify the pair pre-programmed as A reference to a data processing function of executable code, wherein the data processing function is used to perform data processing for extracting each predetermined feature for the field value of the source field defined by the source field item; and a feature value obtaining step and performing data processing on the field values of the data records based on the feature extraction configuration item to obtain the feature value of the predetermined feature.

根据本发明的另一方面，提供了一种进行机器学习的计算装置，包括存储部件和处理器，存储部件中存储有计算机可执行指令集合，当所述计算机可执行指令集合被所述处理器执行时，执行下述步骤：数据记录获取步骤，获取数据记录；特征抽取配置项获取步骤，获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理；特征值获取步骤，基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值；样本获得步骤，至少部分基于所述特征值获取步骤获取的特征值，形成特征向量，作为机器学习的样本；以及机器学习步骤，基于所述样本进行机器学习。According to another aspect of the present invention, there is provided a computing device for machine learning, including a storage unit and a processor, wherein a set of computer-executable instructions is stored in the storage unit, and when the set of computer-executable instructions is executed by the processor During execution, the following steps are performed: a data record acquisition step, acquiring data records; a feature extraction configuration item acquisition step, acquiring a feature extraction configuration item for limiting how to extract predetermined features from the data records, wherein each predetermined feature The feature extraction configuration item includes a source field item and a processing method item. The source field item is used to limit the fields of the data records involved in each predetermined feature as the source field, and the processing method item is used to specify the code that is pre-programmed as executable code. A reference to a data processing function, wherein the data processing function is used to perform data processing for extracting each predetermined feature for the field value of the source field defined by the source field item; the feature value acquisition step is based on the feature extraction The configuration item performs data processing on the field value of the data record to obtain the characteristic value of the predetermined feature; the sample obtaining step is at least partially based on the characteristic value obtained by the characteristic value obtaining step to form a characteristic vector as a sample for machine learning and a machine learning step of performing machine learning based on the samples.

根据本发明的另一方面，提供了一种针对数据记录进行特征抽取的特征抽取装置，可以包括：数据记录获取单元，配置为获取数据记录；特征抽取配置项获取单元，配置为获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理；以及特征值获取单元，配置为基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值。According to another aspect of the present invention, a feature extraction device for feature extraction for data records is provided, which may include: a data record acquisition unit configured to acquire data records; a feature extraction configuration item acquisition unit configured to acquire How to extract the feature extraction configuration items of predetermined features from the data records, wherein, the feature extraction configuration items of each predetermined feature include a source field item and a processing method item, and the source field item is used to use the The field of the data record is defined as the source field, and the processing method item is used to specify a reference to a data processing function pre-programmed as executable code, wherein the data processing function is used for the field value of the source field defined by the source field item performing data processing for extracting each of the predetermined features; and a feature value obtaining unit configured to perform data processing on field values of the data records based on feature extraction configuration items to obtain feature values of the predetermined features.

进一步地，根据本发明实施例的特征抽取装置，其中，所述特征抽取配置项获取单元可以从设置了特征抽取配置项的配置文件读取特征抽取配置项或根据用户的输入操作来获取特征抽取配置项，其中，配置文件被存储在本地或远程接收。Further, according to the feature extraction device according to the embodiment of the present invention, the feature extraction configuration item acquisition unit can read the feature extraction configuration item from the configuration file in which the feature extraction configuration item is set or obtain the feature extraction configuration item according to the user's input operation. Configuration items, where configuration files are stored locally or received remotely.

进一步地，根据本发明实施例的特征抽取装置，其中，所述特征抽取配置项获取单元可以向用户显示用于设置特征抽取配置项的界面，根据用户在所述界面上执行的输入操作来生成设置了特征抽取配置项的配置文件，并从生成的配置文件中读取特征抽取配置项。Further, according to the feature extraction device according to the embodiment of the present invention, the feature extraction configuration item acquisition unit can display an interface for setting feature extraction configuration items to the user, and generate Set the configuration file of the feature extraction configuration item, and read the feature extraction configuration item from the generated configuration file.

进一步地，根据本发明实施例的特征抽取装置，其中，用于设置特征抽取配置项的界面可以为图形用户界面,所述图形用户界面可包括用于手动编辑配置文件的文本编辑界面和/或用于显示特征抽取配置项的内容选项以供手动选择的选择输入型界面。Further, in the feature extraction device according to an embodiment of the present invention, the interface for setting feature extraction configuration items may be a graphical user interface, and the graphical user interface may include a text editing interface for manually editing configuration files and/or A selection input interface for displaying content options of feature extraction configuration items for manual selection.

进一步地，根据本发明实施例的特征抽取装置，其中,所述特征抽取配置项获取单元可以响应于用户的界面切换操作输入在文本编辑界面和选择输入型界面之间切换，在切换前界面下的特征抽取配置项设置结果被同步地显示到切换后的界面下。Further, according to the feature extraction device according to the embodiment of the present invention, the feature extraction configuration item acquisition unit can switch between the text editing interface and the selection input interface in response to the user's interface switching operation input, and the interface before switching The setting results of the feature extraction configuration items are synchronously displayed on the switched interface.

进一步地，根据本发明实施例的特征抽取装置，其中，在选择输入型界面中，可以至少显示有数据记录的能够作为来源字段的各个字段以及设置的预定特征的特征抽取配置项。Further, according to the feature extraction device of the embodiment of the present invention, in the selection input interface, at least each field of the data record that can be used as a source field and the feature extraction configuration item of the set predetermined feature can be displayed.

进一步地，根据本发明实施例的特征抽取装置，其中，在图形用户界面包括选择输入型界面的情况下，所述特征抽取配置项获取单元可以将用户从所述各个字段中选择的字段显示为设置的来源字段，在所述来源项字段被选择的同时，将处理方法列表显示在来源字段附近，并将用户从处理方法列表中选择的处理方法显示为设置的处理方法。Further, in the feature extraction device according to the embodiment of the present invention, in the case where the graphical user interface includes a selection input interface, the feature extraction configuration item acquisition unit may display the fields selected by the user from the various fields as In the set source field, when the source item field is selected, a processing method list is displayed near the source field, and the processing method selected by the user from the processing method list is displayed as the set processing method.

进一步地，根据本发明实施例的特征抽取装置，其中，处理方法项列表可以包括所有处理方法且所有处理方法均处于激活状态，或者，处理方法项列表可以包括所有处理方法但只有能够应用于来源字段项的处理方法处于激活状态，或者，处理方法项列表可以仅包括能够应用于来源字段项的处理方法。Further, according to the feature extraction device of the embodiment of the present invention, wherein, the processing method item list may include all processing methods and all processing methods are in an active state, or the processing method item list may include all processing methods but only those that can be applied to the source The processing method for the field item is active, or the processing method item list may only include processing methods that can be applied to the source field item.

进一步地，根据本发明实施例的特征抽取装置，其中，所述每种预定特征的特征抽取配置项可以还包括与所述处理方法项相应的处理参数项，所述处理参数项可以用于限定所述数据处理函数涉及的参数。Further, in the feature extraction device according to the embodiment of the present invention, the feature extraction configuration item of each predetermined feature may further include a processing parameter item corresponding to the processing method item, and the processing parameter item may be used to define The parameters involved in the data processing function.

进一步地，根据本发明实施例的特征抽取装置，其中，所述每种预定特征的特征抽取配置项可以还包括存储位置标识，用于指示与所述每种预定特征的特征值相应的计算系数在存储器中的存储区域。Further, in the feature extraction device according to the embodiment of the present invention, the feature extraction configuration item of each predetermined feature may further include a storage location identifier, which is used to indicate the calculation coefficient corresponding to the feature value of each predetermined feature storage area in memory.

进一步地，根据本发明实施例的特征抽取装置，其中，所述特征值获取单元可以对所述数据记录中的各条数据记录或由多条组成的各组数据记录并行地执行数据处理。Further, according to the feature extraction device of the embodiment of the present invention, the feature value acquisition unit may perform data processing in parallel on each data record or each group of data records consisting of multiple data records in the data records.

进一步地，根据本发明实施例的特征抽取装置，所述特征值获取单元可以通过分布式计算集群来并行地执行数据处理。Further, according to the feature extraction device of the embodiment of the present invention, the feature value acquisition unit may execute data processing in parallel through a distributed computing cluster.

根据本发明的另一方面，提供了一种机器学习装置，可以包括：数据记录获取单元，配置为获取数据记录；特征抽取配置项获取单元，配置为获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理；特征值获取单元，配置为基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值；样本获得单元，配置为至少部分基于所述特征值获取单元获取的特征值，形成特征向量，作为机器学习的样本；以及机器学习单元，配置为基于所述样本进行机器学习。According to another aspect of the present invention, a machine learning device is provided, which may include: a data record acquisition unit configured to acquire data records; a feature extraction configuration item acquisition unit configured to acquire features used to define how to extract from the data records The feature extraction configuration item of the predetermined feature, wherein, the feature extraction configuration item of each predetermined feature includes a source field item and a processing method item, and the source field item is used to limit the fields of the data records involved in each predetermined feature as the source field, the processing method item is used to specify a reference to a data processing function pre-programmed as executable code, wherein the data processing function is used to perform the extraction of each Data processing of a predetermined feature; a feature value acquisition unit configured to perform data processing on the field value of the data record based on a feature extraction configuration item to obtain a feature value of the predetermined feature; a sample acquisition unit configured to at least partly based on the The eigenvalues acquired by the eigenvalue acquiring unit form eigenvectors as samples for machine learning; and the machine learning unit is configured to perform machine learning based on the samples.

进一步地，根据本发明实施例的机器学习装置，其中，所述机器学习单元可以基于所述样本进行模型训练、模型测试和模型应用之中的至少一项。Further, in the machine learning device according to the embodiment of the present invention, the machine learning unit can perform at least one of model training, model testing and model application based on the sample.

根据本发明实施例的特征抽取技术和机器学习技术，能够独立于特征抽取主程序根据需要来改变各个特征抽取配置项，从而可根据场景对特征抽取进行有效的“抽象”和“表示”，既无需实质性改变特征抽取主程序，同时可灵活地独立编写或增加数据处理函数，增强了编程的灵活性和代码的重用性。由此，对于不同的数据库，只要根据需要定义特征抽取配置项，就可以利用同样的特征抽取主程序和相应的数据处理函数，增强了编程的灵活性、易维护性和代码的重用性。According to the feature extraction technology and machine learning technology of the embodiment of the present invention, each feature extraction configuration item can be changed independently of the feature extraction main program as needed, so that feature extraction can be effectively "abstracted" and "expressed" according to the scene, both There is no need to substantially change the main program of feature extraction, and at the same time, data processing functions can be independently written or added flexibly, which enhances the flexibility of programming and the reusability of codes. Therefore, for different databases, as long as the feature extraction configuration items are defined according to the needs, the same feature extraction main program and corresponding data processing functions can be used, which enhances programming flexibility, ease of maintenance and code reusability.

附图说明Description of drawings

从下面结合附图对本发明实施例的详细描述中，本发明的这些和/或其它方面和优点将变得更加清楚并更容易理解，其中：These and/or other aspects and advantages of the present invention will become clearer and easier to understand from the following detailed description of the embodiments of the present invention in conjunction with the accompanying drawings, wherein:

图1示出了根据本发明一个实施例的特征抽取方法的总体流程图。Fig. 1 shows an overall flowchart of a feature extraction method according to an embodiment of the present invention.

图2示出了特征抽取配置文件内容的示例。Figure 2 shows an example of the content of a feature extraction configuration file.

图3示出了在数据记录为机器学习中的样本数据时，分布式地执行特征抽取过程的示例。FIG. 3 shows an example of performing a feature extraction process in a distributed manner when data is recorded as sample data in machine learning.

图4A示出根据本发明示例性实施例的用于针对特征抽取进行配置的图形用户界面的示例。Figure 4A shows an example of a graphical user interface for configuring for feature extraction according to an exemplary embodiment of the present invention.

图4B示出了用户在左侧区域选中单个字段(例如，“age”字段)的同时显示处理方法列表的部分图形用户界面的示例。4B shows an example of a portion of a GUI displaying a list of treatment methods while a user selects a single field (eg, the "age" field) in the left area.

图4C示出用户在左侧区域选中多个字段的同时显示处理方法列表的部分图形用户界面的示例。FIG. 4C shows an example of a partial GUI displaying a list of processing methods while the user selects multiple fields in the left area.

图5示出了具有能够对特征抽取配置项进行文本编辑的区域的示例性图形用户界面。Fig. 5 shows an exemplary graphical user interface with an area capable of text editing of feature extraction configuration items.

图6示出了根据本发明实施例的应用了上述实施例的特征抽取方法的机器学习方法的总体流程图。Fig. 6 shows an overall flow chart of a machine learning method applying the feature extraction method of the above-mentioned embodiment according to an embodiment of the present invention.

图7示出了根据本发明实施例的计算装置的配置框图。FIG. 7 shows a block diagram of a configuration of a computing device according to an embodiment of the present invention.

具体实施方式detailed description

为了使本领域技术人员更好地理解本发明，下面结合附图和具体实施方式对本发明作进一步详细说明。In order to enable those skilled in the art to better understand the present invention, the present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

图1示出了根据本发明一个实施例的特征抽取方法100的总体流程图。所述方法可通过计算机程序来执行，也可由专门的特征抽取装置来执行。这里，作为示例，可将实现所述方法的程序封装为专门的软件包(例如，lib库)，从而不论是在线调用或离线调用所述软件包，即可实现一致性的特征抽取服务，克服了现有技术中由于在线离线环境不同而导致特征抽取结果不一致的缺陷。Fig. 1 shows an overall flowchart of a feature extraction method 100 according to an embodiment of the present invention. The method can be implemented by a computer program, or by a special feature extraction device. Here, as an example, the program implementing the method can be encapsulated into a special software package (for example, a lib library), so that no matter whether the software package is called online or offline, a consistent feature extraction service can be realized to overcome the It overcomes the defect of inconsistent feature extraction results in the prior art due to different online and offline environments.

在步骤S110中，获取数据记录。数据能够以数据记录形式呈现，数据记录是指对应于数据源中一行信息的一组完整的相关信息。例如，客户邮件列表中的有关某位客户的所有信息为一条数据记录。In step S110, data records are acquired. Data can be presented in the form of a data record, which refers to a complete set of related information corresponding to a row of information in a data source. For example, all information about a customer on a customer mailing list is one data record.

这里的数据记录为后面的特征抽取的原材料，其中，每条数据记录可以具有各种类型的字段及其相应的字段值。作为示例，下表示出用于描述客户关于贷款偿还记录的个人信息的单条数据记录的示例，该数据记录提取出的特征将被用于训练关于客户贷款风险的模型：The data records here are raw materials for feature extraction later, where each data record may have various types of fields and their corresponding field values. As an example, the following table shows an example of a single data record describing a customer's personal information about a loan repayment record, the features extracted from this data record will be used to train a model about the customer's loan risk:

No.No.年龄age工作Work是否有住房whether there is housing联系人contact person生日Birthday偿还贷款标注(label)Loan repayment label (label)113737教师teacher是(“y”)is ("y")配偶spouse1979-3-11979-3-1是(“y”)is ("y")

如上表所列，数据记录描述了所述客户的一些基本信息，例如，年龄(age)、工作(job)、是否拥有住房(housing)、联系人(contact)、生日(birthday)，还包括了关于客户偿还贷款的标注(lable)，具体地，标注为“y”指示该客户具有正面的贷款偿还记录，标注为“n”指示该客户具有负面的贷款偿还记录。作为示例，上述数据记录中提取出的特征可用作训练样本，以基于机器学习算法训练出用于预测客户贷款风险的模型。As listed in the above table, the data record describes some basic information of the customer, such as age (age), job (job), housing (housing), contact (contact), birthday (birthday), and Regarding the labels (labels) for repayment of loans by customers, specifically, a label of "y" indicates that the customer has a positive loan repayment record, and a label of "n" indicates that the customer has a negative loan repayment record. As an example, the features extracted from the above data records can be used as training samples to train a model for predicting customer loan risk based on machine learning algorithms.

应理解，数据记录可具有各种不同的字段以描述各个方面的信息，字段的内容和格式不受限制。而且，数据记录并非必然具有关于预测目标的标注，而是可不具备任何标注。It should be understood that a data record may have various fields to describe various aspects of information, and the content and format of the fields are not limited. Moreover, the data records do not necessarily have annotations about the predicted targets, but may not have any annotations.

这里，对于数据记录的获取方法没有限制，可包括各种获取在线数据或离线数据的方式。例如，可以预先在本地硬盘或分布式文件存储系统上存储一个或多个数据文件，通过读取数据文件来获取数据记录；或者，可以通过访问本地或远程数据库获得数据记录；再或者，数据记录也可以是实时生成的，例如，数据记录可以通过特定通讯协议(例如，接口描述语言IDL)而从调用特征抽取服务的装置实时获取。作为示例，可通过将多个数据记录进行拼接而生成完整的数据记录。Here, there is no limitation on the acquisition method of the data records, which may include various ways of acquiring online data or offline data. For example, one or more data files can be pre-stored on the local hard disk or a distributed file storage system, and the data records can be obtained by reading the data files; or, the data records can be obtained by accessing a local or remote database; or, the data records It can also be generated in real time, for example, the data record can be obtained in real time from the device calling the feature extraction service through a specific communication protocol (eg, Interface Description Language IDL). As an example, a complete data record can be generated by concatenating multiple data records.

这里的获取数据记录，可以是一次获取一条数据记录，也可以是一次获取多条数据记录。The acquisition of data records here may be to acquire one data record at a time, or to acquire multiple data records at a time.

在步骤S120中，获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理。In step S120, feature extraction configuration items used to define how to extract predetermined features from the data records are acquired, wherein each feature extraction configuration item of predetermined features includes a source field item and a processing method item, and the source field item is used to extract The field of the data record involved in each predetermined feature is defined as the source field, and the processing method item is used to specify a reference to a data processing function pre-programmed as an executable code, wherein the data processing function is used to target the data generated by the source The field value of the source field defined by the field item performs data processing for extracting each of the predetermined characteristics.

在一个示例中，所述每种预定特征的特征抽取配置项还包括与处理方法项相应的处理参数项，处理参数项用于限定所述数据处理函数涉及的参数。处理参数项的例子例如有格式参数、抽取区间参数、划分阈值参数、映射规则参数等。通过独立于处理方法项来设置参数项，能够有效整合类似的数据处理函数，而不必针对每种参数细节来编写相应的函数，从而进一步提高数据处理的代码效率。In an example, the feature extraction configuration item of each predetermined feature further includes a processing parameter item corresponding to a processing method item, and the processing parameter item is used to define parameters involved in the data processing function. Examples of processing parameter items include format parameters, extraction interval parameters, division threshold parameters, mapping rule parameters, and the like. By setting the parameter item independently of the processing method item, similar data processing functions can be effectively integrated without having to write a corresponding function for each parameter detail, thereby further improving the code efficiency of data processing.

在一个示例中，每种预定特征的特征抽取配置项还可以包括存储位置标识，用于指示与所述每种预定特征的特征值相应的计算系数在存储器中的存储区域。这里，以机器学习为例，每种预定特征的各个特征值对应的计算系数(例如，机器学习算法中的权重)被分别存储在存储器中的相应位置，然而，随着样本特征值的维度不断扩展(甚至达到数亿级别)而内存的空间又很有限，难以为每种特征值分配一一对应的存储地址，相应地，需要将计算系数的存储地址映射到有限的地址空间以便于查找，为此，将对各特征值对应的计算系数存储位置均进行哈希变换，以得到映射的内存地址。然而，哈希变换会带来地址上的冲突，不同的样本特征值会互相混淆，这会给机器学习的计算带来较大的误差，为此，可基于存储位置标识来划分内存的存储空间以至少将不同种类的预定特征相关的计算系数分开存储，例如，可将存储位置标识设计为与特征种类相应的N位字节(N为整数)以作为内存地址的高位字节，并将哈希变换后的地址作为内存地址的低位字节，从而组合后的内存地址能够在存储空间上按照特征种类区分开，使得不同种类特征的计算系数不会错误地彼此覆盖。In an example, the feature extraction configuration item of each predetermined feature may further include a storage location identifier, which is used to indicate the storage area in the memory of the calculation coefficient corresponding to the feature value of each predetermined feature. Here, taking machine learning as an example, the calculation coefficients (for example, the weights in the machine learning algorithm) corresponding to the respective eigenvalues of each predetermined feature are respectively stored in the corresponding locations in the memory. However, as the dimension of the sample eigenvalues continues Expansion (even to the hundreds of millions level) and the memory space is very limited, it is difficult to assign a one-to-one corresponding storage address for each feature value. Correspondingly, it is necessary to map the storage address of the calculation coefficient to the limited address space for easy search. To this end, hash transformation will be performed on the calculation coefficient storage locations corresponding to each feature value to obtain the mapped memory address. However, hash transformation will bring address conflicts, and different sample feature values will be confused with each other, which will bring large errors to the calculation of machine learning. Therefore, the storage space of memory can be divided based on the storage location identifier At least the calculation coefficients related to different types of predetermined features are stored separately, for example, the storage location identifier can be designed as an N-bit byte (N is an integer) corresponding to the feature type as the high-order byte of the memory address, and the hash The transformed address is used as the low-order byte of the memory address, so that the combined memory address can be distinguished according to the type of feature in the storage space, so that the calculation coefficients of different types of features will not overwrite each other by mistake.

在一个示例中，在配置文件中预先设置有特征抽取配置项，相应地，在步骤120中，从设置了特征抽取配置项的配置文件读取特征抽取配置项，例如，通过解析配置文件来获取特征抽取配置项。这个配置文件被存储在本地或远程接收，作为示例，可以是软件编程人员手动编写生成的。可从本地数据库读取预先存储的由程序员编写好的配置文件，也可通过网络从其它设备接收配置文件。这里，假设抽取的特征将用于进行机器学习中的模型训练，则程序员在编写配置文件时，根据所建模型，结合实际建模场景，确定模型所需特征，进而针对每个特征设计配置项从而得到配置文件。替代地，也可通过向软件用户显示用于设置特征抽取配置项的界面(例如图形用户界面)，根据用户在界面上执行的输入操作而自动生成配置文件。后面将结合附图对通过界面由用户定制特征抽取配置项的方法进行示例性详细说明。In one example, the feature extraction configuration item is preset in the configuration file. Correspondingly, in step 120, the feature extraction configuration item is read from the configuration file in which the feature extraction configuration item is set, for example, obtained by parsing the configuration file. Feature extraction configuration items. This configuration file is stored locally or received remotely and, as an example, may be manually written by a software programmer. The pre-stored configuration files written by the programmer can be read from the local database, and the configuration files can also be received from other devices through the network. Here, assuming that the extracted features will be used for model training in machine learning, when the programmer writes the configuration file, according to the built model, combined with the actual modeling scene, determine the required features of the model, and then design the configuration for each feature item to get the configuration file. Alternatively, the configuration file can also be automatically generated according to the input operations performed by the user on the interface by displaying an interface (such as a graphical user interface) for setting feature extraction configuration items to the software user. The method for extracting configuration items from user-customized features through an interface will be exemplarily described in detail later with reference to the accompanying drawings.

在另一个示例中，可根据用户的输入操作来获取特征抽取配置项。作为示例，可在执行特征抽取过程中间实时地直接获取特征抽取配置项而无需形成任何配置文件，例如程序执行过程中实时弹出图形用户界面，引导用户进行特征抽取配置项选择，从而获得了特征抽取配置项。In another example, feature extraction configuration items may be obtained according to user input operations. As an example, feature extraction configuration items can be obtained directly in real time during the feature extraction process without forming any configuration files. For example, a graphical user interface pops up in real time during program execution to guide users to select feature extraction configuration items, thereby obtaining feature extraction. Configuration items.

根据本发明的示例性实施例，能够独立于特征抽取主程序根据需要来改变各个特征抽取配置项，从而可根据场景对特征抽取进行有效的“抽象”和“表示”，既无需实质性改变特征抽取主程序，同时可灵活地独立编写或增加数据处理函数，增强了编程的灵活性和代码的重用性。由此，对于不同的数据库，只要根据需要定义特征抽取配置项，就可以利用同样的特征抽取主程序和相应的数据处理函数，增强了编程的灵活性、易维护性和代码的重用性。According to the exemplary embodiment of the present invention, each feature extraction configuration item can be changed independently of the feature extraction main program as needed, so that feature extraction can be effectively "abstracted" and "expressed" according to the scene, without substantially changing the feature The main program is extracted, and at the same time, data processing functions can be independently written or added flexibly, which enhances the flexibility of programming and the reusability of codes. Therefore, for different databases, as long as the feature extraction configuration items are defined according to the needs, the same feature extraction main program and corresponding data processing functions can be used, which enhances programming flexibility, ease of maintenance and code reusability.

图2示出了配置文件中存储的特征抽取配置项的示例。Fig. 2 shows an example of feature extraction configuration items stored in configuration files.

图2所示的配置文件共有10行，其中第一行定义了数据记录中的6个字段：age(年龄)、job(工作)、housing(住房)、contact(联系人)、birthday(生日)和y(标记)；第二行到第十行定义了针对每个特征的特征抽取配置项，其中可包括来源字段项和处理方法项。此外，为了进一步有效地管理各个特征的抽取，还可设置每个特征的特征名称，并且，还可针对某些处理方法设置相应的处理参数项。The configuration file shown in Figure 2 has 10 lines in total, and the first line defines 6 fields in the data record: age (age), job (job), housing (housing), contact (contact person), birthday (birthday) and y (flag); the second to tenth lines define feature extraction configuration items for each feature, which may include source field items and processing method items. In addition, in order to further effectively manage the extraction of each feature, the feature name of each feature can also be set, and corresponding processing parameter items can also be set for certain processing methods.

如图2所示，从第二行到第十行，每行分为三列或四列。第一列指定了所提取特征的名称，由图2可见这9个特征名称分别为“F_AGE”、“F_JOB”、“F_HOUSING”、“F_CONTACT”、“F_YEAR”、“F_MONTH”、“F_YEAR”、“F_PROFILE”、“.label”。针对每个特征，第二列指定了相应的来源字段项，即所抽取的特征源自数据记录中的哪个或者哪些字段；第三列指定了处理方法项，即从来源项的字段到输出特征的中间处理方法的引用，通过该引用，即指定了调用的数据处理函数，数据处理函数可以是已经编程好的软件模块、例程、库函数等。第四列(如果有的话)指定了与处理方法项相应的参数。具体地，图2所示的示例中，在feature中设置了所抽取特征的名称；在depends中设置了所抽取特征的来源字段；在method中指定了为了得到预定特征值，对于depends所指定的来源字段，应该执行什么样的数据处理(即，调用哪个数据处理函数)；在args中设置了数据处理方法涉及的数据的格式。As shown in Figure 2, from the second row to the tenth row, each row is divided into three or four columns. The first column specifies the names of the extracted features. It can be seen from Figure 2 that the nine feature names are "F_AGE", "F_JOB", "F_HOUSING", "F_CONTACT", "F_YEAR", "F_MONTH", "F_YEAR", "F_PROFILE", ".label". For each feature, the second column specifies the corresponding source field item, that is, which field or fields in the data record the extracted feature comes from; the third column specifies the processing method item, that is, from the field of the source item to the output feature The reference of the intermediate processing method, through which, the data processing function to be called is specified, and the data processing function can be a programmed software module, routine, library function, etc. The fourth column (if any) specifies the parameters corresponding to the processing method item. Specifically, in the example shown in Figure 2, the name of the extracted feature is set in feature; the source field of the extracted feature is set in depends; in the method, in order to obtain the predetermined feature value, the specified for depends Source field, what kind of data processing should be performed (that is, which data processing function to call); the format of the data involved in the data processing method is set in args.

这里，处理方法项用于调用针对来源字段的字段值执行预定抽取处理的相应函数，作为示例而非限制，以下给出一些处理方法所对应的数据处理，其中，某些处理方法特别地针对于机器学习领域：Here, the processing method item is used to call the corresponding function that performs predetermined extraction processing on the field value of the source field. As an example and not a limitation, the data processing corresponding to some processing methods is given below, and some processing methods are specifically for Field of machine learning:

1.Direct(直接抽取)：对来源字段原样输出，例：“1”->“1”。1. Direct (direct extraction): Output the source field as it is, for example: "1" -> "1".

2.ExpNormalizer(指数离散)：对数值来源字段取2为底的log值输出，例：“2”->“1”。2. ExpNormalizer (exponential discrete): output the log value with the base 2 as the value source field, for example: "2" -> "1".

3.Combine(字段组合)：对多个来源字段以“|”分割后，组合输出，例：“1”、“2”->“1|2”。3. Combine (field combination): split multiple source fields with "|", and combine the output, for example: "1", "2" -> "1|2".

4.DataCalc(日期间隔)：计算两个日期的时间间隔(天为单位)，例：“1900-01-02”，“1900-01-10”->“9”。4. DataCalc (date interval): Calculate the time interval between two dates (in days), for example: "1900-01-02", "1900-01-10" -> "9".

5.GetYearOfDate(日期年份)：截取日期字段中的年份，例：“1900-01-02”->“1900”。5. GetYearOfDate (date year): intercept the year in the date field, for example: "1900-01-02" -> "1900".

6.GetMonthOfDate(日期月份)：截取日期字段中的月份，例：“1900-01-02”->“01”。6. GetMonthOfDate (date month): intercept the month in the date field, for example: "1900-01-02" -> "01".

7.GetDayOfDate(日期天份)：截取日期字段中的天份，例：“1900-01-02”->“02”。7. GetDayOfDate (date day): intercept the day in the date field, for example: "1900-01-02" -> "02".

8.NumberFloor(下取整)：对数值型字段进行下取整，例：“7.89”->“7”。8.NumberFloor (floor rounding): Carry out floor rounding on numeric fields, for example: "7.89" -> "7".

9.LabelDirect(数字标记)：机器学习中的样本标记方法，直接输出来源字段作为label(标记)，label必须是整数。9.LabelDirect (digital label): A sample labeling method in machine learning, which directly outputs the source field as a label (label), and the label must be an integer.

10.LabelBeta(字段标记)：机器学习中的样本标记方法，若来源字段中含有“pos”,则标记为正样本，否则为负样本。10.LabelBeta (field label): A sample labeling method in machine learning. If the source field contains "pos", it will be marked as a positive sample, otherwise it will be a negative sample.

11.LabelBinary(分类标记)：机器学习中的数字标记方法，若来源字段是“1”，则标记为正样本，否则为负样本。11.LabelBinary (classification mark): A digital labeling method in machine learning. If the source field is "1", it is marked as a positive sample, otherwise it is a negative sample.

需要说明的是，上面结合图2描述的数据字段名称、处理方法定义等仅为示例，可以根据需要进行不同的设计。It should be noted that the data field names and processing method definitions described above in conjunction with FIG. 2 are only examples, and different designs can be made according to needs.

在一个示例中，基于例如图2所示的配置文件中设置好的配置项目来得到可执行程序代码，例如可以利用专门的解析程序对配置文件中设置好的配置项目进行解析以形成相应的可执行程序代码，该可执行程序代码被执行时将对来源字段执行处理方法项指定的数据处理，并且获取的特征值赋予给所定义的特征。在一个示例中，通过解析获取的各配置项目得到的可执行程序代码可以作为一个完整的结构被保存，以完成后续执行的过程。In one example, the executable program code is obtained based on the configuration items set in the configuration file shown in FIG. Executing the program code, when the executable program code is executed, the data processing specified in the processing method item will be performed on the source field, and the obtained characteristic value will be assigned to the defined characteristic. In an example, the executable program code obtained by parsing the obtained configuration items may be saved as a complete structure to complete the subsequent execution process.

回到图1，在完成特征抽取配置项获取步骤S120后，前进到步骤S130。Returning to Fig. 1, after the feature extraction configuration item acquisition step S120 is completed, proceed to step S130.

在步骤S130中，基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值。这里，作为示例，可运行通过解析配置文件而得到的可执行程序代码，或者，可根据实时输入的特征抽取配置项来运行特征抽取主程序，从而针对读取的数据记录中相关来源字段的字段值执行预定的数据处理以获得相应的特征值。In step S130, data processing is performed on field values of the data records based on feature extraction configuration items to obtain feature values of the predetermined features. Here, as an example, the executable program code obtained by parsing the configuration file can be run, or the feature extraction main program can be run according to the feature extraction configuration items input in real time, so as to target the fields of the relevant source fields in the read data records Value performs predetermined data processing to obtain the corresponding eigenvalues.

具体地，仍以图2所示的特征抽取配置项为例，通过执行步骤S130，则将每个记录中的年龄(age)字段原样输出赋值给特征F_AGE，类似地，将每个记录中的工作(job)字段、住房(housing)字段、联系人(contact)字段分别原样输出赋值给特征F_JOB、F_HOUSING、F_CONTACT；将YYMM-mm-dd格式的生日(birthday)字段中的年(YYMM)、月(mm)和日(dd)提取出来赋值给特征F_YEAR、FMONTH和F_DATE；将年龄(age)字段和工作(job)字段一起原样输出给特征F_PROFILE；以及将标注(y)字段直接输出给特征label。Specifically, still taking the feature extraction configuration item shown in Figure 2 as an example, by executing step S130, the age (age) field in each record is output and assigned to the feature F_AGE as it is, and similarly, the age (age) field in each record is assigned to the feature F_AGE. The job (job) field, housing (housing) field, and contact (contact) field are respectively output and assigned to the features F_JOB, F_HOUSING, and F_CONTACT; the year (YYMM), The month (mm) and day (dd) are extracted and assigned to the features F_YEAR, FMONTH, and F_DATE; the age (age) field and the job (job) field are output to the feature F_PROFILE as they are; and the label (y) field is directly output to the feature label.

由此抽取的各个特征可组合为特征向量，或者结合其它特征形成特征向量。这些特征向量可用于后续的任何数据统计、分析、计算和/或其它处理。The individual features thus extracted may be combined into a feature vector, or combined with other features to form a feature vector. These feature vectors can be used for any subsequent data statistics, analysis, calculation and/or other processing.

作为示例，所述特征向量可作为机器学习中的训练样本。对各个数据记录均执行上述特征抽取，进而形成训练样本集。训练样本集可以应用于机器学习算法或其它算法以进行数据挖掘。As an example, the feature vectors can be used as training samples in machine learning. The feature extraction described above is performed on each data record to form a training sample set. The training sample set can be applied to machine learning algorithms or other algorithms for data mining.

根据本发明的示例性实施例，凭借抽象出的配置项目的扁平化结构，数据依赖仅仅限制在当前处理的数据记录中。相应地，可简单地对数据记录表进行基于行的文件切分，进而并行地对划分出的各个行分片实现特征抽取。即，在所述特征值获取步骤中，可对所述数据记录中的各条数据记录或由多条组成的各组数据记录并行地执行数据处理。例如，在一个示例中，在特征值获取步骤,以行为单位对各行数据记录进行特征抽取，即，遍历每条数据记录的各个列以根据所配置的特征来源字段和处理方法执行数据处理。这里，作为示例，在离线针对历史数据来进行特征抽取的应用场景中，可利用分布式计算集群来对各个行执行特征值获取步骤。According to an exemplary embodiment of the present invention, by virtue of the flattened structure of the abstracted configuration items, data dependencies are restricted only to the currently processed data records. Correspondingly, row-based file segmentation can be simply performed on the data record table, and then feature extraction can be implemented on each of the divided row segments in parallel. That is, in the feature value acquiring step, data processing may be performed in parallel on each of the data records or each group of multiple data records. For example, in one example, in the feature value acquisition step, feature extraction is performed on each row of data records in row units, that is, each column of each data record is traversed to perform data processing according to the configured feature source fields and processing methods. Here, as an example, in an application scenario where feature extraction is performed offline for historical data, a distributed computing cluster may be used to perform the feature value acquisition step for each row.

图3示出了在数据记录为机器学习中的样本数据时，分布式地执行特征抽取过程的示例，其中，样本数据源可以是数据记录表，其中的每一行作为一条数据记录，每一列对应一个字段。这里，可对数据记录表进行基于行的文件切分，得到各个行分片，然后对于各个行分片的特征值获取可以并行执行。例如，可由分布式计算集群中的各个工作节点来并行地抽取各个行分片的特征值。Figure 3 shows an example of performing the feature extraction process in a distributed manner when the data records are sample data in machine learning, where the sample data source can be a data record table, where each row serves as a data record, and each column corresponds to a field. Here, row-based file segmentation can be performed on the data record table to obtain each row segment, and then the feature value acquisition of each row segment can be executed in parallel. For example, each worker node in the distributed computing cluster can extract the feature values of each row slice in parallel.

在另一个示例中，除了对各个行的特征抽取并行执行外，在行的内部，可以以获取的特征为单位，对于各个特征，并行地执行数据处理以获取特征的特征值。In another example, in addition to performing the feature extraction on each row in parallel, within the row, in units of acquired features, data processing may be performed in parallel for each feature to obtain feature values of the feature.

需要说明的是，在图1中，数据记录获取步骤和特征抽取配置项获取步骤空间上被顺序列出，但是这并不意味着时间上的顺序关系。实际上，对于数据记录获取步骤和特征抽取配置项获取步骤的执行顺序没有限制，在不违反上下文逻辑关系的情况下，各个步骤可以并行进行或者按照相反顺序执行。It should be noted that in FIG. 1 , the data record acquisition steps and the feature extraction configuration item acquisition steps are listed sequentially in space, but this does not imply a sequential relationship in time. In fact, there is no restriction on the execution sequence of the data record acquisition step and the feature extraction configuration item acquisition step, and the steps can be performed in parallel or in the reverse order without violating the context logic relationship.

下面结合附图描述根据本发明实施例的由用户通过图形用户界面设置特征抽取配置项的方法示例。应注意，这里的图形用户界面仅作为示例，本发明还可采用任何其它形式的输入界面。通过所述界面设置的特征抽取配置项可用于形成相应的配置文件以便后续从所述配置文件中读取各个特征抽取配置项，也可将通过所述界面设置的特征抽取配置项直接应用于特征抽取主程序而无需生成任何配置文件。An example of a method for setting feature extraction configuration items by a user through a graphical user interface according to an embodiment of the present invention will be described below with reference to the accompanying drawings. It should be noted that the graphical user interface here is only an example, and the present invention can also adopt any other form of input interface. The feature extraction configuration items set through the interface can be used to form a corresponding configuration file for subsequent reading of each feature extraction configuration item from the configuration file, or the feature extraction configuration items set through the interface can be directly applied to the feature Extract the main program without generating any configuration files.

图4A示出根据本发明示例性实施例的用于针对特征抽取进行配置的图形用户界面200的示例，图4A的图形用户界面200可应用于进行模型训练的建模平台，也可在适当修改后应用于任何其它特征抽取的情景。其中，输入表201bankbasicdata可指示银行的原始数据，目标值202y指示训练样本的标记，输出表203bankdata_out指示抽取出的特征表。FIG. 4A shows an example of a graphical user interface 200 configured for feature extraction according to an exemplary embodiment of the present invention. The graphical user interface 200 of FIG. 4A can be applied to a modeling platform for model training, and can also be appropriately modified It is then applied to any other feature extraction scenarios. Among them, the input table 201 bankbasicdata may indicate the original data of the bank, the target value 202y indicates the label of the training sample, and the output table 203 bankdata_out indicates the extracted feature table.

在上述图形用户界面200中，可至少显示有数据记录的能够作为来源字段的各个字段以及设置的预定特征的特征抽取配置项。此外，作为示例，还可显示其他关于数据源或数据输出的信息。具体说来，如图4A所示，左侧区域示出输入表中数据记录的各个字段，包括字段名称204和字段属性205；右侧区域示出配置特征的配置页面，作为示例，该配置页面可包括用于显示特征抽取配置项的内容选项以供手动选择的选择输入型界面，其中，每一行针对一个特定的特征，相应地配置了该特征的来源项206、处理方法207和特征名208。In the above graphical user interface 200 , at least various fields of data records that can be used as source fields and feature extraction configuration items of preset features set can be displayed. In addition, as an example, other information about the data source or data output can also be displayed. Specifically, as shown in Figure 4A, the left area shows each field of the data record in the input table, including field name 204 and field attribute 205; the right area shows the configuration page of the configuration feature, as an example, the configuration page It may include a selection input interface for displaying the content options of feature extraction configuration items for manual selection, wherein each row is aimed at a specific feature, and the source item 206, processing method 207 and feature name 208 of the feature are correspondingly configured .

作为示例，可根据用户对左侧区域显示的各个字段的设置操作，相应地在右侧区域显示用户设置的各个特征配置项目。在一个示例中，用户可手动编辑右侧区域显示的配置项目。As an example, according to the user's setting operation on each field displayed in the left area, various feature configuration items set by the user may be displayed in the right area accordingly. In one example, the user can manually edit the configuration items displayed in the right area.

具体说来，可首先在图形用户界面上(例如，左侧区域)显示数据记录的各个字段，当用户选中(例如，通过点击来选中)某个或某些显示的字段时，在配置页面中将用户选中的字段设置为设置的来源字段，并在所述来源字段被选择的同时，将处理方法列表显示在图形用户界面上，这里，作为示例，处理方法列表可被显示在用户选择的来源字段附近以便于用户从中选择将在配置页面中显示的处理方法；这里，在所述处理方法列表中，所有处理方法可均处于激活状态；或者，可仅包括能够应用于选中的来源字段项的处理方法；或者，可包括全部处理方法但将能够应用的处理方法显示为激活状态而将无法应用的处理方法显示为禁用状态。Specifically, each field of the data record may first be displayed on the graphical user interface (for example, the left area), and when the user selects (for example, selects by clicking) one or some displayed fields, in the configuration page Set the field selected by the user as the set source field, and when the source field is selected, display the list of processing methods on the graphical user interface. Here, as an example, the list of processing methods can be displayed on the source selected by the user field so that the user can select the processing method to be displayed in the configuration page; here, in the processing method list, all processing methods can be active; or, can only include the processing methods; alternatively, you can include all processing methods but display processing methods that can be applied as active and processing methods that cannot be applied as disabled.

图4B示出在左侧区域中的单个字段(例如，“age”字段)301被用户选择的同时，向用户显示处理方法列表302的部分图形用户界面300的示例。例如，当用户点击“age”字段301时，右侧在“age”字段的附近弹出处理方法列表302供选择。在处理方法列表302中可列出所有的处理方法，并将用户当前选择的处理方法高亮显示。此外，还可仅在处理方法列表302中显示能够应用于选择的“age”字段的处理方法，或者，在处理方法列表302中仅将能够应用于选择的“age”字段的处理方法进行激活(例如，显示为可选状态或突出显示状态)而将其它处理方法显示为禁止状态。4B shows an example of a portion of a graphical user interface 300 displaying a processing method list 302 to the user while a single field (eg, "age" field) 301 in the left area is selected by the user. For example, when the user clicks on the "age" field 301, a processing method list 302 pops up near the "age" field on the right side for selection. All processing methods can be listed in the processing method list 302, and the processing method currently selected by the user is highlighted. In addition, only the processing methods that can be applied to the selected "age" field can be displayed in the processing method list 302, or only the processing methods that can be applied to the selected "age" field can be activated in the processing method list 302 ( For example, shown as optional or highlighted) while other processing methods are shown as disabled.

图4C示出在左侧区域中的多个字段401、402、403被用户选择的同时，向用户显示处理方法列表404的部分图形用户界面400的示例。这表示，用户可在左侧选取一个以上的来源字段401、402和403，相应地，可弹出处理方法列表404，供用户选取对这些来源字段应用的处理方法。类似地，可采用适当的方式来弹出处理方法列表404，并且，处理方法列表404可不必包括所有的处理方法，相应地，可根据左侧选择的来源字段而动态地调整在处理方法列表404中显示的处理方法。Fig. 4C shows an example of a portion of a graphical user interface 400 displaying a list of processing methods 404 to the user while a plurality of fields 401, 402, 403 in the left area are selected by the user. This means that the user can select more than one source field 401, 402, and 403 on the left, and accordingly, a processing method list 404 can pop up for the user to select the processing method applied to these source fields. Similarly, an appropriate method can be used to pop up the processing method list 404, and the processing method list 404 may not necessarily include all processing methods, and accordingly, it can be dynamically adjusted in the processing method list 404 according to the source field selected on the left displayed processing method.

除了上述显示特征抽取配置项的内容选项以供手动选择(例如，通过鼠标点击的方式)的选择输入型界面之外，还可以采用其他形式的用于设置特征抽取配置项的界面，例如，用于手动编辑配置文件的文本编辑界面，使得用户能够直接在文本编辑界面中编写“配置文件”，由于配置文件本身具有内容上的重复性，可通过文本编辑操作(例如，复制、粘贴、拖动等)来快速完成“配置文件”的编写。In addition to the selection input interface that displays the content options of feature extraction configuration items for manual selection (for example, by mouse click), other forms of interfaces for setting feature extraction configuration items can also be used, for example, using The text editing interface for manually editing configuration files enables users to directly write "configuration files" in the text editing interface. Since the configuration files themselves have repetitive content, they can be edited through text editing operations (such as copying, pasting, dragging, etc.) etc.) to quickly complete the writing of the "configuration file".

图5示出了具有能够对特征抽取配置项进行文本编辑的区域的示例性图形用户界面500。图形用户界面500的左侧与图4B和图4C所示的图形用户界面具有类似性，只是图形用户界面500的右侧区域示出用于手动编辑配置文件的文本编辑界面501，用户可以在文本编辑界面501中手动编辑特征抽取配置项目，包括配置特征项名称、来源字段项、处理方法项、处理参数项等。通过文本编辑界面中执行的文本编辑操作(例如、复制、粘贴、拖动等)，用户能够高效率进行特征抽取配置项目的设置。FIG. 5 shows an exemplary graphical user interface 500 having an area enabling text editing of feature extraction configuration items. The left side of the graphical user interface 500 is similar to the graphical user interface shown in FIG. 4B and FIG. 4C, except that the right area of the graphical user interface 500 shows a text editing interface 501 for manually editing configuration files, and the user can edit the configuration file manually. The feature extraction configuration items are manually edited in the editing interface 501, including configuration feature item name, source field item, processing method item, processing parameter item, etc. Through text editing operations (for example, copying, pasting, dragging, etc.) performed in the text editing interface, the user can efficiently set the feature extraction configuration items.

上述两种图形用户界面可同时显示在屏幕上，也可根据用户的选择而单独显示在屏幕上，例如，响应于用户的界面切换操作输入在文本编辑界面和选择输入型界面之间切换(显示切换或激活切换)，在切换前界面下的特征抽取配置项设置结果被同步地显示到切换后的界面下。相应地，用户可利用两种配置界面在操作上的便利性，更有效地设置多个特征抽取方式，例如，用户可首先在选择输入型界面中通过点击等选择输入方式完成代表性的特征抽取配置，然后切换到文本编辑界面下，由于之前设置的结果会同步地显示在文本编辑界面中，用户可结合复制粘贴等操作快速地完成大量特征的抽取项设置。The above two graphical user interfaces can be displayed on the screen simultaneously, or can be displayed separately on the screen according to the user's selection. switch or activate the switch), the setting results of the feature extraction configuration items in the interface before the switch are displayed synchronously in the interface after the switch. Correspondingly, the user can take advantage of the convenience of the two configuration interfaces to more effectively set multiple feature extraction methods. For example, the user can first select the input method by clicking in the input selection interface to complete representative feature extraction. Configure, and then switch to the text editing interface. Since the results of the previous settings will be displayed synchronously in the text editing interface, users can quickly complete the settings of a large number of feature extraction items by combining operations such as copy and paste.

上述特征抽取方式可应用于任何适合的场景，以下将以机器学习领域作为示例对其进行描述。The above-mentioned feature extraction method can be applied to any suitable scenario, which will be described below taking the field of machine learning as an example.

在现有的机器学习领域中，为了能够基于大量的结构化或非结构化数据进行模型训练、测试或应用，往往需要在特征工程阶段耗费较多的人力，例如，需要编程人员预先针对特定的特征抽取规则编写每一种特征的提取代码。相应地，在诸如建模平台等供客户使用的建模产品中，往往需要输入建模平台的已经是提取出的训练数据(即，提取好的特征向量)，而用户难以灵活地设置或调整关于特征抽取的对象和规则，使得建模平台的使用受到限制。In the existing field of machine learning, in order to be able to perform model training, testing or application based on a large amount of structured or unstructured data, it often requires a lot of manpower in the feature engineering stage. Feature extraction rules write the extraction code for each feature. Correspondingly, in modeling products such as modeling platforms for customers, it is often necessary to input the extracted training data (ie, extracted feature vectors) into the modeling platform, and it is difficult for users to flexibly set or adjust Objects and rules about feature extraction limit the use of modeling platforms.

根据本发明的另一实施例，提供了一种应用上述特征抽取方法的机器学习方法，该机器学习方法可应用于诸如建模平台等便于用户(例如，业务人员)进行数据建模的系统总。下面结合图6加以说明。According to another embodiment of the present invention, a machine learning method applying the above feature extraction method is provided, and the machine learning method can be applied to a system aggregate such as a modeling platform that is convenient for users (for example, business personnel) to perform data modeling. . It will be described below in conjunction with FIG. 6 .

图6示出了根据本发明实施例的应用了上述实施例的特征抽取方法的机器学习方法600的总体流程图。这里，作为示例，可将实现所述方法的程序封装为专门的软件包(例如，lib库)，例如，可将步骤S610、S620和S630封装为单独的软件包，从而不论是在线调用或离线调用所述软件包，即可实现一致性的特征抽取服务，克服了现有技术中由于在线离线环境不同而导致特征抽取结果不一致的缺陷。此外，还可将步骤S640封装为单独的软件包，从而不论是在线调用或离线调用所述软件包，即可基于抽取的特征进行机器学习。FIG. 6 shows an overall flowchart of a machine learning method 600 applying the feature extraction method of the above-mentioned embodiments according to an embodiment of the present invention. Here, as an example, the program for implementing the method can be packaged as a special software package (for example, lib library), for example, steps S610, S620 and S630 can be packaged as a separate software package, so that whether it is called online or offline By invoking the software package, consistent feature extraction services can be realized, which overcomes the defect of inconsistent feature extraction results in the prior art due to different online and offline environments. In addition, step S640 can also be packaged as a separate software package, so that machine learning can be performed based on the extracted features no matter whether the software package is called online or offline.

在步骤S610中，执行数据记录获取步骤，获取数据记录。In step S610, a data record acquiring step is performed to acquire data records.

在步骤S620中，执行特征抽取配置项获取步骤，获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理。In step S620, the feature extraction configuration item acquisition step is performed to acquire the feature extraction configuration item used to define how to extract predetermined features from the data records, wherein the feature extraction configuration item of each predetermined feature includes a source field item and a processing method item, the source field item is used to limit the field of the data record involved in each predetermined feature as a source field, and the processing method item is used to specify a reference to a data processing function pre-programmed as an executable code, wherein the The data processing function is used to perform data processing for extracting each predetermined feature with respect to the field values of the source fields defined by the source field items.

在步骤S630中，执行特征值获取步骤，基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值。In step S630, a feature value obtaining step is performed, and data processing is performed on field values of the data records based on feature extraction configuration items to obtain feature values of the predetermined features.

以上步骤S610、S620和S630的具体实现和功能可参考结合图1描述的步骤S110、S120和S130，这里不再赘述。For the specific implementation and functions of the above steps S610, S620 and S630, reference may be made to the steps S110, S120 and S130 described in conjunction with FIG. 1 , which will not be repeated here.

在步骤S640中，执行样本获得步骤，至少部分基于所述特征值获取步骤获取的特征值，形成特征向量，作为机器学习的样本。In step S640, perform a sample obtaining step, at least partly based on the feature values obtained in the feature value obtaining step, to form a feature vector as a machine learning sample.

在一个示例中，对数据记录的如此特征抽取即获得了特征向量的全部维度，即对于每条数据记录，基于特征抽取配置项，对于数据记录的相关字段执行数据处理，获得各个特征的特征值，这些各个维度的特征值组合起来形成完整的机器学习样本。In one example, such feature extraction of data records obtains all the dimensions of the feature vector, that is, for each data record, based on the feature extraction configuration items, data processing is performed on the relevant fields of the data record to obtain the feature values of each feature , the eigenvalues of these dimensions are combined to form a complete machine learning sample.

在另一个示例中，如此得到的各个维度的特征值可以是部分维度的特征，可以和其它维度的特征结合起来，形成最后的特征向量。这里对其它特征的形式或来源没有限制，可以是来自外部的，或者可以是在本地利用类似或者不同的特征抽取方法得到的。In another example, the eigenvalues of each dimension thus obtained may be features of some dimensions, and may be combined with features of other dimensions to form a final eigenvector. There is no restriction on the form or source of other features, which may be from outside, or may be locally obtained by using similar or different feature extraction methods.

在步骤S650中，执行机器学习步骤，基于所述样本进行机器学习。这里，可基于所述样本进行模型训练、模型测试和模型应用之中的至少一项。In step S650, a machine learning step is performed to perform machine learning based on the samples. Here, at least one of model training, model testing and model application may be performed based on the samples.

这里，在进行模型训练时，对于具体采用的机器学习算法，没有特别限制，可以是例如神经网络、贝叶斯网络、支持向量机、决策树、遗传算法、专家系统等各种机器学习方法。需要说明的是，在基于训练数据建立起模型之后，可以针对被用于测试模型性能的数据记录，应用同样的特征抽取方法来得到测试样本，将该测试样本输入到训练得到的模型中，即可判断模型的性能。此外，还可对将利用模型进行预测的数据记录，应用同样的特征抽取方法来得到应用样本，将应用样本输入到模型中，以得到相应的预测结果。这里，对于模型所针对的问题没有限制，根据要执行的任务而不同，可以是例如对于工厂工件是否具有缺陷的判断，工厂环境是否安全的判断，某人信用程度的判断，等等。Here, when performing model training, there is no particular limitation on the specific machine learning algorithm used, which may be various machine learning methods such as neural network, Bayesian network, support vector machine, decision tree, genetic algorithm, and expert system. It should be noted that after the model is established based on the training data, the same feature extraction method can be applied to the data records used to test the performance of the model to obtain test samples, and the test samples can be input into the trained model, that is, The performance of the model can be judged. In addition, the same feature extraction method can be applied to the data records that will be predicted by the model to obtain application samples, and the application samples can be input into the model to obtain corresponding prediction results. Here, there is no limit to the problem that the model is aimed at, and it depends on the task to be performed, such as judging whether a factory workpiece has defects, whether the factory environment is safe, judging someone's creditworthiness, and so on.

根据本发明实施例的机器学习方法利用了根据本发明实施例的特征抽取方法，特别适合于大数据的特征抽取和样本集获得，此外，还能够便于建模平台的用户直接参与机器学习的各个过程，例如，模型的建立、训练和应用过程。The machine learning method according to the embodiment of the present invention utilizes the feature extraction method according to the embodiment of the present invention, which is especially suitable for feature extraction and sample set acquisition of large data. In addition, it can also facilitate the users of the modeling platform to directly participate in various aspects of machine learning. Process, for example, model building, training and application process.

根据本发明另一实施例，提供了一种针对数据记录进行特征抽取的特征抽取装置，包括：数据记录获取单元，配置为获取数据记录；特征抽取配置项获取单元，配置为获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理；特征值获取单元，配置为基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值。According to another embodiment of the present invention, a feature extraction device for feature extraction for data records is provided, including: a data record acquisition unit configured to acquire data records; a feature extraction configuration item acquisition unit configured to acquire a Feature extraction configuration items for extracting predetermined features from the data records, wherein the feature extraction configuration items for each predetermined feature include a source field item and a processing method item, and the source field item is used to use the data involved in each predetermined feature The field of the record is defined as the source field, and the processing method item is used to specify a reference to a data processing function pre-programmed as executable code, wherein the data processing function is used to execute on the field value of the source field defined by the source field item Data processing for extracting each predetermined feature; a feature value acquisition unit configured to perform data processing on field values of the data records based on feature extraction configuration items to acquire feature values of the predetermined features.

根据本发明另一实施例，提供了一种机器学习装置，可以包括：数据记录获取单元，配置为获取数据记录；特征抽取配置项获取单元，配置为获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理；特征值获取单元，配置为基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值；训练样本获得单元，配置为至少部分基于所述特征值获取单元获取的特征值，形成特征向量，作为机器学习的样本；以及机器学习单元，配置为基于所述样本进行机器学习。According to another embodiment of the present invention, a machine learning device is provided, which may include: a data record acquisition unit configured to acquire data records; a feature extraction configuration item acquisition unit configured to acquire features used to define how to extract from the data records The feature extraction configuration item of the predetermined feature, wherein, the feature extraction configuration item of each predetermined feature includes a source field item and a processing method item, and the source field item is used to limit the fields of the data records involved in each predetermined feature as the source field, the processing method item is used to specify a reference to a data processing function pre-programmed as executable code, wherein the data processing function is used to perform the extraction of each Data processing of a predetermined feature; a feature value acquisition unit configured to perform data processing on the field value of the data record based on a feature extraction configuration item to obtain a feature value of the predetermined feature; a training sample acquisition unit configured to be at least partially based on The eigenvalues obtained by the eigenvalue acquiring unit form eigenvectors as samples for machine learning; and the machine learning unit is configured to perform machine learning based on the samples.

应注意，上述特征抽取装置和机器学习装置可完全依赖计算机程序的运行来实现相应的功能，即，各个单元作为与计算机程序的功能架构中与各步骤相应的模块，使得整个装置通过专门的软件包(例如，lib库)而被调用，以在在线或离线实现相应的特征抽取或机器学习功能。It should be noted that the above-mentioned feature extraction device and machine learning device can completely rely on the operation of the computer program to realize the corresponding functions, that is, each unit is used as a module corresponding to each step in the functional framework of the computer program, so that the entire device can be implemented through special software. package (for example, lib) to implement corresponding feature extraction or machine learning functions online or offline.

另一方面，上述各个单元也可以通过硬件、软件、固件、中间件、微代码或其任意组合来实现。当以软件、固件、中间件或微代码实现时，用于进行所需的任务的程序代码或者码段可以存储在诸如存储介质的计算机可读介质中。处理器可以进行所需的任务。On the other hand, each of the above units may also be implemented by hardware, software, firmware, middleware, microcode or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform required tasks may be stored in a computer readable medium such as a storage medium. The processor can perform the required tasks.

这里，本发明实施例可以实现为计算装置，包括存储部件和处理器，存储部件中存储有计算机可执行指令集合，当所述计算机可执行指令集合被所述处理器执行时，执行上述特征抽取方法和/或机器学习方法。Here, the embodiment of the present invention can be implemented as a computing device, including a storage unit and a processor, where a set of computer-executable instructions is stored in the storage unit, and when the set of computer-executable instructions is executed by the processor, the feature extraction described above is performed. methods and/or machine learning methods.

图7示出了根据本发明实施例的计算装置1100的配置框图。FIG. 7 shows a configuration block diagram of a computing device 1100 according to an embodiment of the present invention.

如图7所示，计算装置1100包括中央处理单元1110、存储器1130、显示器1140、网络接口1150、以及可以经由有线或无线方式连接的输入设备1200。存储器1130、显示器1140、网络接口1150、输入设备1200经由总线1120连接到中央处理单元1110。存储器1130包括内存1131和外部存储器1132，在计算装置1100正常运行中，内存1131中驻留有操作系统和各种应用程序；外存1132可以为ROM、硬盘或固态盘，上面可以存储BIOS、数据、程序等。As shown in FIG. 7 , the computing device 1100 includes a central processing unit 1110 , a memory 1130 , a display 1140 , a network interface 1150 , and an input device 1200 that can be connected in a wired or wireless manner. The memory 1130 , the display 1140 , the network interface 1150 , and the input device 1200 are connected to the central processing unit 1110 via the bus 1120 . The memory 1130 includes a memory 1131 and an external memory 1132. During the normal operation of the computing device 1100, an operating system and various application programs reside in the memory 1131; the external memory 1132 can be a ROM, a hard disk or a solid state disk, on which BIOS, data , programs, etc.

存储器中存储有能够实施本发明实施例的特征抽取方法和/或机器学习方法的计算机指令集，当该计算机指令集被中央处理单元执行时，使得执行根据本发明实施例的特征抽取方法和/或机器学习方法。应注意，这里的中央处理单元可以是物理上或逻辑上分布的计算集群，而不限于单机的计算设备。A computer instruction set capable of implementing the feature extraction method and/or machine learning method of the embodiment of the present invention is stored in the memory, and when the computer instruction set is executed by the central processing unit, the feature extraction method and/or the method according to the embodiment of the present invention are executed. or machine learning methods. It should be noted that the central processing unit here may be a physically or logically distributed computing cluster, and is not limited to a stand-alone computing device.

具体说来，根据本发明的一实施例，提供了一种针对数据记录进行特征抽取的计算装置，包括存储部件和处理器，存储部件中存储有计算机可执行指令集合，当所述计算机可执行指令集合被所述处理器执行时，执行下述步骤：数据记录获取步骤，获取数据记录；特征抽取配置项获取步骤，获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理；以及特征值获取步骤，基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值。Specifically, according to an embodiment of the present invention, a computing device for feature extraction for data records is provided, including a storage unit and a processor, and a set of computer-executable instructions is stored in the storage unit. When the computer-executable When the instruction set is executed by the processor, the following steps are performed: a data record obtaining step, obtaining a data record; a feature extraction configuration item obtaining step, obtaining a feature extraction configuration item for limiting how to extract predetermined features from the data record, Wherein, the feature extraction configuration item of each predetermined feature includes a source field item and a processing method item, the source field item is used to limit the fields of the data records involved in each predetermined feature as the source field, and the processing method item is used to specify a reference to a data processing function preprogrammed as executable code, wherein the data processing function is configured to perform data processing for extracting each of the predetermined characteristics with respect to field values of source fields defined by source field items; and The feature value obtaining step is to perform data processing on the field values of the data records based on the feature extraction configuration items to obtain the feature values of the predetermined features.

根据本发明的一实施例，提供了一种进行机器学习的计算装置，包括存储部件和处理器，存储部件中存储有计算机可执行指令集合，当所述计算机可执行指令集合被所述处理器执行时，执行下述步骤：数据记录获取步骤，获取数据记录；特征抽取配置项获取步骤，获取用于限定如何从所述数据记录抽取预定特征的特征抽取配置项，其中，每种预定特征的特征抽取配置项包括来源字段项和处理方法项，来源字段项用于将所述每种预定特征所涉及的数据记录的字段限定为来源字段，处理方法项用于指定对预先编程为可执行代码的数据处理函数的引用，其中，所述数据处理函数用于针对由来源字段项限定的来源字段的字段值执行用于抽取所述每种预定特征的数据处理；特征值获取步骤，基于特征抽取配置项对所述数据记录的字段值执行数据处理以获取所述预定特征的特征值；样本获得步骤，至少部分基于所述特征值获取步骤获取的特征值，形成特征向量，作为机器学习的样本；以及机器学习步骤，基于所述样本进行机器学习。According to an embodiment of the present invention, a computing device for machine learning is provided, including a storage unit and a processor, the storage unit stores a set of computer-executable instructions, and when the set of computer-executable instructions is executed by the processor During execution, the following steps are performed: a data record acquisition step, acquiring data records; a feature extraction configuration item acquisition step, acquiring a feature extraction configuration item for limiting how to extract predetermined features from the data records, wherein each predetermined feature The feature extraction configuration item includes a source field item and a processing method item. The source field item is used to limit the fields of the data records involved in each predetermined feature as the source field, and the processing method item is used to specify the code that is pre-programmed as executable code. A reference to a data processing function, wherein the data processing function is used to perform data processing for extracting each predetermined feature for the field value of the source field defined by the source field item; the feature value acquisition step is based on the feature extraction The configuration item performs data processing on the field value of the data record to obtain the characteristic value of the predetermined feature; the sample obtaining step is at least partially based on the characteristic value obtained by the characteristic value obtaining step to form a characteristic vector as a sample for machine learning and a machine learning step of performing machine learning based on the samples.

以上已经描述了本发明的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。因此，本发明的保护范围应该以权利要求的保护范围为准。Having described various embodiments of the present invention, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. Therefore, the protection scope of the present invention should be determined by the protection scope of the claims.