



技术领域technical field
本公开涉及机器学习技术领域,尤其涉及智慧金融、人工智能和深度学习技术,具体涉及一种特征编码方法、装置、设备、介质和程序产品。The present disclosure relates to the field of machine learning technologies, in particular to smart finance, artificial intelligence and deep learning technologies, and in particular to a feature encoding method, apparatus, device, medium and program product.
背景技术Background technique
在机器学习领域中,需要对样本数据进行特征编码后再对模型进行训练。特征编码的优劣直接影响模型的训练效果。In the field of machine learning, it is necessary to encode the features of the sample data before training the model. The quality of feature encoding directly affects the training effect of the model.
针对执行分类任务的线性模型的训练,现有的特征编码方法中,通常特征维度高,且对于样本覆盖率低的类别,在应用于建模时,这些稀疏样本的特征区分度和贡献度极低,使得模型无法对该类别进行准确识别。For the training of linear models that perform classification tasks, in the existing feature encoding methods, the feature dimension is usually high, and for categories with low sample coverage, when applied to modeling, the feature discrimination and contribution of these sparse samples are extremely high. low, making the model unable to accurately identify the category.
发明内容SUMMARY OF THE INVENTION
本公开提供了一种特征编码方法、装置、设备、介质和程序产品。The present disclosure provides a feature encoding method, apparatus, device, medium and program product.
根据本公开的一方面,提供了一种特征编码方法,包括:According to an aspect of the present disclosure, a feature encoding method is provided, comprising:
根据多个对象的样本数,和至少两种类别下所述多个对象的样本数,计算所述多个对象在所述至少两种类别中的第一权重,其中,所述模型训练的目标是使所述模型在所述至少两种类别中对输入的对象进行分类;Calculate the first weight of the multiple objects in the at least two categories according to the number of samples of the multiple objects and the number of samples of the multiple objects under the at least two categories, wherein the target of the model training is causing the model to classify the input objects in the at least two categories;
根据所述第一权重对所述多个对象进行分箱,得到多个对象分箱;Binning the multiple objects according to the first weight to obtain multiple object binning;
根据所述多个对象分箱的样本数,和所述至少两种类别下所述多个对象分箱的样本数,计算所述多个对象分箱在所述至少两种类别中的第二权重,并将所述多个对象分箱的第二权重作为所述多个对象分箱的特征取值。According to the number of samples of the plurality of object bins and the number of samples of the plurality of object bins under the at least two categories, the second number of the object bins in the at least two categories is calculated. weight, and the second weight of the multiple object bins is taken as the feature value of the multiple object bins.
根据本公开的另一方面,提供了一种特征编码装置,包括:According to another aspect of the present disclosure, a feature encoding apparatus is provided, comprising:
第一权重计算模块,用于根据多个对象的样本数,和至少两种类别下所述多个对象的样本数,计算所述多个对象在所述至少两种类别中的第一权重,其中,所述模型训练的目标是使所述模型在所述至少两种类别中对输入的对象进行分类;a first weight calculation module, configured to calculate the first weight of the multiple objects in the at least two categories according to the number of samples of the multiple objects and the number of samples of the multiple objects under the at least two categories, Wherein, the goal of the model training is to make the model classify the input objects in the at least two categories;
分箱模块,用于根据所述第一权重对所述多个对象进行分箱,得到多个对象分箱;a binning module, configured to bin the multiple objects according to the first weight to obtain multiple object bins;
第二权重计算模块,用于根据所述多个对象分箱的样本数,和所述至少两种类别下所述多个对象分箱的样本数,计算所述多个对象分箱在所述至少两种类别中的第二权重,并将所述多个对象分箱的第二权重作为所述多个对象分箱的特征取值。The second weight calculation module is configured to calculate, according to the number of samples of the multiple object bins and the number of samples of the multiple object bins under the at least two categories, to calculate the number of the object bins in the the second weight in at least two categories, and the second weight of the multiple object bins is taken as the feature value of the multiple object bins.
根据本公开的另一方面,提供了一种电子设备,包括:According to another aspect of the present disclosure, there is provided an electronic device, comprising:
至少一个处理器;以及at least one processor; and
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行本公开任意实施例所述的特征编码方法。The memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the feature encoding method described in any embodiment of the present disclosure .
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,所述计算机指令用于使计算机执行本公开任意实施例所述的特征编码方法。According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the feature encoding method described in any embodiment of the present disclosure.
根据本公开的另一方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现本公开任意实施例所述的特征编码方法。According to another aspect of the present disclosure, there is provided a computer program product, including a computer program, which, when executed by a processor, implements the feature encoding method described in any embodiment of the present disclosure.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or critical features of embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.
附图说明Description of drawings
附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used for better understanding of the present solution, and do not constitute a limitation to the present disclosure. in:
图1是根据本公开实施例的一种特征编码方法的示意图;1 is a schematic diagram of a feature encoding method according to an embodiment of the present disclosure;
图2是根据本公开实施例的一种特征编码方法的示意图;2 is a schematic diagram of a feature encoding method according to an embodiment of the present disclosure;
图3是根据本公开实施例的一种特征编码装置的结构示意图;3 is a schematic structural diagram of a feature encoding apparatus according to an embodiment of the present disclosure;
图4是用来实现本公开实施例的特征编码方法的电子设备的框图。FIG. 4 is a block diagram of an electronic device used to implement the feature encoding method of an embodiment of the present disclosure.
具体实施方式Detailed ways
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.
图1是根据本公开实施例的特征编码方法的流程示意图,本实施例可适用于对训练样本进行特征编码,以利用编码后得到的样本特征训练模型的情况,尤其是针对稀疏样本的特征编码的情况,涉及机器学习技术领域,尤其涉及智慧金融、人工智能和深度学习技术。该方法可由一种特征编码装置来执行,该装置采用软件和/或硬件的方式实现,优选是配置于电子设备中,例如计算机设备或服务器等。如图1所示,该方法具体包括如下:FIG. 1 is a schematic flowchart of a feature encoding method according to an embodiment of the present disclosure. This embodiment is applicable to the case where feature encoding is performed on training samples, and a model is trained by using the sample features obtained after encoding, especially for feature encoding of sparse samples. It involves the field of machine learning technology, especially smart finance, artificial intelligence and deep learning technology. The method can be performed by a feature encoding apparatus, which is implemented in software and/or hardware, preferably configured in an electronic device, such as a computer device or a server. As shown in Figure 1, the method specifically includes the following:
S101、根据多个对象的样本数,和至少两种类别下多个对象的样本数,计算多个对象在至少两种类别中的第一权重,其中,模型训练的目标是使模型在至少两种类别中对输入的对象进行分类。S101. Calculate the first weights of the multiple objects in at least two categories according to the number of samples of the multiple objects and the number of samples of the multiple objects in at least two categories, wherein the goal of model training is to make the model work in at least two categories. Classify the input objects in various categories.
对于分类模型,其模型训练的目的是使得模型具备对输入进行分类的能力,包括二分类模型和多分类模型。对于二分类模型,则是在两种类别中对输入进行分类,而对于多分类模型,则是在两种以上的类别中对输入进行分类。For classification models, the purpose of model training is to enable the model to have the ability to classify the input, including binary classification models and multi-classification models. For a binary classification model, the input is classified in two classes, while for a multiclass model, the input is classified in more than two classes.
如果用于训练分类模型的训练样本中,属于某类别的样本是稀疏样本,即样本数在总样本中占比很低,那么利用该样本来训练模型,若采取现有技术中的Multi-hot或one-hot等编码方式,采用全量对象样本作为特征维度,则得到的特征编码在该类别中就不具有特征区分度,训练出的模型也无法对属于该类别的对象识别出来,无法准确进行分类。例如,在信贷风控场景中,需要模型对某应用程序APP被安装后是否存在欺诈高风险进行判断,也即实现高风险和低风险的二分类任务,但由于安装这类高风险APP且存在欺诈行为的样本数本来就很少,属于稀疏样本,正样本覆盖率很低,而现有技术中的Multi-hot或one-hot等编码方式,都是采用全量APP作为特征维度,安装该APP的样本标1,未安装的则标0,不仅特征维度高,而且稀疏样本的样本特征用于模型训练时,特征区分度和贡献度极低,从而影响模型训练效果,无法准确对存在高风险的对象识别出来。If among the training samples used to train the classification model, the samples belonging to a certain category are sparse samples, that is, the number of samples accounts for a very low proportion of the total samples, then use the samples to train the model, if the Multi-hot method in the prior art is adopted Or one-hot and other coding methods, using the full amount of object samples as the feature dimension, the obtained feature code does not have feature discrimination in this category, and the trained model cannot identify objects belonging to this category. Classification. For example, in a credit risk control scenario, a model is required to determine whether there is a high risk of fraud after an APP is installed, that is, to achieve a binary classification task of high risk and low risk. The number of samples of fraudulent behavior is very small, and it belongs to sparse samples, and the coverage rate of positive samples is very low. However, the encoding methods such as Multi-hot or one-hot in the prior art all use the full APP as the feature dimension, and the APP is installed. The samples are marked with 1, and the ones that are not installed are marked with 0. Not only the feature dimension is high, but also when the sample features of the sparse samples are used for model training, the feature discrimination and contribution are extremely low, which affects the model training effect and cannot be accurate. objects are identified.
本公开实施例中,则先根据多个对象的样本数,和至少两种类别下多个对象的样本数,计算多个对象在至少两种类别中的第一权重。该第一权重用于表示多个对象中各对象在每种类别中的重要程度,并可以利用TF-IDF(词频-逆文档频率)、lift值(评估预测模型是否有效的度量)、woe(Weight of Evidence)或类别占比等参数来表示。其中,可以根据多个对象中各对象的样本数确定样本总数,再结合每种类别下各对象的样本数,按照计算TF-IDF、lift值、woe或类别占比等方式,计算各对象在每种类别中的第一权重。而且,在对象的样本属于稀疏样本,但该稀疏样本属于正样本的场景中,即使稀疏样本数量少,但是这些对象在正样本中的第一权重要高于其他样本,相当于通过第一权重提取出重要的稀疏取值,从而提升稀疏样本的贡献度。例如,在信贷风控场景中,两种待分类的类别分别是高风险的正样本和低风险的负样本,那么安装那些信贷风险高的APP的样本,虽然其样本数占比小于其他样本,但是其在正样本中的第一权重要高于其他样本。In this embodiment of the present disclosure, first weights of the multiple objects in at least two categories are calculated according to the number of samples of the multiple objects and the number of samples of the multiple objects in at least two categories. The first weight is used to represent the importance of each object in each category among the multiple objects, and can use TF-IDF (word frequency-inverse document frequency), lift value (a measure for evaluating whether the prediction model is effective), woe ( Weight of Evidence) or category ratio and other parameters. Among them, the total number of samples can be determined according to the number of samples of each object in multiple objects, and then combined with the number of samples of each object under each category, according to the calculation of TF-IDF, lift value, woe or category ratio, etc. The first weight in each category. Moreover, in the scene where the samples of objects belong to sparse samples, but the sparse samples belong to positive samples, even if the number of sparse samples is small, the first weight of these objects in positive samples is higher than other samples, which is equivalent to passing the first weight Extract important sparse values to improve the contribution of sparse samples. For example, in a credit risk control scenario, the two categories to be classified are high-risk positive samples and low-risk negative samples, so the samples with apps with high credit risk installed, although the proportion of the samples is smaller than other samples, But its first weight in positive samples is higher than other samples.
S102、根据第一权重对多个对象进行分箱,得到多个对象分箱。S102. Binning the multiple objects according to the first weight, to obtain multiple object binning.
利用分箱方法,可以将多个对象中各对象的第一权重进行划分,得到的多个对象分箱中,每个对象分箱都包含多个对象,且这些对象的第一权重相似。具体的分箱过程,可按照现有技术实现,此处不再赘述。Using the binning method, the first weight of each object in the multiple objects can be divided, and in the obtained multiple object bins, each object bin contains multiple objects, and the first weights of these objects are similar. The specific binning process can be implemented according to the prior art, which will not be repeated here.
S103、根据多个对象分箱的样本数,和至少两种类别下多个对象分箱的样本数,计算多个对象分箱在至少两种类别中的第二权重,并将多个对象分箱的第二权重作为多个对象分箱的特征取值。S103. Calculate the second weight of the multiple object bins in at least two categories according to the number of samples of the multiple object bins and the number of samples of the multiple object bins under at least two categories, and classify the multiple objects into two categories. The second weight of the bin is taken as the feature value of the binning of multiple objects.
其中,第二权重用于表示多个对象分箱中各对象分箱在每种类别中的重要程度,其计算方法可以与计算第一权重相同,也可以不同。计算出的第二权重即为每个对象分箱的特征取值,用于分类模型的训练。The second weight is used to represent the importance of each object bin in each category in the multiple object bins, and the calculation method may be the same as that of calculating the first weight, or it may be different. The calculated second weight is the feature value of each object binning, which is used for the training of the classification model.
需要说明的是,本公开实施例中,分箱之前,样本是以每一个对象为单位,样本数量很大。而根据第一权重进行分箱后,则是以每一个对象分箱为单位,使得对象分箱的样本数成倍地减少了。因此,稀疏样本对应的对象分箱,在整体对象分箱样本中的覆盖率得以提升,以对象分箱的特征取值来训练模型,就可以提升稀疏样本特征的区分度和贡献度,让模型具有对稀疏样本所属类别进行对象分类的能力。例如在信贷风控场景下,对安装APP是否存在高风险的欺诈行为进行预测,或者在反欺诈识别、营销响应等二分类场景下进行分类和预测。It should be noted that, in the embodiment of the present disclosure, before binning, the samples are taken as a unit of each object, and the number of samples is large. After the binning is performed according to the first weight, each object binning is used as a unit, so that the number of samples of the object binning is reduced exponentially. Therefore, the object binning corresponding to the sparse sample can improve the coverage rate of the overall object binning sample. Training the model with the feature values of the object binning can improve the discrimination and contribution of the sparse sample features, allowing the model to It has the ability to classify objects to the categories to which sparse samples belong. For example, in the credit risk control scenario, predict whether there is a high-risk fraudulent behavior in the installation of the APP, or classify and predict in the two-category scenarios such as anti-fraud identification and marketing response.
本公开实施例的技术方案,先计算出每个对象在各类别中的第一权重,然后根据第一权重对各对象进行分箱,得到多个对象分箱。接着,以每个对象分箱为单位,计算每个对象分箱在各类别中的第二权重,该第二权重即作为每个对象分箱的特征取值,用来训练模型。而通过分箱,将原来的数量众多的样本取值成倍地减少了,自然地,也就提高了少量稀疏样本在样本总体中的覆盖率,改善现有技术中稀疏样本特征的区分度和贡献度低的问题,提高模型的训练效果。In the technical solution of the embodiment of the present disclosure, the first weight of each object in each category is calculated first, and then each object is binned according to the first weight to obtain a plurality of object bins. Next, take each object bin as a unit, calculate the second weight of each object bin in each category, and the second weight is used as a feature value of each object bin to train the model. And through binning, the original large number of sample values are doubled, and naturally, the coverage rate of a small number of sparse samples in the sample population is improved, and the distinction and degree of sparse sample features in the prior art are improved. The problem of low contribution degree can improve the training effect of the model.
在一种实施方式中,模型可以为二分类模型,相应的,类别则包括正样本所属类别和负样本所属类别,且正样本为稀疏样本。那么按照本公开实施例的方法计算出各对象在正样本所属类别中的第一权重,就可以表示稀疏样本在样本总体中的重要程度,并为进一步的分箱操作做数据准备。In one embodiment, the model may be a binary classification model, and correspondingly, the categories include the category to which the positive samples belong and the category to which the negative samples belong, and the positive samples are sparse samples. Then, the first weight of each object in the category to which the positive sample belongs is calculated according to the method of the embodiment of the present disclosure, which can represent the importance of the sparse sample in the sample population, and prepares data for further binning operations.
图2是根据本公开实施例的特征编码方法的流程示意图,本实施例在上述实施例的基础上,对模型为二分类模型,且利用词频-逆文档频率来计算权重的情形,做出进一步的优化。如图2所示,该方法具体包括如下:FIG. 2 is a schematic flowchart of a feature encoding method according to an embodiment of the present disclosure. On the basis of the above-mentioned embodiment, the model is a two-class model, and the word frequency-inverse document frequency is used to calculate the weight, and further Optimization. As shown in Figure 2, the method specifically includes the following:
S201、根据多个对象的样本数,以及多个对象的样本数之和,计算多个对象的逆向文件频率。S201. Calculate the reverse file frequency of the multiple objects according to the number of samples of the multiple objects and the sum of the number of samples of the multiple objects.
其中,逆向文件频率即IDF(Inverse Document Frequency),通常,某一特定词语的IDF,可以由总文件数目除以包含该词语的文件的数目,再将得到的商取对数得到。在本公开实施例中,这里的“文件”即为对象,因此,各对象的逆向文件频率的计算公式为:Idf=log[样本总数/(各对象的样本数+1)],其中,样本总数即为各对象的样本数之和。Among them, the inverse document frequency is IDF (Inverse Document Frequency). Usually, the IDF of a specific word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the logarithm of the obtained quotient. In the embodiment of the present disclosure, the "file" here is the object. Therefore, the calculation formula of the reverse file frequency of each object is: Idf=log[the total number of samples/(the number of samples of each object+1)], where the sample The total is the sum of the number of samples for each object.
在一种实施方式中,以信贷风控场景为例,模型用于识别安装某APP的行为是否具有欺诈高风险。相应的,对象是指应用程序APP,对象的样本数是指APP被安装的数目,类别包括APP被安装并存在金融欺诈的风险高和存在金融欺诈的风险低两种类别,对象的样本中的正样本是指APP被安装且存在金融欺诈的风险高。因此,在信贷风控场景中,各APP的逆向文件频率的计算公式则可以调整为:Idf=log[样本总数/(安装该APP的样本数+1)],其中,样本总数即为安装各APP的样本数之和。In one embodiment, taking the credit risk control scenario as an example, the model is used to identify whether the behavior of installing an APP has a high risk of fraud. Correspondingly, the object refers to the application program APP, and the number of samples of the object refers to the number of installed APPs. A positive sample means that the app is installed and there is a high risk of financial fraud. Therefore, in the credit risk control scenario, the calculation formula of the reverse file frequency of each APP can be adjusted as: Idf=log[the total number of samples/(the number of samples installed on the APP+1)], where the total number of samples is the total number of installed samples. Sum of samples of APP.
S202、根据多个对象的样本中的正样本总数,以及多个对象的样本数之和,计算多个对象的正样本占比,其中,正样本为稀疏样本。S202. Calculate the proportion of positive samples of the multiple objects according to the total number of positive samples in the samples of the multiple objects and the sum of the number of samples of the multiple objects, where the positive samples are sparse samples.
具体的,利用某对象的样本中的正样本总数除以各对象的样本数之和,就可以得到该对象的正样本占比,类似于词语在文本中的占比TF(词频)。在信贷风控场景中,即为安装该APP的正样本总数除以安装各APP的样本数之和。由于信贷风控场景中,安装这种高风险APP且存在欺诈行为的正样本,在样本总体中的占比本身就很小,因此,这些正样本属于稀疏样本。Specifically, by dividing the total number of positive samples in the samples of an object by the sum of the number of samples of each object, the proportion of positive samples of the object can be obtained, which is similar to the proportion of words in the text TF (word frequency). In the credit risk control scenario, it is the total number of positive samples that have installed the APP divided by the sum of the number of samples that have installed each APP. In the credit risk control scenario, positive samples with such high-risk apps installed and fraudulent behaviors account for a very small proportion of the sample population. Therefore, these positive samples are sparse samples.
S203、将多个对象的逆向文件频率与正样本占比相乘的结果,作为多个对象在正样本所属类别中的第一权重。S203. The result of multiplying the reverse file frequency of the multiple objects and the proportion of the positive samples is used as the first weight of the multiple objects in the category to which the positive samples belong.
词频-逆文档频率(Term Frequency-Inverse Document Frequency,TF-IDF)技术,是一种用于资讯检索与文本挖掘的常用加权技术,可以用来评估一个词对于一个文档集或语料库中某个文档的重要程度。其含义是,字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。本公开实施例中,将词频-逆文档频率技术应用在计算对象在正样本所属类别中的第一权重上,而由于正样本是稀疏样本,因此,利用词频-逆文档频率技术就可以把重要的稀疏取值找出来,再通过分箱把比较相似的取值聚合到一起,从而提升稀疏特征的区分度和贡献度。而且在计算结果中,存在高风险真值的APP的第一权重高于其他APP,并且在这些存在高风险真值的APP中,风险越高,相应APP的第一权重的值也越大。Term Frequency-Inverse Document Frequency (TF-IDF) technology is a commonly used weighting technology for information retrieval and text mining. It can be used to evaluate the effect of a word on a document in a document set or corpus. degree of importance. The implication is that the importance of a word increases proportionally to the number of times it appears in the document, but at the same time decreases inversely proportional to its frequency in the corpus. In the embodiment of the present disclosure, the term frequency-inverse document frequency technology is applied to the first weight of the calculation object in the category to which the positive samples belong, and since the positive samples are sparse samples, the term frequency-inverse document frequency technology can be used to calculate the important The sparse values of s are found out, and then similar values are grouped together by binning, so as to improve the discrimination and contribution of sparse features. In addition, in the calculation result, the first weight of an APP with a high risk truth value is higher than that of other APPs, and in these APPs with a high risk truth value, the higher the risk, the greater the value of the first weight of the corresponding APP.
S204、将第一权重进行逆向排序,并基于逆向排序后的第一权重,对多个对象进行分箱,得到多个对象分箱。S204. Reversely sort the first weights, and based on the reversely sorted first weights, perform binning on multiple objects to obtain multiple object bins.
具体的,可以采用等频分箱或自定义分箱的方法实现。其中,若采用等频分箱,则得到的各对象分箱中包括的对象样本数相同。而对于自定义分箱,其原则是确保每个对象分箱对应的对象样本覆盖率必须高于一定的预设阈值,不能有的分箱特别大,有的分箱特别小。相应的,本公开实施例中采用逆向排序的方式,其目的则是在自定义分箱时,有利于人工确定分箱阈值,也即,分箱中包括多少个对象样本,以及一共有多少个分箱。Specifically, the method of equal-frequency binning or self-defined binning can be used. Wherein, if equal-frequency binning is adopted, the number of object samples included in each object binning obtained is the same. For custom binning, the principle is to ensure that the object sample coverage corresponding to each object binning must be higher than a certain preset threshold, and some bins cannot be particularly large, and some bins are particularly small. Correspondingly, the reverse sorting method is adopted in the embodiment of the present disclosure, and the purpose is to facilitate the manual determination of the binning threshold when the binning is customized, that is, how many object samples are included in the binning, and how many samples are there in total. Box.
S205、根据多个对象分箱的样本数,以及多个对象分箱的样本数之和,计算多个对象分箱的逆向文件频率。S205. Calculate the reverse file frequency of the multiple object bins according to the number of samples of the multiple object bins and the sum of the sample numbers of the multiple object bins.
S206、根据多个对象分箱的样本中的正样本总数,以及多个对象分箱的样本数之和,计算多个对象分箱的正样本占比,其中,正样本为稀疏样本。S206 , according to the total number of positive samples in the samples of the multiple object bins and the sum of the number of samples of the multiple object bins, calculate the proportion of positive samples of the multiple object bins, wherein the positive samples are sparse samples.
S207、将多个对象分箱的逆向文件频率与正样本占比相乘的结果,作为多个对象分箱在正样本所属类别中的第二权重,并将多个对象分箱的第二权重作为多个对象分箱的特征取值。S207. The result of multiplying the reverse file frequency of the multiple object binning and the positive sample ratio is taken as the second weight of the multiple object binning in the category to which the positive sample belongs, and the second weight of the multiple object binning As a feature value for multiple object binning.
计算第二权重的方式与第一权重相同,都是采用词频-逆文档频率技术来计算,只是在计算第二权重的过程中,是将每个对象分箱作为独立的单位。具体的,各对象分箱的逆向文件频率的计算公式为:Idf=log[样本总数/(各对象分箱的样本数+1)],其中,样本总数即为各对象分箱的样本数之和。然后利用某对象分箱的样本中的正样本总数除以各对象分箱的样本数之和,就可以得到该对象分箱的正样本占比。最后将各对象分箱的逆向文件频率与正样本占比相乘,就可以得到第二权重。The method of calculating the second weight is the same as the first weight, which is calculated by using the term frequency-inverse document frequency technique, but in the process of calculating the second weight, each object is binned as an independent unit. Specifically, the calculation formula of the reverse file frequency of each object binning is: Idf=log[the total number of samples/(the number of samples in each object binning+1)], where the total number of samples is the number of samples in each object binning. and. Then, by dividing the total number of positive samples in the binned samples of an object by the sum of the number of samples in each object bin, the proportion of positive samples in the object bin can be obtained. Finally, the second weight can be obtained by multiplying the reverse file frequency of each object bin by the proportion of positive samples.
相应的,在信贷风控经场景下,各分箱APP的逆向文件频率的计算公式则可以调整为为:Idf=log[样本总数/(安装该分箱APP集合的样本数+1)],其中,样本总数即为安装各分箱APP的样本数之和;计算安装某分箱APP集合的正样本总数除以安装各分箱APP的样本数之和就可以得到该分箱APP的正样本占比;最后将分箱APP的逆向文件频率与正样本占比相乘,即可得到各分箱APP的第二权重。Correspondingly, in the scenario of credit risk control, the calculation formula of the reverse file frequency of each sub-box APP can be adjusted as follows: Idf=log[total number of samples/(number of samples installed in this sub-box APP set+1)], Among them, the total number of samples is the sum of the number of samples installed in each sub-box APP; the positive samples of the sub-box APP can be obtained by dividing the total number of positive samples installed in a certain sub-box APP set by the sum of the number of samples installed in each sub-box APP Finally, multiply the reverse file frequency of the binned APP by the proportion of positive samples, and then the second weight of each binned APP can be obtained.
本公开实施例的技术方案,先利用词频-逆文件频率技术计算出每个对象在作为稀疏样本的正样本所属类别中的第一权重,然后根据第一权重经逆向排序后对各对象进行分箱,得到多个对象分箱。接着,以每个对象分箱为单位,计算每个对象分箱在正样本所属类别中的第二权重,该第二权重即作为每个对象分箱的特征取值,用来训练模型。这样,一方面通过词频-逆文件频率技术把重要的稀疏取值找出来,作为稀疏样本重要程度的衡量手段,另一方面通过分箱,将原来的数量众多的样本取值成倍地减少了,自然地,也就提高了少量稀疏样本在样本总体中的覆盖率,改善现有技术中稀疏样本特征的区分度和贡献度低的问题,提高模型的训练效果。According to the technical solution of the embodiment of the present disclosure, the first weight of each object in the category to which the positive sample as a sparse sample belongs is calculated by using the word frequency-inverse document frequency technology, and then the objects are sorted according to the first weight after reverse sorting. Binning to get multiple objects binned. Next, with each object binning as a unit, the second weight of each object binning in the category to which the positive sample belongs is calculated, and the second weight is used as a feature value of each object binning to train the model. In this way, on the one hand, the important sparse values are found out through the word frequency-inverse document frequency technology as a measure of the importance of the sparse samples, and on the other hand, the original large number of sample values are doubled by binning. , naturally, the coverage of a small number of sparse samples in the sample population is improved, the problem of low discrimination and low contribution of sparse sample features in the prior art is improved, and the training effect of the model is improved.
进一步的,在信贷风控场景下,由于安装风险类APP并用于欺诈行为的样本本身就少,属于稀疏样本,那么在进行特征编码时,如果按照现有技术对该数量少的稀疏样本进行特征编码,并将其应用于建模时,贡献度极低。而采用本公开实施例的技术方案,则先利用TF-IDF计算出每个APP在高风险的正样本中对应的权重,把重要的稀疏取值找出来,提高稀疏样本的重要程度。然后根据权重进行分箱,从而将原来的数量众多的样本取值,经过分箱后成倍地减少了,自然地,也就提高了少量稀疏样本在样本总体中的覆盖率。最后以每个分箱APP为单位,再次计算每个分箱的权重,即可得到该分箱的特征取值。而且,采用上述特征编码方法得到的特征取值具备单调性,也即,安装高风险APP越多,高风险APP所在分箱的特征取值就越高,反之越低。这样,在线性模型中无需做进一步特征工程,即可取得较好的效果和解释性。Further, in the scenario of credit risk control, since there are few samples installed on risk APPs and used for fraudulent behaviors, they are sparse samples, so when performing feature coding, if the feature coding is performed on the small number of sparse samples according to the existing technology When coding and applying it to modeling, the contribution is extremely low. However, with the technical solution of the embodiment of the present disclosure, the TF-IDF is used to calculate the corresponding weight of each APP in the high-risk positive samples, and the important sparse values are found to improve the importance of the sparse samples. Then binning is performed according to the weight, so that the original large number of sample values are reduced exponentially after binning, and naturally, the coverage rate of a small number of sparse samples in the sample population is improved. Finally, taking each binning APP as a unit, calculate the weight of each binning again, and then the feature value of the binning can be obtained. Moreover, the feature values obtained by using the above feature encoding method are monotonic, that is, the more high-risk apps are installed, the higher the feature values of the bins where the high-risk apps are located, and vice versa. In this way, better results and interpretability can be achieved without further feature engineering in the linear model.
图3是根据本公开实施例的特征编码装置的结构示意图,本实施例可适用于对训练样本进行特征编码,以利用编码后得到的样本特征训练模型的情况,尤其是针对稀疏样本的特征编码的情况,涉及机器学习技术领域,尤其涉及智慧金融、人工智能和深度学习技术。该装置可实现本公开任意实施例所述的特征编码方法。如图3所示,该装置300具体包括:FIG. 3 is a schematic structural diagram of a feature encoding apparatus according to an embodiment of the present disclosure. This embodiment is applicable to the case where feature encoding is performed on training samples, and a model is trained by using the sample features obtained after encoding, especially for feature encoding of sparse samples. It involves the field of machine learning technology, especially smart finance, artificial intelligence and deep learning technology. The apparatus can implement the feature encoding method described in any embodiment of the present disclosure. As shown in FIG. 3, the
第一权重计算模块301,用于根据多个对象的样本数,和至少两种类别下所述多个对象的样本数,计算所述多个对象在所述至少两种类别中的第一权重,其中,所述模型训练的目标是使所述模型在所述至少两种类别中对输入的对象进行分类;The first
分箱模块302,用于根据所述第一权重对所述多个对象进行分箱,得到多个对象分箱;A
第二权重计算模块303,用于根据所述多个对象分箱的样本数,和所述至少两种类别下所述多个对象分箱的样本数,计算所述多个对象分箱在所述至少两种类别中的第二权重,并将所述多个对象分箱的第二权重作为所述多个对象分箱的特征取值。The second
可选的,所述模型为二分类模型,所述类别包括正样本所属类别和负样本所属类别,且所述正样本为稀疏样本。Optionally, the model is a binary classification model, the categories include a category to which positive samples belong and a category to which negative samples belong, and the positive samples are sparse samples.
可选的,第一权重计算模块301包括:Optionally, the first
第一逆向文件频率计算单元,用于根据所述多个对象的样本数,以及所述多个对象的样本数之和,计算所述多个对象的逆向文件频率;a first reverse file frequency calculation unit, configured to calculate the reverse file frequency of the multiple objects according to the number of samples of the multiple objects and the sum of the sample numbers of the multiple objects;
第一正样本占比计算单元,用于根据所述多个对象的样本中的正样本总数,以及所述多个对象的样本数之和,计算所述多个对象的正样本占比;a first positive sample ratio calculation unit, configured to calculate the positive sample ratio of the multiple objects according to the total number of positive samples in the samples of the multiple objects and the sum of the sample numbers of the multiple objects;
第一权重计算单元,用于将所述多个对象的逆向文件频率与所述正样本占比相乘的结果,作为所述多个对象在所述正样本所属类别中的第一权重。A first weight calculation unit, configured to multiply the reverse file frequency of the plurality of objects by the proportion of the positive samples as a first weight of the plurality of objects in the category to which the positive samples belong.
可选的,第二权重计算模块303包括:Optionally, the second
第二逆向文件频率计算单元,用于根据所述多个对象分箱的样本数,以及所述多个对象分箱的样本数之和,计算所述多个对象分箱的逆向文件频率;a second reverse file frequency calculation unit, configured to calculate the reverse file frequency of the multiple object bins according to the number of samples of the multiple object bins and the sum of the sample numbers of the multiple object bins;
第二正样本占比计算单元,用于根据所述多个对象分箱的样本中的正样本总数,以及所述多个对象分箱的样本数之和,计算所述多个对象分箱的正样本占比;The second positive sample proportion calculation unit is configured to calculate the total number of positive samples in the samples of the multiple object bins and the sum of the number of samples of the multiple object bins, to calculate the number of the multiple object bins The proportion of positive samples;
第二权重计算单元,用于将所述多个对象分箱的逆向文件频率与所述正样本占比相乘的结果,作为所述多个对象分箱在所述正样本所属类别中的第二权重,并将所述多个对象分箱的第二权重作为所述多个对象分箱的特征取值。The second weight calculation unit is configured to multiply the result of multiplying the reverse file frequency of the multiple object bins by the proportion of the positive samples, as the number of the multiple object bins in the category to which the positive samples belong. Two weights, and the second weight of the multiple object bins is taken as a feature value of the multiple object bins.
可选的,分箱模块302具体用于:Optionally, the
将所述第一权重进行逆向排序,并基于逆向排序后的第一权重,对所述多个对象进行分箱,得到多个对象分箱。The first weights are reversely sorted, and based on the reversely sorted first weights, the multiple objects are binned to obtain multiple object bins.
可选的,分箱模块302具体用于:Optionally, the
根据所述第一权重,并利用等频分箱方法,对所述多个对象进行分箱,得到多个对象分箱。According to the first weight and using the equal-frequency binning method, the multiple objects are binned to obtain multiple object bins.
可选的,所述模型用于信贷风控;Optionally, the model is used for credit risk control;
相应的,相应的,所述对象是指应用程序,所述对象的样本数是指所述应用程序被安装的数目,所述类别包括所述应用程序被安装并存在金融欺诈的风险高和存在金融欺诈的风险低两种类别,所述对象的样本中的正样本是指所述应用程序被安装并存在金融欺诈的风险高。Correspondingly, correspondingly, the object refers to an application program, the number of samples of the object refers to the number of the application program installed, and the category includes that the application program is installed and there is a high risk of financial fraud and existence of There are two categories of low risk of financial fraud, a positive sample in the sample of the object means that the application is installed and there is a high risk of financial fraud.
上述产品可执行本公开任意实施例所提供的方法,具备执行方法相应的功能模块和有益效果。The above product can execute the method provided by any embodiment of the present disclosure, and has corresponding functional modules and beneficial effects for executing the method.
本公开的技术方案中,所涉及的用户个人信息的收集、存储、使用、加工、传输、提供和公开等处理,均符合相关法律法规的规定,且不违背公序良俗。In the technical solutions of the present disclosure, the collection, storage, use, processing, transmission, provision, and disclosure of the user's personal information involved are all in compliance with relevant laws and regulations, and do not violate public order and good customs.
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
图4示出了可以用来实施本公开的实施例的示例电子设备400的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 4 shows a schematic block diagram of an example
如图4所示,设备400包括计算单元401,其可以根据存储在只读存储器(ROM)402中的计算机程序或者从存储单元408加载到随机访问存储器(RAM)403中的计算机程序,来执行各种适当的动作和处理。在RAM403中,还可存储设备400操作所需的各种程序和数据。计算单元401、ROM 402以及RAM 403通过总线404彼此相连。输入/输出(I/O)接口405也连接至总线404。As shown in FIG. 4 , the
设备400中的多个部件连接至I/O接口405,包括:输入单元406,例如键盘、鼠标等;输出单元407,例如各种类型的显示器、扬声器等;存储单元408,例如磁盘、光盘等;以及通信单元409,例如网卡、调制解调器、无线通信收发机等。通信单元409允许设备400通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Various components in the
计算单元401可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元401的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元401执行上文所描述的各个方法和处理,例如特征编码方法。例如,在一些实施例中,特征编码方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元408。在一些实施例中,计算机程序的部分或者全部可以经由ROM 402和/或通信单元409而被载入和/或安装到设备400上。当计算机程序加载到RAM 403并由计算单元401执行时,可以执行上文描述的特征编码方法的一个或多个步骤。备选地,在其他实施例中,计算单元401可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行特征编码方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、复杂可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein above can be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), complex programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(阴极射线管)或者LCD(液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:局域网(LAN)、广域网(WAN)、区块链网络和互联网。The systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), blockchain networks, and the Internet.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务中,存在的管理难度大,业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as a cloud computing server or a cloud host. It is a host product in the cloud computing service system to solve the traditional physical host and VPS services, which are difficult to manage and weak in business scalability. defect. The server can also be a server of a distributed system, or a server combined with a blockchain.
人工智能是研究使计算机来模拟人的某些思维过程和智能行为(如学习、推理、思考、规划等)的学科,既有硬件层面的技术也有软件层面的技术。人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术;人工智能软件技术主要包括计算机视觉技术、语音识别技术、自然语言处理技术及机器学习/深度学习技术、大数据处理技术、知识图谱技术等几大方向。Artificial intelligence is the study of making computers to simulate certain thinking processes and intelligent behaviors of people (such as learning, reasoning, thinking, planning, etc.), both hardware-level technology and software-level technology. AI hardware technologies generally include technologies such as sensors, dedicated AI chips, cloud computing, distributed storage, and big data processing; AI software technologies mainly include computer vision technology, speech recognition technology, natural language processing technology, and machine learning/depth Learning technology, big data processing technology, knowledge graph technology and other major directions.
云计算(cloud computing),指的是通过网络接入弹性可扩展的共享物理或虚拟资源池,资源可以包括服务器、操作系统、网络、软件、应用和存储设备等,并可以按需、自服务的方式对资源进行部署和管理的技术体系。通过云计算技术,可以为人工智能、区块链等技术应用、模型训练提供高效强大的数据处理能力。Cloud computing refers to accessing elastically scalable shared physical or virtual resource pools through the network. Resources can include servers, operating systems, networks, software, applications and storage devices, etc., and can be self-service on demand and on demand. A technical system for deploying and managing resources in a way. Through cloud computing technology, it can provide efficient and powerful data processing capabilities for artificial intelligence, blockchain and other technical applications and model training.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开提供的技术方案所期望的结果,本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions provided in the present disclosure can be achieved, no limitation is imposed herein.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements, and improvements made within the spirit and principles of the present disclosure should be included within the protection scope of the present disclosure.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210564917.9ACN114881163B (en) | 2022-05-23 | 2022-05-23 | A feature encoding method, device, equipment and medium |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210564917.9ACN114881163B (en) | 2022-05-23 | 2022-05-23 | A feature encoding method, device, equipment and medium |
| Publication Number | Publication Date |
|---|---|
| CN114881163Atrue CN114881163A (en) | 2022-08-09 |
| CN114881163B CN114881163B (en) | 2025-06-06 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210564917.9AActiveCN114881163B (en) | 2022-05-23 | 2022-05-23 | A feature encoding method, device, equipment and medium |
| Country | Link |
|---|---|
| CN (1) | CN114881163B (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN111698247A (en)* | 2020-06-11 | 2020-09-22 | 腾讯科技(深圳)有限公司 | Abnormal account detection method, device, equipment and storage medium |
| CN111815436A (en)* | 2020-07-14 | 2020-10-23 | 深圳市卡牛科技有限公司 | A credit object classification method, device, terminal and storage medium |
| CN111898675A (en)* | 2020-07-30 | 2020-11-06 | 北京云从科技有限公司 | Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment |
| CN112434163A (en)* | 2020-11-30 | 2021-03-02 | 北京沃东天骏信息技术有限公司 | Risk identification method, model construction method, risk identification device, electronic equipment and medium |
| CN112836765A (en)* | 2021-03-01 | 2021-05-25 | 深圳前海微众银行股份有限公司 | Data processing method, device and electronic device for distributed learning |
| CN113268579A (en)* | 2021-06-24 | 2021-08-17 | 中国平安人寿保险股份有限公司 | Dialog content type identification method and device, computer equipment and storage medium |
| US20210256436A1 (en)* | 2019-09-13 | 2021-08-19 | Oracle International Corporation | Machine learning model for predicting litigation risk in correspondence and identifying severity levels |
| US20210287089A1 (en)* | 2020-03-14 | 2021-09-16 | DataRobot, Inc. | Automated and adaptive design and training of neural networks |
| CN114021650A (en)* | 2021-11-04 | 2022-02-08 | 北京百度网讯科技有限公司 | Data processing method, apparatus, electronic device and medium |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20210256436A1 (en)* | 2019-09-13 | 2021-08-19 | Oracle International Corporation | Machine learning model for predicting litigation risk in correspondence and identifying severity levels |
| US20210287089A1 (en)* | 2020-03-14 | 2021-09-16 | DataRobot, Inc. | Automated and adaptive design and training of neural networks |
| CN111698247A (en)* | 2020-06-11 | 2020-09-22 | 腾讯科技(深圳)有限公司 | Abnormal account detection method, device, equipment and storage medium |
| CN111815436A (en)* | 2020-07-14 | 2020-10-23 | 深圳市卡牛科技有限公司 | A credit object classification method, device, terminal and storage medium |
| CN111898675A (en)* | 2020-07-30 | 2020-11-06 | 北京云从科技有限公司 | Credit wind control model generation method and device, scoring card generation method, machine readable medium and equipment |
| CN112434163A (en)* | 2020-11-30 | 2021-03-02 | 北京沃东天骏信息技术有限公司 | Risk identification method, model construction method, risk identification device, electronic equipment and medium |
| CN112836765A (en)* | 2021-03-01 | 2021-05-25 | 深圳前海微众银行股份有限公司 | Data processing method, device and electronic device for distributed learning |
| CN113268579A (en)* | 2021-06-24 | 2021-08-17 | 中国平安人寿保险股份有限公司 | Dialog content type identification method and device, computer equipment and storage medium |
| CN114021650A (en)* | 2021-11-04 | 2022-02-08 | 北京百度网讯科技有限公司 | Data processing method, apparatus, electronic device and medium |
| Publication number | Publication date |
|---|---|
| CN114881163B (en) | 2025-06-06 |
| Publication | Publication Date | Title |
|---|---|---|
| CN113657465A (en) | Pre-training model generation method and device, electronic equipment and storage medium | |
| CN113722493A (en) | Data processing method, device, storage medium and program product for text classification | |
| CN113360580A (en) | Abnormal event detection method, device, equipment and medium based on knowledge graph | |
| CN114218951B (en) | Entity recognition model training method, entity recognition method and device | |
| US20220374678A1 (en) | Method for determining pre-training model, electronic device and storage medium | |
| JP7357114B2 (en) | Training method, device, electronic device and storage medium for living body detection model | |
| CN114491416A (en) | Characteristic information processing method and device, electronic equipment and storage medium | |
| CN115293149A (en) | Entity relationship identification method, device, equipment and storage medium | |
| CN116629620B (en) | Risk level determining method and device, electronic equipment and storage medium | |
| CN114494776A (en) | A model training method, device, equipment and storage medium | |
| CN113010571A (en) | Data detection method, data detection device, electronic equipment, storage medium and program product | |
| CN114548307A (en) | Classification model training method and device, classification method and device | |
| CN119862883A (en) | Document-based business processing method, device, equipment, storage medium and program product | |
| CN114493844A (en) | Information processing method and device and electronic equipment | |
| CN114511022A (en) | Feature screening, behavior recognition model training and abnormal behavior recognition method and device | |
| CN114202309A (en) | Method, electronic device and program product for determining matching parameters of user and enterprise | |
| CN113344621A (en) | Abnormal account determination method and device and electronic equipment | |
| CN113052325A (en) | Method, device, equipment, storage medium and program product for optimizing online model | |
| US9892411B2 (en) | Efficient tail calculation to exploit data correlation | |
| CN114881163A (en) | A feature encoding method, apparatus, device and medium | |
| CN114882388A (en) | Method, device, equipment and medium for training and predicting multitask model | |
| CN114386506A (en) | Feature screening method, device, electronic device and storage medium | |
| CN116226375A (en) | Training method and device for classification model suitable for text review | |
| CN116306881A (en) | Model compression method, device, electronic equipment and storage medium | |
| CN114138976A (en) | Data processing and model training method, device, electronic device and storage medium |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |