Movatterモバイル変換


[0]ホーム

URL:


CN112784556B - Method and device for generating pivot table value - Google Patents

Method and device for generating pivot table value
Download PDF

Info

Publication number
CN112784556B
CN112784556BCN201911088551.7ACN201911088551ACN112784556BCN 112784556 BCN112784556 BCN 112784556BCN 201911088551 ACN201911088551 ACN 201911088551ACN 112784556 BCN112784556 BCN 112784556B
Authority
CN
China
Prior art keywords
columns
itself
column
data
numbers
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911088551.7A
Other languages
Chinese (zh)
Other versions
CN112784556A (en
Inventor
苏奕虹
辛洋
皮霞林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Original Assignee
Beijing Kingsoft Office Software Inc
Zhuhai Kingsoft Office Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Office Software Inc, Zhuhai Kingsoft Office Software Co LtdfiledCriticalBeijing Kingsoft Office Software Inc
Priority to CN201911088551.7ApriorityCriticalpatent/CN112784556B/en
Publication of CN112784556ApublicationCriticalpatent/CN112784556A/en
Application grantedgrantedCritical
Publication of CN112784556BpublicationCriticalpatent/CN112784556B/en
Activelegal-statusCriticalCurrent
Anticipated expirationlegal-statusCritical

Links

Images

Classifications

Landscapes

Abstract

A method for generating a pivot table value comprises the steps of obtaining selected data columns in a table after receiving an instruction for establishing the pivot table, traversing the selected data columns respectively, and obtaining at least one preset characteristic value of each data column; respectively inputting a pre-generated random forest model to obtain an analysis result of each data column; and determining the data column with the analysis result meeting the preset condition as the value of the pivot table. The method and the system can automatically generate the data pivot table value by adopting the random forest model, help a user to process and analyze data, reduce the use threshold of the user and provide a more convenient way for the user.

Description

Translated fromChinese
一种生成数据透视表值的方法及装置A method and device for generating pivot table values

技术领域technical field

本文涉及计算机技术,尤指一种生成数据透视表值的方法及装置。This article relates to computer technology, especially a method and device for generating pivot table values.

背景技术Background technique

表格软件中“数据透视表”是一个门槛较高的功能。表格用户中会使用该功能的不超过全部使用表格软件用户的百分之二。对于很多表格文档来说,用户需要对工作表中的列进行计数、求和、求平均值等操作。这些操作使用“数据透视表”最方便,但是由于该功能门槛高,所以很多用户只能使用笨拙的方法来完成。"Pivot table" in spreadsheet software is a function with a high threshold. Among spreadsheet users, no more than 2% of all spreadsheet software users will use this function. For many tabular documents, users need to count, sum, average, etc. the columns in the worksheet. These operations are most convenient to use "pivot table", but due to the high threshold of this function, many users can only use clumsy methods to complete.

发明内容Contents of the invention

本申请提供了一种生成数据透视表值的方法及装置,能够帮助用户处理和分析数据,降低用户使用门槛,给用户提供更方便的途径。This application provides a method and device for generating pivot table values, which can help users process and analyze data, lower the threshold for users to use, and provide users with more convenient ways.

本申请提供了一种生成数据透视表值的方法,所述方法包括:当接收到针对当前表格建立数据透视表的指令后,获取表格中被选中的数据列,分别以预定的至少一个顺序遍历所述被选中的数据列,对于每个所述被选中的数据列,分别获取该数据列的至少一个预定的特征值;将获取的每个数据列的所述预定的特征值分别输入预先生成的随机森林模型,得到每个数据列对应所述预定的特征值的分析结果;根据分析结果满足预设条件的数据列,生成所述数据透视表的值。The present application provides a method for generating a pivot table value, the method comprising: after receiving an instruction to create a pivot table for the current table, obtaining the selected data columns in the table, and traversing in at least one predetermined order For the selected data columns, for each of the selected data columns, at least one predetermined feature value of the data column is respectively obtained; the predetermined feature values of each data column obtained are respectively input into the pre-generated According to the random forest model, the analysis results of each data column corresponding to the predetermined feature value are obtained; and the values of the pivot table are generated according to the data columns whose analysis results meet the preset conditions.

在一个示例性实施例中,上述根据分析结果满足预设条件的数据列,生成作为所述数据透视表的值,包括:将当前表格中所述数据列中单元格的值,分别按照所述数据透视表的每个行标题进行求和,将得到的求和结果作为所述数据透视表中相应单元格的值。In an exemplary embodiment, the above-mentioned data columns that meet the preset conditions according to the analysis results are generated as the values of the pivot table, including: the values of the cells in the data columns in the current table are respectively calculated according to the Each row header of the pivot table is summed, and the result of the sum is used as the value of the corresponding cell in the pivot table.

在一个示例性实施例中,上述将获取的每个数据列的所述预定的特征值分别输入预先生成的随机森林模型,得到每个数据列对应所述预定的特征值的分析结果;根据分析结果满足预设条件的数据列,生成作为所述数据透视表的值,包括:将获取的每个数据列的所述预设的特征值分别输入预先生成的随机森林模型,计算每个数据列的预测行字段得分;根据所述预测行字段得分在预设范围内的列,生成所述数据透视表的值。In an exemplary embodiment, the above-mentioned predetermined eigenvalues of each data column obtained are respectively input into the pre-generated random forest model, and the analysis results of each data column corresponding to the predetermined eigenvalues are obtained; according to the analysis As a result, the data columns that meet the preset conditions are generated as the value of the pivot table, including: inputting the preset eigenvalues of each data column obtained into the pre-generated random forest model, and calculating each data column The predicted row field score; according to the column whose predicted row field score is within a preset range, the value of the pivot table is generated.

在一个示例性实施例中,上述预设随机森林模型,是通过采集多个数据透视表作为训练数据样本,提取至少一个特征值按照决策树的生成步骤建立数据透视表值决策树,并根据数据透视表值决策树建立的。In an exemplary embodiment, the above preset random forest model is to collect a plurality of pivot tables as training data samples, extract at least one feature value and establish a pivot table value decision tree according to the generation steps of the decision tree, and according to the data Pivot table value decision trees are built.

在一个示例性实施例中,上述预定的至少一个顺序包括第一从左到右的顺序,当按照第一从左到右的顺序进行遍历时,获取的所每个所述被选中的数据列的至少一个预定的特征值包括:标题提取关键字、整个数据列列数、仅数字的单元格个数、单元格整数数字字符数长度的方差。In an exemplary embodiment, the predetermined at least one order includes a first left-to-right order, and when traversing according to the first left-to-right order, each of the selected data columns obtained The at least one predetermined feature value includes: title extraction keywords, the number of entire data columns, the number of cells with numbers only, and the variance of the length of the cell integer number of characters.

在一个示例性实施例中,上述预定的至少一个顺序包括第二从左到右的顺序,当按照第二从左到右的顺序进行遍历时,所述获取的每个所述被选中的数据列的至少一个预定的特征值还包括:自身及自身以左的各列含数字的列数,自身及自身以左的各列仅含数字的列数,及自身及自身以左的各列含中文、英文、日期的列数。In an exemplary embodiment, the predetermined at least one sequence includes a second left-to-right sequence, and when traversing according to the second left-to-right sequence, each of the selected data acquired The at least one predetermined characteristic value of the column also includes: the number of columns containing numbers in itself and the columns to the left of itself, the number of columns in which only numbers are contained in the columns to the left of itself and itself, and the number of columns in and to the left of itself containing numbers The number of columns for Chinese, English, and date.

在一个示例性实施例中,上述预定的至少一个顺序包括从右到左的顺序,当按照从右到左的顺序进行遍历时,所述获取的每个所述被选中的数据列的至少一个预定的特征值还包括:自身及自身以右的各列含数字的列数,自身及自身以右的各列仅含数字的列数,自身及自身以右的各列含中文、英文、日期的列数。In an exemplary embodiment, the predetermined at least one order includes a right-to-left order, and when traversing in the right-to-left order, at least one of each of the selected data columns acquired The predetermined feature values also include: the number of columns containing numbers on itself and the columns to the right of itself, the number of columns containing only numbers on itself and the columns to the right of itself, and the numbers of columns containing Chinese, English, and dates on itself and the columns to the right of itself the number of columns.

在一个示例性实施例中,上述将获取的每个数据列的至少一个预定的特征值分别输入预先生成的随机森林模型,包括:将获取的每个数据列的标题提取关键字、整个数据列列数、仅数字的单元格个数、单元格整数数字字符数长度的方差、自身及自身以左的各列含数字的列数、自身及自身以左的各列仅含数字的列数、自身及自身以左的各列含中文、英文、日期的列数、自身及自身以右的各列含数字的列数、自身及自身以右的各列仅含数字的列数、自身及自身以右的各列含中文、英文、日期的列数依次输入随机森林模型。In an exemplary embodiment, the aforementioned input of at least one predetermined feature value of each data column obtained into the pre-generated random forest model includes: extracting keywords, the entire data column from the title of each data column obtained The number of columns, the number of cells with only numbers, the variance of the length of the number of integers and characters in the cell, the number of columns containing numbers in itself and the columns to the left of itself, the number of columns in itself and the columns to the left of itself containing only numbers, The number of columns containing Chinese, English, date in each column to the left of itself and itself, the number of columns containing numbers in each column to the right of itself and itself, the number of columns containing only numbers in itself and the columns to the right of itself, itself and itself The columns on the right contain Chinese, English, and date columns and input them into the random forest model in sequence.

本申请还提供一种定向投放内容的装置,包括:获取模块,用于当接收到建立数据透视表的指令后,获取表格中被选中的数据列;分析模块,用于分别以预定的至少一个顺序遍历所述被选中的数据列,对于每个所述被选中的数据列,分别获取该数据列的至少一个预定的特征值,将获取的每个数据列的至少一个预定的特征值按照预定顺序分别输入预先生成的随机森林模型,得到每个数据列的分析结果;并将分析结果满足预设条件的数据列,确定作为所述数据透视表的值。The present application also provides a device for targeted delivery of content, including: an acquisition module, used to acquire the selected data column in the table after receiving an instruction to create a pivot table; an analysis module, used to use at least one predetermined Sequentially traversing the selected data columns, for each of the selected data columns, obtain at least one predetermined feature value of the data column, and obtain at least one predetermined feature value of each data column according to the predetermined Sequentially input the pre-generated random forest model to obtain the analysis results of each data column; and determine the data columns whose analysis results meet the preset conditions as the value of the pivot table.

本申请还一种定向投放内容的装置,包括处理器和存储器,所述存储器中存储有用于定向投放内容的程序;所述处理器用于读取所述用于定向投放内容的程序,执行上述中任一项所述的方法。The present application also provides a device for targeted delivery of content, including a processor and a memory, wherein a program for targeted delivery of content is stored in the memory; the processor is used to read the program for targeted delivery of content, and execute the above-mentioned any one of the methods described.

与相关技术相比,本申请采用随机森林模型自动生成数据透视表值,帮助用户处理和分析数据,降低用户使用门槛,给用户提供更方便的途径。Compared with related technologies, this application uses the random forest model to automatically generate pivot table values to help users process and analyze data, lower the threshold for users to use, and provide users with more convenient ways.

本申请的其它特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的其他优点可通过在说明书、权利要求书以及附图中所描述的方案来实现和获得。Additional features and advantages of the application will be set forth in the description which follows, and, in part, will be obvious from the description, or may be learned by practice of the application. Other advantages of the present application can be realized and obtained through the solutions described in the specification, claims and drawings.

附图说明Description of drawings

附图用来提供对本申请技术方案的理解,并且构成说明书的一部分,与本申请的实施例一起用于解释本申请的技术方案,并不构成对本申请技术方案的限制。The accompanying drawings are used to provide an understanding of the technical solution of the present application, and constitute a part of the specification, and are used together with the embodiments of the present application to explain the technical solution of the present application, and do not constitute a limitation to the technical solution of the present application.

图1为本申请生成数据透视表值的方法流程图;Fig. 1 is the flow chart of the method that the application generates the pivot table value;

图2为本申请实施例一表格数据图;Fig. 2 is a tabular data diagram ofEmbodiment 1 of the present application;

图3为本申请实施例一采用现有技术数据中生成数据透视表示意图;FIG. 3 is a schematic diagram of a pivot table generated by using prior art data inEmbodiment 1 of the present application;

图4为本申请实施例一采用现有技术生成数据透视表结果示意图;Fig. 4 is a schematic diagram of the result of generating a pivot table using the prior art inEmbodiment 1 of the present application;

图5为本申请实施例二表格数据图;Fig. 5 is the table data diagram of the second embodiment of the present application;

图6为本申请生成数据透视表值的方法具体工作流程图;Fig. 6 is the specific work flow diagram of the method for the application to generate pivot table values;

图7为本申请生成数据透视表值的装置模块示意图。FIG. 7 is a schematic diagram of a device module for generating pivot table values in the present application.

具体实施方式Detailed ways

本申请描述了至少一个实施例,但是该描述是示例性的,而不是限制性的,并且对于本领域的普通技术人员来说显而易见的是,在本申请所描述的实施例包含的范围内可以有更多的实施例和实现方案。尽管在附图中示出了许多可能的特征组合,并在具体实施方式中进行了讨论,但是所公开的特征的许多其它组合方式也是可能的。除非特意加以限制的情况以外,任何实施例的任何特征或元件可以与任何其它实施例中的任何其他特征或元件结合使用,或可以替代任何其它实施例中的任何其他特征或元件。This application describes at least one embodiment, but this description is illustrative rather than restrictive, and it will be obvious to those of ordinary skill in the art that within the scope of the embodiments described in this application, the There are many more embodiments and implementations. Although many possible combinations of features are shown in the drawings and discussed in the detailed description, many other combinations of the disclosed features are possible. Except where expressly limited, any feature or element of any embodiment may be used in combination with, or substituted for, any other feature or element of any other embodiment.

本申请包括并设想了与本领域普通技术人员已知的特征和元件的组合。本申请已经公开的实施例、特征和元件也可以与任何常规特征或元件组合,以形成由权利要求限定的独特的发明方案。任何实施例的任何特征或元件也可以与来自其它发明方案的特征或元件组合,以形成另一个由权利要求限定的独特的发明方案。因此,应当理解,在本申请中示出和/或讨论的任何特征可以单独地或以任何适当的组合来实现。因此,除了根据所附权利要求及其等同替换所做的限制以外,实施例不受其它限制。此外,可以在所附权利要求的保护范围内进行各种修改和改变。This application includes and contemplates combinations of features and elements known to those of ordinary skill in the art. The disclosed embodiments, features and elements of this application can also be combined with any conventional features or elements to form unique inventive solutions as defined by the claims. Any feature or element of any embodiment may also be combined with features or elements from other inventive solutions to form another unique inventive solution as defined by the claims. It is therefore to be understood that any of the features shown and/or discussed in this application can be implemented alone or in any suitable combination. Accordingly, the embodiments are not to be limited except in accordance with the appended claims and their equivalents. Furthermore, various modifications and changes may be made within the scope of the appended claims.

此外,在描述具有代表性的实施例时,说明书可能已经将方法和/或过程呈现为特定的步骤序列。然而,在该方法或过程不依赖于本文所述步骤的特定顺序的程度上,该方法或过程不应限于所述的特定顺序的步骤。如本领域普通技术人员将理解的,其它的步骤顺序也是可能的。因此,说明书中阐述的步骤的特定顺序不应被解释为对权利要求的限制。此外,针对该方法和/或过程的权利要求不应限于按照所写顺序执行它们的步骤,本领域技术人员可以容易地理解,这些顺序可以变化,并且仍然保持在本申请实施例的精神和范围内。Furthermore, in describing representative embodiments, the specification may have presented a method and/or process as a particular sequence of steps. However, to the extent the method or process is not dependent on the specific order of steps described herein, the method or process should not be limited to the specific order of steps described. Other sequences of steps are also possible, as will be appreciated by those of ordinary skill in the art. Therefore, the specific order of the steps set forth in the specification should not be construed as limitations on the claims. In addition, claims for the method and/or process should not be limited to performing their steps in the order written, those skilled in the art can easily understand that these orders can be changed and still remain within the spirit and scope of the embodiments of the present application Inside.

下面将结合附图及实施例对本申请的技术方案进行更详细的说明。The technical solution of the present application will be described in more detail below with reference to the drawings and embodiments.

如图1所示,本发明实施例提供一种生成数据透视表值的方法,包括如下步骤:As shown in Figure 1, an embodiment of the present invention provides a method for generating pivot table values, including the following steps:

S101、当接收到针对当前表格建立数据透视表的指令后,获取表格中被选中的数据列;S101. After receiving an instruction to create a pivot table for the current table, obtain the selected data column in the table;

S102、分别以预定的至少一个顺序遍历所述被选中的数据列,对于每个所述被选中的数据列,分别获取该数据列的至少一个预定的特征值,将获取的每个数据列的所述预定的特征值分别输入预先生成的随机森林模型,得到每个数据列对应所述预定的特征值的分析结果;S102. Traverse the selected data columns in at least one predetermined order, and for each of the selected data columns, obtain at least one predetermined feature value of the data column, and obtain each data column The predetermined eigenvalues are respectively input into the pre-generated random forest model to obtain the analysis results of each data column corresponding to the predetermined eigenvalues;

S103、根据分析结果满足预设条件的数据列,生成所述数据透视表的值。S103. Generate the values of the pivot table according to the data columns whose analysis results meet the preset conditions.

数据透视表(Pivot Table)是一种交互式的表,可以进行某些计算,如求和与计数等。所进行的计算与数据跟数据透视表中的排列有关,可以动态地改变它们的版面布置,以便按照不同方式分析数据,也可以重新安排行号、列标和页字段。每一次改变版面布置时,数据透视表会立即按照新的布置重新计算数据。另外,如果原始数据发生更改,则可以更新数据透视表。A Pivot Table is an interactive table that can perform certain calculations, such as sums and counts. The calculations and data performed are relative to the arrangement in the PivotTable, and their layout can be changed dynamically to analyze the data in different ways, and row numbers, column labels, and page fields can also be rearranged. Every time the layout is changed, the pivot table immediately recalculates the data according to the new layout. Also, the PivotTable can be updated if the original data changes.

随机森林是一个包含至少一个决策树的分类器,并且其输出的类别是由个别树输出的类别的众数而定。Leo Breiman和Adele Cutler发展出推论出随机森林的算法。而"Random Forests"是他们的商标。这个术语是1995年由贝尔实验室的Tin Kam Ho所提出的随机决策森林(random decision forests)而来的。这个方法则是结合Breimans的"Bootstrap aggregating"想法和Ho的"random subspace method"以建造决策树的集合。A random forest is a classifier that consists of at least one decision tree and whose output class is determined by the mode of the class output by the individual trees. Leo Breiman and Adele Cutler developed algorithms to infer random forests. And "Random Forests" is their trademark. This term comes from random decision forests proposed by Tin Kam Ho of Bell Labs in 1995. This method combines Breimans' "Bootstrap aggregating" idea with Ho's "random subspace method" to build a collection of decision trees.

在一个示例性实施例中,根据下列算法而建造每棵树:用N来表示训练用例(样本)的个数,M表示特征数目。输入特征数目m,用于确定决策树上一个节点的决策结果;其中m应远小于M。从N个训练用例(样本)中以有放回抽样的方式,取样N次,形成一个训练集(即bootstrap取样),并用未抽到的用例(样本)作预测,评估其误差。对于每一个节点,随机选择m个特征,决策树上每个节点的决定都是基于这些特征确定的。根据这m个特征,计算其最佳的分裂方式。每棵树都会完整成长而不会剪枝,这有可能在建完一棵正常树状分类器后会被采用。In an exemplary embodiment, each tree is constructed according to the following algorithm: let N denote the number of training cases (samples), and M denote the number of features. The number of input features m is used to determine the decision result of a node on the decision tree; where m should be much smaller than M. From the N training cases (samples) in the way of sampling with replacement, sample N times to form a training set (ie, bootstrap sampling), and use the unsampled use cases (samples) to make predictions and evaluate their errors. For each node, m features are randomly selected, and the decision of each node on the decision tree is determined based on these features. According to these m features, calculate its best splitting method. Each tree will be fully grown without pruning, which may be used after building a normal tree classifier.

在一个示例性实施例中,采用Microsoft Office Excel工作表中的数据作为表格数据列的列表的来源。In one exemplary embodiment, data in a Microsoft Office Excel worksheet is used as the source for the list of tabular data columns.

在一个示例性实施例中,步骤S101中,建立数据透视表的指令,可以是MicrosoftOffice Excel工作表中的预先设定的选项,当点击该选项,则触发建立数据透视表;也可以是当用户选中数据列时自动提示数据透视表。In an exemplary embodiment, in step S101, the instruction to create a pivot table may be a preset option in a Microsoft Office Excel worksheet, and when the option is clicked, it will trigger the creation of a pivot table; it may also be when the user Automatically prompt the pivot table when the data column is selected.

在一个示例性实施例中,步骤S101中,获取表格中被选中的数据列,其中获取的数据列可以是直接选择的数据列,也可以是利用直接选择的数据列经过删减或者扩展后得到的数据列。In an exemplary embodiment, in step S101, the selected data column in the table is obtained, wherein the obtained data column may be a directly selected data column, or may be obtained by using the directly selected data column after deletion or expansion data column.

在一个示例性实施例中,步骤S101中,获取表格中被选中的数据列,其中数据列包括数据列的标题及对应标题的数据。In an exemplary embodiment, in step S101, the selected data column in the table is acquired, wherein the data column includes a title of the data column and data corresponding to the title.

在一个示例性实施例中,预设随机森林模型,是通过采集至少一个数据透视表作为训练数据样本,提取至少一个特征按照决策树的生成步骤建立数据透视表值决策树,再根据数据透视表值决策树建立的。In an exemplary embodiment, the preset random forest model is to collect at least one pivot table as a training data sample, extract at least one feature and establish a pivot table value decision tree according to the generation steps of the decision tree, and then according to the pivot table The value decision tree is built.

在一个示例性实施例中,步骤S103中,所述根据分析结果满足预设条件的数据列,生成作为所述数据透视表的值,包括:In an exemplary embodiment, in step S103, the data column that satisfies the preset condition according to the analysis result is generated as the value of the pivot table, including:

将当前表格中所述数据列中单元格的值,分别按照所述数据透视表的每个行标题进行求和,将得到的求和结果作为所述数据透视表中相应单元格的值。The values of the cells in the data column in the current table are summed according to each row title of the pivot table, and the obtained summation result is used as the value of the corresponding cell in the pivot table.

在一个示例性实施例中,如图2所示,将满足条件的第二数据列“老师天数”中,例如凡是和“全天”对应的值进行求和,求和结果就是数据透视表中“全天”这一行对应的值。In an exemplary embodiment, as shown in FIG. 2, in the second data column "teacher's days" that satisfies the condition, for example, the values corresponding to "full day" are summed, and the summed result is the The value for the "all day" row.

在一个示例性实施例中,步骤S102中,所述将获取的每个数据列的所述预定的特征值分别输入预先生成的随机森林模型,得到每个数据列对应所述预定的特征值的分析结果;根据分析结果满足预设条件的数据列,生成所述数据透视表的值,包括:In an exemplary embodiment, in step S102, the acquired predetermined feature values of each data column are respectively input into a pre-generated random forest model to obtain the Analysis results; according to the data columns whose analysis results meet the preset conditions, the values of the pivot table are generated, including:

将获取的每个数据列的至少一个预设的特征值分别输入预先生成的随机森林模型,计算每个数据列的预测行字段得分;根据所述预测行字段得分在预设范围内的数据列,生成所述数据透视表的值。在其他的实施方式中可采用逻辑运算的方式进行推理,比如模型直接输出“是”“否”。本实施例中仅是一个实例性的方式,在此不做限定Input at least one preset eigenvalue of each data column acquired into the pre-generated random forest model respectively, and calculate the predicted row field score of each data column; according to the data column whose predicted row field score is within a preset range , to generate the values for the PivotTable. In other implementation manners, logic operations may be used for reasoning, for example, the model directly outputs "yes" and "no". In this embodiment, it is only an exemplary method, and is not limited here

在一个示例性实施例中,步骤S103中,将分析结果满足预设条件的数据列,确定作为所述数据透视表的值之后,还包括:对所述满足预设条件的列进行计算。在一个示例性实施例中,可对相同行的标题对应的列的值进行求和或计数计算等。In an exemplary embodiment, in step S103, after determining the data columns whose analysis results meet the preset conditions as the value of the pivot table, it further includes: calculating the columns that meet the preset conditions. In an exemplary embodiment, the summation or count calculation may be performed on the values of the columns corresponding to the headings of the same row.

在一个示例性实施例中,所述预定的至少一个顺序包括第一从左到右的顺序,当按照第一从左到右的顺序进行遍历时,所述获取的所述每个所述被选中的数据列的至少一个预定的特征值包括:标题提取关键字、整个数据列列数、仅数字的单元格个数、单元格整数数字字符数长度的方差。In an exemplary embodiment, the predetermined at least one order includes a first left-to-right order, and when traversing according to the first left-to-right order, each of the acquired The at least one predetermined feature value of the selected data column includes: title extraction keywords, the number of entire data columns, the number of cells with numbers only, and the variance of the length of cell integer numbers and characters.

在一个示例性实施例中,所述预定的至少一个顺序包括第二从左到右的顺序,当按照从左到右的顺序进行遍历时,获取的所述每个所述被选中的数据列的至少一个预定的特征值包括:自身及自身以左的各列含数字的列数、自身及自身以左的各列仅含数字的列数、自身及自身以左的各列含中文、英文、日期的列数。In an exemplary embodiment, the predetermined at least one sequence includes a second left-to-right sequence, and when traversing according to the left-to-right sequence, each of the selected data columns acquired The at least one predetermined characteristic value of includes: the number of columns containing numbers in each column to the left of itself and itself, the number of columns in which only numbers are contained in each column to the left of itself and itself, and the number of columns to the left of itself and each column containing Chinese and English , the number of columns for dates.

在一个示例性实施例中,所述预定的至少一个顺序包括从右到左的顺序,当按照从右到左的顺序进行遍历时,获取的所述每个所述被选中的数据列的至少一个预定的特征值包括:自身及自身以右的各列含数字的列数,自身及自身以右的各列仅含数字的列数、自身及自身以右的各列含中文、英文、日期的列数。In an exemplary embodiment, the predetermined at least one order includes a right-to-left order, and when traversing according to the right-to-left order, at least A predetermined feature value includes: the number of columns containing numbers in itself and the columns to the right of itself, the number of columns containing only numbers in itself and the columns to the right of itself, Chinese, English, and date in each column to the right of itself and itself the number of columns.

上述第一从左到右的顺序和第二从左到右的顺序是指分别两次按照从左到右的顺序遍历上述不同的特征值;一次是得到标题提取关键字、整个数据列列数、仅数字的单元格个数、单元格整数数字字符数长度的方差;另外一次是得到自身及自身以左的各列含数字的列数,自身及自身以左的各列仅含数字的列数、自身及自身以左的各列含中文、英文、日期的列数。The first left-to-right order and the second left-to-right order refer to traversing the above-mentioned different feature values twice in order from left to right; once to obtain the title extraction keyword, the entire data column number , the number of cells with only numbers, and the variance of the length of the number of integers and characters in the cell; the other time is to get the number of columns containing numbers in itself and the columns to the left of itself, and the columns containing only numbers in the columns to the left of itself The number, itself and the columns to the left of itself contain Chinese, English, and date columns.

上述实施例中的特征值可以通过预定的顺序进行获取,相对于计算机处理比较简单,当然本领域技术人员可以采用其他预定的顺序获取上述特征值,本申请的获取特征值的宗旨在于特征值的结果,而不在于获取的顺序,在此不做限定。The eigenvalues in the above-mentioned embodiments can be obtained in a predetermined order, which is relatively simple compared to computer processing. Of course, those skilled in the art can obtain the above-mentioned eigenvalues in other predetermined order. The purpose of obtaining the eigenvalues in this application is to The result, not the order of acquisition, is not limited here.

其中,特征值1:标题提取关键字,例如含以下词语计数1,"数量"、"金额"、"汇总"、"合计"、"收入"、"支出"、"额"、"费"、"销售",含以下词语计数-1,"月份"、"年份"、"号"、"联系"、"电话"、"码"、"单号"、"序号"、"单价"、"时间"、"日期"、"编号"、"单位",最终得到一个数字作为该特征值。具体的,可以提取的关键字较多,可以根据更多样本训练得到更多的结果。Among them, feature value 1: title extraction keywords, such as the following words count 1, "quantity", "amount", "summary", "total", "income", "expenditure", "amount", "fee", "Sales", containing the following words count-1, "month", "year", "number", "contact", "telephone", "code", "single number", "serial number", "unit price", "time ", "date", "number", "unit", and finally get a number as the feature value. Specifically, more keywords can be extracted, and more results can be obtained by training with more samples.

特征值2:整个数据列列数,具体的,整个数据列列数为表格总的列数,同一个表格每一列这个特征值是一样的。Eigenvalue 2: The number of columns in the entire data column. Specifically, the number of columns in the entire data column is the total number of columns in the table. The eigenvalue of each column in the same table is the same.

特征值3:仅数字的单元格个数,具体的,对只有数字的单元格相加,得到该值。Characteristic value 3: the number of cells with only numbers, specifically, add the cells with only numbers to get this value.

特征值4:单元格整数数字字符数长度的方差,具体的,统计该列每个单元格中整数数字的字符长度,计算方差,如果遇到小数则截取整数部分计算。Characteristic value 4: The variance of the length of the integer number of characters in the cell. Specifically, count the character length of the integer number in each cell of the column, and calculate the variance. If a decimal is encountered, the integer part is intercepted for calculation.

特征值5:自身及自身以左的各列含数字的列数,具体的,从左边第一列开始计算至当前列,计算包含数字的列的个数。Eigenvalue 5: the number of columns containing numbers in each column to the left of itself and itself, specifically, count from the first column on the left to the current column, and calculate the number of columns containing numbers.

特征值6:自身及自身以左的各列仅含数字的列数,具体的,从左边第一列开始计算至当前列,计算仅包含数字的列的个数Eigenvalue 6: the number of columns containing only numbers in each column to the left of itself and itself, specifically, counting from the first column on the left to the current column, and calculating the number of columns containing only numbers

特征值7:自身及自身以左的各列含中文、英文、日期的列数,具体的,从左边第一列开始计算至当前列,计算包含中文、英文、日期的列的个数。Characteristic value 7: The number of columns containing Chinese, English, and date in each column to the left of itself and itself. Specifically, counting from the first column on the left to the current column, calculate the number of columns containing Chinese, English, and date.

特征值8:自身及自身以右的各列含数字的列数,具体的,从右边第一列开始计算至当前列,计算包含数字的列的个数。Eigenvalue 8: The number of columns containing numbers in each column to the right of itself and itself. Specifically, count from the first column on the right to the current column, and calculate the number of columns containing numbers.

特征值9:自身及自身以右的各列仅含数字的列数,具体的,从右边第一列开始计算至当前列,计算仅包含数字的列的个数。Eigenvalue 9: The number of columns containing only numbers in itself and the columns to the right of itself. Specifically, count from the first column on the right to the current column, and calculate the number of columns containing only numbers.

特征值10:自身及自身以右的各列含中文、英文、日期的列数;具体的,从右边第一列开始计算至当前列,计算包含中文、英文、日期的列的个数。其中中文、英文、日期,只需要包含其一,则算一列。Feature value 10: the number of columns containing Chinese, English, and date in each column to the right of itself and itself; specifically, calculate the number of columns containing Chinese, English, and date from the first column on the right to the current column. Among them, Chinese, English, and date only need to contain one of them, and it will be counted as one column.

在一个示例性实施例中,步骤S102中所述将获取的每个数据列的至少一个预定的特征值分别输入预先生成的随机森林模型,包括:In an exemplary embodiment, at least one predetermined feature value of each data column obtained in step S102 is respectively input into a pre-generated random forest model, including:

将获取的每个数据列的标题提取关键字、整个数据列列数、仅数字的单元格个数、单元格整数数字字符数长度的方差、自身及自身以左的各列含数字的列数、自身及自身以左的各列仅含数字的列数、自身及自身以左的各列含中文、英文、日期的列数、自身及自身以右的各列含数字的列数、自身及自身以右的各列仅含数字的列数、自身及自身以右的各列含中文、英文、日期的列数依次输入随机森林模型。Extract keywords from the title of each data column obtained, the number of columns in the entire data column, the number of cells with only numbers, the variance of the length of the number of integers and characters in the cell, and the number of columns containing numbers in each column to the left of itself and itself , the number of columns containing only numbers for itself and the columns to the left of itself, the number of columns for itself and the columns to the left of itself containing Chinese, English, date, the number of columns for itself and the columns to the right of itself containing numbers, itself and The columns to the right of itself contain only numbers, and the columns to the right of itself contain Chinese, English, and dates are input into the random forest model in sequence.

本领域技术人员可以理解的是随机森林算法只需要提供正确的特征值和模型,就能够得到分数。随机森林中间过程是算法的封装,并不是该项专利的范畴。Those skilled in the art can understand that the random forest algorithm only needs to provide correct feature values and models to obtain scores. The random forest intermediate process is the encapsulation of the algorithm, which is not the scope of this patent.

本领域技术人员可以理解的是训练出来的随机森林模型需要对应特征值,顺序是不能随意修改的,除非是比如优化算法的时候,用新的用户数据重新训练模型,只要不增减特征值,则再次训练出来的模型相同的且顺序不变。本申请采用该顺序能够得到相当精确的结果。Those skilled in the art can understand that the trained random forest model needs to correspond to the eigenvalues, and the order cannot be modified at will, unless, for example, when optimizing the algorithm, the model is retrained with new user data, as long as the eigenvalues are not increased or decreased, Then the retrained model is the same and the order is unchanged. The present application can obtain fairly accurate results by adopting this order.

本申请通过随机森林模型对按照预定顺序获取的表格数据列的列表的至少一个特征值进行分析,为用户自动找到数据透视表的值所在的列,并确定给用户,从而降低用户使用门槛,给用户提供更方便的途径。This application analyzes at least one eigenvalue of the list of table data columns obtained in a predetermined order through the random forest model, automatically finds the column where the value of the pivot table is located for the user, and determines it for the user, thereby reducing the user's use threshold and giving Provide users with a more convenient way.

如图2所示,本申请实施例一表格数据列的列表包括A,B,C,D,E,共5列,每列的标题分别为“年级”、“姓名”、“托管方式”、“天数”、“老师天数”。用户期望把“托管方式”和“老师天数”进行求和。As shown in Figure 2, the list of the table data columns in the first embodiment of the present application includes A, B, C, D, E, a total of 5 columns, and the titles of each column are "grade", "name", "hosting method", "Number of days", "Number of teacher days". The user expects to sum the "hosting method" and "teacher days".

如图3-4所示,现有技术采用数据透视表把“托管方式”和“老师天数”进行求和的操作方法。首先选中当前表格区域,点击插入,数据透视表,再在右上角把“托管方式”拖拽到右下角“行”,把“老师天数”拖拽到右下角的“值”,得到数据透视表后再进行数据分析。现有技术需要多步进行操作。As shown in Figure 3-4, in the prior art, the pivot table is used to sum the "management mode" and "teacher's days" in the operation method. First select the current table area, click Insert, PivotTable, then drag the "Hosting Mode" in the upper right corner to the "Row" in the lower right corner, and drag the "Teacher Days" to the "Value" in the lower right corner to get the PivotTable Then analyze the data. Prior art requires multiple steps to operate.

采用本申请确定数据透视表的方法,系统自动获取表格的至少一个特征值,本实施例中为10个特征值。如表1所示,以第一列为例:By adopting the method for determining the pivot table in this application, the system automatically obtains at least one characteristic value of the table, which is 10 characteristic values in this embodiment. As shown in Table 1, taking the first column as an example:

第一列的特征值1:标题提取关键字为0;The feature value of the first column is 1: the title extraction keyword is 0;

第一列的特征值2:整个数据列列数为5例;Eigenvalue 2 of the first column: the number of columns in the entire data column is 5 cases;

第一列的特征值3:仅数字的单元格个数为0个;Eigenvalue 3 of the first column: the number of cells with only numbers is 0;

第一列的特征值4:单元格整数数字字符数长度的方差为0;Eigenvalue 4 of the first column: the variance of the length of the cell integer number of characters is 0;

第一列的特征值5:自身及自身以左的各列含数字的列数为0例;The eigenvalue of the first column is 5: the number of columns containing numbers in each column to the left of itself and itself is 0;

第一列的特征值6:自身及自身以左的各列仅含数字的列数为0例;Eigenvalue 6 of the first column: the number of columns containing only numbers in itself and the columns to the left of itself is 0;

第一列的特征值7:自身及自身以左的各列含中文、英文、日期的列数0例;The eigenvalue of the first column is 7: the number of columns containing Chinese, English, and date in itself and the columns to the left of itself is 0 cases;

第一列的特征值8:自身及自身以右的各列含数字的列数为2例;The eigenvalue of the first column is 8: the number of columns containing numbers in itself and the columns to the right of itself is 2 cases;

第一列的特征值9:自身及自身以右的各列仅含数字的列数为2例;The eigenvalue of the first column is 9: the number of columns containing only numbers for itself and the columns to the right of itself is 2 cases;

第一列的特征值10:自身及自身以右的各列含中文、英文、日期的列数为0例。The characteristic value of the first column is 10: the number of columns including Chinese, English, and date in itself and the columns to the right of itself is 0.

将上述10个特征值输入随机森林模型进行分析计算,得到第一列计算结果预测行字段分数为0.0。本实施例阈值范围为分数介于0-1之间,当分数大于0.65的数据列,可生成数据透视表的值。第一列的预测行字段分数为0.0小于0.65,故不能生成数据透视表的值。以此类推,预测行字段分数大于0.65且在0-1之间的是主题名称为“老师天数”的数据列,最终推荐E列“老师天数”生成数据透视表的值。Input the above 10 eigenvalues into the random forest model for analysis and calculation, and get the calculation result of the first column to predict the score of the row field to be 0.0. In this embodiment, the threshold range is that the score is between 0-1. When the score is greater than 0.65, the value of the pivot table can be generated. The score of the predicted row field in the first column is 0.0 less than 0.65, so the value of the pivot table cannot be generated. By analogy, the predicted row field score is greater than 0.65 and between 0-1 is the data column with the subject name "Teacher's Days", and finally recommends column E "Teacher's Days" to generate the value of the pivot table.

表1Table 1

Figure BDA0002266175170000101
Figure BDA0002266175170000101

Figure BDA0002266175170000111
Figure BDA0002266175170000111

如图5所示,本申请实施例二表格数据列的列表包括A,B,C,D,E,F,共6列,每列的标题分别为“站点名称”、“物资名称”、“单位、“设计数量”、“施工用量”、“理核实数量”。As shown in Figure 5, the list of data columns in the form of Example 2 of the present application includes A, B, C, D, E, F, a total of 6 columns, and the titles of each column are "site name", "material name", " Unit, "Design Quantity", "Construction Quantity", "Real Verification Quantity".

系统自动获取表格的至少一个特征值,本实施例中为10个特征值。如表2所示,以第一列为例:The system automatically acquires at least one feature value of the table, which is 10 feature values in this embodiment. As shown in Table 2, taking the first column as an example:

第一列的特征值1:标题提取关键字为0例;The feature value of the first column is 1: the title extraction keyword is 0 cases;

第一列的特征值2:整个数据列列数为6例;Eigenvalue 2 of the first column: the number of columns in the entire data column is 6 cases;

第一列的特征值3:仅数字的单元格个数为0个;Eigenvalue 3 of the first column: the number of cells with only numbers is 0;

第一列的特征值4:单元格整数数字字符数长度的方差为0;Eigenvalue 4 of the first column: the variance of the length of the cell integer number of characters is 0;

第一列的特征值5:自身及自身以左的各列含数字的列数为0例;The eigenvalue of the first column is 5: the number of columns containing numbers in each column to the left of itself and itself is 0;

第一列的特征值6:自身及自身以左的各列含数字的列数为0例;Eigenvalue 6 of the first column: the number of columns containing numbers in itself and the columns to the left of itself is 0;

第一列的特征值7:自身及自身以左的各列含中文、英文、日期的列数为0例;The eigenvalue of the first column is 7: the number of columns containing Chinese, English, and date in itself and the columns to the left of itself is 0;

第一列的特征值8:自身及自身以右的各列含数字的列数为3例;The eigenvalue of the first column is 8: the number of columns containing numbers on itself and the columns to the right of itself is 3 cases;

第一列的特征值9:自身及自身以右的各列仅含数字的列数为2例;The eigenvalue of the first column is 9: the number of columns containing only numbers for itself and the columns to the right of itself is 2 cases;

第一列的特征值10:自身及自身以右的各列含中文、英文、日期的列数为0例。The characteristic value of the first column is 10: the number of columns including Chinese, English, and date in itself and the columns to the right of itself is 0.

将上述10个特征值输入随机森林模型进行分析计算,得到计算结果预测行字段分数为0.0。本实施例阈值范围为分数介于0-1之间,当分数大于0.65数据列,可生成数据透视表的值。第一列的预测行字段分数为0.0小于0.65,故不能作为数据透视表的值。以此类推,预测行字段分数大于0.65且在0-1之间的是主题名称为“设计数量”,“施工用量”、“监理核实数量”的数据列,最终推荐D、E、F列“设计数量”,“施工用量”、“监理核实数量”生成数据透视表的值。Input the above 10 eigenvalues into the random forest model for analysis and calculation, and the calculation result predicts that the row field score is 0.0. In this embodiment, the threshold range is that the score is between 0-1, and when the score is greater than 0.65 in the data column, the value of the pivot table can be generated. The score of the predicted row field in the first column is 0.0 and less than 0.65, so it cannot be used as the value of the pivot table. By analogy, the predicted line field score is greater than 0.65 and between 0-1 is the data column with the subject name "design quantity", "construction quantity", "supervision verification quantity", and finally recommends D, E, F columns" Design Quantity", "Construction Quantity", and "Supervision Verification Quantity" generate the values of the pivot table.

表2Table 2

Figure BDA0002266175170000121
Figure BDA0002266175170000121

如图7所示,本发明定向投放内容的装置,包括:As shown in Figure 7, the device for targeted delivery of content according to the present invention includes:

获取模块10,用于当接收到针对当前表格建立数据透视表的指令后,获取表格中被选中的数据列;The obtainingmodule 10 is used to obtain the selected data columns in the table after receiving the instruction for establishing a pivot table for the current table;

本发明定向投放内容的装置还包括分析模块20,用于分别以预定的至少一个顺序遍历所述被选中的数据列,对于每个所述被选中的数据列,分别获取该数据列的至少一个预定的特征值;将获取的每个数据列的所述预定的特征值分别输入预先生成的随机森林模型,得到每个数据列对应所述预定的特征值的分析结果;根据分析结果满足预设条件的数据列,生成所述数据透视表的值。The device for targeted delivery of content in the present invention also includes ananalysis module 20, configured to traverse the selected data columns in at least one predetermined order, and for each of the selected data columns, respectively obtain at least one of the data columns Predetermined eigenvalues; input the predetermined eigenvalues of each acquired data column into a pre-generated random forest model to obtain analysis results corresponding to the predetermined eigenvalues of each data column; according to the analysis results, satisfy the preset Criteria for the data columns that generate the values for the PivotTable.

数据透视表(Pivot Table)是一种交互式的表,可以进行某些计算,如求和与计数等。所进行的计算与数据跟数据透视表中的排列有关,可以动态地改变它们的版面布置,以便按照不同方式分析数据,也可以重新安排行号、列标和页字段。每一次改变版面布置时,数据透视表会立即按照新的布置重新计算数据。另外,如果原始数据发生更改,则可以更新数据透视表。A Pivot Table is an interactive table that can perform certain calculations, such as sums and counts. The calculations and data performed are relative to the arrangement in the PivotTable, and their layout can be changed dynamically to analyze the data in different ways, and row numbers, column labels, and page fields can also be rearranged. Every time the layout is changed, the pivot table immediately recalculates the data according to the new layout. Also, the PivotTable can be updated if the original data changes.

随机森林是一个包含至少一个决策树的分类器,并且其输出的类别是由个别树输出的类别的众数而定。Leo Breiman和Adele Cutler发展出推论出随机森林的算法。而"Random Forests"是他们的商标。这个术语是1995年由贝尔实验室的Tin Kam Ho所提出的随机决策森林(random decision forests)而来的。这个方法则是结合Breimans的"Bootstrap aggregating"想法和Ho的"random subspace method"以建造决策树的集合。A random forest is a classifier that consists of at least one decision tree and whose output class is determined by the mode of the class output by the individual trees. Leo Breiman and Adele Cutler developed algorithms to infer random forests. And "Random Forests" is their trademark. This term comes from random decision forests proposed by Tin Kam Ho of Bell Labs in 1995. This method combines Breimans' "Bootstrap aggregating" idea with Ho's "random subspace method" to build a collection of decision trees.

在一个示例性实施例中,根据下列算法而建造每棵树:用N来表示训练用例(样本)的个数,M表示特征数目。输入特征数目m,用于确定决策树上一个节点的决策结果;其中m应远小于M。从N个训练用例(样本)中以有放回抽样的方式,取样N次,形成一个训练集(即bootstrap取样),并用未抽到的用例(样本)作预测,评估其误差。对于每一个节点,随机选择m个特征,决策树上每个节点的决定都是基于这些特征确定的。根据这m个特征,计算其最佳的分裂方式。每棵树都会完整成长而不会剪枝,这有可能在建完一棵正常树状分类器后会被采用。In an exemplary embodiment, each tree is constructed according to the following algorithm: let N denote the number of training cases (samples), and M denote the number of features. The number of input features m is used to determine the decision result of a node on the decision tree; where m should be much smaller than M. From the N training cases (samples) in the way of sampling with replacement, sample N times to form a training set (ie, bootstrap sampling), and use the unsampled use cases (samples) to make predictions and evaluate their errors. For each node, m features are randomly selected, and the decision of each node on the decision tree is determined based on these features. According to these m features, calculate its best splitting method. Each tree will be fully grown without pruning, which may be used after building a normal tree classifier.

在一个示例性实施例中,采用Microsoft Office Excel工作表中的数据作为表格数据列的列表的来源。In one exemplary embodiment, data in a Microsoft Office Excel worksheet is used as the source for the list of tabular data columns.

在一个示例性实施例中,上述建立数据透视表的指令,可以是Microsoft OfficeExcel工作表中的预先设定的选项,当点击该选项,则触发建立数据透视表;也可以是当用户选中数据列时自动提示数据透视表。In an exemplary embodiment, the above-mentioned instruction for creating a pivot table may be a preset option in a Microsoft OfficeExcel worksheet, and when the option is clicked, it will trigger the creation of a pivot table; it may also be when the user selects a data column Automatically prompt the pivot table when.

在一个示例性实施例中,获取模块10获取的数据列可以是直接选择的数据列,也可以是利用直接选择的数据列经过删减或者扩展后得到的数据列。In an exemplary embodiment, the data column acquired by the acquiringmodule 10 may be a directly selected data column, or may be a data column obtained by using the directly selected data column after pruning or expansion.

在一个示例性实施例中,获取模块10获取的数据列包括数据列的标题及对应标题下的数据内容。In an exemplary embodiment, the data column acquired by theacquisition module 10 includes the title of the data column and the data content under the corresponding title.

在一个示例性实施例中,预设随机森林模型,具体的是通过采集至少一个数据透视表作为训练数据样本,提取至少一个特征按照决策树的生成步骤建立数据透视表值决策树,再根据数据透视表值决策树建立的。In an exemplary embodiment, the random forest model is preset, specifically by collecting at least one pivot table as a training data sample, extracting at least one feature and establishing a pivot table value decision tree according to the generation steps of the decision tree, and then according to the data Pivot table value decision trees are built.

在一个示例性实施例中,分析模块20将获取的每个数据列的至少一个预定的特征值分别输入预先生成的随机森林模型,得到每个数据列的分析结果,将分析结果满足预设条件的数据列,确定作为所述数据透视表的值,是指:In an exemplary embodiment, theanalysis module 20 inputs at least one predetermined eigenvalue of each data column obtained into a pre-generated random forest model to obtain an analysis result of each data column, and the analysis result satisfies a preset condition The data columns, identified as values for the pivot table, refer to:

分析模块20将获取的每个数据列的至少一个预设的特征值分别输入预先生成的随机森林模型,计算每个数据列的预测行字段得分;根据所述预测行字段得分在预设范围内的列,生成作为所述数据透视表的值。在其他的实施方式中可采用逻辑运算的方式进行推理,比如模型直接输出“是”“否”。本实施例中仅是一个实例性的方式,在此不做限定。Theanalysis module 20 inputs at least one preset feature value of each data column obtained into the pre-generated random forest model respectively, and calculates the predicted row field score of each data column; according to the predicted row field score is within a preset range The columns that are generated as values for the pivot table. In other implementation manners, logic operations may be used for reasoning, for example, the model directly outputs "yes" and "no". This embodiment is only an exemplary manner, and is not limited here.

在一个示例性实施例中,分析模块20根据分析结果满足预设条件的数据列,生成作为所述数据透视表的值,是指:In an exemplary embodiment, theanalysis module 20 generates a value as the pivot table according to the data column whose analysis result meets the preset condition, which refers to:

分析模块20将当前表格中所述数据列中单元格的值,分别按照所述数据透视表的每个行标题进行求和,将得到的求和结果作为所述数据透视表中相应单元格的值。Theanalysis module 20 sums the values of the cells in the data column in the current table according to each row title of the pivot table, and uses the obtained summation result as the value of the corresponding cell in the pivot table. value.

在一个示例性实施例中,所述预定的至少一个顺序包括第一从左到右的顺序,当按照第一从左到右的顺序进行遍历时,获取模块10获取的所述每个所述被选中的数据列的至少一个预定的特征值包括:标题提取关键字、整个数据列列数、仅数字的单元格个数、单元格整数数字字符数长度的方差。In an exemplary embodiment, the predetermined at least one sequence includes a first left-to-right sequence, and when traversing according to the first left-to-right sequence, each of the At least one predetermined characteristic value of the selected data column includes: title extraction keywords, number of entire data columns, number of cells with numbers only, and variance of the length of cell integer numbers and characters.

在一个示例性实施例中,所述预定的至少一个顺序包括第二从左到右的顺序,当按照第二从左到右的顺序进行遍历时,获取模块10获取的所述每个所述被选中的数据列的至少一个预定的特征值包括:自身及自身以左的各列含数字的列数,自身及自身以左的各列仅含数字的列数,自身及自身以左的各列含中文、英文、日期的列数。In an exemplary embodiment, the predetermined at least one sequence includes a second left-to-right sequence, and when traversing according to the second left-to-right sequence, each of the At least one predetermined characteristic value of the selected data column includes: the number of columns containing numbers in each column to the left of itself and itself, the number of columns containing only numbers in each column to the left of itself and itself, and the number of columns to the left of itself and each column to the left of itself. The number of columns containing Chinese, English, and date.

在一个示例性实施例中,所述预定的至少一个顺序包括从右到左的顺序,当按照从右到左的顺序进行遍历时,获取模块10获取的所述每个所述被选中的数据列的至少一个预定的特征值包括:自身及自身以右的各列含数字的列数,自身及自身以右的各列仅含数字的列数,自身及自身以右的各列含中文、英文、日期的列数。In an exemplary embodiment, the predetermined at least one order includes a right-to-left order, and when traversing according to the right-to-left order, each of the selected data acquired by theacquisition module 10 At least one predetermined characteristic value of the column includes: the number of columns containing numbers in each column to the right of itself and itself, the number of columns in which only numbers are contained in each column to the right of itself and itself, and the number of columns to contain only numbers in each column to the right of itself and itself. Number of columns in English, date.

上述第一从左到右的顺序和第二从左到右的顺序是指分别两次按照从左到右的顺序遍历上述不同的特征值,一次是得到整个数据列列数、索引值、整列含有的数据类型、去除重复单元格后的单元格个数、重复单元格内容出现次数的方差、单元格字符长度最大值、单元格字符长度的方差,另一次时得到自身及自身以左的各列含数字的列数,自身及自身以左的各列仅含数字的列数,自身及自身以左的各列含中文、英文、日期的列数。The first left-to-right order and the second left-to-right order refer to traversing the above-mentioned different eigenvalues twice in order from left to right, and once to obtain the entire data column number, index value, and entire column The data type contained, the number of cells after removing duplicate cells, the variance of the number of occurrences of duplicate cell content, the maximum value of cell character length, and the variance of cell character length. The number of columns containing numbers, the number of columns containing numbers only, and the number of columns containing Chinese, English, and dates.

上述实施例中的特征值可以通过预定的顺序进行获取,相对于计算机处理比较简单,当然本领域技术人员可以采用其他预定的顺序获取上述特征值,本申请的获取特征值的宗旨在于特征值的结果,而不在于获取的顺序,在此不做限定。The eigenvalues in the above-mentioned embodiments can be obtained in a predetermined order, which is relatively simple compared to computer processing. Of course, those skilled in the art can obtain the above-mentioned eigenvalues in other predetermined order. The purpose of obtaining the eigenvalues in this application is to The result, not the order of acquisition, is not limited here.

其中,特征值1:标题提取关键字,例如含以下词语计数1,"数量"、"金额"、"汇总"、"合计"、"收入"、"支出"、"额"、"费"、"销售",含以下词语计数-1,"月份"、"年份"、"号"、"联系"、"电话"、"码"、"单号"、"序号"、"单价"、"时间"、"日期"、"编号"、"单位",最终得到一个数字作为该特征值。具体的,可以提取的关键字较多,可以根据更多样本训练得到更多的结果。Among them, feature value 1: title extraction keywords, such as the following words count 1, "quantity", "amount", "summary", "total", "income", "expenditure", "amount", "fee", "Sales", containing the following words count-1, "month", "year", "number", "contact", "telephone", "code", "single number", "serial number", "unit price", "time ", "date", "number", "unit", and finally get a number as the feature value. Specifically, more keywords can be extracted, and more results can be obtained by training with more samples.

特征值2:整个数据列列数,具体的,整个数据列列数为表格总的列数,同一个表格每一列这个特征值是一样的。Eigenvalue 2: The number of columns in the entire data column. Specifically, the number of columns in the entire data column is the total number of columns in the table. The eigenvalue of each column in the same table is the same.

特征值3:仅数字的单元格个数,具体的,对只有数字的单元格相加,得到该值。Characteristic value 3: the number of cells with only numbers, specifically, add the cells with only numbers to get this value.

特征值4:单元格整数数字字符数长度的方差,具体的,统计该列每个单元格中整数数字的字符长度,计算方差,如果遇到小数则截取整数部分计算。Characteristic value 4: The variance of the length of the integer number of characters in the cell. Specifically, count the character length of the integer number in each cell of the column, and calculate the variance. If a decimal is encountered, the integer part is intercepted for calculation.

特征值5:自身及自身以左的各列含数字的列数,具体的,从左边第一列开始计算至当前列,计算包含数字的列的个数。Eigenvalue 5: the number of columns containing numbers in each column to the left of itself and itself, specifically, count from the first column on the left to the current column, and calculate the number of columns containing numbers.

特征值6:自身及自身以左的各列仅含数字的列数,具体的,从左边第一列开始计算至当前列,计算仅包含数字的列的个数Eigenvalue 6: the number of columns containing only numbers in each column to the left of itself and itself, specifically, counting from the first column on the left to the current column, and calculating the number of columns containing only numbers

特征值7:自身及自身以左的各列含中文、英文、日期的列数,具体的,从左边第一列开始计算至当前列,计算包含中文、英文、日期的列的个数。Characteristic value 7: The number of columns containing Chinese, English, and date in each column to the left of itself and itself. Specifically, counting from the first column on the left to the current column, calculate the number of columns containing Chinese, English, and date.

特征值8:自身及自身以右的各列含数字的列数,具体的,从右边第一列开始计算至当前列,计算包含数字的列的个数。Eigenvalue 8: The number of columns containing numbers in each column to the right of itself and itself. Specifically, count from the first column on the right to the current column, and calculate the number of columns containing numbers.

特征值9:自身及自身以右的各列仅含数字的列数,具体的,从右边第一列开始计算至当前列,计算仅包含数字的列的个数。Eigenvalue 9: The number of columns containing only numbers in itself and the columns to the right of itself. Specifically, count from the first column on the right to the current column, and calculate the number of columns containing only numbers.

特征值10:自身及自身以右的各列含中文、英文、日期的列数,具体的,从右边第一列开始计算至当前列,计算包含中文、英文、日期的列的个数。其中中文、英文、日期,只需要包含其一,则算一列。Characteristic value 10: the number of columns containing Chinese, English, and date in each column to the right of itself and itself. Specifically, count from the first column on the right to the current column, and calculate the number of columns containing Chinese, English, and date. Among them, Chinese, English, and date only need to contain one of them, and it will be counted as one column.

在一个示例性实施例中,分析模块20将获取的每个数据列的至少一个预定的特征值分别输入预先生成的随机森林模型,是指:In an exemplary embodiment, theanalysis module 20 respectively inputs at least one predetermined feature value of each data column obtained into a pre-generated random forest model, which means:

分析模块20将获取的每个数据列的标题提取关键字、整个数据列列数、仅数字的单元格个数、单元格整数数字字符数长度的方差、自身及自身以左的各列含数字的列数、自身及自身以左的各列仅含数字的列数、自身及自身以左的各列含中文、英文、日期的列数、自身及自身以右的各列含数字的列数、自身及自身以右的各列仅含数字的列数、自身及自身以右的各列含中文、英文、日期的列数依次输入随机森林模型。Theanalysis module 20 extracts keywords, the number of whole data columns, the number of cells with only numbers, the variance of the length of the integer number of characters in the cell, the number of characters in each column of itself and the left of itself, and the number of columns on the left of each data column obtained by theanalysis module 20. The number of columns, the number of columns of itself and the columns to the left of itself containing only numbers, the number of columns of itself and the columns to the left of itself containing Chinese, English, date, the number of columns of itself and the columns to the right of itself containing numbers , the number of columns containing only numbers on itself and the columns to the right of itself, and the number of columns containing Chinese, English, and dates on itself and the columns to the right of itself are input into the random forest model in turn.

本领域技术人员可以理解的是随机森林算法只需要提供正确的特征值和模型,就能够得到分数。随机森林中间过程是算法的封装,并不是该项专利的范畴。Those skilled in the art can understand that the random forest algorithm only needs to provide correct feature values and models to obtain scores. The random forest intermediate process is the encapsulation of the algorithm, which is not the scope of this patent.

本领域技术人员可以理解的是训练出来的随机森林模型需要对应特征值,顺序是不能随意修改的,除非是比如优化算法的时候,用新的用户数据重新训练模型,只要不增减特征值,则再次训练出来的模型相同的且顺序不变。本申请采用该顺序能够得到相当精确的结果。Those skilled in the art can understand that the trained random forest model needs to correspond to the eigenvalues, and the order cannot be modified at will, unless, for example, when optimizing the algorithm, the model is retrained with new user data, as long as the eigenvalues are not increased or decreased, Then the retrained model is the same and the order is unchanged. The present application can obtain fairly accurate results by adopting this order.

本申请通过随机森林模型对按照预定顺序获取的表格数据列的列表的至少一个特征值进行分析,为用户自动找到数据透视表的值所在的列,并确定给用户,从而降低用户使用门槛,给用户提供更方便的途径。This application analyzes at least one eigenvalue of the list of table data columns obtained in a predetermined order through the random forest model, automatically finds the column where the value of the pivot table is located for the user, and determines it for the user, thereby reducing the user's use threshold and giving Provide users with a more convenient way.

如2所示,本申请实施例一表格数据列的列表包括A,B,C,D,E,共5列,每列的标题分别为“年级”、“姓名”、“托管方式”、“天数”、“老师天数”。用户期望把“托管方式”和“老师天数”进行求和。As shown in 2, the list of data columns in the form of Example 1 of the present application includes A, B, C, D, E, a total of 5 columns, and the titles of each column are "grade", "name", "hosting method", " days", "teacher days". The user expects to sum the "hosting method" and "teacher days".

如图3-4所示,现有技术采用数据透视表把“托管方式”和“老师天数”进行求和的操作方法。首先选中当前表格区域,点击插入,数据透视表,再在右上角把“托管方式”拖拽到右下角“行”,把“老师天数”拖拽到右下角的“值”,得到数据透视表后再进行数据分析。现有技术需要多步进行操作。As shown in Figure 3-4, in the prior art, the pivot table is used to sum the "management mode" and "teacher's days" in the operation method. First select the current table area, click Insert, PivotTable, then drag the "Hosting Mode" in the upper right corner to the "Row" in the lower right corner, and drag the "Teacher Days" to the "Value" in the lower right corner to get the PivotTable Then analyze the data. Prior art requires multiple steps to operate.

采用本申请确定数据透视表的方法,系统自动获取表格的至少一个特征值,本实施例中为10个特征值。如表1所示,以第一列为例:By adopting the method for determining the pivot table in this application, the system automatically obtains at least one characteristic value of the table, which is 10 characteristic values in this embodiment. As shown in Table 1, taking the first column as an example:

第一列的特征值1:标题提取关键字为0;The feature value of the first column is 1: the title extraction keyword is 0;

第一列的特征值2:整个数据列列数为5例;Eigenvalue 2 of the first column: the number of columns in the entire data column is 5 cases;

第一列的特征值3:仅数字的单元格个数为0个;Eigenvalue 3 of the first column: the number of cells with only numbers is 0;

第一列的特征值4:单元格整数数字字符数长度的方差为0;Eigenvalue 4 of the first column: the variance of the length of the cell integer number of characters is 0;

第一列的特征值5:自身及自身以左的各列含数字的列数为0例;The eigenvalue of the first column is 5: the number of columns containing numbers in each column to the left of itself and itself is 0;

第一列的特征值6:自身及自身以左的各列仅含数字的列数为0例;Eigenvalue 6 of the first column: the number of columns containing only numbers in itself and the columns to the left of itself is 0;

第一列的特征值7:自身及自身以左的各列含中文、英文、日期的列数0例;The eigenvalue of the first column is 7: the number of columns containing Chinese, English, and date in itself and the columns to the left of itself is 0 cases;

第一列的特征值8:自身及自身以右的各列含数字的列数为2例;The eigenvalue of the first column is 8: the number of columns containing numbers in itself and the columns to the right of itself is 2 cases;

第一列的特征值9:自身及自身以右的各列仅含数字的列数为2例;The eigenvalue of the first column is 9: the number of columns containing only numbers for itself and the columns to the right of itself is 2 cases;

第一列的特征值10:自身及自身以右的各列含中文、英文、日期的列数为0例。The characteristic value of the first column is 10: the number of columns including Chinese, English, and date in itself and the columns to the right of itself is 0.

将上述10个特征值输入随机森林模型进行分析计算,得到计算结果预测行字段分数为0.0。本实施例阈值范围为分数介于0-1之间,当分数大于0.65的数据列,可生成数据透视表的值。第一列的预测行字段分数为0.0小于0.65,故不能生成数据透视表的值。以此类推,预测行字段分数大于0.65且在0-1之间的是主题名称为“老师天数”的数据列,最终推荐E列“老师天数”生成数据透视表的值。Input the above 10 eigenvalues into the random forest model for analysis and calculation, and the calculation result predicts that the row field score is 0.0. In this embodiment, the threshold range is that the score is between 0-1. When the score is greater than 0.65, the value of the pivot table can be generated. The score of the predicted row field in the first column is 0.0 less than 0.65, so the value of the pivot table cannot be generated. By analogy, the predicted row field score is greater than 0.65 and between 0-1 is the data column with the subject name "Teacher's Days", and finally recommends column E "Teacher's Days" to generate the value of the pivot table.

表1Table 1

列标题column header年级grade姓名Name托管方式hosting method天数number of days老师天数teacher days特征值1Eigenvalue 10000000000特征值2Eigenvalue 25555555555特征值3Eigenvalue 300000025252525特征值4Eigenvalue 40000000.4050806940.4050806940.4050806940.405080694特征值5Eigenvalue 50000001122特征值6Eigenvalue 60011222222特征值7Eigenvalue 70000000000特征值8Eigenvalue 82222222211特征值9Eigenvalue 92222110000特征值10Eigenvalue 100000000000预测行字段分数Forecast Row Field Score0.00.00.00.00.00.00.30.31.01.0

如5所示,本申请实施例二表格数据列的列表包括A,B,C,D,E,F,共6列,每列的标题分别为“站点名称”、“物资名称”、“单位、“设计数量”、“施工用量”、“理核实数量”。As shown in 5, the list of data columns in the form of Example 2 of the present application includes A, B, C, D, E, F, a total of 6 columns, and the titles of each column are "site name", "material name", "unit , "Design Quantity", "Construction Quantity", "Real Verification Quantity".

系统自动获取表格的至少一个特征值,本实施例中为10个特征值。如表2所示,以第一列为例:The system automatically acquires at least one feature value of the table, which is 10 feature values in this embodiment. As shown in Table 2, taking the first column as an example:

第一列的特征值1:标题提取关键字为0例;The feature value of the first column is 1: the title extraction keyword is 0 cases;

第一列的特征值2:整个数据列列数为6例;Eigenvalue 2 of the first column: the number of columns in the entire data column is 6 cases;

第一列的特征值3:仅数字的单元格个数为0个;Eigenvalue 3 of the first column: the number of cells with only numbers is 0;

第一列的特征值4:单元格整数数字字符数长度的方差为0;Eigenvalue 4 of the first column: the variance of the length of the cell integer number of characters is 0;

第一列的特征值5:自身及自身以左的各列含数字的列数为0例;The eigenvalue of the first column is 5: the number of columns containing numbers in each column to the left of itself and itself is 0;

第一列的特征值6:自身及自身以左的各列含数字的列数为0例;Eigenvalue 6 of the first column: the number of columns containing numbers in itself and the columns to the left of itself is 0;

第一列的特征值7:自身及自身以左的各列含中文、英文、日期的列数为0例;The eigenvalue of the first column is 7: the number of columns containing Chinese, English, and date in itself and the columns to the left of itself is 0;

第一列的特征值8:自身及自身以右的各列含数字的列数为3例;The eigenvalue of the first column is 8: the number of columns containing numbers on itself and the columns to the right of itself is 3 cases;

第一列的特征值9:自身及自身以右的各列仅含数字的列数为2例;The eigenvalue of the first column is 9: the number of columns containing only numbers for itself and the columns to the right of itself is 2 cases;

第一列的特征值10:自身及自身以右的各列含中文、英文、日期的列数为0例。The characteristic value of the first column is 10: the number of columns including Chinese, English, and date in itself and the columns to the right of itself is 0.

将上述10个特征值输入随机森林模型进行分析计算,得到计算结果预测行字段分数为0.0。本实施例阈值范围为分数介于0-1之间,当分数大于0.65数据列,可生成数据透视表的值。第一列的预测行字段分数为0.0小于0.65,故不能作为数据透视表的值。以此类推,预测行字段分数大于0.65且在0-1之间的是主题名称为“设计数量”,“施工用量”、“监理核实数量”的数据列,最终推荐D、E、F列“设计数量”,“施工用量”、“监理核实数量”生成数据透视表的值。Input the above 10 eigenvalues into the random forest model for analysis and calculation, and the calculation result predicts that the row field score is 0.0. In this embodiment, the threshold range is that the score is between 0-1, and when the score is greater than 0.65 in the data column, the value of the pivot table can be generated. The score of the predicted row field in the first column is 0.0 and less than 0.65, so it cannot be used as the value of the pivot table. By analogy, the predicted line field score is greater than 0.65 and between 0-1 is the data column with the subject name "design quantity", "construction quantity", "supervision verification quantity", and finally recommends D, E, F columns" Design Quantity", "Construction Quantity", and "Supervision Verification Quantity" generate the values of the pivot table.

表2Table 2

Figure BDA0002266175170000191
Figure BDA0002266175170000191

如图6所示,本发明生成数据透视表值的方法,包括如下步骤:As shown in Figure 6, the present invention generates the method for pivot table value, comprises the following steps:

1)开始;1) start;

2)获取表格数据列的列表以及每列的标题;2) Obtain the list of table data columns and the title of each column;

3)遍历所有数据列,获取每列的以下特征:标题提取关键字、整表列数、仅数字的单元格个数、单元格整数数字字符数长度的方差;3) Traverse all data columns to obtain the following characteristics of each column: title extraction keyword, number of columns in the entire table, number of cells with only numbers, variance of the length of the number of characters in the cell integer number;

4)从左遍历所有数据列,获取每列的以下特征:该例以左各列含数字的列数(含本列)、该例以左各列仅含数字的列数(含本列)、该以左的各列含中文、英文、日期的列数(含本列);4) Traverse all data columns from the left to obtain the following characteristics of each column: the number of columns containing numbers in the left columns of this example (including this column), the number of columns containing only numbers in the left columns of this example (including this column) , The columns to the left include the number of Chinese, English, and date columns (including this column);

5)从右遍历所有数据列,获取每列的以下特征:该例以右各列含数字的列数(含本列)、该例以右各列仅含数字的列数(含本列)、该以右的各列含中文、英文、日期的列数(含本列);5) Traverse all data columns from the right to obtain the following characteristics of each column: the number of columns containing numbers in the columns to the right of this example (including this column), the number of columns containing only numbers in the columns to the right of this example (including this column) , Each column to the right contains the number of Chinese, English, and date columns (including this column);

6)使用获取的以上10个特征和模型计算得到该列是否为“值字段”;6) Use the obtained above 10 features and models to calculate whether the column is a "value field";

7)结束。7) End.

本申请还保护一种定向投放内容的装置,包括处理器和存储器,所述存储器中存储有用于定向投放内容的程序;所述处理器用于读取所述用于定向投放内容的程序,执行上述所有实施例中的任一个步骤方法。The present application also protects a device for targeted delivery of content, including a processor and a memory, wherein a program for targeted delivery of content is stored in the memory; the processor is used to read the program for targeted delivery of content, and execute the above-mentioned Any step method in all examples.

本申请还保护一种计算机存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述所有实施例中的任意一个步骤方法。The present application also protects a computer storage medium on which a computer program is stored, and when the computer program is executed by a processor, any one of the steps and methods in all the above-mentioned embodiments is realized.

本领域普通技术人员可以理解,上文中所公开方法中的全部或某些步骤、系统、装置中的功能模块/单元可以被实施为软件、固件、硬件及其适当的组合。在硬件实施方式中,在以上描述中提及的功能模块/单元之间的划分不一定对应于物理组件的划分;例如,一个物理组件可以具有至少一个功能,或者一个功能或步骤可以由若干物理组件合作执行。某些组件或所有组件可以被实施为由处理器,如数字信号处理器或微处理器执行的软件,或者被实施为硬件,或者被实施为集成电路,如专用集成电路。这样的软件可以分布在计算机可读介质上,计算机可读介质可以包括计算机存储介质(或非暂时性介质)和通信介质(或暂时性介质)。如本领域普通技术人员公知的,术语计算机存储介质包括在用于存储信息(诸如计算机可读指令、数据结构、程序模块或其他数据)的任何方法或技术中实施的易失性和非易失性、可移除和不可移除介质。计算机存储介质包括但不限于RAM、ROM、EEPROM、闪存或其他存储器技术、CD-ROM、数字多功能盘(DVD)或其他光盘存储、磁盒、磁带、磁盘存储或其他磁存储装置、或者可以用于存储期望的信息并且可以被计算机访问的任何其他的介质。此外,本领域普通技术人员公知的是,通信介质通常包含计算机可读指令、数据结构、程序模块或者诸如载波或其他传输机制之类的调制数据信号中的其他数据,并且可包括任何信息递送介质。Those of ordinary skill in the art can understand that all or some of the steps in the methods disclosed above, the functional modules/units in the system, and the device can be implemented as software, firmware, hardware, and an appropriate combination thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have at least one function, or one function or step may be composed of several physical components. Components cooperate to execute. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). As known to those of ordinary skill in the art, the term computer storage media includes both volatile and nonvolatile media implemented in any method or technology for storage of information, such as computer readable instructions, data structures, program modules, or other data. permanent, removable and non-removable media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical disk storage, magnetic cartridges, tape, magnetic disk storage or other magnetic storage devices, or can Any other medium used to store desired information and which can be accessed by a computer. In addition, as is well known to those of ordinary skill in the art, communication media typically embodies computer readable instructions, data structures, program modules, or other data in a modulated data signal such as a carrier wave or other transport mechanism, and may include any information delivery media .

Claims (10)

Translated fromChinese
1.一种生成数据透视表值的方法,其特征在于,所述方法包括:1. A method for generating pivot table values, characterized in that the method comprises:当接收到针对当前表格建立数据透视表的指令后,获取表格中被选中的数据列,分别以预定的至少一个顺序遍历所述被选中的数据列,对于每个所述被选中的数据列,分别获取该数据列的至少一个预定的特征值;After receiving an instruction to create a pivot table for the current table, the selected data columns in the table are obtained, and the selected data columns are respectively traversed in at least one predetermined order, and for each of the selected data columns, Respectively obtain at least one predetermined characteristic value of the data column;将获取的每个数据列的所述预定的特征值分别输入预先生成的随机森林模型,得到每个数据列对应所述预定的特征值的分析结果;Input the predetermined eigenvalues of each acquired data column into the pre-generated random forest model respectively, and obtain the analysis results of each data column corresponding to the predetermined eigenvalues;根据分析结果满足预设条件的数据列,生成所述数据透视表的值;Generate the value of the pivot table according to the data column whose analysis result meets the preset condition;所述预定的特征值包括:标题提取关键字,整个数据列列数,仅数字的单元格个数,单元格整数数字字符数长度的方差,自身及自身以左的各列含数字的列数,自身及自身以左的各列仅含数字的列数,自身及自身以左的各列含中文、英文、日期的列数,自身及自身以右的各列含数字的列数,自身及自身以右的各列仅含数字的列数,自身及自身以右的各列含中文、英文、日期的列数;The predetermined feature value includes: title extraction keywords, the number of entire data columns, the number of cells with only numbers, the variance of the length of the cell integer number of characters, the number of columns containing numbers in each column to the left of itself and itself , the number of columns containing only numbers on itself and the columns to its left, the number of columns on itself and its left containing Chinese, English, and date, the number of columns on itself and its right containing numbers, and the number of columns on its own and its left The columns to the right of itself contain only numbers, and the columns to the right of itself contain Chinese, English, and dates;所述将获取的每个数据列的所述预定的特征值分别输入预先生成的随机森林模型,得到每个数据列对应所述预定的特征值的分析结果,包括:The predetermined eigenvalues of each data column obtained are respectively input into the pre-generated random forest model, and the analysis results corresponding to the predetermined eigenvalues of each data column are obtained, including:将获取的每个数据列的所述预设的特征值分别输入预先生成的随机森林模型,计算每个数据列的预测行字段得分。The acquired preset feature values of each data column are respectively input into the pre-generated random forest model, and the predicted row field score of each data column is calculated.2.根据权利要求1所述的方法,其特征在于,所述根据分析结果满足预设条件的数据列,生成作为所述数据透视表的值,包括:2. The method according to claim 1, characterized in that, the data columns satisfying preset conditions according to the analysis results are generated as the value of the pivot table, comprising:将当前表格中所述数据列中单元格的值,分别按照所述数据透视表的每个行标题进行求和,将得到的求和结果作为所述数据透视表中相应单元格的值。The values of the cells in the data column in the current table are summed according to each row title of the pivot table, and the obtained summation result is used as the value of the corresponding cell in the pivot table.3.根据权利要求1所述的方法,其特征在于,所述根据分析结果满足预设条件的数据列,生成作为所述数据透视表的值,包括:3. The method according to claim 1, characterized in that, the data column satisfying the preset condition according to the analysis result, generating the value as the pivot table, comprises:根据所述预测行字段得分在预设范围内的列,生成所述数据透视表的值。The values of the pivot table are generated according to the columns whose scores of the predicted row fields are within a preset range.4.根据权利要求1所述的方法,其特征在于:所述预先生成的随机森林模型,是通过采集多个数据透视表作为训练数据样本,提取至少一个特征值按照决策树的生成步骤建立数据透视表值决策树,并根据数据透视表值决策树建立的。4. The method according to claim 1, characterized in that: the random forest model generated in advance is to extract at least one feature value and set up data according to the generation steps of decision tree by collecting a plurality of pivot tables as training data samples Pivot table value decision tree, and based on the pivot table value decision tree built.5.根据权利要求1所述的方法,其特征在于,所述预定的至少一个顺序包括第一从左到右的顺序,当按照第一从左到右的顺序进行遍历时,获取的所每个所述被选中的数据列的至少一个预定的特征值包括:标题提取关键字、整个数据列列数、仅数字的单元格个数、单元格整数数字字符数长度的方差。5. The method according to claim 1, wherein the predetermined at least one order comprises a first left-to-right order, and when traversing according to the first left-to-right order, all acquired At least one predetermined feature value of each selected data column includes: title extraction keywords, the number of entire data columns, the number of cells with numbers only, and the variance of the length of the cell integer number of characters.6.根据权利要求5所述的方法,其特征在于,所述预定的至少一个顺序包括第二从左到右的顺序,当按照第二从左到右的顺序进行遍历时,所述获取的每个所述被选中的数据列的至少一个预定的特征值还包括:自身及自身以左的各列含数字的列数,自身及自身以左的各列仅含数字的列数,及自身及自身以左的各列含中文、英文、日期的列数。6. The method according to claim 5, wherein the predetermined at least one order comprises a second left-to-right order, and when traversing according to the second left-to-right order, the acquired At least one predetermined feature value of each of the selected data columns also includes: the number of columns containing numbers in each column to the left of itself and itself, the column number of numbers in each column to the left of itself and itself, and the column number of numbers in each column to the left of itself and itself. And each column to the left of itself contains the number of columns in Chinese, English, and date.7.根据权利要求6所述的方法,其特征在于,所述预定的至少一个顺序包括从右到左的顺序,当按照从右到左的顺序进行遍历时,所述获取的每个所述被选中的数据列的至少一个预定的特征值还包括:自身及自身以右的各列含数字的列数,自身及自身以右的各列仅含数字的列数,自身及自身以右的各列含中文、英文、日期的列数。7. The method according to claim 6, wherein the predetermined at least one order includes a right-to-left order, and when traversing according to the right-to-left order, each of the acquired At least one predetermined feature value of the selected data column also includes: the number of columns containing numbers in each column to the right of itself and itself, the column number of numbers in each column to the right of itself and itself, and the number of columns to the right of itself and to the right of itself. Each column contains Chinese, English, and date columns.8.根据权利要求7所述的方法,其特征在于,所述将获取的每个数据列的至少一个预定的特征值分别输入预先生成的随机森林模型,包括:8. The method according to claim 7, wherein said at least one predetermined eigenvalue of each data column obtained is input into a pre-generated random forest model respectively, comprising:将获取的每个数据列的标题提取关键字、整个数据列列数、仅数字的单元格个数、单元格整数数字字符数长度的方差、自身及自身以左的各列含数字的列数、自身及自身以左的各列仅含数字的列数、自身及自身以左的各列含中文、英文、日期的列数、自身及自身以右的各列含数字的列数、自身及自身以右的各列仅含数字的列数、自身及自身以右的各列含中文、英文、日期的列数依次输入随机森林模型。Extract keywords from the title of each data column obtained, the number of columns in the entire data column, the number of cells with only numbers, the variance of the length of the number of integers and characters in the cell, and the number of columns containing numbers in each column to the left of itself and itself , the number of columns containing only numbers for itself and the columns to the left of itself, the number of columns for itself and the columns to the left of itself containing Chinese, English, date, the number of columns for itself and the columns to the right of itself containing numbers, itself and The columns to the right of itself contain only numbers, and the columns to the right of itself contain Chinese, English, and dates are input into the random forest model in sequence.9.一种定向投放内容的装置,其特征在于,包括:9. A device for targeted delivery of content, comprising:获取模块,用于当接收到建立数据透视表的指令后,获取表格中被选中的数据列;The obtaining module is used to obtain the selected data column in the table after receiving the instruction to create the pivot table;分析模块,用于分别以预定的至少一个顺序遍历所述被选中的数据列,对于每个所述被选中的数据列,分别获取该数据列的至少一个预定的特征值,将获取的每个数据列的至少一个预定的特征值按照预定顺序分别输入预先生成的随机森林模型,得到每个数据列的分析结果;并将分析结果满足预设条件的数据列,确定作为所述数据透视表的值;An analysis module, configured to respectively traverse the selected data columns in at least one predetermined order, and for each of the selected data columns, respectively acquire at least one predetermined characteristic value of the data column, and each acquired At least one predetermined characteristic value of the data column is respectively input into the pre-generated random forest model according to a predetermined order, and the analysis result of each data column is obtained; and the data column whose analysis result meets the preset condition is determined as the pivot table value;所述预定的特征值包括:标题提取关键字,整个数据列列数,仅数字的单元格个数,单元格整数数字字符数长度的方差,自身及自身以左的各列含数字的列数,自身及自身以左的各列仅含数字的列数,自身及自身以左的各列含中文、英文、日期的列数,自身及自身以右的各列含数字的列数,自身及自身以右的各列仅含数字的列数,自身及自身以右的各列含中文、英文、日期的列数;The predetermined feature value includes: title extraction keywords, the number of entire data columns, the number of cells with only numbers, the variance of the length of the cell integer number of characters, the number of columns containing numbers in each column to the left of itself and itself , the number of columns containing only numbers on itself and the columns to its left, the number of columns on itself and its left containing Chinese, English, and date, the number of columns on itself and its right containing numbers, and the number of columns on its own and its left The columns to the right of itself contain only numbers, and the columns to the right of itself contain Chinese, English, and dates;所述将获取的每个数据列的所述预定的特征值分别输入预先生成的随机森林模型,得到每个数据列对应所述预定的特征值的分析结果,包括:The predetermined eigenvalues of each data column obtained are respectively input into the pre-generated random forest model, and the analysis results corresponding to the predetermined eigenvalues of each data column are obtained, including:将获取的每个数据列的所述预设的特征值分别输入预先生成的随机森林模型,计算每个数据列的预测行字段得分。The acquired preset feature values of each data column are respectively input into the pre-generated random forest model, and the predicted row field score of each data column is calculated.10.一种定向投放内容的装置,包括处理器和存储器,其特征在于,所述存储器中存储有用于定向投放内容的程序;所述处理器用于读取所述用于定向投放内容的程序,执行权利要求1-8中任一项所述的方法。10. A device for targeted delivery of content, comprising a processor and a memory, wherein a program for targeted delivery of content is stored in the memory; the processor is used to read the program for targeted delivery of content, Carry out the method described in any one in the claim 1-8.
CN201911088551.7A2019-11-082019-11-08Method and device for generating pivot table valueActiveCN112784556B (en)

Priority Applications (1)

Application NumberPriority DateFiling DateTitle
CN201911088551.7ACN112784556B (en)2019-11-082019-11-08Method and device for generating pivot table value

Applications Claiming Priority (1)

Application NumberPriority DateFiling DateTitle
CN201911088551.7ACN112784556B (en)2019-11-082019-11-08Method and device for generating pivot table value

Publications (2)

Publication NumberPublication Date
CN112784556A CN112784556A (en)2021-05-11
CN112784556Btrue CN112784556B (en)2023-06-30

Family

ID=75748989

Family Applications (1)

Application NumberTitlePriority DateFiling Date
CN201911088551.7AActiveCN112784556B (en)2019-11-082019-11-08Method and device for generating pivot table value

Country Status (1)

CountryLink
CN (1)CN112784556B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN114064636A (en)*2021-10-202022-02-18珠海金山办公软件有限公司 Processing method, device, electronic device and readable storage medium for data pivot table
CN113901770B (en)*2021-10-292024-10-29中国平安财产保险股份有限公司Report generation method based on random forest model and related equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103020031A (en)*2012-12-192013-04-03珠海金山办公软件有限公司Method and device for updating data pivot table intelligently
CN109101864A (en)*2018-04-182018-12-28长春理工大学The upper half of human body action identification method returned based on key frame and random forest
CN109657721A (en)*2018-12-202019-04-19长沙理工大学Multi-class decision method combining fuzzy set and random forest tree
WO2019143412A1 (en)*2018-01-192019-07-25Umajin Inc.Configurable server kit

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
US20160253308A1 (en)*2015-02-272016-09-01Microsoft Technology Licensing, LlcAnalysis view for pivot table interfacing

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication numberPriority datePublication dateAssigneeTitle
CN103020031A (en)*2012-12-192013-04-03珠海金山办公软件有限公司Method and device for updating data pivot table intelligently
WO2019143412A1 (en)*2018-01-192019-07-25Umajin Inc.Configurable server kit
CN109101864A (en)*2018-04-182018-12-28长春理工大学The upper half of human body action identification method returned based on key frame and random forest
CN109657721A (en)*2018-12-202019-04-19长沙理工大学Multi-class decision method combining fuzzy set and random forest tree

Also Published As

Publication numberPublication date
CN112784556A (en)2021-05-11

Similar Documents

PublicationPublication DateTitle
Liang et al.Dynamic clustering of streaming short documents
CN110941959B (en)Text violation detection, text restoration method, data processing method and equipment
CN106528532A (en)Text error correction method and device and terminal
WO2020259280A1 (en)Log management method and apparatus, network device and readable storage medium
CN105760474A (en)Document collection feature word extracting method and system based on position information
CN108021545B (en)Case course extraction method and device for judicial writing
WO2022048363A1 (en)Website classification method and apparatus, computer device, and storage medium
CN103699521A (en)Text analysis method and device
Chen et al.Doctag2vec: An embedding based multi-label learning approach for document tagging
CN110598209B (en) Method, system and storage medium for extracting keywords
CN110222192A (en)Corpus method for building up and device
CN112784556B (en)Method and device for generating pivot table value
CN110598126B (en)Cross-social network user identity recognition method based on behavior habits
CN109993216A (en) A text classification method based on K nearest neighbors KNN and its equipment
US12293149B2 (en)Apparatus and method for generating an article
CN112783890B (en) A method and device for generating pivot table rows
CN112784557B (en)Method and device for determining pivot table
CN104615685B (en) A popularity evaluation method for network topics
CN105824828A (en)Label excavation method and apparatus
CN111899832B (en)Medical theme management system and method based on context semantic analysis
Monjalet et al.Predicting file lifetimes with machine learning
CN102103604A (en)Method and device for determining core weight of term
Ruhwinaningsih et al.A sentiment knowledge discovery model in Twitter’s TV content using stochastic gradient descent algorithm
CN110688451A (en)Evaluation information processing method, evaluation information processing device, computer device, and storage medium
CN116028620A (en)Method and system for generating patent abstract based on multi-task feature cooperation

Legal Events

DateCodeTitleDescription
PB01Publication
PB01Publication
SE01Entry into force of request for substantive examination
SE01Entry into force of request for substantive examination
GR01Patent grant
GR01Patent grant

[8]ページ先頭

©2009-2025 Movatter.jp