JP2018156549A

Movatterモバイル変換

Info

Publication number: JP2018156549A
Application number: JP2017054494A
Authority: JP
Inventors: 真平齋藤; Shinpei Saito
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2017-03-21
Filing date: 2017-03-21
Publication date: 2018-10-04
Anticipated expiration: 2037-03-21
Also published as: JP6862969B2

Abstract

PROBLEM TO BE SOLVED: To enable the estimation of a more detailed type of a set of numerical data.SOLUTION: An information processing device 1A includes: extraction means 2 for calculating a statistical value from numerical set data for learning to extract additional data added to the numerical set data; feature quantity vector generation means 3 for generating a feature quantity vector from the statistical value and the additional data; and feature quantity vector hierarchization storage means 4 for storing the feature quantity vector in a state hierarchized by a type of the numerical set data in a storage part 5.SELECTED DRAWING: Figure 10

Description

Translated fromJapanese

本発明は、数値データの集合から、その集合のデータ種別を推定するための情報処理方法、情報処理装置および情報処理プログラムに関する。 The present invention relates to an information processing method, an information processing apparatus, and an information processing program for estimating a data type of a set from numerical data sets.

データの収集技術、データの処理技術、およびデータの蓄積技術の発展に従って、多種多様かつ多量の情報を扱う機会が多くなっている。扱うデータの大半は、他の目的で他者が収集したデータや、過去に収集されたデータである。データ収集の際に、データが集積された表やデータ自体にデータの意味や仕様が記載されないことが多い。データの意味や仕様は、表をまとめた資料の凡例やシステムの仕様書等に記載される。一般に、それらの仕様情報は入手できなかったり、最新状態に更新されていなかったり、場合によっては紛失するという事態が発生する。 With the development of data collection technology, data processing technology, and data storage technology, there are increasing opportunities to handle a wide variety of information. Most of the data handled is data collected by others for other purposes or data collected in the past. At the time of data collection, the meaning and specifications of the data are often not described in a table in which the data is collected or in the data itself. The meaning and specifications of the data are described in the legend of the data that summarizes the table, the system specifications, etc. Generally, such specification information cannot be obtained, updated to the latest state, or lost in some cases.

一般に、数値集合データを扱う際には、そのデータの種別、データの出所、データの単位等は既知のことである。従って、データをどのように処理すればよいか判別することは容易である。 In general, when handling numerical set data, the type of data, the origin of data, the unit of data, etc. are known. Therefore, it is easy to determine how to process the data.

しかし、データが採取されてから処理を行う人がデータを入手するまでの過程において、データを正しく処理するために必要な情報の一部が欠落して、そのデータが何を表すのか一見して判別できないことも起こり得る。例えば、データベースのテーブル構造についての資料が無い状態で、列名が略されていた場合、データが何を表すのか一見して判別できない。そのような場合には、入手経路や処理方法を解析する等の対策がとられている。しかし、データを採取した人間との接触が制限されていたり専門的な知識を要する等のために、データの種別を明らかにするために長時間を要することもある。 However, in the process from when the data is collected until the person who performs the processing obtains the data, some of the information necessary to correctly process the data is missing, and at a glance what the data represents It may happen that it cannot be determined. For example, if there is no data about the table structure of the database and the column name is omitted, it is impossible to determine at a glance what the data represents. In such a case, measures are taken such as analyzing the acquisition route and processing method. However, it may take a long time to clarify the type of data because contact with the person who collected the data is limited or specialized knowledge is required.

例えば、表計算ソフトウェアの表データ等は、作成者本人および利用者だけが理解できる用語が盛り込まれていることが多い。時間が経って作成当時の関係者が不在になると、そのデータが何を表すのか、全く手がかりがない状態も起こり得る。 For example, the spreadsheet data of spreadsheet software often includes terms that only the creator and the user can understand. As time goes by, there will be no clue as to what the data represents when there are no participants at the time of creation.

文字データであれば、単語や文章の意味からある程度の意味、例えば住所や氏名を推測することはできるが、数値データの場合、一つ一つのデータを眺めても推測を行うことは難しい。また、テキストデータに関して、類似度の判定方法として、単語ごとに多次元のベクトルを生成し、データ間の類似性を判定する手法が知られている。しかし、数値データについては、数値そのものを次元として利用できないので、そのような手法を適用することは難しい。 In the case of character data, a certain degree of meaning, for example, an address or a name, can be estimated from the meaning of a word or sentence, but in the case of numerical data, it is difficult to estimate even by looking at each piece of data. As a method for determining the degree of similarity with respect to text data, a method of generating a multidimensional vector for each word and determining the similarity between the data is known. However, for numerical data, the numerical value itself cannot be used as a dimension, so it is difficult to apply such a method.

特許文献１には、平均や分散等の統計値にもとづいてどの種別のデータかを推定する方法が記載されている。 Patent Document 1 describes a method for estimating which type of data is based on statistical values such as average and variance.

特開２００６−９９２３６号公報JP 2006-99236 A

しかし、特許文献１に記載された発明では、統計値（平均や分散等）しか利用していないので、ある程度正しいと思われる種別を得るための計算量が多くなる。また、例えば、「長さ」という種別を得ることは可能であるが、それより下位の種別（詳細な種別）を得ることはできない。 However, in the invention described in Patent Document 1, since only statistical values (average, variance, etc.) are used, the amount of calculation for obtaining a type that seems to be correct to some extent increases. In addition, for example, it is possible to obtain the type “length”, but it is not possible to obtain a lower type (detailed type).

本発明は、数値データの集合のより詳しい種別を推定可能にすることを目的とする。 An object of the present invention is to make it possible to estimate a more detailed type of a set of numerical data.

本発明による情報処理方法は、学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出し、統計値と抽出された付加データとから特徴量ベクトルを生成し、数値集合データの種別で階層化された状態で特徴量ベクトルを記憶部に保存することを特徴とする。 According to the information processing method of the present invention, a statistical value is calculated from numerical value set data for learning, additional data added to the numerical value set data is extracted, and a feature vector is calculated from the statistical value and the extracted additional data. A feature amount vector is generated and stored in a storage unit in a state of being hierarchized by the type of numerical set data.

本発明による情報処理装置は、学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出する抽出手段と、統計値と抽出された付加データとから特徴量ベクトルを生成する特徴量ベクトル生成手段と、数値集合データの種別で階層化された状態で特徴量ベクトルを記憶部に保存する特徴量ベクトル階層化保存手段とを備えることを特徴とする。 An information processing apparatus according to the present invention is characterized by an extraction means for calculating a statistical value from learning numerical set data, extracting additional data added to the numerical set data, and the statistical value and the extracted additional data. It is characterized by comprising: a feature vector generation unit that generates a quantity vector; and a feature vector hierarchization storage unit that stores the feature vector in a storage unit in a state of being hierarchized according to the type of numerical set data.

本発明による情報処理プログラムは、コンピュータに、学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出する処理と、統計値と抽出された付加データとから特徴量ベクトルを生成する処理と、数値集合データの種別で階層化された状態で特徴量ベクトルを記憶部に保存する処理とを実行させることを特徴とする。 An information processing program according to the present invention includes a computer for calculating a statistical value from learning numerical set data, extracting additional data added to the numerical set data, a statistical value, and extracted additional data; And a process of generating a feature vector from the storage unit and a process of storing the feature vector in a storage unit in a state of being hierarchized by the type of numerical value set data.

本発明によれば、数値データの集合のより詳しい種別を推定することが可能になる。 According to the present invention, it is possible to estimate a more detailed type of a set of numerical data.

データ種別を推定するための情報処理装置の一例であるデータ種別推定装置を示すブロック図である。It is a block diagram which shows the data classification estimation apparatus which is an example of the information processing apparatus for estimating a data classification.学習用データの一例を示す説明図である。It is explanatory drawing which shows an example of the data for learning.推定対象データの一例を示す説明図である。It is explanatory drawing which shows an example of estimation object data.学習フェーズの処理を示すフローチャートである。It is a flowchart which shows the process of a learning phase.特徴量ベクトル生成部が生成する特徴量ベクトルを説明するための説明図である。It is explanatory drawing for demonstrating the feature-value vector which a feature-value vector production | generation part produces | generates.特徴量ベクトル階層化保存部に保存されている階層構造を説明するための説明図である。It is explanatory drawing for demonstrating the hierarchical structure preserve | saved at the feature-value vector hierarchization preservation | save part.学習フェーズの処理を示すフローチャートである。It is a flowchart which shows the process of a learning phase.抽出された特徴量ベクトル等の一例を示す説明図である。It is explanatory drawing which shows an example of the extracted feature-value vector etc.親種別を含む抽出結果を示す説明図である。It is explanatory drawing which shows the extraction result containing a parent classification.本発明による情報処理装置の主要部を示すブロック図である。It is a block diagram which shows the principal part of the information processing apparatus by this invention.本発明による他の態様の情報処理装置の主要部を示すブロック図である。It is a block diagram which shows the principal part of the information processing apparatus of the other aspect by this invention.

以下、本発明の実施形態を図面を参照して説明する。 Hereinafter, embodiments of the present invention will be described with reference to the drawings.

図１は、データ種別（以下、単に「種別」ともいう。）を推定するための情報処理装置の一例であるデータ種別推定装置を示すブロック図である。図１には、データ種別推定装置１０と、学習用数値列データ記憶装置５００および推定対象数値列データ記憶装置６００が示されている。 FIG. 1 is a block diagram illustrating a data type estimation device that is an example of an information processing device for estimating a data type (hereinafter also simply referred to as “type”). FIG. 1 shows a data type estimation device 10, a learning numeric stringdata storage device 500 and an estimation target numeric stringdata storage device 600.

データ種別推定装置１０は、特徴量ベクトル抽出部１００と、学習部２００と、推定部３００と、特徴量ベクトル管理部４００とを備える。 The data type estimation device 10 includes a featurevector extraction unit 100, alearning unit 200, anestimation unit 300, and a featurevector management unit 400.

特徴量ベクトル抽出部１００は、種別（具体的には、種別を示すデータ）および付加情報を含む数値集合データを入力して特徴量ベクトルを出力する。特徴量ベクトル抽出部１００は、統計値算出部１０１と、付加データ抽出部１０２と、特徴量ベクトル生成部１０３とを含む。 The feature quantityvector extraction unit 100 inputs numerical set data including a type (specifically, data indicating the type) and additional information, and outputs a feature quantity vector. The feature quantityvector extraction unit 100 includes a statisticalvalue calculation unit 101, an additionaldata extraction unit 102, and a feature quantityvector generation unit 103.

統計値算出部１０１は、入力された数値集合データの平均や分散等の統計値を算出する。付加データ抽出部１０２は、特徴量ベクトルの単位や併記されている情報の特性を抽出する。特徴量ベクトル生成部１０３は、統計値算出部１０１の算出結果と付加データ抽出部１０２の抽出結果とを結合して、入力データに対する特徴量ベクトルを生成する。 The statisticalvalue calculation unit 101 calculates a statistical value such as an average or variance of the input numerical value set data. The additionaldata extraction unit 102 extracts the feature vector unit and the characteristics of the written information. The featurevector generation unit 103 combines the calculation result of the statisticalvalue calculation unit 101 and the extraction result of the additionaldata extraction unit 102 to generate a feature vector for the input data.

学習部２００は、学習用の数値集合データ（数値データの集合）すなわち学習用データを入力し、特徴量ベクトルを生成する。学習部２００は、学習用データ入力部２０１と、データ解析部２０２と、特徴量ベクトル出力部２０３とを含む。 Thelearning unit 200 inputs numerical set data for learning (a set of numerical data), that is, learning data, and generates a feature vector. Thelearning unit 200 includes a learningdata input unit 201, adata analysis unit 202, and a feature quantityvector output unit 203.

学習用データ入力部２０１は、学習用数値列データ記憶装置５００から学習用データを入力する。データ解析部２０２は、特徴量ベクトル抽出部１００を利用して、学習用データを対象として特徴量ベクトルを取得する。特徴量ベクトル出力部２０３は、取得された特徴量ベクトルと種別とを、特徴量ベクトル管理部４００に出力する。 The learningdata input unit 201 inputs learning data from the learning numerical stringdata storage device 500. Thedata analysis unit 202 uses the featurevector extraction unit 100 to acquire a feature vector from the learning data. The featurevector output unit 203 outputs the acquired feature vector and type to the featurevector management unit 400.

推定部３００は、数値集合データの種別を推定する。推定部３００は、推定対象データ入力部３０１と、データ解析部３０２と、類似特徴量ベクトル探索部３０３と、結果表示部３０４とを含む。 Theestimation unit 300 estimates the type of numerical set data. Theestimation unit 300 includes an estimation targetdata input unit 301, adata analysis unit 302, a similar featurevector search unit 303, and a result display unit 304.

推定対象データ入力部３０１は、推定対象数値列データ記憶装置６００から、種別推定対象の数値集合データを入力する。データ解析部３０２は、特徴量ベクトル抽出部１００を利用して、種別推定対象の数値集合データを対象として特徴量ベクトルを生成する。類似特徴量ベクトル探索部３０３は、特徴量ベクトル管理部４００を利用して、生成された特徴量ベクトルと距離が近い特徴量ベクトルを特徴量ベクトル記憶部４０３から抽出する。結果表示部３０４は、類似特徴量ベクトル探索部３０３の推定結果を表示部（図示せず）に表示する。なお、学習部２００を経て特徴量ベクトル記憶部４０３に格納された特徴量ベクトルを学習済み特徴量ベクトルとする。 The estimation targetdata input unit 301 inputs the numerical set data of the type estimation target from the estimation target numerical sequencedata storage device 600. Thedata analysis unit 302 uses the feature quantityvector extraction unit 100 to generate a feature quantity vector for the numerical set data to be type-estimated. The similar featurevector search unit 303 uses the featurevector management unit 400 to extract from the feature vector storage unit 403 a feature vector whose distance is close to the generated feature vector. The result display unit 304 displays the estimation result of the similar featurevector search unit 303 on a display unit (not shown). Note that the feature amount vector stored in the feature amountvector storage unit 403 via thelearning unit 200 is a learned feature amount vector.

特徴量ベクトル管理部４００は、特徴量ベクトルを管理する。特徴量ベクトル管理部４００は、特徴量ベクトル階層化保存部４０１と、特徴量ベクトル比較部４０２と、特徴量ベクトル記憶部４０３とを含む。 The featurevector management unit 400 manages feature vectors. The featurevector management unit 400 includes a feature vectorhierarchization storage unit 401, a featurevector comparison unit 402, and a featurevector storage unit 403.

特徴量ベクトル階層化保存部４０１は、特徴量ベクトル出力部２０３から入力される種別を、特徴量ベクトル記憶部４０３の記憶内容と照合しながら、特徴量ベクトルを保存すべき階層を特定して、特徴量ベクトルを特徴量ベクトル記憶部４０３に保存する。特徴量ベクトル比較部４０２は、類似特徴量ベクトル探索部３０３から入力した特徴量ベクトルと、特徴量ベクトル記憶部４０３に保存されている学習済み特徴量ベクトルとの距離を算出し、距離が近い特徴量ベクトルが有する種別を、距離に応じた確率を付与して抽出する。なお、抽出された種別に共通の種別（親種別）が存在する場合は、その親種別も併せて抽出する。 The feature vectorhierarchization storage unit 401 specifies the layer in which the feature vector is to be stored while collating the type input from the featurevector output unit 203 with the storage content of the featurevector storage unit 403, The feature vector is stored in the featurevector storage unit 403. The featurevector comparison unit 402 calculates the distance between the feature vector input from the similar featurevector search unit 303 and the learned feature vector stored in the featurevector storage unit 403, and features that are close to each other. The type of the quantity vector is extracted with a probability corresponding to the distance. If there is a common type (parent type) among the extracted types, the parent type is also extracted.

少なくとも統計値算出部１０１、付加データ抽出部１０２、特徴量ベクトル生成部１０３、データ解析部２０２、特徴量ベクトル出力部２０３、データ解析部３０２、類似特徴量ベクトル探索部３０３、特徴量ベクトル階層化保存部４０１、および特徴量ベクトル比較部４０２は、プログラム記憶部に格納されたプログラムにもとづいて１つまたは複数のＣＰＵ（Central Processing Unit ）が処理を実行することによって実現可能である。しかし、それらおよび他のブロック（記憶部を除く。）は、ハードウェア（個別回路またはＬＳＩ（Large Scale Integration ））で実現されてもよい。 At least statisticalvalue calculation unit 101, additionaldata extraction unit 102, featurevector generation unit 103,data analysis unit 202, featurevector output unit 203,data analysis unit 302, similar featurevector search unit 303, feature vector hierarchization Thestorage unit 401 and the featurevector comparison unit 402 can be realized by one or a plurality of CPUs (Central Processing Units) executing processes based on a program stored in the program storage unit. However, these and other blocks (excluding the storage unit) may be realized by hardware (individual circuit or LSI (Large Scale Integration)).

次に、データ種別推定装置１０の動作を説明する。 Next, the operation of the data type estimation device 10 will be described.

ここでは、図２に示す学習用データを使用し、数値集合データの推定のために学習する場合と、図３に示す推定対象データを使用し、数値集合データを推定する場合を例にする。 Here, a case in which learning data shown in FIG. 2 is used and learning is performed for estimation of numerical set data, and a case in which numerical set data is estimated using estimation target data shown in FIG. 3 are taken as examples.

図２に例示する学習用データは、２０代成人男子の最高血圧に関する数値集合データである。図３に示す推定対象データは、血圧に関する数値集合データであってデータ種別は未知である。 The learning data illustrated in FIG. 2 is numerical set data regarding the systolic blood pressure of an adult male in the 20s. The estimation target data shown in FIG. 3 is numerical set data related to blood pressure, and the data type is unknown.

また、表形式のデータを用いる場合を例にする。以下、数値集合データを数値列データということがある。入力データは、表計算ソフトウェアのファイルであったり、データベースの表、XML （eXtensible Markup Language ）文書、CSV （Common Separated Value）形式の文書、HTML（Hypertext Markup Language ）文書などである。ただし、数値の集合と付加データに分解可能なものであれば、入力データの形式は問わない。 Further, a case where tabular data is used is taken as an example. Hereinafter, the numerical set data may be referred to as numerical sequence data. The input data is a spreadsheet software file, a database table, an XML (eXtensible Markup Language) document, a CSV (Common Separated Value) document, an HTML (Hypertext Markup Language) document, or the like. However, the format of the input data is not limited as long as it can be decomposed into a set of numerical values and additional data.

本実施形態では、データ種別推定装置１０は、学習フェーズの処理と実際に推定を行う推定フェーズの処理とを実行する。データ種別推定装置１０は、学習フェーズにおいて、データ種別が既知である数値集合データについて、統計値の算出および付加データの抽出を行って作成した特徴量ベクトルを、与えられたデータ種別を元に階層化して保存する。 In the present embodiment, the data type estimation device 10 executes a learning phase process and an estimation phase process that performs actual estimation. In the learning phase, the data type estimation device 10 generates a feature quantity vector created by calculating a statistical value and extracting additional data for numerical set data whose data type is known, based on the given data type. And save.

図４は、学習フェーズの処理を示すフローチャートである。図４（Ａ）は、学習部２００の処理を示す。図４（Ｂ）は、特徴量ベクトル抽出部１００の処理を示す。 FIG. 4 is a flowchart showing processing in the learning phase. FIG. 4A shows processing of thelearning unit 200. FIG. 4B shows processing of the featurevector extraction unit 100.

学習フェーズにおいて、学習部２００の学習用データ入力部２０１は、学習用数値列データ記憶装置５００から学習用の数値列データを入力データとして入力する（ステップＳ１１）。データ解析部２０２は、特徴量ベクトル抽出部１００を利用して、学習用データを対象として特徴量ベクトルを取得する（ステップＳ１２）。特徴量ベクトル抽出部１００は、図４（Ｂ）に示す処理を実行する。 In the learning phase, the learningdata input unit 201 of thelearning unit 200 inputs the numerical sequence data for learning from the numerical sequencedata storage device 500 for learning as input data (step S11). Thedata analysis unit 202 uses the feature quantityvector extraction unit 100 to acquire a feature quantity vector for the learning data (step S12). The feature quantityvector extraction unit 100 executes the process shown in FIG.

統計値算出部１０１は、入力された数値列データの統計値を算出する（ステップＢ１１）。統計値は、例えば、平均、分散、尖度、歪度、分布の型（一例として、正規分布、Poisson 分布、ロングテールの分布）である。 The statisticalvalue calculation unit 101 calculates a statistical value of the input numerical string data (step B11). The statistical value is, for example, mean, variance, kurtosis, skewness, and distribution type (for example, normal distribution, Poisson distribution, long tail distribution).

付加データ抽出部１０２は、数値列データにおける数値の単位（一例として、m 、g 、円、℃）や、数値列データにおいて併記されている情報の特性（付加データ）を抽出する（ステップＢ１２）。また、付加データ抽出部１０２は、数値列データにおいて氏名やIDなど固体を識別可能なデータがあれば、個体識別情報を「有」にする。その他、付加データとして、例えば、緯度経度や時刻情報が考えられる。図２に示された例では、単位として"mmHg"が抽出される。また、個体識別情報は「有」とされる。 The additionaldata extraction unit 102 extracts the unit of numerical values in the numerical string data (for example, m 1,g 2, circle, and ° C.) and the characteristics of the information (additional data) written together in the numerical string data (step B12). . Further, the additionaldata extraction unit 102 sets the individual identification information to “present” if there is data that can identify the individual such as name and ID in the numerical string data. In addition, for example, latitude and longitude and time information can be considered as additional data. In the example shown in FIG. 2, “mmHg” is extracted as a unit. The individual identification information is “present”.

特徴量ベクトル生成部１０３は、統計値算出部１０１の算出結果と付加データ抽出部１０２の抽出結果とを結合して、入力データに対する特徴量ベクトルを生成し、呼び出し元（この場合には、データ解析部２０２）に供給する（ステップＢ１３）。 The featurevector generation unit 103 combines the calculation result of the statisticalvalue calculation unit 101 and the extraction result of the additionaldata extraction unit 102 to generate a feature vector for the input data, and the caller (in this case, the data It supplies to the analysis part 202) (step B13).

図５は、特徴量ベクトル生成部１０３が生成する特徴量ベクトルを説明するための説明図である。図５に示されている等式の右辺には、図３に例示された数値列データを入力値として、統計値算出部１０１と付加データ抽出部１０２とが得た値を、多次元量（ベクトル）としてまとめた状態が示されている。 FIG. 5 is an explanatory diagram for describing a feature quantity vector generated by the feature quantityvector generation unit 103. On the right side of the equation shown in FIG. 5, the values obtained by the statisticalvalue calculation unit 101 and the additionaldata extraction unit 102 using the numerical string data illustrated in FIG. The state summarized as a vector) is shown.

特徴量ベクトル出力部２０３は、取得された特徴量ベクトルと種別とを、特徴量ベクトル管理部４００に出力する。 The featurevector output unit 203 outputs the acquired feature vector and type to the featurevector management unit 400.

特徴量ベクトル記憶部４０３において、特徴量ベクトルは、データ種別に関して。階層化（クラスタ化）されて保存されている。階層化は、特徴量ベクトル階層化保存部４０１によって実行されるが、特徴量ベクトル階層化保存部４０１は、特徴量ベクトルを保存するときに、与えられたデータ種別を階層化した状態で保存する。その際に、例えば「２０代男性の血圧」、「２０代女性の血圧」という２つの入力データがあった場合、共通する「２０0代成人の血圧」という種別を子種別として保存してもよい。なお、クラスタ化の手法として、特定分野の辞書と照合する手法や、特徴量ベクトルでクラスタ分析を行う手法等がある。特徴量ベクトル階層化保存部４０１は、既知の階層化手法のいずれを使用してもよいが、望ましい推定結果が得られる階層化手法を選択することが好ましい。 In the featurevector storage unit 403, the feature vector is related to the data type. Stored in a hierarchy (clustered). Hierarchization is performed by the feature vectorhierarchization storage unit 401. When the feature vectorhierarchization storage unit 401 stores the feature vector, it stores the given data type in a hierarchical state. . In this case, for example, when there are two input data such as “blood pressure of male in twenties” and “blood pressure of female in twenties”, the common type “blood pressure of adult in twenties” may be stored as a child type. . As a clustering method, there are a method of collating with a dictionary in a specific field, a method of performing cluster analysis using a feature vector, and the like. The feature vectorhierarchization storage unit 401 may use any of the known hierarchization methods, but preferably selects a hierarchization method that provides a desired estimation result.

特徴量ベクトル管理部４００における特徴量ベクトル階層化保存部４０１は、特徴量ベクトル記憶部４０３から、データ種別の階層構造を示す階層情報を読み出す（ステップＳ１３）。 The feature vectorhierarchization storage unit 401 in the featurevector management unit 400 reads hierarchical information indicating the hierarchical structure of the data type from the feature vector storage unit 403 (step S13).

特徴量ベクトル階層化保存部４０１は、学習用データとしての数値列データの種別の階層と、特徴量ベクトル階層化保存部４０１から読み出した階層構造とから、特徴量ベクトルを保存すべき階層を特定する（ステップＳ１４）。特徴量ベクトル階層化保存部４０１は、特徴量ベクトルを特徴量ベクトル記憶部４０３における特定された階層に特徴量ベクトルを保存する（ステップＳ１５）。 The feature quantity vectorhierarchization storage unit 401 specifies the hierarchy in which the feature quantity vector should be saved from the hierarchy of the type of numerical string data as learning data and the hierarchical structure read from the feature quantity vector hierarchization storage unit 401 (Step S14). The feature vectorhierarchization storage unit 401 stores the feature vector in the hierarchy specified in the feature vector storage unit 403 (step S15).

図５に示されている等式の左辺には、図３に例示された数値列データに付与された種別が階層化された状態が示されている。図５に示すように、種別をより広い概念（この例では、「圧力」）から狭い概念（この例では「２０代」）の順に階層化されている。 The left side of the equation shown in FIG. 5 shows a state in which the types assigned to the numerical string data illustrated in FIG. 3 are hierarchized. As shown in FIG. 5, the categories are hierarchized in order from a broader concept (in this example, “pressure”) to a narrower concept (in this example, “20's”).

図６は、特徴量ベクトル階層化保存部４０１に保存されている階層構造を説明するための説明図である。図６に示す例では、階層構造における最も広い概念から最も狭い概念に向けてツリー状に表現されている。図５および図６に示す例では、ステップＳ１４の処理で、「男性」の下の階層が特定される。 FIG. 6 is an explanatory diagram for explaining the hierarchical structure stored in the feature vectorhierarchization storage unit 401. In the example illustrated in FIG. 6, the tree structure is expressed from the widest concept to the narrowest concept in the hierarchical structure. In the example illustrated in FIGS. 5 and 6, the hierarchy below “male” is identified in the process of step S <b> 14.

図７は、学習フェーズの処理を示すフローチャートである。図７（Ａ）は、推定部３００の処理を示す。図７（Ｂ）は、特徴量ベクトル抽出部１００の処理を示す。 FIG. 7 is a flowchart showing processing in the learning phase. FIG. 7A shows processing of theestimation unit 300. FIG. 7B shows processing of the feature quantityvector extraction unit 100.

推定フェーズにおいて、推定部３００の推定対象データ入力部３０１は、推定対象数値列データ記憶装置６００から推定対象の数値列データを入力データとして入力する（ステップＳ２１）。データ解析部３０２は、特徴量ベクトル抽出部１００を利用して、推定対象の数値列データを対象として特徴量ベクトルを取得する（ステップＳ２２）。特徴量ベクトル抽出部１００は、図７（Ｂ）に示す処理を実行する。図７（Ｂ）に示す処理は、図４（Ｂ）に示された処理と同じである。 In the estimation phase, the estimation targetdata input unit 301 of theestimation unit 300 inputs the estimation target numerical sequence data from the estimation target numerical sequencedata storage device 600 as input data (step S21). Thedata analysis unit 302 uses the feature quantityvector extraction unit 100 to acquire a feature quantity vector for the estimation target numerical string data (step S22). The feature quantityvector extraction unit 100 executes the process shown in FIG. The process shown in FIG. 7B is the same as the process shown in FIG.

次いで、類似特徴量ベクトル探索部３０３は、特徴量ベクトル管理部４００を利用して、生成された特徴量ベクトルと距離が近い学習済み特徴量ベクトルを特徴量ベクトル記憶部４０３から抽出する。 Next, the similar featurevector search unit 303 uses the featurevector management unit 400 to extract a learned feature vector having a distance close to the generated feature vector from the featurevector storage unit 403.

具体的には、特徴量ベクトル比較部４０２は、特徴量ベクトル記憶部４０３から特徴量ベクトルのリスト（一覧）を読み出す（ステップＳ２３）。特徴量ベクトル比較部４０２は、ステップＳ２３の処理で取得された特徴量ベクトルを、類似特徴量ベクトル探索部３０３を介して入力し、当該特徴量ベクトルと学習済みの個々の特徴量ベクトル（リストに存在する特徴量ベクトル）との距離を算出する（ステップＳ２４）。このとき、特徴量ベクトル比較部４０２は、特徴量ベクトルに含まれる各々の要素を均等に扱うのではなく、単位や個体識別情報の有無等について、統計値よりも重みをつけて扱うことが好ましい。例えば、種別が血圧であれば、単位が長さ（m ）や重さ（kg）であることはないので、単位の違い（mmHg以外の単位）の距離への影響を大きくすることが好ましい。 Specifically, the featurevector comparison unit 402 reads a list of feature vectors from the feature vector storage unit 403 (step S23). The featurevector comparison unit 402 inputs the feature vector acquired in step S23 through the similar featurevector search unit 303, and the feature vector and the learned individual feature vectors (in the list). The distance to the existing feature vector is calculated (step S24). At this time, it is preferable that the featurevector comparison unit 402 does not handle each element included in the feature vector equally, but treats the unit, the presence / absence of individual identification information, and the like with more weight than the statistical value. . For example, if the type is blood pressure, the unit is not length (m 2) or weight (kg), so it is preferable to increase the influence of the unit difference (unit other than mmHg) on the distance.

なお、外国為替（￥⇔＄）や電力（Ｗ⇔VA）等が関連するような対象データを扱う場合、必ずしも１種別１単位に集約できるとは限らない。そのような場合、単位を排除してもよいが、それぞれに重み付けを行って対応してもよい。同様に、個体識別情報、位置情報、時刻情報の有無に対しても、算出された統計値に対して重みをつけるなど、好ましい推定結果が得られるよう調整する。 In addition, when dealing with target data related to foreign exchange (¥ ⇔ $), electric power (W⇔VA), etc., it is not always possible to consolidate into one type and one unit. In such a case, the unit may be excluded, but each may be weighted. Similarly, with respect to the presence / absence of individual identification information, position information, and time information, adjustment is made so that a preferable estimation result is obtained, such as weighting the calculated statistical value.

特徴量ベクトル比較部４０２は、距離が小さいｎ件の特徴量ベクトルを抽出する（ステップＳ２５）。なお、「ｎ」はあらかじめ決められている正の整数である。そして、特徴量ベクトル比較部４０２は、ｎ件の特徴量ベクトルを類似特徴量ベクトル探索部３０３に出力する。その際に、特徴量ベクトル比較部４０２は、距離に応じた確率も類似特徴量ベクトル探索部３０３に出力する。特徴量ベクトル比較部４０２は、ｎ件の特徴量ベクトルに共通する種別（親種別）が存在する場合は、その親種別も併せて抽出する。 The featurevector comparison unit 402 extracts n feature vectors having a small distance (step S25). “N” is a positive integer determined in advance. Then, the feature quantityvector comparison unit 402 outputs n feature quantity vectors to the similar feature quantityvector search unit 303. At that time, the featurevector comparison unit 402 also outputs a probability corresponding to the distance to the similar featurevector search unit 303. When there is a common type (parent type) among n feature quantity vectors, the feature quantityvector comparison unit 402 also extracts the parent type.

ベクトル間の距離を測定する手法として様々の方法が知られているが、例えば、MinHash 法に代表されるベクトル間の角度（コサイン類似度）を距離として扱う手法は、高速で処理可能であるため、大量の数値集合データを扱うのに適している。 Various methods are known as a method for measuring the distance between vectors. For example, the method of treating the angle between vectors (cosine similarity) represented by the MinHash method as a distance can be processed at high speed. It is suitable for handling a large amount of numerical data.

なお、本実施形態では、特徴量ベクトル比較部４０２が距離と確率の算出を行うが、それらの機能は、類似特徴量ベクトル探索部３０３に含まれていてもよい。 In the present embodiment, the featurevector comparison unit 402 calculates the distance and the probability, but these functions may be included in the similar featurevector search unit 303.

結果表示部３０４は、抽出されたｎ件の特徴量ベクトルを表示部に表示することによってユーザに提示する（ステップＳ２６）。 The result display unit 304 presents the extracted n feature quantity vectors to the user by displaying them on the display unit (step S26).

図８は、抽出された特徴量ベクトル等の一例（表示例）を示す説明図である。図８に示すように、種別は階層化されている。なお、個々の距離に応じて確率が計算されるので、確率の合計は１を越える。 FIG. 8 is an explanatory diagram illustrating an example (display example) of the extracted feature vector and the like. As shown in FIG. 8, the types are hierarchized. Since the probability is calculated according to each distance, the total probability exceeds 1.

図８に示す例では、「圧力／血圧／最高／男性」が共通の種別（親種別）になっている。そのような場合には、特徴量ベクトル比較部４０２は、ステップＳ２５の処理で、親種別として「圧力／血圧／最高／男性」を抽出結果に含める。また、抽出結果が表示されるときに、親種別が最上位に表示されることが好ましい（図９参照）。 In the example shown in FIG. 8, “pressure / blood pressure / maximum / male” is a common type (parent type). In such a case, the featurevector comparison unit 402 includes “pressure / blood pressure / maximum / male” as the parent type in the extraction result in the process of step S25. Moreover, it is preferable that the parent type is displayed at the top when the extraction result is displayed (see FIG. 9).

図９は、親種別を含む抽出結果を示す説明図である。図９に示す例では、「２０代」の距離と「３０代」の距離との違いは僅差である。しかし、男性の血圧であることがほぼ確実と推定できるので、特徴量ベクトル比較部４０２は、推定精度の向上を図るために、種別階層を利用した結果の集約を行う。なお、種別の階層化が行われない場合には、距離が近い特徴量ベクトル間の関係が特定できず、包括する種別を提示することができない。 FIG. 9 is an explanatory diagram illustrating an extraction result including a parent type. In the example shown in FIG. 9, the difference between the distance of “20's” and the distance of “30's” is a slight difference. However, since it is almost certain that the blood pressure is male, the featurevector comparison unit 402 aggregates the results using the type hierarchy in order to improve the estimation accuracy. In addition, when the types are not hierarchized, the relationship between feature quantity vectors that are close to each other cannot be specified, and the comprehensive types cannot be presented.

以上に説明したように、本実施形態では、利用者が対象数値集合データに関する知識を有していなくても、データ種別推定装置１０が、数値集合データの統計値や単位等の付加情報を特徴量ベクトルとして抽出し、特徴量ベクトルと学習済みデータの特徴量ベクトルとの比較結果を提示するので、利用者は、データ種別を容易に推定することができる。 As described above, in this embodiment, even if the user does not have knowledge about the target numerical set data, the data type estimation device 10 features additional information such as statistical values and units of the numerical set data. Since it is extracted as a quantity vector and the comparison result between the feature quantity vector and the feature quantity vector of learned data is presented, the user can easily estimate the data type.

また、数値集合データは数値データであるから学習データと完全に一致することはないが、データ種別推定装置１０が、特徴量ベクトル間の距離を算出し、学習済み特徴量データの階層構造と照合することによって、距離が近い学習済み特徴量ベクトルに対応する種別の抽出と、抽出された種別の共通種別（親種別）を高い精度で推定することができる。 In addition, since the numerical set data is numerical data, it does not completely match the learning data. However, the data type estimation device 10 calculates the distance between the feature vectors and collates with the hierarchical structure of the learned feature data. By doing this, it is possible to extract the type corresponding to the learned feature vector having a short distance and to estimate the common type (parent type) of the extracted type with high accuracy.

なお、本実施形態では、血圧に関するデータが数値集合データとされたが、本発明は、他の種類の数値集合データを対象とすることもできる。例えば、過去の販売データを参照する際に、その当時のテーブル構造や種別についての仕様が入手困難な状態であっても、現在の販売データに含まれる数値データをあらかじめ学習しておくことによって、数値集合データにおけるどの列が売上で、どの列が値引き額である等の推定を自動的に行うことができる。 In the present embodiment, the blood pressure-related data is the numerical set data. However, the present invention can also be applied to other types of numerical set data. For example, when referring to past sales data, even if it is difficult to obtain specifications for the table structure and type at that time, by learning in advance the numerical data included in the current sales data, It is possible to automatically estimate which column in the numerical set data is sales and which column is the discount amount.

図１０は、本発明による情報処理装置の主要部を示すブロック図である。図１０に示す情報処理装置１Ａは、学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出する抽出手段２（実施形態では、統計値算出部１０１および付加データ抽出部１０２で実現される。）と、統計値と抽出された付加データとから特徴量ベクトルを生成する特徴量ベクトル生成手段３（実施形態では、特徴量ベクトル生成部で実現される。）と、数値集合データの種別で階層化された状態で特徴量ベクトルを記憶部５（実施形態では、特徴量ベクトル記憶部４０３で実現される。）に保存する特徴量ベクトル階層化保存手段４（実施形態では、特徴量ベクトル階層化保存部４０１で実現される。）とを備えている。 FIG. 10 is a block diagram showing a main part of the information processing apparatus according to the present invention. Aninformation processing apparatus 1A shown in FIG. 10 calculates a statistical value from learning numerical set data, and extracts means 2 for extracting additional data added to the numerical set data (in the embodiment, a statistical value calculation unit 101). And the additionaldata extraction unit 102.) and a feature vector generation unit 3 (in the embodiment, realized by the feature vector generation unit) that generates a feature vector from the statistical value and the extracted additional data. And a feature vector hierarchization storage unit that stores feature vectors in the storage unit 5 (in the embodiment, realized by the feature vector storage unit 403) in a state of being hierarchized by the type of numerical set data. 4 (implemented in the feature vectorhierarchization storage unit 401 in the embodiment).

図１１は、本発明による他の態様の情報処理装置の主要部を示すブロック図である。図１１に示す情報処理装置１Ｂは、抽出手段が、種別の推定対象である対象用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出し、特徴量ベクトル生成手段が、当該統計値と抽出された当該付加データとから特徴量ベクトルを生成し、生成された特徴量ベクトルと記憶部５に保存されている特徴量ベクトルとの距離を算出し、算出された距離にもとづいて対象用の数値集合データの種別を推定する推定手段６（実施形態では、特徴量ベクトル比較部４０２で実現される。）を備えている。 FIG. 11 is a block diagram showing a main part of an information processing apparatus according to another aspect of the present invention. In theinformation processing apparatus 1B illustrated in FIG. 11, the extraction unit calculates a statistical value from the target numerical set data that is a type estimation target, extracts additional data added to the numerical set data, and extracts a feature amount. A vector generation unit generates a feature vector from the statistical value and the extracted additional data, calculates a distance between the generated feature vector and the feature vector stored in thestorage unit 5, and calculates The estimation means 6 (implemented by the featurevector comparison unit 402 in the embodiment) for estimating the type of the numerical set data for the object based on the determined distance is provided.

１Ａ，１Ｂ情報処理装置
２抽出手段
３特徴量ベクトル生成手段
４特徴量ベクトル階層化保存手段
５記憶部
６推定手段
１０データ種別推定装置
１００特徴量ベクトル抽出部
１０１統計値算出部
１０２付加データ抽出部
１０３特徴量ベクトル生成部
２００学習部
２０１学習用データ入力部
２０２データ解析部
２０３特徴量ベクトル出力部
３００推定部
３０１推定対象データ入力部
３０２データ解析部
３０３類似特徴量ベクトル探索部
３０４結果表示部
４００特徴量ベクトル管理部
４０１特徴量ベクトル階層化保存部
４０２特徴量ベクトル比較部
４０３特徴量ベクトル記憶部
５００学習用数値列データ記憶装置
６００推定対象数値列データ記憶装置1A, 1BInformation processing apparatus 2 Extraction means 3 Feature quantity vector generation means 4 Feature quantity vector hierarchization storage means 5Storage section 6 Estimation means 10 Datatype estimation apparatus 100 Feature quantityvector extraction section 101 Statisticalvalue calculation section 102 Additional data extraction section DESCRIPTION OFSYMBOLS 103 Featurevector generation part 200Learning part 201 Learning datainput part 202Data analysis part 203 Feature quantityvector output part 300Estimation part 301 Estimation objectdata input part 302Data analysis part 303 Similar feature quantity vector search part 304Result display part 400 Featurevector management unit 401 Feature vectorhierarchization storage unit 402 Featurevector comparison unit 403 Featurevector storage unit 500 Learning numeric stringdata storage device 600 Estimation target numeric string data storage device

Claims

Translated fromJapanese

学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出し、
前記統計値と抽出された前記付加データとから特徴量ベクトルを生成し、
数値集合データの種別で階層化された状態で前記特徴量ベクトルを記憶部に保存する
情報処理方法。Calculate the statistical value from the numerical set data for learning, extract the additional data added to the numerical set data,
Generating a feature vector from the statistical value and the extracted additional data;
An information processing method for storing the feature amount vector in a storage unit in a state of being hierarchized by the type of numerical set data.

種別の推定対象である対象用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出し、
当該統計値と抽出された当該付加データとから特徴量ベクトルを生成し、
生成された特徴量ベクトルと記憶部に保存されている特徴量ベクトルとの距離を算出し、算出された距離にもとづいて前記対象用の数値集合データの種別を推定する
請求項１記載の情報処理方法。Calculate the statistical value from the numerical set data for the target that is the type estimation target, extract additional data added to the numerical set data,
A feature vector is generated from the statistical value and the extracted additional data,
The information processing according to claim 1, wherein a distance between the generated feature quantity vector and a feature quantity vector stored in the storage unit is calculated, and a type of the numerical set data for the target is estimated based on the calculated distance. Method.

生成された特徴量ベクトルと記憶部に保存されている複数の特徴量ベクトルの各々との距離を算出し、
距離が小さい所定件の記憶部に保存されている特徴量ベクトルを出力する
請求項２記載の情報処理方法。Calculating a distance between the generated feature vector and each of the plurality of feature vectors stored in the storage unit;
The information processing method according to claim 2, wherein a feature amount vector stored in a predetermined storage unit having a small distance is output.

数値集合データは、表形式のデータである
請求項１から請求項３のうちのいずれか１項に記載の情報処理方法。The information processing method according to any one of claims 1 to 3, wherein the numerical set data is tabular data.

学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出する抽出手段と、
前記統計値と抽出された前記付加データとから特徴量ベクトルを生成する特徴量ベクトル生成手段と、
数値集合データの種別で階層化された状態で前記特徴量ベクトルを記憶部に保存する特徴量ベクトル階層化保存手段と
を備えた情報処理装置。An extraction means for calculating a statistical value from the numerical set data for learning and extracting additional data added to the numerical set data;
Feature quantity vector generating means for generating a feature quantity vector from the statistical value and the extracted additional data;
An information processing apparatus comprising: a feature vector hierarchization storage unit that stores the feature vector in a storage unit in a state of being hierarchized according to the type of numerical set data.

抽出手段は、種別の推定対象である対象用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出し、
特徴量ベクトル生成手段は、当該統計値と抽出された当該付加データとから特徴量ベクトルを生成し、
生成された特徴量ベクトルと記憶部に保存されている特徴量ベクトルとの距離を算出し、算出された距離にもとづいて前記対象用の数値集合データの種別を推定する推定手段を備えた
請求項５記載の情報処理装置。The extraction means calculates a statistical value from the numerical set data for the target that is the type estimation target, extracts additional data added to the numerical set data,
The feature vector generation means generates a feature vector from the statistical value and the extracted additional data,
An estimation unit that calculates a distance between the generated feature quantity vector and a feature quantity vector stored in a storage unit, and estimates a type of the numerical set data for the target based on the calculated distance. 5. The information processing apparatus according to 5.

推定手段は、生成された特徴量ベクトルと記憶部に保存されている複数の特徴量ベクトルの各々との距離を算出し、距離が小さい所定件の記憶部に保存されている特徴量ベクトルを出力する
請求項６記載の情報処理装置。The estimation means calculates a distance between the generated feature quantity vector and each of a plurality of feature quantity vectors stored in the storage unit, and outputs a feature quantity vector stored in a predetermined storage unit having a small distance The information processing apparatus according to claim 6.

コンピュータに、
学習用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出する処理と、
前記統計値と抽出された前記付加データとから特徴量ベクトルを生成する処理と、
数値集合データの種別で階層化された状態で前記特徴量ベクトルを記憶部に保存する処理と
を実行させるための情報処理プログラム。On the computer,
A process of calculating a statistical value from the learning numerical set data and extracting additional data added to the numerical set data;
Processing for generating a feature vector from the statistical value and the extracted additional data;
An information processing program for executing a process of storing the feature quantity vector in a storage unit in a state of being hierarchized by the type of numerical set data.

コンピュータに、
種別の推定対象である対象用の数値集合データから統計値を算出し、該数値集合データに付加されている付加データを抽出する処理と、
当該統計値と抽出された当該付加データとから特徴量ベクトルを生成する処理と、
生成された特徴量ベクトルと記憶部に保存されている特徴量ベクトルとの距離を算出し、算出された距離にもとづいて前記対象用の数値集合データの種別を推定する処理と
を実行させる請求項８記載の情報処理プログラム。On the computer,
A process of calculating a statistic value from the numerical set data for the target, which is a type estimation target, and extracting additional data added to the numerical set data;
Processing for generating a feature vector from the statistical value and the extracted additional data;
A process for calculating a distance between the generated feature quantity vector and a feature quantity vector stored in a storage unit, and estimating a type of the numerical set data for the target based on the calculated distance. 8. The information processing program according to 8.

コンピュータに、
生成された特徴量ベクトルと記憶部に保存されている複数の特徴量ベクトルの各々との距離を算出する処理と、
距離が小さい所定件の記憶部に保存されている特徴量ベクトルを出力する処理と
を実行させる請求項９記載の情報処理プログラム。On the computer,
Processing for calculating the distance between the generated feature vector and each of the plurality of feature vectors stored in the storage unit;
The information processing program according to claim 9, further comprising: a process of outputting a feature vector stored in a predetermined storage unit having a small distance.