JP5040718B2

Movatterモバイル変換

Info

Publication number: JP5040718B2
Application number: JP2008040334A
Authority: JP
Inventors: 憲和松村; 健司山西
Original assignee: NEC Corp
Current assignee: NEC Corp
Priority date: 2008-02-21
Filing date: 2008-02-21
Publication date: 2012-10-03
Anticipated expiration: 2028-02-21
Also published as: JP2009199341A

Description

本発明は、スパム・イベント検出装置及び方法並びにプログラムに関し、特に、ブログやＳＮＳ（ＳｏｃｉａｌＮｅｔｗｏｒｋｉｎｇＳｅｒｖｉｃｅ）、Ｗｅｂニュースなどといったインターネット上に存在するテキストデータを対象としたスパム記述やイベント記述の検出・判定・駆除技術に関する。 The present invention relates to a spam event detection apparatus, method, and program, and particularly to detection / determination of spam descriptions and event descriptions for text data existing on the Internet, such as blogs, SNSs (Social Networking Services), and Web news. -Regarding removal technology.

インターネットの普及に伴って、検索エンジン経由の訪問者の増加や、アフィリエイトの表示、特定のサイトへの誘導などを目的として自動生成されるブログ、ＳＮＳ、Ｗｅｂニュース等の記事（以下、スパム記述）が急増している。そこでこのようなスパム記述を検出し、除去するための各種の装置、方法が知られている（例えば、特許文献１、２、３参照）。 With the spread of the Internet, articles such as blogs, SNS, and Web news that are automatically generated for the purpose of increasing the number of visitors via search engines, displaying affiliates, and guiding to specific sites (hereinafter referred to as spam description) Has increased rapidly. Therefore, various apparatuses and methods for detecting and removing such spam descriptions are known (see, for example,Patent Documents 1, 2, and 3).

特許文献１には、Ｗｅｂｌｏｇサービス用サーバによって運用されるＷｅｂｌｏｇサイトに、ユーザから付与されるコメントおよびトラックバックを介して、ユーザ間でコミュニケーションを得るＷｅｂｌｏｇシステムのコミュニケーション制御方法が記載されている。この方法は、Ｗｅｂｌｏｇサイトをオープンにしながら、コミュニケーションを頻繁に行っているユーザからのものを優先して提示し、スパムのような一方的なものを排除可能にしたものである。 Patent Document 1 describes a communication control method of a Weblog system that obtains communication between users via comments and trackbacks given by users to a Weblog site operated by a Weblog service server. In this method, a Weblog site is opened, and a user from a user who communicates frequently is preferentially presented so that one-sided items such as spam can be excluded.

また、特許文献２には、ＵＲＬ（Uniform Resource Locator）やキーワードといった固定的なパターンデータをもとにスパム記述であるか否かの判断を行うシステムが記載されている。スパム記述であると判断した場合は、ブログサーバにおいて投稿除外の処理を行う。 Patent Document 2 describes a system that determines whether or not a spam description is based on fixed pattern data such as a URL (Uniform Resource Locator) or a keyword. If it is determined that the description is spam description, the blog server performs post exclusion processing.

さらに、特許文献３には、インターネット上のブログサイトで公開されているブログ記事のブログ記事情報を解析してトレンドキーワードを抽出し、トレンドキーワードが記述されているブログ記事情報から得られた関連キーワードをユーザに提示するトレンド解析方法が記載されている。 Further, inPatent Document 3, a blog article information of a blog article published on a blog site on the Internet is analyzed to extract a trend keyword, and a related keyword obtained from the blog article information in which the trend keyword is described. A trend analysis method for presenting to a user is described.

特開２００６−３３１２９７号公報JP 2006-331297 A特開２００７−１１５１７３号公報JP 2007-115173 A特開２００７−２３３４３８号公報JP 2007-233438 A

以下の分析は本発明において与えられる。 The following analysis is given in the present invention.

しかしながら、特許文献１では、コメントやトラックバックにおけるスパムを検出できたとしても、ブログ記事自身についてはスパムかどうかを判定する手段を持たない。さらに、コメントやトラックバックは、ブログ記事がインターネット上で公開されてから付けられていくため、新着記事を取得する（スナップショットで行うクロール）のでは収集が難しく、利用できないケースが大半である。また、特許文献２では、日々新種のスパムが増加していくため、固定的なパターンによる抽出では新規スパムに対して追従して検出することが困難である。さらに、特許文献３では、キーワードレベルでの抽出ではイベントを解釈することは容易でなく、スパムとの判別が難しい。 However, inPatent Document 1, even if spam in comments and trackbacks can be detected, there is no means for determining whether or not the blog article itself is spam. In addition, comments and trackbacks are added after blog posts are published on the Internet, so collecting new articles (crawls using snapshots) is difficult to collect and cannot be used in most cases. Further, inPatent Document 2, since new types of spam increase day by day, it is difficult to detect and track new spam by extraction using a fixed pattern. Further, inPatent Document 3, it is not easy to interpret an event by extraction at the keyword level, and it is difficult to distinguish it from spam.

本発明の目的は、ブログ、ＳＮＳ等のテキストデータからスパム記述を精度良く検出するスパム・イベント検出装置及び方法並びにプログラムを提供することにある。 An object of the present invention is to provide a spam event detection apparatus, method, and program for accurately detecting a spam description from text data such as a blog or SNS.

本発明の１つのアスペクト（側面）に係るスパム・イベント検出装置は、検出対象とされる蓄積したテキストデータを所定の方法で複数のデータに分類するデータ抽出手段を備え、データ抽出手段によって分類されたデータに対してバースト情報に得点を付与する時系列バースト得点付与手段と、データ抽出手段によって分類されたデータに対してスパム記述ルールを元に得点を付与するデータクレンジング得点付与手段と、時系列バースト得点付与手段およびデータクレンジング得点付与手段が付与した双方の得点に基づいてスパム記述かイベント記述かを判定するスパム・イベント集約判定手段と、を、複数のデータのそれぞれに対応させて複数備え、スパム・イベント集約判定手段でスパム記述であると判定したスパムの情報を元にスパム記述ルールを書き換える。A spam event detection apparatus according to one aspect of the present inventionincludes data extraction means for classifying accumulated text datato be detected into aplurality of data by a predetermined method, and is classified by the data extraction means. A time-series burst scoring means for assigning scores to burst information for the received data, a data cleansing scoring means for assigning scores based on spam description rules to the data classified by the data extraction means, and a time series A plurality of spam event aggregation judging means for judging whether it is a spam description or an event description based on both scores given by the burst score giving means and the data cleansing score giving means, corresponding to each of a plurality of data, Based on the spam information determined by the spam event aggregation judgment means as spam description It rewrites the beam description rule.

本発明のスパム・イベント検出装置において、スパム記述の駆除がなされたデータを所定の方法で複数のデータに分類するデータ抽出手段を備え、時系列バースト検出手段、バースト特徴抽出手段、およびスパム・イベント特徴判定手段を複数のデータのそれぞれに対応させて複数備えるようにしてもよい。 The spam event detection apparatus according to the present invention comprises data extraction means for classifying the data whose spam description has been eliminated into a plurality of data by a predetermined method, a time-series burst detection means, a burst feature extraction means, and a spam event A plurality of feature determination means may be provided corresponding to each of the plurality of data.

本発明のスパム・イベント検出装置において、検出対象とされる蓄積したテキストデータを所定の方法で複数のデータに分類するデータ抽出手段を備え、データクレンジング手段は、分類されたデータから分類されたデータに対応するスパム記述ルールを用いてスパム記述の駆除を行い、データクレンジング手段、時系列バースト検出手段、バースト特徴抽出手段、およびスパム・イベント特徴判定手段を複数のデータのそれぞれに対応させて複数備えるようにしてもよい。 The spam event detection apparatus of the present invention comprises data extraction means for classifying the accumulated text data to be detected into a plurality of data by a predetermined method, and the data cleansing means is data classified from the classified data. Spam description is eliminated using the spam description rule corresponding to, and a plurality of data cleansing means, time series burst detection means, burst feature extraction means, and spam event feature determination means are provided corresponding to each of a plurality of data. You may do it.

本発明のスパム・イベント検出装置において、スパム・イベント特徴判定手段でスパム記述であると判定したスパムの情報を元にスパム記述ルールを書き換えるようにしてもよい。 In the spam event detection device of the present invention, the spam description rule may be rewritten based on the spam information determined to be the spam description by the spam event feature determination means.

本発明のスパム・イベント検出装置において、データ抽出手段によって分類されたデータに対してバースト情報に得点を付与する時系列バースト得点付与手段と、データ抽出手段によって分類されたデータに対してスパム記述ルールを元に得点を付与するデータクレンジング得点付与手段と、時系列バースト得点付与手段およびデータクレンジング得点付与手段が付与した双方の得点に基づいてスパム記述かイベント記述かを判定するスパム・イベント集約判定手段と、を、データクレンジング手段に替えて複数のデータのそれぞれに対応させて複数備え、スパム・イベント特徴判定手段およびスパム・イベント集約判定手段でスパム記述であると判定したスパムの情報を元にスパム記述ルールを書き換えるようにしてもよい。 In the spam event detection apparatus of the present invention, a time-series burst score assigning means for assigning a score to burst information with respect to data classified by the data extraction means, and a spam description rule for the data classified by the data extraction means Data cleansing score giving means for assigning a score based on the above, and spam event aggregation judging means for judging whether it is a spam description or an event description based on both scores given by the time-series burst score giving means and the data cleansing score giving means And a plurality of data corresponding to each of a plurality of data instead of the data cleansing means, and spam based on spam information determined to be spam description by the spam event feature judging means and the spam event aggregation judging means. The description rule may be rewritten.

本発明のスパム・イベント検出装置において、スパム・イベント特徴判定手段を廃し、スパム・イベント集約判定手段でスパム記述であると判定したスパムの情報のみを元にスパム記述ルールを書き換えるようにしてもよい。 In the spam event detection apparatus of the present invention, the spam event feature determination unit may be eliminated, and the spam description rule may be rewritten based only on spam information determined to be spam description by the spam event aggregation determination unit. .

本発明の他のアスペクト（側面）に係るスパム・イベント検出方法は、スパム・イベント検出装置がスパムを検出する方法であって、検出対象とされる蓄積したテキストデータを所定の方法で複数のデータに分類するステップを含み、分類されたデータに対してバースト情報に得点を付与するステップと、分類されたデータに対してスパム記述ルールを元に得点を付与するステップと、２つの得点を付与するステップにおいて付与した双方の得点に基づいてスパム記述かイベント記述かを判定するステップと、を、複数のデータのそれぞれに対応させて含み、双方の得点に基づいてスパム記述かイベント記述かを判定するステップでスパム記述であると判定したスパムの情報を元にスパム記述ルールが書き換えられる。A spam event detection method according to another aspect of the present invention is a method in which a spam event detection apparatus detects spam,and a plurality of data is stored in apredetermined method using accumulated text datato be detected.And assigning a score to the burst information for the classified data, and giving a score to the classified data based on the spam description rule, and giving two scores Determining whether the description is a spam description or an event description based on both scores given in the step, corresponding to each of a plurality of data, and determining whether the description is a spam description or an event description based on both scores The spam description rule is rewritten based on the spam information determined to be the spam description at the step.

本発明のさらに他のアスペクト（側面）に係るプログラムは、スパム・イベント検出装置を構成するコンピュータに、検出対象とされる蓄積したテキストデータを所定の方法で複数のデータに分類するデータ抽出処理を実行させ、データ抽出処理によって分類されたデータに対してバースト情報に得点を付与する時系列バースト得点付与処理と、データ抽出処理によって分類されたデータに対してスパム記述ルールを元に得点を付与するデータクレンジング得点付与処理と、時系列バースト得点付与処理およびデータクレンジング得点付与処理が付与した双方の得点に基づいてスパム記述かイベント記述かを判定するスパム・イベント集約判定処理と、を、複数のデータのそれぞれに対応させて実行させ、スパム・イベント集約判定処理でスパム記述であると判定したスパムの情報を元にスパム記述ルールを書き換える。
A program according to still another aspect of the present inventionperforms a data extraction process for classifying accumulated text datato be detected into aplurality of data by a predetermined method in a computer constituting the spam event detection apparatus.Execute time-series burst scoring process for assigning scores to burst information for data classified by data extraction process, and assigning scores based on spam description rules to data classified by data extraction process A data cleansing score assignment process, and a spam event aggregation judgment process for judging whether it is a spam description or an event description based on both scores given by the time series burst score assignment process and the data cleansing score assignment process, a plurality of data Are executed in response to each of the Rewriting the spam description rule based spam information it is determined that the arm description.

本発明によれば、インターネット上に存在するテキストデータのバーストの特徴を元にスパム記述かイベント記述かを判定するので、スパム記述を精度良く検出することができる。 According to the present invention, it is determined whether the description is a spam description or an event description based on the characteristics of a burst of text data existing on the Internet, so that the spam description can be detected with high accuracy.

本発明の実施形態に係るスパム・イベント検出装置は、検出対象とされる蓄積したテキストデータからスパム記述ルールを用いてスパム記述の駆除を行うデータクレンジング手段と、スパム記述の駆除がなされたデータを所定の方法で複数のデータに分類するデータ抽出手段とを備える。また、分類された複数のデータのそれぞれに対応させて、スパム記述の駆除がなされたデータの時系列バーストを検出する時系列バースト検出手段と、検出したバースト期間の特徴を抽出するバースト特徴抽出手段と、抽出した特徴からスパム記述かイベント記述かを判定するスパム・イベント特徴判定手段と、を備える。スパム・イベント特徴判定手段でスパム記述であると判定したスパムの情報を元にスパム記述ルールが書き換えられる。また、スパム・イベント特徴判定手段でイベント記述であると判定したイベント情報は、蓄積され外部に出力される。 A spam event detection apparatus according to an embodiment of the present invention includes a data cleansing means for removing spam descriptions from accumulated text data to be detected using spam description rules, and data for which spam descriptions have been removed. Data extraction means for classifying the data into a plurality of data by a predetermined method. In addition, a time-series burst detecting means for detecting a time-series burst of data whose spam description has been extinguished in correspondence with each of a plurality of classified data, and a burst feature extracting means for extracting features of the detected burst period And spam event feature determination means for determining whether the description is a spam description or an event description. The spam description rule is rewritten based on the spam information determined to be the spam description by the spam event feature determination means. Further, event information determined by the spam event feature determining means to be an event description is accumulated and output to the outside.

なお、上記および後述の各実施形態におけるスパム・イベント検出装置は、スパム・イベント検出装置を構成するコンピュータにプログラムを実行させて、各部、各手段を機能させるようにしてもよい。この場合、装置を２以上に分割し、各装置に機能分散あるいは負荷分散させる構成としてもよいことは言うまでもない。 Note that the spam event detection apparatus in each of the embodiments described above and below may cause a computer constituting the spam event detection apparatus to execute a program so that each unit and each unit function. In this case, it is needless to say that the apparatus may be divided into two or more and each apparatus may be configured to distribute functions or distribute loads.

本発明の実施形態に係るスパム・イベント検出装置によれば、インターネット上に存在するテキストデータのバースト情報を検知して、そのバーストの特徴を抽出し、その特徴をデータクレンジングルールに追記する。これにより、未知のスパム記述に対しても即座に特徴をクレンジングルールに加えて、除去することができる。したがって、テキストデータから未知のものも含めたスパム記述を駆除することができる。 According to the spam event detection apparatus of the embodiment of the present invention, burst information of text data existing on the Internet is detected, a feature of the burst is extracted, and the feature is added to a data cleansing rule. This allows features to be immediately removed from the unknown spam description by adding them to the cleansing rules. Therefore, it is possible to remove spam descriptions including unknown ones from text data.

また、本発明の実施形態に係るスパム・イベント検出装置によれば、インターネット上に存在するテキストデータのバースト情報を検知して、そのバーストの特徴として要約文や複数の特徴を示す単語を抽出し、その特徴をイベント情報として記憶する。これにより、ユーザが詳細に分析したいと思うバースト情報の部分のみ、システムは特徴を提示することができる。また、そのバースト情報のみを特徴分析すればよいため、特徴分析に関するシステムへの負荷は全期間を分析するよりも軽くなる。したがって、テキストデータからイベント記述を検出・解釈できる要約文、または単語を提示できる。 Further, according to the spam event detection apparatus of the embodiment of the present invention, burst information of text data existing on the Internet is detected, and a summary sentence or a word indicating a plurality of characteristics is extracted as the feature of the burst. The feature is stored as event information. This allows the system to present features only for the portion of the burst information that the user wants to analyze in detail. Moreover, since only the burst information needs to be subjected to feature analysis, the load on the system related to feature analysis is lighter than that of analyzing the entire period. Therefore, it is possible to present a summary sentence or word that can detect and interpret the event description from the text data.

さらに、本発明の実施形態に係るスパム・イベント検出装置によれば、バースト情報のイベント記述・スパム記述の判定に関して、Ｗｅｂニュースなどの事実を示す情報源の利用、コンテンツベース・トラフィックベースの両側面での判定などを行う。したがって、バーストした箇所がイベント記述かスパム記述かを判定できる。 Furthermore, according to the spam event detection apparatus according to the embodiment of the present invention, regarding the determination of the event description / spam description of burst information, use of an information source indicating facts such as Web news, both sides of content base / traffic base Make a decision at Therefore, it can be determined whether the burst portion is an event description or a spam description.

以下、図面を参照し、より具体的に実施形態について説明する。 Hereinafter, embodiments will be described more specifically with reference to the drawings.

［第１の実施形態］
図１は、本発明の第１の実施形態に係るスパム・イベント検出装置の構成を示す図である。図１を参照すると、スパム・イベント検出装置は、入力装置１と、データ記憶部２１と、データクレンジング手段２２と、データクレンジングルール記憶部２３と、データ抽出手段２４と、抽出データ記憶部２５１、２５２、・・２５ｎと、時系列バースト検出手段３１１、３１２、・・３１ｎと、バースト特徴抽出手段３２１、３２２、・・３２ｎと、スパム・イベント特徴判定手段３３１、３３２、・・３３ｎと、イベント情報記憶部３４１、３４２、・・３４ｎと、出力装置４を備える。[First Embodiment]
FIG. 1 is a diagram showing a configuration of a spam event detection apparatus according to the first embodiment of the present invention. Referring to FIG. 1, the spam event detection apparatus includes aninput device 1, adata storage unit 21, adata cleansing unit 22, a data cleansingrule storage unit 23, adata extraction unit 24, an extracteddata storage unit 251, 25n, time series burst detection means 311, 312, ... 31n, burst feature extraction means 321, 322, ... 32n, spam event feature determination means 331, 332, ... 33n, eventInformation storage units 341, 342,... 34n and an output device 4 are provided.

入力装置１は、データを操作入力するためのキーボードや、ブログ、ＳＮＳなどのテキストデータ、ＷＥＢに掲載された記事などをダウンロードしてくるアプリケーションソフト、サーバに蓄積されるシステムログをそのまま転送するアプリケーションソフトなどを用いて、分析対象となるデータの入力を行う。 Theinput device 1 is a keyboard for operating and inputting data, text data such as a blog and SNS, application software for downloading articles posted on the WEB, and an application for transferring a system log stored in a server as it is. Input data to be analyzed using software.

データ記憶部２１は、入力装置１から入力されたデータをそのまま格納する。 Thedata storage unit 21 stores the data input from theinput device 1 as it is.

データクレンジング手段２２は、クレンジングルール記憶部２３に蓄積された削除ルール（パターン）によって、スパム記述の除去を行う。ここでのルールとは、例えばスパム記述に頻出して現れる単語や文章、スパム記述を発信するＵＲＬ（Uniform Resource Locator）、単語のリスト形式で記述されるといったスパム記述によく見られるパターンなどである。ここで、入力テキストがブログ記事であると想定する。この場合、データクレンジング手段２２は、ブログ記事内のスパム記述に現れる特徴的な単語やＵＲＬ、文章として成立していない品詞などのパターンなどから、スパム記述を特定し、除去する。 The data cleansing means 22 removes spam descriptions according to the deletion rules (patterns) stored in the cleansingrule storage unit 23. The rules here are, for example, words and sentences that frequently appear in spam descriptions, URLs (Uniform Resource Locators) that send spam descriptions, patterns often found in spam descriptions such as words written in list format, etc. . Here, it is assumed that the input text is a blog article. In this case, the data cleansing means 22 identifies and removes the spam description from patterns such as characteristic words and URLs appearing in the spam description in the blog article, parts of speech that are not established as sentences, and the like.

データ抽出手段２４は、データクレンジング手段２２によるスパム記述削除後のデータに対して、分析対象とするカテゴリー、または類似したカテゴリーの記述ごとに分けて、データを抽出する。ここでのカテゴリーとは、商品や企業、行動、感情、一般用語などあらゆるものを想定している。より具体的にデータ抽出手段２４は、ユーザが予め分類するカテゴリーが明確な場合には、ある特定の単語が含まれていればこのカテゴリーに分類するといったルールによって分類する。あるいは、ユーザが予め分類するカテゴリーが不明確な場合には、例えば統計的手法によって単語の共起からカテゴリーに分類するルールを作成し、そのルールベースで分類するようにしてもよい。 Thedata extraction unit 24 extracts the data after the spam description is deleted by thedata cleansing unit 22 for each category to be analyzed or a description of a similar category. The category here assumes all things such as products, companies, behaviors, emotions, and general terms. More specifically, thedata extraction unit 24 performs classification according to a rule such that when a category previously classified by the user is clear, a certain word is included in the category. Alternatively, if the category to be classified in advance by the user is unclear, a rule for classifying the word into co-occurrence from a word co-occurrence by a statistical method, for example, may be created and classified based on the rule base.

抽出データ記憶部２５１、２５２、・・２５ｎは、データ抽出手段２４によって抽出されたカテゴリーごとのデータをそれぞれ記憶する。ここでは、ｎ個のカテゴリーに抽出されたものと仮定する。 The extracteddata storage units 251, 252,... 25 n store data for each category extracted by thedata extraction unit 24. Here, it is assumed that n categories have been extracted.

時系列バースト検出手段３１１、３１２、・・３１ｎは、カテゴリーごとに抽出されたそれぞれのデータのバースト現象を検出する。時系列バースト検出手段３１１、３１２、・・３１ｎの具体的な動作としては、例えばデータの急激な増加をリアルタイムに検知を行う動作となる。特に抽出するカテゴリーによっては、休日といった特定の曜日や長期休みといった特定の月には必ずバーストするような話題があることがある。例えば、ＴＶで放映されている番組、毎週決まった曜日など定期的に開催されるイベントなどがそれに該当する。その都度、盛り上がるたびに時系列バースト検出手段で検出するのでは、ユーザにとってメリットが無い。そのようなケースの具体的な動作としては、週や月ごとの周期性を考慮したバーストの検知を行う。定期的にバーストする日時で通常以上のバーストの出現や、定期的にバーストしていない箇所でのバーストの出現などを検知する。 The time-series burst detection means 311, 312,... 31 n detect the burst phenomenon of each data extracted for each category. As a specific operation of the time-series burst detection means 311, 312,... 31 n, for example, an operation of detecting a rapid increase in data in real time. In particular, depending on the category to be extracted, there may be a topic that always bursts on a specific day of the week such as a holiday or a specific month such as a long holiday. For example, a program that is broadcast on TV, an event that is regularly held, such as a fixed day of the week, and the like. In each case, detection by the time-series burst detection means at every excitement has no merit for the user. As a specific operation in such a case, burst detection is performed in consideration of the periodicity of each week or month. It detects the appearance of bursts that are higher than normal at the time of regular bursting, and the appearance of bursts at places that are not regularly bursting.

バースト特徴抽出手段３２１、３２２、・・３２ｎは、それぞれ時系列バースト検出手段３１１、３１２、・・３１ｎが検出したバーストの期間の特徴を抽出する。バースト特徴抽出手段３２１、３２２、・・３２ｎは、バースト期間内の記事特有な単語、高頻出な単語の抽出などによって実現する。バースト期間内での記事特有な単語を抽出することにより、定常的に語られているそのカテゴリー内では当たり前の単語が除去されるため、バーストの原因となった単語が効果的に抽出される。 The burst feature extraction means 321, 322,... 32n extract the characteristics of the burst period detected by the time series burst detection means 311, 312,. The burst feature extraction means 321, 322,... 32 n are realized by extraction of words unique to articles in the burst period, highly frequent words, and the like. By extracting the word specific to the article within the burst period, the word that caused the burst is effectively extracted because the word that is taken for granted is removed in the category that is regularly spoken.

スパム・イベント特徴判定手段３３１、３３２、・・３３ｎは、バースト特徴抽出手段３２１、３２２、・・３２ｎによってそれぞれ抽出された特徴語を元にスパム記述かイベント記述かの判定を行う。具体的には、例えば、特徴語に対して、ＷＥＢニュースなどの外部ソースなどでほぼ同時期に出現しているかどうかを調査することで判定が可能となる。もちろん、抽出されてきた特徴語やその単語が利用されている原文などをユーザが確認することにより、スパム記述・イベント記述かを判定させることも可能である。この場合は、バーストが検知されるたびにメールなどの伝達手段によって、ユーザにバーストが検知されたこと、その期間内の特徴がどのようなものであったかを知らせるシステムを構築するようにしてもよい。イベント記述と判定されれば、そのカテゴリーに関する記事の時系列グラフと共にイベント内容を記述することにより、詳細な情報の提供が可能となる。スパム記述と判定されれば、そのスパムに見られる単語、ＵＲＬやパターンなどいった特徴をルール化して、クレンジングルール記憶部２３に送信し、再度クレンジングから実行することにより、ノイズの少ない高精度なカテゴリーに関するデータ抽出が実現される。 The spam event feature determination means 331, 332,... 33n determines whether it is a spam description or an event description based on the feature words extracted by the burst feature extraction means 321, 322,. Specifically, for example, it is possible to make a determination by investigating whether or not the feature word appears almost simultaneously with an external source such as WEB news. Of course, it is possible to determine whether the description is a spam description or an event description by the user confirming the extracted feature word or the original text in which the word is used. In this case, every time a burst is detected, a system that informs the user that the burst has been detected and what the characteristics within that period are, by means of transmission means such as mail, may be constructed. . If it is determined as an event description, detailed information can be provided by describing the event contents together with a time series graph of articles related to the category. If it is determined to be a spam description, features such as words, URLs and patterns found in the spam are ruled, transmitted to the cleansingrule storage unit 23, and executed again from cleansing. Data extraction related to categories is realized.

イベント情報記憶部３４１、３４２、・・３４ｎは、スパム・イベント特徴判定手段３３１、３３２、・・３３ｎによってイベントであると判定されたイベント情報をそれぞれ記憶する。 The eventinformation storage units 341, 342,... 34n store event information determined to be events by the spam / event feature determination means 331, 332,.

出力装置４は、イベント情報記憶部３４１、３４２、・・３４ｎに保持されているイベント情報などの出力を行うディスプレイなどの表示機器やプリンタなどの印刷機器などが該当する。 The output device 4 corresponds to a display device such as a display or a printing device such as a printer for outputting the event information held in the eventinformation storage units 341, 342,.

次に、図１及び図２のフローチャートを参照して、本発明の第１の実施形態に係るスパム・イベント検出装置の動作について説明する。図２は、本発明の第１の実施形態に係るスパム・イベント検出装置の動作を示すフローチャートである。 Next, the operation of the spam event detection apparatus according to the first embodiment of the present invention will be described with reference to the flowcharts of FIGS. FIG. 2 is a flowchart showing the operation of the spam event detection apparatus according to the first embodiment of the present invention.

入力装置１は、入力したテキストデータをデータ記憶部２１に送出し、データ記憶部２１は、テキストデータを記憶する（図２のステップＡ１）。 Theinput device 1 sends the input text data to thedata storage unit 21, and thedata storage unit 21 stores the text data (step A1 in FIG. 2).

データクレンジング手段２２は、データ記憶部２１に記憶されているテキストデータから、クレンジングルール記憶部２３を参照して現在のスパム記述ルールに一致するスパム記述を除去する（図２のステップＡ２）。 The data cleansing means 22 removes the spam description that matches the current spam description rule from the text data stored in thedata storage unit 21 with reference to the cleansing rule storage unit 23 (step A2 in FIG. 2).

データ抽出手段２４は、スパム記述除去後のデータから、ユーザが分析したい各カテゴリー、または自動的に抽出された類似カテゴリーに関するテキストをそれぞれ抽出し、抽出データ記憶部２５１、２５２、・・２５ｎにそれぞれ格納する（図２のステップＡ３）。 The data extraction means 24 extracts the text about each category that the user wants to analyze or the automatically extracted similar category from the data after the spam description is removed, respectively, and extracts them to the extracteddata storage units 251, 252,. Store (step A3 in FIG. 2).

時系列バースト検出手段３１１、３１２、・・３１ｎは、それぞれ抽出されたカテゴリーに関するデータの件数情報から、バーストを時系列として検出する（図２のステップＡ４）。 The time-series burst detection means 311, 312,... 31 n detect bursts as time-series from the information on the number of data relating to the extracted categories, respectively (step A 4 in FIG. 2).

バースト特徴抽出手段３２１、３２２、・・３２ｎは、それぞれ検出されたバーストの期間の抽出データから、バースト期間のみに特徴的に出現したキーワードや、キーワード群、要約文などを特徴として抽出する（図２のステップＡ５）。 The burst feature extraction means 321, 322,... 32n extract, as features, keywords, keyword groups, summary sentences, etc. that appear characteristically only in the burst period from the extracted data of the detected burst period (see FIG. 2 step A5).

スパム・イベント特徴判定手段３３１、３３２、・・３３ｎは、それぞれバースト期間内の特徴情報から、バースト期間内の記述が、スパム記述であったのかイベント記述であったのかを判定する。スパム記述であれば、判定されたスパム記述を除去するためのルールを作成し、クレンジングルール記憶部２３に格納し、ステップＡ２のデータクレンジングを再度実行する。一方、イベント記述であれば、特徴情報をイベント情報として、イベント記憶部３４１、３４２、・・３４ｎにそれぞれ格納する（図２のステップＡ６）。 Each of the spam event feature determination means 331, 332,... 33n determines whether the description in the burst period is a spam description or an event description from the feature information in the burst period. If it is a spam description, a rule for removing the determined spam description is created and stored in the cleansingrule storage unit 23, and the data cleansing in step A2 is executed again. On the other hand, if it is an event description, the feature information is stored as event information in theevent storage units 341, 342,... 34n (step A6 in FIG. 2).

出力装置４は、イベント記憶部３４１、３４２、・・３４ｎに格納された、バーストの原因となったイベント情報やそのイベント情報を付与したカテゴリーに関するブログ件数の時系列グラフなどを出力する（図２のステップＡ７）。 The output device 4 outputs, for example, a time-series graph of the number of blogs related to the event information causing the burst and the category to which the event information is assigned, stored in theevent storage units 341, 342,. Step A7).

次に具体的な例を元に、各部の動作について説明する。例えば、ここで世の中に存在するブログ記事全てを入力し、ある企業Ａに関するブログを抽出したとする。図３、図４は、企業Ａのブログ記事数の時間的推移の例を表す図である。図３を参照すると、２００７年１１月付近に２度のバースト現象が見られる。時系列バースト検出手段３１１、３１２、・・３１ｎは、このようなバースト現象を検出する。さらに、バースト特徴抽出手段３２１、３２２、・・３２ｎは、２度のバースト現象それぞれについて特徴語を抽出する。図３では、初めのバースト現象の特徴が「××事件」「・・・・によるミス」などといった不祥事を示している。これらの単語は、外部のＷＥＢニュースやＴＶなどでも同時期に出現することから、スパム・イベント判定装置によってイベント記述であると判定することができる。 Next, the operation of each unit will be described based on a specific example. For example, suppose that all blog articles existing in the world are input and a blog about a company A is extracted. 3 and 4 are diagrams illustrating an example of temporal transition of the number of blog articles of the company A. FIG. Referring to FIG. 3, two bursts are observed around November 2007. The time series burst detecting means 311, 312,... 31 n detect such a burst phenomenon. Further, the burst feature extraction means 321, 322,... 32n extract feature words for each of the two burst phenomena. In FIG. 3, the characteristic of the first burst phenomenon indicates a scandal such as “xx case”, “... mistake”. Since these words appear in external WEB news, TV, etc. at the same time, the spam event determination device can determine that the word is an event description.

一方、２つ目のバースト現象での特徴は、外部ソースでも全く出現しない、意味の持たない、カテゴリーとは無関係の単語が抽出されているものとする。この場合、スパム・イベント特徴判定手段３３１、３３２、・・３３ｎは、スパム記述であると判断し、抽出された特徴語を多く含む記述はスパムであるという判定ルールをクレンジングルール記憶部２３に追記する。そして、データクレンジングを再実行することにより、図４に示すようにバースト現象が無くなる。一般にスパム記述は、検索されやすい単語を散りばめて自動生成されるため、カテゴリーに絞り込むことにより、このようなバースト現象として出現する。 On the other hand, the second burst phenomenon is characterized in that words that do not appear at all in an external source, have no meaning, and are unrelated to a category are extracted. In this case, the spam event feature determination means 331, 332,... 33n determine that the description is spam description, and add a determination rule that the description including many extracted feature words is spam to the cleansingrule storage unit 23. To do. Then, by performing the data cleansing again, the burst phenomenon disappears as shown in FIG. In general, a spam description is automatically generated by interspersing words that are easily searched, and thus appears as a burst phenomenon by narrowing down to a category.

また、図５は、書き込みに周期性を強く持つカテゴリーを分析した時の時間的推移の例を表す図である。図５に示すように週次といった特定の周期でバーストしているが、他のバースト現象と比較し、よりバーストしているものを検出する。ＴＶで放送されるドラマなどの番組であれば、決められた曜日に放映され、その直後にブログなどの書き込み量が盛り上がる。その中でも異常に盛り上がっているバーストの特徴を見ると、図５のように「ゲストタレントＢ」「サプライズ」などといったその時の特別なイベントが抽出される。その定期的な周期に反して、異なる箇所でのバーストについては、出演タレントのイベントやスパム記述などが考えられ、スパム・イベント特徴判定手段３３１、３３２、・・３３ｎによって判定することが可能である。 FIG. 5 is a diagram illustrating an example of temporal transition when a category having strong periodicity in writing is analyzed. As shown in FIG. 5, although bursting occurs at a specific cycle such as weekly, a bursting is detected as compared with other bursting phenomena. If it is a program such as a drama broadcast on TV, it will be aired on a predetermined day of the week, and immediately after that, the amount of writing on the blog will rise. Looking at the characteristics of the abnormally rising burst, special events such as “Guest Talent B” and “Surprise” are extracted as shown in FIG. Contrary to the regular cycle, bursts at different places may be events of appearance talent, spam description, etc., and can be determined by spam event feature determination means 331, 332,. .

以上のようなスパム・イベント検出装置は、入力テキストの中からユーザが希望するカテゴリーまたは類似話題ごとにテキストを分類した上で、各カテゴリーテキスト内のスパム記述やイベント記述を自動的に検出し、スパム記述であれば再度データクレンジングを実行するように再帰的に構成される。また、イベント記述であれば高付加価値な情報として記憶する。このように、人手をかけることなく未知のスパム記述の除去などといった高精度な分析が実現できる。また、カテゴリーに関するイベント情報も付与した高付加価値なテキストのレポーティングが可能になる。 The spam event detection apparatus as described above automatically classifies the text for each category or similar topic desired by the user from the input text, and automatically detects spam descriptions and event descriptions in each category text. A spam description is recursively configured to perform data cleansing again. Further, if it is an event description, it is stored as high value-added information. In this way, highly accurate analysis such as removal of unknown spam descriptions can be realized without human intervention. In addition, it is possible to report high-value-added text with event information related to the category.

［第２の実施形態］
図６は、本発明の第２の実施形態に係るスパム・イベント検出装置の構成を示す図である。図６を参照すると、スパム・イベント検出装置は、図１に示された構成に対し、図１のデータクレンジング手段２２、クレンジングルール記憶部２３を持たない点で異なる。さらに、スパム記述除去を行うデータクレンジング手段３５１、３５２、・・３５ｎ、クレンジングルール記憶部３６１、３６２、・・３６ｎを有する点で異なる。[Second Embodiment]
FIG. 6 is a diagram showing the configuration of the spam event detection apparatus according to the second embodiment of the present invention. Referring to FIG. 6, the spam event detection apparatus differs from the configuration shown in FIG. 1 in that it does not have the data cleansing means 22 and the cleansingrule storage unit 23 of FIG. Furthermore, it differs in that it has data cleansing means 351, 352,... 35n, and cleansingrule storage units 361, 362,.

データクレンジング手段３５１、３５２、・・３５ｎ、クレンジングルール記憶部３６１、３６２、・・３６ｎは、機能としては図１のデータクレンジング手段２２、クレンジングルール記憶部２３とそれぞれ同様であるが、カテゴリーごとに抽出されたデータに対して、データクレンジングを行う点で異なる。 The data cleansing means 351, 352,... 35n and the cleansingrule storage units 361, 362,... 36n are similar in function to the data cleansing means 22 and the cleansingrule storage unit 23 in FIG. The difference is that data cleansing is performed on the extracted data.

次に、図６及び図７のフローチャートを参照して、本発明の第２の実施形態に係るスパム・イベント検出装置の動作について説明する。図７は、本発明の第２の実施形態に係るスパム・イベント検出装置の動作を示すフローチャートである。図７のステップＡ１、Ａ４−Ａ７における処理内容は、図２の同じ符号のステップと同一のため、その説明を省略する。 Next, the operation of the spam event detection apparatus according to the second embodiment of the present invention will be described with reference to the flowcharts of FIGS. FIG. 7 is a flowchart showing the operation of the spam event detection apparatus according to the second exemplary embodiment of the present invention. Since the processing contents in steps A1 and A4-A7 in FIG. 7 are the same as the steps with the same reference numerals in FIG.

第１の実施形態では、各カテゴリーに抽出する前にスパム記述の除去処理であるデータクレンジングを行っていた。これに対し、本実施形態では、ステップＡ１の後、データ抽出手段２４において、カテゴリーごとにデータ抽出（図７のステップＢ１）を行う。さらに、各カテゴリーに関するデータに対して、データクレンジング手段３５１、３５２、・・３５ｎ、クレンジングルール記憶部３６１、３６２、・・３６ｎにおいて、データクレンジングを行う（図７のステップＢ２）。 In the first embodiment, data cleansing, which is a spam description removal process, is performed before extraction into each category. On the other hand, in this embodiment, after step A1, thedata extraction unit 24 performs data extraction for each category (step B1 in FIG. 7). Further, data cleansing is performed on the data related to each category in the data cleansing means 351, 352,... 35n, and the cleansingrule storage units 361, 362,... 36n (step B2 in FIG. 7).

第２の実施形態のスパム・イベント検出装置において、第１の実施形態と異なる箇所は、データクレンジング手段をデータ抽出手段の前で実行するか、後で実行するかである。第１の実施形態のように、前で実行するのであれば、クレンジングルールは全カテゴリー共通なものとなり、管理面や処理面では効率的ではある。しかし、スパム記述でないものに対して、スパム記述であると判定する誤抽出も多くなる。これに対し、第２の実施形態であれば、各カテゴリーに特有のクレンジングルールを備えるために、管理面、処理面では負荷が大きいが、高精度なデータクレンジングが実現できる。 In the spam event detection apparatus of the second embodiment, the difference from the first embodiment is whether the data cleansing means is executed before the data extracting means or after it. If executed before, as in the first embodiment, the cleansing rule is common to all categories, and is efficient in terms of management and processing. However, there are many false extractions that determine that a spam description is not a spam description. On the other hand, according to the second embodiment, since cleansing rules peculiar to each category are provided, although the load is large in terms of management and processing, highly accurate data cleansing can be realized.

このように第２の実施形態のスパム・イベント検出装置によれば、第１の実施形態の効果に加えて、各カテゴリーに対して専用のクレンジングルールを持つように構成される。したがって、各カテゴリーのテキスト抽出に関して、スパム記述除去の誤りといった誤作動が減少し、高精度なスパム記述の除去が可能となる。 As described above, according to the spam event detection apparatus of the second embodiment, in addition to the effects of the first embodiment, a dedicated cleansing rule is provided for each category. Therefore, the malfunction such as an error in removing the spam description regarding the text extraction of each category is reduced, and the spam description can be removed with high accuracy.

［第３の実施形態］
図８は、本発明の第３の実施形態に係るスパム・イベント検出装置の構成を示す図である。図８を参照すると、スパム・イベント検出装置は、図６に示された構成に加え、図６の時系列バースト検出手段３１１、３１２、・・３１ｎの代わりに時系列バースト検出得点付与手段３７１、３７２、・・３７ｎを有する点、図６のデータクレンジング手段３５１、３５２、・・３５ｎの代わりにデータクレンジング得点付与手段３８１、３８２、・・３８ｎを有する点、スパム・イベント集約判定手段３９１、３９２、・・３９ｎを有する点、図６のスパム・イベント特徴判定手段３３１、３３２、・・３３ｎを持たない点で異なる。[Third Embodiment]
FIG. 8 is a diagram showing a configuration of a spam event detection apparatus according to the third embodiment of the present invention. Referring to FIG. 8, in addition to the configuration shown in FIG. 6, the spam event detection apparatus is replaced with time series burst detection means 311, 312,. 372,... 37n, data cleansing means 351, 352,... 35n instead of data cleansing means 351, 352,... 35n, spam event aggregation determination means 391, 392 ,... 39n, and the point that the spam event feature determination means 331, 332,.

時系列バースト検出得点付与手段３７１、３７２、・・３７ｎは、抽出データ記憶部２５１、２５２、・・２５ｎにおけるカテゴリーごとに抽出されたそれぞれのデータの全時間帯に対してバースト検知を行い、各時刻でのバースト情報の得点化する。 The time series burst detectionscore assigning means 371, 372,... 37n perform burst detection for all time zones of the respective data extracted for each category in the extracteddata storage units 251, 252,. Score burst information at time.

データクレンジング得点付与手段３８１、３８２、・・３８ｎは、抽出データ記憶部２５１、２５２、・・２５ｎにおけるカテゴリーごとに抽出されたそれぞれのデータに対し、クレンジングルール記憶部３６１、３６２、・・３６ｎに格納されたそれぞれのルールに合致した記事にそれぞれ得点を与えていく。 The data cleansingscore assigning means 381, 382,... 38n apply the cleansingrule storage units 361, 362,... 36n to the respective data extracted for each category in the extracteddata storage units 251, 252,. Scores are given to articles that match each stored rule.

スパム・イベント集約判定手段３９１、３９２、・・３９ｎは、バースト検出得点付与手段３７１、３７２、・・３７ｎでのそれぞれの得点、データクレンジング得点付与手段３８１、３８２、・・３８ｎでのそれぞれの得点を元に、バースト期間の記事に対し、スパム記述かイベント記述かどうかを判定する。 Spam / event aggregation determination means 391, 392,... 39n is a score in each of burst detection score giving means 371, 372,... 37n, and a score in each of data cleansing score giving means 381, 382,. Based on the above, it is determined whether the article is a spam description or an event description for an article in a burst period.

クレンジングルール記憶部３６１、３６２、・・３６ｎは、スパム・イベント集約判定手段３９１、３９２、・・３９ｎにおいてそれぞれスパム記述と判定された情報を元に削除ルールの更新を行う。 The cleansingrule storage units 361, 362,... 36n update the deletion rules based on the information determined as spam description by the spam / event aggregation determination means 391, 392,.

次に、図８及び図９のフローチャートを参照して、本発明の第３の実施形態に係るスパム・イベント検出装置の動作について説明する。図９は、本発明の第３の実施形態に係るスパム・イベント検出装置の動作を示すフローチャートである。 Next, the operation of the spam event detection apparatus according to the third embodiment of the present invention will be described with reference to the flowcharts of FIGS. FIG. 9 is a flowchart showing the operation of the spam event detection apparatus according to the third exemplary embodiment of the present invention.

図９のステップＡ１、Ｂ１、Ａ５、Ａ７における処理内容は、図７の同じ符号のステップと同一のため、その説明を省略する。 The processing contents in steps A1, B1, A5, and A7 in FIG. 9 are the same as the steps with the same reference numerals in FIG.

第２の実施形態では、スパム記述か、イベント記述かを判定するのに、バースト期間の特徴からの判定を行っていた。これに対し、本実施形態ではステップＢ１の後、バースト検出得点付与手段３７１、３７２、・・３７ｎは、データ抽出手段２４によって抽出されたデータに対して、バーストの度合いを大きいものが高得点になるように得点化する（図９のステップＣ２）。 In the second embodiment, in order to determine whether the description is spam description or event description, the determination is made from the characteristics of the burst period. On the other hand, in this embodiment, after step B1, the burst detectionscore assigning means 371, 372,... 37n give a high score to the data extracted by the data extraction means 24 with a higher degree of burst. A score is obtained as follows (step C2 in FIG. 9).

また、ステップＢ１の後、データクレンジング得点付与手段３８１、３８２、・・３８ｎは、データ抽出手段２４によって抽出されたデータに対して、クレンジングルール記憶部３６１、３６２、・・３６ｎに記憶されているルールに合致する記事にルール別の得点を与える（図９のステップＣ１）。 Further, after step B1, the data cleansingscore assigning means 381, 382,... 38n are stored in the cleansingrule storage units 361, 362,. A score for each rule is given to an article that matches the rule (step C1 in FIG. 9).

そして、スパム・イベント集約判定手段３９１、３９２、・・３９ｎは、これらの得点を総合的に判定し、イベント記述かスパム記述かを判定する（図９のステップＣ３）。 Then, the spam / event aggregation determining means 391, 392,... 39n comprehensively determine these scores to determine whether the event description or spam description (step C3 in FIG. 9).

その後、ステップＡ５、Ａ７を実行する。 Thereafter, steps A5 and A7 are executed.

第３の実施形態のスパム・イベント検出装置において、第１、第２の実施形態と異なる箇所は、スパム・イベント記述の判定をバースト期間の特徴で判定するか、バースト検出手段での得点、およびデータクレンジング手段での得点の総合点で判定するかである。バースト検出の得点とは、より大きなバーストに対して、バースト期間内の各記事に対して得点を多く付与する。データクレンジング手段でも、クレンジングルールに予め得点を付与しておき、合致したルールの得点の合計値によって各記事に得点を付与する。スパム・イベント集約判定手段３９１、３９２、・・３９ｎでは、バースト検出の得点、クレンジングの得点が共に高得点であれば、スパム記述である可能性が最も高くなると判定する。また、バースト検出の得点が高く、クレンジングの得点が低ければ、イベント記述の可能性が高くなると判定する。例えば、図１０に示すように判定をすることによって、判定の目安をつけることが可能となる。クレンジング得点が高得点で通常記述の可能性が高ければ、クレンジングルールの見直しなどを行うようにすることができる。 In the spam event detection apparatus of the third embodiment, the difference from the first and second embodiments is that the determination of the spam event description is determined by the characteristics of the burst period, the score by the burst detection means, and It is determined by the total score of the data cleansing means. The score for burst detection is that a large number of scores are given to each article within the burst period for a larger burst. The data cleansing means also assigns a score to the cleansing rule in advance, and assigns a score to each article according to the total score of the matched rule. The spam event aggregation determination means 391, 392,... 39n determines that the possibility of spam description is the highest if both the burst detection score and cleansing score are high. If the burst detection score is high and the cleansing score is low, it is determined that the possibility of event description increases. For example, by making a determination as shown in FIG. If the cleansing score is high and the possibility of normal description is high, the cleansing rules can be reviewed.

このように第３の実施形態のスパム・イベント検出装置によれば、第２の実施形態の効果に加えて、スパム記述かイベント記述かの判定が得点によって判別できる。したがって、より明確な判断、各記述の可能性の提示が可能となる。 As described above, according to the spam event detection apparatus of the third embodiment, in addition to the effects of the second embodiment, the determination of the spam description or the event description can be determined by the score. Therefore, it is possible to make a clearer judgment and present the possibility of each description.

［第４の実施形態］
図１１は、本発明の第４の実施形態に係るスパム・イベント検出装置の構成を示す図である。図１１を参照すると、スパム・イベント検出装置は、図８に示された構成に加え、バースト特徴抽出手段３２１、３２２、・・３２ｎの後にそれぞれスパム・イベント特徴判定手段３３１、３３２、・・３３ｎを有する点で異なる。[Fourth Embodiment]
FIG. 11 is a diagram showing a configuration of a spam event detection apparatus according to the fourth embodiment of the present invention. Referring to FIG. 11, in addition to the configuration shown in FIG. 8, the spam event detection apparatus is configured to include spam event feature determination means 331, 332,... 33n after burst feature extraction means 321, 322,. It is different in having.

スパム・イベント特徴判定手段３３１、３３２、・・３３ｎの機能については、第１の実施形態で説明しているので、ここでは省略する。 Since the functions of the spam event feature determination means 331, 332,... 33n have been described in the first embodiment, they are omitted here.

次に、図１１及び図１２を参照して、本発明の第４の実施形態に係るスパム・イベント検出装置の動作について説明する。図１２は、本発明の第４の実施形態に係るスパム・イベント検出装置の動作を示すフローチャートである。図１２のステップＡ１、Ｂ１、Ｃ１、Ｃ２、Ｃ３、Ａ５、Ａ７における処理内容は、図９の同じ符号のステップと同一のため、その説明を省略する。 Next, with reference to FIGS. 11 and 12, the operation of the spam event detection apparatus according to the fourth embodiment of the present invention will be described. FIG. 12 is a flowchart showing the operation of the spam event detection apparatus according to the fourth exemplary embodiment of the present invention. The processing contents in steps A1, B1, C1, C2, C3, A5, and A7 in FIG. 12 are the same as those in FIG.

第３の実施形態では、スパム記述かイベント記述かを判定するのに、バーストの検出といったトラフィックベースでの判定と、ルールによるスパム検出といったコンテンツベースでの判定との２つの判定の得点を考慮することのみで行っていた。これに対し、本実施形態では、ステップＡ５の後に、スパム・イベント集約判定手段３９１、３９２、・・３９ｎによって、イベント記述であると判定され、バースト特徴抽出手段３２１、３２２、・・３２ｎによって抽出された特徴に対して、スパム・イベント特徴判定手段３３１、３３２、・・３３ｎによって再度、判定を行う（図１２のステップＤ１）。この結果、スパム記述であると判定されれば、再度クレンジングルールを更新してデータクレンジングを行う（図１２のステップＣ１）。 In the third embodiment, in order to determine whether the description is a spam description or an event description, the score of two determinations, a traffic-based determination such as burst detection and a content-based determination such as spam detection based on a rule, is considered. I was just doing that. On the other hand, in this embodiment, after step A5, the event description is determined by the spam / event aggregation determining means 391, 392,... 39n, and extracted by the burst feature extracting means 321, 322,. The determined feature is again determined by the spam event feature determining means 331, 332,... 33n (step D1 in FIG. 12). As a result, if it is determined that the description is a spam description, the cleansing rule is updated again to perform data cleansing (step C1 in FIG. 12).

第４の実施形態のスパム・イベント検出装置において、第１、第２、第３の実施形態と異なる箇所は、第３の実施形態の判定手段の後に第１、第２の実施形態の判定手段を実行する２段階での判定を行う点にある。第３の実施形態であれば、例えば、クレンジングの得点が低く、バーストの得点が高い記述に対しては、未知のスパム記述かイベント記述かの判定は難しい。そこで、スパム・イベント特徴判定手段３３１、３３２、・・３３ｎによって再度、判定を行うことで、未知のスパム記述の判定も可能となる。このような二重の判定によって、高精度なスパム・イベント判定が実現できる。 In the spam event detection apparatus according to the fourth embodiment, the difference from the first, second, and third embodiments is that the determination means according to the first and second embodiments follows the determination means according to the third embodiment. The point is that the determination is performed in two stages. In the third embodiment, for example, it is difficult to determine whether the description is an unknown spam description or an event description for a description with a low cleansing score and a high burst score. Therefore, it is possible to determine an unknown spam description by performing the determination again by the spam event feature determination means 331, 332,. Such double determination can realize highly accurate spam event determination.

このように第４の実施形態のスパム・イベント検出装置によれば、第３の実施形態の効果に加えて、バースト検知・ルールマッチングといったスパム記述・イベント記述判定を行った後、バースト期間の特徴による判定も行う。したがって、より高精度な判定が可能となる。 As described above, according to the spam event detection device of the fourth embodiment, in addition to the effects of the third embodiment, after performing spam description / event description determination such as burst detection / rule matching, the characteristics of the burst period Judgment is also performed. Therefore, determination with higher accuracy is possible.

本発明によれば、マーケティング調査、ブランドイメージ調査などに多く用いられるブログやＳＮＳ、ニュース記事などのノイズ（アフリエイト記事などのスパムなど）やイベント（新商品発売、ＣＭ放映、ＴＶ番組での紹介など）を多く含むテキストの分析といった用途に適用できる。また、ブログ、ＳＮＳなどを運営するコミュニティサービス事業、ニュースなどの情報提供を行うポータルサービス事業といった用途にも適用可能である。 According to the present invention, noise (such as spam for affiliate articles) and events (new product launches, CM broadcasts, introductions on TV programs, etc.) such as blogs, SNSs, news articles, etc. that are often used for marketing research, brand image research, etc. ) Can be used for purposes such as analyzing texts that contain a large amount of text. Further, the present invention can be applied to a community service business that manages blogs, SNSs, and the like, and a portal service business that provides information such as news.

なお、前述の特許文献等の各開示を、本書に引用をもって繰り込むものとする。本発明の全開示（請求の範囲を含む）の枠内において、さらにその基本的技術思想に基づいて、実施形態ないし実施例の変更・調整が可能である。また、本発明の請求の範囲の枠内において種々の開示要素の多様な組み合わせないし選択が可能である。すなわち、本発明は、請求の範囲を含む全開示、技術的思想にしたがって当業者であればなし得るであろう各種変形、修正を含むことは勿論である。 It should be noted that the disclosures of the aforementioned patent documents and the like are incorporated herein by reference. Within the scope of the entire disclosure (including claims) of the present invention, the embodiments and examples can be changed and adjusted based on the basic technical concept. Various combinations and selections of various disclosed elements are possible within the scope of the claims of the present invention. That is, the present invention of course includes various variations and modifications that could be made by those skilled in the art according to the entire disclosure including the claims and the technical idea.

本発明の第１の実施形態に係るスパム・イベント検出装置の構成を示す図である。It is a figure which shows the structure of the spam event detection apparatus which concerns on the 1st Embodiment of this invention.本発明の第１の実施形態に係るスパム・イベント検出装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the spam event detection apparatus which concerns on the 1st Embodiment of this invention.企業Ａのブログ記事数の時間的推移の例を表す第１の図である。It is a 1st figure showing the example of the time transition of the blog article number of the company A. FIG.企業Ａのブログ記事数の時間的推移の例を表す第２の図である。It is a 2nd figure showing the example of the time transition of the blog article number of the company A. FIG.書き込みに周期性を強く持つカテゴリーを分析した時の時間的推移の例を表す図である。It is a figure showing the example of time transition when analyzing the category which has periodicity in writing strongly.本発明の第２の実施形態に係るスパム・イベント検出装置の構成を示す図である。It is a figure which shows the structure of the spam event detection apparatus which concerns on the 2nd Embodiment of this invention.本発明の第２の実施形態に係るスパム・イベント検出装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the spam event detection apparatus which concerns on the 2nd Embodiment of this invention.本発明の第３の実施形態に係るスパム・イベント検出装置の構成を示す図である。It is a figure which shows the structure of the spam event detection apparatus which concerns on the 3rd Embodiment of this invention.本発明の第３の実施形態に係るスパム・イベント検出装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the spam event detection apparatus which concerns on the 3rd Embodiment of this invention.本発明の第３の実施形態に係る判定手段における判定ルールの例を表す図である。It is a figure showing the example of the determination rule in the determination means which concerns on the 3rd Embodiment of this invention.本発明の第４の実施形態に係るスパム・イベント検出装置の構成を示す図である。It is a figure which shows the structure of the spam event detection apparatus which concerns on the 4th Embodiment of this invention.本発明の第４の実施形態に係るスパム・イベント検出装置の動作を示すフローチャートである。It is a flowchart which shows operation | movement of the spam event detection apparatus which concerns on the 4th Embodiment of this invention.

符号の説明Explanation of symbols

１入力装置
４出力装置
２１データ記憶部
２２データクレンジング手段
２３クレンジングルール記憶部
２４データ抽出手段
２５１、２５２、・・２５ｎ抽出データ記憶部
３１１、３１２、・・３１ｎ時系列バースト検出手段
３２１、３２２、・・３２ｎバースト特徴抽出手段
３３１，３３２、・・３３ｎスパム・イベント特徴判定手段
３４１、３４２、・・３４ｎイベント情報記憶部
３５１、３５２、・・３５ｎデータクレンジング手段
３６１、３６２、・・３６ｎクレンジングルール記憶部
３７１、３７２、・・３７ｎ時系列バースト検出得点付与手段
３８１、３８２、・・３８ｎデータクレンジング得点付与手段
３９１、３９２、・・３９ｎスパム・イベント集約判定手段DESCRIPTION OFSYMBOLS 1 Input device 4Output device 21Data storage part 22 Data cleansing means 23 Cleansingrule storage part 24 Data extraction means 251,252, ... 25n Extraction data storage part 311,312, ... 31n Time series burst detection means 321,322 32n burst feature extraction means 331, 332, 33n spam event feature determination means 341, 342, 34n eventinformation storage unit 351, 352, 35n data cleansing means 361, 362, 36n cleansingrule Storage unit 371, 372, ... 37n Time series burst detection score giving means 381, 382, ... 38n Data cleansing score giving means 391, 392, ... 39n Spam event aggregation judgment means

Claims

Translated fromJapanese

検出対象とされる蓄積したテキストデータを所定の方法で複数のデータに分類するデータ抽出手段を備え、  Data extraction means for classifying the accumulated text data to be detected into a plurality of data by a predetermined method,
前記データ抽出手段によって分類されたデータに対してバースト情報に得点を付与する時系列バースト得点付与手段と、  A time-series burst score giving means for assigning a score to burst information for the data classified by the data extraction means;
前記データ抽出手段によって分類されたデータに対してスパム記述ルールを元に得点を付与するデータクレンジング得点付与手段と、  Data cleansing score giving means for giving a score based on spam description rules to the data classified by the data extraction means;
前記時系列バースト得点付与手段およびデータクレンジング得点付与手段が付与した双方の得点に基づいてスパム記述かイベント記述かを判定するスパム・イベント集約判定手段と、  A spam event aggregation judging means for judging whether it is a spam description or an event description based on both scores given by the time-series burst score giving means and the data cleansing score giving means;
を、前記複数のデータのそれぞれに対応させて複数備え、  Are provided corresponding to each of the plurality of data,
前記スパム・イベント集約判定手段でスパム記述であると判定したスパムの情報を元に前記スパム記述ルールを書き換えることを特徴とするスパム・イベント検出装置。  A spam event detection apparatus that rewrites the spam description rule based on spam information determined to be spam description by the spam event aggregation determination means.

前記スパム・イベント集約判定手段で前記イベント記述と判定した情報に対しバースト期間の特徴を抽出するバースト特徴抽出手段と、  Burst feature extraction means for extracting features of a burst period for the information determined as the event description by the spam event aggregation determination means;
前記バースト特徴抽出手段で抽出した特徴に基づいてスパム記述かイベント記述かを判定するスパム・イベント特徴判定手段と、  A spam event feature determination unit that determines whether the description is a spam description or an event description based on the feature extracted by the burst feature extraction unit;
を、前記複数のデータのそれぞれに対応させて複数さらに備え、  Further corresponding to each of the plurality of data,
前記スパム・イベント特徴判定手段でスパム記述であると判定したスパムの情報を元に前記スパム記述ルールを書き換えることを特徴とする請求項１記載のスパム・イベント検出装置。  2. The spam event detection apparatus according to claim 1, wherein the spam description rule is rewritten based on the spam information determined to be a spam description by the spam event feature determination means.

スパム・イベント検出装置がスパムを検出する方法であって、  A spam event detection device for detecting spam,
検出対象とされる蓄積したテキストデータを所定の方法で複数のデータに分類するステップを含み、  Classifying the accumulated text data to be detected into a plurality of data by a predetermined method,
前記分類されたデータに対してバースト情報に得点を付与するステップと、  Assigning a score to burst information for the classified data;
前記分類されたデータに対してスパム記述ルールを元に得点を付与するステップと、  Assigning a score to the classified data based on spam description rules;
前記２つの得点を付与するステップにおいて付与した双方の得点に基づいてスパム記述かイベント記述かを判定するステップと、  Determining whether it is a spam description or an event description based on both of the scores assigned in the step of assigning the two scores;
を、前記複数のデータのそれぞれに対応させて含み、  And corresponding to each of the plurality of data,
前記双方の得点に基づいてスパム記述かイベント記述かを判定するステップでスパム記述であると判定したスパムの情報を元に前記スパム記述ルールが書き換えられることを特徴とするスパム・イベント検出方法。  A spam event detection method, wherein the spam description rule is rewritten based on the spam information determined to be a spam description in the step of determining whether the description is a spam description or an event description based on both scores.

前記スパム記述かイベント記述かを判定するステップでイベント記述と判定した情報に対しバースト期間の特徴を抽出するステップと、  Extracting the characteristics of the burst period for the information determined as the event description in the step of determining whether the spam description or the event description;
前記抽出した特徴に基づいてスパム記述かイベント記述かを判定するステップと、  Determining whether it is a spam description or an event description based on the extracted features;
を、前記複数のデータのそれぞれに対応させて含み、  And corresponding to each of the plurality of data,
前記抽出した特徴に基づいてスパム記述かイベント記述かを判定するステップでスパム記述であると判定したスパムの情報を元に前記スパム記述ルールを書き換えることを特徴とする請求項３記載のスパム・イベント検出方法。  4. The spam event according to claim 3, wherein the spam description rule is rewritten based on the spam information determined to be a spam description in the step of determining whether the description is a spam description or an event description based on the extracted feature. Detection method.

スパム・イベント検出装置を構成するコンピュータに、  On the computer that configures the spam event detection device,
検出対象とされる蓄積したテキストデータを所定の方法で複数のデータに分類するデータ抽出処理を実行させ、  The data extraction process for classifying the accumulated text data to be detected into a plurality of data by a predetermined method is executed.
前記データ抽出処理によって分類されたデータに対してバースト情報に得点を付与する時系列バースト得点付与処理と、  A time-series burst score assigning process for assigning scores to burst information for data classified by the data extraction process;
前記データ抽出処理によって分類されたデータに対してスパム記述ルールを元に得点を付与するデータクレンジング得点付与処理と、  A data cleansing score assignment process for assigning a score based on spam description rules to the data classified by the data extraction process;
前記時系列バースト得点付与処理およびデータクレンジング得点付与処理が付与した双方の得点に基づいてスパム記述かイベント記述かを判定するスパム・イベント集約判定処理と、  A spam event aggregation determination process for determining whether the description is a spam description or an event description based on both scores given by the time-series burst score assignment process and the data cleansing score assignment process;
を、前記複数のデータのそれぞれに対応させて実行させ、  Is executed in correspondence with each of the plurality of data,
前記スパム・イベント集約判定処理でスパム記述であると判定したスパムの情報を元に前記スパム記述ルールを書き換えるプログラム。  A program for rewriting the spam description rule based on spam information determined to be a spam description by the spam event aggregation determination process.

前記スパム・イベント集約判定処理で前記イベント記述と判定した情報に対しバースト期間の特徴を抽出するバースト特徴抽出処理と、  Burst feature extraction processing for extracting features of a burst period for information determined as the event description in the spam event aggregation determination processing;
前記バースト特徴抽出処理で抽出した特徴に基づいてスパム記述かイベント記述かを判定するスパム・イベント特徴判定処理と、  A spam event feature determination process for determining whether the description is a spam description or an event description based on the feature extracted in the burst feature extraction process;
を、前記複数のデータのそれぞれに対応させてさらに実行させ、  Is further executed corresponding to each of the plurality of data,
前記スパム・イベント特徴判定処理でスパム記述であると判定したスパムの情報を元に前記スパム記述ルールを書き換える請求項５記載のプログラム。  6. The program according to claim 5, wherein the spam description rule is rewritten based on spam information determined to be a spam description in the spam event feature determination processing.