JP2011053734A

Movatterモバイル変換

Info

Publication number: JP2011053734A
Application number: JP2009199345A
Authority: JP
Inventors: Tatsuya Shindo; 達也進藤
Original assignee: Individual
Current assignee: Individual
Priority date: 2009-08-31
Filing date: 2009-08-31
Publication date: 2011-03-17

Abstract

<P>PROBLEM TO BE SOLVED: To easily perform in-site search while reducing a load on a site operator side in an in-site search system for providing a search service of a content in a specific site. <P>SOLUTION: An attribute extracted from a source code of an HTML document for commodity introduction is set separately to a character string type and a numerical type. As an extraction source, a content itself or the URL of the content is designated. A description related to the attribute is identified from the source code, and an extraction condition by a regular expression is set to store an attribute value that is a character string or a numerical value in a reference variable $1 or the like. Attribute information extracted under the condition is stored in a database, and used for in-site internal search through the Internet. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

Translated fromJapanese

本発明は、サイト内検索サーバにより特定サイト内のコンテンツの検索サービスを提供するサイト内検索システムに関する。 The present invention relates to an in-site search system that provides a search service for content in a specific site by an in-site search server.

特許文献１（特開２００２−１０８９０６「情報検索用ウェブページ」）には、「Ｗｅｂサイトにおける情報検索を快適かつ円滑に行えるようにした情報検索用Ｗｅｂページを提供する」ことを課題とする技術が開示されている。バーチャルショッピングモールにおける商品の検索などに用いられる。 Patent Document 1 (Japanese Patent Application Laid-Open No. 2002-108906 “Information Search Web Page”) has a technique of “providing an information search Web page that enables information search on a Web site to be performed comfortably and smoothly”. Is disclosed. Used for searching for products in a virtual shopping mall.

この例では、商品に係る情報データベースを予め用意し、サイト運営者のサーバに検索手段を設けることが前提となっている。その為、サイト運営者側の負担が大きい。 In this example, it is assumed that an information database relating to a product is prepared in advance, and a search means is provided on the site operator's server. For this reason, the burden on the site operator is large.

特許文献２（特開２００４−２６４９２８「Ｗｅｂサイト内検索方法と装置、Ｗｅｂサイト内検索プログラムおよびこのプログラムを記録した記録媒体」）には、「サイト検索において、検索者に対して、サイト内構造を可視化し、分かりやすいサイト内ナビゲーションを行う」ことを課題とする技術が開示されている。その為に、予めＷｅｂページを収集して、推定されるサイト木構造に関係付けておく。 Patent document 2 (Japanese Patent Application Laid-Open No. 2004-264928 “Web site search method and apparatus, Web site search program and recording medium on which this program is recorded”) describes “in-site structure for searchers in site search”. The technology which makes it a subject that "is made visible and navigating in-site easy-to-understand" is disclosed. For this purpose, Web pages are collected in advance and related to the estimated site tree structure.

しかし、推定されるサイト木構造は、ページのリンクに基づくものであって、ページの内容を反映したものではない。従って、商品情報の検索などには不十分である。 However, the estimated site tree structure is based on the link of the page, and does not reflect the contents of the page. Therefore, it is insufficient for searching product information.

特開２００２−１０８９０６号公報JP 2002-108906 A特開２００４−２６４９２８号公報JP 2004-264928 A

解決しようとする問題点は、サイト運営者側の負担を軽減し、簡単にサイト内検索を実現することである。 The problem to be solved is to reduce the burden on the site operator and to easily perform in-site search.

本願発明に係るサイト内検索サーバは、
インターネットを介してタグ付文書ファイルにより公開情報を提供するサイトサーバとインターネットを介して接続し、更にシステム管理者端末に接続するサイト内検索サーバであって、以下の要素を有することを特徴とする
（１）サイトサーバが提供する公開情報から任意の属性に関する属性情報を抽出するための属性抽出条件を、システム管理者端末から受信する属性抽出条件登録部
（２）受信した属性抽出条件を記憶する属性抽出条件記憶部
（３）インターネットを介して電子商取引サイトサーバで公開するタグ付文書ファイルを収集する文書収集部
（４）収集したタグ付文書ファイルを記憶する収集文書格納部
（５）属性抽出条件に従って、各タグ付文書ファイルから属性情報を抽出する属性抽出部
（６）タグ付文書毎に、抽出した属性情報を記憶する収集文書データベース。The site search server according to the present invention is:
A site search server that is connected to a site server that provides public information by a tagged document file via the Internet via the Internet, and is further connected to a system administrator terminal, and includes the following elements: (1) An attribute extraction condition registration unit that receives from the system administrator terminal an attribute extraction condition for extracting attribute information related to an arbitrary attribute from public information provided by the site server. (2) Stores the received attribute extraction condition. Attribute extraction condition storage unit (3) Document collection unit that collects tagged document files to be published on the e-commerce site server via the Internet (4) Collected document storage unit that stores collected tagged document files (5) Attribute extraction Attribute extraction unit that extracts attribute information from each tagged document file according to the conditions (6) extracted for each tagged document Collection document database that stores the sex information.

また、属性抽出条件は、抽出元データをタグ付文書ファイル中の記述コードとするか、あるいはタグ付文書ファイルのＵＲＬとするかの条件を含み、
サイト内検索サーバは、更に、収集したタグ付文書ファイルのＵＲＬを記憶する収集文書リストを有し、
属性抽出条件が抽出元データをタグ付文書ファイル中の記述コードとする条件を含む場合には、タグ付文書ファイル中の記述コードから属性情報を抽出し、属性抽出条件が抽出元データをタグ付文書ファイルのＵＲＬとする条件を含む場合には、タグ付文書ファイルのＵＲＬから属性情報を抽出することを特徴とする。The attribute extraction condition includes a condition that the extraction source data is a description code in the tagged document file or a URL of the tagged document file.
The site search server further has a collected document list for storing URLs of the collected tagged document files,
If the attribute extraction condition includes a condition that sets the extraction source data as a description code in the tagged document file, attribute information is extracted from the description code in the tagged document file, and the attribute extraction condition is tagged as the extraction source data. When the conditions for the URL of the document file are included, the attribute information is extracted from the URL of the tagged document file.

また、属性抽出条件は、記述パターンを特定するとともに記述パターン中の参照部位を特定する正規表現と、正規表現により特定される参照部位の記述コードを格納する参照変数名を含み、
属性抽出部は、タグ付文書ファイルの全体の記述コード中に、正規表現により特定される記述パターンが存在するか判定し、記述パターンが存在する場合に、正規表現により特定される参照部位の記述コードを参照変数に格納するマッチング判定を行ない、参照変数値を属性値として属性情報に含めることを特徴とする。The attribute extraction condition includes a regular expression that specifies a description pattern and a reference part in the description pattern, and a reference variable name that stores a description code of the reference part specified by the regular expression.
The attribute extraction unit determines whether the description pattern specified by the regular expression exists in the entire description code of the tagged document file, and if the description pattern exists, the description of the reference part specified by the regular expression. A matching determination is performed in which a code is stored in a reference variable, and a reference variable value is included in attribute information as an attribute value.

また、属性抽出条件登録部は、文字列型の属性抽出条件と数値型の属性抽出条件を受信し、
属性抽出部は、文字列型の属性抽出条件により得られた参照変数値である参照部位の記述コードを、そのまま属性値として文字列の属性情報に含め、数値型の属性抽出条件により得られた参照変数値である参照部位の記述コードを、バイナリコードに変換し、変換したバイナリコードを属性値として数値の属性情報に含めることを特徴とする。In addition, the attribute extraction condition registration unit receives the string type attribute extraction condition and the numeric type attribute extraction condition,
The attribute extraction unit includes the description code of the reference part, which is the reference variable value obtained by the character string type attribute extraction condition, as it is in the attribute information of the character string as the attribute value, and is obtained by the numerical type attribute extraction condition. A description code of a reference part which is a reference variable value is converted into a binary code, and the converted binary code is included as attribute values in numerical attribute information.

また、属性抽出条件の正規表現は、複数の参照部位を特定し、属性抽出条件の複数の参照変数名は、階層に分けられ、
属性抽出部は、各参照変数値である属性値を、各階層に配分した属性情報を生成することを特徴とする。In addition, the regular expression of the attribute extraction condition specifies a plurality of reference parts, and the plurality of reference variable names of the attribute extraction condition are divided into hierarchies,
The attribute extracting unit generates attribute information in which attribute values that are reference variable values are distributed to each layer.

また、属性抽出条件の正規表現は、複数の参照部位を特定し、属性抽出条件の複数の参照変数名は、並列であり、
属性抽出部は、各参照変数値である属性値を、単階層に列挙した属性情報を生成することを特徴とする。In addition, the regular expression of the attribute extraction condition specifies a plurality of reference parts, and the plurality of reference variable names of the attribute extraction condition are in parallel,
The attribute extraction unit generates attribute information in which attribute values that are reference variable values are listed in a single hierarchy.

また、サイト内検索サーバは、更に、インターネットを介して、属性に関する検索条件を含む検索要求を受信する検索要求受付部と、
受信した検索要求から検索条件を特定する検索条件判定部と、
収集文書データベースから、属性情報が検索条件に適合するタグ付文書を検索する文書検索実行部と、
検索されたタグ付文書のタイトルを一覧表示する一覧ウィンドウを生成する一覧ウィンドウ生成部と、
一覧ウィンドウを含む文書検索結果画面を返信する文書検索結果画面送信部を有することを特徴とする。The site search server further includes a search request reception unit that receives a search request including a search condition related to an attribute via the Internet;
A search condition determination unit for specifying a search condition from the received search request;
A document search execution unit for searching a tagged document whose attribute information matches a search condition from the collected document database;
A list window generating unit for generating a list window for displaying a list of the titles of the tagged documents searched;
It has a document search result screen transmission unit for returning a document search result screen including a list window.

また、検索要求は、サイト内検索サーバの所定ＵＲＬであり、検索条件は、前記ＵＲＬに付されたパラメータであることを特徴とする。 The search request is a predetermined URL of the site search server, and the search condition is a parameter attached to the URL.

また、サイト内検索サーバは、更に、前記任意の属性について、すべてのタグ付文書の属性情報に含まれる属性値のリストを生成する属性値リスト生成部と、
生成した属性値のリストを記憶する属性値リスト記憶部を有することを特徴とする。Further, the site search server further includes, for the arbitrary attribute, an attribute value list generation unit that generates a list of attribute values included in the attribute information of all tagged documents;
An attribute value list storage unit that stores a list of generated attribute values is provided.

また、サイト内検索サーバは、更に、前記任意の属性について、属性値のリストを表示する属性ウィンドウを生成する属性ウィンドウ生成部を有し、
文書検索結果画面送信部は、文書検索結果画面に属性ウィンドウを含めることを特徴とする。The site search server further includes an attribute window generation unit that generates an attribute window for displaying a list of attribute values for the arbitrary attribute.
The document search result screen transmission unit includes an attribute window in the document search result screen.

また、サイトサーバは、インターネットを介する商品取引を支援する電子商取引サイトサーバであって、
タグ付文書は、電子商取引の商品に関する内容を表示する画面を構成することを特徴とする。The site server is an electronic commerce site server that supports merchandise transactions via the Internet,
The tagged document is characterized in that it constitutes a screen for displaying the contents related to the electronic commerce product.

また、タグ付文書は、構造化文書であることを特徴とする。 The tagged document is a structured document.

また、タグ付文書は、ＰＤＦ文書であることを特徴とする。 The tagged document is a PDF document.

本願発明に係るサイト内検索サービス方法は、
インターネットを介してタグ付文書ファイルにより公開情報を提供するサイトサーバとインターネットを介して接続し、更にシステム管理者端末に接続するサイト内検索サーバによるサイト内検索サービス方法であって、以下の要素を有することを特徴とする
（１）サイトサーバが提供する公開情報から任意の属性に関する属性情報を抽出するための属性抽出条件を、システム管理者端末から受信する属性抽出条件登録工程
（２）受信した属性抽出条件を記憶する属性抽出条件記憶工程
（３）インターネットを介して電子商取引サイトサーバで公開するタグ付文書ファイルを収集する文書収集工程
（４）収集したタグ付文書ファイルを記憶する収集文書格納工程
（５）属性抽出条件に従って、各タグ付文書ファイルから属性情報を抽出する属性抽出工程
（６）タグ付文書毎に、抽出した属性情報を記憶する収集文書データベース化工程。The site search service method according to the present invention is:
A site search service method by a site search server connected to a site server that provides public information by a tagged document file via the Internet and further connected to a system administrator terminal, comprising the following elements: (1) An attribute extraction condition registration step for receiving attribute extraction conditions for extracting attribute information related to an arbitrary attribute from public information provided by a site server from a system administrator terminal (2) received Attribute extraction condition storage step for storing attribute extraction conditions (3) Document collection step for collecting tagged document files to be published on the electronic commerce site server via the Internet (4) Collected document storage for storing the collected tagged document files Step (5) Attribute for extracting attribute information from each tagged document file according to the attribute extraction condition The extraction step (6) for each tagged document collection document database step of storing the extracted attribute information.

また、サイト内検索サービス方法は、更に、インターネットを介して、属性に関する検索条件を含む検索要求を受信する検索要求受付工程と、
受信した検索要求から検索条件を特定する検索条件判定工程と、
収集文書データベースから、属性情報が検索条件に適合するタグ付文書を検索する文書検索実行工程と、
検索されたタグ付文書のタイトルを一覧表示する一覧ウィンドウを生成する一覧ウィンドウ生成工程と、
一覧ウィンドウを含む文書検索結果画面を返信する文書検索結果画面送信工程を有することを特徴とする。The site search service method further includes a search request receiving step of receiving a search request including a search condition related to an attribute via the Internet,
A search condition determination step for specifying a search condition from the received search request;
A document search execution step for searching a tagged document whose attribute information matches the search condition from the collected document database;
A list window generating step for generating a list window for displaying a list of the titles of the tagged documents searched;
A document search result screen sending step for returning a document search result screen including a list window is provided.

また、本願発明に係るプログラムは、
インターネットを介してタグ付文書ファイルにより公開情報を提供するサイトサーバとインターネットを介して接続し、更にシステム管理者端末に接続するサイト内検索サーバとなるコンピュータに、以下の手順を実行させることを特徴とする
（１）サイトサーバが提供する公開情報から任意の属性に関する属性情報を抽出するための属性抽出条件を、システム管理者端末から受信する属性抽出条件登録手順
（２）受信した属性抽出条件を記憶する属性抽出条件記憶手順
（３）インターネットを介して電子商取引サイトサーバで公開するタグ付文書ファイルを収集する文書収集手順
（４）収集したタグ付文書ファイルを記憶する収集文書格納手順
（５）属性抽出条件に従って、各タグ付文書ファイルから属性情報を抽出する属性抽出手順
（６）タグ付文書毎に、抽出した属性情報を記憶する収集文書データベース化手順。The program according to the present invention is
A computer that is connected to a site server that provides public information by a tagged document file via the Internet via the Internet and that is a site search server that is connected to a system administrator terminal, executes the following procedure. (1) An attribute extraction condition registration procedure for receiving from the system administrator terminal an attribute extraction condition for extracting attribute information related to an arbitrary attribute from public information provided by the site server. (2) The received attribute extraction condition. Stored attribute extraction condition storage procedure (3) Document collection procedure for collecting tagged document files to be published on the e-commerce site server via the Internet (4) Collected document storage procedure for storing collected tagged document files (5) Attribute extraction procedure for extracting attribute information from each tagged document file according to the attribute extraction condition (6 For each tagged document collection document database procedure to store the extracted attribute information.

また、プログラムは、更に、インターネットを介して、属性に関する検索条件を含む検索要求を受信する検索要求受付手順と、
受信した検索要求から検索条件を特定する検索条件判定手順と、
収集文書データベースから、属性情報が検索条件に適合するタグ付文書を検索する文書検索実行手順と、
検索されたタグ付文書のタイトルを一覧表示する一覧ウィンドウを生成する一覧ウィンドウ生成手順と、
一覧ウィンドウを含む文書検索結果画面を返信する文書検索結果画面送信手順をサイト内検索サーバとなるコンピュータに実行させることを特徴とする。Further, the program further includes a search request reception procedure for receiving a search request including a search condition related to an attribute via the Internet,
A search condition determination procedure for specifying a search condition from the received search request;
A document search execution procedure for searching a tagged document whose attribute information matches the search condition from the collected document database,
A list window generation procedure for generating a list window for displaying a list of the titles of the tagged documents searched;
A document search result screen transmission procedure for returning a document search result screen including a list window is executed by a computer serving as a site search server.

登録した属性抽出条件により、収集した文書から各属性を抽出するので、Ｗｅｂページと別に商品情報などのデータベースを用意する必要がなくなる。電子取引サイト側に改造や開発が伴わないため、検索機能の短期導入が可能となる。 Since each attribute is extracted from the collected document according to the registered attribute extraction condition, it is not necessary to prepare a database such as product information separately from the Web page. Since there is no modification or development on the electronic trading site side, a search function can be introduced in a short period of time.

属性の抽出を自動化できるで、データベースの更新などメンテナンスの負荷を軽減することができる。また、Ｗｅｂページの内容とデータベースの内容の不整合も生じない。 The attribute extraction can be automated, reducing the maintenance load such as updating the database. In addition, there is no inconsistency between the contents of the Web page and the contents of the database.

Ｗｅｂページから抽出すべき属性を自由に選べるので、商品の仕様や価格などサイトの目的に応じた検索項目の設定が可能となる。 Since attributes to be extracted from the Web page can be freely selected, it is possible to set search items according to the purpose of the site, such as product specifications and prices.

インターネットを介して外部のサーバから検索サービスを提供するので、サイト運営側のサーバの負担が軽減される。 Since the search service is provided from an external server via the Internet, the load on the server on the site management side is reduced.

図１は、サイト内検索システムのネットワーク概要（その１）を示す図である。FIG. 1 is a diagram showing a network outline (part 1) of the in-site search system.図２は、サイト内検索システムのネットワーク概要（その２）を示す図である。FIG. 2 is a diagram showing a network outline (part 2) of the in-site search system.図３は、電子商取引サイトコンテンツ構成例を示す図である。FIG. 3 is a diagram showing an example of the configuration of electronic commerce site content.図４は、電子商取引サイトのＨＴＭＬ文書表示例を示す図である。FIG. 4 is a diagram showing an HTML document display example of the electronic commerce site.図５は、電子商取引サイトのＨＴＭＬ文書ソースコード例（１／７）を示す図である。FIG. 5 is a diagram showing an example (1/7) of HTML document source code of the electronic commerce site.図６は、電子商取引サイトのＨＴＭＬ文書ソースコード例（２／７）を示す図である。FIG. 6 is a diagram illustrating an HTML document source code example (2/7) of the electronic commerce site.図７は、電子商取引サイトのＨＴＭＬ文書ソースコード例（３／７）を示す図である。FIG. 7 is a diagram showing an example (3/7) of HTML document source code of the electronic commerce site.図８は、電子商取引サイトのＨＴＭＬ文書ソースコード例（４／７）を示す図である。FIG. 8 is a diagram showing an example (4/7) of HTML document source code of the electronic commerce site.図９は、電子商取引サイトのＨＴＭＬ文書ソースコード例（５／７）を示す図である。FIG. 9 is a diagram showing an HTML document source code example (5/7) of the electronic commerce site.図１０は、電子商取引サイトのＨＴＭＬ文書ソースコード例（６／７）を示す図である。FIG. 10 is a diagram illustrating an HTML document source code example (6/7) of the electronic commerce site.図１１は、電子商取引サイトのＨＴＭＬ文書ソースコード例（７／７）を示す図である。FIG. 11 is a diagram showing an example (7/7) of HTML document source code of the electronic commerce site.図１２は、サイト内検索サーバの属性抽出条件登録に係る構成を示す図である。FIG. 12 is a diagram showing a configuration related to attribute extraction condition registration of the site search server.図１３は、属性抽出条件登録画面（文字列型１）の例を示す図である。FIG. 13 is a diagram showing an example of the attribute extraction condition registration screen (character string type 1).図１４は、属性抽出条件登録画面（文字列型２）の例を示す図である。FIG. 14 is a diagram illustrating an example of the attribute extraction condition registration screen (character string type 2).図１５は、属性抽出条件登録画面（文字列型３）の例を示す図である。FIG. 15 is a diagram illustrating an example of the attribute extraction condition registration screen (character string type 3).図１６は、属性抽出条件登録画面（文字列型４）の例を示す図である。FIG. 16 is a diagram showing an example of the attribute extraction condition registration screen (character string type 4).図１７は、属性抽出条件登録画面（数値型１）の例を示す図である。FIG. 17 is a diagram illustrating an example of the attribute extraction condition registration screen (numerical value type 1).図１８は、属性抽出条件登録画面（数値型２）の例を示す図である。FIG. 18 is a diagram showing an example of the attribute extraction condition registration screen (numerical value type 2).図１９は、属性抽出条件登録画面（文字列型５）の例を示す図である。FIG. 19 is a diagram showing an example of the attribute extraction condition registration screen (character string type 5).図２０は、属性名テーブルの例を示す図である。FIG. 20 is a diagram illustrating an example of the attribute name table.図２１は、属性抽出条件記憶部の例を示す図である。FIG. 21 is a diagram illustrating an example of the attribute extraction condition storage unit.図２２は、文字列型１の属性抽出条件の例を示す図である。FIG. 22 is a diagram illustrating an example of the attribute extraction condition of thecharacter string type 1.図２３は、文字列型２の属性抽出条件の例を示す図である。FIG. 23 is a diagram illustrating an example of thecharacter string type 2 attribute extraction condition.図２４は、文字列型３の属性抽出条件の例を示す図である。FIG. 24 is a diagram illustrating an example of the attribute extraction condition of thecharacter string type 3.図２５は、文字列型４の属性抽出条件の例を示す図である。FIG. 25 is a diagram illustrating an example of the attribute extraction condition of thecharacter string type 4.図２６は、文字列型５の属性抽出条件の例を示す図である。FIG. 26 is a diagram illustrating an example of the attribute extraction condition of thecharacter string type 5.図２７は、数値型１の属性抽出条件の例を示す図である。FIG. 27 is a diagram illustrating an example of the attribute type extraction condition of thenumerical value type 1.図２８は、数値型２の属性抽出条件の例を示す図である。FIG. 28 is a diagram illustrating an example of the attribute type extraction condition of thenumerical value type 2.図２９は、サイト内検索サーバのサイト内文書収集整理処理フローを示す図である。FIG. 29 is a diagram showing a processing flow for collecting and organizing in-site documents of the in-site search server.図３０は、サイト内検索サーバのサイト内文書収集整理に係る構成を示す図である。FIG. 30 is a diagram showing a configuration relating to in-site document collection and arrangement of the in-site search server.図３１は、収集文書リストの例を示す図である。FIG. 31 is a diagram illustrating an example of a collected document list.図３２は、収集文書データベースの例を示す図である。FIG. 32 is a diagram illustrating an example of a collected document database.図３３は、属性抽出処理フローを示す図である。FIG. 33 is a diagram showing an attribute extraction processing flow.図３４は、属性抽出部の構成を示す図である。FIG. 34 is a diagram illustrating a configuration of the attribute extraction unit.図３５は、文字列型の属性情報格納処理フローを示す図である。FIG. 35 is a diagram showing a character string type attribute information storage processing flow.図３６は、階層セレクタ展開処理フローを示す図である。FIG. 36 is a diagram showing a hierarchical selector expansion processing flow.図３７は、数値型の属性情報格納処理フローを示す図である。FIG. 37 is a diagram showing a numerical attribute information storage processing flow.図３８は、属性値リスト記憶部の例を示す図である。FIG. 38 is a diagram illustrating an example of the attribute value list storage unit.図３９は、属性値リスト生成処理フローを示す図である。FIG. 39 is a diagram showing an attribute value list generation processing flow.図４０は、層内属性値抽出処理フローを示す図である。FIG. 40 is a diagram showing the in-layer attribute value extraction processing flow.図４１は、電子商取引サイトの検索ウィンドウの例を示す図である。FIG. 41 is a diagram illustrating an example of a search window for an electronic commerce site.図４２は、電子商取引サイトの検索ウィンドウのソースコード例を示す図である。FIG. 42 is a diagram showing an example of the source code of the search window of the electronic commerce site.図４３は、文書検索結果画面の例を示す図である。FIG. 43 is a diagram illustrating an example of a document search result screen.図４４は、サイト内検索サーバの文書検索サービス処理フローを示す図である。FIG. 44 is a diagram showing a document search service process flow of the site search server.図４５は、サイト内検索サーバの文書検索サービスに係る構成を示す図である。FIG. 45 is a diagram showing a configuration related to the document search service of the site search server.図４６は、実施例２に係る電子商取引サイトのＨＴＭＬ文書表示例を示す図である。FIG. 46 is a diagram illustrating an HTML document display example of the electronic commerce site according to the second embodiment.図４７は、実施例２に係る電子商取引サイトのＨＴＭＬ文書ソースコード例を示す図である。FIG. 47 is a diagram illustrating an example of HTML document source code of the electronic commerce site according to the second embodiment.図４９は、実施例２に係る属性抽出条件登録画面（文字列型１）の例を示す図である。FIG. 49 is a diagram illustrating an example of the attribute extraction condition registration screen (character string type 1) according to the second embodiment.図４９は、実施例２に係る属性抽出条件登録画面（数値型１）の例を示す図である。FIG. 49 is a diagram illustrating an example of an attribute extraction condition registration screen (numerical type 1) according to the second embodiment.図５０は、実施例２に係る文字列型１の属性抽出条件の例を示す図である。FIG. 50 is a diagram illustrating an example of the attribute extraction condition of thecharacter string type 1 according to the second embodiment.図５１は、実施例２に係る数値型１の属性抽出条件の例を示す図である。FIG. 51 is a diagram illustrating an example of the attribute extraction condition of thenumerical value type 1 according to the second embodiment.図５２は、日付変換処理フローを示す図である。FIG. 52 is a diagram showing a date conversion processing flow.図５３は、実施例２に係る収集文書データベースの例を示す図である。FIG. 53 is a diagram illustrating an example of the collected document database according to the second embodiment.

本発明に係るサイト内検索システムは、インターネットを介し外付けされたサイト内検索サーバにより特定サイト内のコンテンツの検索サービスを提供する。 The site search system according to the present invention provides a search service for content in a specific site using a site search server externally provided via the Internet.

図１は、サイト内検索システムのネットワーク概要（その１）を示す図である。サイト内検索サーバ１０１は、インターネットを介して、システム管理者端末１０２及び電子商取引サイトサーバ１０３と接続している。また、電子商取引サイトサーバ１０３は、インターネットを介して、サイト利用者端末１０４と接続している。電子商取引サイトサーバ１０３は、通信販売のウェブサイトを開設し、電子商取引を行う通信販売業者が運営するサーバである。サイト利用者端末１０４は、通信販売のウェブサイトを閲覧し、商品を購入する利用者が用いる端末である。サイト内検索サーバ１０１は、例えば通信販売のウェブサイト内の検索サービスを提供するアプリケーションソフトを実装状態でレンタルするＡＳＰ（アプリケーションサービスプロバイダ）サーバである。システム管理者端末１０２は、後述する属性抽出条件など運用に必要な情報の設定を行う端末である。システム管理者端末１０２とサイト利用者端末１０４は、ＨＴＭＬ文書（タグ付文書の例、構造化文書の例、コンテンツの例、Ｗｅｂページの例）を表示し、指示されたイベントを返信するブラウザを有している。 FIG. 1 is a diagram showing a network outline (part 1) of the in-site search system. Thesite search server 101 is connected to thesystem administrator terminal 102 and the electroniccommerce site server 103 via the Internet. The electroniccommerce site server 103 is connected to thesite user terminal 104 via the Internet. Thee-commerce site server 103 is a server operated by a mail-order company that establishes a mail-order website and conducts e-commerce. Thesite user terminal 104 is a terminal used by a user who browses a mail order website and purchases a product. Thesite search server 101 is an ASP (application service provider) server that rents application software that provides a search service in a mail order website, for example, in a mounted state. Thesystem administrator terminal 102 is a terminal for setting information necessary for operation such as attribute extraction conditions described later. Thesystem administrator terminal 102 and thesite user terminal 104 display an HTML document (an example of a tagged document, an example of a structured document, an example of content, an example of a Web page), and a browser that returns an instructed event. Have.

図１以外にも、システム管理者端末１０２とサイト内検索サーバ１０１を非公開のネットワークで接続する形態も考えられる。図２は、サイト内検索システムのネットワーク概要（その２）を示す図である。この例では、システム管理者端末１０２は、ＬＡＮを介して同一施設内に設置されたサイト内検索サーバ１０１と接続している。 In addition to FIG. 1, a mode in which thesystem administrator terminal 102 and thesite search server 101 are connected via a private network is also conceivable. FIG. 2 is a diagram showing a network outline (part 2) of the in-site search system. In this example, thesystem administrator terminal 102 is connected to asite search server 101 installed in the same facility via a LAN.

次に、電子商取引サイトサーバ１０３に設けられているコンテンツを例示する。図３は、電子商取引サイトコンテンツ構成例を示す図である。www.example.com/shop/wine/の下位には、当該通信販売業者が取り扱う商品であるワインの紹介画面が設けられている。 Next, content provided in the electroniccommerce site server 103 will be exemplified. FIG. 3 is a diagram showing an example of the configuration of electronic commerce site content. Below www.example.com/shop/wine/ is an introduction screen for wine, which is a product handled by the mail-order dealer.

図４は、電子商取引サイトのＨＴＭＬ文書表示例を示す図である。カテゴリトップ＞フランス＞ブルゴーニュ＞ヴォーヌロマネは、パンくずリストである。これは、ＷＥＢサイト内の当該ページの階層内位置を、上位ページへのリンクのリスト形式で表している。その他、商品名、価格、生産者、ヴィンテージ、色、味わい、個数、商品のご説明等の項目とともに、商品の画像が表示されている。個数を入力し、「買い物かごへ」のアイコンをクリックすることにより、購入手続きに進行するように構成されている。この例で、価格、生産者、ヴィンテージ、色、味わい、商品のご説明は、商品の属性を示しており、商品を検索する際の項目となり得る情報である。特に、生産者や味わいは、文字列型の属性であり、価格やヴィンテージは、数値型の属性である。 FIG. 4 is a diagram showing an HTML document display example of the electronic commerce site. Top Categories> France> Burgundy> Vonne Romanet is a breadcrumbs list. This represents the position in the hierarchy of the page in the WEB site in the form of a list of links to upper pages. In addition, an image of the product is displayed along with items such as product name, price, producer, vintage, color, taste, number of pieces, and description of the product. By inputting the number of items and clicking the “go to shopping cart” icon, the purchase procedure is advanced. In this example, the price, the producer, the vintage, the color, the taste, and the description of the merchandise indicate the attributes of the merchandise and are information that can be items when searching for the merchandise. In particular, the producer and taste are character string type attributes, and the price and vintage are numeric type attributes.

図５〜図１１は、電子商取引サイトのＨＴＭＬ文書ソースコード例を示す図である。この例は、図４の表示例に対応している。 5 to 11 are diagrams showing examples of HTML document source code of the electronic commerce site. This example corresponds to the display example of FIG.

ここで、サイト内検索システムの動作概要について説明する。
（イ）属性抽出条件登録
前処理として、システム管理者端末１０２からサイト内検索サーバ１０１に、属性情報を抽出する際の条件を登録する。例えば、ワイン紹介画面のＨＴＭＬコードから価格や味わいなどの属性を抽出する条件である。
（ロ）サイト内文書収集整理
定期的に、サイト内検索サーバ１０１が電子商取引サイトサーバ１０３から公開情報である文書を収集し、収集した文書から属性情報を抽出し、データベース化する。例えば、ワイン紹介画面毎に、価格や味わいなどの属性を検索できるように整理する。
（ハ）文書検索サービス
サイト内検索サーバ１０１は、サイト利用者端末１０４からの検索要求に応じて、データベースから所望の文書を検索して、検索文書一覧として提供する。但し、サイト利用者端末１０４からの検索要求は、電子商取引サイトサーバ１０３が提供する画面に含まれる検索ウィンドウより発生するので、サイト利用者自身は、サイト内検索サーバ１０１にアクセスしているという意識を持つことなく、電子商取引サイトサーバ１０３による画面展開と一連の動作として接することになる。例えば、通販業者の画面からワインの検索を指示し、遠隔地にあるサイト内検索サーバから通販業者サイト内のワイン検索結果一覧画面を取得する。Here, an outline of the operation of the site search system will be described.
(A) As an attribute extraction condition registration pre-process, a condition for extracting attribute information is registered in thesite search server 101 from thesystem administrator terminal 102. For example, there are conditions for extracting attributes such as price and taste from the HTML code on the wine introduction screen.
(B) On-site document collection and arrangement Periodically, the in-site search server 101 collects documents that are public information from the electroniccommerce site server 103, extracts attribute information from the collected documents, and creates a database. For example, each wine introduction screen is arranged so that attributes such as price and taste can be searched.
(C) Document search service In-site search server 101 searches a desired document from a database in response to a search request fromsite user terminal 104 and provides it as a search document list. However, since the search request from thesite user terminal 104 is generated from a search window included in the screen provided by the electroniccommerce site server 103, the user is aware that the site user himself is accessing the in-site search server 101. Without touching the screen, the electroniccommerce site server 103 contacts the screen development as a series of operations. For example, wine search is instructed from the mail order agent screen, and the wine search result list screen in the mail order agent site is acquired from a remote site search server.

上述の（イ）属性抽出条件登録、（ロ）サイト内文書収集整理、（ハ）文書検索サービスを順に説明する。まず、（イ）属性抽出条件登録について説明する。 The (a) attribute extraction condition registration, (b) site document collection and arrangement, and (c) document search service will be described in this order. First, (a) attribute extraction condition registration will be described.

図１２は、サイト内検索サーバの属性抽出条件登録に係る構成を示す図である。
属性抽出条件登録部１２０１は、システム管理者端末１０２に属性抽出条件登録画面を送信し、システム管理者端末１０２のブラウザに表示された属性抽出条件登録画面に対して入力され、返信される属性抽出条件を受信するように構成されている。そして、各属性名を属性名テーブル１２０２に記憶させ、各属性抽出条件を属性抽出条件記憶部１２０３に記憶させるように動作する。FIG. 12 is a diagram showing a configuration related to attribute extraction condition registration of the site search server.
The attribute extractioncondition registration unit 1201 transmits an attribute extraction condition registration screen to thesystem administrator terminal 102, and is input to and returned from the attribute extraction condition registration screen displayed on the browser of thesystem administrator terminal 102 It is configured to receive the condition. Then, each attribute name is stored in the attribute name table 1202, and each attribute extraction condition is stored in the attribute extractioncondition storage unit 1203.

属性抽出条件登録画面を例示する。図１３は、属性抽出条件登録画面（文字列型１）の例を示す図である。この例では、文字列型の属性を５つ扱い、数値型の属性を５つ扱う構成となっている。タグは、文字列型１〜文字列型５、及び数値型１〜数値型５の属性抽出条件登録画面を選択するように構成されている。属性名は、当該属性を示す名前である。抽出元データは、当該属性を含むデータ源を示している。「ＵＲＬ」の選択は、抽出元データをコンテンツ（例えば、ＨＴＭＬ文書ファイル）のＵＲＬとする条件を設定することを意味する。これにより、コンテンツのＵＲＬコードから属性値を得ることになる。他方、「コンテンツ」の選択は、抽出元データをコンテンツ（ＨＴＭＬ文書ファイル）中の記述コードとする条件を設定することを意味する。そして、コンテンツのソースコードから属性値を得ることになる。 The attribute extraction condition registration screen is illustrated. FIG. 13 is a diagram showing an example of the attribute extraction condition registration screen (character string type 1). In this example, five character string type attributes are handled and five numeric type attributes are handled. The tag is configured to select an attribute extraction condition registration screen ofcharacter string type 1 tocharacter string type 5 andnumerical value type 1 tonumerical value type 5. The attribute name is a name indicating the attribute. The extraction source data indicates a data source including the attribute. The selection of “URL” means that a condition for setting the extraction source data as the URL of the content (for example, an HTML document file) is set. Thereby, the attribute value is obtained from the URL code of the content. On the other hand, the selection of “content” means setting a condition for using the extraction source data as a description code in the content (HTML document file). Then, the attribute value is obtained from the content source code.

抽出条件には、抽出元データの中から属性値を特定するための比較条件を正規表現により記述する。属性が階層構造である場合には、第一層セレクタから第五層セレクタを用いて、各層を区別する。各セレクタには、参照変数名（＄１、＄２、＄３等）を入力する。参照変数は、正規表現中の「()」で括られた文字列を参照するための変数であり、括弧の出現順に各括弧内の文字列に相当するコードを＄１、＄２、＄３、・・・で参照することができる。参照変数は、マッチング判定を行なうプログラムにおいてメモリに記憶されるスカラー変数である。本願発明では、上述の参照変数の機能を活用する。その為、属性値が参照変数値となるように正規表現を記述するとともに、その参照変数値をセレクタで指定する。この例では、上位の属性値から順に、２番目に出現する括弧に対応する＄２と、４番目に出現する括弧に対応する＄４と、６番目に出現する括弧に対応する＄６と、８番目に出現する括弧に対応する＄８を用い、１番目に出現する括弧に対応する＄１と、３番目に出現する括弧に対応する＄３と、５番目に出現する括弧に対応する＄５と、７番目に出現する括弧に対応する＄７は用いない。 In the extraction condition, a comparison condition for specifying an attribute value from the extraction source data is described by a regular expression. If the attribute has a hierarchical structure, the first layer selector to the fifth layer selector are used to distinguish each layer. Reference variable names ($ 1, $ 2, $ 3, etc.) are input to each selector. The reference variable is a variable for referring to the character string enclosed in “()” in the regular expression, and codes corresponding to the character strings in the parentheses in the order of appearance of the parentheses are $ 1, $ 2, and $ 3. , ... can be referred to. The reference variable is a scalar variable stored in the memory in the program that performs matching determination. In the present invention, the function of the reference variable described above is utilized. Therefore, a regular expression is described so that the attribute value becomes the reference variable value, and the reference variable value is designated by the selector. In this example, in order from the highest attribute value, $ 2 corresponding to the second parenthesis, $ 4 corresponding to the fourth parenthesis, $ 6 corresponding to the sixth parenthesis, Using $ 8 corresponding to the eighth parenthesis, $ 1 corresponding to the first parenthesis, $ 3 corresponding to the third parenthesis, and $ corresponding to thefifth parenthesis 5 and $ 7 corresponding to the seventh parenthesis are not used.

上述の通り、抽出条件は、正規表現により属性に関わる記述パターンを特定するとともに、記述パターン中の参照部位を特定する。そして、セレクタには、正規表現により特定される参照部位に相当する記述コードを格納する参照変数名を設定する。 As described above, the extraction condition specifies a description pattern related to an attribute by a regular expression and specifies a reference portion in the description pattern. In the selector, a reference variable name for storing a description code corresponding to the reference part specified by the regular expression is set.

最後に設定ボタンを指示することにより、属性名及び属性抽出条件をサイト内検索サーバ１０１に返信するように構成されている。 Finally, an attribute name and attribute extraction conditions are returned to thesite search server 101 by instructing a setting button.

図１４は、属性抽出条件登録画面（文字列型２）の例を示す図である。図１５は、属性抽出条件登録画面（文字列型３）の例を示す図である。属性値が階層でない場合（単層の場合）には、第一層セレクタのみに参照変数を設定する。 FIG. 14 is a diagram illustrating an example of the attribute extraction condition registration screen (character string type 2). FIG. 15 is a diagram illustrating an example of the attribute extraction condition registration screen (character string type 3). If the attribute value is not a hierarchy (single layer), a reference variable is set only for the first layer selector.

図１６は、属性抽出条件登録画面（文字列型４）の例を示す図である。属性値が、並列に複数列挙される場合には、第一層セレクタに複数の参照変数を入力する。その際、各参照変数をカンマ（並列区切り文字の例）で仕切る。この例では、並列の属性値群を、１番目に出現する括弧に対応する＄１と、３番目に出現する括弧に対応する＄３と、５番目に出現する括弧に対応する＄５と、７番目に出現する括弧に対応する＄７と、９番目に出現する括弧に対応する＄９を用いて参照する。２番目に出現する括弧に対応する＄２と、４番目に出現する括弧に対応する＄４と、６番目に出現する括弧に対応する＄６と、８番目に出現する括弧に対応する＄８は用いない。 FIG. 16 is a diagram showing an example of the attribute extraction condition registration screen (character string type 4). When a plurality of attribute values are listed in parallel, a plurality of reference variables are input to the first layer selector. At that time, each reference variable is separated by a comma (an example of a parallel delimiter). In this example, parallel attribute value groups are represented by $ 1 corresponding to the first parenthesis, $ 3 corresponding to the third parenthesis, $ 5 corresponding to the fifth parenthesis, Reference is made using $ 7 corresponding to the seventh parenthesis and $ 9 corresponding to the ninth parenthesis. $ 2 corresponding to the second parenthesis, $ 4 corresponding to the fourth parenthesis, $ 6 corresponding to the sixth parenthesis, and $ 8 corresponding to the eighth parenthesis Is not used.

図１７は、属性抽出条件登録画面（数値型１）の例を示す図である。図１８は、属性抽出条件登録画面（数値型２）の例を示す図である。数値型の属性は、抽出形式として、数値タイプと日付タイプを指定できる。数値タイプは、一つの数値として抽出されるタイプであり、日付タイプは、年、月、日の三つの数値として抽出されるタイプである。本実施例では、数値タイプを説明し、日付タイプは実施例２で説明する。また、表示フォーマットとして、後方固定文字と、カンマ区切りを指定できる。但し、表示フォーマットは属性抽出条件ではなく、表示に関する条件である。便宜的に、同一画面による設定を受け付けている。 FIG. 17 is a diagram illustrating an example of the attribute extraction condition registration screen (numerical value type 1). FIG. 18 is a diagram showing an example of the attribute extraction condition registration screen (numerical value type 2). A numeric type attribute can specify a numeric type and a date type as an extraction format. The numerical value type is a type extracted as a single numerical value, and the date type is a type extracted as three numerical values of year, month, and day. In the present embodiment, the numerical type will be described, and the date type will be described in the second embodiment. As the display format, backward fixed characters and comma delimiters can be specified. However, the display format is not an attribute extraction condition but a display condition. For convenience, settings on the same screen are accepted.

図１９は、属性抽出条件登録画面（文字列型５）の例を示す図である。説明文のように、複数の文からなる文章も抽出条件として設定することができる。 FIG. 19 is a diagram showing an example of the attribute extraction condition registration screen (character string type 5). A sentence composed of a plurality of sentences such as an explanatory sentence can be set as the extraction condition.

図２０は、属性名テーブルの例を示す図である。属性名テーブル１２０２は、属性抽出条件登録画面で受け付けた文字列型１属性から文字列型５属性の各属性名と、数値型１属性から数値型５属性の各属性名を記憶するように構成されている。 FIG. 20 is a diagram illustrating an example of the attribute name table. The attribute name table 1202 is configured to store the attribute names of thecharacter string type 1 attribute to thecharacter string type 5 attribute received on the attribute extraction condition registration screen, and the attribute names of thenumerical value type 1 attribute to thenumerical value type 5 attribute. Has been.

図２１は、属性抽出条件記憶部の例を示す図である。属性抽出条件記憶部１２０３は、属性抽出条件登録画面で受け付けた文字列型１属性から文字列型５属性の各属性抽出条件と、数値型１属性から数値型５属性の各属性抽出条件を記憶するように構成されている。 FIG. 21 is a diagram illustrating an example of the attribute extraction condition storage unit. The attribute extractioncondition storage unit 1203 stores each attribute extraction condition from thecharacter string type 1 attribute to thecharacter string type 5 attribute and each attribute extraction condition from thenumerical value type 1 attribute to thenumerical value type 5 attribute received on the attribute extraction condition registration screen. Is configured to do.

図２２〜図２６は、文字列型１〜文字列型５の属性抽出条件の例を示す図である。属性抽出条件登録画面で受け付けた各文字列型属性の抽出元データ、抽出条件（正規表現）、第一層セレクタ〜第五層セレクタを記憶するように構成されている。有効フラグは、当該属性抽出条件を受け付けた場合に、ＯＮとし、受け付けなかった場合に、ＯＦＦとするように設定される。また、参照変数が設定されなかったセレクタには、未設定の旨のコードが格納させる。 22 to 26 are diagrams showing examples of attribute extraction conditions forcharacter string type 1 tocharacter string type 5. FIG. It is configured to store extraction source data of each character string type attribute received on the attribute extraction condition registration screen, an extraction condition (regular expression), and a first layer selector to a fifth layer selector. The valid flag is set to ON when the attribute extraction condition is accepted, and OFF when the attribute extraction condition is not accepted. Further, a code indicating that the reference variable is not set is stored in the selector in which the reference variable is not set.

図２７と図２８は、数値型１と数値型２の属性抽出条件の例を示す図である。文字列型と同様に、属性抽出条件登録画面で受け付けた各数値型属性の抽出元データ、抽出条件（正規表現）、第一層セレクタ〜第三層セレクタを記憶するように構成されている。文字型と同様に、有効フラグは、当該属性抽出条件を受け付けた場合に、ＯＮとし、受け付けなかった場合に、ＯＦＦとするように設定される。また、参照変数が設定されなかったセレクタには、未設定の旨のコードが格納させる。 FIG. 27 and FIG. 28 are diagrams showing examples of attribute extraction conditions ofnumerical value type 1 andnumerical value type 2. Similar to the character string type, it is configured to store extraction source data, extraction conditions (regular expressions), first layer selector to third layer selector of each numeric attribute received on the attribute extraction condition registration screen. Similar to the character type, the valid flag is set to ON when the attribute extraction condition is accepted, and to OFF when the attribute extraction condition is not accepted. Further, a code indicating that the reference variable is not set is stored in the selector in which the reference variable is not set.

次に、（ロ）サイト内文書収集整理について説明する。この処理は、電子商取引サイトサーバ１０３のコンテンツの更新を反映するのに適したタイミングに実行される。例えば、コンテンツの更新が毎日行われる運用においては、一日一回サイト内文書収集整理処理が行われる。また、コンテンツの更新の直後に同処理することが望ましい。その為、定期的に自動的に起動することが有効である。 Next, (b) on-site document collection and arrangement will be described. This process is executed at a timing suitable for reflecting the content update of the electroniccommerce site server 103. For example, in an operation in which content is updated every day, in-site document collection and organization processing is performed once a day. Also, it is desirable to perform the same processing immediately after the content update. Therefore, it is effective to start automatically automatically.

図２９は、サイト内検索サーバのサイト内文書収集整理処理フローを示す図である。クローリング（文書収集部処理）（Ｓ２９０１）で、起点ＵＲＬより下位の文書（コンテンツの例）を収集し、収集文書毎に複製処理（Ｓ２９０３）、文書フィルタリング処理（Ｓ２９０４）、属性抽出処理（Ｓ２９０５）を繰り返し（Ｓ２９０２）、電子商取引サイトサーバ１０３に対応する収集文書データベースを生成する。そして、すべての収集文書についてデータベース化し終えると（Ｓ２９０６）、属性値リスト生成処理（Ｓ２９０７）で、属性値のリストを生成する。 FIG. 29 is a diagram showing a processing flow for collecting and organizing in-site documents of the in-site search server. In crawling (document collection unit processing) (S2901), documents (examples of content) lower than the origin URL are collected, and for each collected document, replication processing (S2903), document filtering processing (S2904), and attribute extraction processing (S2905) Is repeated (S2902), and a collected document database corresponding to the electroniccommerce site server 103 is generated. When all collected documents have been converted into a database (S2906), a list of attribute values is generated by attribute value list generation processing (S2907).

以下、モジュール構成とデータフローを示し、各処理を詳述する。図３０は、サイト内検索サーバのサイト内文書収集整理に係る構成を示す図である。クローリング（文書収集部処理）（Ｓ２９０１）を行うクローラー（文書収集部）３００１は、所定の起点ＵＲＬより下位にある文書ファイルを収集するように構成されている。クローラー（文書収集部）３００１は、従来の周知技術により実現される。収集した文書のＵＲＬと取得日時を収集文書リスト３００２に記憶させ、収集した文書ファイルを収集文書ファイル格納部３００３に記憶させる。当該文書ファイルを読み出す場合には、収集文書リスト３００２のＵＲＬにより当該ファイルを特定することができるように構成されている。 The module configuration and data flow will be described below, and each process will be described in detail. FIG. 30 is a diagram showing a configuration relating to in-site document collection and arrangement of the in-site search server. The crawler (document collection unit) 3001 that performs crawling (document collection unit processing) (S2901) is configured to collect document files below a predetermined starting URL. The crawler (document collection unit) 3001 is realized by a conventional well-known technique. The URL of the collected document and the acquisition date and time are stored in the collecteddocument list 3002, and the collected document file is stored in the collected documentfile storage unit 3003. When reading out the document file, the file can be specified by the URL of the collecteddocument list 3002.

図３１は、収集文書リストの例を示す図である。収集文書リスト３００２は、収集した文書毎に、収集文書ＵＲＬと取得日付を対応付けて記憶するように構成されている。 FIG. 31 is a diagram illustrating an example of a collected document list. The collecteddocument list 3002 is configured to store the collected document URL and the acquisition date in association with each collected document.

複製部３００４による複製処理（Ｓ２９０３）では、当該収集文書の収集文書ＵＲＬと取得日付を収集文書リスト３００２から読み出し、当該収集文書の管理データとして収集文書データベース３００７に記憶させる。 In the duplication processing (S2903) by theduplication unit 3004, the collected document URL and acquisition date of the collected document are read from the collecteddocument list 3002, and stored in the collecteddocument database 3007 as management data of the collected document.

ここで、収集文書データベース３００７の構成について説明する。図３２は、収集文書データベースの例を示す図である。収集文書毎に管理データを設け、各管理データは、収集文書ＵＲＬ、取得日付、タイトル、本文、文字列型１属性情報〜文字列型５属性情報、数値型１属性情報〜数値型５属性情報を記憶している。 Here, the configuration of the collecteddocument database 3007 will be described. FIG. 32 is a diagram illustrating an example of a collected document database. Management data is provided for each collection document, and each management data includes collection document URL, acquisition date, title, body,character string type 1 attribute information tocharacter string type 5 attribute information,numerical type 1 attribute information tonumerical type 5 attribute information. Is remembered.

文書フィルタ３００５による文書フィルタリング処理（Ｓ２９０４）では、当該収集文書ＵＲＬにより収集文書ファイルを特定し、収集文書ファイルを読込み、当該文書のタイトルと、本文（当該文書により表示される文字列）を抽出するように動作する。文書フィルタ３００５も、従来の技術により実現される。抽出したタイトルと本文を、収集文書データベース３００７の当該収集文書の管理データに書き込む。 In the document filtering process (S2904) by thedocument filter 3005, the collected document file is specified by the collected document URL, the collected document file is read, and the title and text (character string displayed by the document) of the document are extracted. To work. Thedocument filter 3005 is also realized by a conventional technique. The extracted title and text are written in the management data of the collected document in the collecteddocument database 3007.

属性抽出部３００６による属性抽出処理（Ｓ２９０５）では、属性抽出条件記憶部１２０３の属性抽出条件に従って、収集文書ファイル格納部３００３の収集文書から属性情報を抽出する。その際、正規表現に従って参照変数値を求めるマッチング判定処理を行う。以下、処理フロー及びモジュール構成を図示して詳述する。 In attribute extraction processing (S2905) by theattribute extraction unit 3006, attribute information is extracted from the collected document in the collected documentfile storage unit 3003 in accordance with the attribute extraction condition in the attribute extractioncondition storage unit 1203. At this time, a matching determination process for obtaining a reference variable value according to a regular expression is performed. The processing flow and module configuration will be described in detail below with reference to the drawings.

図３３は、属性抽出処理フローを示す図である。図３４は、属性抽出部の構成を示す図である。属性抽出条件毎に以下の処理を繰り返す（Ｓ３３０１）。有効フラグ（Ｓ３３０２）がＯＦＦの場合には、当該属性に対する抽出処理を行なわない。有効フラグ（Ｓ３３０２）がＯＮの場合は、参照変数数判定部３４０１の参照変数数判定処理（Ｓ３３０３）により、参照変数の数を判定する。具体的には、すべてのセレクタに含まれる参照変数名の末尾の数字（１、２、３、・・・）の最大値を求める。例えば、図２２の例では、第一層セレクタに「＄２」が設定され、第二層セレクタに「＄４」が設定され、第三層セレクタに「＄６」が設定され、第四層セレクタに「＄８」が設定されているので、末尾の数字（２，４，６，８）のうち最大の８が参照変数の数となる。例えば、図２５の例では、第一層セレクタに「＄１，＄３，＄５，＄７，＄９」が設定されているので、末尾の数字（１，３，５，７，９）のうち最大の９が参照変数の数となる。 FIG. 33 is a diagram showing an attribute extraction processing flow. FIG. 34 is a diagram illustrating a configuration of the attribute extraction unit. The following processing is repeated for each attribute extraction condition (S3301). When the valid flag (S3302) is OFF, the extraction process for the attribute is not performed. When the valid flag (S3302) is ON, the number of reference variables is determined by the reference variable number determination process (S3303) of the reference variablenumber determination unit 3401. Specifically, the maximum value of the numbers (1, 2, 3,...) At the end of the reference variable names included in all selectors is obtained. For example, in the example of FIG. 22, “$ 2” is set in the first layer selector, “$ 4” is set in the second layer selector, “$ 6” is set in the third layer selector, and the fourth layer is selected. Since “$ 8” is set in the selector, themaximum 8 of the last numbers (2, 4, 6, 8) is the number of reference variables. For example, in the example of FIG. 25, “$ 1, $ 3, $ 5, $ 7, $ 9” is set in the first layer selector, so the last number (1, 3, 5, 7, 9) 9 is the number of reference variables.

参照変数数判定処理（Ｓ３３０３）の他の方法として、当該属性抽出条件の抽出条件（正規表現）に含まれる丸括弧対（「(」と「)」）の数をカウントし、参照変数の数を得ることもできる。 As another method of the reference variable number determination process (S3303), the number of parentheses (“(” and “)”) included in the extraction condition (regular expression) of the attribute extraction condition is counted, and the number of reference variables You can also get

次に、対象文字列判定部３４０２の対象文字列判定処理（Ｓ３３０４）では、正規表現に照らす対象文字列を特定する。まず、当該属性抽出条件に含まれる抽出元データを取得する。抽出元データがコンテンツを指している場合には、収集文書ＵＲＬにより特定される収集文書ファイルのソースコードが対象文字列になると判定する。抽出元データがＵＲＬを指している場合には、収集文書ＵＲＬそのものが対象文字列になると判定する。 Next, in a target character string determination process (S3304) of the target characterstring determination unit 3402, a target character string in light of a regular expression is specified. First, the extraction source data included in the attribute extraction condition is acquired. When the extraction source data indicates content, it is determined that the source code of the collected document file specified by the collected document URL is the target character string. When the extraction source data indicates a URL, it is determined that the collected document URL itself is the target character string.

マッチング判定部３４０３は、当該属性抽出条件に含まれる抽出条件（正規表現）を読み取り（Ｓ３３０５）、正規表現、参照変数の数、対象文字列を指定して、マッチング判定する（Ｓ３３０６）。マッチング処理は、正規表現に従って記述パターンに合致する箇所を判定し、更に参照部位に相当する記述コードを参照変数に格納する。尚、参照変数は、属性抽出部３００６内の変数記憶領域（図示せず）に設けられている。また、マッチング処理自体は、従来の技術により実現される。例えば、Ｐｅｒｌ言語は、正規表現によるマッチ処理関数を備えている。 The matchingdetermination unit 3403 reads the extraction condition (regular expression) included in the attribute extraction condition (S3305), specifies the regular expression, the number of reference variables, and the target character string, and determines the matching (S3306). The matching process determines a location that matches the description pattern according to the regular expression, and further stores a description code corresponding to the reference site in the reference variable. The reference variable is provided in a variable storage area (not shown) in theattribute extraction unit 3006. Further, the matching process itself is realized by a conventional technique. For example, the Perl language has a regular expression match processing function.

ここで、正規表現について簡単に説明する。「\」は、特殊文字のエスケープである。「.」は、任意の一文字を示す。「*」０回以上の繰り返しを示す。「?」は、０回か１回の繰り返しを示す。「[]」内の「^」は、否定表現である。そして、「()」で囲まれた部分は、＄１，＄２，＄３，・・・の参照変数値に格納される。 Here, the regular expression will be briefly described. "\" Is an escape for special characters. “.” Indicates an arbitrary character. “*” Indicates 0 or more repetitions. “?” Indicates 0 or 1 repetition. “^” In “[]” is a negative expression. The portion surrounded by “()” is stored in the reference variable values of $ 1, $ 2, $ 3,.

文字列型１の属性名「産地」については、図５の５０１〜５０４の記述コードから参照変数値を得ることになる。同様に、文字列型２の属性名「生産者」については、図７の７０３と７０４の記述コードから、文字列型３の属性名「色」については、７０７と７０８の記述コードから、文字列型４の属性名「味わい」については、図８の８０１と８０２の記述コードから、文字列型５の属性名「商品説明」については、図９の９０１〜図１０の１００１の記述コードから、数値型１の属性名「価格」については、図７の７０１と７０２の記述コードから、数値型２の属性名「ヴィンテージ」については、７０５と７０６からの記述コードから、それぞれ参照変数値を得ることになる。 For the attribute name “origin” of thecharacter string type 1, reference variable values are obtained from thedescription codes 501 to 504 in FIG. Similarly, thecharacter string type 2 attribute name “producer” is obtained from thedescription codes 703 and 704 in FIG. 7, and thecharacter string type 3 attribute name “color” is obtained from thedescription codes 707 and 708. The attribute name “taste” of thecolumn type 4 is from thedescription codes 801 and 802 of FIG. 8, and the attribute name “product description” of thecharacter string type 5 is from the description codes of 901 to 1001 in FIG. For the attribute name “price” of thenumerical type 1, reference variable values are respectively obtained from thedescription codes 701 and 702 in FIG. 7, and from the description code from 705 and 706 for the attribute name “vintage” of thenumerical type 2. Will get.

マッチングを終え、属性型が文字列型の場合には（Ｓ３３０７）、文字列型の属性情報生成部３４０４による文字列型の属性情報生成処理（Ｓ３３０８）を行い、属性型が数値型の場合には（Ｓ３３０７）、数値型の属性情報生成部３４０５による数値型の属性情報生成処理（Ｓ３３０９）を行う。詳しくは後述する。そして、上述の処理を、すべての属性抽出条件について行った時点で終了する（Ｓ３３１０）。 When matching is finished and the attribute type is a character string type (S3307), the character string type attributeinformation generation unit 3404 performs a character string type attribute information generation process (S3308). (S3307), numerical value type attribute information generation processing (S3309) by the numerical value type attributeinformation generation unit 3405 is performed. Details will be described later. Then, when the above process is performed for all attribute extraction conditions, the process ends (S3310).

文字列型の属性情報生成処理（Ｓ３３０８）について説明する。この処理により、例えば図２２に示した文字列型１の属性抽出条件の場合、第一層から第四層までの各セレクタに参照変数が設定されているので、属性値も階層を有することになる。図３２の文字列型１属性情報の「フランス／ブルゴーニュ／ヴォーヌロマネ」のように階層区切コード（／）をはさんで第一層の属性値、第二層の属性値、第三層の属性値が並べられる。また、図２５に示した文字列型４の属性抽出条件の場合、第一層のセレクタに複数の参照変数が並列に設定されているので、属性値も複数列挙される。図３２の文字列型５属性情報の「フルボディ繊細果実未豊か」のように並列区切コード（スペース）をはさんで第一層の属性値が複数並べられる。つまり、単階層に属性値を列挙した属性情報を生成する。 The character string type attribute information generation processing (S3308) will be described. With this processing, for example, in the case of thecharacter string type 1 attribute extraction condition shown in FIG. 22, since the reference variable is set in each selector from the first layer to the fourth layer, the attribute value also has a hierarchy. Become. The attribute value of the first layer, the attribute value of the second layer, and the attribute of the third layer with the layer delimiter code (/) sandwiched between thecharacter string type 1 attribute information of FIG. 32, such as “France / Burgundy / Vaune Romanet” The values are ordered. In the case of thecharacter string type 4 attribute extraction condition shown in FIG. 25, since a plurality of reference variables are set in parallel in the selector in the first layer, a plurality of attribute values are also listed. A plurality of attribute values of the first layer are arranged across a parallel delimiter code (space) as “full body delicate fruit not rich” in thecharacter string type 5 attribute information of FIG. That is, attribute information listing attribute values in a single hierarchy is generated.

図３５は、文字列型の属性情報格納処理フローを示す図である。上位から順（第一層、第二層、第三層、・・・の順）に階層セレクタ毎に以下の処理を繰り返す（Ｓ３５０１）。階層セレクタ展開処理（Ｓ３５０２）では、当該階層のセレクタについて参照変数値を文字列型属性情報に展開する。詳しくは、図３６で詳述する。そして、次の階層セレクタに参照変数が設定されているか判定し（Ｓ３５０３）、設定されている場合には、階層区切コード（／）を書き込み（Ｓ３５０４）、次の階層のセレクタに関する処理に移行する（Ｓ３５０１）。次の階層セレクタに参照変数が設定されていない場合には、終了する。 FIG. 35 is a diagram showing a character string type attribute information storage processing flow. The following processing is repeated for each hierarchical selector in order from the top (first layer, second layer, third layer,...) (S3501). In the hierarchical selector expansion process (S3502), the reference variable value is expanded into character string type attribute information for the selector of the hierarchy. Details will be described with reference to FIG. Then, it is determined whether or not a reference variable is set for the next hierarchy selector (S3503). If it is set, the hierarchy delimiter code (/) is written (S3504), and the process proceeds to the process related to the selector of the next hierarchy. (S3501). If no reference variable is set in the next hierarchy selector, the process ends.

図３６は、階層セレクタ展開処理フローを示す図である。当該セレクタに設定されている参照変数毎に以下の処理を繰り返す（Ｓ３６０１）。参照変数値を読み取り、属性値として文字列型属性情報に書き込む（Ｓ３６０２）。そして、次の参照変数が設定されている場合には（Ｓ３６０３）、並列区切コード（スペース）を書き込み（Ｓ３６０４）、次の参照変数に対する処理に移行する。次の参照変数が設定されていない場合には、処理を終了する。 FIG. 36 is a diagram showing a hierarchical selector expansion processing flow. The following processing is repeated for each reference variable set in the selector (S3601). The reference variable value is read and written in the character string type attribute information as an attribute value (S3602). If the next reference variable is set (S3603), the parallel delimiter code (space) is written (S3604), and the process proceeds to the next reference variable. If the next reference variable is not set, the process ends.

次に、数値型の属性情報生成処理（Ｓ３３０９）について説明する。抽出タイプが数値タイプの場合には、文字コードをバイナリコードに変換する。抽出タイプが日付タイプの場合については、実施例２で後述する。 Next, numerical attribute information generation processing (S3309) will be described. When the extraction type is a numeric type, the character code is converted into a binary code. A case where the extraction type is a date type will be described later in a second embodiment.

図３７は、数値型の属性情報格納処理フローを示す図である。抽出形式が日付タイプである場合（Ｓ３７０１）の日付変換処理（Ｓ３７０５）については、実施例２で説明する。第一層セレクタから参照変数を読み取り（Ｓ３７０２）、その参照変数値をバイナリに変換する（Ｓ３７０３）。そして、変換したバイナリデータを属性値として当該数値型属性情報に書き込む（Ｓ３７０４）。 FIG. 37 is a diagram showing a numerical attribute information storage processing flow. The date conversion process (S3705) when the extraction format is a date type (S3701) will be described in the second embodiment. A reference variable is read from the first layer selector (S3702), and the reference variable value is converted to binary (S3703). Then, the converted binary data is written in the numeric attribute information as an attribute value (S3704).

以上で、収集文書データベースの生成に関する説明を終える。 This is the end of the description regarding the generation of the collected document database.

続いて、属性値リスト生成部３００８による属性値リスト生成処理（Ｓ２９０７）について説明する。この処理では、検索サービスの際に表示する属性値のリストを生成する。つまり、収集文書データベース３００７に格納された属性値から重複を除いて、すべての属性値をリスト化する。特に、属性値が階層構造をなす場合には、上位の属性値と下位の属性値の関係を定義するように属性値リスト記憶部３００９を生成する。文書型１〜文書型５の各属性、数値型１〜数値型５の各属性について属性値リスト生成処理（Ｓ２９０７）を行う。 Next, attribute value list generation processing (S2907) by the attribute valuelist generation unit 3008 will be described. In this process, a list of attribute values to be displayed in the search service is generated. That is, all attribute values are listed by excluding duplication from the attribute values stored in the collecteddocument database 3007. In particular, when the attribute value has a hierarchical structure, the attribute valuelist storage unit 3009 is generated so as to define the relationship between the upper attribute value and the lower attribute value. Attribute value list generation processing (S2907) is performed for each attribute ofdocument type 1 to documenttype 5 and each attribute ofnumerical value type 1 tonumerical value type 5.

図３８は、属性値リスト記憶部の例を示す図である。この例では、第一層の属性値リスト３８０１、第二層の属性値リスト３８０２、第三層の属性値リスト３８０３を示している。第五層まで階層を有する場合には、第四層の属性値リストと第五層の属性値リストも設ける。また、階層構造でない場合は、第一層の属性値リスト３８０１のみで足りる。 FIG. 38 is a diagram illustrating an example of the attribute value list storage unit. In this example, a first layerattribute value list 3801, a second layerattribute value list 3802, and a third layerattribute value list 3803 are shown. If there is a hierarchy up to the fifth layer, a fourth layer attribute value list and a fifth layer attribute value list are also provided. In the case of a hierarchical structure, only theattribute value list 3801 of the first layer is sufficient.

リストは、属性値毎に、上位層コード、当該層コード、当該層属性値を対応付けて記憶するように構成されている。上位層コードは、当該層の親となる層のコードである。例えば、第一層の属性値リスト３８０１では、上位層は無いので上位がない旨を示している。そして当該層（第一層）内での識別コードとして、Ａ０１、Ａ０２等のコードを記憶している。そして、それに対応する当該層（第一層）の属性値「フランス」、「イタリア」等を記憶している。第二層の第二層の属性値リスト３８０２では、上位層（第一層）があるので、親となっている第一層コードを上位層コードとして記憶している。例えば当該層（第二層）のコードがＢ０１である属性値「ボルドー」は、親の属性値が「フランス」であるので、上位層コードとして属性値「フランス」に対応する第一層コードＡ０１を記憶している。第三層の属性値リスト３８０３も同様に、上位層コードとして、親となる属性値を識別するコードを記憶している。 The list is configured to store an upper layer code, the layer code, and the layer attribute value in association with each attribute value. The upper layer code is a code of a layer that is a parent of the layer. For example, theattribute value list 3801 of the first layer indicates that there is no upper layer because there is no upper layer. And codes such as A01 and A02 are stored as identification codes in the layer (first layer). And the attribute value "France", "Italy", etc. of the said layer (1st layer) corresponding to it are memorize | stored. In the second layerattribute value list 3802 of the second layer, since there is an upper layer (first layer), the parent first layer code is stored as the upper layer code. For example, since the attribute value “Bordeaux” whose code of the layer (second layer) is B01 is the parent attribute value is “France”, the first layer code A01 corresponding to the attribute value “France” as the upper layer code Is remembered. Similarly, theattribute value list 3803 of the third layer stores a code for identifying the parent attribute value as an upper layer code.

図３９は、属性値リスト生成処理フローを示す図である。上位層から順に（Ｓ３９０１）、層内属性値抽出処理（Ｓ３９０２）で当該層の属性値を抽出して当該層の属性値リストに登録する。そして、その層に該当する属性値があった場合には（Ｓ３９０３）、更に下位の層の処理に移行し、その層に該当する属性値が無かった場合には終了する。 FIG. 39 is a diagram showing an attribute value list generation processing flow. In order from the upper layer (S3901), in-layer attribute value extraction processing (S3902), the attribute value of the layer is extracted and registered in the attribute value list of the layer. If there is an attribute value corresponding to that layer (S3903), the process proceeds to the processing of a lower layer, and if there is no attribute value corresponding to that layer, the process ends.

図４０は、層内属性値抽出処理フローを示す図である。収集文書データベース３００７で管理する収集文書毎に以下の処理を繰り返す（Ｓ４００１）。当該収集文書の管理データに含まれる当該属性情報の当該層の属性値を読み取る（Ｓ４００２）。例えば、文字列型１の属性に関する属性値リスト生成過程で、第二層の層内属性抽出処理で、収集文書w0059.htmlについて属性値を読み取る場合には、図３２に示した文字列型１属性情報から一つ目の階層区切コード（／）と二つ目の階層区切コード（／）で仕切られた属性値「ブルゴーニュ」を読み取る。そして、上位層の属性値に対応する上位層コードを特定する（Ｓ４００３）。前述の例では、上位である第一層の属性値、つまり一つ目の階層区切コード（／）より前の属性値「フランス」を読み取り、第一層の属性値リスト３８０１で「フランス」対応する当該層コード「Ａ０１」を読み取る。次に、その上位層コードと当該層の属性値が、当該層の属性値リストに既に登録されているか判定する（Ｓ４００５）。前述の例では、第二層の属性値リスト３８０２に「Ａ０１」「（任意）」「ボルドー」のレコードがあるか判定する。すでにレコードが存在すれば、次の収集文書の処理に移行する。存在しなければ、新たな当該層コードを割り振って、上位層コードと当該層の属性値を記憶させる（Ｓ４００６）。前述の例では、上位層コード「Ａ０１」と当該層属性値「ボルドー」の組に、新たな当該層コード「Ｂ０１」を割り振っている。そして、すべての収集文書について処理した時点で終了する（Ｓ４００８）。 FIG. 40 is a diagram showing the in-layer attribute value extraction processing flow. The following processing is repeated for each collected document managed by the collected document database 3007 (S4001). The attribute value of the layer of the attribute information included in the management data of the collected document is read (S4002). For example, when an attribute value is read for the collected document w0059.html in the attribute value list generation process for the attribute of thecharacter string type 1 in the second layer attribute extraction process, thecharacter string type 1 shown in FIG. The attribute value “burgundy” partitioned by the first hierarchy delimiter code (/) and the second hierarchy delimiter code (/) is read from the attribute information. Then, an upper layer code corresponding to the attribute value of the upper layer is specified (S4003). In the above example, the attribute value of the upper layer, that is, the attribute value “France” before the first layer delimiter code (/) is read, and “France” is supported in theattribute value list 3801 of the first layer. The layer code “A01” to be read is read. Next, it is determined whether the upper layer code and the attribute value of the layer are already registered in the attribute value list of the layer (S4005). In the above example, it is determined whether there is a record of “A01”, “(arbitrary)”, and “Bordeaux” in theattribute value list 3802 of the second layer. If a record already exists, the process proceeds to processing of the next collected document. If not, a new layer code is allocated and the upper layer code and the attribute value of the layer are stored (S4006). In the above example, the new layer code “B01” is assigned to the set of the upper layer code “A01” and the layer attribute value “Bordeaux”. Then, the process ends when all collected documents have been processed (S4008).

例えば、文字列型４の属性に関する属性値リスト生成過程で、第一層の層内属性抽出処理で、収集文書w0059.htmlについて属性値を読み取る場合には、図３２に示した文字列型４属性情報から「フルボディ繊細果実未豊か」を読み取る。そして、並列区切コード（スペース）で区切られた属性値「フルボディ」、「繊細」、「果実未豊か」毎に以下の処理（Ｓ４００４〜Ｓ４００６）を行なう。単層のために上位層が無い場合は、上位層コードを「なし」として処理する。 For example, when an attribute value is read for the collected document w0059.html in the attribute extraction process of the first layer in the attribute value list generation process regarding the attribute of thecharacter string type 4, thecharacter string type 4 shown in FIG. Read "Full body delicate fruit not rich" from the attribute information. Then, the following processing (S4004 to S4006) is performed for each of the attribute values “full body”, “fine”, and “unfruitful” delimited by the parallel delimiter code (space). If there is no upper layer for a single layer, the upper layer code is processed as “none”.

最後に、（ハ）文書検索サービスについて説明する。サイト利用者端末１０４のブラウザ上で、電子商取引サイトサーバ１０３のサイトが提供する画面に含まれる検索ウィンドウから検索を指示することにより、サイト内検索サーバ１０１の文書検索サービスを起動する。 Finally, (c) the document search service will be described. The document search service of the in-site search server 101 is activated by instructing a search from a search window included in a screen provided by the site of the electroniccommerce site server 103 on the browser of thesite user terminal 104.

図４１は、電子商取引サイトの検索ウィンドウの例を示す図である。この例では、検索キーワードを入力して、検索ボタンをクリックすることにより、パラメータを付して検索ＵＲＬにアクセスするように構成されている。これにより、サイト内検索サーバ１０１の文書検索サービスを起動する。 FIG. 41 is a diagram illustrating an example of a search window for an electronic commerce site. In this example, a search keyword is input and a search button is clicked to attach a parameter to access the search URL. Thereby, the document search service of thesite search server 101 is activated.

図４２は、電子商取引サイトの検索ウィンドウのソースコード例を示す図である。例えば、「ロマネ」をフリーキーワードとして検索を指示すると、「http://bizsearchasp.accelatech.com/bizasp/index.php?q=%83%8D%83%7D%83l&corpId=atc000001&en=1」のパラメータ付検索ＵＲＬでサイト内検索サーバ１０１にアクセスする。「http://bizsearchasp.accelatech.com/bizasp/index.php」は、検索ＵＲＬであり、「q=%83%8D%83%7D%83l」は、検索条件をエンコードした値であり、「corpId=atc000001」は、電子商取引サイトサーバ１０３を識別する企業ＩＤである。検索ＵＲＬは、サイト内検索サーバ１０１で受信する検索要求の例である。 FIG. 42 is a diagram showing an example of the source code of the search window of the electronic commerce site. For example, if you search for “Romanée” as a free keyword, the parameter “http://bizsearchasp.accelatech.com/bizasp/index.php?q=%83%8D%83%7D%83l&corpId=atc000001&en=1” The in-site search server 101 is accessed with the attached search URL. “Http://bizsearchasp.accelatech.com/bizasp/index.php” is a search URL, “q =% 83% 8D% 83% 7D% 83l” is a value obtained by encoding a search condition, “corpId = atc000001” is a company ID for identifying the electroniccommerce site server 103. The search URL is an example of a search request received by thesite search server 101.

文書検索サービスの結果として、文書検索結果画面がサイト利用者端末１０４にサイト内検索サーバ１０１から返信される。図４３は、文書検索結果画面の例を示す図である。４３０１は、検索条件や表示条件を示す条件ウィンドウ、４３０２は、検索された文書のタイトル等を一覧表示する一覧ウィンドウ、４３０３は、各属性の属性値リストを表示する属性ウィンドウである。 As a result of the document search service, a document search result screen is returned from thesite search server 101 to thesite user terminal 104. FIG. 43 is a diagram illustrating an example of a document search result screen. 4301 is a condition window indicating search conditions and display conditions, 4302 is a list window displaying a list of searched document titles, etc., and 4303 is an attribute window displaying an attribute value list of each attribute.

一覧ウィンドウ４３０２は、検索された文書のタイトルと本文の先頭部分を表示するように構成されている。そして、ブラウザ上で何れかの文書タイトルをクリックすると当該文書のＵＲＬにアクセスするように構成されている。これにより、サイト利用者端末１０４から電子商取引サイトサーバ１０３の所望の文書へアクセスすることができる。 Thelist window 4302 is configured to display the title of the retrieved document and the head portion of the text. When any document title is clicked on the browser, the URL of the document is accessed. Thereby, a desired document of the electroniccommerce site server 103 can be accessed from thesite user terminal 104.

属性ウィンドウ４３０３は、各属性の属性値と、検索された文書のうち当該属性値を有する文書の数を表示するように構成されている。また、いずれかの属性値をクリックすると、当該属性値を当該属性の検索条件とするパラメータを付して検索ＵＲＬに再度アクセスするように構成されている。これにより、検索文書を絞り込むことができる。 Theattribute window 4303 is configured to display the attribute value of each attribute and the number of documents having the attribute value among the retrieved documents. When any attribute value is clicked, the search URL is accessed again with a parameter using the attribute value as a search condition for the attribute. Thereby, the search document can be narrowed down.

文書検索サービスの動作について説明する。図４４は、サイト内検索サーバの文書検索サービス処理フローを示す図である。図４５は、サイト内検索サーバの文書検索サービスに係る構成を示す図である。 The operation of the document search service will be described. FIG. 44 is a diagram showing a document search service process flow of the site search server. FIG. 45 is a diagram showing a configuration related to the document search service of the site search server.

検索要求受付部４５０１による検索要求受付処理（Ｓ４４０１）では、検索ＵＲＬへのアクセス待ち状態を維持し、サイト利用者端末１０４からのパラメータ付の検索ＵＲＬへのアクセスを受け付ける。アクセスを受け付けると、電子商取引サイト判定部４５０２による電子商取引サイト判定処理（Ｓ４４０２）で、電子商取引サイトサーバ１０３を特定する。具体的には、検索ＵＲＬに含まれるパラメータから企業ＩＤを取得し、企業ＩＤに対応する電子商取引サイトサーバ１０３を特定する。次に、検索条件判定部４５０３による検索条件判定処理（Ｓ４４０３）で、同じく検索ＵＲＬに含まれるパラメータから検索条件を特定する。そして、文書検索実行部４５０４による文書検索実行処理（Ｓ４４０４）で、電子商取引サイトサーバ１０３に対応する収集文書データベースから、検索条件に適合する文書を検索する。フリーキーワードの場合には、本文中に当該キーワードと一致する部分が含まれる場合に適合と判定する。適合した文書のＵＲＬ（収集文書ＵＲＬ）を検索結果として特定する。一覧ウィンドウ生成部４５０５による一覧ウィンドウ生成処理（Ｓ４４０５）では、検索された文書のＵＲＬに対応するうタイトルと本文を収集文書データベース３００７から取得し、タイトルと本文の先頭部分からなるリストを表示するようにウィンドウを生成する。また、各文書のタイトルをクリック（指示）することにより、当該文書のＵＲＬへのアクセスがブラウザより起動されるように構成する。属性ウィンドウ生成部４５０６による属性ウィンドウ生成処理（Ｓ４４０６）では、属性値リスト記憶部３００９に記憶している属性毎の属性値のリストを表示するように属性ウィンドウを生成する。更に、各属性値を当該属性の条件とする検索条件を加えて再検索し、属性値とフリーキーワードのＡＮＤ条件による検索文書数を求める。そして、その文書数を各属性値に対応させて表示する。この例では括弧付の数字で表示している。また、各属性をクリックにより指示した場合には、その属性値を当該属性の条件としてパラメータに加え、そのパラメータを付した検索ＵＲＬへのアクセスをブラウザに起動させるように画面を構成する。文書検索結果画面返信部４５０７による文書検索結果画面返信処理（Ｓ４４０７）では、検索条件と表示条件を表示する条件ウィンドウを生成し、条件ウィンドウと一覧ウィンドウと属性ウィンドウからなる文書検索結果画面をサイト利用者端末１０４に返信する。 In the search request receiving process (S4401) by the search request receiving unit 4501, the access waiting state for the search URL is maintained and access to the search URL with parameters from thesite user terminal 104 is received. When the access is accepted, the electroniccommerce site server 103 is specified by the electronic commerce site determination process (S4402) by the electronic commercesite determination unit 4502. Specifically, the company ID is acquired from the parameters included in the search URL, and the electroniccommerce site server 103 corresponding to the company ID is specified. Next, in the search condition determination process (S4403) by the searchcondition determination unit 4503, the search condition is specified from the parameters included in the search URL. Then, in a document search execution process (S4404) by the documentsearch execution unit 4504, a document matching the search condition is searched from the collected document database corresponding to the electroniccommerce site server 103. In the case of a free keyword, it is determined to be relevant if the text contains a portion that matches the keyword. The URL of the matched document (collected document URL) is specified as a search result. In the list window generation process (S4405) by the listwindow generation unit 4505, the title and the text corresponding to the URL of the retrieved document are acquired from the collecteddocument database 3007, and the list including the title and the head part of the text is displayed. Create a window on Further, by clicking (instructing) the title of each document, access to the URL of the document is activated from the browser. In attribute window generation processing (S4406) by the attribute window generation unit 4506, an attribute window is generated so as to display a list of attribute values for each attribute stored in the attribute valuelist storage unit 3009. Further, the search is performed again by adding a search condition using each attribute value as the condition of the attribute, and the number of search documents based on the AND condition of the attribute value and the free keyword is obtained. The number of documents is displayed corresponding to each attribute value. In this example, the numbers are shown in parentheses. Further, when each attribute is instructed by clicking, the screen is configured so that the attribute value is added to the parameter as the attribute condition, and the browser is activated to access the search URL with the parameter. In the document search result screen reply process (S4407) by the document search result screen reply unit 4507, a condition window for displaying the search condition and the display condition is generated, and the document search result screen including the condition window, the list window, and the attribute window is used on the site. To theperson terminal 104.

上述の例では、図４１のように当初の検索ウィンドウをフリーキーワードとしたが、各属性の条件を受け付けるように構成することも有効である。例えば、文字列型３の属性名「色」に対して、赤あるいは白を選択させ、パラメータに色を指定する検索条件を含めるようにすることができる。文字列型１の属性名「産地」のように階層を設けた属性の場合には、例えば属性値「フランス」を当該層の検索条件として文書検索するとともに、属性ウィンドウの生成において、当該属性値「フランス」を親とする下位の属性値「ボルドー」等を検索し、検索された下位の属性値リストを表示させる。 In the above example, the initial search window is set as a free keyword as shown in FIG. 41, but it is also effective to configure it so as to accept the condition of each attribute. For example, for the attribute name “color” of thecharacter string type 3, it is possible to select red or white and include a search condition for designating the color in the parameter. In the case of an attribute having a hierarchy such as the attribute name “origin” of thecharacter string type 1, for example, a document search is performed using the attribute value “France” as a search condition for the layer, and the attribute value A lower attribute value “Bordeaux” or the like having “France” as a parent is searched, and the searched lower attribute value list is displayed.

また、上述の文書検索結果画面に、再度検索条件を設定するための検索ウィンドウを設け、フリーキーワードあるいは各属性の条件を受け付けて、その条件で再検索するようにすることも有効である。 It is also effective to provide a search window for setting the search condition again on the above-described document search result screen, accept the free keyword or the condition of each attribute, and perform the search again under that condition.

数値型の属性に関しては、数値として特性を活かした検索（大小比較や範囲指定等）が有効である。 For numeric type attributes, searches (size comparison, range specification, etc.) that make use of characteristics as numeric values are effective.

本実施例では、抽出元データをＵＲＬとする例と、数値型の属性のうち抽出形式を日付タイプとする例について説明する。 In the present embodiment, an example in which the extraction source data is a URL and an example in which the extraction format is a date type among numerical type attributes will be described.

図４６は、実施例２に係る電子商取引サイトのＨＴＭＬ文書表示例を示す図である。この文書は、図２のコンテンツ構成のうち、www.example.com/press/以下のＨＴＭＬ文書の例である。図４７は、図４６に対応するＨＴＭＬ文書ソースコード例を示す図である。 FIG. 46 is a diagram illustrating an HTML document display example of the electronic commerce site according to the second embodiment. This document is an example of an HTML document below www.example.com/press/ in the content structure of FIG. FIG. 47 is a diagram showing an example HTML document source code corresponding to FIG.

図４８は、実施例２に係る属性抽出条件登録画面（文字列型１）の例を示す図である。この例は、当該文書のＵＲＬに含まれるディレクトリ名を抽出し、属性として用いる場合の条件設定を示している。第一層セレクタは、第一下位層のディレクトリ「ｓｈｏｐ」や「ｐｒｅｓｓ」を参照し、第二層セレクタは、第二下位層のディレクトリ「２００９」や「２００８」を参照している。 FIG. 48 is a diagram illustrating an example of the attribute extraction condition registration screen (character string type 1) according to the second embodiment. This example shows a condition setting when a directory name included in the URL of the document is extracted and used as an attribute. The first layer selector refers to the first lower layer directories “shop” and “press”, and the second layer selector refers to the second lower layer directory “2009” and “2008”.

図４９は、実施例２に係る属性抽出条件登録画面（数値型１）の例を示す図である。ＨＴＭＬ文書中のリリース日の記述部位のうち、第一層セレクタは、年の数値を参照し、第二層セレクタは、月の数値を参照し、第三層セレクタは、日の数値を参照している。 FIG. 49 is a diagram illustrating an example of an attribute extraction condition registration screen (numerical type 1) according to the second embodiment. Of the description part of the release date in the HTML document, the first layer selector refers to the year value, the second layer selector refers to the month number, and the third layer selector refers to the day number. ing.

図５０と図５１は、実施例２における文字列型１と数値型１の属性抽出条件を示している。 50 and 51 show the attribute extraction conditions for thecharacter string type 1 and thenumerical value type 1 in the second embodiment.

実施例２の文字列型１の場合、実施例１の対象文字列判定部３４０２の対象文字列判定処理（Ｓ３３０４）において、抽出元データがＵＲＬであることから、収集文書ＵＲＬそのものを対象文字列と判定し、マッチング判定部３４０３によるマッチング判定（Ｓ３３０６）において収集文書ＵＲＬに対して正規表現によるマッチングを行う。 In the case of thecharacter string type 1 of the second embodiment, in the target character string determination process (S3304) of the target characterstring determination unit 3402 of the first embodiment, the extraction source data is a URL. In the matching determination (S3306) by the matchingdetermination unit 3403, the collected document URL is matched with a regular expression.

また、実施例２の数値型１の場合、図３７のＳ３７０１で抽出条件が日付タイプであると判定され、日付変換処理（Ｓ３７０５）が行われる。 In the case of thenumerical value type 1 of the second embodiment, it is determined in S3701 of FIG. 37 that the extraction condition is a date type, and date conversion processing (S3705) is performed.

日付変換処理（Ｓ３７０５）では、年、月、日で別々に入力された文字コードを、一つのバイナリコードに変換する。このバイナリコードは、十進法の１桁と２桁を日とし、同じく３桁と４桁を月とし、同じく５桁〜８桁を年とする数値である。 In the date conversion process (S3705), character codes input separately for year, month, and day are converted into one binary code. This binary code is a numerical value in which one and two decimal digits are the day, three and four digits are the month, and five to eight digits are the year.

図５２は、日付変換処理フローを示す図である。第一層セレクタから参照変数を読み取り（Ｓ５２０１）、第一層の参照変数値（文字コード）をバイナリに変換する（Ｓ５２０２）。同様に、第二層セレクタから参照変数を読み取り（Ｓ５２０３）、第二層の参照変数値（文字コード）をバイナリに変換する（Ｓ５２０４）。更に、第三層セレクタから参照変数を読み取り（Ｓ５２０５）、第三層の参照変数値（文字コード）をバイナリに変換する（Ｓ５２０６）。そして、第一バイナリ値×１００００（十進法）＋第二バイナリ値×１００（十進法）＋第三バイナリ値を算出する（Ｓ５２０７）。最後に、和を当該属性情報に書き込む（Ｓ５２０８）。 FIG. 52 is a diagram showing a date conversion processing flow. The reference variable is read from the first layer selector (S5201), and the first layer reference variable value (character code) is converted to binary (S5202). Similarly, the reference variable is read from the second layer selector (S5203), and the second layer reference variable value (character code) is converted to binary (S5204). Further, the reference variable is read from the third layer selector (S5205), and the third layer reference variable value (character code) is converted to binary (S5206). Then, the first binary value × 10000 (decimal system) + second binary value × 100 (decimal system) + third binary value is calculated (S5207). Finally, the sum is written in the attribute information (S5208).

上述処理により得られる収集文書データベース３００７の例を示す。図５３は、実施例２に係る収集文書データベースの例を示す図である。 The example of thecollection document database 3007 obtained by the above-mentioned process is shown. FIG. 53 is a diagram illustrating an example of the collected document database according to the second embodiment.

文書検索サービスによる属性情報の利用に関しては、実施例１と同様である。 The use of attribute information by the document search service is the same as in the first embodiment.

上述の説明では、タグ付文書の例としてＨＴＭＬ文書を示したが、他の構造化文書（文章にタグをつけて構造を示す方法を採用した文書）に対しても有効である。ＳＧＭＬ文書、ＸＭＬ文書などにも有効である。これらの構造化文書は、マークアップ言語による記述されている。また、ＰＤＦ文書など、他のタグ付文書についても有効である。 In the above description, an HTML document is shown as an example of a tagged document, but it is also effective for other structured documents (documents that employ a method of showing a structure by tagging sentences). It is also effective for SGML documents and XML documents. These structured documents are described in a markup language. It is also effective for other tagged documents such as PDF documents.

サイト内検索サーバ１０１は、コンピュータであり、各要素はプログラムにより処理を実行することができる。また、プログラムを記憶媒体に記憶させ、記憶媒体からコンピュータに読み取られるようにすることができる。 Thesite search server 101 is a computer, and each element can execute processing by a program. Further, the program can be stored in a storage medium so that the computer can read the program from the storage medium.

１０１サイト内検索サーバ
１０２システム管理者端末
１０３電子商取引サイトサーバ
１０４サイト利用者端末
１２０１属性抽出条件登録部
１２０２属性名テーブル
１２０３属性抽出条件記憶部
３００１クローラー（文書収集部）
３００２収集文書リスト
３００３収集文書ファイル格納部
３００４複製部
３００５文書フィルタ
３００６属性抽出部
３００７収集文書データベース
３００８属性値リスト生成部
３００９属性値リスト記憶部
３４０１参照変数数判定部
３４０２対象文字列判定部
３４０３マッチング判定部
３４０４文字列型の属性情報生成部
３４０５数値型の属性情報生成部
４５０１検索要求受付部
４５０２電子商取引サイト判定部
４５０３検索条件判定部
４５０４文書検索実行部
４５０５一覧ウィンドウ生成部
４５０６属性ウィンドウ生成部
４５０７文書検索結果画面返信部101Site Search Server 102System Administrator Terminal 103 ElectronicCommerce Site Server 104Site User Terminal 1201 Attribute ExtractionCondition Registration Unit 1202 Attribute Name Table 1203 Attribute ExtractionCondition Storage Unit 3001 Crawler (Document Collection Unit)
3002Collected document list 3003 Collected documentfile storage unit 3004Duplicate unit 3005Document filter 3006Attribute extraction unit 3007Collected document database 3008 Attribute valuelist generation unit 3009 Attribute valuelist storage unit 3401 Reference variablenumber determination unit 3402 Target characterstring determination unit 3403Matching Determination unit 3404 Character string type attributeinformation generation unit 3405 Numeric type attribute information generation unit 4501 Searchrequest reception unit 4502 Electronic commercesite determination unit 4503 Searchcondition determination unit 4504 Documentsearch execution unit 4505 List window generation unit 4506 Attribute window generation unit 4506 4507 Document search result screen reply section