JPH09160849A

Movatterモバイル変換

Info

Publication number: JPH09160849A
Application number: JP7315623A
Authority: JP
Inventors: Masaaki Naoi; 昌明直井; Akira Oi; 明大井; Masaji Muranaka; 正次村中; Hiroshi Cho; 宏張
Original assignee: N T T SOFTWARE KK; Nippon Telegraph and Telephone Corp; NTT Software Corp
Current assignee: N T T SOFTWARE KK; Nippon Telegraph and Telephone Corp; NTT Software Corp
Priority date: 1995-12-04
Filing date: 1995-12-04
Publication date: 1997-06-20

Abstract

(57)【要約】【課題】分散処理能力及び実時間処理能力が高く、よ
り柔軟にかつメンテナンスし易く構成可能な通信ネット
ワーク障害管理システムを提供すること。【解決手段】通信ネットワークの障害管理に必要な事
象認識、一次切り分け、影響分析、措置及び試験の処理
をそれぞれ、事象認識自律エージェント群１０、一次切
り分け自律エージェント群２０、影響分析自律エージェ
ント群３０、措置自律エージェント４０及び試験自律エ
ージェント群５０に受け持たせ、これらに対応した黒板
６１〜６５、黒板管理自律エージェント７１〜７５並び
に障害管理黒板６６を通じて、各自律エージェントもし
くは自律エージェント群間の情報交換を行わせることに
より、処理の独立、分散化を図り、システム全体におけ
る処理時間を短縮する。(57) An object of the present invention is to provide a communication network fault management system which has a high distributed processing capacity and a real-time processing capacity and which can be configured more flexibly and easily. An event recognition autonomous agent group 10, a primary isolation autonomous agent group 20, an impact analysis autonomous agent group 30, for event recognition, primary isolation, impact analysis, measures, and test processing required for fault management of a communication network, respectively. The information is exchanged between each autonomous agent or group of autonomous agents through the measures autonomous agent 40 and the test autonomous agent group 50 through the corresponding blackboards 61 to 65, the blackboard management autonomous agents 71 to 75, and the failure management blackboard 66. By doing so, the processing becomes independent and distributed, and the processing time in the entire system is shortened.

Description

Translated fromJapanese

【発明の詳細な説明】Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】本発明は、自律エージェント
制御による通信ネットワーク障害管理システムに関する
ものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a communication network fault management system controlled by an autonomous agent.

【０００２】[0002]

【従来の技術】従来の逐次処理による通信ネットワーク
障害管理システムでは、障害が通知されると、まず、そ
の障害を分析し、発生場所や障害内容を特定する。ここ
で、他の障害が存在しない場合は次の処理、即ちその障
害によって影響される他の通信ネットワークを構成する
ノード（ワークステーションやルータ、ＡＴＭスイッチ
等の通信ネットワークを構成する設備）への影響の分
析、障害復旧に適切な措置の検索等を行う。一方、他の
障害が存在する場合は該他の障害が既に分析した障害と
関連するか否かに拘わらず（なぜなら、関連の特定が非
常に困難であるため。）、前記同様にその障害を分析
し、発生場所や障害内容を特定する如くなっていた。2. Description of the Related Art In a conventional communication network failure management system based on sequential processing, when a failure is notified, the failure is first analyzed to identify the place of occurrence and the content of the failure. If there is no other fault, the next process, that is, the influence on the nodes (workstations, routers, ATM switches, and other facilities that compose the communication network) that make up the other communication network, will be affected. Analysis and search for appropriate measures for disaster recovery. On the other hand, if another obstacle exists, regardless of whether or not the other obstacle is related to the already analyzed obstacle (because it is very difficult to identify the relation), the same obstacle is identified as described above. It was designed to analyze and identify the place of occurrence and the details of the failure.

【０００３】[0003]

【発明が解決しようとする課題】ところで、通信ネット
ワークにおける障害は非同期的に発生するため、例えば
原因を同じくするような関連した２以上の障害であって
も通信ネットワークを構成するノードの使用タイミング
の差により、同一時間帯に障害情報を出すとは考え難
い。従って、従来のシステムでは関連する複数の障害に
対して分析作業を繰り返すことになり、その分、無駄な
作業を行うことになって処理に多大な時間がかかる、い
いかえれば実時間処理が困難であるという問題があっ
た。By the way, since a failure in a communication network occurs asynchronously, even if there are two or more related failures having the same cause, the timing of use of the nodes constituting the communication network may change. Due to the difference, it is unlikely that fault information will be issued in the same time zone. Therefore, in the conventional system, the analysis work is repeated for a plurality of related faults, and the wasteful work is performed correspondingly, and the processing takes a lot of time. In other words, the real-time processing is difficult. There was a problem.

【０００４】また、逐次処理の場合、障害のパターン化
作業も非常に困難で、さらに予めプログラムされた通り
１ステップずつ処理を行うため、あるステップでプログ
ラムによる障害または知識不足による障害が発生する
と、システムが戸惑って処理を進められなくなるか処理
に多大な時間を消費し、効率的に問題の解決ができない
という問題があった。Further, in the case of sequential processing, it is very difficult to pattern a failure, and since the processing is performed step by step as programmed in advance, if a failure due to a program or a failure due to lack of knowledge occurs at a certain step, There is a problem in that the system is confusing and the processing cannot be advanced or the processing consumes a lot of time, and the problem cannot be efficiently solved.

【０００５】このように従来のエキスパートシステムで
は、単一故障の場合は処理速度でそれほど問題視されな
いが、非同期に複数の障害が通知されると、短時間で信
頼性の高い推論結果を提示するのはほとんど不可能にな
る。また、知識不足やシステムエラーが発生した場合、
システムが長時間処理を停止するか完全に機能しなくな
ってしまうという問題があった。As described above, in the conventional expert system, the processing speed in the case of a single failure is not so problematic, but when a plurality of failures are asynchronously notified, a highly reliable inference result is presented in a short time. Becomes almost impossible. Also, if you have a lack of knowledge or a system error,
There was a problem that the system stopped processing for a long time or became completely non-functional.

【０００６】本発明の目的は、分散処理能力及び実時間
処理能力が高く、より柔軟にかつメンテナンスし易く構
成することが可能な通信ネットワーク障害管理システム
を提供することにある。An object of the present invention is to provide a communication network fault management system which has a high distributed processing capability and a real-time processing capability and which can be configured more flexibly and easily maintained.

【０００７】[0007]

【課題を解決するための手段】本発明では前記課題を解
決するため、自律エージェントを導入してシステム処理
における中央制御をなるべくなくす。この際、同じ仕事
をこなせる自律エージェントを実行環境が許す限り、な
るべく多く用意し、複数の自律エージェントに競い合わ
せて処理を行わせる。これによって、処理対象における
部分的な知識不足やプログラムの部分的な不備による障
害がシステム全体の処理に及ぼす影響をかなり低下させ
ることができる。自律エージェントによって構築される
システムでは機能を分散化することが可能であるから、
１自律エージェント当たりの処理時間を大幅に短縮する
ことができ、システム全体の処理時間を減らすことが可
能となる。In order to solve the above problems, the present invention introduces an autonomous agent to eliminate central control in system processing as much as possible. At this time, as long as the execution environment allows the number of autonomous agents that can perform the same job, prepare as many as possible and let multiple autonomous agents compete for processing. As a result, it is possible to significantly reduce the influence of a failure due to a partial lack of knowledge in the processing target or a partial deficiency of the program on the processing of the entire system. Since it is possible to decentralize functions in a system constructed by autonomous agents,
The processing time per autonomous agent can be significantly reduced, and the processing time of the entire system can be reduced.

【０００８】そこで、本発明の請求項１では、通信ネッ
トワーク障害管理の機能を分散化した複数の自律エージ
ェントもしくは自律エージェント群と、該複数の自律エ
ージェントもしくは自律エージェント群にそれぞれ対応
する複数の黒板と、該複数の黒板をそれぞれ管理する複
数の黒板管理自律エージェントとからなる自律エージェ
ント制御による通信ネットワーク障害管理システムを提
案する。Therefore, according to claim 1 of the present invention, a plurality of autonomous agents or groups of autonomous agents in which the communication network fault management function is distributed, and a plurality of blackboards respectively corresponding to the plurality of autonomous agents or groups of autonomous agents are provided. We propose a communication network fault management system based on autonomous agent control, which consists of a plurality of blackboard management autonomous agents that respectively manage the plurality of blackboards.

【０００９】請求項１の発明によれば、通信ネットワー
ク障害管理の機能を複数の自律エージェントもしくは自
律エージェント群に分散化し、これらに対応する複数の
黒板及び黒板管理自律エージェントを設けたことによ
り、障害管理に必要な各種の機能を独立・分散して処理
することが可能となり、処理時間の短縮や柔軟でかつメ
ンテナンスし易い構成の実現を図ることができる。According to the first aspect of the present invention, the function of communication network failure management is distributed to a plurality of autonomous agents or groups of autonomous agents, and a plurality of blackboards and blackboard management autonomous agents corresponding to these are provided, so that a failure occurs. Various functions required for management can be processed independently and distributed, and the processing time can be shortened and a flexible and easy-to-maintain configuration can be realized.

【００１０】また、請求項２では、通信ネットワークに
発生した障害イベント、通信ネットワークが提供するサ
ービスや通信ネットワークの構成機器に関するユーザか
らの苦情等を認識し、情報の一括管理を行う事象認識自
律エージェント群と、該事象認識自律エージェント群と
他の自律エージェントもしくは他の自律エージェント群
との間で情報の交換を行う事象認識黒板と、前記事象認
識自律エージェント群から通知された事象を分析し、障
害原因や障害場所を特定する一次切り分け自律エージェ
ント群と、該一次切り分け自律エージェント群と他の自
律エージェントもしくは他の自律エージェント群との間
で情報の交換を行う一次切り分け黒板と、一次切り分け
自律エージェント群によって特定された障害原因や障害
場所または事象認識自律エージェント群から通知された
事象より、通信ネットワークの構成機器、通信ネットワ
ークが提供するサービス、ユーザに与える影響を分析す
る影響分析自律エージェント群と、該影響分析自律エー
ジェント群と他の自律エージェントもしくは他の自律エ
ージェント群との間で情報の交換を行う影響分析黒板
と、特定された障害原因や障害場所に対し障害を修復す
るオペレーションを検索し、これを自動的に実行するか
またはネットワーク管理者に示す措置自律エージェント
と、該措置自律エージェントと他の自律エージェントも
しくは他の自律エージェント群との間で情報の交換を行
う措置黒板と、他の自律エージェントもしくは他の自律
エージェント群からの要求に対し、適切な試験を検索し
て実行する試験自律エージェント群と、該試験自律エー
ジェント群と他の自律エージェントもしくは他の自律エ
ージェント群との間で情報の交換を行う試験黒板と、前
記各黒板上で交換される情報をそれぞれ管理する黒板対
応の黒板管理自律エージェントと、システムに共通する
情報の交換を行う障害管理黒板とからなる自律エージェ
ント制御による通信ネットワーク障害管理システムを提
案する。Further, in claim 2, an event recognition autonomous agent for recognizing a failure event occurring in a communication network, a complaint from a user regarding a service provided by the communication network or a component of the communication network, and managing information collectively. Group, an event recognition blackboard that exchanges information between the event recognition autonomous agent group and another autonomous agent or another autonomous agent group, and analyzes the events notified from the event recognition autonomous agent group, Primary carving autonomous agents that specify the cause of failure or fault location, primary carving blackboard that exchanges information between the primary carving autonomous agent and other autonomous agents or other autonomous agents, and primary carving autonomous agents Cause, location or event identification identified by group From the events notified from the autonomous agent group, the component devices of the communication network, the services provided by the communication network, the impact analysis autonomous agent group that analyzes the impact on the user, and the impact analysis autonomous agent group and other autonomous agents or others Search for impact analysis blackboards that exchange information with other autonomous agents and the operation that repairs the failure for the specified failure cause and location, and execute this automatically or ask the network administrator Measures autonomous agent shown, measures blackboard for exchanging information between the measures autonomous agent and other autonomous agents or groups of other autonomous agents, and for requests from other autonomous agents or groups of other autonomous agents, Test autonomous agents that search for and execute appropriate tests A test blackboard for exchanging information between the test autonomous agent group and another autonomous agent or another autonomous agent group, and a blackboard management autonomous agent corresponding to the blackboard that manages information exchanged on each blackboard. We propose a communication network fault management system based on autonomous agent control, which consists of a fault management blackboard that exchanges information common to all systems.

【００１１】請求項２の発明によれば、事象認識自律エ
ージェント群により、通信ネットワークに発生した障害
イベント、通信ネットワークが提供するサービスや通信
ネットワークの構成機器に関するユーザからの苦情等を
認識し、これを事象認識黒板に書き込む。事象認識黒板
対応の黒板管理自律エージェントは該事象認識黒板に書
き込まれた情報を障害管理黒板に書き込む。障害管理黒
板に書き込まれた事象認識自律エージェント群からの情
報は、一次切り分け黒板対応の黒板管理自律エージェン
トにより一次切り分け黒板に書き込まれる。一次切り分
け黒板に書き込まれた事象は一次切り分け自律エージェ
ント群により読み取られて分析され、障害原因や障害場
所が特定され、一次切り分け黒板及びその黒板管理自律
エージェントを介して障害管理黒板に書き込まれる。障
害管理黒板に書き込まれた事象認識自律エージェント群
からの事象、一次切り分け自律エージェント群からの障
害原因や障害場所は、影響分析黒板対応の黒板管理自律
エージェントにより影響分析黒板に書き込まれる。影響
分析黒板に書き込まれた事象、障害原因や障害場所は影
響分析自律エージェント群により読み取られて通信ネッ
トワークの構成機器、通信ネットワークが提供するサー
ビス、ユーザに与える影響が分析され、これが影響分析
黒板及びその黒板管理自律エージェントを介して障害管
理黒板に書き込まれる。また、障害管理黒板に書き込ま
れた一次切り分け自律エージェント群からの障害原因や
障害場所は、措置黒板対応の黒板管理自律エージェント
により措置黒板に書き込まれる。措置黒板に書き込まれ
た障害原因や障害場所は措置自律エージェントにより読
み取られて障害を修復するオペレーションが検索され、
これが実行されるかまたはネットワーク管理者に示され
る。また、障害管理黒板に書き込まれた事象、障害原因
や障害場所等の各種の情報は、試験黒板対応の黒板管理
自律エージェントにより試験黒板に書き込まれる。試験
黒板に書き込まれた前記各種の情報は試験自律エージェ
ントにより読み取られて、これに基づいて適切な試験が
検索され、実行される。According to the second aspect of the present invention, the event recognition autonomous agent group recognizes a failure event occurring in the communication network, a complaint from a user regarding a service provided by the communication network or a component of the communication network, and the like. Write on the event recognition blackboard. The blackboard management autonomous agent corresponding to the event recognition blackboard writes the information written in the event recognition blackboard in the fault management blackboard. The information from the event recognition autonomous agents written on the fault management blackboard is written on the primary isolation blackboard by the primary administration blackboard management autonomous agent. The event written on the primary isolation blackboard is read and analyzed by the primary isolation autonomous agent group, the cause of the failure and the location of the failure are identified, and written on the failure isolation blackboard through the primary isolation blackboard and the blackboard management autonomous agent. Event recognition written on the failure management blackboard The events from the autonomous agent group, the cause of failure and the failure location from the primary isolation autonomous agent group are written on the impact analysis blackboard by the blackboard management autonomous agent corresponding to the impact analysis blackboard. The event, the cause of failure, and the location of the failure written on the impact analysis blackboard are read by the impact analysis autonomous agents to analyze the components of the communication network, the services provided by the communication network, and the impact on the user. Written on the fault management blackboard via the blackboard management autonomous agent. In addition, the cause of failure and the place of failure from the primary isolation autonomous agent group written on the fault management blackboard are written on the action blackboard by the blackboard management autonomous agent corresponding to the action blackboard. The cause and location of the fault written on the action blackboard are read by the action autonomous agent to search for the operation to repair the fault,
This is done or shown to the network administrator. In addition, various information such as the event, the cause of the failure, and the location of the failure written on the failure management blackboard is written on the test blackboard by the blackboard management autonomous agent corresponding to the test blackboard. The various kinds of information written on the test blackboard are read by the test autonomous agent, and based on this, an appropriate test is searched and executed.

【００１２】また、請求項３では、事象認識自律エージ
ェント群は通信ネットワークに発生した障害イベントや
通信ネットワークが提供するサービスの状態を監視する
監視モニタからの障害イベントを認識する監視モニタ自
律エージェント、通信ネットワークが提供するサービス
や通信ネットワークの構成機器に関するユーザからの苦
情を認識するユーザ申告自律エージェント及び前記障害
イベントや苦情を事象に変換するイベント変換自律エー
ジェントからなり、一次切り分け自律エージェント群並
びに影響分析自律エージェント群はルールベース推論自
律エージェント及びメモリベース推論自律エージェント
からなり、試験自律エージェント群は要求される試験に
対して知識を検索しシステムとして理解し易い具体的な
試験内容を記述する試験検索自律エージェント、複数の
試験が登録された場合に効率的な試験順序を決定する試
験実行スケジューラ自律エージェント及び該試験実行ス
ケジューラ自律エージェントによってスケジューリング
された試験を逐次実行する試験実行自律エージェントか
らなるからなる請求項２記載の自律エージェント制御に
よる通信ネットワーク障害管理システムを提案する。Further, in claim 3, the event recognition autonomous agent group is a monitoring monitor autonomous agent that recognizes a failure event generated in the communication network or a failure event from a monitoring monitor that monitors the state of the service provided by the communication network. It consists of a user-reported autonomous agent that recognizes complaints from users regarding services provided by the network and components of communication networks, and an event conversion autonomous agent that converts the failure event or complaint into an event. The agent group consists of rule-based reasoning autonomous agents and memory-based reasoning autonomous agents. The test autonomous agents search knowledge for required tests and describe specific test contents that are easy to understand as a system. It consists of a test search autonomous agent, a test execution scheduler autonomous agent that determines an efficient test order when multiple tests are registered, and a test execution autonomous agent that sequentially executes tests scheduled by the test execution scheduler autonomous agent. A communication network fault management system by autonomous agent control according to claim 2 is proposed.

【００１３】請求項３の発明によれば、事象認識自律エ
ージェント群の監視モニタ自律エージェントにより通信
ネットワークに発生した障害イベントや通信ネットワー
クが提供するサービスの状態を監視する監視モニタから
の障害イベントが認識され、ユーザ申告自律エージェン
トにより通信ネットワークが提供するサービスや通信ネ
ットワークの構成機器に関するユーザからの苦情が認識
され、イベント変換自律エージェントにより前記障害イ
ベントや苦情が事象に変換され、事象認識黒板に書き込
まれる。また、一次切り分け自律エージェント群のルー
ルベース推論自律エージェント及びメモリベース推論自
律エージェントにより前記事象が分析され、障害原因や
障害場所が特定される。また、影響分析自律エージェン
ト群のルールベース推論自律エージェント及びメモリベ
ース推論自律エージェントにより前記事象、障害原因や
障害場所に基づく通信ネットワークの構成機器、通信ネ
ットワークが提供するサービス、ユーザに与える影響が
分析される。また、試験自律エージェント群の試験検索
自律エージェントにより具体的な試験内容が記述され、
試験実行スケジューラ自律エージェントにより効率的な
試験順序が決定され、スケジューリングされた試験内容
が試験実行自律エージェントにより逐次実行される。According to the third aspect of the invention, the monitoring monitor of the event recognition autonomous agent group recognizes the failure event generated in the communication network by the autonomous agent and the failure event from the monitoring monitor for monitoring the state of the service provided by the communication network. The user-reported autonomous agent recognizes the complaint from the user regarding the service provided by the communication network and the components of the communication network, and the event conversion autonomous agent converts the failure event or complaint into an event and writes it on the event recognition blackboard. . Further, the event is analyzed by the rule-based inference autonomous agent and the memory-based inference autonomous agent of the primary isolation autonomous agent group, and the cause of failure and the location of failure are specified. In addition, analysis of impacts by rule-based reasoning autonomous agents and memory-based reasoning autonomous agents of the group of autonomous agents analyzes the components of the communication network based on the event, the cause of failure, and the location of failure, the services provided by the communication network, and the impact on users. To be done. In addition, specific test contents are described by the test search autonomous agents of the test autonomous agents.
The test execution scheduler autonomous agent determines an efficient test order, and the scheduled test contents are sequentially executed by the test execution autonomous agent.

【００１４】また、請求項４では、一次切り分け自律エ
ージェント群のルールベース推論自律エージェントは故
障の第１の事象による確信度をｘ、第２の事象による確
信度をｙとして、故障の合成された確信度Ｃを、Ｃ＝（ｘ）^1/2＋（ｙ）^1/2−（ｘｙ）^1/2 により定義し、これに基づいて故障を推論する請求項３
記載の自律エージェント制御による通信ネットワーク障
害管理システムを提案する。Further, in claim 4, the rule-based reasoning autonomous agent of the primary carving autonomous agent group is composed of the faults with the certainty factor due to the first event of the fault as x and the certainty factor due to the second event as y. The confidence factor C is defined by C = (x)^1/2 + (y)^1/2 − (xy)^1/2 , and the fault is inferred based on this.
We propose a communication network fault management system based on the described autonomous agent control.

【００１５】請求項４の発明によれば、複数の事象から
故障の確信度を推定でき、より正確な故障の推論が可能
となる。According to the invention of claim 4, the certainty factor of the failure can be estimated from a plurality of events, and more accurate inference of the failure becomes possible.

【００１６】また、請求項５では、一次切り分け自律エ
ージェント群は定数を０以上１以下の値として、複数の
故障の確信度からその時点での故障出力のしきい値を、故障出力のしきい値＝（最も高い確信度）＊（定数）により計算し、該しきい値を越える故障を障害管理黒板
への一次切り分け結果とする請求項２または３記載の自
律エージェント制御による通信ネットワーク障害管理シ
ステムを提案する。According to a fifth aspect of the present invention, the primary carving autonomous agent group sets the constant as a value of 0 or more and 1 or less, and determines the threshold of the fault output at that time from the certainty of a plurality of faults as the threshold of the fault output. 4. The communication network fault management system by autonomous agent control according to claim 2 or 3, wherein a fault exceeding the threshold value is calculated as value = (highest certainty factor) * (constant), and is set as a primary isolation result to a fault management blackboard. To propose.

【００１７】請求項５の発明によれば、複数の故障が発
生した場合に、その確信度から処理すべき故障を的確に
選択することが可能となる。According to the invention of claim 5, when a plurality of faults occur, it is possible to accurately select the fault to be processed from the certainty factor.

【００１８】また、請求項６では、試験自律エージェン
ト群は試験結果データベースにアクセスして試験要求と
同一の試験結果がない時は該当試験を実施し、ある時は
α、βを定数として、試験結果の有効度を、有効度＝１−１／（１＋α・ｅｘｐ（−β・経過時
間））により定義し、これが予め設定された有効度のしきい値
より大きい場合は試験要求を拒否し、小さい場合は試験
結果データベースから前記試験結果を削除するとともに
試験を実施する請求項２または３記載の自律エージェン
ト制御による通信ネットワーク障害管理システムを提案
する。Further, in claim 6, the test autonomous agent group accesses the test result database to carry out the corresponding test when there is no test result identical to the test request, and when there is a test result, α and β are used as constants to perform the test. The effectiveness of the result is defined by the effectiveness = 1-1 / (1 + α · exp (−β · elapsed time)), and when this is larger than the preset effectiveness threshold, the test request is rejected, If the test result is small, the test result is deleted from the test result database and the test is performed.

【００１９】請求項６の発明によれば、既に実施した試
験結果の有効性の高い試験について再度、繰り返す必要
がなくなり、その分、処理の負担を軽減できる。According to the invention of claim 6, it is not necessary to repeat a test which has already been executed and which has a high validity, and the processing load can be reduced accordingly.

【００２０】[0020]

【発明の実施の形態】本発明では、通信ネットワークの
障害管理に必要な処理を考慮し、それぞれ事象認識、一
次切り分け、影響分析、措置及び試験等のサブタスクに
分け、また、それぞれのサブタスク処理の実施に当たっ
ては各種の自律エージェントもしくは自律エージェント
群に担当させる。自律エージェントもしくは自律エージ
ェント群間または各自律エージェント群内の自律エージ
ェント間の情報交換は全て黒板を通して行う。BEST MODE FOR CARRYING OUT THE INVENTION In the present invention, in consideration of processing required for fault management of a communication network, each is divided into subtasks such as event recognition, primary isolation, impact analysis, measures, and tests, and each subtask processing In implementing, various autonomous agents or groups of autonomous agents are in charge. All information exchange between autonomous agents or groups of autonomous agents or between autonomous agents within each group of autonomous agents is done through the blackboard.

【００２１】各自律エージェントもしくは自律エージェ
ント群は自ら所属する黒板をアクセスし、必要な情報を
取得しまたは必要な情報の提供を要求する。自律エージ
ェントもしくは自律エージェント群は黒板から取得した
情報に対し、自ら所持する知識データベースの知識を用
いて処理の必要性を確認する。処理の必要性が確認され
た場合は処理を行い、結果を黒板に出力する。Each autonomous agent or group of autonomous agents accesses the blackboard to which it belongs and acquires required information or requests provision of required information. The autonomous agent or group of autonomous agents confirms the necessity of processing the information acquired from the blackboard by using the knowledge of the knowledge database possessed by itself. When the necessity of processing is confirmed, the processing is performed and the result is output on the blackboard.

【００２２】各自律エージェントもしくは自律エージェ
ント群は前述した処理を自律的かつ並列に行い、通信ネ
ットワークにおける障害管理を実現する。また、自律エ
ージェントもしくは自律エージェント群をより柔軟にシ
ステムの処理要求に対応させるため、それぞれ自律エー
ジェント行動を制御するアクションルールを設定する。Each autonomous agent or group of autonomous agents performs the above-mentioned processing autonomously and in parallel to realize fault management in the communication network. Moreover, in order to more flexibly respond to the processing request of the system by the autonomous agent or group of autonomous agents, action rules for controlling the autonomous agent behavior are set.

【００２３】図１は本発明の通信ネットワーク障害管理
システムの実施の形態の一例を示すものである。図中、
１０は事象認識自律エージェント群、２０は一次切り分
け自律エージェント群、３０は影響分析自律エージェン
ト群、４０は措置自律エージェント、５０は試験自律エ
ージェント群、６１は事象認識黒板、６２は一次切り分
け黒板、６３は影響分析黒板、６４は措置黒板、６５は
試験黒板、６６は障害管理黒板、７１，７２，７３，７
４，７５は黒板管理自律エージェントであり、これらは
通信ネットワーク障害管理システム１００を構成する。
また、２００は通信ネットワーク、３００はネットワー
ク構成データベース（ＤＢ）、４００は知識データベー
ス（ＤＢ）である。FIG. 1 shows an embodiment of the communication network fault management system of the present invention. In the figure,
10 is an event recognition autonomous agent group, 20 is a primary isolation autonomous agent group, 30 is an impact analysis autonomous agent group, 40 is a measure autonomous agent, 50 is a test autonomous agent group, 61 is an event recognition blackboard, 62 is a primary isolation blackboard, 63 Is an impact analysis blackboard, 64 is a measure blackboard, 65 is a test blackboard, 66 is a failure management blackboard, 71, 72, 73, 7
Reference numerals 4 and 75 are blackboard management autonomous agents, which constitute the communication network failure management system 100.
Further, 200 is a communication network, 300 is a network configuration database (DB), and 400 is a knowledge database (DB).

【００２４】事象認識自律エージェント群１０は、通信
ネットワークに発生した障害イベントや通信ネットワー
クが提供するサービスの状態を監視する監視モニタから
の障害イベント、通信ネットワークが提供するサービス
や通信ネットワークの構成機器に関するユーザからの苦
情を認識し、それらの情報を事象に変換して一括管理す
るもので、図２に示すようにイベント変換自律エージェ
ント１１と、ユーザ申告自律エージェント１２と、監視
モニタ自律エージェント１３とからなっている。The event recognition autonomous agent group 10 relates to a failure event that has occurred in a communication network or a failure event from a monitoring monitor that monitors the status of a service provided by the communication network, a service provided by the communication network, or a component device of the communication network. It recognizes complaints from users, converts the information into events, and manages them collectively. As shown in FIG. 2, the event conversion autonomous agent 11, the user-reported autonomous agent 12, and the monitoring monitor autonomous agent 13 Has become.

【００２５】一次切り分け自律エージェント群２０は、
事象認識自律エージェント群１０から通知された事象を
分析し、障害原因や障害場所を特定するもので、図３に
示すようにルールベース推論自律エージェント２１と、
メモリベース推論自律エージェント２２とからなってい
る。The primary isolation autonomous agent group 20 is
An event notified from the event recognition autonomous agent group 10 is analyzed to identify the cause and location of the failure. As shown in FIG. 3, a rule-based reasoning autonomous agent 21,
It comprises a memory-based reasoning autonomous agent 22.

【００２６】影響分析自律エージェント群３０は、一次
切り分け自律エージェント群２０によって特定された障
害原因や障害場所または事象認識自律エージェント群１
０から通知された事象より、通信ネットワークの構成機
器、通信ネットワークが提供するサービス、ユーザに与
える影響を分析するもので、図４に示すように一次切り
分け自律エージェント群２０と同様な、ルールベース推
論自律エージェント３１と、メモリベース推論自律エー
ジェント３２とからなっている。The influence analysis autonomous agent group 30 is the failure cause, failure location, or event recognition autonomous agent group 1 specified by the primary isolation autonomous agent group 20.
A rule-based reasoning similar to that of the primary isolation autonomous agent group 20 as shown in FIG. 4 is performed by analyzing the components of the communication network, the services provided by the communication network, and the effect on the user from the event notified from 0. It is composed of an autonomous agent 31 and a memory-based reasoning autonomous agent 32.

【００２７】措置自律エージェント４０は、特定された
障害原因や障害場所に対し、障害を修復するオペレーシ
ョン検索し、これを自動的に実行するかまたはネットワ
ーク管理者に示すものである。Measures The autonomous agent 40 retrieves an operation for repairing a failure with respect to the specified failure cause and failure location, and automatically executes this operation or shows it to the network administrator.

【００２８】試験自律エージェント群５０は、他の自律
エージェントからの要求に対し、適切な試験を検索して
実行するもので、図５に示すように要求される試験に対
して知識を検索しシステムとして理解し易い具体的な試
験内容を記述する試験検索自律エージェント５１と、複
数の試験が登録された場合、効率的な試験順序を決定す
る試験実行スケジューラ自律エージェント５２と、該試
験実行スケジューラ自律エージェント５２によってスケ
ジューリングされた試験内容を逐次実行する試験実行自
律エージェント５３とからなっている。The test autonomous agent group 50 searches for and executes an appropriate test in response to a request from another autonomous agent. As shown in FIG. 5, knowledge is searched for the required test and the system is searched. A test search autonomous agent 51 that describes specific test contents that are easy to understand, a test execution scheduler autonomous agent 52 that determines an efficient test order when a plurality of tests are registered, and a test execution scheduler autonomous agent The test execution autonomous agent 53 sequentially executes the test contents scheduled by 52.

【００２９】また、事象認識黒板６１、一次切り分け黒
板６２、影響分析黒板６３、措置黒板６４及び試験黒板
６５はそれぞれ、事象認識自律エージェント群１０、一
次切り分け自律エージェント群２０、影響分析自律エー
ジェント群３０、措置自律エージェント４０及び試験自
律エージェント群５０と他の自律エージェントもしくは
他の自律エージェント群との間で情報交換を行うための
ものであり、各黒板専用の黒板管理自律エージェント７
１〜７５により管理されている。また、障害管理黒板６
６はシステムに共通する情報の交換を行うためのもので
ある。The event recognition blackboard 61, the primary isolation blackboard 62, the impact analysis blackboard 63, the action blackboard 64, and the test blackboard 65 are respectively an event recognition autonomous agent group 10, a primary isolation autonomous agent group 20, and an impact analysis autonomous agent group 30. , Measure autonomous agent 40 and test autonomous agent group 50 for exchanging information between another autonomous agent or another autonomous agent group, and a blackboard management autonomous agent 7 dedicated to each blackboard
It is managed by 1-75. In addition, obstacle management blackboard 6
6 is for exchanging information common to the system.

【００３０】このように情報交換のための黒板を分散化
することにより、自律エージェントもしくは自律エージ
ェント群の独立性を高めることができ、システム構成を
柔軟にかつメンテナンスし易くできる。By thus decentralizing the blackboards for exchanging information, the independence of the autonomous agent or group of autonomous agents can be increased, and the system configuration can be flexibly and easily maintained.

【００３１】また、本システムでは試験結果を管理する
時間関数を定義することによって、複数の自律エージェ
ントもしくは複数の自律エージェント群から非同期的に
要求される同一の試験に関する処理を効率的に行うこと
ができる。即ち、試験自律エージェント群５０は他の自
律エージェントもしくは他の自律エージェント群から試
験の実施が要求されると、試験結果を格納する試験結果
データベースにアクセスして要求されている試験と同一
の試験の結果を検索し、同一の試験結果がない時は試験
を実施し、ある時は以下に説明する試験結果の有効度に
従って処理する。Further, in this system, by defining a time function for managing the test result, it is possible to efficiently perform the processing related to the same test asynchronously requested from a plurality of autonomous agents or a group of a plurality of autonomous agents. it can. That is, when the test autonomous agent group 50 requests execution of a test from another autonomous agent or another autonomous agent group, the test autonomous agent group 50 accesses the test result database that stores the test result and executes the same test as the requested test. The results are searched, and when there is no identical test result, the test is performed, and when there is, the test result is processed according to the validity of the test result.

【００３２】通信ネットワークは非常にダイナミックで
あり、試験の結果も時間の経過とともに有効性が低下し
ていく。そこで、その有効性を反映するため、α、βを
定数として、試験の有効度を、有効度＝１−１／（１＋α・ｅｘｐ（−β・経過時
間））により定義する。Communication networks are very dynamic, and the results of tests also become less effective over time. Therefore, in order to reflect the effectiveness, the effectiveness of the test is defined by the effectiveness = 1-1 / (1 + α · exp (−β · elapsed time)) with α and β as constants.

【００３３】そして、これが予め設定された有効度のし
きい値より大きい場合は試験要求を拒否し、小さい場合
は試験結果データベースから該当試験結果を自動的に削
除するとともに該当試験を再実施する。これによって、
同一環境における同一内容の試験の実施を省略すること
ができ、処理時間を短縮することができる。If this is larger than the preset threshold of validity, the test request is rejected, and if smaller than this, the corresponding test result is automatically deleted from the test result database and the corresponding test is re-executed. by this,
It is possible to omit performing the same test in the same environment, and it is possible to shorten the processing time.

【００３４】また、自律エージェントアクションルール
を定義することにより、自律エージェントの行動または
機能をより柔軟に制御あるいは拡張することができ、こ
れによってシステム構成をさらに柔軟にかつメンテナン
スし易くできる。Further, by defining the autonomous agent action rule, the behavior or function of the autonomous agent can be more flexibly controlled or expanded, thereby making the system configuration more flexible and easy to maintain.

【００３５】以下、一次切り分け自律エージェント群２
０のルールベース推論自律エージェント２１で用いる２
つの自律エージェントアクションルールについて述べ
る。Hereinafter, the primary isolation autonomous agent group 2
0 used in rule-based reasoning autonomous agent 21
We describe two autonomous agent action rules.

【００３６】（１）ルールベース推論自律エージェント
間の推論結果統合ルール一次切り分け自律エージェント群２０の場合、少なくと
も１つ存在するルールベース推論自律エージェント２１
は各黒板６１〜６５，６６から処理する事象を獲得し、
処理を行う。(1) Rule-based reasoning Inference result integration rules between autonomous agents In the case of the primary isolation autonomous agent group 20, at least one rule-based reasoning autonomous agent 21 exists.
Acquires events to process from each blackboard 61-65, 66,
Perform processing.

【００３７】前述したように通信ネットワークにおいて
発生する事象の多くは同一の故障に基づくケースが多
い。例えば、ある事象から推論された、ルール対応に定
められた低い確信度を持つ故障であっても、他の事象か
ら同じ故障が結論として推論された場合、その故障の確
信度をより高く設定し直さなくてはならない。そこで、
故障の第１の事象による確信度をｘ、第２の事象による
確信度をｙとして、故障の合成された確信度Ｃを、Ｃ＝（ｘ）^1/2＋（ｙ）^1/2−（ｘｙ）^1/2 により定義し（但し、０≦確信度≦１）、これに基づい
て故障を推論する。As described above, many of the events that occur in the communication network are often based on the same failure. For example, even if a fault is inferred from an event and has a low confidence level defined by rule correspondence, if the same fault is inferred as a conclusion from another event, the confidence level of the fault is set higher. I have to fix it. Therefore,
Letting the confidence factor due to the first event of the fault be x and the confidence factor due to the second event be y, the combined confidence factor C of the fault is C = (x)^1/2 + (y)^1/2 − ( xy)^1/2 (where 0 ≦ confidence ≦ 1), and the fault is inferred based on this.

【００３８】（２）故障の出力制限ルール一次切り分け自律エージェント群２０の場合、ルールベ
ース推論自律エージェント２１から確信度別に複数の故
障が推論結果として得られる。この複数の故障の中には
別の故障に付随して発生したものもあり、また、装置が
動作しているかどうか分からないといった状態の確認が
できないものもある。(2) Fault Output Restriction Rule In the case of the primary isolation autonomous agent group 20, a plurality of faults are obtained as inference results from the rule-based reasoning autonomous agent 21 for each certainty factor. Some of the plurality of failures have occurred in association with another failure, and some of them cannot confirm the status such as not knowing whether the device is operating.

【００３９】このように、個々の故障に対して推論され
た確信度にはばらつきがあるため、少なくとも１つのル
ールベース推論自律エージェント２１から得られた複数
の故障のうち最も高い確信度を持つ障害を基準とし、定
数を０以上１以下の値として、その時点での故障出力の
しきい値を、故障出力のしきい値＝（最も高い確信度）＊（定数）により計算し、該しきい値を越える故障だけを一次切り
分け黒板６２から障害管理黒板６６へ一次切り分けの結
果として登録する。これによって、影響分析自律エージ
ェント群３０等の故障を処理対象とする自律エージェン
トの負荷を軽減することができる。As described above, since the certainty factor inferred for each fault varies, a fault having the highest certainty factor among a plurality of faults obtained from at least one rule-based reasoning autonomous agent 21. With the constant as a value of 0 or more and 1 or less, the threshold of the failure output at that time is calculated by the threshold of the failure output = (highest certainty factor) * (constant), and the threshold is calculated. Only failures exceeding the value are registered from the primary isolation blackboard 62 to the failure management blackboard 66 as a result of primary isolation. As a result, it is possible to reduce the load on the autonomous agent that processes a failure of the influence analysis autonomous agent group 30 or the like.

【００４０】複数のルールベース推論自律エージェント
が実装された本発明のシステムによる処理結果及び従来
のエキスパートシステムによる処理結果の一例を、一次
切り分けを例にとって下記表１、２に示す。An example of the processing result by the system of the present invention in which a plurality of rule-based reasoning autonomous agents are implemented and the processing result by the conventional expert system is shown in Tables 1 and 2 below, taking primary separation as an example.

【００４１】[0041]

【表１】[Table 1]

【表２】この例によれば、本発明のシステムの方が従来のシステ
ムより処理時間を半分以下に短縮可能なことが分かる。
なお、処理時間の短縮率は障害の発生数によってさらに
大きくすることが可能である。[Table 2] According to this example, it can be seen that the system of the present invention can reduce the processing time to less than half that of the conventional system.
The processing time reduction rate can be further increased depending on the number of failures.

【００４２】[0042]

【発明の効果】本発明によれば、通信ネットワーク障害
管理の機能を複数の自律エージェントもしくは自律エー
ジェント群に分散化し、これらに対応する複数の黒板及
び黒板管理自律エージェントを設けたことにより、障害
管理に必要な各種の処理を独立・分散して処理すること
が可能となり、処理時間の短縮や柔軟でかつメンテナン
スし易い構成の実現を図ることができる。According to the present invention, the function of communication network fault management is distributed to a plurality of autonomous agents or a group of autonomous agents, and a plurality of blackboards and blackboard management autonomous agents corresponding to these are provided, whereby fault management is performed. Various kinds of processing required for the above can be processed independently and distributed, and the processing time can be shortened and a flexible and easy-to-maintain configuration can be realized.

【００４３】また、本発明によれば、システムに実装さ
れている計算機パワーが許す限り、複数のルールベース
推論自律エージェントを用意することができ、これによ
って複数事象発生の際の処理速度を向上でき、より現実
的な問題解決が可能になるとともに、多くの障害情報の
中からノイズを取り除くことが可能となり、短時間で信
頼性の高い推論結果を提示できる。Further, according to the present invention, a plurality of rule-based reasoning autonomous agents can be prepared as long as the computer power installed in the system permits, and this can improve the processing speed when a plurality of events occur. , It becomes possible to solve the problem more realistically, and it becomes possible to remove noise from a lot of fault information, so that highly reliable inference results can be presented in a short time.

【００４４】また、自律エージェントもしくは自律エー
ジェント群の独立化に伴い、知識不足や自律エージェン
トもしくは自律エージェント群の障害によるシステム全
体へのダメージを大幅に軽減でき、その分、ロバスト性
を高めることができ、さらにネットワーク構成の変更に
伴う機能の追加や変更を容易にできる等の利点がある。Further, with the independence of the autonomous agent or group of autonomous agents, damage to the entire system due to lack of knowledge or failure of the autonomous agent or group of autonomous agents can be greatly reduced, and the robustness can be increased accordingly. Further, there is an advantage that it is possible to easily add or change the function accompanying the change of the network configuration.

【図面の簡単な説明】[Brief description of the drawings]

【図１】本発明のシステムの実施の形態の一例を示す構
成図FIG. 1 is a configuration diagram showing an example of an embodiment of a system of the present invention.

【図２】事象認識自律エージェント群の詳細を示す構成
図FIG. 2 is a configuration diagram showing details of an event recognition autonomous agent group.

【図３】一次切り分け自律エージェント群の詳細を示す
構成図FIG. 3 is a configuration diagram showing details of a primary separation autonomous agent group;

【図４】影響分析自律エージェント群の詳細を示す構成
図FIG. 4 is a configuration diagram showing details of an impact analysis autonomous agent group;

【図５】試験自律エージェント群の詳細を示す構成図FIG. 5 is a configuration diagram showing details of a test autonomous agent group.

【符号の説明】[Explanation of symbols]

１０…事象認識自律エージェント群、２０…一次切り分
け自律エージェント群、３０…影響分析自律エージェン
ト群、４０…措置自律エージェント、５０…試験自律エ
ージェント群、６１…事象認識黒板、６２…一次切り分
け黒板、６３…影響分析黒板、６４…措置黒板、６５…
試験黒板、６６…障害管理黒板、７１〜７５…黒板管理
自律エージェント、１００…通信ネットワーク障害管理
システム、２００…通信ネットワーク、３００…ネット
ワーク構成データベース、４００…知識データベース。10 ... Event recognition autonomous agent group, 20 ... Primary isolation autonomous agent group, 30 ... Impact analysis autonomous agent group, 40 ... Measure autonomous agent, 50 ... Test autonomous agent group, 61 ... Event recognition blackboard, 62 ... Primary isolation blackboard, 63 … Impact analysis blackboard, 64… Action blackboard, 65…
Test blackboard, 66 ... Fault management blackboard, 71-75 ... Blackboard management autonomous agent, 100 ... Communication network failure management system, 200 ... Communication network, 300 ... Network configuration database, 400 ... Knowledge database.

───────────────────────────────────────────────────── フロントページの続き (51)Int.Cl.⁶ 識別記号庁内整理番号ＦＩ技術表示箇所Ｇ０６Ｆ 15/00 ３２０Ｇ０６Ｆ 15/00 ３２０Ａ 15/16 ４７０ 15/16 ４７０ＵＨ０４Ｌ 12/24 9466−5ＫＨ０４Ｌ 11/08 12/26 (72)発明者村中正次東京都新宿区西新宿３丁目19番２号日本電信電話株式会社内 (72)発明者張宏神奈川県横浜市中区山下町223番１エヌ・ティ・ティ・ソフトウェア株式会社内─────────────────────────────────────────────────── ─── Continuation of front page (51) Int.Cl.⁶ Identification code Internal reference number FI Technical display location G06F 15/00 320 G06F 15/00 320A 15/16 470 15/16 470U H04L 12/24 9466-5K H04L 11/08 12/26 (72) Inventor Masatsugu Muranaka 3-19-2 Nishishinjuku, Shinjuku-ku, Tokyo Nippon Telegraph and Telephone Corporation (72) Inventor Hiroshi Zhang 223 Yamashita-cho, Naka-ku, Yokohama-shi, Kanagawa Number 1 NTT Software Corporation

Claims

Translated fromJapanese

【特許請求の範囲】[Claims]

【請求項１】通信ネットワーク障害管理の機能を分散
化した複数の自律エージェントもしくは自律エージェン
ト群と、該複数の自律エージェントもしくは自律エージェント群
にそれぞれ対応する複数の黒板と、該複数の黒板をそれぞれ管理する複数の黒板管理自律エ
ージェントとからなることを特徴とする自律エージェン
ト制御による通信ネットワーク障害管理システム。1. A plurality of autonomous agents or groups of autonomous agents having distributed communication network fault management functions, a plurality of blackboards corresponding to the plurality of autonomous agents or groups of autonomous agents, and managing the plurality of blackboards, respectively. A communication network fault management system based on autonomous agent control, which comprises a plurality of blackboard management autonomous agents that operate.

【請求項４】一次切り分け自律エージェント群のルー
ルベース推論自律エージェントは故障の第１の事象によ
る確信度をｘ、第２の事象による確信度をｙとして、故
障の合成された確信度Ｃを、Ｃ＝（ｘ）^1/2＋（ｙ）^1/2−（ｘｙ）^1/2 により定義し、これに基づいて故障を推論することを特
徴とする請求項３記載の自律エージェント制御による通
信ネットワーク障害管理システム。4. A rule-based reasoning of a primary carving autonomous agent group, where the autonomous agent has a certainty factor due to a first event of a failure as x and a certainty factor due to a second event as y, a compounded certainty factor C of the failure, 4. The communication network under the control of an autonomous agent according to claim 3, wherein C = (x)^1/2 + (y)^1/2 − (xy)^{1/2 is} defined and a fault is inferred based on this. Fault management system.

【請求項５】一次切り分け自律エージェント群は定数
を０以上１以下の値として、複数の故障の確信度からそ
の時点での故障出力のしきい値を、故障出力のしきい値＝（最も高い確信度）＊（定数）により計算し、該しきい値を越える故障を障害管理黒板
への一次切り分け結果とすることを特徴とする請求項２
または３記載の自律エージェント制御による通信ネット
ワーク障害管理システム。5. The primary carving autonomous agent group sets a constant as a value of 0 or more and 1 or less, and determines a threshold value of a failure output at that time based on certainty factors of a plurality of failures. The certainty factor ** (constant) is calculated, and the fault exceeding the threshold value is used as the primary isolation result to the fault management blackboard.
Alternatively, the communication network fault management system according to the autonomous agent control described in 3.

【請求項６】試験自律エージェント群は試験結果デー
タベースにアクセスして試験要求と同一の試験結果がな
い時は該当試験を実施し、ある時はα、βを定数とし
て、試験結果の有効度を、有効度＝１−１／（１＋α・ｅｘｐ（−β・経過時
間））により定義し、これが予め設定された有効度のしきい値
より大きい場合は試験要求を拒否し、小さい場合は試験
結果データベースから前記試験結果を削除するとともに
試験を実施することを特徴とする請求項２または３記載
の自律エージェント制御による通信ネットワーク障害管
理システム。6. The test autonomous agent group accesses the test result database and executes the corresponding test when there is no test result that is the same as the test request, and when it does, the validity of the test result is determined by using α and β as constants. , Effectiveness = 1-1 / (1 + α ・ exp (-β ・ Elapsed time)). If this is larger than the preset threshold of effectiveness, the test request is rejected, and if it is smaller, the test result. The communication network fault management system according to claim 2 or 3, wherein the test result is deleted from the database and the test is performed.