技术领域technical field
本发明涉及互联网领域,更具体地,涉及一种高性能计算互联网网络监控方法及系统。The present invention relates to the field of the Internet, and more particularly, to a high-performance computing Internet network monitoring method and system.
背景技术Background technique
随着高性能计算集群规模正在不断的扩大,高性能计算互联网络也变得越来越庞大和复杂。高性能计算互联网络的监控工作就是要高效、实时的了解网络中通信板卡在线状态、互联通信板卡配置信息、链路连接状态、链路带宽性能、链路通信质量等;如果要全面了解整个高性能计算集群的网络运行状态,网络的自动化监控软件不可或缺。而目所采用技术,对高性能计算互联网络中监控数据采集不够实时、故障定位不够高效,使得对互联网网络的监控存在滞后性,无法及时的反应互联网网络中存在的问题。As the scale of high-performance computing clusters is continuously expanding, the high-performance computing interconnection network is also becoming more and more large and complex. The monitoring work of high-performance computing interconnection network is to efficiently and real-time understand the online status of communication boards in the network, the configuration information of interconnected communication boards, link connection status, link bandwidth performance, link communication quality, etc.; if you want to fully understand The network running status of the entire high-performance computing cluster requires automatic network monitoring software. However, the technology adopted by the project is not real-time enough for monitoring data collection in the high-performance computing Internet network, and the fault location is not efficient enough, which makes the monitoring of the Internet network lag, and cannot timely reflect the problems existing in the Internet network.
发明内容SUMMARY OF THE INVENTION
为了解决现有技术中对高性能计算互联网络中监控数据采集不够实时、故障定位不够高效的不足,本发明提供了一种高性能计算互联网网络监控方法及系统。In order to solve the problems in the prior art that the monitoring data collection in the high-performance computing internet network is not real-time enough and the fault location is not efficient enough, the present invention provides a high-performance computing internet network monitoring method and system.
为解决上述技术问题,本发明的技术方案如下:For solving the above-mentioned technical problems, the technical scheme of the present invention is as follows:
一种高性能计算互联网网络监控方法,包括以下步骤:A high-performance computing Internet network monitoring method, comprising the following steps:
步骤S101:在监控数据采集节点中部署通信芯片状态寄存器采集程序;Step S101: Deploy a communication chip status register collection program in the monitoring data collection node;
步骤S102:启动采集程序,判断采集程序是否启动成功,若成功执行步骤S103,启动失败则重新执行步骤S101;Step S102: Start the collection program, and determine whether the collection program is successfully started, if step S103 is successfully performed, and if the startup fails, step S101 is performed again;
步骤S103:采集程序周期性的采集被监控的通信芯片上的状态寄存器的信息;Step S103: the collection program periodically collects the information of the status register on the monitored communication chip;
步骤S104:将采集程序采集到的状态寄存器的信息进行标准格式化处理,并存入数据库中;Step S104: carry out standard formatting processing on the information of the status register collected by the collection program, and store it in a database;
步骤S105:中间层数据处理端分析数据库中的数据是否有预警、报警信息,如果有预、报警信息则执行步骤S106,若没有预、报警信息则直接执行步骤S107;Step S105: The middle-layer data processing terminal analyzes whether the data in the database has warning or alarm information, and if there is warning or warning information, executes step S106, and if there is no warning or warning information, executes step S107 directly;
步骤S106:中间层数据处理端将预警、报警信息推送到终端;Step S106: the middle layer data processing terminal pushes the warning and alarm information to the terminal;
步骤S107:显示前端从中间层数据处理端中获取状态寄存器的数据信息,并将数据进行可视化。Step S107: the display front end obtains the data information of the status register from the data processing end of the middle layer, and visualizes the data.
优选的,所述的中间层数据处理端通过定时器函数连接数据库,周期性的处理采集到的数据并从中筛选出预警、报警信息,再将预警、报警信息推送出去。Preferably, the data processing end of the middle layer is connected to the database through a timer function, periodically processes the collected data, filters out early warning and alarm information, and then pushes the early warning and alarm information.
优选的,所述的中间层数据处理端通过restful api函数将采集到的数据json化供显示前端进行调用。Preferably, the middle-layer data processing end uses a restful api function to jsonize the collected data for the display front end to call.
优选的,所述的监控数据采集节点和数据库须在同一个局域网中。Preferably, the monitoring data collection node and the database must be in the same local area network.
优选的,所述的监控数据采集节点采用分布式部署。Preferably, the monitoring data collection node adopts distributed deployment.
优选的,所述的中间层数据处理端和显示前端采用b/s模式。Preferably, the middle layer data processing end and the display front end adopt b/s mode.
一种高性能计算互联网网络监控系统,所述系统基于上述所述的方法,包括底层数据采集端,数据库,中间层数据处理端以及显示前端;A high-performance computing Internet network monitoring system, the system is based on the above-mentioned method, and includes a bottom-layer data collection terminal, a database, a middle-layer data processing terminal and a display front-end;
所述的底层数据采集端包括状态寄存器采集程序,采集程序部署在通信芯片状态寄存器中,启动采集程序后,系统判断采集程序是否启动成功,启动失败则重新启动采集程序,若启动成功,则采集程序周期性的采集被监控的通信芯片上的状态寄存器的信息,并将采集程序采集到的状态寄存器的信息进行标准格式化处理,并存入数据库中,中间层数据处理端分析数据库中的数据是否有预警、报警信息,如果有预、报警信息,中间层数据处理端将预警、报警信息推送到终端;若没有预、报警信息,显示前端从中间层数据处理端中获取状态寄存器的数据信息,并将数据进行可视化。The underlying data acquisition terminal includes a status register acquisition program, and the acquisition program is deployed in the communication chip status register. After the acquisition program is started, the system determines whether the acquisition program is successfully started. If the start fails, the acquisition program is restarted. The program periodically collects the information of the status register on the monitored communication chip, and the information of the status register collected by the collection program is formatted and stored in the database, and the data in the database is analyzed by the middle-level data processing end Whether there is pre-warning or alarm information, if there is pre-alarm or pre-alarm information, the middle-layer data processing terminal will push the pre-warning and alarm information to the terminal; if there is no pre-alarm and alarm information, the display front end obtains the data information of the status register from the middle-layer data processing terminal , and visualize the data.
与现有技术相比,本发明技术方案的有益效果是:Compared with the prior art, the beneficial effects of the technical solution of the present invention are:
本发明解决了高性能计算互联网络中监控数据采集不够实时、故障定位不够高效的问题,并且实现对预、报警信息的主动推送,实现一个统一的监控界面,直观高效,对芯片状态信息进行经过标准化、格式化的处理后便于系统管理员更直接高效的定位故障点,并且本发明采用能够根据采集到的数据主动推送预、报警信息,避免了人为的疏忽,稳定可靠。The invention solves the problems that the monitoring data collection is not real-time enough and the fault location is not efficient enough in the high-performance computing interconnection network, and realizes the active push of the pre-warning and alarm information, realizes a unified monitoring interface, is intuitive and efficient, and processes the chip status information. After standardized and formatted processing, it is convenient for system administrators to locate fault points more directly and efficiently, and the present invention can actively push pre-warning and alarm information according to the collected data, which avoids human negligence and is stable and reliable.
附图说明Description of drawings
图1为本发明的方法的流程图。Figure 1 is a flow chart of the method of the present invention.
图2为本发明的系统结构图。FIG. 2 is a system structure diagram of the present invention.
具体实施方式Detailed ways
附图仅用于示例性说明,不能理解为对本专利的限制;The accompanying drawings are for illustrative purposes only, and should not be construed as limitations on this patent;
为了更好说明本实施例,附图某些部件会有省略、放大或缩小,并不代表实际产品的尺寸;In order to better illustrate this embodiment, some parts of the drawings are omitted, enlarged or reduced, which do not represent the size of the actual product;
对于本领域技术人员来说,附图中某些公知结构及其说明可能省略是可以理解的。It will be understood by those skilled in the art that some well-known structures and their descriptions may be omitted from the drawings.
下面结合附图和实施例对本发明的技术方案做进一步的说明。The technical solutions of the present invention will be further described below with reference to the accompanying drawings and embodiments.
实施例1Example 1
如图1所示,一种高性能计算互联网网络监控方法,包括以下步骤:As shown in Figure 1, a high-performance computing Internet network monitoring method includes the following steps:
步骤S101:在监控数据采集节点中部署通信芯片状态寄存器采集程序;Step S101: Deploy a communication chip status register collection program in the monitoring data collection node;
步骤S102:启动采集程序,判断采集程序是否启动成功,若成功执行步骤S103,启动失败则重新执行步骤S101;Step S102: Start the collection program, and determine whether the collection program is successfully started, if step S103 is successfully performed, and if the startup fails, step S101 is performed again;
步骤S103:采集程序周期性的采集被监控的通信芯片上的状态寄存器的信息;Step S103: the collection program periodically collects the information of the status register on the monitored communication chip;
步骤S104:将采集程序采集到的状态寄存器的信息进行标准格式化处理,并存入数据库中;Step S104: carry out standard formatting processing on the information of the status register collected by the collection program, and store it in a database;
步骤S105:中间层数据处理端分析数据库中的数据是否有预警、报警信息,如果有预、报警信息则执行步骤S106,若没有预、报警信息则直接执行步骤S107;Step S105: The middle-layer data processing terminal analyzes whether the data in the database has warning or alarm information, and if there is warning or warning information, executes step S106, and if there is no warning or warning information, executes step S107 directly;
步骤S106:中间层数据处理端将预警、报警信息推送到终端;Step S106: the middle layer data processing terminal pushes the warning and alarm information to the terminal;
步骤S107:显示前端从中间层数据处理端中获取状态寄存器的数据信息,并将数据进行可视化。Step S107: the display front end obtains the data information of the status register from the data processing end of the middle layer, and visualizes the data.
作为一个优选的实施例,所述的中间层数据处理端通过定时器函数连接数据库,周期性的处理采集到的数据并从中筛选出预警、报警信息,再将预警、报警信息推送出去。As a preferred embodiment, the middle-layer data processing end connects to the database through a timer function, periodically processes the collected data, filters out early warning and alarm information, and then pushes the early warning and alarm information.
作为一个优选的实施例,所述的中间层数据处理端通过restful api函数将采集到的数据json化供显示前端进行调用。As a preferred embodiment, the middle-layer data processing end uses a restful api function to jsonize the collected data for the display front end to call.
作为一个优选的实施例,所述的监控数据采集节点和数据库须在同一个局域网中。As a preferred embodiment, the monitoring data collection node and the database must be in the same local area network.
作为一个优选的实施例,所述的监控数据采集节点采用分布式部署。As a preferred embodiment, the monitoring data collection node adopts distributed deployment.
作为一个优选的实施例,所述的中间层数据处理端和显示前端采用b/s模式。As a preferred embodiment, the middle layer data processing end and the display front end adopt b/s mode.
实施例2Example 2
如图2所示,一种高性能计算互联网网络监控系统,所述系统基于上述所述的方法,包括底层数据采集端1,数据库2,中间层数据处理端3以及显示前端4;As shown in FIG. 2 , a high-performance computing Internet network monitoring system, the system is based on the above-mentioned method, comprising a bottom data collection terminal 1, a database 2, a middle layer data processing terminal 3 and a display front end 4;
所述的底层数据采集端1包括状态寄存器采集程序,采集程序部署在通信芯片状态寄存器中,启动采集程序后,系统判断采集程序是否启动成功,启动失败则重新启动采集程序,若启动成功,则采集程序周期性的采集被监控的通信芯片上的状态寄存器的信息,并将采集程序采集到的状态寄存器的信息进行标准格式化处理,并存入数据库2中,中间层数据处理端2分析数据库1中的数据是否有预警、报警信息,如果有预、报警信息,中间层数据处理端3将预警、报警信息推送到终端;若没有预、报警信息,显示前端4从中间层数据处理端3中获取状态寄存器的数据信息,并将数据进行可视化。The underlying data acquisition terminal 1 includes a status register acquisition program, and the acquisition program is deployed in the communication chip status register. After the acquisition program is started, the system determines whether the acquisition program is successfully started. If the start fails, the acquisition program is restarted. The collection program periodically collects the information of the status register on the monitored communication chip, and the information of the status register collected by the collection program is processed in standard format and stored in the database 2, and the middle layer data processing terminal 2 analyzes the database Whether the data in 1 has pre-warning and alarm information, if there is pre-alarm information, the middle-layer data processing terminal 3 will push the warning and alarm information to the terminal; if there is no pre-alarm information, display front-end 4 from the middle-layer data processing terminal 3 Get the data information of the status register and visualize the data.
附图中描述位置关系的用语仅用于示例性说明,不能理解为对本专利的限制;The terms describing the positional relationship in the accompanying drawings are only used for exemplary illustration, and should not be construed as a limitation on this patent;
显然,本发明的上述实施例仅仅是为清楚地说明本发明所作的举例,而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动。这里无需也无法对所有的实施方式予以穷举。凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本发明权利要求的保护范围之内。Obviously, the above-mentioned embodiments of the present invention are only examples for clearly illustrating the present invention, rather than limiting the embodiments of the present invention. For those of ordinary skill in the art, changes or modifications in other different forms can also be made on the basis of the above description. There is no need and cannot be exhaustive of all implementations here. Any modifications, equivalent replacements and improvements made within the spirit and principle of the present invention shall be included within the protection scope of the claims of the present invention.
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910402272.7ACN110289981A (en) | 2019-05-14 | 2019-05-14 | A high-performance computing Internet network monitoring method and system |
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201910402272.7ACN110289981A (en) | 2019-05-14 | 2019-05-14 | A high-performance computing Internet network monitoring method and system |
| Publication Number | Publication Date |
|---|---|
| CN110289981Atrue CN110289981A (en) | 2019-09-27 |
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201910402272.7APendingCN110289981A (en) | 2019-05-14 | 2019-05-14 | A high-performance computing Internet network monitoring method and system |
| Country | Link |
|---|---|
| CN (1) | CN110289981A (en) |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113553228A (en)* | 2021-06-21 | 2021-10-26 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Lightweight computer state monitoring system and method |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5761200A (en)* | 1993-10-27 | 1998-06-02 | Industrial Technology Research Institute | Intelligent distributed data transfer system |
| CN102647736A (en)* | 2012-04-19 | 2012-08-22 | 华为技术有限公司 | A device status information acquisition system and communication method |
| CN103500133A (en)* | 2013-09-17 | 2014-01-08 | 华为技术有限公司 | Fault locating method and device |
| CN108563550A (en)* | 2018-04-23 | 2018-09-21 | 上海达梦数据库有限公司 | A kind of monitoring method of distributed system, device, server and storage medium |
| CN108710347A (en)* | 2018-04-16 | 2018-10-26 | 佛山市顺德区中山大学研究院 | A kind of monitoring cloud platform |
| WO2019058615A1 (en)* | 2017-09-21 | 2019-03-28 | 株式会社東芝 | Industrial plant monitoring device and distributed control system |
| CN109586999A (en)* | 2018-11-12 | 2019-04-05 | 深圳先进技术研究院 | A kind of container cloud platform condition monitoring early warning system, method and electronic equipment |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5761200A (en)* | 1993-10-27 | 1998-06-02 | Industrial Technology Research Institute | Intelligent distributed data transfer system |
| CN102647736A (en)* | 2012-04-19 | 2012-08-22 | 华为技术有限公司 | A device status information acquisition system and communication method |
| CN103500133A (en)* | 2013-09-17 | 2014-01-08 | 华为技术有限公司 | Fault locating method and device |
| WO2019058615A1 (en)* | 2017-09-21 | 2019-03-28 | 株式会社東芝 | Industrial plant monitoring device and distributed control system |
| CN108710347A (en)* | 2018-04-16 | 2018-10-26 | 佛山市顺德区中山大学研究院 | A kind of monitoring cloud platform |
| CN108563550A (en)* | 2018-04-23 | 2018-09-21 | 上海达梦数据库有限公司 | A kind of monitoring method of distributed system, device, server and storage medium |
| CN109586999A (en)* | 2018-11-12 | 2019-04-05 | 深圳先进技术研究院 | A kind of container cloud platform condition monitoring early warning system, method and electronic equipment |
| Title |
|---|
| 赵哲等: "基于Zabbix的网络监控系统", 《计算机技术与发展》* |
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113553228A (en)* | 2021-06-21 | 2021-10-26 | 华东计算技术研究所(中国电子科技集团公司第三十二研究所) | Lightweight computer state monitoring system and method |
| Publication | Publication Date | Title |
|---|---|---|
| CN112671560B (en) | High-availability distributed real-time alarm processing method and system | |
| CN103401698B (en) | For the monitoring system that server health is reported to the police in server set group operatione | |
| CN110290012A (en) | The detection recovery system and method for RabbitMQ clustering fault | |
| CN105610648B (en) | A method and server for collecting operation and maintenance monitoring data | |
| CN111181767A (en) | Monitoring and fault self-healing system and method for complex system | |
| CN105306272B (en) | Information system fault scenes formation gathering method and system | |
| CN113704052B (en) | Operation and maintenance system, method, equipment and medium of micro-service architecture | |
| US5822302A (en) | LAN early warning system | |
| CN100426751C (en) | Method for ensuring accordant configuration information in cluster system | |
| CN110287079A (en) | A cluster automatic monitoring system and method | |
| CN106789306B (en) | Method and system for detecting, collecting and recovering software fault of communication equipment | |
| CN111881011A (en) | Log management method, platform, server and storage medium | |
| JP2001511278A (en) | Method and apparatus for pruning a multiprocessor system for maximal total connections during recovery | |
| CN114896166B (en) | Scene library construction method, device, electronic device and storage medium | |
| CN114707363B (en) | Problem data processing method and system for distribution network engineering management | |
| CN106911519B (en) | Data acquisition monitoring method and device | |
| CN110417586A (en) | Service monitoring method, service node, server and computer readable storage medium | |
| CN110289981A (en) | A high-performance computing Internet network monitoring method and system | |
| CN111752488B (en) | Management method and device of storage cluster, management node and storage medium | |
| CN116264541A (en) | A multi-dimensional database disaster recovery method and device | |
| CN113765690A (en) | Cluster switching method, system, device, terminal, server and storage medium | |
| CN111586608A (en) | Intelligent health service system of power supply vehicle and data transmission method thereof | |
| CN114885014A (en) | Method, device, equipment and medium for monitoring external field equipment state | |
| CN113688111A (en) | Cross-region message copying method, system, electronic equipment and readable storage medium | |
| CN108829563B (en) | Alarm method and alarm device |
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication | Application publication date:20190927 |