DoE selected InfoBeyond to develop a Diagnosis Tool to detect network software failure

Tuesday, February 19, 2019

Organizations move their services from local access into online, such as globally accessible High Performance Computing (HPC), Clouds, etc. However, users may experience unusual degradation of the network performance remotely. Broadly, the network performance problem is usually defined by two main categories:

1. Hard-failures: Hard failures correspond to the inability to transfer any data between users and the network, which can be easily identified and resolved due to an immediately noticeable loss in connectivity.

2. Soft-failures: Soft-failures are characterized by degraded performance, which is more difficult to diagnose and are usually assisted by collecting signs from the network.



Soft failures could be caused by many reasons such as wrong configuration. However, there is a lack of a fully automated tool that can help network users to find the complicated network problems that degrade the performance of network applications. To fill this gap, DoE has awarded Infobeyond to develop an Automated Soft-Failure Diagnostic Tool Using Machine Learning (DiagSoftfailure) for Network Users to infer the location and root cause of network failures that result in performance degradation. It promotes Big Data and Machine Learning to analyze the network behaviors to identify both known and unknown soft-failures in the network.

In general, DiagSoftfailure provides the capabilities for automated network soft-failure diagnosis. It particularly is designed as an automated user-focused network diagnosis tool that helps network users to actively find the performance problems that cause the application to run slower than expected. DiagSoftfailure is expected to have several features:

• A user-focused diagnosis requires no cooperation with the network manager;

• An adaptive network signature that is robust against data inconsistency and high-dimensionality of network behavior data ensures high diagnosis accuracy;

• Capable of identifying unknown faults by combining supervised and unsupervised machine learning;

• Requires no changes in OS system kernel and allows implementation flexibility;

• Diagnosis report groups test results into the different categories in a comprehensive format and can be understood by novice users.

In this efforts, DiagSoftfailure will be developed as a tool:

It can help users to find the root cause of the network performance degradation.

Users or network administrators can rapidly resolve the network problem and improve connection speeds and alleviate user dissatisfaction. By leveraging the automated diagnosis, it reduces the manpower required for network diagnosis as much as possible. Therefore, it can largely reduce the cost of network operation and system maintenance.

