PhD graduated
Team : NPA
Departure date : 02/02/2011

Supervision : Serge FDIDA

Conception de Mécanismes d'Amélioration de la Gestion d'Incidents dans les Réseaux IP

IP networks of operators carry the most data traffic of the world every day, and so should provide an increasingly important reliability. However, these networks are often subject to incidents that arise from maintenance works or unexpected failures. Many of these incidents are unavoidable, mainly because their origin are external to network operators. Moreover, when they happen the network can undergo considerable damages. It is therefore important to develop tools to prevent network incidents outbreak, or at least to limit their impact on the network. In this context, automatic procedures can help to accelerate troubleshooting procedures and maintenance works and so, to reduce the overall downtime of the network. The main focus of this thesis is to automatically detect IP network incidents. To reach this goal, we need a deep understanding of these incidents and their effects on the network. Network operators use trouble tickets to track all the steps of troubleshooting and maintenance activities. The history of trouble tickets carries valuable information for network management. Tickets are text documents that store the description (and the cause) of incidents which have required operator intervention. The effects of these incidents are observables through alarm messages which come from different sources (for instance, SNMP, router syslogs, or routing protocols), we focus on routing alarm messages. Our key observation is that operators already use trouble ticketing systems to record all events that require their intervention. Hence, we can use the history of trouble tickets combined with intradomain routing messages to train a classifier. Then, we can apply this classifier online to process intradomain routing messages and automatically single out the critical events. As a first step, we propose Troubleminer, a mechanism based on document clustering techniques to (1) automatically extract the causes of network incidents from tickets and (2) organize a collection of trouble tickets into an hierarchy that network operators can easily used. Then, we develop an heuristic to correlate trouble tickets with instability routing events in two operational networks: a VPN provider and Internet2 backbone network. We find that 4% (VPN operator) and 23% (Internet2) of routing events in these networks are critical, which means that they do coincide with trouble tickets. Finally, we show the faisability of detecting critical routing events by means of k-NN and Random Forest algorithms. Our results show that we can accurately pinpoint approximately 70% of critical events for both networks.

Defence : 02/02/2011 - 10h30 - Site Jussieu 25-26/105

Jury members :

Damien MAGONI, Professeur, Université de Bordeaux [Rapporteur]
Philippe OWEZARSKI, Chercheur, CNRS [Rapporteur]
Patrick GALLINARI, Professeur, UPMC Sorbonne Universités
Nöemie SIMONI, Professeur, ENST Paris
Olivier FESTOR, Chercheur, INRIA
Mickael MEULLE, Chercheur, Orange Labs R&D (France Telecom R&D)
Serge FDIDA, Professeur, UPMC Sorbonne Universités

2007-2013 Publications