PhD graduated
Team : BD
Departure date : 08/31/2013

Supervision : Stéphane GANÇARSKI

Web Archives Quality : modeling and optimization

Nowadays, the Web has become the most important way to spread information that can have a great cultural, scientific or economic value. Archiving the Web or at least a part of it has become crucial to preserve some useful information for future generations of researchers, writers, historians, etc. However, archivists are facing a great challenge to maintain the quality of collected data that should reflect the real Web. In this perspective, our work in this thesis aims at improving the quality of archives. We focus on two quality measures: the temporal completeness and the temporal coherence which are very relevant to assess Web archives. We propose a new Web archiving approach based on the visual aspect of pages to detect changes in the same way that they are perceived by users. Then, we propose a method to evaluate the importance of detected changes. We model the importance of changes based on patterns through PPaC model (Pattern of Pages Changes). Unlike existing models based on the average rate of changes, PPaC better predicts the periods of time where important changes are expected to occur on web pages. Based on PPaC, we have proposed different crawling strategies that aim at improving the temporal completeness and/or the temporal coherence. Our different strategies have been implemented and tested on both simulated and real pages. The results show that the PPaC model based on the importance of changes is an useful instrument to improve significantly the quality of archives.

Defence : 11/18/2011

Jury members :

Serge Abiteboul Directeur de recherche à INRIA-Saclay [Rapporteur]
Vassilis Christophides Professeur à FORTH-ICS [Rapporteur]
Elisabeth Murisasco Professeur à l'USTV
Bernd Amann Professeur à l'UPMC
Julien Masanès Directeur d'Internet Memory Foundation
Jérôme Mainka Directeur de recherche à Antidot
Stéphane Gançarski Maitre de conférences (HDR) à l'UPMC

2010-2012 Publications

