Supervision : Stéphane GANÇARSKI
Web Page Segmentation, Evaluation and Applications
Web pages are becoming more complex than ever, as they are usually not designed manually but generated by Content Management Systems (CMS). Thus, analyzing them, i.e. automatically identifying and classifying different elements from Web pages, such as main content, menus, user comments, advertising among others, becomes difficult. A solution to this issue is provided by Web page segmentation. Web page segmentation refers to the process of dividing a Web page into visually and semantically coherent segments called blocks.
The quality of any Web page segmenter is measured by its correctness (or precision), and its genericity, i.e. the variety of Web page types it is able to segment. Our research focuses on enhancing this quality and measuring it in a fair and accurate way, so that we can compare the state of the art segmenters.
We first propose a conceptual model for segmentation, as well as a Block-o-Matic (BoM) a Web page segmenter that takes the precision and genericity into account. We propose an evaluation model that takes the content as well as the geometry of blocks into account in order to measure the correctness of a segmentation algorithm according to a predefined ground truth. The quality of four state of the art algorithms (including BoM) is experimentally tested on four types of pages (blog, enterprise, forum, picture and wiki). Our evaluation framework allows testing any segmenter. It allows us measuring segmenters quality and giving observations about their correctness. The results show that BoM presents the best performance among the four segmentation algorithms tested, and also that the performance of segmenters depends on the type of page to segment.
We present two applications of BoM. Pagelyzer uses BoM for comparing two Web pages versions and decides if they are similar or not. It is the main contribution of our team to the European project Scape (FP7-IP). We also developed a migration tool of Web pages from HTML4 format to HTML5 format in the context of Web archives.
Defence : 01/22/2015 - 10h - Site Jussieu - Salle Jean-Louis Laurière - 25-26/101
Jury members :
MURISASCO Elisabeth (Professeure, Université de Toulon) [Rapporteur]
RUKOZ Marta (Professeure, Université de Paris Ouest Nanterre) [Rapporteur]
BOUGAMIN Luc (Directeur de Recherches, Inria Rocquencourt)
SENELLART Pierre (Professeur, Télécom ParisTech)
CORD Matthieu (Professeur, UPMC)
GANÇARSKI Stéphane (Maître de Conférences HDR, UPMC)
- A. Sanoja, S. Gançarski : “Block-based Migration from HTML4 Standard to HTML5 Standard in the Context of Web Archives”, SCTC16, Caracas, Venezuela, Bolivarian Republic of (2016)
- A. Sanoja : “Segmentation des Pages Web, Évaluation et Applications”, thesis, defence 01/22/2015, supervision Gançarski, Stéphane (2015)
- A. Sanoja, S. Gançarski : “Web page segmentation evaluation”, 30th Annual ACM Symposium on Applied Computing, Salamanca, Spain (2015)
- A. Sanoja, S. Gançarski : “Block-o-Matic: A web page segmentation framework”, Multimedia Computing and Systems (ICMCS), 2014 International Conference on, Marrakesh, Morocco, pp. 595-600, (IEEE) (2014)
- A. Sanoja, S. Gançarski : “Block-o-Matic: a Web Page Segmentation Tool and its Evaluation”, 29e journées "Base de données avancées", BDA'13, Nantes, France (2013)
- A. Sanoja, S. Gançarski : “Yet Another Hybrid Segmentation Tool”, iPRES 2012 – 9 th International Conference on Preservation of Digital Objects, Toronto, Canada (2012)