Continuous top-k queries over real-time web streams
In the era of the real-time web, online news media aggregators, like Google News, and social media sites, like Twitter, strive for effective and efficient filtering of large volumes of information for millions of users. Given the vast diversity and burstiness of information published on the web each minute, filtering, ranking and delivering content streams to interested users becomes a challenging task. Moreover, feedback signals on the content, such as clicks or shares, provide useful information on content importance, but also require more complex manipulation techniques for filtering these streams. Scoring functions in this context consider both static and dynamic ranking factors, such as profile relevance, recency of information and dynamic feedback signals. Existing works for the real-time web fail to handle such dynamic scoring functions in an online way and rather adapt an approach of iterative execution of snapshot queries. In this thesis, we are interested in efficient evaluation techniques of continuous top-k queries over text and feedback streams featuring generalized scoring functions which capture dynamic ranking aspects. As a first contribution, we generalize state of the art continuous top-k query models, by introducing a general family of non-homogeneous scoring functions combining query-independent item importance with query-dependent content relevance and continuous score decay reflecting information freshness. Our second contribution consists in the definition and implementation of efficient in-memory data structures for indexing and evaluating this new family of continuous top-k queries. Our experiments show that our solution is scalable and outperforms other existing state of the art solutions, when restricted to homogeneous functions. Going a step further, in the second part of this thesis we consider the problem of incorporating dynamic feedback signals to the original scoring function and propose a new general real- time query evaluation framework with a family of new algorithms for efficiently processing continuous top-k queries with dynamic feedback scores in a real-time web context. Finally, putting together the outcomes of these works, we present MeowsReader, a real-time news ranking and filtering prototype which illustrates how a general class of continuous top-k queries offers a suitable abstraction for modelling and implementing continuous online information filtering applications combining keyword search and real-time web activity.
Defence : 09/17/2015 - 14h - Site Jussieu 25-26/105 Jury members : Mme. Evaggelia Pitoura, Professeur, University of Ioannina [Rapporteur]
Mme Sihem Amer-Yahia, Directrice de Recherche, Laboratoire d’Informatique de Grenoble [Rapporteur]
M. Bernd Amann, Professeur, Université Pierre & Marie Curie
M. Vassilis Christophides, Professeur, University of Crete
M. Themis Palpanas, Professeur, LIPADE - Paris Descartes Univ.
M. Dan Vodislav, Professeur, Université de Cergy-Pontoise
M. Ludovic Denoyer, Professeur, Université Pierre & Marie Curie
Z. Hmedeh, N. Vouzoukidou, N. Travers, V. Christophides, C. du Mouza, M. Scholl : “Characterizing Web Syndication Behavior and Content”, WISE'11, The 11th International Conference on Web Information System Engineering, LNCS, Sidney, Australia, pp. 29-42 (2011)