Projects BD

Team : BD

  • EPIQUE - Reconstructing the long terms evolution of science through large scale analysis of science productions

    The evolution of scientific knowledge is directly related to the history of humanity. Document archives and bibliographic sources like the “Web Of Science” or PubMed contain a valuable source for the analysis and reconstruction of this evolution. The proposed project takes as starting point the contributions of D. Chavalarias and J.P. Cointet about the analysis of the dynamicity of evolutive corpora and the automatic construction of “phylomemetic” topic lattices (as an analogy with genealogic trees of natural species). Currently existing tools are limited to the processing of medium sized collections and a non interactive usage. The expected project outcome is situated at the crossroad between Computer science and Social sciences. Our goal is to develop new highly performant tools for building phylomemetic maps of science by exploiting recent technologies for parallelizing tasks and algorithms on complex and voluminous data. These tools are conceived and validated in collaboration with experts in philosophy and history of science over large scientific archives. Keywords : quantitative epistemology, evolution of science, topic detection, temporal alignment, big data processing, data science, big data

    Project leader : Bernd AMANN
    More details here
  • experimaestro - Computer science experiment scheduler and manager

    Experimaestro is an experiment manager based on a server that contains a job scheduler (job dependencies, locking mechanisms) and a framework to describe the experiments with JavaScript or in Java.

    Project leader : Benajmin PIWOWARSKI
    More details here
  • SPARQL on Spark - SPARQL query processing with Apache Spark

    A common way to achieve scalability for processing SPARQL queries over large RDF data sets is to choose map-reduce frameworks like Hadoop or Spark. Processing complex SPARQL queries generating large join plans over distributed data partitions is a major challenge in these shared nothing architectures. In this article we are particularly interested in two representative distributed join algorithms, partitioned join and broadcast join, which are deployed in map-reduce frameworks for the evaluation of complex distributed graph pattern join plans. We compare five SPARQL graph pattern evaluation implementations on top of Apache Spark to illustrate the importance of cautiously choosing the physical data storage layer and of the possibility to use both join algorithms to take account of the existing predefined data partitionings. Our experimentations with different SPARQL benchmarks over real-world and synthetic workloads emphasize that hybrid join plans introduce more flexibility and often can achieve better performance than join plans using a single kind of join implementation.

    Project leader : Hubert NAACKE
    More details here
  • BOM - Block-o-Matic!

    Block-o-Matic is a web page segmentation algorithm based on an hybrid approach for scanned document segmentation and visual-based content segmentation. A web page is associated with three structures: the DOM tree, the content structure and the logical structure. The DOM tree represents the HTML elements of a page, the geometric structure organizes the content based on a category and its geometry and finally the logical structure is the result of mapping content structure on the basis of the human-perceptible meaning that conforms the blocks. The segmentation process is divided in three phases: analysis, understanding and reconstruction of a web page. An evaluation method is proposed in order to perform the evaluation of web page segmentations based on a ground truth of 400 pages classified into 16 categories. A set of metrics are presented based on geometric properties of blocks. Satisfactory results are reached when comparing to other algorithms following the same approach.

    Project leader : Andrès SANOJA
    More details here
 Mentions légales
Site map |