Joint Inference in Information Extraction and Social Network Analysis
Intervenant(s) : Andrew McCallum (University of Massachusetts Amherst)
In this talk I will describe recent research at the intersection of information extraction, data mining and social network analysis. In particular I will focus on how such a combination can be made both robust and scalable---showing that the typical brittle cascading of errors from information extraction to data mining can be avoided with unified probabilistic inference in graphical models, and showing that these models can be made efficient with recent methods of approximate inference and learning. After briefly introducing conditional random fields, I will demonstrate their use in joint models of extraction, entity resolution, and sequence alignment.
I will then describe two methods of integrating textual data into a particular type of data mining---social network analysis. In one model, we discover role-similarity between entities by examining not only network connectivity, but also the words communicated on on those edges; I'll demonstrate this method on a large corpus of email data subpoenaed as part of the Enron investigation. In another model, we discover groups of entities and the "topical" conditions under which different groupings arise; I'll demonstrate this on coalition discovery from many years worth of voting records in the U.S. Senate and the U.N. I'll conclude with further examples of graphical models successfully applied to relational data, as well as discussion of their applicability to trend analysis, expert-finding and bibliometrics.
Joint work with colleagues at UMass: Charles Sutton, Aron Culotta, Chris Pal, Ben Wellner, Michael Hay, Xuerui Wang, Natasha Mohanty, David Mimno, Gideon Mann, Wei Li, and Andres Corrada.
Bio: Andrew McCallum is an Associate Professor and Director of the 15-person Information Extraction and Synthesis Laboratory in the Computer Science Department at University of Massachusetts Amherst. He was previously Vice President of Research and Development at WhizBang Labs, a company that used machine learning for information extraction from the Web. In the late 1990's he was a Research Scientist and Coordinator at Justsystem Pittsburgh Research Center, where he spearheaded the creation of CORA, an early research paper search engine that used machine learning for spidering, extraction, classification and citation analysis. McCallum received his PhD from the University of Rochester in 1995, followed by a post-doctoral fellowship at Carnegie Mellon University.
He is the recipient of two NSF ITR awards, the UMass NSM Distinguished Research Award, the UMass Lilly Teaching Fellowship, and the IBM Faculty Partnership Award. He is the Program Co-chair for the International Conference on Machine Learning (ICML) 2008, and a member of the boards of the International Machine Learning Society, the CRA Community Computing Consortium and the editorial board of the Journal of Machine Learning Research. He has given tutorials or invited talks on information extraction at NIPS, KDD, ACL, MSRA, and elsewhere.
For the past ten years, McCallum has been active in research on statistical machine learning applied to text, especially information extraction, co-reference, information integration, document classification, clustering, finite state models, semi-supervised learning, and social network analysis. New work on search and bibliometric analysis of open-access research literature can be found at http://rexa.info.
Thomas.Baerecke (at) nulllip6.fr