Wikification and Beyond: The Challenges of Entity and Concept Grounding

Dan Roth (UIUC), Heng Ji (RPI), Ming-Wei Chang (MSR), Taylor Cassidy (ARL, IBM)

Contextual disambiguation and grounding of concepts and entities in natural language text are essential to moving forward in many natural language understanding related tasks and are fundamental to many applications. The Wikification task aims at automatically identifying concept mentions appearing in a text document and linking them to (or “grounding them in”) concept referents in a knowledge base (KB) (e.g., Wikipedia). For example, consider the sentence, "The Times report on Blumenthal (D) has the potential to fundamentally reshape the contest in the Nutmeg State.". A Wikifier should identify the key entities and concepts (Times, Blumental, D and the Nutmeg State), and disambiguate them by mapping them to an encyclopedic resource revealing, for example, that “D” here represents the Democratic Party, and that “the Nutmeg State” refers Connecticut. Wikification may benefit both human end-users and Natural Language Processing (NLP) systems. When a document is Wikified a reader can more easily comprehend it, as information about related topics and relevant enriched knowledge from a KB is readily accessible. From a system-to-system perspective, a Wikified document conveys the meanings of its key concepts and entities by grounding them in an encyclopedic resource or a structurally rich ontology.

The primary goals of this tutorial are to review the framework of Wikification and motivate it as a broad paradigm for cross-source linking for knowledge enrichment. We will present and discuss multiple dimensions of the task definition, present the basic building blocks of a state-of-the-art Wikifier system, share some key lessons learned from the analysis of evaluation results, and discuss recently proposed ideas for advancing work in this area along with some of the key challenges. We will also suggest some research questions brought up by new applications, including interactive Wikification, social media, and censorship. The tutorial will be useful for both senior and junior researchers with interests in cross-source information extraction and linking, knowledge acquisition, and the use of acquired knowledge in natural language processing and information extraction. We will try to provide a concise roadmap of recent perspectives and results, as well as point to some of our Wikification resources that are available to the research communities.


After shortly motivating and introducing the general task the first part of the tutorial will be a methodological presentation of a skeletal Wikification system that will allow us to focus on some of the key challenges and computational directions. We will then describe in detail some of the obstacles including the scarcity of supervision signals, issues related to mention detection and candidate selection in different scenarios, and issues that arise when dealing with diverse text genres. Advanced methods that address these obstacles will be surveyed carefully. We will conclude with a discussion of some key remaining challenges and future work.

  1. Motivation and Task Definition [30 minutes] We will describe the notion of Wikification as a generic cross-source linking problem, motivate the task from both human reader and system perspectives, and exemplify some applications. We will then lay out multiple dimensions of the task definition and illustrate how different settings are appropriate for different applications and what impact they might have on computational approaches.

  2. A Skeletal View of Wikification Systems [45 minutes] We will present the general architecture of a systematic approach to Wikification, and use it to survey existing approaches from a fairly unified perspective. In doing that we will address the key computational steps – mention identification, candidate identification and decision making and, within these, key issues such as knowledge representation, local and global context analysis, relevant statistical features, the role of machine learning and the utilization of unlabeled data.

  3. Key Challenges and Recent Advances [35 minutes] We will address some of the key challenges facing high-performing end-to-end Wikification approaches once the basic algorithmic solutions are in place. In doing that we will touch upon all stages of the pipeline: mention identification, candidate generation, ranking of candidates and the identification of concepts and entities that are outside the knowledge base. We will discuss solutions that advance joint modeling of some of these computational steps, joint inference of Wikification with an application (e.g., coreference) or an additional process (e.g., relation identification) and new training models, and exhibit their impact on various steps in the Wikification pipeline.

  4. New Tasks, Trends and Applications [30 minutes] In this part of the tutorial we will address some of the new challenges that arise from extending the Wikification task to new settings – social media, cross-lingual Wikification, censored data, etc. We will present some of the solutions that have started to emerge in this area (e.g., to deal with short and noisy text), along with some recent and interesting applications.

  5. What's Next? [10 minutes] We will conclude with a discussion of some of the open issues in this domain. These include the challenge of dealing with multiple knowledge bases of different levels of quality, difficulties that arise when interacting with users at multiple levels of expertise and those that result from using cross-genre data. We will also provide pointers to resources, including data sets, software and on-line demos.

Tutorial Instructors:

Name: Dan Roth
Affiliation: University of Illinois at Urbana-Champaign
Email address: danr@illinois.edu
Website: http://L2R.cs.uiuc.edu/
Research Interests and Area of Expertise:
Dan Roth is a Professor in the Department of Computer Science at the University of Illinois at Urbana-Champaign and the Beckman Institute of Advanced Science and Technology (UIUC) and a University of Illinois Scholar. He is a fellow of AAAI, ACL and the ACM. Roth has published broadly in machine learning, natural language processing, knowledge representation and reasoning and received several paper, teaching and research awards. He has developed several machine learning based natural language processing systems that are widely used in the computational linguistics community and in industry and has presented invited talks and tutorials in several major conferences. Over the last few years he has worked on Entity Linking and Wikification. He has given several tutorials at ACL/NAACL/ECL and other forums.

Name: Heng Ji
Affiliation: Rensselaer Polytechnic Institute
Email address: jih@rpi.edu
Website: http://nlp.cs.rpi.edu/hengji.html
Research Interests and Area of Expertise:
Heng Ji is the Edward G. Hamilton Development Chair Associate Professor in Computer Science Department of Rensselaer Polytechnic Institute. Her research interests focus on Natural Language Processing, especially on Cross-source Information Extraction and Knowledge Base Population. She coordinated the NIST TAC Knowledge Base Population task in 2010, 2011 and 2014 and has published several papers on entity linking and Wikification.

Name: Ming-Wei Chang
Affiliation: Microsoft Research
Email address: minchang@microsoft.com
Website: http://research.microsoft.com/en-us/um/people/minchang/
Research Interests and Area of Expertise:
Ming-Wei Chang is a researcher at Microsoft Research. His research interests are in machine learning and natural language understanding. He currently focuses on using large-scale structured and unstructured data for semantic understanding. Specially, he is interested in developing algorithms for entity linking that are effective for short and noisy text.

Name: Taylor Cassidy
Affiliation: U.S. Army Research Laboratory & IBM Research
Email address: taylor.cassidy.ctr@mail.mil
Website: http://www.linkedin.com/pub/taylor-cassidy/10/b/160
Research Interests and Area of Expertise:
Taylor Cassidy is a Postdoctoral Researcher at U.S. Army Research Laboratory & IBM Research. His research interests include Cross-lingual Entity Linking and Wikification for social media.