|Fall 2012||November 28, 2012|
DC-area Information Retrieval Experts (DIRE)
Fall 2012 Meeting
Time and Place
Wednesday, November 28, 2012
10:00am to 3:30pm
Graduate Center, Columbia Campus
8890 McGaw Road
Columbia, MD 21045
Location and hours
- 10:00: opening, coffee & donuts, meet and greet
- 10:15 - 10:45: James Mayfield, Johns Hopkins
Cold Start Knowledge Base Construction
In the past 3 years, the NIST Text Analysis Conference Knowledge Base
Population (TAC-KBP) track has conducted tests of components (i.e.,
entity linking and slot filling) that are required to automatically
build knowledge bases from unstructured text. This year a new task
was proposed that directly assesses the fidelty of automatically
induced KBs. The task is called Cold Start Knowledge Base
Population (cold start implies building a KB from scratch).
In this talk we describe the task, explain how it is feasible to
evaluate KBs generated from different teams, and present a novel
evaluation paradigm. We also describe the submission from the JHU
HLTCOE. If pressed, we might consider giving a *live demo* using a KB
constructed from 26,000 articles from the Washington Post.
The KELVIN system learned 194k assertions, including that Warren
Buffett has a son, Howard, is 79 years old, and, resides in Nebraska,
as well as identifying seven companies which he owns stock in. The
system also learns which high school Barbara Mikulski attended, that
Freeman Hrabowski is an employee of UMBC, and that the Applied Physics
Laboratory is a subsidiary of the Johns Hopkins University.
The system isn't perfect, it thinks Jill Biden is married to Jill
Biden and that "MacBook Air" is a subsidiary of Apple.
- 10:45 - 11:15: Lana Yeganova, NCBI/NLM
Finding Bomedical Categories in Medline
There are several humanly defined ontologies relevant to Medline. However, Medline is a fast growing collection of biomedical documents which creates difficulties in updating and expanding these humanly defined ontologies. Automatically identifying meaningful categories of entities in a large text corpus is useful for information extraction, construction of machine learning features, and development of semantic representations.
In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent nouns from Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms, potential biomedical categories.
We find that both approaches produce reasonable terms as potential categories. We also find that there is a significant agreement between the two sets of terms. The overlap between the two methods improves our confidence regarding categories predicted by these independent methods.
- 11:15 - 11:45: Henry Feild, UMass Amherst
CrowdLogger as a Community Platform for Searcher Behavior Experiments
Searcher behavior can be mined to many ends. One common task is to use it to understand user habits, such as their tendency to re-find information or switch search engines. Another is to use it to evaluate multiple retrieval algorithms or interfaces. In both situations, researchers would ideally be able to request feedback from users, e.g., “Why did you switch search engines?” or “Which system do you prefer and why?”
User studies are a great way to conduct such tasks, but they are also time consuming. Whether in a controlled lab setting or in-situ, researchers conducting studies need to develop and configure logging software and recruit subjects. If any sense of user search history is needed, the study must be extended in order to collect that data. Reproducibility is also an issue due to differences in user populations and the software.
To reduce the severity of these issues, we propose a community-shared platform on which to evaluate new retrieval algorithms, search tools, and explore user behavior. The platform software should be easy to install, have a large user base, provide secure and private logging, and expose an API that allows researchers to conduct experiments in-situ. The software should also allow researchers to use it out-of-network for in-lab studies.
We believe that CrowdLogger is a natural base for this community-shared platform. CrowdLogger is an extension for the Firefox and Chrome web browsers that logs search behavior locally and has mechanisms for aggregating sensitive data across users privately (which is useful if an algorithm depends on, e.g., query rewrites entered by other users). However, there are many additional challenges that we must overcome, including:
- data management across multiple experiments
- an API that allows researchers sufficient control over accessing user data and implementing experiments
- controlling what data is shared with researchers (just feedback data? or queries, etc., too?)
- incentivizing users to download the extension and participate in experiments
In this talk, we will give an overview of the proposed system and some initial ideas about how to address the challenges listed above. We hope to get feedback from the DIRE community to improve the development of the platform.
- 11:45 - 12:15: Clay Fink, Johns Hopkins
Can You Accurately Gauge Sentiment With Social Media? A Comparison of Twitter Sentiment to Polling Data from the Nigerian Presidential Election of 2011
To what extent can social media be used to augment traditional opinion polling as a means of gauging political sentiment in the developing world? We investigated how well Twitter captured public opinion during the run-up to the 2011 Nigerian Presidential election by comparing candidate mentions and extracted sentiment to official election returns and polling data. We found significant correlations between mentions of candidates and election results, indicating that Twitter mirrored the regional trends in the data. However, Twitter was less accurate in estimating mean levels of support. Analysis of sentiment in tweets mirrored regional trends less accurately and showed a strong negativity bias against the incumbent president, although some correct geographic trends were still evident. This talk will focus on the sentiment extraction portion of this work. Extracting sentiment from Twitter is a particularly difficult task because of the short length of documents, the endless variety of ways people have of expressing themselves in 140 of characters or less, and the multiplicity of topics discussed. Using textual features alone, we obtained accuracies of 80% or more in predicting the sentiment of tweets mentioning candidates for the Nigerian presidency. In this talk, we will describe the use of Amazon Mechanical Turk for obtaining reliable sentiment annotations; engineering features for training sentiment classifiers; and the results of our machine learning experiments. We will also briefly discuss the methodology used for comparing extracted sentiment to polling data and the election results. This was work done by Clay Fink, Nathan Bos, Jonathon Kopecky, and Edwina Liu, all at APL, under a grant from the Office of Naval Research.
- 12:15 - 1:15: pizza lunch (on site)
- 1:15 - 1:45: Doug Oard, University of Maryland
Learning from the Voice of Apollo
Between 1968 and 1975, a total of 33 Americans rode 15 rockets into space. Twelve of them walked on the moon. It was one of the most extensively documented events in human history. Paradoxically, nobody knows all that happened during those missions. Why? Because most of what happened during a mission, happened here on Earth. The three people who flew in space on each mission were the tip of a very large pyramid. We also have tens of thousands of hours of recorded conversations among people here on the ground that nobody has heard for more than 40 years. Why should we care? Because today we continue to coordinate life-critical events in similar ways. What happened in the control center of the Fukushima nuclear powerplant in the hours immediately following the tsunami? Probably it was similar in some ways to the literally dozens of life-critical and mission-critical events in Apollo, Skylab, and the Apollo-Soyuz Test Program (ASTP). But with one difference: we can now study those programs from 40 years ago, and learn from them. Why should an information retrieval researcher care? Because we hold the keys to addressing this challenge. You can’t listen to tens of thousands of hours of audio; you need search technology to take you to what you need to hear. And even if you could, you couldn’t possibly read the literally mountain of documents that would be needed to provided context and elaboration to what you would hear. But naming the challenge is just the first step. In this talk I will describe the approach that we are taking in our new NSF grant to gain physical access to this unique content, to develop methods for providing intellectual access to it, and to develop test collections to characterize how well we are doing at that. This is joint work with John Hansen at the University of Texas at Dallas.
- 1:45 - 2:15: Rezarta Islamaj, NCBI/NLM
The NCBI disease corpus: an annotated corpus for disease name recognition and normalization
Information in biomedical literature is only valuable if efficient and reliable ways of accessing and analyzing that information are available. In this regard, text-mining tools and natural language processing techniques have been developed to automatically detect important biomedical concepts such as diseases. To accelerate text-mining research in disease name recognition and normalization, we have developed a large corpus of 793 fully annotated PubMed abstracts. Each PubMed abstract was annotated by two human annotators with disease mentions, as well as their corresponding concepts in Medical Subject Headings (MeSH(r)) or Online Mendelian Inheritance in Man (OMIM. Fourteen annotators were randomly paired and differing annotations were discussed for reaching the consensus. The product, the NCBI disease corpus, contains 6,892 disease mentions mapped to 790 unique disease concepts, and is a valuable resource for mining disease-related information in biomedical text.
- 2:15 - 2:45: Ben Carterette, University of Delaware
Building Test Collections for Whole-Session Evaluation of IR Systems
Information Retrieval has traditionally focused on building search engines that serve the best results for a single user query. But much user interaction with systems is iterative, with users reformulating queries based on what the engine has shown them before. We call a sequence of reformulations towards satisfying a user's information need a /session/. Systems that are optimized to perform over a session may be able to gauge how the user is progressing, better respond to user actions, and intervene when necessary to help the user accomplish their goal.
We will discuss the TREC Session track, which has the goal of producing test collections that can be used to optimize and evaluate systems over entire sessions. The track, which has been running since 2010, has so far produced test collections for evaluating only the final query in sessions of varying length. We will additionally present plans that emerged from discussion at a week-long NII Shonan workshop on whole-session evaluation for extending these test collections to whole-session evaluation.
- 2:45 - 3:00: Charles Nicholas, UMBC
Experience with Normalized Compression Distance
The Normalized Compression Distance, or NCD, is a versatile and intuitively appealing measure of object similarity. The idea behind NCD is simple: concatenate the two objects to be compared, and compute the length of the resulting object when compressed. If the length of this compressed object is less than what would be expected when looking at the lengths of the individual objects when compressed, the objects share substrings in a way that the compression algorithm can exploit.
However, the overhead imposed by compression often causes NCD to violate the reflexive and symmetry properties of metrics. Furthermore, a naive implementation of NCD does not scale well, since compression is a relatively expensive operation. We propose the dzd measure, which appeals to the same intuition as NCD but also satisfies the reflexive, symmetry and triangle inequality properties of metrics. The dzd measure is also easier to compute than NCD, in that partial results can be done once and saved for future use. We tested both measures against a private malware collection of many thousands of specimens. The
dzd measure consistently agrees with NCD when tested using random pairs of objects drawn from this collection, but an unoptimized implementation of dzd seems to be faster by a factor of 2.5.
- 3:00: discussion and wrap-up