The Project

The UK Grand Challenge for Computing Research called Memories for Life requires technology for searching and efficiently organising large amounts of information about individuals, organisations, and events, technology that will extract knowledge from language and multimedia sources. This proposal is intended as a partial answer to that challenge. The basic research effort is the creation and augmentation of the technology needed not only to analyse large-scale unstructured or semi-structured data coming from various sources, but to time-stamp and estimate the importance of each particular element of information, while attempting to corroborate it across multiple resources. The end result would normally refer to a particular human, company or institution and would contain the relevant extracted information, presented in an understandable and intuitive manner, typically a time-ordered graph, since time is the dimension giving overall coherence to descriptions of lives and sets of events that constitute companies, institutions etc. This technology, together with the ability to unambiguously identify named entities (like person names, organisations, events, etc.) and automatically cross-reference possible relations and interactions between different named entities, will allow major improvements to be made to current search and information management technology, since time ordering will disambiguate competing person-graphs, while at the same time rendering event-graph information intuitively comprehensible. The two specific problems addressed by this research proposal are (where the first is required for the solution of the second):

i). Disambiguation of similarly named entities. It is currently very difficult to distinguish web search results for persons with the same name, say between George H.W. Bush and George W. Bush (polysemy). The problem gets worse if one or two people or organisations are very popular compared to the rest - imagine trying to find information about the other 25 persons named George Bush from Texas. A related problem is recognising material related to the same person or entity whose name changes over time, for example given the fact that "Norma Mortenson", "Norma Jean Baker" and "Marilyn Monroe" refer to the same person (polymorphism). Minimal research has been performed on named entity disambiguation to date, though see.

ii). Presenting relevant, timely information for particular persons, organisations or events in a structured manner. Current search technology often presents results in a somewhat counter-intuitive presentation order. Recently created but vital pieces of information are often ranked too low in the results returned from conventional web searches. This effect is an inevitable feature of today's popularity-based ranking techniques (e.g. as in Google) that assign importance according to the number of links pointing to a particular piece of information, hence favouring established (but less recent) pages. Additionally, information overload generally occurs when the number of items exceeds 7 to 20, so a better structured way of presenting results should be sought to lower this overload [68]. In a personalised search context, where a search history is available, awareness of temporal relations associated with particular text can help in identifying unseen or novel information, giving it higher priority over information that has already been read before. For example, if you are looking at information about the US Presidential Elections, you would want to get new up-to-date information listed on top, and if you are looking for papers about some topic, you want to get new research papers first.

We shall address two issues in Natural Language Processing (NLP) and Information Retrieval (IR) that have been largely neglected and whose effective solution will enable us to solve (i) and (ii) above: (a) the automated extraction of temporal information and (b) the intuitive generation and presentation of timelines from textual data associated with some named entity, so as to enable the disambiguation of similarly named entities. An effective solution to these issues needs to make use of a closely integrated mixture of NLP, Information Extraction (IE) and IR techniques. Utilisation of temporal information is desirable from a user perspective as well: a recent US ARDA NORRC workshop showed that analysts want a system capable of timestamping information and presenting it in a timeline with clear indications as to what information is the most pertinent to the task at hand .

The People

Prinicpal Investigator: Yorick Wilks, Computer Science Department, University of Sheffield
Research Fellows: Roberta Catizone and Angelo Dalli, Computer Science Department, University of Sheffield

Publications

[2006] Catizone, R., Dalli, A. and Wilks, Y. Evaluating Automatically Generated Timelines from the Web, LREC 2006, Genoa Italy

[2006] Dalli, A Spatio-Temporal Analysis of news and Blogs, WWW2006, Edinburgh