Seminars

Scheduled Speakers

6 June 2013 - Daniel Preotiuc / Roland Roller / Dowe Gelling (The University of Sheffield

18 July 2013 - Internal Paper Presentations

3 October 2013 - Barry Haddow (University of Edinburgh)

10 October 2013 - Katja Markert (Leeds University)

7 November 2013 - Serge Sharoff (Leeds University)

21 November 2013 - Matthew Rowe (Lancaster University)

2012 - 2013

30th May, 2013 Internal Paper Presntations

Nikoloas Aletras - Representing Topics Using Images

Topics generated automatically, e.g. using LDA, are now widely used in Computational Linguistics. Topics are normally represented as a set of keywords, often the n terms in a topic with the highest marginal probabilities. We introduce an alternative approach in which topics are represented using images. Candidate images for each topic are retrieved from the web by querying a search engine using the top n terms. The most suitable image is selected from this set using a graph-based algorithm which makes use of textual information from the metadata associated with each image and features extracted from the images themselves. We show that the proposed approach significantly outperforms several baselines and can provide images that are useful to represent a topic.

Abdulaziz Alamri

Abdulaziz has been in the IT field for a while and part of his experience was in the field of NLP. In this talk, he will introduce himself and describe briefly his previous work experience within the NLP.

16th May, 2013 Ayman AlHelbawy The University of Sheffield - Collective Named Entity Disambiguation Using HMMs

In this paper we present a novel approach to disambiguate textual mentions of named entities against the Wikipedia knowledge base.

The conditional dependencies between different named entities across Wikipedia are represented as a Markov network. In our approach, named entities are treated as hidden variables and textual mentions as observations. The number of states and observations is huge and naively using the Viterbi algorithm to find the hidden state sequence that emits the query observation sequence is computationally infeasible, given a state space of this size. Based on an observation that is specific to the disambiguation problem, %; for each textual mention that there is a disambiguation list of reference knowledge base named entities. So, we propose an approach that uses a tailored approximation to reduce the size of the state space, making the Viterbi algorithm feasible.

Results show good improvement in disambiguation accuracy relative to the baseline approach and to some state-of-the-art approaches. Also, our approach shows how, with suitable approximations, HMMs can be used in such large-scale state space problems.

9th May, 2013 Daniel Beck The University of Sheffield - Minimizing Annotation Costs in Quality Estimation

Quality Estimation (QE) models provide a quality feedback for new, unseen machine translated (MT) texts without relying on reference translations. These models are usually built by applying supervised machine learning techniques on datasets composed of human-evaluated machine translations. Since QE is a task-specific problem, QE models should, ideally, be specifically tailored to their end task, taking into account the annotators, language pairs and MT systems, among other features. However, building task-specific models leads to a large increase in annotation costs. In this talk, I will show some approaches to tackle this issue, including strategies to reduce the annotation effort and to reuse different datasets by applying domain adaptation techniques.

2nd May, 2013 Wilker Aziz University of Wolverhampton - Exact Optimisation and Sampling for Statistical Machine Translation

In this talk I will present the OS* algorithm (Dymetman et al, 2012) and how this algorithm can be used to perform exact optimisation and sampling for SMT.

OS* is a tractable form of adaptive rejection sampling that can also be used for optimisation.

The contributions of this research go beyond the exactness aspect of OS*. Sampling has many applications in SMT, such as it enables one to better explore the space of likely solutions, it is less prone to outliers than optimisation, it is relevant to topics such as minimum error rate training, minimum Bayes risk decoding, and consensus decoding.

Topics relevant to this talk are: SMT, SMT decoding, automata theory, complexity theory, optimisation and sampling.

25th April, 2013 Eric Atwell Leeds University - Natural Language Processing Working Together with Arabic and Islamic Studies

You may ask: is the Qur'an a suitable dataset for Computing research? Text Analytics touches many domains and applications; but in general NLP research involves Machine Learning from a domain-specific corpus of text documents enriched with linguistic and semantic tags. Ideally, we want a domain where: a source text corpus is freely available, with no IPR or privacy restrictions; a large expert community exists, which has already developed standard "tagging schemes" or linguistic analyses and ontologies for the domain, such as Ibn Kathir's Tafsir ("gold standard" commentary); and a large user group exists, to assist with linguistic and semantic tagging, and to evaluate our systems and results, and also to use the text analytics tools we deliver, so that our research has Impact. The corpus which best meets these research criteria is the Qur'an: the source Classical Arabic text is freely available; Qur'anic scholars over the past thousand years have developed a rich tradition of Arabic linguistics to formally describe the language and meaning of the Quran; and billions of Muslims worldwide constitute the largest user-group ever for a single text corpus.

Our Arabic Language Computing research group at Leeds University developed:

  • the first free-to-download Arabic corpus, the Corpus of Contemporary Arabic;
  • the first open-source concordance tool for analysis of Arabic corpus texts, aConCorde;
  • the Standard Arabic Linguistics Morphological Analysis tag set Gold Standard SALMA-tagged sample corpus, expounding traditional fine-grained morphological features;
  • online Arabic lexical resource for Arabic root-meaning search in classical Arabic dictionaries;
  • Quran "search for a concept" tool and website, Qurany, for search in Quran and Hadith;
  • the Quranic Arabic Corpus, the first online resource which shows the Arabic "irab" morphology and grammar in the Quran, including word-by-word morphology and English gloss, and Ontology
  • tools for text mining the Quran including verse similarity, lemma concordance and collocation
  • A database of 25,000 pronoun anaphoric co-references, QurAna
  • A database of 7,600 pairs of semantically related verses, QurSim
  • web concordance, collocation and analysis tools for Querying Arabic Corpora including 170-million-word lemmatised Arabic Web Corpus, Arabic Wikipedia, Corpus of Contemporary Arabic, and specialist Arabic corpora;
  • the Leeds Arabic Discourse Treebank;
  • Web-as-Corpus resources for Islamic Studies

    We have just begun an EPSRC-funded project, Natural Language Processing Working Together with Arabic and Islamic Studies. We will bring together these different levels of linguistic annotation in a single mutli-layer corpus, and add phonetic and prosodic annotations to capture Tajweed or Quranic recitation. We will also target research communities in: NLP and Artificial Intelligence; Arabic Language and Literature; Qur'anic and Islamic Studies; Religious Studies and Theology; Corpus Linguistics and Digital Humanities; Lexicography; and Linguistics and Phonetics.

    Join us at the WACL'2 Workshop on Arabic Corpus Linguistics, 22 July 2013, Lancaster University.

  • 18th April, 2013 Sebastian Riedel University College London - Relation Extraction with Matrix Factorization and Universal Schemas

    The ambiguity and variability of language makes it difficult for computers to analyse, mine, and base decisions on. This has motivated machine reading: automatically converting text into semantic representations. At the heart of machine reading is relation extraction: predicting relations between entities, such as employeeOf(Person,Company). Machine learning approaches to this task require either manual annotation or, for distant supervision, existing databases of the same schema (=set of relations). Yet, for many interesting questions (who criticised whom?) pre-existing databases and schemas are insufficient. For example, there is no critized(Person,Person) relation in Freebase. Moreover, the incomplete nature of any schema severely limits any global reasoning we could use to improve our extractions.

    In this talk I will first present some earlier work we have done in distantly supervised extraction. Then I will show that the need for pre-existing datasets can be avoided by using, what we call, a "universal schema": the union of all involved schemas (surface form predicates such as "X-was-criticized-by-Y", and relations in the schemas of pre-existing databases). This extended schema allows us to answer new questions not yet supported by any structured schema, and to answer old questions more accurately. For example, if we learn to accurately predict the surface form relation "X-is-scientist-at-Y", this can help us to better predict the Freebase employee(X,Y) relation.

    To populate a database of such schema we present a family of matrix factorization models that predict affinity between database tuples and relations. We show that this achieves substantially higher accuracy than the traditional classification approach. More importantly, by operating simultaneously on relations observed in text and in pre-existing structured DBs, we are able to reason about unstructured and structured data in mutually-supporting ways. By doing so our approach outperforms state-of-the-art distant supervision.

    21st March, 2013 Yorick Wilks Florida Institute of Human and Machine Cognition - Can metaphor processing move to a large and empirical scale?

    The paper described part of the current US effort on metaphor recognition and interpretation, and in particular the CMU/IHMC project METAL. The paper also presents an experimental algorithm to detect conventionalised metaphors implicit in the lexical data in a resource like WordNet, where metaphors are coded into the senses and so would never be detected by any algorithm based on the violation of preferences, since there would always be a constraint satisfied by such senses, We report an implementation of this algorithm, which was implemented first with Wordnet and the (limited) preference constraints in VerbNet. We then transformed WordNet in a systematic way so as to produce far more extensive constraints based on its content, and with this data we reimplemented the detection algorithm and got a substantial improvement in recall. We suggest that this algorithm could contribute to the core detection pipeline of the METAL project at CMU. The new WordNet data is of wider significance because it also produces adjective constraints, unlike any existing lexical resource, and can be applied to any language with a WordNet for it.

    14th March, 2013 Internal Paper Presentations

    Nikoloas Aletras - Evaluating Topic Coherence Using Distributional Semantics

    This paper introduces distributional semantic similarity methods for automatically measuring the coherence of a set of words generated by a topic model. We construct a semantic space to represent each topic word by making use of Wikipedia as a reference corpus to identify context features and collect frequencies. Relatedness between topic words and context features is measured using variants of Pointwise Mutual Information (PMI). Topic coherence is determined by measuring the distance between these vectors computed using a variety of metrics. Evaluation on three data sets shows that the distributional-based measures outperform the state-of-the-art approach for this task.

    Dominic Rout - Reliably Evaluating Summaries of Twitter Timelines

    The primary view of the Twitter social network service for most users is the timeline, a heterogenous, reverse chronological list of posts from all connected authors.
    Previous tweet ranking and summarisation work has heavily relied on retweets as a gold standard for evaluation. The author argues that this is unsatisfactory, since retweets only account for certain kinds of post relevance. The focus of the talk is work-in-progress on designing a user study, through which to create a gold standard for evaluating automatically generated summaries of personal timelines.

    7th March, 2013 Samia Touileb University of Bergen - Inducing local grammars from n-grams

    With the increase of information in blogs, there is a pressing need to develop tools for extracting statements that characterize the content of blog posts (e.g. to highlight different opinions). In this talk we will present our ideas for using grammar induction to create statement extraction templates that capture the typical expressions around a concept. We have evaluated two algorithms (ADIOS [Solan et al., 2005], ABL [Van Zaanen, 2001]) on input data comprising n-grams around the concept "climate change" (generated from a blog search engine).

    28th February, 2013 Andrew Salway Uni Computing, Bergen - Key Statement Extraction in the NTAP project

    Language technologies have an important role to play in making social media a more accessible information source, and in enabling social scientists to better understand how, through social media, organisations and individuals influence public opinion on important and complex issues. The first part of this talk will give an overview of the NTAP project (2012-15, www.ntap.no) which is synthesising language processing and network visualization in order to map the distribution, flow and development of information/opinions in the blogosphere. A distinctive feature of our approach is the treatment of text content as key statements, rather than as keywords, which elucidates the diverse aspects and viewpoints on an issue. By associating key statements with blogs and time-stamps, we hope to be able to track the diffusion of a statement (e.g. "climate change is caused by humans") along with statements related to it. The second part of the talk will present and discuss early results for extracting key statements from the blogosphere using relatively portable methods, with the example of statements about the causes and effects of climate change.

    21st February, 2013 Marcelo Amancio The University of Sheffield - Automatic Text Adaptation

    Text Adaptation is one of the activities that writers use to improve text comprehension and text readability for certain audiences. Two main techniques are usually used. One is Text Elaboration, which brings complementary information in the text, and the other is Text Simplification, which rewrites the text using simpler grammar and vocabulary. My talk will present my former worker in Text Elaboration and present the my initial approach in Text Simplification within the context of my PhD work.

    29th November, 2012 Samuel Fernando The University of Sheffield - Comparing taxonomies for organising collections of documents

    There is a demand for taxonomies to organise large collections of documents into categories for browsing and exploration. This paper examines four existing taxonomies that have been manually created, along with two methods for deriving taxonomies automatically from data items. We use these taxonomies to organise items from a large online cultural heritage collection. We then present two human evaluations of the taxonomies. The first measures the cohesion of the taxonomies to determine how well they group together similar items under the same concept node. The second analyses the concept relations in the taxonomies. The results show that the manual taxonomies have high quality well defined relations. However the novel automatic method is found to generate very high cohesion.

    15th November, 2012 Roland Roller The University of Sheffield - Presentation of my former work

    Within my talk I would like to present my former work, in particular my work at NTT communication science laboratories and DFKI. First I would like introduce the influence model and my extension of user turn segmentation. Both models utilize the effect of speech entrainment to improve the language model in polylogue. Furthermore, I will present the SpeechEval project, a corpus-based user simulation to evaluate spoken dialogue systems.

    8th November, 2012 Mark Steedman The University of Edinburgh - The Future of Semantic Parser Induction

    There has recently been some interest in the task of inducing grammar-based "semantic parsers" from sets of paired strings and meaning representations, following pioneering work by Zettlemoyer and Collins (2005). Work of this kind is currently limited by the paucity of datasets for training. The talk reviews the state of the art in this field, then proposes a way to semi-automatically generate much larger datasets, on the same order of magnitude as syntactic treebanks, using linguistic knowledge that has only recently begun to become available, for use in inducing semantic parsers for under-resourced languages for possible applications of semantic parsing in statistical machine translation.

    1st November, 2012 Dominic Rout The University of Sheffield - Drowning in Tweets: Automatic Summarisation of Twitter's home timelines

    Social networks such as Twitter present vast oceans of information in which it's easy for the average user to drown. Where content is generated by absolutely anyone in no time, it's easy to see why the number of incoming tweets can quickly become too much to handle. This talk discusses the problem of 'information overload' on social network services. We present a study that helped to demonstrate how twitterers at The University of Sheffield are interested in only a fraction of the content to which they are exposed. We also provide a background and describe the state of the art in personalised timeline summarisation for Twitter users.

    This presentation is given as part of the speaker's PhD research programme and discusses his ongoing work.

    25th October, 2012 Vasileios Lampos The University of Sheffield - Detecting Events and Patterns in the Social Web with Statistical Learning

    A vast amount of textual web streams is influenced by events or phenomena emerging in the real world. The Social Web forms an excellent modern paradigm, where unstructured user generated content is published on a regular basis and in most occasions is freely distributed. The main purpose of this talk is to present methods that enable us to automatically extract useful conclusions from this raw information in both supervised and unsupervised learning scenarios. Our input data stream will be the micro-blogging service of Twitter and presented applications will include the 'nowcasting' of Influenza-like illness rates as well as collective mood analysis for the UK.

    Selected Publications

    [1] V. Lampos and N. Cristianini. Nowcasting Events from the Social Web with Statistical Learning. ACM TIST 3(4), no. 72, 2012. [ Link: ]

    [2] V. Lampos. Detecting Events and Patterns in Large-Scale User Generated Textual Streams with Statistical Learning Methods. PhD Thesis, University of Bristol, 2012. [ Link: ]

    18th October, 2012 Oier Lopez De Lacalle Lekuona University of Cambridge Visiting Scholar - Domain Specific Word Sense Disambiguation

    Word Sense Disambiguation (WSD), in its broader sense, can be considered as a task determining the sense of every word occurring in a context. Computationally, it can be seen as classification problem, where the sense are the classes, the context provides the evidence, and each occurrence of a word is assigned to one or more possible classes based on the evidence. WSD often is described as "AI-complete" problem, whose solution presupposes a solution to complete Natural Language Understanding (NLU).

    State-of-the-art methods which acquire linguistic knowledge via hand-tagged text mainly suffer from two drawbacks, called the data-sparseness and the domain shift problems. This is specially noticeable in WSD, where there is a lack of training examples. The domain shift problem involve potential changes on word sense distribution and context distribution. This makes more difficult to estimate a robust and high-performance models, and causes a degradation in the performance when porting from one domain to another.

    This work explores domain adaptation issues for WSD systems based on features induced with Singular Value Decomposition (SVD) and the use of unlabeled data. The use of the SVD and unlabeled data might be helpful to mitigate the data sparseness problem, and make possible to port WSD system across domains. SVD finds a condensed representation and reduce significantly the dimensionality of the feature space. This representation captures indirect and high-order associations, by finding linear combination over features and occurrences of target words. This work presents how to induce the reduced feature space, and shows how it can help adapting a generic WSD system into specific domains.

    11th October, 2012 Diana McCarthy University of Cambridge Visiting Scholar - Compositionality modelling and non-compositionality detection with distributional semantics

    Distributional similarity has been used as a proxy for modelling lexical semantics for nearly two decades. There is now a significant and growing interest in moving these models from lexical to phrasal semantics. For just under one decade, many computational linguistics researchers have applied distributional semantics to the task of detecting non-compositionality of candidate multiwords. In this talk, I will give an overview of my work in this area. I will focus on the more recent work I have collaborated on, with Siva Reddy and colleagues, which borrows techniques from the state-of-the-art phrasal compositional models for non-compositionality detection. Ultimately, these distributional models of phrasal semantics will need to be extended to incorporate non-compositionality.

    4th October, 2012 Rob Gaizauskas The University of Sheffield - Applying ISO-Space to Healthcare Facility Design Evaluation Reports

    This paper describes preliminary work on the spatial annotation of textual reports about healthcare facility design to support the long-term goal linking of report content to a three-dimensional building model. Emerging semantic annotation standards enable formal description of multiple types of discourse information. In this instance, we investigate the application of a spatial semantic annotation standard at the building-interior level, where most prior applications have been at inter-city or street level. Working with a small corpus of design evaluation documents, we have begun to apply the ISO-Space specification to annotate spatial information in healthcare facility design evaluation reports. These reports present an opportunity to explore semantic annotation of spatial language in a novel situation. We describe our application scenario, report on the sorts of spatial language found in design evaluation reports, discuss issues arising when applying ISO-Space to building-level entities and propose possible extensions to ISO-Space to address the issues encountered.

    27th September, 2012 Kashif Shah The University of Sheffield - Weighting parallel data for model adaptation in SMT

    Statistical Machine Translation (SMT) systems use parallel texts as training material for creation of translation model and monolingual corpora for target language modeling. The performance of an SMT system heavily depends upon the quality and quantity of available data. In order to train the translation model, the parallel texts is collected from various sources and domains. These corpora are usually concatenated, word alignments are calculated, phrases are extracted and their translation probabilities are estimated. This means that the corpora are not weighted according to their importance to the domain of the translation task. Therefore, it is the domain of the training resources that influences the translations that are selected among several choices. This is in contrast to the training of the language model for which well known techniques are used to weight the various sources of texts. We have proposed novel methods to automatically weight the heterogeneous data to adapt the translation model. I will present the underlying architecture of proposed techniques along with experiments and results.

    2011 - 2012

    28th June, 2012 Chris Daniels The University of Sheffield (CICS) - To talk to a person, press one: An insider's view of the Automated University Switchboard Speaker

    The Automated Switchboard was the first general use of Speech Self Service at the University. Naturally the new service would be met with both interest and resistance. This talk will provide an anecdotal account of the design, development and evaluation processes used in its implementation. It will discuss the development tools and grammar design in addition to questions beyond the technical surrounding the social and political elements of replacing a human operator with an automated system at the University.

    21st June, 2012 Yang Feng The University of Sheffield - Left-to-Right Tree-to-String Decoding with Prediction

    Decoding algorithms for syntax based machine translation suffer from high computational complexity, a consequence of intersecting a language model with a context free grammar. Left-to-right decoding, which generates the target string in order, can improve decoding efficiency by simplifying the language model evaluation. This paper presents a novel left to right decoding algorithm for tree-to-string translation, using a bottom-up parsing strategy and dynamic future cost estimation for each partial translation. Our method outperforms previously published tree-to-string decoders, including a competing left-to-right method.

    30th May, 2012 Douwe Gelling The University of Sheffield - Using Senses in HMM Word Alignment

    Some of the most used models for statistical word alignment are the IBM models. Although these models generate acceptable alignments, they do not exploit the rich information found in lexical resources, and as such have no reasonable means to choose better translations for specific senses.

    We try to address this issue by extending the IBM HMM model with an extra hidden layer which represents the senses a word can take, allowing similar words to share similar output distributions. We test a preliminary version of this model on English-French data. We compare different ways of generating senses and assess the quality of the alignments relative to the IBM HMM model, as well as the generated sense probabilities, in order to gauge the usefulness in Word Sense Disambiguation.

    28th May, 2012 Juidta Preiss The University of Sheffield - Identifying Comparable Corpora Using LDA

    Parallel corpora have applications in many areas of Natural Language Processing, but are very expensive to produce. Much information can be gained from comparable texts, and we present an algorithm which, given any bodies of text in multiple languages, uses existing named entity recognition software and topic detection algorithm to generate pairs of comparable texts without requiring a parallel corpus training phase. We evaluate the system's performance firstly on data from the online newspaper domain, and secondly on Wikipedia cross-language links.

    10th May, 2012 Federico Sangati (University of Edinburgh) - Accurate Parsing with Compact Tree-Substitution Grammars: Double-DOP

    I will mainly present my EMNLP 2011 paper describing a novel approach to Data-Oriented Parsing (DOP). Like other DOP models, the parser utilizes syntactic fragments of arbitrary size from a treebank to analyse new sentences, but, crucially, it uses only those which are encountered at least twice in the training data. This follows the general assumption of considering a syntactic construction linguistically relevant if there is some empirical evidence about its reusability in a representative treebank. This criterion allows us to work with a relatively small but representative set of fragments, which can be employed as the symbolic backbone of several probabilistic generative models. For parsing we define a transform-backtransform approach that allows us to use standard PCFG technology, making our results easily replicable. According to standard Parseval metrics, our best model is on par with other state-of-the-art parsers, while offering some complementary benefits: a simple generative probability model, and an explicit representation of the larger units of grammar.

    In the final part of the talk I will introduce my current parsing framework: an efficient and accurate incremental Double-DOP parser which only utilizes lexicalized recurring fragments.

    3rd May, 2012 Internal Paper Presentations

    Daniel Preotiuc Real Time Analysis of Social Media Text

    The emergence of online social networks (OSNs) and the accompanying availability of large amounts of data, pose a number of new natural language processing (NLP) and computational challenges. Data from OSNs is different to data from traditional sources (e.g. newswire). The texts are short, noisy and conversational. Another important issue is that data occurs in a real-time streams, needing immediate analysis that is grounded in time and context.

    I will describe a new open-source framework for efficient text processing of streaming OSN data. I will present the current state of its development as well as some novel contributions to tackle two important issues: social network user location and recall oriented information retrieval.

    Jing Li Biologically-inspired Building Recognition

    Building recognition has attracted much attention in computer vision research. However, existing building recognition systems have the following problems: 1) extracted features are not biologically-related to human visual perception; 2) features are usually of high dimensionality, resulting in the curse of dimensionality; 3) semantic gap between low-level visual features and high-level image concepts; and 4) limited challenges set by published databases. To this end, we propose a biologically-inspired building recognition scheme and create a new building image database to address the aforementioned problems. The scheme is based on biologically-inspired features that can model the process of human visual perception. To deal with the curse of dimensionality, the dimensionality of extracted features is reduced by linear discriminant analysis (LDA). To fill the semantic gap, a relevance feedback-based support vector machine (SVM) is applied for classification.

    29th March, 2012 Massimo Poesio (The University of Essex) - Rethinking anaphora

    Current models of the anaphora resolution task achieve mediocre results for all but the simpler aspects of the task such as coreference proper (i.e. linking proper names into coreference chains). One of the reasons for this state of affairs is the drastically simplified picture of the task at the basis of existing annotated resources and models-e.g., the assumption that human subjects by and large agree on anaphoric judgments. In this talk I will present the current state of our efforts to collect more realistic judgments about anaphora through the Phrase Detectives online game, and to develop models of anaphora resolution that do not rely on the total agreement assumption.

    Joint work with Jon Chamberlain and Udo Kruschwitz

    15th March, 2012 Internal Paper Presentations

    Nikos Aletras - Computing Similarity between Cultural Heritage Items using Multimodal Features

    A significant amount of information about Cultural Heritage artefacts is now available in digital format and has been made available in digital libraries. Being able to identify items that are similar would be useful for search and navigation through these data sets. Information about items in these repositories is often multimodal, such as pictures of the artefact and an accompanying textual description. This paper explores the use of information from these various media for computing similarity between Cultural Heritage artefacts. Results show that combining information from images and text produces better estimates of similarity than when only a single medium is considered.

    Mark Hall - Enabling the Discovery of Digital Cultural Heritage Objects through Wikipedia

    Over the past years large digital cultural heritage collections have become increasingly available. While these provide adequate search functionality for the expert user, this may not offer the best support for non-expert or novice users. In this paper we propose a novel mechanism for introducing new users to the items in a collection by allowing them to browse Wikipedia articles, which are augmented with items from the cultural heritage collection. Using Europeana as a case-study we demonstrate the effectiveness of our approach for encouraging users to spend longer exploring items in Europeana compared with the existing search provision.

    8th March, 2012 Robert Villa (The University of Sheffield / Information School) - Can an Intermediary Collection Help Users Search Image Databases Without Annotations?

    Developing methods for searching image databases is a challenging and ongoing area of research. A common approach is to use manual annotations, although generating annotations can be expensive in terms of time and money. Content-based search techniques which extract visual features from image data can be used, but users are typically forced to express their information need using example images, or through sketching interfaces. This can be difficult if no visual example of the information need is available, or when the information need cannot be easily drawn.

    In this talk an alternative approach is considered, where the a final content-based image search is mediated by an intermediate database which contains annotated images. A user can search by conventional text means in the intermediate database, as a way of finding visual examples of their information need. The visual examples can then be used to search a database that lacks annotations. Experiments which investigated this idea, culminating in a small user study, will be discussed in this talk.

    19th January, 2012 Maria Liakata (Aberystwyth University / European Bioinformatics Institute (EMBL-EBI)), Cambridge - Towards reasoning with scientific articles: identifying conceptualisation zones and beyond

    Scholarly biomedical publications report on the findings of a research investigation. Scientists use a well-established discourse structure to relate their work to the state of the art, express their own motivation and hypotheses and report on their methods, results and conclusions. Here I discuss our approach and results from automatically annotating the scientific discourse at the sentence level in terms of eleven categories, which we call the Core Scientific Concepts. I will present applications of this work in extractive summarisation and its implications in improving our automatic understanding of scientific articles.

    8th December, 2011 Sascha Kriewel (Universität Duisburg-Essen) - Introduction to Daffodil / ezDL

    Daffodil was created to provide strategic support through high-level search functions to users of Digital Libraries. It is based on ideas by Marcia Bates with the goal of supporting the entire scientific workflow.

    The agent-based architecture of the backend can be easily extended to add new services and a tool-based user client can be configured into different perspectives for specific tasks. Since 2009 the software is being re-implemented as ezDL (easy access to Digital Libraries). EzDL is currently used within several running projects and provides a platform for user-based evaluations, e.g. within the INEX iTrack.

    17th November, 2011 Elaine Toms (The University of Sheffield) - Designing the next generation information appliance

    Finding information has been all about plugging keywords into a search box and scanning a ranked list of items where the ranking has been based on a mysterious and somewhat magical query-keyword match with a set of documents. This has led to unreasonable expectations about the power of the search box, and disappointment in results particularly in workplace settings where outputs have productivity, profit, and performance implications. How do we move beyond this simple "bag of words" approach? The problem has both algorithmic and interface issues that are tightly inter-related. In this talk, I will discuss two studies, one in which we considered the interface problem, and one in which we started from the beginning -- the requirements for an application, rather than from the source of documents to be used.

    10th November, 2011 Ahmet Aker (The University of Sheffield) - Conceptual Modelling for Multi-Document Summarization

    I will talk about the paper I presented in ACL 2010 (see abstract of the paper below). However, I will also discuss my current experiments and ask for feedback from your side. I hope with those new experiments I can finalize my PhD.

    This paper presents a novel approach to automatic captioning of geo-tagged images by summarizing multiple web-documents that contain information related to an image's location. The summarizer is biased by dependency pattern models towards sentences which contain features typically provided for different scene types such as those of churches, bridges, etc. Our results show that summaries biased by dependency pattern models lead to significantly higher ROUGE scores than both n-gram language models reported in previous work and also Wikipedia baseline summaries. Summaries generated using dependency patterns also lead to more readable summaries than those generated without dependency patterns.

    3rd November, 2011 Ayman Alhelbawy (The University of Sheffield) - Disambiguating Named Entities against a Reference Knowledge Base

    The task of Named Entity Linking, as defined in the recent NIST knowledge base population evaluation, aims at associating named entities with a corresponding explanatory document - a document that contains information about that entity - in a given document collection. There are two main challenges in this task. The first challenge is the ambiguity of the named entity: the same named entity string can occur in different contexts with different meaning. Also, a named entity may be denoted using various forms like acronyms and nick names. The second challenge is to decide if the named entity is not found in the document collection, then link this named entity with the "NIL" link. A survey of some methodologies that have been used to perform the entity linking task is presented in addition to the base line approach. Also, data sets used for this purpose for evaluation and training will be explored. Finally, evaluation metrics used and some results for the state of the art will be presented.

    20th October, 2011 Udo Kruschwitz (University of Essex) - Exploiting Implicit Feedback: From Search to Adaptive Search

    This talk will give an overview of the information retrieval work we conduct in the Language and Computation Group at the University of Essex on building adaptive domain models that can assist in searching or navigating document collections. We are particularly interested in searching local Web sites, digital libraries and other collections. Such collections are different from the Web in that spamming is not an issue, searchers are less heterogeneous and often there is only a single document satisfying an information need. The underlying assumption of our work is that we can use implicit feedback such as queries submitted, documents clicked on etc. to build domain models that assist other users with similar requests in finding the relevant documents quickly. Our ongoing work is about applying different algorithms in the construction and automatic adaptation of domain models but also about finding ways to evaluate these models.

    13th October, 2011 Mark Stevenson (The University of Sheffield) - Disambiguation of Medline Abstracts using Topic Models

    Topic models are an established technique for generating information about the subjects discussed in collections of documents. Latent Dirichlet Allocation (LDA) is a widely applied topic model. We apply LDA to a corpus of Medline abstracts and compare the topics that are generated against manually curated labels, Medical Subject Headings (MeSH) codes.

    The models generated by LDA consist of sets of terms associated with each topic and these are used to provide context for a Word Sense Disambiguation (WSD) system. It is found that using this context leads to a statistically significant improvement in the performance of a graph-based WSD system when applied to a standard evaluation resource in the biomedical domain.

    Information about the topic of a document has already been shown to be useful for WSD of Medline abstracts. Previous approaches have relied on using MeSH codes but these have to be added manually. We demonstrate that information about the topic of abstracts can be identified without the need for manual annotation, by using an unsupervised technique, and can also be used to improve WSD performance.

    6th October, 2011 Chris Dyer (Carnegie Mellon University) - Unsupervised Word Alignment and Part of Speech Induction with Undirected Models

    This talk explores unsupervised learning in undirected graphical models for two problems in natural language processing. Undirected models can incorporate arbitrary, non-independent features computed over random variables, thereby overcoming the inherent limitation of directed models, which require that features factor according to the conditional independencies of an acyclic generative process. Using word alignment (finding lexical correspondences in parallel texts) and bilingual part-of-speech induction (jointly learning syntactic categories for two languages from parallel data) as case studies, we show that relaxing the acyclicity requirement lets us formulate more succinct models that make fewer counterintuitive independence assumptions. Experiments confirm that our undirected alignment model yields consistently better performance than directed model baselines, according to both intrinsic and extrinsic measures. With POS tagging, we find more tentative results. Analysis reveals that our parameter learner tends to get caught in shallow local optima corresponding to poor tagging solutions. Switching to an alternative learning objective (contrastive estimation; Smith and Eisner, 2005) improves the stability and performance, but it suggests that non-convex objectives may be a larger problem in undirected models than with directed models.

    Joint work with Noah Smith, Desai Chen, Shay Cohen, Jon Clark, and Alon Lavie

    15th September, 2011 Rao Nawab (The University of Sheffield) - External Plagiarism Detection using Information Retrieval and Sequence Alignment

    This talk describes the University of Sheffield entry for the 3rd International Competition on Plagiarism Detection which attempted the monolingual external plagiarism detection task. A three stage framework was used: preprocessing and indexing, candidate document selection (using an Information Retrieval based approach) and detailed analysis (using the Running Karp-Rabin Greedy String Tiling algorithm). The submitted system obtained an overall performance of 0.0804, precision of 0.2780, recall of 0.0885 and granularity of 2.18 in the formal evaluation.

    2010 - 2011

    6th July, 2011 Paola Velardi (Universita di Roma) - A Graph-based Algorithm for Inducing Lexical Taxonomies from Scratch Slides

    In this talk I present a novel graph-based approach aimed at learning a lexical taxonomy automatically, starting from a domain corpus and the Web. Unlike many taxonomy learning approaches in the literature, the algorithm learns both concepts and relations entirely from scratch via the automated extraction of terms, definitions and hypernyms. This results in a very dense, cyclic and possibly disconnected hypernym graph. The algorithm then induces a taxonomy from the graph via optimal branching. Experiments show high-quality results, both when building brand-new taxonomies and when reconstructing WordNet sub-hierarchies.

    This research is the result of joint work with Roberto Navigli and Stefano Faralli.

    R. Navigli, P. Velardi LearningWord-Class Lattices for Definition and Hypernym Extraction The 48th Annual Meeting of the Association for Computational Linguistics ACL 2010 Uppsala, Sweden, July 11-16, 2010

    R. Navigli, P. Velardi, S. Faralli. A Graph-based Algorithm for Inducing Lexical Taxonomies from Scratch. To appear in Proc. of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011), Barcelona, Spain, July 19-22th, 2011. YouTube

    30th June, 2011 Ann Copestake (University of Cambridge) - Formal semantics and dependency structures

    Logical representations and dependency structures are both used to describe aspects of the meaning of natural language sentences, but are formally very different. In this talk, I will show that one widely used form of logical representation can be transformed into graph structures comparable to dependency representations without loss of information. This has some significant practical advantages for language processing.

    23rd June, 2011 Peter Wallis (The University of Sheffield) - Engineering Spoken Dialogue Systems

    Having a conversation with a machine has many commercial applications and has a certain sex appeal for the students. What is more, it is a grand challenge that could provide a unifying theme for much of the departmental research. The dialog manager is, I believe, where there is the greatest opportunity for improvement in spoken dialogue systems and in this talk I contrast my approach with POMDPs. Partially Observable Markov Decision Processes are an elegant approach to the problem of structuring conversation but it is not clear the work being done on them will lead to useful systems. In this talk I argue for an agent based approach to dialogue and provide a set of algorithms from the literature.

    9th June, 2011 Internal Research Student Presentations

    Niraj Aswani - Evolving a General Framework for Text Alignment: Case Studies with Two South Asian Languages

    A gold standard is an essential requirement for automatic evaluation of text alignment algorithms and approaches such as semi-automatic or incremental learning can be used to speed up the process of creating one. In this talk, I will describe a general framework for text alignment that supports manual creation of a gold-standard while in the background updating the language resources used to suggest an initial alignment. In particular, the talk will cover a case study of developing language resources for the English-Hindi language pair. Our focus is on the South Asian languages that are similar to the Hindi language for which the resources are scarce. I will demonstrate the generality of the approach by adapting the resources for the English-Gujarati language pair.

    Danica Damljanovic - Usability Enhancement Methods in Natural Language Interfaces for Querying Ontologies

    Recent years have seen a tremendous increase of structured data on the Web with Linked Open Data project encouraging publication of even more. This massive amount of data requires effective exploitation which is now a big challenge largely because of the complexity and syntactic unfamiliarity of the underlying triple models and the query languages built on top of them. Natural Language Interfaces are increasingly relevant for information systems fronting rich structured data stores such as RDF and OWL repositories, largely because of the conception of them being intuitive for human. Many NLIs to ontologies have been developed, however little work has been done in testing the usability of these systems and the usability enhancement methods which can improve their performance. In this paper, we assess the effect of these methods through the two user-centric studies of the two systems: QuestIO and FREyA. The first study assesses the usability of QuestIO, which is fully automatic, in comparison to the traditional ways of search. The second one assesses the usability of FREyA, which involves the user into loop, with special emphasis on feedback. Our results highlight the expressiveness of the language supported by QuestIO and FREyA, and also the importance of feedback which is shown to improve the overall usability and user experience. In addition, combination of feedback and clarification dialogs in FREyA is shown to outperform the state of the art systems.

    2nd June 2011 Piek Vossen (Vrije Universiteit Amsterdam) - The KYOTO project: a cross- lingual platform for open text mining

    The European-Asian project KYOTO developed a platform for mining concepts and events from text across different languages. It uses a layered stand-off representation of text that is shared by 7 languages: English, Dutch, Italian, Spanish, Basque, Chinese and Japanese. The KYOTO Annotation Format (KAF) distinguishes separate layers for structural and semantic aspects of the text that can be stacked on top of each other and that can be extended easily. Once a structural representation of the text in KAF is created, semantic layers are added using modules that work the same for all the languages, creating an interoperable semantic interpretation of the text. The semantic layers are based on wordnet concepts linked to a shared ontology and named entities.
    From the semantically annotated text, KYOTO derives on the one hand terminology databases with concepts that are anchored to the wordnets and through these to the ontology and on the other hand events with participants that are mentioned in the text, which are instantiations of these concepts. The detection of the latter is helped through the conceptual database. Ultimately, every word and expression in the text is connected to the ontology. Likewise, events and their participants are mined by defining patterns using constraints in the shared ontology, e.g. physical_object in object position of a change_of_integrity process. Such patterns can be applied to text in any language, since the structural unit in these languages are mapped to the same concept structure. Mined events are related to time and places, detected as named entities. This makes potential facts out of events: they took place at some point of time in some place. These potential facts help the development of applications that can group all events that took place in the same area in the same period and that may be semantically related or show some conceptual coherence. KYOTO carried out first evaluations of the precision and recall of such an open-event mining approach and developed a semantic search application that exploits the rich data. Such a search system bridges the gap between rich text mining and comprehensive search on text indexes.

    19th May 2011 Internal Research Student Presentations

    Kumutha Swampillai - Overview of Research Topic

    Douwe Gelling - Overview of Research Topic

    12th May 2011 Leon Derczynski (The University of Sheffield) - Processing Temporal Relations

    Language requires a description of time in order to allow use to describe change, to plan, and to discuss history. Temporal information extraction has been a persistently difficult task over the past decade. I will discuss my PhD research in this area and outline a partially data-driven method to extract temporal relations from natural language text, with good results.

    5th May, 2011 David Weir (University of Sussex) - Exploiting Distributional Semantics: exploring asymmetry and non-standard contextual features

    The distributional hypothesis asserts that words that occur in similar contexts tend to have similar meanings. A growing body of research has been concerned with exploiting the connection between language use and meaning, and much of this work has involved measuring the distributional similarity of words based on the extent that they share similar contexts. In this talk I look at two particular aspects of how distributional similarity can be measured: the value of asymmetry and the choice of co-occurrence features. These issues will be considered in the contexts of various applications, including cross-domain sentiment analysis and detection of non-compositionally.

    14th April, 2011 Paul Rayson (Lancaster University) - Extreme NLP - Co-presenting with Will Simm, Scott Piao and Maria-Angela Ferrario

    In this talk, we will describe Natural Language Processing research and applications which can be loosely described as 'Extreme NLP'. At Lancaster, there are a number of projects which apply NLP techniques in extreme or harsh circumstances and to controversial or challenging topics. For example, we will describe the problems faced when applying corpus-based NLP methods and tools to historical data (Early Modern English) and to online varieties of language (social networks, emails, blogs). Short texts, informal messages and high volumes of data cause multiple issues for existing tools trained on modern standard varieties of language. The novel application areas such as online child protection, crime, environmental issues, serendipity etc, also mean that it is sometimes difficult to be precise about the exact techniques that are employed.

    7th April, 2011 Edward Grefenstette (University of Oxford) - Categorical Compositionality for Distributional Semantics, Without Tears

    Coecke, Sadrzadeh, and Clark (arXiv:1003.4394v1 [cs.CL]) developed a compositional model of meaning for distributional semantics, in which each word in a sentence has a meaning vector and the distributional meaning of the sentence is a function of the tensor products of the word vectors. Abstractly speaking, this function is the morphism corresponding to the grammatical structure of the sentence in the category of finite dimensional vector spaces. In this paper, we provide a concrete method for implementing this linear meaning map, by constructing a corpus-based vector space for the type of sentence. Our construction method is based on structured vector spaces whereby meaning vectors of all sentences, regardless of their grammatical structure, live in the same vector space. Our proposed sentence space is the tensor product of two noun spaces, in which the basis vectors are pairs of words each augmented with a grammatical role. This enables us to compare meanings of sentences by simply taking the inner product of their vectors.

    31st March, 2011 Alexander Clark (Royal Holloway University of London) - Distributional Lattice Grammars: a learnable representation for syntax

    A central problem for NLP is grammar induction: the development of unsupervised learning algorithms for syntax. In this paper we present a lattice-theoretic representation for natural language syntax, called Distributional Lattice Grammars.

    These representations are objective or empiricist, based on a generalisation of distributional learning, and are capable of representing all regular languages, some but not all context-free languages and some non-context-free languages. We present a simple algorithm for learning these grammars together with a complete self-contained proof of the correctness and efficiency of the algorithm, and we discuss the relevance of this work to the problems of theoretical linguistics.

    17th March, 2011 Stephen Clark (University of Cambridge) - Practical Linguistic Steganography using Synonym Substitution - joint work with Ching-Yun (Frannie) Chang

    Linguistic Steganography is concerned with hiding information in a natural language text, for the purposes of sending secret messages. A related area is natural language watermarking, in which information is added to a text in order to identify it, for example for the purposes of copyright. Linguistic Steganography algorithms hide information by manipulating properties of the text, for example by replacing some words with their synonyms. Unlike image-based steganography, linguistic steganography is in its infancy with little existing work. In this talk I will motivate the problem, in particular as an interesting application for NLP and especially generation. Linguistic steganography is a difficult NLP problem because any change to the cover text must retain the meaning and style of the original, in order to prevent detection by an adversary.

    Our method embeds information in the cover text by replacing words in the text with appropriate substitutes, making the task similar to the standard lexical substitution task. We use the Google n-gram data to determine if a substitution is acceptable, obtaining promising results from an evaluation in which human judges are asked to rate the acceptability of sentences.

    10th March, 2011 Internal Research Student Presentations

    Xingyi Song - Overview of research topic

    Daniel Preotiuc - Overview of research topic

    Samuel Fernando - Enriching knowledge bases from Wikipedia

    Lexical knowledge bases, such as WordNet, have been shown to be useful in a wide range of language processing applications. Enriching such resources using the usual manual approach is costly. This thesis explores methods for enriching WordNet using information from Wikipedia.

    The approach consists of mapping concepts in WordNet to corresponding articles in Wikipedia. This is done using a three stage approach. First a set of possible candidate articles is retrieved for each WordNet concept. Secondly, text similarity scores are then used to select the best match from the candidate articles. Finally, the mappings are refined using information from Wikipedia links to give a set of high quality matches. Evaluation reveals that this approach generates mappings of accuracy over 90%.

    This information is then used to enrich relations in WordNet using Wikipedia links. The enriched WordNet is then used with a knowledge based Word Sense Disambiguation system, and evaluated on Semeval 2007 test data. Using WordNet alone gives accuracy of 70%, but with the enriched WordNet the performance is boosted to 84% correct disambiguation, rivalling state-of-the-art performance on this data set.

    3rd March, 2011 John Carroll (University of Sussex) - Text Mining from User-Generated Content

    Over the past five years or so, technology has made it possible for members of the general public to create and publish digital media content, for example in the form of video, audio, or text. Being able to process such content automatically to derive relevant information from it will be of great societal and commercial benefit. In this talk I will present a number of research and commercial applications which I and collaborators are developing, in which we process digital text from sources as diverse as mobile phone text messages, non-native language learner essays, and primary care medical notes. These applications involve a number of language processing challenges, and I will outline how we have overcome them.

    24th February, 2011 Leon Derczynski (The University of Sheffield) - ESSLLI course - Word Senses

    In an introduction to the tasks of word sense disambiguation and word sense induction, we will discuss a wide range of techniques for the two tasks, from fundamental concepts to state of the art. Further, we survey tools for the development of systems able to participate in past and current evaluation exercises for WSD and WSI (ref: Semeval).

    17th February, 2011 Lucia Specia (University of Wolverhampton) - Quality Estimation for Machine Translation

    One of the most popular ways to incorporate Machine Translation (MT) into the human translation workflow is to have humans checking and post-editing the output of MT systems. However, the post-editing of a proportion of the translated segments may require more effort than translating those segments from scratch, without the aid of an MT system. In this talk I will introduce some of my work on quality estimation for MT: the task of predicting the quality of sentences produced by machine translation systems, where "quality" is defined in terms of post-editing effort. A quality estimation system can be used to filter out bad quality translations to prevent human translators spending time post-editing them. I will present the outcomes of experiments with different ways of estimating quality which demonstrate that it is possible to predict post-editing effort using standard machine learning techniques with a relatively small number of training examples and a number of shallow features.

    10th February, 2011 Rao Nawab (The University of Sheffield) - Automatic Plagiarism Detection

    The task of plagiarism detection using automatic methods has got the attention of the academia, commercial and publishing communities. The main objective of my PhD thesis is to explore the problem of automatically detecting extrinsic plagiarism (when the plagiarized text is created by paraphrasing) using IR and NLP techniques.

    The first part of my talk will give an overview of a two-stage framework for my PhD thesis: 1) candidate document selection stage and 2) detailed analysis stage. The aim of first stage is to reduce the search space whereas that of second stage is to identify the suspicious-source sections from the reduced search space. The second part of my talk will present my current work on the candidate document selection stage and a brief summary of the results. Suggestions and feedback from the group will be of great value for me.

    3rd February, 2011 Adam Kilgarriff (Lexical Computing Ltd.) - Using Corpora Without the Pain

    Corpora are large objects and querying them efficiently is non-trivial. There are substantial costs to building them, storing them, maintaining them, and building and maintaining software to access them. We propose a model where this work is done by a corpus specialist and NLP systems then use corpora via web services or (if there is a local installation) a command-line API. Our corpus tool is fast, even for billion-word corpora, and offers a wide range of queries via its web API. We have large corpora available for twenty-six languages, and are experts in preparing large corpora from the web, with particular expertise in web text cleaning and de-duplication. To increase our coverage of the world's languages, we have a 'corpus factory' programme. For English, we are building corpora that are both bigger and more richly marked up than others available. The 'big corpus' thread is BiWeC (BIg WEb Corpus) for which we currently have 5.5 billion words fully encoded. The 'more richly marked up' thread is the New Model Corpus, which we are setting up as a collaborative project for multiple annotation. The combination of the API model, the corpora, and the tools, will allow many NLP researchers to use bigger and better corpora in more sophisticated ways than would otherwise be possible.

    27th January, 2011 Leon Derczynski (The University of Sheffield) - Review of courses from ESSLLI 2010

    Last year, I attended the first week of the European Summer School for Logic, Language and Information. In this talk I will recap briefly over two of the classes taken there.
    Class 1 - Focus. An introduction to the phenomena and theories of focus at the levels of phonetics, phonology, syntax, semantics and pragmatics, and the interfaces between them. Common grammatical and contextual environments that trigger focus are surveyed. We will look in detail at the most prominent accounts of the semantics of focus and consider how they are applied in particular cases. Additional topics include issues of grammatical representation including scope; focus in the pragmatics of the question-answer relation, and the hypothesis that focus and question phrases have a single compositional semantics.

    Class 2 - Word Sense Disambiguation and Induction. We introduce the audience to a wide range of techniques for the two tasks; in addition, we provide tools for the development of systems able to participate in past and current evaluation exercises for WSD and WSI (ref: Semeval).

    13th January, 2011 Diana Maynard (The University of Sheffield) - The National Archives: The GATE-way to Government Transparency

    In this talk I will describe work we are undertaking in a short project for the National Archives, improving access to the huge volumes of information they are making available as part of the data.gov.uk initiative publishing government-related material in open and accessible forms as linked data. Together with our partners Ontotext, we have developed tools to import, store and index structured data in a scalable semantic repository, making links from regularly crawled web archive data into this repository storing hundreds of millions of documents, and enabling search via semantic annotation. Document annotation is first carried out using GATE, and then indexed via MIMIR, a new massively scalable multiparadigm index that forms part of the GATE and Ontotext product family.

    9th December, 2010 Bill Byrne (University of Cambridge) - Hierarchical Phrase-based Translation with Weighted Finite State Transducers

    I will present recent work in statistical machine translation which uses Weighted Finite-State Transducers (WFSTs) to implement a variety of search and estimation algorithms. I will describe HiFST, a lattice-based decoder for hierarchical phrase-based statistical machine translation. The decoder is implemented with standard WFST operations as an alternative to the well-known cube pruning procedure. I will discuss how improved modelling in translation results from the efficient representation of translation hypotheses and their derivations and scores under translation grammars. We find that the use of WFSTs in translation leads to fewer search errors, better parameter optimisation, improved translation performance, and the ability to extract useful confidence measures under the translation grammar.

    8th November, 2010 John Tait (Information Retrieval Facility) - Slides

    7th October, 2010 Danica Damljanovic (The University of Sheffield) - Natural Language Interfaces to Conceptual Models

    Accessing structured data in the form of ontologies currently requires the use of formal query languages (e.g., SPARQL) which pose significant difficulties for non-expert users. One way to lower the learning overhead and make ontology queries more straightforward is through a Natural Language Interface (NLI). While there are existing NLIs to structured data with reasonable performance, they tend to require expensive customisation to each new domain. Additionally, they often require specific adherence to a pre-defined syntax which, in turn, means that users still have to undergo training. We study the usability of NLIs from two perspectives: that of the developer who is customising the NLI system, and that of the end-user who uses it for querying. We investigate whether usability methods such as feedback and clarification dialogs can increase the usability for end users and reduce the customisation effort for the developers. To that end, we have developed FREyA - an interactive NLI to ontologies which will be the described and demoed during this talk.

    2009 - 2010

    5th August, 2010 David Guthrie (The University of Sheffield) - Storing the Web in Memory: Space Efficient Language Models using Minimal Perfect Hashing

    The availability of the text on the web and very large text collections, such as the Gigaword corpus of newswire and the Google Web1T 1-5gram corpus, have made it possible to build language models incorporating counts of billions of n-grams. In this talk we present novel methods for efficiently storing these large models. We introduce three novel data structures that take advantage of the distribution of n-grams in corpora and make use of various numbers of minimal perfect hashes to compactly store language models containing full frequency counts of billions of n-grams. Our methods use significantly less space than all known approaches and have retrieval speed faster than current language modelling toolkits.

    22nd July, 2010 Alberto Diaz (Universidad Complutense de Madrid) -

    In the talk I'll give a short introduction to my research group (members and high levels details about the main research areas), and after I'll explain more details about my research lines and projects. In particular, I'll talk about personalization for digital newspapers through user modelling and text classification tasks, and for text processing for biomedical documents, including text summarization and ICD-9-CM indexing tasks.

    8th July, 2010 Laura Plaza (The University of Sheffield Visiting Researcher) - Improving Summarization of Biomedical Documents using Word Sense Disambiguation

    We describe a concept-based summarization system for biomedical documents and show that its performance can be improved using Word Sense Disambiguation. The system represents the documents as graphs formed from concepts and relations from the UMLS. A degree-based clustering algorithm is applied to these graphs to discover different themes or topics within the document. To create the graphs, the MetaMap program is used to map the text onto concepts in the UMLS Metathesaurus. This paper shows that applying a graph-based Word Sense Disambiguation algorithm to the output of MetaMap improves the quality of the summaries that are generated.

    24th June, 2010 Ronald Denaux (University of Leeds) -

    Ronald will first present his work on involving domain experts in ontology engineering through the use of the Rabbit controlled natural language, a tailored ontology engineering methodology and a tailored user interface based on Protege (this all in the context of the Confluence project in a collaboration between the Ordnance Survey and the University of Leeds). In the second part, Ronald will present his current work on Multi-perspective Ontology Engineering where he is investigating a mechanism for capturing the perspective of ontology authors in order to enhance tool support for ontology creation and reuse. In particular, Ronald is working on formalising the purpose of ontologies and eliciting the goals of ontology authors through dialogue games (the second part is in the context of Ronald's PhD).

    17th June, 2010 Hector Llorens (University of Sheffield Visiting Researcher) - Temporal information extraction using semantic roles and semantic networks

    In the last years, there has been an intensive research on the temporal elements of natural language text. TimeML scheme has been recently adopted as the standard for annotating temporal expressions (TIMEX3), events (EVENT), and their relations ([T,A,S]LINK). This research analyzes the advantages of applying semantic information to the automatic annotation of TimeML elements. For that purpose, a system addressing the automatic annotation of TimeML elements is presented. The system implements an approach which uses semantic roles and semantic networks as additional information extending classic approaches based on morphosyntactic information. A multilingual analysis carried out evaluating the system for Spanish demonstrated the approach is valid for different languages achieving same quality results and improvement over classic approaches. In the talk, I will include an "application proposal" which I intend to develop during my stay there and which will be the application of my thesis. Yours and your group suggestions and feed back about my current and further work will be of great value for me.

    30th April, 2010 Atefeh Farzindar (NLP Technologies Inc) - Successful cooperation between the university and industry

    NLP Technologies and RALI (Applied Research in Computational Linguistics, Université de Montréal) have developed an automated monitoring system for the automatic summarization and translation of legal decisions. During this seminar, Atefeh Farzindar, will discuss the successful cooperation between the university and industry leaders, a milestone in applied research and technology transfer. Experience shows that when industry players combine their strengths and work alongside university experts with the same vision, the result yielded is by far greater than what can be achieved separately. She will present her experience with domain-based technologies in the legal and military fields.

    22nd April, 2010 Miles Osbourne (University of Edinburgh) - What is happening now? Finding events in Massive Message Streams

    Social Media {eg Twitter, Blogs, Forums, FaceBook} has exploded over the last few years. FaceBook is now the most visited site on the Web, with Blogger being the 7th and Twitter the 13th. These sites contain the aggregated beliefs and opinions of millions of people on an epic range of topics, and in a large number of languages. Twitter in particular is an example of a massive message stream and finding events embedded in it poses hard engineering challenges. I will explain how we use a variant of Locality Sensitive Hashing to find new stories as they break. The approach scales well, easily dealing with the more than 1 million Tweets a day we process and only needing a single processor. For June 2009, the fastest growing stories all concerned deaths of one kind or another.

    15th April, 2010 Peter Wallis (University of Sheffield) - Conversation in Context: what should a robot companion say?

    Language as used by humans is a truly amazing thing with multiple roles in our lives. Academics have tended to focus on the way languages convey meaning, and disciplines that come new to the problem such as computer science tend to start with reference semantics and progress to models of meaning that look mathematical and hence solidly academic. Language as used is however beautifully messy. People sing, they lie and swear, they use metaphor and poetry, play word games and talk to themselves. Is there a better way to look at language? Interdisciplinary research is hard not only because each discipline has its own terminology, but also because they usually have different interests. Those of us interested in spoken language interfaces (computer science) however have a shared interest with applied linguistics in how language works in situ. This paper outlines a theory about how language works from applied linguistics and shows how the theory can be used to guide the design of a robot companion.

    25th March, 2010 Adam Funk (University of Sheffield) - Ontology-Based Categorization of Web Services with Machine Learning

    We discuss the problem of categorizing web services according to a shallow ontology for presentation on a specialist portal. We treat it as a text classification problem and apply first information extraction techniques (using keywords and rules), then machine learning (ML), and finally a combined approach in which ML has priority over keywords. The techniques are evaluated according to standard measures for flat categorization as well as the Balanced Distance Metric for ontological classification and compared with related work in web service categorization. The ML and combined categorization results are good and the system is designed to take users' contributions through the portal's Web 2.0 features as additional training data.

    18th March, 2010 Elena Lloret (University of Alicante) - Text Summarization and it's Applications in NLP Tasks

    Text Summarization, which aims to condense the information contained in one or more documents and present it in a more concise way, can be very useful for helping users to manage the large amounts of information available due to the rapid growth of the Internet. In this talk, I will present the Natural Language Processing and Information Systems Research Group of the University of Alicante (Spain), and next I will focus on Text Summarization as the research topic of my PhD. I will describe a knowledge-based approach to generate extractive summaries, and how this approach has been successfully applied to neighbouring NLP tasks, such as Question Answering, Sentiment Analysis or Text Classification. Finally, some issues regarding the difficult task concerning the evaluation of summaries will be also outlined, suggesting preliminary ideas of new directions for the evaluation task.

    26th February, 2010 René Witte (Concordia University in Montréal) - Software Engineering and Natural Language Processing: Friends or Foes?

    This talk will investigate some connections between software engineering (SE) and natural language processing (NLP). It will attempt to answer questions such as "Why do software engineers use natural language artifacts everywhere, but no NLP?" and "Why, after more than 10 years of modern NLP research, do we still not have the most basic NLP functionalities integrated into our desktops?". In the first part, we examine NLP for SE: Documents written in natural languages constitute a major part of the artifacts produced during the software engineering lifecycle. Especially during software maintenance or reverse engineering, semantic information conveyed in these documents can provide important knowledge for the software engineer. However, while source code artifacts are well-managed by today's software development tools, documents are not integrated on a semantic level with their corresponding code artifacts. This results in a number of problems, like the loss of traceability links between code and its documentation (requirements specifications, user guides, design documents). We show how natural language processing approaches can be used to retrieve semantic information from software documents and connect them with source code using ontology alignment techniques. The second part of the talk will investigate the integration of existing NLP techniques (such as summarization or question-answering) into end-user desktop programs (such as email clients or word processors). This work is motivated by the observation that none of the impressive advances in NLP and text mining over the last decade has materialized in the tools and desktop environments in use today. The "Semantic Assistants" project aims to provide effective means for the integration of natural language processing services into existing applications, using an open service-oriented architecture based on OWL ontologies and W3C Web services.

    25th February, 2010 Claude Roux (Xerox Research Labs) - TBA

    4th February, 2010 Peter Wallis (University of Sheffield) - High Recall Search in Practice

    Internet search engines do an amazing thing, but what they can do well has coloured our view of the general problem of search. There are cases where a search engine would be better if the searcher knew he or she had found everything relevant, but how often and how significant these cases are is an open question. One popular notion is that high recall is not that useful as we can get by without it. Although sound reasoning, it does not mean there is not a opportunity to be had - Xerox faced this marketing problem with the photocopier and jet engines had been in use for quite a while before the advantages were quantifiable. One situation where the need for high recall is acknowledged is defence intelligence. Defence has both the will and resources to develop bespoke systems for their particular needs and in this talk I describe, in some detail, the needs of the "Health Intelligence" community. I go on to describe how we addressed these needs using an Information Extraction system based on a library of "Fact Extractors".

    17th December, 2009 Jose Iria (University of Sheffield) - Machine Learning Approaches to Text and Multimedia Mining

    Today's search engines are able to retrieve and index several billion web pages, but the analysis that they perform on the content of these pages is still very shallow -- as is, consequently, the functionality that they are able to offer the user. What if these search engines could, for example, extract the factual content from the pages they retrieve, classify the pictures that accompany the text, disambiguate namesakes or mine opinions expressed in the pages? Undoubtably, this would open a world of possibilities in what concerns new functionalities and enhanced user experience, fueled by richer underlying data models. In this talk, I will describe my research, spanning a number of years, on these topics. The common denominator in the several approaches that I will present is the fact that they rely heavily on machine learning techniques, to train systems to classify and extract target information. The talk will also overview real-world applications of the systems originating from the research -- for instance, in one case we trained one of our systems to extract information from a collection of jet engine reports provided by Rolls-Royce, resulting in a positive impact in the way their engineers search for information in the course of their work.

    15th December, 2009 Donia Scott (University of Sussex) - Summarisation and Visualization of Electronic Health Records

    10th December, 2009 Roberto Navigli (Universita di Roma "La Sapienza") - Comparing Graph Connectivity Measures for Word Sense Disambiguation

    Word sense disambiguation (WSD), the task of identifying the intended meanings (i.e. senses) for words in context, has been a long-standing research objective for Natural Language Processing. While supervised systems typically achieve better performance, they require large amounts of sense-tagged training instances. An alternative solution is that of adopting knowledge-based approaches, that exploit existing knowledge resources to perform WSD and do not need annotated training sets. In this talk, we present an objective comparison of graph-based algorithms for alleviating the data requirements for large-scale WSD. Under this framework, finding the right sense for a given word amounts to identifying the most "important" node among the set of graph nodes representing its senses. We present a variety of measures that exploit the connectivity of graph structures, thereby identifying the most relevant word senses. We assess their performance on standard datasets, and show that the best measures perform comparably to state-of-the-art systems. We also provide interesting insights into the relevance of the underlying knowledge resource on WSD performance.

    26th November, 2009 Serge Sharoff (University of Leeds) - Classifying the Web into Domains and Genres

    The jungle metaphor is quite common in corpus studies. The subtitle of David Lee's seminal paper on genre classification is 'navigating a path through the BNC jungle'. According to Adam Kilgarriff, the BNC is a jungle only when compared to smaller Brown-type corpora, while it looks more like an English garden when compared to the Web. At the moment we know little about the domains and genres of webpages. In the seminar I'm going to talk about approaches to understand the composition of the Web as a corpus.

    19th November, 2009 Luke Zettlemoyer (University of Edinburgh) - Learning to Follow Orders: Reinforcement Learning for Mapping Instructions to Actions

    In this talk, I will address the problem of relating linguistic analysis and control --- specifically, mapping natural language instructions to executable actions. I will present a reinforcement learning algorithm for inducing these mappings by interacting with virtual computer environments and observing the outcome of the executed actions. This technique has enabled automation of tasks that until now have required human participation --- for example, automatically configuring software by consulting how-to guides. Our results demonstrate that this method can rival supervised learning techniques while requiring few or no annotated training examples.

    29th October, 2009 Allan Ramsay (Univeristy of Manchester) - Using English to Express Commonsense Rules

    The talk will discuss some issues arising from an attempt to provide natural language access to a body of simple information about diet and its effect on various common medical conditions. Expressing this knowledge in natural language has a number of advantages. It also raises a number of difficult issues. I will outline the reasons why it seemed like a good idea and the reasons why it is difficult, and sketch our solution to these problems.

    15th October, 2009 Diana Maynard (University of Sheffield) - Using Lexico-Syntactic Patterns for Ontology Enrichment: the case of ODd SOFAS

    This talk describes the use of information extraction techniques involving lexico-syntactic patterns to generate ontological information from unstructured text and augment an existing ontology with new entities. We refine the patterns using a term extraction tool and some semantic restrictions derived from WordNet and VerbNet, in order to prevent the overgeneration that typically occurs with general patterns. We present two applications developed in GATE and available as plugins for the NeOn Toolkit: one for general use on all kinds of text, and one for specific use in the fisheries domain. Both make use of a new plugin for GATE which generates ontologies on the fly. Furthermore, we integrate support for ontology lifecycle development via a change log mechanism that enables logging of ontology versions and application of changes from one version to another.

    1st October, 2009 Trevor Cohn (Univeristy of Sheffield) - Bayesian Non-Parametric Models for Parsing and Translation Slides

    Many natural language processing tasks require inference over partially observed input data. Traditionally these models are trained using the expectation maximisation (EM) algorithm. However, for many models EM finds poor or degenerate solutions. Bayesian methods provide a elegant and theoretically principled way to address these problems, by including a prior over the model and integrating over uncertain events. In this talk I'll describe how we developed non-parametric Bayesian models for two related tasks: 1) learning a tree substitution grammar (DOP) for syntactic parsing and 2) learning a grammar-based machine translation model. The models learn compact and simple grammars, uncovering latent linguistic structures and in doing so outperform competitive baselines.

    2008 - 2009

    14th May, 2009 Sivaji Bandyopadhyay (Jadavpur University, India) - Emotion Analysis in Blog texts

    Emotion analysis on blog texts is being carried out for a less privileged language like Bengali. A set of six attitude types, namely, happy, sad, anger, fear, disgust and surprise, have been selected toward this emotion detection task for reliable and semi automatic annotation of the blog texts. An automatic classifier has been applied for recognizing six basic types of attitudes for different words of a sentence. Different scoring strategies have been applied to identify sentence level emotion type based on the acquired word level emotion information. Unsupervised techniques have been applied on the classified test output to improve the accuracy. Same method has been applied on English SemEval 2007 Affect Sensing corpus that has given satisfactory performance.

    7th May, 2009 Leon Derczynski (University of Sheffield) - Sequencing of Events and Their Durations Based on Event Descriptions Slides

    Temporal Information Extraction is the elicitation of accurate data on events in a discourse. This specifies both tense and aspect of actions, both explicitly given by text and implicit from world knowledge. Events can occur at any point along a timeline, and are often only loosely specified in terms of upper and/or lower bounds relative to other events. Being able to identify and annotate times in discourse enables us to build a richer representation of the knowledge present in text. Given a document - for example, a news article - only a subset of facts within that document ever hold true at any one time. For example, we cannot concurrently assert "The silver and black Scott bike was chained to railings" and "An hour later it was gone". Extracting and temporally linking information is the only way to know which sets of facts hold true at the same time. A brief summary of literature and models surrounding tense and temporal location will be presented, followed by a review of recent work in the field. We will look at the normalisation of temporal data (anchoring vague expressions to a fixed interval on an absolute time scale), how events in text relate to each other and ways of reasoning about them, and different representations of temporal data - logical, textual and visual.

    30th April, 2009 Marta Sabou (Open University) - Exploiting Semantic Web Ontologies: An Experimental Report Slides

    As a side effect of the Semantic Web research activities, a large collection of ontologies is now available online constituting one of the largest and most heterogeneous knowledge sources in the history of AI. In this talk we report on the characteristics of this novel source and on its successful use for relation discovery. Our experiments show that, in the context of an ontology matching task, relations between the concepts of two ontologies can be discovered with a precision of 70% when using online ontologies. We conclude by exploring the potential of this novel knowledge resource for language technology applications.

    16th April, 2009 Kumutha Swampillai (University of Sheffield) - Inter-Sentential and Intra-Sentential Relations in IE Corpora

    Some information extraction systems are limited to extracting binary relations from single sentences. This constraint means that relations occurring across sentence boundaries cannot possibly be extracted by such systems. We examine the distribution of inter-sentential and intra-sentential relations in the MUC6 and ACE03 corpora. It was found that inter-sentential relations constitute 31.4% and 9.4% of the total number of relations in MUC6 and ACE03 respectively. These results show a 69.6% and a 90.6% recall upper bound of single sentence approaches to relation extraction. As such, any comprehensive approach to relation extraction will have to treat linguistic units larger than a sentence.

    2nd April, 2009 Danica Damljanovic (University of Sheffield) - Natural Language Interfaces to Conceptual Models: Usability and Performance Slides

    Accessing structured data in the form of ontologies currently requires the use of formal query languages (e.g., SeRQL or SPARQL) which pose significant difficulties for non-expert users. One way to lower the learning overhead and make ontology queries more straightforward is through a Natural Language Interface (NLI). While there are existing NLIs to structured data with reasonable performance, they tend to require expensive customisation to each new domain or ontology. Additionally, they often require specific adherence to a pre-defined syntax which, in turn, means that users still have to undergo training. Many methods are under development to reduce this training, and increase the usability of NLIs. We have developed Question-based Interface to Ontologies (QuestIO) which translates Natural Language text-based queries to SeRQL/SPARQL queries, which are then executed against the given ontology/knowledge base and the results are shown to the user. Customisation of this system is performed automatically from the ontology vocabulary. QuestIO is quite flexible in terms of complexity and syntax of the supported queries, as both keyword-based searches and full blown questions are supported. However, in the user-centric evaluation of this system we have noticed that the performance was degraded as the users did not have suficient help from the interaction with the system. In this talk, we propose combination of the three methods which are used to assist the user while interacting with the system: feedback, creating personalised vocabulary, and query refinement, and how these can be used in combination to improve the usability of NLIs to conceptual models.

    19th March, 2009 Peter Wallis (University of Sheffield) - Social Engagement with Robots and Agents (SERA) Slides

    Getting people to engage with robotic and virtual artifacts is easy, but keeping them engaged over time is hard: robots and agents lack some fundamental capabilities which can be summarized as sociability. The research community has realized the problem, but approaches, so far, have been dispersed and disjoint. If robots and agents are to become companions in people's lives, they will have to blend into these lives seamlessly. SERA is innovative in that it addresses sociability holistically, by advancing knowledge about what sociability in robots and agents entails, by developing methodology to analyze and evaluate it, and by making available research resources and platforms. SERA will, to this purpose, undertake real-life extended field studies of users' engagement with robotic devices. Sociablity has also to be built into robot and agent architectures from scratch and the goal here is to implement an architecture that caters for both background (cultural, normative etc.) and situational individual (theory of mind, adaptivity, responsiveness) practices and needs of users, with the guiding principle of pervasive affectivity. Assistive robots and agents that are to become true companions have to be versatile in functionality and identity (style, personality) depending on the service they are required to deliver, such as (reactive) social mediators, as (in turn reactive and proactive) information assistants, or as (proactive) coaches or monitors e.g. with health-related tasks. SERA will develop pilots of such intertwined interactive service applications for a robotic device.

    12th March, 2009 Chris Huyck (Middlesex University) - A Pyscholinguistic Model of Natural Language Parsing Implemented in Simulated Neurons Slides

    One of the central activities in natural language processing is parsing. There are a wide range of engineering solutions to parsing but none perform at human levels. The understanding of how humans process language is far from complete, but there is little doubt that humans use their neurons for all mental activities including parsing. There are several psychological models of parsing, but this talk will describe the first neuro-psychological model of parsing. That is, the parser is implemented entirely in simulated neurons. It makes use of Hebb's Cell Assembly hypothesis to form the basis of memories including words, clauses and sentences. Neural parsers require variable binding, and this parser binds via short-term potentiation. The parser produces correct semantic output. As neural cycles have an associated time, time can be measured, and the parser parses in times similar to humans. Prepositional phrase attachment ambiguities are resolved based on the semantics of the sentence. Finally, the parser is embedded in a functioning agent.

    5th March, 2009 Monica Schraefel (University of Southampton) - The Path to Joyful Interaction or Why doesn't your computer make you happy?

    The common computing interaction paradigm is task oriented and task silo'd. We go to a specific application that supports a specific task and do that specific thing. There is some boundary crossing within applications - calendars and address books share data; email is forced into being as flexible as a paper notebook, spreadsheets can be linked into word processing documents. Yet perhaps not too many would say they feel particularly empowered by their computers; that their quality of life is enhanced by interacting with these machines. There are several ways at least in which we might consider why this lack of joy and delight is the more usual experience of computers in our world. One may be this sense of having to do too many things FOR the computer in order for it to do things for us. Another may be that even when it has the information, it does not DO what we want with it. It is functionally obtuse. Another may be that the cost of trying to explain what to do is simply too high for the benefit that might accrue. In the past year or so, a few of us have been looking at some of these problems that appear to be quite light weight issues, and yet have been substantial road blocks towards delightful computing. We have been prototyping some approaches to explore new interactions and new types of services that might be both practically effective in freeing us from serving the computer to get on with our own missions, and may, in so doing, serve to enhance our quality of life along the way. In this talk, I'll go over some of these projects, the motivation behind them and how far we've gotten on the path to joyful computing and the perfect digital assistant.

    26th February, 2009 Mark Stevenson (University of Sheffield) - Disambiguation of Biomedical Text Slides

    Like text in other domains, biomedical documents contain a range of terms with more than one possible meaning. These ambiguities form a significant obstacle to the automatic processing of these texts. Previous approaches to resolving this problem have made use of a variety of knowledge sources including the context in which the ambiguous term is used and domain-specific resources (such as UMLS). We compare a range of knowledge sources which have been previously used and introduce a novel one: MeSH terms. The best performance is obtained using linguistic features in combination with MeSH terms. Performance exceeds previously reported results on a standard test set. Our approach is supervised and therefore relies on annotated training examples. A novel approach to automatically acquiring additional training data, based on the relevance feedback technique from Information Retrieval, is presented. Applying this method to generate additional training examples is shown to lead to a further increase in performance.

    19th February, 2009 Mark A. Greenwood (University of Sheffield) - IR4QA: An Unhappy Marriage Slides

    Over a decade of recent question answering (QA) research has relied on using off-the-shelf information retrieval (IR) engines in order to find relevant documents from which exact answers can be extracted. In this talk I will explain why most QA systems follow this approach and summarise the recent research into what has become known as IR4QA. It is becoming increasingly clear, however, that the use of IR within QA systems is nothing more than a marriage of convenience: in general, QA researchers don't want to develop IR engines and IR researchers are not interested in the QA task. I believe that this marriage is doomed and will never lead to the production of high performance QA systems. The second half of the talk will highlight the main problems inherent in modern QA systems which use IR engines and suggest some possible avenues that QA research may take in the future.

    12th February, 2009 Ehud Reiter (University of Aberdeen) - BabyTalk: Generating English Summaries of Clinical Data Slides

    I will give an overview of the BabyTalk project, whose goal is to generate English summaries of complex clinical data from a neonatal intensive care unit, for doctors, nurses, parents, and other family members. BabyTalk is based on the hypothesis that a textual summary of the most important information in a data set can in some cases be more useful than a visualisation which presents all of the data, or a expert system which explicitly gives advice based on the data. I will primarily focus on NLP challenges in BabyTalk, such as generating good narratives and effectively communicating temporal information. I will also present the results of our first evaluation, which were mixed but overall quite encouraging.

    5th February, 2009 Julien Bourdon (Kyoto University) - Language Grid: An Infrastructure for Intercultural Collaboration Slides

    The Language Grid is an on-line multilingual service platform which enables easy registration and sharing of language services such as on-line dictionaries, bilingual corpora, and machine translations. Unlike existing machine translation systems, the Language Grid allows users to register and combine user-created dictionaries and bilingual corpora with existing machine translations to realize user-oriented translation programs with greater accuracy. The main goals of this project are to combine the existing standard language services provided by linguistic professionals and to assist users to create new language services for their own purpose by permitting them to add their own language resources to the ones made by professionals. Currently, services such as translators, dictionaries, parallel texts, morphological analysers, concept dictionaries, available in 10 languages are deployed on the Language Grid. The Language Grid is used for applications such as multilingual collaboration in NPOs, intercultural coexistence in Japanese schools or hospitals.

    4th December, 2008 Diana McCarthy (University of Sussex) - Evaluating Lexical Inventories and Disambiguation Systems with Lexical Substitution Slides

    There has been a surge of interest within Computational Linguistics over the last decade into methods for word sense disambiguation (WSD). A major catalyst has been the series of SENSEVAL evaluation exercises which have provided standard datasets for the field. Whilst researchers believe that WSD will ultimately prove useful for applications which need some degree of semantic interpretation; the jury is still out on this point. One significant problem is that there is no clear choice of inventory for any given task, other than the use of a parallel corpus for a specific language pair for a machine translation application. Many of the evaluation datasets produced, certainly in English, have used WordNet. Whilst WordNet is a useful resource, it would be beneficial if systems using other inventories could enter the WSD arena without the need for mappings between the inventories which may mask results. This is particularly important since there is no consensus that WordNet sense distinctions are the right ones to make for any given application. As well as the work in disambiguation, there is a growing interest in automatic acquisition of inventories of word meaning. It would be useful to investigate the merits of predefined inventories themselves, aside from their use for disambiguation, and compare these with inventories which have been acquired automatically. In this talk I will discuss these issues and some results in the context of the English Lexical Substitution Task, organised by myself and Roberto Navigli (University of Rome, "La Sapienza") last year under the auspices of SEMEVAL.

    27th November, 2008 David Guthrie (University of Sheffield) - Unsupervised Detection of Anomalous Text Slides, PhD Thesis

    Situations abound that rely on the ability of computers to detect differences from what is normal or expected. Credit card companies identify possible fraud by detecting spending patterns that differ from what is 'normal' for a given cardholder and network analysts detect possible attacks by spotting network traffic that is out of the ordinary. The focus for this talk is the development of unsupervised technologies to similarly detect anomalies in text. We use the term "anomalous" to refer to text that is irregular, or unusual, with respect to the writing style in the majority of a text. In this talk we show that identifying such abnormalities in text can be viewed as a type of outlier detection because these anomalies will deviate significantly from their surrounding context. We consider segments of text which are anomalous with respect to topic (i.e. about a different subject), author (written by a different person), or genre (written for a different audience or from a different source) and experiment with whether it is possible to identify these anomalous segments automatically. Several different innovative approaches to this problem are introduced and we present results over large document collections, created to contain randomly inserted anomalous segments.

    18th November, 2008 Seemab Latif (University of Manchester) - Novel Automatic Technique for Linguistic Quality Assessment of Students' Essays Using Automatic Summarizers Slides

    In this seminar, I will be talking about the experiments that have addressed the calculation of inter-annotator inconsistency in selecting the content in both manual and automatic summarization of sample TOEFL essays. A new finding is that the linguistic quality of source essay has a very strong positive correlation with the degree of disagreement among human assessors to what should be included in a summary. This leads to a fully automated essay evaluation technique based on degree of disagreement among automated summarizes. ROUGE evaluation is used to measure the degree of inconsistency among the participants (human summarizers and automatic summarizers). This automated essay evaluation technique is potentially an important contribution with wider significance.

    6 November, 2008 Niraj Aswani (University of Sheffield) - Tools for Alignment Tasks Slides

    For some tasks, such as text alignment and cross-document co-reference resolution, one would need to refer to more than one document at the same time. Hence, a need arises for Processing Resources (PRs) which can accept more than one document as parameters. For example, given two documents, a source and a target, a Sentence Alignment PR would need to refer to both of them to identify which sentence of the source document aligns with which sentence of the target document. Similarly for a cross-document co-reference resolution, the respective PR would need to access both the documents simultaneously. The standard behaviour of the GATE PRs contradicts the above mentioned requirements. GATE PRs process one document at a time. Corpus pipeline which accepts a corpus as input, considers only one document at a time. Having said this it is not impossible to make PRs accepting more than one document but this would require a lot of re-engineering. Recently, we have introduced a few new resources in GATE (e.g. CompoundDocument, CompositeDocument, AlignmentEditor etc.) to address these issues. In this short presentation, I will describe these components and show how to use them.

    28 October, 2008 Rob Gaizauskas (University of Sheffield) - Generating Image Captions using Topic Focused Multi-document Summarization Slides

    In the near future digital cameras will come standardly equipped with GPS and compass and will automatically add global position and direction information to the metadata of every picture taken. Can we use this information, together with information from geographical information systems and the Web more generally, to caption images automatically? This challenge is being pursued in the TRIPOD project and in this talk I will address one of the subchallenges this topic raises: given a set of toponyms automatically generated from geo-data associated with an image, can we use these toponyms to retrieve documents from the Web and to generate an appropriate caption for the image?

    We begin assuming the toponyms name the principal objects or scene contents in the image. Using web resources (e.g. Wikipedia) we attempt to determine the types of these things -- is this a picture of church? a mountain? a city? We have constructed a taxonomy of such image content types using on-line image collections and for each such type we have constructed a several collections of texts describing that type. For example, we have a collection of captions describing churches and a collection of Wiki pages describing churches. The intuition here is that these collections are examples of, e.g. the sorts of things people say in captions of churches. These collections can then be used to derive models of objects or scene types which can be used to bias or focus multi-document summaries of new images of things of the same type.

    In the talk I report results of work we have carried out to explore the hypothesis underlying this approach, namely that brief multidocument summaries generated as image captions by using models of object/scene types to bias or focus content selection will be superior to generic multidocument summaries generated for this purpose. I describe how we have constructed an image content taxonomy, how we have derived text collections for object/scene types, how we have derived object/scene type models from these collections and how these have been used in multi-document summarization. I also discuss the issue of how to evaluate the resulting captions and present preliminary results from one sort of evaluation.

    21 October, 2008
    Leon Derczynski (University of Sheffield) - A Data Driven Approach to Query Expansion in Question Answering Slides

    Automated answering of natural language questions is an interesting and useful problem to solve. Question answering (QA) systems often perform information retrieval at an initial stage. Information retrieval (IR) performance, provided by engines such as Lucene, places a bound on overall system performance. For example, no answer bearing documents are retrieved at low ranks for almost 40% of questions. In this paper, answer texts from previous QA evaluations held as part of the Text REtrieval Conferences (TREC) are paired with queries and analysed in an attempt to identify performance-enhancing words. These words are then used to evaluate the performance of a query expansion method. Data driven extension words were found to help in over 70% of difficult questions. These words can be used to improve and evaluate query expansion methods. Simple blind relevance feedback (RF) was correctly predicted as unlikely to help overall performance, and an possible explanation is provided for its low value in IR for QA.

    Mark A. Greenwood (University of Sheffield) - Evaluation of Automatically Reformulated Questions in Question Series Slides

    Having gold standards allows us to evaluate new methods and approaches against a common benchmark. In this paper we describe a set of gold standard question reformulations and associated reformulation guidelines that we have created to support research into automatic interpretation of questions in TREC question series, where questions may refer anaphorically to the target of the series or to answers to previous questions. We also assess various string comparison metrics for their utility as evaluation measures of the proximity of an automated system's reformulations to the gold standard. Finally we show how we have used this approach to assess the question processing capability of our own QA system and to pinpoint areas for improvement.

    14 October, 2008 - Jordi Poveda (UPC Catalunya) - A Combination of Machine Learning Methods for the Recognition of Temporal Expressions Slides

    Time expression recognition and representation of the time information they convey in a suitable normalized form is a central part of Information Extraction (IE), for it paves the way for the extraction of events and temporal relations. The most common approach to time expression recognition in the past has been the use of handmade extraction rules (grammars), which also served as the basis for normalization. Our aim is to explore the possibilities afforded by applying machine learning techniques to the recognition of time expressions, in order to see where it stands in relation to grammar-based approaches. We focus on recognizing the appearances of time expressions in text (not normalization) and transform the problem into one of chunking, where the aim is to correctly assign IOB tags to tokens. We explain will the knowledge representation used and compare the results obtained in our experiments with two different supervised methods, one statistical (support vector machines) and one of rule induction (FOIL), where the superiority of SVMs is revealed. Next, we will present a semi-supervised approach (based on bootstrapping) to the extraction of time expression mentions in large unlabelled corpora based on bootstrapping. The only supervision is in the form of seed examples, hence it becomes necessary to resort to heuristics to rank and filter out spurious patterns and candidate time expressions. We will summarize our preliminary result with this bootstrapping architecture, which is currently in a testing and improvement stage . The ultimate benefit of developing an end-to-end machine-learning-based framework for information extraction is that it can be carried to new domains and tasks with little customization.