Current Trends in Language Technology and Linguistic Modeling

Associated event to TLT 13

"Language Technology and Linguistic Modeling" is a topic extremely relevant in the context of treebanks and linguistic theories. Therefore, this topic was selected for this year's associated event of TLT 13 in Tübingen. The organizing institution in Tübingen is well known in this field, hosting and maintaining prominent resources such as the Tübingen Treebanks of German (TüBa-family) and GermaNet, a resource often used in language technology. Therefore, this year's associated event presents speakers who work in this field and share a Tübingen background. On the occasion of the 60th birthday of the local TLT chair, Erhard Hinrichs, some of his former collaborators return to Tübingen to present their current work at this event. Especially (but not only) TLT 13 participants are cordially invited to join this associated event at no cost.

Program: Thursday, December 11

09:30 - 10:00 Registration for TLT
10:00 - 12:00
  • Adam Przepiórkowski: "Towards a scalable combinatorial dictionary that has it all"
  • Manfred Sailer & Frank Richter: "Idiomatic negation marking. The case of German 'den Teufel tun'"
  • Yannick Versley: "Firing the linguists doesn't help: On the interdisciplinarity in Computational Linguistics"
12:00 - 14:00 Lunch Break
14:00 - 15:20
  • Paula Monachesi: "Language technology and social media for Future Cities"
  • Shuly Wintner: "The features of translationese: computational approaches to translation studies"
15:20 - 15:50 Coffee Break
15:50 - 17:30
  • Petya Osenova and Kiril Simov: "Semantic Resources for Knowledge Rich NLP"
  • Sandra Kübler: "From Text to Underlying Opinions"
  • Erhard Hinrichs: Concluding Remarks

Abstracts

Adam Przepiórkowski: "Towards a scalable combinatorial dictionary that has it all"

In this talk I will describe a large valence dictionary of Polish developed at the Institute of Computer Science of the Polish Academy of Sciences, Walenty (http://zil.ipipan.waw.pl/Walenty), which already has a number of novel features compared to other such dictionaries, including a very rich phraseological component. I will also argue that the time has come to start fulfilling the vision of a dictionary which exhaustively describes various paradigmatic, syntactic, semantic and pragmatic aspects of all single- and multi-word units of a given language. Such a vision was perhaps naive in late 1960s and early 1970s, when Mel'čuk, Zholkovsky and Apresjan formulated it, but is now becoming feasible due to 1) the availability of large-scale more basic lexical resources such as wordnets, morphosyntactic valence dictionaries and phraseological lexica, 2) the dynamic development of methods of acquiring more complex linguistic knowledge from such resources and corpora, 3) the possibility to build tools significantly supporting manual lexicographic work at all stages: from the collection of relevant data, to the verification of coherence and correctness of linguistic information added by humans.

Manfred Sailer & Frank Richter: "Idiomatic negation marking. The case of German 'den Teufel tun'"

The idiomatic German negator "einen/den Teufel tun" (literally: do a/the devil) is followed by an infinitival or a conjoined VP. As shown in (1), this expression licenses strong NPIs such as "ein Sterbenswörtchen sagen" (`say a dying word’) in its second part, just as an n-word does in its scope.

(1) Ich werde einen/den Teufel tun und ein Sterbenswörtchen sagen/, ein Sterbenswörtchen zu sagen.
I will do the devil and say a dying word/ to say a dying word
‘I will certainly not say a dying word.’

When the expression is used in the scope of negation, speaker judgments differ quite drastically. While some speakers reject (2) ("group 1"), others have an interpretation which ignores one of the negations ("group 2").

(2) Ich glaube nicht, dass Alex einen Teufel tun wird, dir zu helfen.
I don't think that Alex will do the devil to help you
Speaker group 1: *
Speaker group 2: `I think Alex will certainly not help you’

Based on corpus data and on introspective judgments, we will present a detailed picture of both the NPI licensing properties of "einen Teufel tun" and of its interaction with other negative operators. We will explore the relevance of these observations for current theories of polarity items.

Yannick Versley: "Firing the linguists doesn't help: On the interdisciplinarity in Computational Linguistics"

Frederick Jelinek, receiver of the ACL's lifetime achievement award and (in)famous for having stated that "Every time we fired a linguist, recognition rates improved", likened the relation between linguistics and natural language processing to that between physicists and engineers. Knowing that engineering needs good knowledge of the physics involved, and similarly, that much of today's physics is quite engineering-heavy, I will use this backdrop to probe further into the interactions between linguistics and computational science in multiple 'paradigms' that computational linguistics has seen, ranging from finite-state approaches to current approaches involving convolutional neural networks, and investigate how these paradigms influence the interaction between disciplines.

Shuly Wintner: "The features of translationese: computational approaches to translation studies"

Translation is a text production mode that imposes cognitive (and cultural) constraints on the text producer. The product of this process, known as *translationese*, reflects these constraints; translated texts are therefore ontologically different from texts written originally in the same language. Many of the special properties of translationese are believed to be universal, in that they are manifest in any translated text regardless of the source and target languages.

In this work we test several Translation Studies hypotheses using a computational methodology that is based on supervised machine learning. Casting the problem in the paradigm of authorship attribution, we define dozens of classifiers that implement various linguistically-informed features that reflect translation universals. While the practical task of distinguishing original from translated texts is easy, we focus not on improving the accuracy of classification, but rather on designing linguistically meaningful features and assessing their contribution to the task. We demonstrate that some feature sets are indeed good indicators of translationese, thereby corroborating some hypotheses, whereas others perform much worse (sometimes at chance level), indicating that some 'universal' assumptions have to be reconsidered.

Petya Osenova and Kiril Simov: "Semantic Resources for Knowledge Rich NLP"

The enhancement of the knowledge-based resources, tools and applications have been speeded up the Semantic Web developments. The usage of lexical databases, such as Wordnets and thesauri, in various NLP applications has become a best practice. They either complement the information in other language resources (such as treebanks, corpora, etc.) or support word sense disambiguation and machine translation tasks, thus improving language understanding and generation.
At the same time, a number of ontologies have emerged in the last years: linguistic (SIMPLE, GOLD, etc.) and domain ones (MESH; GENIA Ontology, etc). Last, but not least, Linked Open Data came into the NLP scene with its lots of structured factual knowledge in various domains (DBPedia, FreeBase, etc.). The data became thematically interlinked by ontologies. It also became searchable by the addition of SPARQL endpoints. Thus, inevitably, the necessity had been posed for finding ways of communicating and utilizing the variety of existing large knowledge resources. For example, the LEMON lexicon model for ontologies has been created (MONNET project); SPARQL language for searching information in Linked Open Data has been established, etc.
The talk will discuss the above mentioned aspects with a focus to a small language - Bulgarian.

Sandra Kübler: "From Text to Underlying Opinions"

In this talk, I will describe work on a current project, in which we investigate how to to detect underlying opinions in text automatically. I will concentrate on identifying sentences that carry opinion and on predicting user ratings in review situations, a hard problem because of the extreme skewing of the opinions. I will discuss feature selection methods as well as the use of linguistic knowledge, which requires syntactic and semantic analysis.

News