Cross-linguistic Semantic and Syntactic Representation

This joint PhD project is based at the Hebrew University of Jerusalem, with a 12 month stay at the University of Melbourne.

Supervision Team: Dr Omri Abend, Hebrew University of Jersualem; Dr Lea Frermann, University of Melbourne

Project Description:

The technological and theoretical importance of cross-linguistic applicability in semantic and syntactic representation has long been recognized, but achieving this goal has proved extremely difficult. The project will make progress towards a definition of a semantic and syntactic scheme that can be applied consistently across languages, by building on two major bodies of work:

  1. At the lexical level, we will build on the expanding work on mapping the semantic spaces of different languages [17, 18, 19]. Despite considerable success and interest of the research community in these methods [20], and their value for downstream applications, such as the automatic compilation of a multilingual dictionary, the developed approaches suffer from making simplistic assumptions as to the nature of the mapping between the semantic spaces of different languages [21].
  2. At the sentence level, we will build on the Universal Dependencies scheme for syntactic representation [22], and the UCCA scheme [23] for semantic representation. Both approaches build on linguistic typological work, and have been applied to a number of languages. However, the schemes remain coarse-grained in their categories, and the relation between the sentence and lexical level remains mostly unexplored.

Studying Cross-linguistic Alignment and Divergence Patterns through Parallel Corpora: The development of the Universal Dependencies (UD) and UCCA annotation schemes provides a basis for statistical in-depth studies of cross-linguistic syntactic divergences based on data from parallel corpora. This constitutes an improvement over traditional feature-based studies that treat languages as vectors of categorical features (as languages are represented, e.g., in databases such as WALS or AutoTyp). However, existing studies are mostly based on summary statistics over parallel corpora, such as relative frequencies of different word-order patterns, and do not reflect fine- grained cross linguistic mappings that are very important both for linguistic typology and practical NLP applications. For example,this methodology cannot directly detect that English nominal compounds and nominal-modification constructions are often translated with Russian adjectival-modification constructions or that English adjectival-modification and nominal-modification constructions routinely give rise to Korean relative clauses.

Preliminary work in Omri’s lab has manually word-aligned a subset of the Parallel Universal Dependencies corpus collection and conducted a quantitative and qualitative study based on it. The proposed project will not only extend the analysis to additional language pairs and to the use of UCCA categories, but also refine the representation with finer-grained distinctions, based on other sentence level schemes, such as AMR [24]. Moreover, the project will extend the analysis to include differences in the lexical semantics of the two languages, using an induced mapping between the distributional spaces of these languages.

Richer Mappings of Distributional Spaces across Languages. A complementary effort to studying the semantic mappings across languages by aligning parallel corpora, is aligning the vector space representations induced from monolingual data in each language. We will go beyond current approaches that attempt to find a global mapping of distributional spaces, mostly in terms of orthogonal linear transformations between the spaces. Instead, we will adopt a non-linear approach, based on topological data science theory.

The project will also seek to study the relation between syntactic and lexical differences between languages, with the goal of understanding how both types of differences shape the geometry and topology of the embedding spaces of different languages.


[17] Tomas Mikolov, Quoc Le, and Ilya Sutskever. Exploiting similarities among languages for machine translation. 09 2013.
[18] Mikel Artetxe, Gorka Labaka, and Eneko Agirre. A robust self-learning method for fully unsupervised crosslingual mappings of word embeddings. pages 789–798, 01 2018.
[19] Alexis Conneau, Guillaume Lample, Marc’Aurelio Ranzato, Ludovic Denoyer, and Hervé Jégou. Word translation without parallel data. arXiv preprint arXiv:1710.04087, 2017.
[20] Sebastian Ruder, Ivan Vuli´c, and Anders Søgaard. A survey of cross-lingual word embedding models. Journal of Artificial Intelligence Research, 65:569–631, 08 2019.
[21] Anders Søgaard, Sebastian Ruder, and Ivan Vuli´c. On the limitations of unsupervised bilingual dictionary induction. 05 2018.
[22] Joakim Nivre, Marie-Catherine de Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajiˇc, Christopher D. Manning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666, Portorož, Slovenia, May 2016. European Language Resources Association (ELRA).
[23] Omri Abend and Ari Rappoport. Universal conceptual cognitive annotation (ucca). In ACL (1), pages 228–238, 2013.
[24] Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. Abstract meaning representation for sembanking. In Proceedings of the 7th linguistic annotation workshop and interoperability with discourse, pages 178–186, 2013.