Skip to Main Content
 

Global Search Box

 
 
 
 

Files

ETD Abstract Container

Abstract Header

Leveraging Degree of Isomorphism to Improve Cross-Lingual Embedding Space for Low-Resource Languages

Bhowmik, Kowshik

Abstract Details

2022, PhD, University of Cincinnati, Engineering and Applied Science: Computer Science and Engineering.
Distributed representation of words, or word embeddings, have been successfully utilized in many Natural Language Processing (NLP) tasks. However, not all monolingual embedding spaces are trained with the same amount of data. Interest in transferring knowledge across languages, especially from languages rich with resources to low-resource ones, has given rise to cross-lingual word embeddings(CLWE). CLWEs represent words belonging to different languages in a shared semantic space. In this joint embedding space, vector representations of semantically equivalent words share a low distance, irrespective of which language they belong to. CLWEs form the basis of Bilingual Lexicon Induction(BLI) and Machine Translation(MT) as they make comparing word meanings across languages possible. The similar geometric arrangement of similar concepts in monolingual word embeddings of different languages has led to the learning of linear, and more specifically, orthogonal transformation from one embedding space to another. Mapping-based methods of learning CLWEs hinged on the premise that there exists invariance among languages resulting in their embedding spaces being isomorphic. This assumption significantly weakens for etymologically distant language pairs and/or those disparate in terms of their available resources. This weak assumption has been utilized to measure the degree of isomorphism between monolingual embedding space pairs and has also been used to measure their typological distance. In this dissertation, we propose to first cluster a set of monolingual embedding spaces based on their pairwise degrees of isomorphism. We present a qualitative analysis of the comparative impact of typological relations among the languages and the size of the embedding spaces. The goal is to determine the combination of clustering algorithm and measure of isomorphism that is able to cluster related languages together. Low-resource languages in the cluster are then enabled to leverage related richer-resource languages to get better representation in the cross-lingual embedding space.
Anca Ralescu, Ph.D. (Committee Member)
Kenneth Berman, Ph.D. (Committee Member)
Dan Ralescu, Ph.D. (Committee Member)
James Lee (Committee Member)
David Musgrave, Ph.D. (Committee Member)
Chia Han, Ph.D. (Committee Member)
127 p.

Recommended Citations

Citations

  • Bhowmik, K. (2022). Leveraging Degree of Isomorphism to Improve Cross-Lingual Embedding Space for Low-Resource Languages [Doctoral dissertation, University of Cincinnati]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1668619714623125

    APA Style (7th edition)

  • Bhowmik, Kowshik. Leveraging Degree of Isomorphism to Improve Cross-Lingual Embedding Space for Low-Resource Languages. 2022. University of Cincinnati, Doctoral dissertation. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=ucin1668619714623125.

    MLA Style (8th edition)

  • Bhowmik, Kowshik. "Leveraging Degree of Isomorphism to Improve Cross-Lingual Embedding Space for Low-Resource Languages." Doctoral dissertation, University of Cincinnati, 2022. http://rave.ohiolink.edu/etdc/view?acc_num=ucin1668619714623125

    Chicago Manual of Style (17th edition)