Skip to main navigation Skip to search Skip to main content

The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis

  • Institute of Science and Technology Austria

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

Word embeddings represent language vocabularies as clouds of d-dimensional points. We investigate how information is conveyed by the general shape of these clouds, instead of representing the semantic meaning of each token. Specifically, we use the notion of persistent homology from topological data analysis (TDA) to measure the distances between language pairs from the shape of their unlabeled embeddings. These distances quantify the degree of non-isometry of the embeddings. To distinguish whether these differences are random training errors or capture real information about the languages, we use the computed distance matrices to construct language phylogenetic trees over 81 Indo-European languages. Careful evaluation shows that our reconstructed trees exhibit strong and statistically-significant similarities to the reference.

Original languageEnglish
Title of host publicationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
EditorsYaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
PublisherAssociation for Computational Linguistics (ACL)
Pages12080-12099
Number of pages20
ISBN (Electronic)9798891761681
DOIs
StatePublished - 2024
Event2024 Findings of the Association for Computational Linguistics, EMNLP 2024 - Hybrid, Miami, United States
Duration: Nov 12 2024Nov 16 2024

Publication series

NameEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024

Conference

Conference2024 Findings of the Association for Computational Linguistics, EMNLP 2024
Country/TerritoryUnited States
CityHybrid, Miami
Period11/12/2411/16/24

Fingerprint

Dive into the research topics of 'The Shape of Word Embeddings: Quantifying Non-Isometry With Topological Data Analysis'. Together they form a unique fingerprint.

Cite this