HYBRID DISTANCE-STATISTICAL-BASED PHRASE ALIGNMENT FOR ANALYZING PARALLEL TEXTS IN STANDARD MALAY AND MALAY DIALECTS

Authors

  • Jasmina Khaw Yen Min Universiti Tunku Abdul Rahman
  • Tien Ping Tan
  • Bali Ranaivo-Malancon

DOI:

https://doi.org/10.22452/mjcs.vol37no1.5

Keywords:

Malay dialects; Parallel text; Word alignment

Abstract

Parallel texts corpora are essential resources in linguistics and natural language processing, especially in translation  and multilingual information retrieval. The publicly available parallel text corpora are limited to certain genres, types  and domains. Furthermore, the parallel dialect text is scarce, even though they are important in the analysis and study  of a dialect. Collecting parallel dialect text is challenging because dialects typically appear in the form of speech and  very limited dialectic texts exist. Moreover, there is no standard orthography in most dialects. The contributions of  this paper are threefold. First, the paper describes a methodology in acquiring a parallel text corpus of Standard Malay and Malay dialects, particularly Kelantan Malay and Sarawak Malay. Second, we propose a hybrid of distance based and statistical-based alignment algorithm to align words and phrases the parallel text. The results show that  the precision and recall values of the proposed alignment algorithm are more than 95% and better than the state-of the-art GIZA++. Third, the alignment obtained were compared to find out the lexical similarities and differences between Standard Malay and the two studied Malay dialects, contributing valuable insights into the linguistic  variations within the Malay language family.

Downloads

Download data is not yet available.

Downloads

Published

2024-01-31

How to Cite

Yen Min, J. K., Tan, T. P., & Ranaivo-Malancon, B. (2024). HYBRID DISTANCE-STATISTICAL-BASED PHRASE ALIGNMENT FOR ANALYZING PARALLEL TEXTS IN STANDARD MALAY AND MALAY DIALECTS. Malaysian Journal of Computer Science, 37(1), 1–25. https://doi.org/10.22452/mjcs.vol37no1.5