MOTEC: The Malay Offensive Text Classification using Extra Tree and Language Standardization

Authors

  • Fairuz Amalina Department of Computer System and Technology, Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
  • Faiz Zaki Department of Computer System and Technology, Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
  • Hamza H. M. Altarturi International Center for Living Aquatic Resource Management, WorldFish, 11960 Pulau Pinang, Malaysia
  • Hazim Hanif Department of Software Engineering, Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
  • Nor Badrul Anuar Centre of Research for Cyber Security and Network (CSNET), Faculty of Computer Science and Information Technology, Universiti Malaya, 50603, Kuala Lumpur, Malaysia, Visiting Professor, Institute of Informatics & Computing in Energy, Universiti Tenaga Nasional, Kajang, Malaysia

DOI:

https://doi.org/10.22452/mjcs.vol38no1.4

Keywords:

Natural Language Processing, Machine Learning, Extra Tree, Offensive Text, Text Classification, Decision Tree, Ensemble Method, Malay Language

Abstract

Cyberbullying has increased globally, with offensive text contributing significantly. Detecting offensive text in Malay is challenging due to non-standard Malay text, unique social media writing styles, a lack of standardization, and limited resources. This study proposes the Malay Offensive Text Classification (MOTEC) framework to address these challenges. The MOTEC framework incorporates a Malay standardization preprocessing task, utilizing three specialized dictionaries: (a) abbreviations, (b) noisy text, and (c) Malaysian dialects. This approach enhances data quality by converting non-standard text into standardized Malay sentences before classification. For feature extraction, the framework employs Term Frequency-Inverse Document Frequency (TF-IDF). This statistical method evaluates the importance of words in a document relative to a collection of documents, coupled with an Extra Tree classifier for the classification process. Evaluating the MOTEC framework using a private dataset collected from Twitter, this study achieved a classification accuracy of 94%, significantly outperforming other studies, which reported an accuracy of 84%. The MOTEC framework substantially improves the classification of offensive Malay text by enhancing accuracy, reducing execution time, and improving data quality through effective language standardization.

Downloads

Download data is not yet available.

Downloads

Published

2025-03-30

How to Cite

Amalina, F. ., Zaki, F. ., Altarturi, H. H. M. ., Hanif, H. ., & Anuar, N. B. . (2025). MOTEC: The Malay Offensive Text Classification using Extra Tree and Language Standardization. Malaysian Journal of Computer Science, 38(1), 81–98. https://doi.org/10.22452/mjcs.vol38no1.4