MOTEC: The Malay Offensive Text Classification using Extra Tree and Language Standardization

Fairuz  Amalina; Faiz  Zaki; Hamza H. M.  Altarturi; Hazim  Hanif; Nor Badrul   Anuar

doi:10.22452/mjcs.vol38no1.4

Authors

Fairuz Amalina Department of Computer System and Technology, Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
Faiz Zaki Department of Computer System and Technology, Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
Hamza H. M. Altarturi International Center for Living Aquatic Resource Management, WorldFish, 11960 Pulau Pinang, Malaysia
Hazim Hanif Department of Software Engineering, Faculty of Computer Science and Information Technology, Universiti Malaya, 50603 Kuala Lumpur, Malaysia
Nor Badrul Anuar Centre of Research for Cyber Security and Network (CSNET), Faculty of Computer Science and Information Technology, Universiti Malaya, 50603, Kuala Lumpur, Malaysia, Visiting Professor, Institute of Informatics & Computing in Energy, Universiti Tenaga Nasional, Kajang, Malaysia Corresponding Author

DOI:

https://doi.org/10.22452/mjcs.vol38no1.4

Keywords:

Natural Language Processing, Machine Learning, Extra Tree, Offensive Text, Text Classification, Decision Tree, Ensemble Method, Malay Language

Abstract

Cyberbullying has increased globally, with offensive text contributing significantly. Detecting offensive text in Malay is challenging due to non-standard Malay text, unique social media writing styles, a lack of standardization, and limited resources. This study proposes the Malay Offensive Text Classification (MOTEC) framework to address these challenges. The MOTEC framework incorporates a Malay standardization preprocessing task, utilizing three specialized dictionaries: (a) abbreviations, (b) noisy text, and (c) Malaysian dialects. This approach enhances data quality by converting non-standard text into standardized Malay sentences before classification. For feature extraction, the framework employs Term Frequency-Inverse Document Frequency (TF-IDF). This statistical method evaluates the importance of words in a document relative to a collection of documents, coupled with an Extra Tree classifier for the classification process. Evaluating the MOTEC framework using a private dataset collected from Twitter, this study achieved a classification accuracy of 94%, significantly outperforming other studies, which reported an accuracy of 84%. The MOTEC framework substantially improves the classification of offensive Malay text by enhancing accuracy, reducing execution time, and improving data quality through effective language standardization.

MOTEC: The Malay Offensive Text Classification using Extra Tree and Language Standardization

Authors

DOI:

Keywords:

Abstract

Downloads

Published

Issue

Section

License

Most read articles by the same author(s)

Editorial Information

Scope

Submission Guidelines

Indexing

Article Publication Charge

Journal Template

Special Issue

In Press Publication

Awards

Information

Conference

Articles

Top Cited Articles

Most View Articles

Publishing Timeline