Improving AI Technology to Understand Colloquialisms

BY GINA MANTICA

Many people interact with natural language processing systems every day unknowingly by asking Siri for help or talking with a chat bot at an online store. Natural language processing (NLP) is a field of artificial intelligence (AI) that aims to build computer systems that can understand, process, and learn about language. But the way that people communicate online is often very different from the formal usage of the language, and these colloquialisms are not recognized by NLP systems. Researchers are developing NLP models that can recognize and process colloquialisms by expanding the datasets that train these models.

Artificial intelligence research (AIR) initiative member Derry Wijaya, an Assistant Professor in Computer Science, worked with researchers at Kata.ai Research and Universitas Indonesia to develop a dataset of colloquial Indonesian words that can be used for testing NLP models. The findings were published recently in the Findings of the Association for Computational Linguistics (ACL-IJCNLP 2021).

Derry Wijaya, Hariri Institute Research Fellow and Assistant Professor in Computer Science, develop a dataset of colloquial words that can be used for testing natural language processing models.

To create a dataset of colloquial Indonesian language, Wijaya and colleagues turned to social media. Shortening words like “thanks” to “thx”, typing “2” instead of “to”,  and removing vowels like in the case of “lazy” and “lzy” are all examples of colloquial transformations that result in words that cannot be understood by natural language processing systems, despite how common they are on social media. The researchers collected Tweets and made a list of the most frequently used Indonesian words. They identified colloquial language in the list by finding words that are not part of the Indonesian dictionary.

Indonesian linguists then determined how these colloquial words on Twitter are created from their formal counterparts in the Indonesian dictionary.  The linguists’ annotations detailing the different ways of creating colloquialisms were used to train the NLP models. By training a model using information on how to derive colloquial language from formal language, NLP systems might be able to do these language transformations themselves.

Wijaya and colleagues tested their dataset using one of the most popular NLP models, known as the transformer model. Sure enough, the dataset enables the transformer model to learn and recognize some colloquial Indonesian word transformations, including the removal of vowels and shortening of words. But other transformations are harder for the model to recognize, like suffixes or similar additions to words. This might be because the context of a sentence can affect the meaning of the colloquial word. “For the model to do better, it needs to understand the context of the word within a sentence,” said Wijaya.

The dataset’s success in training the transformer model is encouraging and other languages with lots of colloquialisms, like Korean and Arabic, could benefit from AI technology that can process and understand colloquial language. “The issue of chatbots or other AI technology understanding colloquial language is universal, and we will see more and more of it as social media becomes more prevalent all over the world,” said Wijaya, “These resources can train AI models to process colloquial words that they aren’t used to.”


Interested in learning more about the research happening at the Hariri Institute? Sign up for our newsletter here.