Realistic machine translation of Gujarati, Somali, & Kazakh
BY: GINA MANTICA
In today’s virtual world, equal access to online resources is more important than ever before. But online texts aren’t available in all 7,000 languages, and most machine translation programs are limited to Arabic, Chinese, Japanese, and several European languages. Researchers from Boston University and Institut Teknologi Bandung developed an innovative method to automatically translate low-resource languages — or languages that lack sufficient human translated data from books, articles, dictionaries, databases, or other texts.
Derry Wijaya, a Hariri Institute Research Fellow and Assistant Professor in Computer Science at Boston University, recently published the findings in a preprint on arXiv. The research team created an algorithm for translation that could be implemented easily by simplifying their methodology and taking into account the cost of using the required computing power, based on the average salaries of people in countries that speak three low-resource languages: Gujarati, Somali, and Kazakh.

Wijaya’s team used Wikipedia pages to build the algorithm’s training data set, or the information used to teach the machine how to translate from one language to another. Wikipedia pages are written in over 300 languages, including some low-resource languages, and are freely accessible online. The researchers compared sentences on pages written in Gujarati, Somali, and Kazakh to sentences on similar pages written in English. The team selected sentences from the foreign language and English pages that have some overlapping words, and used dictionaries to translate other words to English. “We wanted to make our method as generalizable as possible so it can be applied to any kind of low-resource language,” said Wijaya. In addition to the extracted sentences from Wikipedia, the resulting Frankenstein sentences, with some words in English and other words in the low-resource languages, were also used to augment the data for training the machine and developing the algorithm for translation.
The researchers’ simple use of dictionaries to translate Wikipedia pages proved successful — the method outperforms other existing automatic translation programs for all three languages. The team’s algorithm creates clear sentences when translating from English to Somali or Gujarati, and vice versa, despite large differences in the script, complexity, and word order between these languages and English. The method is a bit worse at translating from English to Kazakh, though the performance is still an improvement from previous methods. This might be because a lot of Kazakh words look quite similar, due to their morphological complexity. The algorithm might therefore need more examples of Kazakh translations in the training data set to run efficiently.

When testing the algorithm’s translation abilities, the team also took into account how many graphics processing units, or GPUs, people in countries that speak Gujarati, Somali, and Kazakh can afford. While academics in the United States may have access to four GPUs at a time, the average person that speaks Gujarati may only have access to one GPU. However, just one GPU can allow for successful machine translation. “We can do a lot even if we don’t have a lot of resources,” said Wijaya, “I want people in other countries to use our system and be able to benefit from it.” The researchers’ realistic approach to the algorithm’s development and testing ensures that it is affordable and accessible to the people that might use it.
Wijaya and colleagues were excited to find that adding additional GPUs can improve the algorithm’s translation significantly. But, the team is wary of the environmental costs associated with the use of more GPUs. That is why it is important to factor in the computing resources used when reporting results, said Wijaya.
Interested in learning more about the transformational science happening at the Hariri Institute? Sign up for our newsletter here.