82531

Автор(ы): 

Автор(ов): 

2

Параметры публикации

Тип публикации: 

Доклад

Название: 

The Impact of Multilinguality and Tokenization on Statistical Machine Translation

Электронная публикация: 

Да

ISBN/ISSN: 

2305-7254

DOI: 

10.23919/fruct61870.2024.10516416

Наименование конференции: 

  • 2024 35th Conference of Open Innovations Association (FRUCT)

Наименование источника: 

  • Proceedings of the 35th Conference of Open Innovations Association (FRUCT 2024)

Город: 

  • Тампере

Издательство: 

  • IEEE

Год издания: 

2024

Страницы: 

149-157
Аннотация
Multilingual neural machine translation systems has achieved state-of-the-art results on translation quality, especially for low-resource languages, yet statistical machine translations systems has not been trained and examined in similar multilingual setup. This work defines a multilingual statistical machine translation system as a many-to-one system capable of translating from any of the predefined languages to the one target language. We study how the multilingual setting affects translations quality compared to a regular one-to-one language machine translation system. And we examine how this setting affects related languages with different amount of training data. The research is conducted in multiple languages of different language families. The impact of different tokenizers and preprocessing methods is researched as well. Specifically, we compare the default Moses tokenizer with the SentencePiece tokenizer, as well as dedicated Chinese and Japanese word splitters. We also investigate the impact of lowercasing and conduct our experiments on data of different sizes. We find out that multilinguality gives a small gain across all of the metrics. Languages with sufficient amount of good quality training data do not affect the quality of related languages with lesser quality data. The SentencePiece tokenizer shows lower BLEU scores on average, but outperforms other tokenizers on chrF++ and METEOR metrics. Lowercasing increases scores of all metrics in all of the scenarios.

Библиографическая ссылка: 

Асваров А., Грабовой А.В. The Impact of Multilinguality and Tokenization on Statistical Machine Translation / Proceedings of the 35th Conference of Open Innovations Association (FRUCT 2024). Тампере: IEEE, 2024. С. 149-157.