Determining the Language of a Text

Determining the language in which a text was written is one of the most important tasks in the automated processing of documents. A classification, sentiment, fake news or spam analysis can not be made without knowing in which language the text, tweet or review was written.

This is especially relevant in Switzerland, which has four official languages. Here entries in forums or product reviews of users are often written intermixed in several languages. Example of this are the Swiss Administration or the popular online-shop Digitec. Also many organizations write documents in a variety of languages. Often the official corporate language is English, but it is often easier to write a document in your native language.

The Model

Therefore we decided to implement Language Detection for English, German, French, Italian, Spanish and Romansh. As training and test material we used the Wikipedia dumps in the respective language. The Romansh Wikipedia is small compared to the dumps of the other languages. Therefore, we have extracted additional training and test data from the website of the Romansh Radio and Television (RTR) station.

The machine learning model was implemented with Keras and tested with TensorFlow as backend. For English, German, French, Italian and Romansh 75’000 texts extracts from Wikipedia and RTR.ch as the Romansh Wikipedia is much smaller than other languages. For English, German, French, Italian and Spanish 200’000 chunks of text from Wikipedia were used.

Accuracy

The accuracy of the predictions with Romansh is 94.65%. The accuracy for English, German, French, Italian and Spanish is over 99%. The model consists of an embedding layer with 10 dimensions as input, a hidden layer with 4 LSTM units and an output layer with 5 unit and softmax activation. The maximum length of an input text is 200 words (of course you can analyse longer text. But the first 200 words are enough to detect the language).

The REST Interface

As an interface to the model, we have implemented a simple REST service. The model can therefore be easily integrated into any application architecture as a component. Several texts can be posted to the REST service at once for analysis.

 

The software is also available free of charge under the Apache License 2.0 with basic support on Github.

About the author: Thomas studied computer linguistics and philosophy and graduated with a PhD in computer science. He has worked as a consultant for natural language processing and application development for major Swiss banks. Thomas is founder of ipublia. He lives with his family in Zürich.

Leave a Reply