Language detection for English, German, French, Italian, Spanish and Romansh.

The download includes the software and support via our ticket system.

Description

Determining the language in which a text was written is one of the most important tasks in the automated processing of documents. A classification, sentiment, fake news or spam analysis can not be made without knowing in which language the text, tweet or review was written. This model determines the language in which a text was written.

Model

The machine learning model was implemented with Keras and tested with TensorFlow as backend. For English, German, French, Italian and Romansh 75’000 texts extracts from Wikipedia and RTR.ch as the Romansh Wikipedia is much smaller than other languages. For English, German, French, Italian and Spanish 200’000 chunks of text from Wikipedia were used.

Applications

Determining the language in which a text was written is one of the most important tasks in the automated processing of documents. A classification, sentiment, fake news or spam analysis can not be made without knowing in which language the text, tweet or review was written.

This is especially relevant in Switzerland, which has four official languages. Here entries in forums or product reviews of users are often written intermixed in several languages.

Accuracy

The accuracy of the predictions with Romansh is 94.65%. The accuracy for English, German, French, Italian and Spanish is over 99%. The model consists of an embedding layer with 10 dimensions as input, a hidden layer with 4 LSTM units and an output layer with 5 unit and softmax activation. The maximum length of an input text is 200 words (of course you can analyse longer text. But the first 200 words are enough to detect the language).

Interface

As interface to the model, we have implemented a simple REST service. The model can therefore be easily integrated into any application architecture as a component. Several texts can be posted to the REST service at once for analysis.

Installation

For installation and usage of the software refer to the documentation on GitHub.

License

ipublia/sentiment-analysis is licensed under the

Apache License 2.0

A permissive license whose main conditions require preservation of copyright and license notices. Contributors provide an express grant of patent rights. Licensed works, modifications, and larger works may be distributed under different terms and without source code.

The software is available free of charge and without premium support on Github.