TIL: How to improve language detection using langdetect

Welcome to the second TIL ✌🏻, where I share small lessons I’ve learnt, these generally aren’t substantial enough to be their fully-featured blog post but are worth sharing, just in case someone else is looking at the same problem.

Today I’ll be telling you about how you can sometimes improve language detection in a Python package called langdetect. If you read the last TIL you’ll know that I have been working on refactoring a Python application that generates a deck based on a list of words given.

One of the problems I noticed in my testing is many words that have origins in other languages or are ambiguous were often not translated correctly as a result of langdetect’s classification and in part by my very simple translation generator:

def __translate_word(self, word):
    translation_dict = {"en": "", "de": "", "artikle": ""}

    if detector_factory.detect() == 'en':
        translate_api_url = self.CONFIG.TRANSLATE_API_URL.format('en', 'de')
        translation_dict["en"] = word
        translation_dict["de"] = self.__request_translation(translate_api_url)
    elif detector_factory.detect() == 'de':
        translate_api_url = self.CONFIG.TRANSLATE_API_URL.format('de', 'en')
        translation_dict["de"] = word
        translation_dict["en"] = self.__request_translation(translate_api_url)
    else:
        translate_api_url = self.CONFIG.TRANSLATE_API_URL.format('de', 'en')
        translation_dict["de"] = word
        translation_dict["en"] = self.__request_translation(translate_api_url)

    translation_dict["artikle"] = self.__identify_article(translation_dict["de"])
    return translation_dict

One of the solutions I considered was just using Google’s detect API to determine which language was being used but this would likely have similar issues to langdetect.

What I needed to do was reduce the problem space, I know that we will only use two languages, English or German so instead of taking the returned language and trying to determine if the original word is borrowed and used by English or German (or both), we could just specify that each word we provide will either be German or English.

In langdetect we do this by creating a new DetectorFactory which allows us to specify which languages we want to use and the probability assigned to each word, which can be helpful if you know you’re more likely to specify German words than English or vice versa.

# The __init__ from langdetect.DetectorFactory
def __init__(self):
    self.word_lang_prob_map = {}
    self.langlist = []

Before we can use our factory though we need to load profiles with DetectorFactory.load_profile(). For this code, I used the default profiles, but if you are using an unsupported language, you can add your own. Ultimately we will end up with something like this:

def init_detector_factory():
    # instantiate the DetectorFactory
    factory = langdetect.detector_factory.DetectorFactory()

    # load the default profiles
    factory.load_profile(langdetect.detector_factory.PROFILES_DIRECTORY)

    # create the factory
    detector = factory.create()

    # set the information about language probabilities
    detector.set_prior_map({"en": 0.5, "de": 0.5})

    return detector

Now all we need to do is add this to our __translate_word() method:

def __translate_word(self, word):
    translation_dict = {"en": "", "de": "", "artikle": ""}

    detector_factory = init_detector_factory()
    detector_factory.append(word)

    if detector_factory.detect() == 'en':
        translate_api_url = self.CONFIG.TRANSLATE_API_URL.format('en', 'de')
        translation_dict["en"] = word
        translation_dict["de"] = self.__request_translation(translate_api_url)
    elif detector_factory.detect() == 'de':
        translate_api_url = self.CONFIG.TRANSLATE_API_URL.format('de', 'en')
        translation_dict["de"] = word
        translation_dict["en"] = self.__request_translation(translate_api_url)
    else:
        translate_api_url = self.CONFIG.TRANSLATE_API_URL.format('de', 'en')
        translation_dict["de"] = word
        translation_dict["en"] = self.__request_translation(translate_api_url)

    translation_dict["artikle"] = self.__identify_article(translation_dict["de"])
    return translation_dict

The above example isn’t the most optimal code - technically, we can create the factory and load the profile once and, then each time we need to check a new word we create the factory and set the probability like so:

import langdetect

words = ["der Lenz", "the Uncle", "woman", "Wasser"]

 # instantiate the DetectorFactory
factory = langdetect.detector_factory.DetectorFactory()

# load the default profiles
factory.load_profile(langdetect.detector_factory.PROFILES_DIRECTORY)

for word in words:
    detector = factory.create()
    detector.set_prior_map({"en": 0.5, "de": 0.5})
    detector_factory.append(word)
    print(f"{word} - {detector_factory.detect()}")

But hopefully, it gives you an idea of how you can specify a set number of languages and the probability of them appearing when using langdetect.