Language detection module?

Dinu C. Gherman gherman at darwin.in-berlin.de
Thu Oct 21 09:53:43 EDT 1999


Hello,

is there anything already like a function that I can pass an
arbitrary string and it will tell me wether it is written in
English, French, German, etc.? 

I imagine this could be rather simply implemented with some 
dicts containing common prefixes and suffixes (as well as most 
often used words like 'you', 'me', etc.) used in the respective 
natural language. One could then calculate some likelihood for 
the text to be in any of these languages classes and then return
the most likely one, or a list of them. I'm not sure, though,
how accents would be represented in a "portable" way (accross
multiple platforms), maybe in HTML...?

While writing this I started a small experiment to code what
I think about. Below you'll find what came out of it. Is there
anything more sophisticated out there, including a better scor-
ing/weighting method, maybe also for combinations of words, even 
handling accents, perhaps?

Regards,

Dinu

-- 
Dinu C. Gherman
................................................................
Food for Echelon: Delta Force, SEAL, virtual, WASS, WID, Dolch,
secure shell, screws, Black-Ops, O/S, Area51, SABC, basement, 
ISWG, $@, data-haven, NSDD, black-bag, rack, TEMPEST, Goodwin, 
rebels, ID, MD5, IDEA, garbage, market, beef, Stego, ISAF, NARF, 
Manfurov, Kvashnin, Marx, Abdurahmon, snullen, Pseudonyms, MITM, 
Gray Data, VLSI, Leitrim... -- Visit http://www.hacktivism.org


# langdetect.py -- Detect a natural language of a written text.

import string

en, fr, de = 'en', 'fr', 'de'

wordDict = {
    'i':en, 'you':en, 'me':en, 'the':en, 'a':en, 
    'moi':fr, 'je':fr, 'toi':fr, 'vouz':fr, 'sur':fr, 'en':fr,
    'sie':de, 'ich':de, 'um':de, 'an':de, 'ab':de}

prefixDict = {
    'off':en, 'to':en, 'under':en, 'in':en, 'thou':en,
    'mont':fr, 'contr':fr, 'mal':fr,
    'ver':de, 'zu':de, 'los':de, 'gut':de}

suffixDict = {
    'son':en, 'day':en, 'ing':en, 'ly':en, 'ght':en,
    'ique':fr, 'tude':fr, 'ont':fr, 'nal':fr,
    'tung':de, 'heim':de, 'zeug':de}

punct = """.,!?"()[]{}!§$%&/*+#"""
trans = string.maketrans(punct, ' '*len(punct))


def detectLanguage(input):
    inp0 = string.lower(input)
    inp1 = string.translate(inp0, trans)
    inp2 = string.strip(inp1)
    inp3 = string.split(inp2, ' ')

    res = {en:0, fr:0, de:0}
    explain = {en:[], fr:[], de:[]}

    for word in inp3:
        try :
            v = wordDict[word]
            res[v] = res[v] + 1
            explain[v].append(word)
        except KeyError:
            pass

        for p in prefixDict.keys():
            try:
                wp = word[:len(p)]
                if p == wp:
                    prefixDict[wp]
                    res[v] = res[v] + 1
                    explain[v].append(word)
            except KeyError:
                pass

        for s in suffixDict.keys():
            try:
                ws = word[-len(s):]
                if s == ws:
                    suffixDict[ws]
                    res[v] = res[v] + 1
                    explain[v].append(word)
            except KeyError:
                pass

    return res, explain


for phrase in ("I am in a good mood today.", 
        "Je suis en plaine forme.",
        "Ich bin heute gut drauf."):
    result, explain = detectLanguage(phrase)
    print "Input:", phrase
    print "Hypothesis:", result       
    print "Reasons:", explain
    print


# Should print something like this:
#
# Input: I am in a good mood today.
# Hypothesis: {'en': 5, 'fr': 0, 'de': 0}
# Reasons: {'en': ['i', 'in', 'a', 'today', 'today'], 
#           'fr': [], 
#           'de': []}
#
# Input: Je suis en plaine forme.
# Hypothesis: {'en': 0, 'fr': 2, 'de': 0}
# Reasons: {'en': [], 'fr': ['je', 'en'], 'de': []}
#
# Input: Ich bin heute gut drauf.
# Hypothesis: {'en': 0, 'fr': 0, 'de': 2}
# Reasons: {'en': [], 'fr': [], 'de': ['ich', 'gut']}





More information about the Python-list mailing list