Language detection module?
Dinu C. Gherman
gherman at darwin.in-berlin.de
Thu Oct 21 09:53:43 EDT 1999
Hello,
is there anything already like a function that I can pass an
arbitrary string and it will tell me wether it is written in
English, French, German, etc.?
I imagine this could be rather simply implemented with some
dicts containing common prefixes and suffixes (as well as most
often used words like 'you', 'me', etc.) used in the respective
natural language. One could then calculate some likelihood for
the text to be in any of these languages classes and then return
the most likely one, or a list of them. I'm not sure, though,
how accents would be represented in a "portable" way (accross
multiple platforms), maybe in HTML...?
While writing this I started a small experiment to code what
I think about. Below you'll find what came out of it. Is there
anything more sophisticated out there, including a better scor-
ing/weighting method, maybe also for combinations of words, even
handling accents, perhaps?
Regards,
Dinu
--
Dinu C. Gherman
................................................................
Food for Echelon: Delta Force, SEAL, virtual, WASS, WID, Dolch,
secure shell, screws, Black-Ops, O/S, Area51, SABC, basement,
ISWG, $@, data-haven, NSDD, black-bag, rack, TEMPEST, Goodwin,
rebels, ID, MD5, IDEA, garbage, market, beef, Stego, ISAF, NARF,
Manfurov, Kvashnin, Marx, Abdurahmon, snullen, Pseudonyms, MITM,
Gray Data, VLSI, Leitrim... -- Visit http://www.hacktivism.org
# langdetect.py -- Detect a natural language of a written text.
import string
en, fr, de = 'en', 'fr', 'de'
wordDict = {
'i':en, 'you':en, 'me':en, 'the':en, 'a':en,
'moi':fr, 'je':fr, 'toi':fr, 'vouz':fr, 'sur':fr, 'en':fr,
'sie':de, 'ich':de, 'um':de, 'an':de, 'ab':de}
prefixDict = {
'off':en, 'to':en, 'under':en, 'in':en, 'thou':en,
'mont':fr, 'contr':fr, 'mal':fr,
'ver':de, 'zu':de, 'los':de, 'gut':de}
suffixDict = {
'son':en, 'day':en, 'ing':en, 'ly':en, 'ght':en,
'ique':fr, 'tude':fr, 'ont':fr, 'nal':fr,
'tung':de, 'heim':de, 'zeug':de}
punct = """.,!?"()[]{}!§$%&/*+#"""
trans = string.maketrans(punct, ' '*len(punct))
def detectLanguage(input):
inp0 = string.lower(input)
inp1 = string.translate(inp0, trans)
inp2 = string.strip(inp1)
inp3 = string.split(inp2, ' ')
res = {en:0, fr:0, de:0}
explain = {en:[], fr:[], de:[]}
for word in inp3:
try :
v = wordDict[word]
res[v] = res[v] + 1
explain[v].append(word)
except KeyError:
pass
for p in prefixDict.keys():
try:
wp = word[:len(p)]
if p == wp:
prefixDict[wp]
res[v] = res[v] + 1
explain[v].append(word)
except KeyError:
pass
for s in suffixDict.keys():
try:
ws = word[-len(s):]
if s == ws:
suffixDict[ws]
res[v] = res[v] + 1
explain[v].append(word)
except KeyError:
pass
return res, explain
for phrase in ("I am in a good mood today.",
"Je suis en plaine forme.",
"Ich bin heute gut drauf."):
result, explain = detectLanguage(phrase)
print "Input:", phrase
print "Hypothesis:", result
print "Reasons:", explain
print
# Should print something like this:
#
# Input: I am in a good mood today.
# Hypothesis: {'en': 5, 'fr': 0, 'de': 0}
# Reasons: {'en': ['i', 'in', 'a', 'today', 'today'],
# 'fr': [],
# 'de': []}
#
# Input: Je suis en plaine forme.
# Hypothesis: {'en': 0, 'fr': 2, 'de': 0}
# Reasons: {'en': [], 'fr': ['je', 'en'], 'de': []}
#
# Input: Ich bin heute gut drauf.
# Hypothesis: {'en': 0, 'fr': 0, 'de': 2}
# Reasons: {'en': [], 'fr': [], 'de': ['ich', 'gut']}
More information about the Python-list
mailing list