MeCab UTF-8 Decoding Problem

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Jun 29 12:20:24 EDT 2013


On Sat, 29 Jun 2013 04:29:23 -0700, fobos3 wrote:

> Hi,
> 
> I am trying to use a program called MeCab, which does syntax analysis on
> Japanese text. The problem I am having is that it returns a byte string
> and if I try to print it, it prints question marks for almost all
> characters. However, if I try to use .decide, it throws an error. Here
> is my code:
> 
> #!/usr/bin/python
> # -*- coding:utf-8 -*-
> 
> import MeCab
> tagger = MeCab.Tagger("-Owakati")
> text = 'MeCabで遊んでみよう!'

I see from below you are using Python 2.7.

Here you are using a byte-string rather than Unicode. The actual bytes 
that you get *may* be indeterminate. I don't think that Python guarantees 
that just because the source file is declared as UTF-8, that *implicit* 
encoding into bytes will necessarily use UTF-8.

Even if it does, it is still better to use an explicit Unicode string, 
and explicitly encode into bytes using whatever encoding MeCab expects 
you to use, say:

text = u'MeCabで遊んでみよう!'.encode('utf-8')

By the way, what makes you think that MeCab expects, and returns, text 
encoded using UTF-8?


> result = tagger.parse(text)
> print result
> 
> result = result.decode('utf-8')
> print result
> 
> And here is the output:
> 
> MeCab �� �� ��んで�� �� ��う!

MeCab has returned a bunch of bytes, representing some text in some 
encoding. When you print those bytes, your terminal uses whatever its 
default encoding is (probably UTF-8, on a Linux system) and tries to make 
sense of the bytes, using � for any byte it cannot make sense of. This is 
good evidence that MeCab is *not* actually using UTF-8.

And sure enough, when you try to decode it manually:


> Traceback (most recent call last):
>   File "test.py", line 11, in <module>
>     result = result.decode('utf-8')
>   File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
>     return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
> invalid continuation byte

Assuming that the bytes being returned are *supposed* to be encoded in 
UTF-8, it's possible that MeCab is simply buggy and cannot produce proper 
UTF-8 encoded byte strings. This wouldn't surprise me -- after all, using 
*byte strings* as non-ASCII text strongly suggests that the author 
doesn't understand Unicode very well.

But perhaps more likely, MeCab isn't using UTF-8 at all. What does the 
documentation say?

A third possibility is that the string you feed to MeCab is simply 
mangled beyond recognition due to the way you create it using the 
implicit encoding from chars to bytes. Change the line

text = 'MeCab ...'

to use an explicit Unicode string and encode, as above, and maybe the 
error will go away.



-- 
Steven



More information about the Python-list mailing list