MeCab UTF-8 Decoding Problem
Steven D'Aprano
steve+comp.lang.python at pearwood.info
Sat Jun 29 12:20:24 EDT 2013
On Sat, 29 Jun 2013 04:29:23 -0700, fobos3 wrote:
> Hi,
>
> I am trying to use a program called MeCab, which does syntax analysis on
> Japanese text. The problem I am having is that it returns a byte string
> and if I try to print it, it prints question marks for almost all
> characters. However, if I try to use .decide, it throws an error. Here
> is my code:
>
> #!/usr/bin/python
> # -*- coding:utf-8 -*-
>
> import MeCab
> tagger = MeCab.Tagger("-Owakati")
> text = 'MeCabで遊んでみよう!'
I see from below you are using Python 2.7.
Here you are using a byte-string rather than Unicode. The actual bytes
that you get *may* be indeterminate. I don't think that Python guarantees
that just because the source file is declared as UTF-8, that *implicit*
encoding into bytes will necessarily use UTF-8.
Even if it does, it is still better to use an explicit Unicode string,
and explicitly encode into bytes using whatever encoding MeCab expects
you to use, say:
text = u'MeCabで遊んでみよう!'.encode('utf-8')
By the way, what makes you think that MeCab expects, and returns, text
encoded using UTF-8?
> result = tagger.parse(text)
> print result
>
> result = result.decode('utf-8')
> print result
>
> And here is the output:
>
> MeCab �� �� ��んで�� �� ��う!
MeCab has returned a bunch of bytes, representing some text in some
encoding. When you print those bytes, your terminal uses whatever its
default encoding is (probably UTF-8, on a Linux system) and tries to make
sense of the bytes, using � for any byte it cannot make sense of. This is
good evidence that MeCab is *not* actually using UTF-8.
And sure enough, when you try to decode it manually:
> Traceback (most recent call last):
> File "test.py", line 11, in <module>
> result = result.decode('utf-8')
> File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 6-7:
> invalid continuation byte
Assuming that the bytes being returned are *supposed* to be encoded in
UTF-8, it's possible that MeCab is simply buggy and cannot produce proper
UTF-8 encoded byte strings. This wouldn't surprise me -- after all, using
*byte strings* as non-ASCII text strongly suggests that the author
doesn't understand Unicode very well.
But perhaps more likely, MeCab isn't using UTF-8 at all. What does the
documentation say?
A third possibility is that the string you feed to MeCab is simply
mangled beyond recognition due to the way you create it using the
implicit encoding from chars to bytes. Change the line
text = 'MeCab ...'
to use an explicit Unicode string and encode, as above, and maybe the
error will go away.
--
Steven
More information about the Python-list
mailing list