split lines from stdin into a list of unicode strings
Peter Otten
__peter__ at web.de
Thu Aug 29 09:15:25 EDT 2013
Kurt Mueller wrote:
> I have to say that I am a bit disapointed by the chardet library.
> The encoding for the single character 'ü'
> is detected as {'confidence': 0.99, 'encoding': 'EUC-JP'},
> whereas "file" says:
> $ echo "ü" | file -i -
> /dev/stdin: text/plain; charset=utf-8
> $
>
> "ü" is a character I use very often, as it is in my name: "Müller":-)
You cannot determine an encoding by a single letter.
Why should "ü" be more likely than "端"? The only thing you can blame chardet
for is that its confidence rating is a flat out lie...
For "Müller" on the other side you could probably come up with a (simple)
heuristic that "ü" is more likely to be surrounded by ascii-letters than
"端".
More information about the Python-list
mailing list