reading from file

Thu Jun 11 16:45:11 EDT 2009

On Jun 11, 4:24 pm, Sydoruk Yaroslav <sw... at mirohost.net> wrote:
> Hello all,
>
> In a text file aword.txt, there is a string:
>     "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc".
>
> There is a first script:
> f = open ("aword.txt", "r")
> for line in f:
>     print chardet.detect(line)
>     b = line.decode('cp1251')
>     print b
>
> _RESULT_
> {'confidence': 1.0, 'encoding': 'ascii'}
> \xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc
>
> There is a second script:
> line = "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc"
> print chardet.detect(line)
> b = line.decode('cp1251')
> print b
>
> _RESULT_
> {'confidence': 0.98999999999999999, 'encoding': 'windows-1251'}
> как+позвонить
>
> Why is reading from a file into a string variable is defined as ascii,
> but when it is clearly defined in the script is defined as cp1251.
> How do I solve this problem.
>
> --
> Only one 0_o

Is the string in your text file literally "\xea\xe0\xea+\xef\xee
\xe7\xe2\xee\xed\xe8\xf2\xfc" as "plain text?"  My assumption is that
when you're reading that in, Python is interpreting each byte as an
ASCII value (and rightfully so) rather than the corresponding '\x'
escapes.

As an experiment:

(t)jeff at marvin:~/t$ cat test.py
import chardet

s = "\xea\xe0\xea+\xef\xee\xe7\xe2\xee\xed\xe8\xf2\xfc"
with open('test.txt', 'w') as f:
        print >>f, s

print chardet.detect(open('test.txt').read())
(t)jeff at marvin:~/t$ python test.py
{'confidence': 0.98999999999999999, 'encoding': 'windows-1251'}
(t)jeff at marvin:~/t$

HTH,

Jeff
mcjeff.blogspot.com