What happens when python seeks a text file

李嘉鹏 lijpbasin at 126.com
Mon Jul 27 19:21:21 EDT 2015


Hi, I tried using seek to reverse a text file after reading about the
subject in the documentation:

https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

https://docs.python.org/3/library/io.html#io.TextIOBase.seek

The script "reverse_text_by_seek3.py" produces expected result on a UTF-8
encoded text file "Moon-utf8.txt" (several lines of Chinese characters):

    $ ./reverse_text_by_seek3.py Moon-utf8.txt
    [0, 10, 11, 27, 28, 44, 60, 76, 92]
    低头思故乡
    举头望明月
    疑似地上霜
    床前明月光
    
    李白(唐)
    
    静夜思

or

    $ ./reverse_text_by_seek3.py Moon-utf8.txt seek
    [0, 10, 11, 27, 28, 44, 60, 76, 92]
    低头思故乡
    举头望明月
    疑似地上霜
    床前明月光
    
    李白(唐)
    
    静夜思

However, an exception is raised if a file with the same content encoded in
GBK is provided:

    $ ./reverse_text_by_seek3.py Moon-gbk.txt
    [0, 7, 8, 19, 21, 32, 42, 53, 64]
    低头思故乡
    举头望明月
    Traceback (most recent call last):
      File "./reverse_text_by_seek3.py", line 21, in <module>
        print(f.readline(), end="")
    UnicodeDecodeError: 'gbk' codec can't decode byte 0xaa in position 8: illegal multibyte sequence

While everything works fine again when a seek operation is applied after
each readline invocation:

    $ ./reverse_text_by_seek3.py Moon-gbk.txt seek
    [0, 7, 8, 19, 20, 31, 42, 53, 64]
    低头思故乡
    举头望明月
    疑似地上霜
    床前明月光
    
    李白(唐)
    
    静夜思

Some of the printed positions are also different.

A python2 counterpart "reverse_text_by_seek2.py" is written, which decodes
the lines upon printing instead of reading, no exception occurs.

It's just fun doing this, not for anything useful. Can anyone reproduce the
above results? What's really happening here? Is it a bug?

Other information:

    Distribution: Arch Linux
    Python3 package: 3.4.3-2 (official)
    Python2 package: 2.7.10-1 (official)

    $ uname -rvom
    4.1.2-2-ARCH #1 SMP PREEMPT Wed Jul 15 08:30:32 UTC 2015 x86_64 GNU/Linux

    $ env | grep -e LC -e LANG
    LC_ALL=en_US.UTF-8
    LC_COLLATE=C
    LANG=en_US.UTF-8
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reverse_text_by_seek3.py
Type: application/octet-stream
Size: 552 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20150728/7343d79f/attachment.obj>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Moon-gbk.txt
URL: <http://mail.python.org/pipermail/python-list/attachments/20150728/7343d79f/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: Moon-utf8.txt
URL: <http://mail.python.org/pipermail/python-list/attachments/20150728/7343d79f/attachment-0001.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: reverse_text_by_seek2.py
Type: application/octet-stream
Size: 542 bytes
Desc: not available
URL: <http://mail.python.org/pipermail/python-list/attachments/20150728/7343d79f/attachment-0001.obj>


More information about the Python-list mailing list