Proper use of the codecs module.

Fri Aug 16 15:12:02 EDT 2013

On Fri, 16 Aug 2013 10:02:08 -0400, Andrew wrote:

> I have a mixed binary/text file[0], and the text portions use a
> radically nonstandard character set. I want to read them easily given
> information about the character encoding and an offset for the beginning
> of a string.

"Mixed binary/text" is not a helpful model to use. You are better off 
thinking of the file as "binary", where some of the fields happen to 
contain text encoded with some custom codec.

If you try opening the file in text mode, you'll very likely break the 
binary parts (e.g. converting the two bytes 0x0D0A to a single byte 
0x0A). So best to stick to binary only, extract the "text" portions of 
the file, then explicitly decode them.

> The descriptions of the codecs module and codecs.register() in
> particular seem to suggest that this is already supported in the
> standard library. However, I can't find any examples of its proper use.
> Most people who use the module seem to want to read utf files in python
> 2.x.[1] I would like to know how to correctly set up a new codec for
> reading files that have nonstandard encodings.

I suggest you look at the source code for the dozens of codecs in the 
standard library. E.g. /usr/local/lib/python3.3/encodings/palmos.py

(Adjust for your installation location as required.)

> I have two other related questions:
> 
> How does seek() work on a file opened in text mode? Does it seek to a
> character offset or to a byte offset? I need the latter behavior. If I
> can't get it I will have to find a different approach.

For text files, seek() is only legal for offsets that tell() can return, 
but this is not enforced, so you can get nasty rubbish like this:

py> f = open('/tmp/t', 'w', encoding='utf-32')
py> f.write('hello world')
11
py> f.close()
py> f = open('/tmp/t', 'r', encoding='utf-32')
py> f.read(1)
'h'
py> f.tell()
8
py> f.seek(3)
3
py> f.read(1)
'栀'

So I prefer not to seek in text files if I can help it.

> The files I'm working with use a nonstandard end-of-string character in
> the same fashion as C null-terminated strings. Is there a builtin
> function that will read a file "from seek position until seeing EOS
> character X"? The methods I see for this online seem to amount to
> reading one character at a time and checking manually, which seems
> nonoptimal to me.

How do you think such a built-in function would work, if not inspect each 
character until the EOS character is seen? :-)

There is no such built-in function though. By default, Python files are 
buffered, so it won't literally read one character from disk at a time. 
The actual disk IO will read a bunch of bytes into a memory buffer, and 
then read from the buffer.

-- 
Steven