Proper use of the codecs module.

Fri Aug 16 16:16:31 EDT 2013

On 16 Aug 2013 19:12:02 GMT, Steven D'Aprano wrote:

> If you try opening the file in text mode, you'll very likely break the 
> binary parts (e.g. converting the two bytes 0x0D0A to a single byte 
> 0x0A). So best to stick to binary only, extract the "text" portions of 
> the file, then explicitly decode them.

Okay, I'll do that. Given what you said about seek() and text mode below, I
have no choice anyway. 

>> I would like to know how to correctly set up a new codec for
>> reading files that have nonstandard encodings.
> 
> I suggest you look at the source code for the dozens of codecs in the 
> standard library. E.g. /usr/local/lib/python3.3/encodings/palmos.py

I'll do that too. My thanks for the pointer.

>> How does seek() work on a file opened in text mode? Does it seek to a
>> character offset or to a byte offset? I need the latter behavior. If I
>> can't get it I will have to find a different approach.
> 
> For text files, seek() is only legal for offsets that tell() can return, 
> but this is not enforced, so you can get nasty rubbish like this:
> 
> <snip evil>
> 
> So I prefer not to seek in text files if I can help it.

If I'm understanding the above right, it seeks to a byte offset but the
behavior is undocumented, not guaranteed, shouldn't be used, etc. That
would actually work for me in theory (because I have exact byte offsets to
work with) but I think I'll avoid it anyway, on the grounds that relying on
undocumented behavior is bad. 

>> The files I'm working with use a nonstandard end-of-string character in
>> the same fashion as C null-terminated strings. Is there a builtin
>> function that will read a file "from seek position until seeing EOS
>> character X"? The methods I see for this online seem to amount to
>> reading one character at a time and checking manually, which seems
>> nonoptimal to me.
> 
> How do you think such a built-in function would work, if not inspect each 
> character until the EOS character is seen? :-)

I don't know, but I'm assuming it wouldn't involve a function call to
file.read(1) for each character, and that's what Google keeps handing me.
Such an approach fills me with horror. :-) I suppose there's nothing
stopping me from reading some educated guess at the length of the string
and then stepping through the result. Or I'll look at the readline() source
and see how it does its thing.

> There is no such built-in function though. By default, Python files are 
> buffered, so it won't literally read one character from disk at a time. 
> The actual disk IO will read a bunch of bytes into a memory buffer, and 
> then read from the buffer.

I'd guessed as much, but assumed there was still ridiculous function call
overhead involved in the repeated read(1) method above. Of course, trying
to avoid said overhead is premature optimization; my interest in doing so
is more aesthetic than anything else. 

Thanks for the help.

-- 

Andrew