[Python-Dev] Bytes path support

Fri Aug 22 19:09:21 CEST 2014

On 8/22/2014 9:52 AM, Oleg Broytman wrote:
> On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
>> On 8/22/2014 8:51 AM, Oleg Broytman wrote:
>>>     What encoding does have a text file (an HTML, to be precise) with
>>> text in utf-8, ads in cp1251 (ad blocks were included from different
>>> files) and comments in koi8-r?
>>>     Well, I must admit the HTML was rather an exception, but having a
>>> text file with some strange characters (binary strings, or paragraphs
>>> in different encodings) is not that exceptional.
>> That's not a text file. That's a binary file containing (hopefully
>> delimited, and documented) sections of encoded text in different
>> encodings.
>     Allow me to disagree. For me, this is a text file which I can (and
> do) view with a pager, edit with a text editor, list on a console,
> search with grep and so on. If it is not a text file by strict Python3
> standards then these standards are too strict for me. Either I find a
> simple workaround in Python3 to work with such texts or find a different
> tool. I cannot avoid such files because my reality is much more complex
> than strict text/binary dichotomy in Python3.
>
> Oleg.

I was not declaring your file not to be a "text file" from any 
definition obtained from Python3 documentation, just from a common sense 
definition of "text file".

Looking at it from Python3, though, it is clear that when opening a file 
in "text" mode, an encoding may be specified or will be assumed.  That 
is one encoding, applying to the whole file, not 3 encodings, with 
declarations on when to switch between them. So I think, in general, 
Python3 assumes or defines a definition of text file that matches my 
"common sense" definition. Also, if it is an HTML file, I doubt the 
browser will use multiple different encodings when interpreting it, so 
it is not clear that the file is of practical use for its intended 
purpose if it contains text in multiple different encodings, but is 
served using only a single encoding, unless there is javascript or some 
programming in the browser that reencodes the data.

On the other hand, Python3 provides various facilities for working with 
such files.

The first I'll mention is the one that follows from my description of 
what your file really is: Python3 allows opening files in binary mode, 
and then decoding various sections of it using whatever encoding you 
like, using the bytes.decode() operation on various sections of the 
file. Determination of which sections are in which encodings is beyond 
the scope of this description of the technique, and is application 
dependent.

The second is to specify an error handler, that, like you, is trained to 
recognize the other encodings and convert them appropriately. I'm not 
aware that such an error handler has been or could be written, myself 
not having your training.

The third is to specify the UTF-8 with the surrogate escape error 
handler. This allows non-UTF-8 codes to be loaded into memory. You, or 
algorithms as smart as you, could perhaps be developed to detect and 
manipulate the resulting "lone surrogate" codes in meaningful ways, or 
could simply allow them to ride along without interpretation, and be 
emitted as the original, into other files.

There may be other technique that I am not aware of.

Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140822/22042630/attachment.html>