[Python-Dev] Bytes path support
Glenn Linderman
v+python at g.nevcal.com
Fri Aug 22 19:09:21 CEST 2014
On 8/22/2014 9:52 AM, Oleg Broytman wrote:
> On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
>> On 8/22/2014 8:51 AM, Oleg Broytman wrote:
>>> What encoding does have a text file (an HTML, to be precise) with
>>> text in utf-8, ads in cp1251 (ad blocks were included from different
>>> files) and comments in koi8-r?
>>> Well, I must admit the HTML was rather an exception, but having a
>>> text file with some strange characters (binary strings, or paragraphs
>>> in different encodings) is not that exceptional.
>> That's not a text file. That's a binary file containing (hopefully
>> delimited, and documented) sections of encoded text in different
>> encodings.
> Allow me to disagree. For me, this is a text file which I can (and
> do) view with a pager, edit with a text editor, list on a console,
> search with grep and so on. If it is not a text file by strict Python3
> standards then these standards are too strict for me. Either I find a
> simple workaround in Python3 to work with such texts or find a different
> tool. I cannot avoid such files because my reality is much more complex
> than strict text/binary dichotomy in Python3.
>
> Oleg.
I was not declaring your file not to be a "text file" from any
definition obtained from Python3 documentation, just from a common sense
definition of "text file".
Looking at it from Python3, though, it is clear that when opening a file
in "text" mode, an encoding may be specified or will be assumed. That
is one encoding, applying to the whole file, not 3 encodings, with
declarations on when to switch between them. So I think, in general,
Python3 assumes or defines a definition of text file that matches my
"common sense" definition. Also, if it is an HTML file, I doubt the
browser will use multiple different encodings when interpreting it, so
it is not clear that the file is of practical use for its intended
purpose if it contains text in multiple different encodings, but is
served using only a single encoding, unless there is javascript or some
programming in the browser that reencodes the data.
On the other hand, Python3 provides various facilities for working with
such files.
The first I'll mention is the one that follows from my description of
what your file really is: Python3 allows opening files in binary mode,
and then decoding various sections of it using whatever encoding you
like, using the bytes.decode() operation on various sections of the
file. Determination of which sections are in which encodings is beyond
the scope of this description of the technique, and is application
dependent.
The second is to specify an error handler, that, like you, is trained to
recognize the other encodings and convert them appropriately. I'm not
aware that such an error handler has been or could be written, myself
not having your training.
The third is to specify the UTF-8 with the surrogate escape error
handler. This allows non-UTF-8 codes to be loaded into memory. You, or
algorithms as smart as you, could perhaps be developed to detect and
manipulate the resulting "lone surrogate" codes in meaningful ways, or
could simply allow them to ride along without interpretation, and be
emitted as the original, into other files.
There may be other technique that I am not aware of.
Glenn
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140822/22042630/attachment.html>
More information about the Python-Dev
mailing list