[Python-Dev] Bytes path support
Chris Barker
chris.barker at noaa.gov
Fri Aug 22 20:51:20 CEST 2014
On Fri, Aug 22, 2014 at 10:09 AM, Glenn Linderman <v+python at g.nevcal.com>
wrote:
> What encoding does have a text file (an HTML, to be precise) with
> text in utf-8, ads in cp1251 (ad blocks were included from different
> files) and comments in koi8-r?
> Well, I must admit the HTML was rather an exception, but having a
> text file with some strange characters (binary strings, or paragraphs
> in different encodings) is not that exceptional.
>
> That's not a text file. That's a binary file containing (hopefully
> delimited, and documented) sections of encoded text in different
> encodings.
>
> Allow me to disagree. For me, this is a text file which I can (and
> do) view with a pager, edit with a text editor, list on a console,
> search with grep and so on. If it is not a text file by strict Python3
> standards then these standards are too strict for me. Either I find a
> simple workaround in Python3 to work with such texts or find a different
> tool. I cannot avoid such files because my reality is much more complex
> than strict text/binary dichotomy in Python3.
>
> First -- we're getting OT here -- this thread was about file and path
names, not the contents of files. But I suppose I brought that in when I
talked about writing file names to files...
The first I'll mention is the one that follows from my description of what
> your file really is: Python3 allows opening files in binary mode, and then
> decoding various sections of it using whatever encoding you like, using the
> bytes.decode() operation on various sections of the file. Determination of
> which sections are in which encodings is beyond the scope of this
> description of the technique, and is application dependent.
>
right -- and you would have wanted to open such file in binary mode with
py2 as well, but in that case, you's have the contents in py2 string
object, which has a few more convenient ways to work with text (at least
ascii-compatible) than the py3 bytes object does.
The third is to specify the UTF-8 with the surrogate escape error handler.
> This allows non-UTF-8 codes to be loaded into memory. You, or algorithms as
> smart as you, could perhaps be developed to detect and manipulate the
> resulting "lone surrogate" codes in meaningful ways, or could simply allow
> them to ride along without interpretation, and be emitted as the original,
> into other files.
>
Just so I'm clear here -- if you write that back out, encoded as utf-8 --
you'll get the exact same binary blob out as came in?
I wonder if this would make it hard to preserve byte boundaries, though.
By the way, IIUC correctly, you can also use the python latin-1 decoder --
anything latin-1 will come through correctly, anything not valid latin-1
will come in as garbage, but if you re-encode with latin-1 the original
bytes will be preserved. I think this will also preserve a 1:1 relationship
between character count and byte count, which could be handy.
-Chris
--
Christopher Barker, Ph.D.
Oceanographer
Emergency Response Division
NOAA/NOS/OR&R (206) 526-6959 voice
7600 Sand Point Way NE (206) 526-6329 fax
Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker at noaa.gov
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140822/87d8b65a/attachment.html>
More information about the Python-Dev
mailing list