[Python-Dev] Bytes path support

Fri Aug 22 20:50:05 CEST 2014

On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
> On 8/22/2014 9:52 AM, Oleg Broytman wrote:
> >On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
> >>On 8/22/2014 8:51 AM, Oleg Broytman wrote:
> >>>    What encoding does have a text file (an HTML, to be precise) with
> >>>text in utf-8, ads in cp1251 (ad blocks were included from different
> >>>files) and comments in koi8-r?
> >>>    Well, I must admit the HTML was rather an exception, but having a
> >>>text file with some strange characters (binary strings, or paragraphs
> >>>in different encodings) is not that exceptional.
> >>That's not a text file. That's a binary file containing (hopefully
> >>delimited, and documented) sections of encoded text in different
> >>encodings.
> >    Allow me to disagree. For me, this is a text file which I can (and
> >do) view with a pager, edit with a text editor, list on a console,
> >search with grep and so on. If it is not a text file by strict Python3
> >standards then these standards are too strict for me. Either I find a
> >simple workaround in Python3 to work with such texts or find a different
> >tool. I cannot avoid such files because my reality is much more complex
> >than strict text/binary dichotomy in Python3.
> 
> I was not declaring your file not to be a "text file" from any
> definition obtained from Python3 documentation, just from a common
> sense definition of "text file".

   And in my opinion those files are perfect text. The files consist of
lines separated by EOL characters (not necessary EOL characters of my OS
because it could be a text file produced in a different OS), lines
consist of words and words of characters.

> Looking at it from Python3, though, it is clear that when opening a
> file in "text" mode, an encoding may be specified or will be
> assumed.  That is one encoding, applying to the whole file, not 3
> encodings, with declarations on when to switch between them. So I
> think, in general, Python3 assumes or defines a definition of text
> file that matches my "common sense" definition.

   I don't have problems with Python3 text. I have problems with Python3
trying to get rid of byte strings and treating bytes as strict non-text.

> On the other hand, Python3 provides various facilities for working
> with such files.
> 
> The first I'll mention is the one that follows from my description
> of what your file really is: Python3 allows opening files in binary
> mode, and then decoding various sections of it using whatever
> encoding you like, using the bytes.decode() operation on various
> sections of the file. Determination of which sections are in which
> encodings is beyond the scope of this description of the technique,
> and is application dependent.

   This is perhaps the most promising approach. If I can open a text
file in binary mode, iterate it line by line, split every line of
non-ascii bytes with .split() and process them that'd satisfy my needs.
   But still there are dragons. If I read a filename from such file I
read it as bytes, not str, so I can only use low-level APIs to
manipulate with those filenames. Pity.

   Let see a perfectly normal situation I am quite often in. A person
sent me a directory full of MP3 files. The transport doesn't matter; it
could be FTP, or rsync, or a zip file sent by email, or bittorrent. What
matters is that filenames and content are in alien encodings. Most often
it's cp1251 (the encoding used in Russian Windows) but can be koi8 or
utf8. There is a playlist among the files -- a text file that lists MP3
files, every file on a single line; usually with full paths
("C:\Audio\some.mp3").
   Now I want to read filenames from the file and process the filenames
(strip paths) and files (verify existing of files, or renumber the files
or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are
also in cp1251 of utf-8 encoding]...whatever). I don't know the encoding
of the playlist but I know it corresponds to the encoding of filenames
so I can expect those files exist on my filesystem; they have strangely
looking unreadable names but they exist.
   Just a small example of why I do want to process filenames from a
text file in an alien encoding. Without knowing the encoding in advance.

> The second is to specify an error handler, that, like you, is
> trained to recognize the other encodings and convert them
> appropriately. I'm not aware that such an error handler has been or
> could be written, myself not having your training.
> 
> The third is to specify the UTF-8 with the surrogate escape error
> handler. This allows non-UTF-8 codes to be loaded into memory. You,
> or algorithms as smart as you, could perhaps be developed to detect
> and manipulate the resulting "lone surrogate" codes in meaningful
> ways, or could simply allow them to ride along without
> interpretation, and be emitted as the original, into other files.

   Yes, these are different workarounds.

Oleg.
-- 
     Oleg Broytman            http://phdru.name/            phd at phdru.name
           Programmers don't die, they just GOSUB without RETURN.