Python3 - encoding issues

Sat Nov 28 21:32:09 EST 2009

Hello,

at first i must beg the pardon of those from you, whose mailboxes got
flooded by my last announcement of depikt. I myself get no emails from
this list, and when i had done my corrections and posted each of the
sligthly improved versions, i wasn't aware of the extra emails that
produces. Sorry !

I read here recently, that some reagard Python3 worse at encoding
issues than former versions. For me, a German, quite the contrary is
true. The automatic conversion without an Exception from before 3 has
caused pain over pain during the last years. Even some weeks before it
happened, that pygtk suddenly returned utf-8, not unicode, and my
software had delivered a lot of muddled automatically written emails,
before i saw the mess. Python 3 would have raised Exceptions - however
the translation of my software to 3 has just begun.

Now there is a concept of two separated worlds, and i have decided to
use bytes for my software. The string representation, that output
needs anyway, and with depikt and a changed apsw (file reads anyway)
or other database-APIs (internally they all understand utf-8)  i can
get utf-8 for all input too.

This means, that i do not have the standard string methods, but
substitutes are easily made. Not for a subclass of bytes, that
wouldn't have the b"...." initialization. Thus only in form of
functions. Here are some of my utools:

u0 = "".encode('utf-8')
def u(s):
    if type(s) in (int, float, type): s = str(s)
    if type(s) == str: return s.encode("utf-8")
    if type(s) == bytes: # we keep the two worlds cleanly separated
        raise TypeError(b"argument is bytes already")
    raise TypeError(b"Bad argument for utf-encoding")

def u_startswith(s, test):
    try:
        if s.index(test) == 0: return True
    except:    # a bit frisky perhaps
        return False

def u_endswith(s, test):
    if s[-len(test):] == test: return True
    return False

def u_split(s, splitter):
    ret = []
    while s and splitter in s:
        if u_startswith(s, splitter):
            s = s[len(splitter):]; continue
        ret += s[:s.index[splitter]]
    return ret + [s]

def u_join(joiner, l):
    while True:
        if len(l) in (0,1): return l
        else: l = [l[0]+joiner+l[1]]+l[2:]

(not all with the standard signatures). Writing them is trivial. Note
u0 - unfortunately b"" doesn't at all work as expected, i had to learn
the hard way.

Looking more close to these functions one sees, that they only use the
sequence protocol. "index" is in the sequence protocol too now - there
the library reference has still to be updated. Thus all of these and
much more string methods could get to the sequence protocol too
without much work - then nobody would have to write all this. This
doesn't only affect string-like objects: split and join for lists
could open interesting possibilities for list representations of trees
for example.

Does anybody want to make a PEP from this (i won't do so) ?

Joost Behrends