Fwd: Lossless bulletproof conversion to unicode (backslashing) (fwd)

Fri May 29 05:46:01 EDT 2015

On Fri, May 29, 2015 at 11:41 AM, Laura Creighton <lac at openend.se> wrote:
> In a message of Fri, 29 May 2015 11:05:07 +0300, anatoly techtonik writes:
>
>>Added Mailman to my suxx tracker:
>>https://github.com/techtonik/suxx-tracker#mailman
>
> You are damning the wrong piece of software -- this is not a problem
> with mailman; mailman doesn't care at all what software you use to
> read mail and reply to it with.  The problem is with the various
> readers and repliers that people are using.  In particular, people on
> the other side of one the usenet -> python-list gateway may not be seeing
> this as mail at all, or sending their replies as mail.

Sounds legit. But middle ux in suxx stands for user experience,
and Mailman still doesn't improve it. If Mailman could subscribe
me automatically to the thread I am starting, that would resolve
all the problems.

> But back to your original problem.
>
> I still don't understand why you need to go from some lossless
> representation of your filename, back to the original.

It is just happened that the only way to get graph out of SCons
is to print its tree representation. That worked fine until we
switched to from StringIO to its io.StringIO unicode equivalent.

Dumping binary stuff in text form is a very common and reliable
way to backup and process data. Starting from SQL dumps to
SVN dumps - all these formats are convenient to store, transmit
and process.

> You start
> with the binary version of the filename  -- a series of bytes which
> turns out to be good Cyrillic text, but could be anything.

Right, good Cyrillic text in utf-8, and Python 2.x uses 'ascii', so if
Python 2.x used 'utf-8' as its default encoding, there won't be an
issue. For now. But I realize that it is not enough, so I want 100%
protection from unwanted crashes and data loss, so I want to
backslash non-utf-8 bytes when converting the data to unicode.

> You store
> that as the first so many bytes of your file. If ever you need to have
> the original representation of your filename, you already have it,
> right there, by reading the first so many bytes of your file.  Why
> care about what the user sees as a filename?

Not sure that I understand. I don't store anything in file. Build graph
is a representation of filesystem structure with entries that may or
may not exist. Node in build graph can also be a string that is never
written to disk. When I dump graph, I have no idea how I will
process it, but when I will need to identify some Node, grep it, find
a reference to it, I want its representation (which may as well serve
as ID) to be preserved to avoid conflicts and wrong interpretation
due to data loss

Hopefully now that my user story is clear, can you tell me how can I
do this bulletproof unicode conversion in Python 2? =)
-- 
anatoly t.