[Python-Dev] Multilingual programming article on the Red Hat Developer blog

Wed Sep 17 03:14:15 CEST 2014

On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray <rdmurray at bitdance.com> wrote:
> Yes.  I thought you were saying that one could not treat the string with
> smuggled bytes as if it were a string.  (It's a string that can't be
> encoded unless you use the surrogateescape error handler, but it is
> still a string from Python's POV, which is the point of the error
> handler).
>
> Or, to put it another way, your implication was that there were no
> string operations that could be usefully applied to a string containing
> smuggled bytes, but that is not the case.  (I may well have read an
> implication that was not there; if so I apologize and you can ignore the
> rest of this :)

Ahh, I see where we are getting confused. What I said was that you
can't treat the string as a *pure* Unicode string. Parts of it are
Unicode text, parts of it aren't.

> Basically, we are pretending that the each smuggled
> byte is single character for string parsing purposes...but they don't
> match any of our parsing constants.  They are all "any character" matches
> in the regexes and what have you.

This is slightly iffy, as you can't be sure that one byte represents
one character, but as long as you don't much care about that, it's not
going to be an issue. I'm fairly sure you're never going to find an
encoding in which one unknown byte represents two characters, but
there are cases where it takes more than one byte to make up a
character (or the bytes are just shift codes or something). Does that
ever throw off your regexes? It wouldn't be an issue to a .* between
two character markers, but if you ever say .{5} then it might match
incorrectly.

I think we're in agreement here, just using different words. :)

ChrisA