[issue17445] Return the type you accept

Greg Ward report at bugs.python.org
Mon Mar 18 19:50:27 CET 2013


Greg Ward added the comment:

Replying to Terry Reedy:
> So a dual string/bytes function would not be completely trivial.

Correct. I have one working, but it makes my eyes bleed. I fail ashamed to have written it.

> Greg, can you convert bytes to strings, or strings to bytes

Nope. Here is the hypothetical use case: I have a text file written in Polish encoded in ISO-8859-1 committed to a Mercurial repository. (Or saved in a filesystem somewhere: doesn't really matter, except that Mercurial repositories are immutable, long-term, and *must* *not* *lose* *data*.) Then I decide I should play nicely with the rest of the world and transcode to UTF-8, so commit a new rev in UTF-8.

Years later, I need to look at the diff between those two old revisions. Rev 1 is a pile of ISO-8859-2 bytes, and rev 2 is a pile of UTF-8 bytes. The output of diff looks like

  - blah blah [iso-8859-2 bytes] blah
  + blah blah [utf-8 bytes] blah

Note this: the output of diff has some lines that are iso-8859-2 bytes and some that are utf-8 bytes. *There is no single encoding* that applies.

Note also that diff output must contain the exact original bytes, so that it can be consumed by patch. Diffs are read both by humans and by machines.

> Otherwise, I think it might be better to write a new function 
> 'unified_diff_bytes' that did exactly what you want than to try to 
> make unified_diff accept sequences of bytes.

Good idea. That might be much less revolting than what I have now. I'll give it a shot.

----------

_______________________________________
Python tracker <report at bugs.python.org>
<http://bugs.python.org/issue17445>
_______________________________________


More information about the Python-bugs-list mailing list