[Tutor] Logical error?

Sun May 4 05:54:22 CEST 2014

>> A small note about performance here. If your log files are very large
>> (say, hundreds of thousands or millions of lines) you will find that
>> this part is *horribly horrible slow*. There's two problems, a minor and
>> a major one.
>
> Ah, actually I was mistaken about that. I forgot that for built-in
> lists, += augmented assignment is equivalent to calling list.extend(),
> so it actually does make the modifications in place. So I was wrong to
> say:

No problem.  Linus's Law wins again.  :P
(http://en.wikipedia.org/wiki/Linus's_Law)  Thank goodness for public
mailing lists where we can share our successes and learning
experiences together!

With regards to the earlier part about using the decoding at the call
to open(), rather than on each individual line, next time you'll want
to make the point that it's better to do so at open() time not because
it's more efficient, but because it's more correct.  Correctness needs
to be the winning argument here.

Encoding is a property of the entire file, not a property on
individual lines.  In fact, we can get into trouble by doing the
decoding piece-wise across lines because certain encodings are
multi-byte in nature.  What this means is that what might look like a
newline in the uninterpreted bytes of a file may be deceptive: that
"newline" byte might actually be part of a multibyte character!

Let's see if we can construct an example to demonstrate.

########################################################################
for encoding in ('utf-8', 'utf-16', 'utf-32'):
  for i in range(0x110000):
    aChar = unichr(i)
    try:
      someBytes = aChar.encode(encoding)
      if '\n' in someBytes:
        print("%r contains a newline in its bytes encoded with %s" %
(aChar, encoding))
    except:
      ## Normally, try/catches with an empty except is a bad idea.
      ## Here, this is toy code, and we're just exploring.
      pass
########################################################################

This toy code goes through all possible Unicode code points, and then
encodes them in three different codecs.  We look to see if any of the
encoded characters have newlines in them, and report.  Try running it.
 Notice how many characters start being reported.  :P

Hopefully, this makes the point clearer: we must not try to decode
individual lines.  By that time, the damage has been done: the act of
trying to break the file into lines by looking naively at newline byte
characters is invalid when certain characters can themselves have
newline characters.