Fastest way to calculate leading whitespace

dasacc22 dasacc22 at gmail.com
Sun May 9 14:59:23 EDT 2010


On May 9, 8:28 am, John Machin <sjmac... at lexicon.net> wrote:
> dasacc22 <dasacc22 <at> gmail.com> writes:
>
>
>
> > U presume entirely to much. I have a preprocessor that normalizes
> > documents while performing other more complex operations.  Theres
> > nothing buggy about what im doing
>
> Are you sure?
>
> Your "solution" calculates (the number of leading whitespace characters) + (the
> number of TRAILING whitespace characters).
>
> Problem 1: including TRAILING whitespace.
> Example: "content" + 3 * " " + "\n" has 4 leading spaces according to your
> reckoning; should be 0.
> Fix: use lstrip() instead of strip()
>
> Problem 2: assuming all whitespace characters have *effective* width the same as
> " ".
> Examples: TAB has width 4 or 8 or whatever you want it to be. There are quite a
> number of whitespace characters, even when you stick to ASCII. When you look at
> Unicode, there are heaps more. Here's a list of BMP characters such that
> character.isspace() is True, showing the Unicode codepoint, the Python repr(),
> and the name of the character (other than for control characters):
>
> U+0009 u'\t' ?
> U+000A u'\n' ?
> U+000B u'\x0b' ?
> U+000C u'\x0c' ?
> U+000D u'\r' ?
> U+001C u'\x1c' ?
> U+001D u'\x1d' ?
> U+001E u'\x1e' ?
> U+001F u'\x1f' ?
> U+0020 u' ' SPACE
> U+0085 u'\x85' ?
> U+00A0 u'\xa0' NO-BREAK SPACE
> U+1680 u'\u1680' OGHAM SPACE MARK
> U+2000 u'\u2000' EN QUAD
> U+2001 u'\u2001' EM QUAD
> U+2002 u'\u2002' EN SPACE
> U+2003 u'\u2003' EM SPACE
> U+2004 u'\u2004' THREE-PER-EM SPACE
> U+2005 u'\u2005' FOUR-PER-EM SPACE
> U+2006 u'\u2006' SIX-PER-EM SPACE
> U+2007 u'\u2007' FIGURE SPACE
> U+2008 u'\u2008' PUNCTUATION SPACE
> U+2009 u'\u2009' THIN SPACE
> U+200A u'\u200a' HAIR SPACE
> U+200B u'\u200b' ZERO WIDTH SPACE
> U+2028 u'\u2028' LINE SEPARATOR
> U+2029 u'\u2029' PARAGRAPH SEPARATOR
> U+202F u'\u202f' NARROW NO-BREAK SPACE
> U+205F u'\u205f' MEDIUM MATHEMATICAL SPACE
> U+3000 u'\u3000' IDEOGRAPHIC SPACE
>
> Hmmm, looks like all kinds of widths, from zero upwards.

I unfortunately mixed the solution with a string that would never make
it in the state i typed it in, the trailing whitespace

This is my fault



More information about the Python-list mailing list