Fastest way to calculate leading whitespace

John Machin sjmachin at lexicon.net
Sun May 9 09:28:14 EDT 2010


dasacc22 <dasacc22 <at> gmail.com> writes:

> 
> U presume entirely to much. I have a preprocessor that normalizes
> documents while performing other more complex operations.  Theres
> nothing buggy about what im doing

Are you sure?

Your "solution" calculates (the number of leading whitespace characters) + (the
number of TRAILING whitespace characters).

Problem 1: including TRAILING whitespace.
Example: "content" + 3 * " " + "\n" has 4 leading spaces according to your
reckoning; should be 0.
Fix: use lstrip() instead of strip()

Problem 2: assuming all whitespace characters have *effective* width the same as
" ".
Examples: TAB has width 4 or 8 or whatever you want it to be. There are quite a
number of whitespace characters, even when you stick to ASCII. When you look at
Unicode, there are heaps more. Here's a list of BMP characters such that
character.isspace() is True, showing the Unicode codepoint, the Python repr(),
and the name of the character (other than for control characters):

U+0009 u'\t' ?
U+000A u'\n' ?
U+000B u'\x0b' ?
U+000C u'\x0c' ?
U+000D u'\r' ?
U+001C u'\x1c' ?
U+001D u'\x1d' ?
U+001E u'\x1e' ?
U+001F u'\x1f' ?
U+0020 u' ' SPACE
U+0085 u'\x85' ?
U+00A0 u'\xa0' NO-BREAK SPACE
U+1680 u'\u1680' OGHAM SPACE MARK
U+2000 u'\u2000' EN QUAD
U+2001 u'\u2001' EM QUAD
U+2002 u'\u2002' EN SPACE
U+2003 u'\u2003' EM SPACE
U+2004 u'\u2004' THREE-PER-EM SPACE
U+2005 u'\u2005' FOUR-PER-EM SPACE
U+2006 u'\u2006' SIX-PER-EM SPACE
U+2007 u'\u2007' FIGURE SPACE
U+2008 u'\u2008' PUNCTUATION SPACE
U+2009 u'\u2009' THIN SPACE
U+200A u'\u200a' HAIR SPACE
U+200B u'\u200b' ZERO WIDTH SPACE
U+2028 u'\u2028' LINE SEPARATOR
U+2029 u'\u2029' PARAGRAPH SEPARATOR
U+202F u'\u202f' NARROW NO-BREAK SPACE
U+205F u'\u205f' MEDIUM MATHEMATICAL SPACE
U+3000 u'\u3000' IDEOGRAPHIC SPACE

Hmmm, looks like all kinds of widths, from zero upwards.







More information about the Python-list mailing list