Weird problem matching with REs

Andrew Berg bahamutzero8825 at gmail.com
Sun May 29 12:16:35 EDT 2011


On 2011.05.29 10:48 AM, John S wrote:
> Dots don't match end-of-line-for-your-current-OS is how I think of
> it.
IMO, the docs should say the dot matches any character except a line
feed ('\n'), since that is more accurate.
> True, malformed
> HTML can throw you off, but they can also throw a parser off.
That was part of my point. html.parser.HTMLParser from the standard
library will definitely not work on x264.nl's broken HTML, and fixing it
requires lxml (I'm working with Python 3; I've looked into
BeautifulSoup, and does not work with Python 3 at all). Admittedly,
fixing x264.nl's HTML only requires one or two lines of code, but really
nasty HTML might require quite a bit of work.
> In your case, and because x264 might change their HTML, I suggest the
> following code, which works great on my system.YMMV. I changed your
> newline matches to use \s and put some capturing parentheses around
> the date, so you could grab it.
I've been meaning to learn how to use parenthesis groups.
> Also, be sure to
> use a raw string when composing REs, so you don't run into backslash
> issues.
How would I do that when grabbing strings from a config file (via the
configparser module)? Or rather, if I have a predefined variable
containing a string, how do change it into a raw string?



More information about the Python-list mailing list