[Tutor] ABout distinquishing elements in a string/regular expr
Kent Johnson
kent37 at tds.net
Thu Sep 7 13:47:55 CEST 2006
Xiao Yu Michael Yang wrote:
> Hi tutors,
>
> I am currently working on a project that identifies languages of html
> documents, using Python, of course.
You might be interested in http://chardet.feedparser.org/ which seems to
work directly on HTML.
> Just wondering, given a string:
>
> str = "<html> title this is french 77 992 / <aaabbbccc> </html>"
Note that str is the name of the built-in string type and not a good
choice for a variable name.
> what is the python expression for:
>
> 1. r = return_anything_that's_within<> (str), i.e. it should give "html,
> aaabbbccc, html"
You can do this with a regular expression:
In [1]: s = "<html> title this is french 77 992 / <aaabbbccc> </html>"
In [2]: import re
In [3]: re.findall('<.*?>', s)
Out[3]: ['<html>', '<aaabbbccc>', '</html>']
If you are trying to strip the tags from the HTML, try one of these:
http://www.oluyede.org/blog/2006/02/13/html-stripper/
http://www.aminus.org/rbre/python/cleanhtml.py
>
> 2. r = remove_all_numbers(str), (what is the python expression for
> 'is_int') i.e. it removes "77" and "992"
What should r look like here? Is it the string s with digits removed, or
some kind of list?
s.isdigit() will test if s is a string containing all digits.
>
> 3. dif = listA_minus_listB(str, r), i.e. should return ['77', '992'],
> using the above 'r' value.
You seem to be confused about strings vs lists. s is a string, not a
list. If you have two lists a and b and you want a new list containing
everything in a not in b, use a list comprehension:
[ aa for aa in a if aa not in b ]
If what you are looking for is all the number strings from s, you can
use a regular expression again:
In [4]: re.findall(r'\d+', s)
Out[4]: ['77', '992']
Kent
More information about the Tutor
mailing list