[Tutor] ABout distinquishing elements in a string/regular expr

Thu Sep 7 13:47:55 CEST 2006

Xiao Yu Michael Yang wrote:
> Hi tutors,
> 
>    I am currently working on a project that identifies languages of html 
> documents, using Python, of course. 

You might be interested in http://chardet.feedparser.org/ which seems to 
work directly on HTML.
> Just wondering, given a string:
> 
>    str = "<html> title this is french 77 992 / <aaabbbccc> </html>"

Note that str is the name of the built-in string type and not a good 
choice for a variable name.

> what is the python expression for:
> 
> 1. r = return_anything_that's_within<> (str), i.e. it should give "html,
> aaabbbccc, html"

You can do this with a regular expression:
In [1]: s = "<html> title this is french 77 992 / <aaabbbccc> </html>"

In [2]: import re

In [3]: re.findall('<.*?>', s)
Out[3]: ['<html>', '<aaabbbccc>', '</html>']

If you are trying to strip the tags from the HTML, try one of these:
http://www.oluyede.org/blog/2006/02/13/html-stripper/
http://www.aminus.org/rbre/python/cleanhtml.py

> 
> 2. r = remove_all_numbers(str), (what is the python expression for
> 'is_int') i.e. it removes "77" and "992"

What should r look like here? Is it the string s with digits removed, or 
some kind of list?

s.isdigit() will test if s is a string containing all digits.

> 
> 3. dif = listA_minus_listB(str, r), i.e. should return ['77', '992'],
> using the above 'r' value.

You seem to be confused about strings vs lists. s is a string, not a 
list. If you have two lists a and b and you want a new list containing 
everything in a not in b, use a list comprehension:
[ aa for aa in a if aa not in b ]

If what you are looking for is all the number strings from s, you can 
use a regular expression again:
In [4]: re.findall(r'\d+', s)
Out[4]: ['77', '992']

Kent