help with (x)html / xml encoding...
lt
glt2010pas.de.spam at yahoo.fr
Thu Mar 20 18:03:13 EST 2003
hello,
i'm looking for a way to extract encoding from a file retrieved by urllib,
i'm planning of creating a "restricted" parser which will only examine <?
and <meta tags, to check for :
<meta http-equiv="content-type" content="text/html; charset=xxxencodingxxx">
or
<?xml version="1.0" encoding="'xxxencodingxxx'"?>
do you think that is enough ? how should you do it ?
my solution is below, please feel free to comment this code, i *really *need
to improve my python !!! (inspired by sgmllib.py)
s will be the string to check out and more precisely the string returned
while parsing by a SGMLParser object that will only handle_pi and start_meta
import re
_encoding = re.compile(
r'\s*encoding\s*=\s*'
r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:;+*%?!&$\(\)_#=~\'"]*)')
_charset = re.compile(
r'\s*charset\s*=\s*'
r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:;+*%?!&$\(\)_#=~\'"]*)')
def get_encoding(s):
encoding = None
search = _encoding.search(s)
if not search:
search = _charset.search(s)
if not search:
return encoding
encoding = search.group(1)
while encoding[:1] == '\'' == encoding[-1:] or \
encoding[:1] == '"' == encoding[-1:]:
encoding = encoding[1:-1]
return encoding
thanks...
More information about the Python-list
mailing list