help with (x)html / xml encoding...

lt glt2010pas.de.spam at yahoo.fr
Thu Mar 20 18:03:13 EST 2003


hello,

i'm looking for a way to extract encoding from a file retrieved by urllib,
i'm planning of creating a "restricted" parser which will only examine <?
and <meta tags, to check for :

<meta http-equiv="content-type" content="text/html; charset=xxxencodingxxx">
or
<?xml version="1.0" encoding="'xxxencodingxxx'"?>

do you think that is enough ? how should you do it ?

my solution is below, please feel free to comment this code, i *really *need
to improve my python !!! (inspired by sgmllib.py)
s will be the string to check out and more precisely the string returned
while parsing by a SGMLParser object that will only handle_pi and start_meta

import re

_encoding = re.compile(
    r'\s*encoding\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:;+*%?!&$\(\)_#=~\'"]*)')
_charset = re.compile(
    r'\s*charset\s*=\s*'
    r'(\'[^\']*\'|"[^"]*"|[-a-zA-Z0-9./:;+*%?!&$\(\)_#=~\'"]*)')

def get_encoding(s):
encoding = None
 search = _encoding.search(s)
 if not search:
  search = _charset.search(s)
 if not search:
  return encoding
 encoding = search.group(1)
 while encoding[:1] == '\'' == encoding[-1:] or \
 encoding[:1] == '"' == encoding[-1:]:
  encoding = encoding[1:-1]
 return encoding

thanks...






More information about the Python-list mailing list