[Tutor] Entity to UTF-8

Paul Tremblay phthenry@earthlink.net
Wed Apr 30 20:01:16 2003


You probably already know this already, but I thought I'd offer it
anyway. 

Your code has the lines:


patt = '&#([^;]+);'

ustr = re.sub(patt, ToUTF8, ustr)

I believe this is ineffecient, because python has to compile the regular
expression each time.  This code should be more effecient:

patt = re.compile(r'&#[^;];')

ustr = re.sub(patt, ToUTF8, ustr)

I am struggling with unicode myself, so I am going to test out your code
and see if it helps me.

Paul




On Wed, Apr 30, 2003 at 11:53:10AM +0800, Ezequiel, Justin wrote:
> From: "Ezequiel, Justin" <j.ezequiel@spitech.com>
> To: "'tutor@python. org' (E-mail)" <tutor@python.org>
> Subject: [Tutor] Entity to UTF-8
> Date: Wed, 30 Apr 2003 11:53:10 +0800
> 
> Greetings.
> 
> I need to convert entities (&#945;) in an XML file to the actual UTF-8 characters (?).
> Currently, I am using this bit of code (prototype to see if I can actually do it).
> This seems to work just fine but I was wondering if there are other ways of doing this.
> 
> ##--------------------------------------------------
> import codecs
> import re
> 
> (utf8_encode, utf8_decode, utf8_reader, utf8_writer) = codecs.lookup("utf-8")
> patt = '&#([^;]+);'
> 
> def ToUTF8(matchobj):
>     return unichr(long(matchobj.group(1)))
> 
> def GetUTF8(pth):
>     infile = utf8_reader(open(pth))
>     readstr = infile.read()
>     infile.close()
>     return readstr
> 
> def WriteUTF8(pth, str):
>     outf = utf8_writer(open(pth, 'w'))
>     outf.write(str)
>     outf.close()
> 
> ustr = GetUTF8('input.htm')
> 
> ustr = re.sub(patt, ToUTF8, ustr)
> 
> WriteUTF8('output.htm', ustr)
> ##--------------------------------------------------
> 
> sample input file (actual production files would be XML):
> <HTML>
> <HEAD>
> <META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
> <TITLE>TESTING</TITLE>
> </HEAD>
> <BODY>
> <P>&#65279; &#1103; &#1078; &#1097; &#1102; &#1092; &#1081; &#1073; &#8936; &#8995; &#62; &#9742; &#945;</P>
> <P>&#65279; &#1103; &#1078; &#1097; &#1102; &#1092; &#1081; &#1073; &#8936; &#8995; &#62; &#9742; &#945;</P>
> </BODY>
> </HTML>
> 
> sample output file:
> <HTML>
> <HEAD>
> <META http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
> <TITLE>TESTING</TITLE>
> </HEAD>
> <BODY>
> <P>? ? ? ? ? ? ? ? ? ? > ? ?</P>
> <P>? ? ? ? ? ? ? ? ? ? > ? ?</P>
> </BODY>
> </HTML>
> 
> Can you point me to resources/tutorials if any for this?
> Is there a HowTo for the codecs module?
> Maybe there are other modules I should look at (XML?).
> 
> Actual (production) input files would most likely have &alpha; instead of &#945; but &#x3B1; is also possible.
> 
> BTW, is there a built-in method to convert a Hex string ('3B1') to a long (945)?
> I am currently using my own function (am too embarrassed to post it here).
> 
> _______________________________________________
> Tutor maillist  -  Tutor@python.org
> http://mail.python.org/mailman/listinfo/tutor

-- 

************************
*Paul Tremblay         *
*phthenry@earthlink.net*
************************