[Tutor] HTML encoding of character sets...

Wed May 3 19:15:43 CEST 2006

Hi Frank,

A couple of questions / issues here. Not really an answer, but hopefully 
a start in the right direction.

Why do you need to use entity escapes at all? If you have a correct 
charset declaration you can use whatever encoding you like for the web 
page - latin-1, koi8-r, shift-jis, etc.

If you pick utf-8 for the encoding, you can use it for all your pages 
rather than using a different encoding for each language.

The string you read with open('russian.txt', 'r').read() is already 
encoded in the encoding used by the file 'russian.txt'. It is critical 
that you know this encoding.

You can convert encoded text to Unicode with text.decode('koi8-r') etc. 
decode() goes to Unicode, encode() goes away from Unicode.

Once you have Unicode text you can encode it to whatever encoding you 
want. Or you can use the codepoint2name dictionary in the htmlentitydefs 
module to convert to entities. This recipe shows one way to do it:
http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/440563

HTH
Kent

Frank Moore wrote:
> Hi,
> 
> I need to do some encoding of text that will be used in a web page.
> The text has been translated into 16 different languages.
> I've managed the manual translation of some of the more regular 
> languages (French, Spanish, Italian etc...) , by
> replacing characters like 'А' with the numeric entity &#225; etc...
> This works when you only have a few characters like this in the text and 
> they visually stand out.
> However, I now have to move on to other languages like Arabic, Russian, 
> Chinese, Hebrew, Japanese, Korean, Hindi and Polish.
> In these languages, the sheer volume of characters that need to be 
> encoded is huge.
> For instance, the following text is a title on a web page (in Russian), 
> asking the user to wait for the page to load:
> 
> хДЕР ГЮЦПСГЙЮ, ОНФЮКСИЯРЮ, ОНДНФДХРЕ┘
> 
> It obviously looks like garbage unless you have your email reader set to 
> a Russian text encoding.
> But even if it appears correctly, the sheer number of characters that I 
> will need to numerically encode is massive.
> 
> Does anyone know how I can automate this process?
> I want to be able to read a string out of a translation file, pass it to 
> a Python script and get back a list or string of numeric entities
> that I can then bury in my HTML.
> 
> I had a play with a snippet of code from the Unicode chapter of 'Dive 
> Into Python' (http://diveintopython.org/xml_processing/unicode.html)
> but get the following error:
> 
> text = open('russian.txt', 'r').read()
> converted_text = text.encode('koi8-r')
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> File "c:\Python24\lib\encodings\koi8_r.py", line 18, in encode
> return codecs.charmap_encode(input,errors,encoding_map)
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc8 in position 0: 
> ordinal not in range(128)
> 
> Anybody got any ideas?
> 
> Many thanks,
> Frank.
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> http://mail.python.org/mailman/listinfo/tutor
> 
>