Html character entity conversion

Tue Aug 1 04:10:18 EDT 2006

Pak (or Andrei, whichever is your first name),

      My proposal below:

----- Original Message -----
From: <pak.andrei at gmail.com>
Newsgroups: comp.lang.python
To: <python-list at python.org>
Sent: Sunday, July 30, 2006 8:52 PM
Subject: Re: Html character entity conversion

> danielx wrote:
> > pak.andrei at gmail.com wrote:
> > > Here is my script:
> > >
> > > from mechanize import *
> > > from BeautifulSoup import *
> > > import StringIO
> > > b = Browser()
> > > f = b.open("http://www.translate.ru/text.asp?lang=ru")
> > > b.select_form(nr=0)
> > > b["source"] = "hello python"
> > > html = b.submit().get_data()
> > > soup = BeautifulSoup(html)
> > > print  soup.find("span", id = "r_text").string
> > >
> > > OUTPUT:
> > > привет
> > > питон
> > > ----------
> > > In russian it looks like:
> > > "привет питон"
> > >
> > > How can I translate this using standard Python libraries??
> > >
> > > --
> > > Pak Andrei, http://paxoblog.blogspot.com, icq://97449800
> >

I've been proposing solutions of late using a stream editor I recently wrote, realizing each time how well it works in a vareity of
different situations. I can only hope I am not beginning to get on people's nerves (Here he comes again with his damn thing!).
      I base the following on proposals others have made so far, because I haven't used unicodes and know little about them. If
nothing else, I do think this is a rather elegant way to translate the ampersands to the unicode stirngs. Having to read them
through an 'eval', though, doesn't seem to be the ultimate solution. I couldn't assign a unicode string to a variable so that it
would print text as Claudio proposed.

Here is my htm example:

>>> htm = StringIO.StringIO ('''
<htm>
  <!-- Examen -->
  <head><title>Deuxième question</title></head>
    <body bgcolor="#beb4a0" text="#000082" etc. >
      <b>L´élève doit lire et traduire:</b> привет
питон<br>
    </body>
</htm> ''')

And here is my SE hack:

>>> import SE    # Available at the Cheese Shop
>>> Ampersand_Filter = SE.SE (' <EAT> "~&#[0-9]+;~==(10)" ')
>>> for line in htm:
            line = line [:-1]
            ampersand_codes = Ampersand_Filter (line [:-1])
            # A list of the ampersand codes found in the current line
            if ampersand_codes:
               # From it we edit the substitution defintiions for the current line
               substitutions = ''
               for code in ampersand_codes.split ('\n')[:-1]:
                  substitutions = '%s%s=\\u%04x\n' % (substitutions, code, int (code [2:-1]))
                  # And make a custom Editor just for the current line
                  Line_Unicoder = SE.SE (substitutions)
                  unicode_line = Line_Unicoder (line)
               print eval ('u"%s"' % unicode_line)
            else:
               print line

<htm>
  <!-- Examen -->
  <head><title>Deuxième question</title></head>
    <body bgcolor="#beb4a0" text="#000082" etc. >
      <b>L´élève doit lire et traduire:</b> привет питон<br>
    </body>
</htm>

This is a text book example of dynamic substitutions. Typically SE compiles static substituions lists. But with 2**16 (?) unicodes,
building a static list would be absurd if at all possible. So we dynamically make custom substitutions for each line after
extracting the ampersand escapes that may be there.

Next we would like to fix the regular ascii ampersand escapes and also strip the tags. That is a simple question of preprocessing
the file.

>>> Legibilizer = SE.SE ('htm2iso.se "~<(.|\n)*?>~=" "~<!--(.|\n)*?-->~=" ')

'htm2iso.se' is a substitutions definition file that defines the standard ascii ampersands to characters. It is included in the SE
package. You can name as many definition files as you want. In a definition string the name of a file is equivalent to its contents.

>>> htm.seek (0)
>>> htm_no_tags = Legibilizer (htm.read ())
>>> for line in htm_no_tags.split ('\n'):
         if line.strip () == '': continue
         ampersand_codes = Ampersand_Filter (line)
         ... (same as above)

  Deuxième question
      L'élève doit lire et traduire: привет питон

Whether this serves your purpose I don't really know. How you can use it other than read it in the IDLE window, I don't know
either.I tried to copy it out, but it doesn't survive the operation and the paste has question marks or squares in the place of the
Russian letters.

Regards

Frederic