newbie - HTML character codes

Wed Dec 13 09:11:43 EST 2006

ardief wrote:
[...]
> And I want the HTML char codes to turn into their equivalent plain
> text. I've looked at the newsgroup archives, the cookbook, the web in
> general and can't manage to sort it out. I thought doing something like
> this -
> 
> file = open('filename', 'r')

It's not a good idea to use 'file' as a variable name, since you are
shadowing the builtin type of the same name.

> ofile = open('otherfile', 'w')
> 
> done = 0
> 
> while not done:
>    line = file.readline()
>    if 'THE END' in line:
>        done = 1
>    elif '—' in line:
>        line.replace('—', '--')

The replace method doesn't modify the 'line' string, it returns a new string.

>        ofile.write(line)
>    else:
>        ofile.write(line)

This should work (untested):

    infile  = open('filename', 'r')
    outfile = open('otherfile', 'w')

    for line in infile:
	outfile.write(line.replace('—', '--'))

But I think the best approach is to use a existing aplication or library
that solves the problem.  recode(1) can easily convert to and from HTML
entities:

    recode html..utf-8 filename

Best regards.
-- 
Roberto Bonvallet