encoding problem with BeautifulSoup - problem when writing parsed text to file

Greg gregor.hochschild at googlemail.com
Thu Oct 6 00:39:17 EDT 2011


Brilliant! It worked. Thanks!

Here is the final code for those who are struggling with similar
problems:

## open and decode file
# In this case, the encoding comes from the charset argument in a meta
tag
# e.g. <meta charset="iso-8859-2">
fileObj = open(filePath,"r").read()
fileContent = fileObj.decode("iso-8859-2")
fileSoup = BeautifulSoup(fileContent)

## Do some BeautifulSoup magic and preserve unicode, presume result is
saved in 'text' ##

## write extracted text to file
f = open(outFilePath, 'w')
f.write(text.encode('utf-8'))
f.close()



On Oct 5, 11:40 pm, Steven D'Aprano <steve
+comp.lang.pyt... at pearwood.info> wrote:
> On Wed, 05 Oct 2011 16:35:59 -0700, Greg wrote:
> > Hi, I am having some encoding problems when I first parse stuff from a
> > non-english website using BeautifulSoup and then write the results to a
> > txt file.
>
> If you haven't already read this, you should do so:
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
> > I have the text both as a normal (text) and as a unicode string (utext):
> > print repr(text)
> > 'Branie zak\xc2\xb3adnik\xc3\xb3w'
>
> This is pretty much meaningless, because we don't know how you got the
> text and what it actually is. You're showing us a bunch of bytes, with no
> clue as to whether they are the right bytes or not. Considering that your
> Unicode text is also incorrect, I would say it is *not* right and your
> description of the problem is 100% backwards: the problem is not
> *writing* the text, but *reading* the bytes and decoding it.
>
> You should do something like this:
>
> (1) Inspect the web page to find out what encoding is actually used.
>
> (2) If the web page doesn't know what encoding it uses, or if it uses
> bits and pieces of different encodings, then the source is broken and you
> shouldn't expect much better results. You could try guessing, but you
> should expect mojibake in your results.
>
> http://en.wikipedia.org/wiki/Mojibake
>
> (3) Decode the web page into Unicode text, using the correct encoding.
>
> (4) Do all your processing in Unicode, not bytes.
>
> (5) Encode the text into bytes using UTF-8 encoding.
>
> (6) Write the bytes to a file.
>
> [...]
>
> > Now I am trying to save this to a file but I never get the encoding
> > right. Here is what I tried (+ lot's of different things with encode,
> > decode...):
> > outFile=codecs.open( filePath, "w", "UTF8" )
> > outFile.write(utext)
> > outFile.close()
>
> That's the correct approach, but it won't help you if utext contains the
> wrong characters in the first place. The critical step is taking the
> bytes in the web page and turning them into text.
>
> How are you generating utext?
>
> --
> Steven




More information about the Python-list mailing list