nntplib encoding problem

Sun Feb 27 21:12:20 EST 2011

On 28/02/2011 01:31, Laurent Duchesne wrote:
> Hi,
>
> I'm using python 3.2 and got the following error:
>
>>>> nntpClient = nntplib.NNTP_SSL(...)
>>>> nntpClient.group("alt.binaries.cd.lossless")
>>>> nntpClient.over((534157,534157))
> ... 'subject': 'Myl\udce8ne Farmer - Anamorphosee (Japan Edition) 1995
> [02/41] "Back.jpg" yEnc (1/3)' ...
>>>> overview = nntpClient.over((534157,534157))
>>>> print(overview[1][0][1]['subject'])
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in
> position 3: surrogates not allowed
>
> I'm not sure if I should report this as a bug in nntplib or if I'm doing
> something wrong.
>
> Note that I get the same error if I try to write this data to a file:
>
>>>> h = open("output.txt", "a")
>>>> h.write(overview[1][0][1]['subject'])
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> UnicodeEncodeError: 'utf-8' codec can't encode character '\udce8' in
> position 3: surrogates not allowed
>
It's looks like the subject was originally encoded as Latin-1 (or
similar) (b'Myl\xe8ne Farmer - Anamorphosee (Japan Edition) 1995
[02/41] "Back.jpg" yEnc (1/3)') but has been decoded as UTF-8 with
"surrogateescape" passed as the "errors" parameter.

You can get the "correct" Unicode by encoding as UTF-8 with
"surrogateescape" and then decoding as Latin-1:

     overview[1][0][1]['subject'].encode("utf-8", 
"surrogateescape").decode("latin-1")