Stripping unencodable characters from a string

Paul Moore p.f.moore at gmail.com
Tue May 5 15:24:56 EDT 2015


On Tuesday, 5 May 2015 20:01:04 UTC+1, Dave Angel  wrote:
> On 05/05/2015 02:19 PM, Paul Moore wrote:
> 
> You need to specify that you're using Python 3.4 (or whichever) when 
> starting a new thread.

Sorry. 2.6, 2.7, and 3.3+. It's for use in a cross-version library.

> If you're going to take charge of the encoding of the file, why not just 
> open the file in binary, and do it all with
>      file.write(data.encode( myencoding, errors='replace') )

I don't have control of the encoding of the file. It's typically sys.stdout, which is already open. I can't replace sys.stdout (because the main program which calls my library code wouldn't like me messing with global state behind its back). And sys.stdout isn't open in binary mode.

> i can't see the benefit of two encodes and a decode just to write a 
> string to the file.

Nor can I - that's my point. But if all I have is an open text-mode file with the "strict" error mode, I have to incur one encode, and I have to make sure that no characters are passed to that encode which can't be encoded.

If there was a codec method to identify un-encodable characters, that might be an alternative (although it's quite possible that the encode/decode dance would be faster anyway, as it's mostly in C - not that performance is key here).

> Alternatively, there's probably a way to open the file using 
> codecs.open(), and reassign it to sys.stdout.

As I said, I have to work with the file (sys.stdout or whatever) that I'm given. I can't reopen or replace it.

Paul



More information about the Python-list mailing list