Stripping characters from windows clipboard with win32clipboard from excel

Thu Sep 12 20:19:30 EDT 2013

On Thu, 12 Sep 2013 16:01:20 -0700, stephen.boulet wrote:

> I have an excel file. When I select cells, copy from excel, and then use
> win32clipboard to get the contents of the clipboard, I have a 131071
> character string.

How exactly are you using win32clipboard, and what exact result are you 
getting? Start with a smaller selection of cells, say, two cells, so the 
amount of data is manageable.

I doubt very much you are getting 131071 *characters*. Perhaps you are 
getting that many *bytes*. Or perhaps you are getting HTML-formatted 
text. Without seeing what is in the clipboard, who can tell? Remember 
that the clipboard can contain multiple versions of the same data.

> When I save the file as a text file, and use the python 3.3 open command
> to read its contents, I only have 80684 characters.

How exactly are you using the open function? The result you get may 
differ drastically depending on what you do.

> Excel (and  other programs too) appends a bunch of b'\x00' (or similar)
> characters.

Append to what? The .xls file? The text file? What tool are you using to 
see this?

> Is there a pythonic way to strip these out?

It's hard to tell from the limited description, but my guess is that you 
are misinterpreting what you are seeing. Part of the evidence for this is 
that you are conflating bytes and characters, e.g. above where you refer 
to the NUL *byte* b'\x00' as a character. Bytes are not characters. Scrub 
that out of your brain. They never were, not even back in the old ASCII 
days (may they soon be forgotten), despite what many people think.

My guess is that when you save the spreadsheet as text, either Excel by 
default, or possibly because you have ticked a checkbox, have set it to 
save using the UTF-16 encoding. That's a good thing (not ideal, ideally 
they ought to use UTF-8, but UTF-16 is not a bad choice).

But when you open the file in Python, Python defaults to UTF-8, which 
means you get an bunch of extraneous NULs when opening the file in text 
mode, or b'\x00' null bytes in binary mode. For example:

py> with open('/tmp/rubbish', 'w', encoding='utf-16be') as f:
...     f.write('hello world blah blah blah\n')
...
27
py> with open('/tmp/rubbish', 'r') as f:   # default encoding is UTF-8
...     f.read()
...
'\x00h\x00e\x00l\x00l\x00o\x00 \x00w\x00o\x00r\x00l\x00d\x00 \x00b\x00l
\x00a\x00h\x00 \x00b\x00l\x00a\x00h\x00 \x00b\x00l\x00a\x00h\x00\n'

If you look carefully, you will see that every character appears to be 
preceded by the NUL control character, \x00. But that's because you've 
used the wrong encoding to decode the bytes in the file back to 
characters. The right way to do this is:

py> with open('/tmp/rubbish', 'r', encoding='utf-16be') as f:
...     f.read()
...
'hello world blah blah blah\n'

Which encoding should you use? Unfortunately, there is no clean way for 
text files to record which encoding to use inside the file itself, so 
it's often impossible to know for sure. This is why everyone should move 
towards using UTF-8 everywhere. But I digress. 

The fact that you show the NULs as *bytes* b'\x00' rather than characters 
'\x00' suggests that you might be reading the file in binary mode. 
Presumably you did this because you got an error when trying to read it 
in text mode. If you got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: 
invalid start byte

then you should go back to text mode, but use 'utf-16' as the codec:

open(filename, 'w', encoding='utf-16')

Otherwise, if the NULs appear before the letters, as in my example, use 
'utf-16be'. If they appear after the letters, use 'utf-16le'. The 
existence of three oh-so-very-slightly different versions of UTF-16 is 
why it is a sub-optimal encoding, and it is oh-so-bloody-typical that 
Microsoft screwed it up for everyone by picking it as their default 
implementation. If they had sensibly used UTF-8, you wouldn't be having 
this problem.

Of course, it is possible I've misdiagnosed your problem. I've had to 
guess a lot because you didn't show us what you actually did to get the 
results you say you got.

-- 
Steven