[Tutor] bogus characters in a windows file

Thu Feb 9 21:57:40 CET 2012

> 
>> I'm reading a file output by the system utility WMIC in windows (so I can track CPU usage by process ID) and the text file WMIC outputs seems to have extra characters in I've not seen before.
>> 
>> I use os.system('WMIC /OUTPUT:c:\cpu.txt PROCESS GET ProcessId') to output the file and parse file c:\cpu.txt
> 
> First mistake.  If you use backslash inside a python literal string, you need to do one of two things:
>       1) use a raw string
>       2) double the backslash
> It so happens that \c is not a python escape sequence, so you escaped this particular bug.

Lucked out on that one ... slipped under my radar. I was just cutting and pasting some code from the documentation to WMIC ;-)

> 
>> The first few lines of the file look like this in notepad
>> 
>> ProcessId
>> 0
>> 4
>> 568
>> 624
>> 648
>> 
>> 
>> I input the data with the lines
>> 
>> infile = open('c:\cpu.txt','r')
> Same thing.  You should either make it r'c:\cpu.txt'   or   'c:\\cpu.txt'  or  even 'c:/cpu.txt'
>> infile.readline()
>> infile.readline()
>> infile.readline()
>> 
> OK, so you throw away the first 3 lines of the file.
> 
>> the readline()s yield the following output
>> 
>> '\xff\xfeP\x00r\x00o\x00c\x00e\x00s\x00s\x00I\x00d\x00 \x00 \x00\r\x00\n'
>> '\x000\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'
>> '\x004\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00\r\x00\n'
>> 
> Now, how did you get those bytes displayed;  they've already been thrown out.

Simple run the readline() commands at the command line and python interpreter prompt (or IDLE if you like). The results are not thrown away ... they are echoed to the screen.

>> Now for the first line the title 'ProcessId' is in this string but the individual characters are separated by '\x00' and at least for the first line of the file there is an extra '\xff\xfe'. For subsequent its just '\x00. Now I can just replace the '\x**' with '' but that seems a bit inelegant. I've tried various options on the open 'rU' and 'rb' but no effect.
>> 
>> Does anybody know what the rubbish characters are and what has caused the. I'm using the latest Enthought python if that matters.
> It matters, but it'd save each of us lots of trouble if you told us what version that was;  especially which version of Python.  The latest Enthought I see is called EPD 7.2.  But after 10 minutes on the site, I can't see whether there actually is a Python on there or not.  it seems to be just a bunch of libraries for Python.  But whether they're for CPython, IronPython, or something else, who knows?

My fault. Its Python 2.7.1 ... Ipython interpreter. 

> 
> 
> I don't see any rubbish characters.  What I see is some unicode strings, displayed as though they were byte strings.  the first two bytes are the BOM code, commonly put at the beginning of a file encoded in UTF-16.  The remaining pairs of bytes are UTF-16 encodings for ordinary characters.  Notepad would recognize the UTF-16 encoding, and display the characters correctly.  Perhaps you need to do the same.

Yes well this was the insight I was after. At one stage I was using a distribution compiled for Unicode (so I'm guessing I would have never seen this problem then) but it seems like the last distribution from Enthought is non-Unicode (I've sent them an email to confirm this ... but thats what it looks like). This is the first time I've explicitly faced Unicode input from a text file so the \x00 stuff was unfamiliar with the details of how it works and displays itself in a normal string. Mostly I've seen them in python as u'string' and never paid much attention (unless I passed them as a file name to open() ... when they caused all sorts of grief until I realised I needed to change their type to str with str())

Since this is one-off to get one of my PhD students out of hole I might just filter out the \x** characters explicitly since the remainder looks OK. 

As background the reason for this is to manage a stand-alone science code developed elsewhere to ensure that CPU usage doesn't go out of control. We're doing thousands of runs with this code (monte-carlo simulation), launching the code for each simulation with os.system() and occasionally a simulation goes into an infinite loop, which stalls the monte-carlo so we just want to be able to kill that simulation and go to the next one. WE do this sort of stuff on *NIX all the time using the unix command 'ps' but because the executable we need to use is somebody else's we are stuck on Windows ... and WMIC looks the easiest, quickest way to achieve this sort of process control on Windows. If anybody has any other ideas how to do this direct from python that might be platform independent (being able to set some CPU limits on a popen call for instance) I'd be interested but looking on the web most of the solutions look rather difficult.

====================================================================
Prof Garry Willgoose,
Director, Centre for Climate Impact Management (C2IM),
Head of Discipline, Discipline of Civil Surveying and Environmental Engineering,
School of Engineering, The University of Newcastle,
Callaghan, 2308
Australia.

C2IM webpage: www.c2im.org.au

Phone: (International) +61 2 4921 6050 (Tues-Thurs); +61 2 6545 9574 (Mon, Fri)
FAX: (International) +61 2 4921 6991
Env. Engg. Secretary: (International) +61 2 4921 6042

email:  garry.willgoose at newcastle.edu.au (uni); g.willgoose at telluricresearch.com (personal, consulting)
email-for-life: garry.willgoose at alum.mit.edu
personal webpage: www.telluricresearch.com/garry
====================================================================
"Do not go where the path may lead, go instead where there is no path and leave a trail"
                          Ralph Waldo Emerson
====================================================================