Preprocessing not quite fixed-width file before parsing

Thu Nov 24 20:59:39 EST 2022

On 11/24/2022 9:06 AM, Loris Bennett wrote:
> Thomas Passin <list1 at tompassin.net> writes:
> 
>> On 11/23/2022 11:00 AM, Loris Bennett wrote:
>>> Hi,
>>> I am using pandas to parse a file with the following structure:
>>> Name       fileset    type             KB      quota      limit
>>> in_doubt    grace |    files   quota    limit in_doubt    grace
>>> shortname  sharedhome USR        14097664  524288000  545259520          0     none |   107110       0        0        0     none
>>> gracedays  sharedhome USR       774858944  524288000  775946240          0   5 days |  1115717       0        0        0     none
>>> nametoolong sharedhome USR        27418496  524288000  545259520          0     none |    11581       0        0        0     none
>>> I was initially able to use
>>>     df = pandas.read_csv(file_name, delimiter=r"\s+")
>>> because all the values for 'grace' were 'none'.  Now, however,
>>> non-"none" values have appeared and this fails.
>>> I can't use
>>>     pandas.read_fwf
>>> even with an explicit colspec, because the names in the first column
>>> which are too long for the column will displace the rest of the data to
>>> the right.
>>> The report which produces the file could in fact also generate a
>>> properly delimited CSV file, but I have a lot of historical data in the
>>> readable but poorly parsable format above that I need to deal with.
>>> If I were doing something similar in the shell, I would just pipe
>>> the
>>> file through sed or something to replace '5 days' with, say '5_days'.
>>> How could I achieve a similar sort of preprocessing in Python, ideally
>>> without having to generate a lot of temporary files?
>>
>> This is really annoying, isn't it?  A space-separated line with spaces
>> in data entries.   If the example you give is typical, I don't think
>> there is a general solution.  If you know there are only certain
>> values like that, then you could do a search-and-replace for them in
>> Python just like the example you gave for "5 days".
>>
>> If you know that the field that might contain entries with spaces is
>> the same one, e.g., the one just before the "|" marker, you could make
>> use of that. But it could be tricky.
>>
>> I don't know how many files like this you will need to process, nor
>> how many rows they might contain. If I were to do tackle this job, I
>> would probably do some quality checking first.  Using this example
>> file, figure out how many fields there are supposed to be.  First,
>> split the file into lines:
>>
>> with open("filename") as f:
>>      lines = f.readlines()
>>
>> # Check space-separated fields defined in first row:
>> fields = lines[0].split()
>> num_fields = len(fields)
>> print(num_fields)   # e.g., 100)
>>
>> # Find lines that have the wrong number of fields
>> bad_lines = []
>> for line in lines:
>>     fields = line.split()
>>     if len(fields) != num_fields:
>>       bad_lines.append(line)
>>
>> print(len(bad_lines))
>>
>> # Inspect a sample
>> for line in bad_lines[:10]:
>>      print(line)
>>
>> This will give you an idea of how many problems lines there are, and
>> if they can all be fixed by a simple replacement.  If they can and
>> this is the only file you need to handle, just fix it up and run it.
>> I would replace the spaces with tabs or commas.  Splitting a line on
>> spaces (split()) takes care of the issue of having a variable number
>> of spaces, so that's easy enough.
>>
>> If you will need to handle many files, and you can automate the fixes
>> - possibly with a regular expression - then you should preprocess each
>> file before giving it to pandas.  Something like this:
>>
>> def fix_line(line):
>>     """Test line for field errors and fix errors if any."""
>>     # ....
>>     return fixed
>>
>> # For each file
>> with open("filename") as f:
>>      lines = f.readlines()
>>
>> fixed_lines = []
>> for line in lines:
>>      fixed = fix_line(line)
>>      fields = fixed.split()
>>      tabified = '\t'.join(fields) # Could be done by fix_line()
>>      fixed_lines.append(tabified)
>>
>> # Now use an IOString to feed the file to pandas
>> # From memory, some details may not be right
>> f = IOString()
>> f.writelines(fixed_lines)
>>
>> # Give f to pandas as if it were an external file
>> # ...
>>
> 
> Thanks to both Gerard and Thomas for the pointer to IOString.  I ended up
> just reading the file line-by-line, using a regex to replace
> 
>    '<n> <units> |'
> 
> with
> 
>    '<n><units> |'
> 
> and writing the new lines to an IOString, which I then passed to
> pandas.read_csv.
> 
> The wrapper approach looks interesting, but it looks like I need to read
> up more on contexts before adding that to my own code, otherwise I may
> not understand it in a month's time.

Glad that IOString works for you here. I seem to remember that after 
writing to the IOString, you have to seek to 0 before reading from it. 
Better check that point!