Preprocessing not quite fixed-width file before parsing

Thu Nov 24 09:06:17 EST 2022

Thomas Passin <list1 at tompassin.net> writes:

> On 11/23/2022 11:00 AM, Loris Bennett wrote:
>> Hi,
>> I am using pandas to parse a file with the following structure:
>> Name       fileset    type             KB      quota      limit
>> in_doubt    grace |    files   quota    limit in_doubt    grace
>> shortname  sharedhome USR        14097664  524288000  545259520          0     none |   107110       0        0        0     none
>> gracedays  sharedhome USR       774858944  524288000  775946240          0   5 days |  1115717       0        0        0     none
>> nametoolong sharedhome USR        27418496  524288000  545259520          0     none |    11581       0        0        0     none
>> I was initially able to use
>>    df = pandas.read_csv(file_name, delimiter=r"\s+")
>> because all the values for 'grace' were 'none'.  Now, however,
>> non-"none" values have appeared and this fails.
>> I can't use
>>    pandas.read_fwf
>> even with an explicit colspec, because the names in the first column
>> which are too long for the column will displace the rest of the data to
>> the right.
>> The report which produces the file could in fact also generate a
>> properly delimited CSV file, but I have a lot of historical data in the
>> readable but poorly parsable format above that I need to deal with.
>> If I were doing something similar in the shell, I would just pipe
>> the
>> file through sed or something to replace '5 days' with, say '5_days'.
>> How could I achieve a similar sort of preprocessing in Python, ideally
>> without having to generate a lot of temporary files?
>
> This is really annoying, isn't it?  A space-separated line with spaces
> in data entries.   If the example you give is typical, I don't think 
> there is a general solution.  If you know there are only certain
> values like that, then you could do a search-and-replace for them in
> Python just like the example you gave for "5 days".
>
> If you know that the field that might contain entries with spaces is
> the same one, e.g., the one just before the "|" marker, you could make
> use of that. But it could be tricky.
>
> I don't know how many files like this you will need to process, nor
> how many rows they might contain. If I were to do tackle this job, I
> would probably do some quality checking first.  Using this example
> file, figure out how many fields there are supposed to be.  First,
> split the file into lines:
>
> with open("filename") as f:
>     lines = f.readlines()
>
> # Check space-separated fields defined in first row:
> fields = lines[0].split()
> num_fields = len(fields)
> print(num_fields)   # e.g., 100)
>
> # Find lines that have the wrong number of fields
> bad_lines = []
> for line in lines:
>    fields = line.split()
>    if len(fields) != num_fields:
>      bad_lines.append(line)
>
> print(len(bad_lines))
>
> # Inspect a sample
> for line in bad_lines[:10]:
>     print(line)
>
> This will give you an idea of how many problems lines there are, and
> if they can all be fixed by a simple replacement.  If they can and
> this is the only file you need to handle, just fix it up and run it.
> I would replace the spaces with tabs or commas.  Splitting a line on
> spaces (split()) takes care of the issue of having a variable number
> of spaces, so that's easy enough.
>
> If you will need to handle many files, and you can automate the fixes
> - possibly with a regular expression - then you should preprocess each 
> file before giving it to pandas.  Something like this:
>
> def fix_line(line):
>    """Test line for field errors and fix errors if any."""
>    # ....
>    return fixed
>
> # For each file
> with open("filename") as f:
>     lines = f.readlines()
>
> fixed_lines = []
> for line in lines:
>     fixed = fix_line(line)
>     fields = fixed.split()
>     tabified = '\t'.join(fields) # Could be done by fix_line()
>     fixed_lines.append(tabified)
>
> # Now use an IOString to feed the file to pandas
> # From memory, some details may not be right
> f = IOString()
> f.writelines(fixed_lines)
>
> # Give f to pandas as if it were an external file
> # ...
>

Thanks to both Gerard and Thomas for the pointer to IOString.  I ended up
just reading the file line-by-line, using a regex to replace

  '<n> <units> |'

with 

  '<n><units> |'

and writing the new lines to an IOString, which I then passed to
pandas.read_csv.

The wrapper approach looks interesting, but it looks like I need to read
up more on contexts before adding that to my own code, otherwise I may
not understand it in a month's time.

Cheers,

Loris

-- 
This signature is currently under constuction.