Preprocessing not quite fixed-width file before parsing

Fri Nov 25 02:09:52 EST 2022

Thomas Passin <list1 at tompassin.net> writes:

> On 11/24/2022 9:06 AM, Loris Bennett wrote:
>> Thomas Passin <list1 at tompassin.net> writes:
>> 
>>> On 11/23/2022 11:00 AM, Loris Bennett wrote:
>>>> Hi,
>>>> I am using pandas to parse a file with the following structure:
>>>> Name       fileset    type             KB      quota      limit
>>>> in_doubt    grace |    files   quota    limit in_doubt    grace
>>>> shortname  sharedhome USR        14097664  524288000  545259520          0     none |   107110       0        0        0     none
>>>> gracedays  sharedhome USR       774858944  524288000  775946240          0   5 days |  1115717       0        0        0     none
>>>> nametoolong sharedhome USR        27418496  524288000  545259520          0     none |    11581       0        0        0     none
>>>> I was initially able to use
>>>>     df = pandas.read_csv(file_name, delimiter=r"\s+")
>>>> because all the values for 'grace' were 'none'.  Now, however,
>>>> non-"none" values have appeared and this fails.
>>>> I can't use
>>>>     pandas.read_fwf
>>>> even with an explicit colspec, because the names in the first column
>>>> which are too long for the column will displace the rest of the data to
>>>> the right.
>>>> The report which produces the file could in fact also generate a
>>>> properly delimited CSV file, but I have a lot of historical data in the
>>>> readable but poorly parsable format above that I need to deal with.
>>>> If I were doing something similar in the shell, I would just pipe
>>>> the
>>>> file through sed or something to replace '5 days' with, say '5_days'.
>>>> How could I achieve a similar sort of preprocessing in Python, ideally
>>>> without having to generate a lot of temporary files?
>>>
>>> This is really annoying, isn't it?  A space-separated line with spaces
>>> in data entries.   If the example you give is typical, I don't think
>>> there is a general solution.  If you know there are only certain
>>> values like that, then you could do a search-and-replace for them in
>>> Python just like the example you gave for "5 days".
>>>
>>> If you know that the field that might contain entries with spaces is
>>> the same one, e.g., the one just before the "|" marker, you could make
>>> use of that. But it could be tricky.
>>>
>>> I don't know how many files like this you will need to process, nor
>>> how many rows they might contain. If I were to do tackle this job, I
>>> would probably do some quality checking first.  Using this example
>>> file, figure out how many fields there are supposed to be.  First,
>>> split the file into lines:
>>>
>>> with open("filename") as f:
>>>      lines = f.readlines()
>>>
>>> # Check space-separated fields defined in first row:
>>> fields = lines[0].split()
>>> num_fields = len(fields)
>>> print(num_fields)   # e.g., 100)
>>>
>>> # Find lines that have the wrong number of fields
>>> bad_lines = []
>>> for line in lines:
>>>     fields = line.split()
>>>     if len(fields) != num_fields:
>>>       bad_lines.append(line)
>>>
>>> print(len(bad_lines))
>>>
>>> # Inspect a sample
>>> for line in bad_lines[:10]:
>>>      print(line)
>>>
>>> This will give you an idea of how many problems lines there are, and
>>> if they can all be fixed by a simple replacement.  If they can and
>>> this is the only file you need to handle, just fix it up and run it.
>>> I would replace the spaces with tabs or commas.  Splitting a line on
>>> spaces (split()) takes care of the issue of having a variable number
>>> of spaces, so that's easy enough.
>>>
>>> If you will need to handle many files, and you can automate the fixes
>>> - possibly with a regular expression - then you should preprocess each
>>> file before giving it to pandas.  Something like this:
>>>
>>> def fix_line(line):
>>>     """Test line for field errors and fix errors if any."""
>>>     # ....
>>>     return fixed
>>>
>>> # For each file
>>> with open("filename") as f:
>>>      lines = f.readlines()
>>>
>>> fixed_lines = []
>>> for line in lines:
>>>      fixed = fix_line(line)
>>>      fields = fixed.split()
>>>      tabified = '\t'.join(fields) # Could be done by fix_line()
>>>      fixed_lines.append(tabified)
>>>
>>> # Now use an IOString to feed the file to pandas
>>> # From memory, some details may not be right
>>> f = IOString()
>>> f.writelines(fixed_lines)
>>>
>>> # Give f to pandas as if it were an external file
>>> # ...
>>>
>> Thanks to both Gerard and Thomas for the pointer to IOString.  I
>> ended up
>> just reading the file line-by-line, using a regex to replace
>>    '<n> <units> |'
>> with
>>    '<n><units> |'
>> and writing the new lines to an IOString, which I then passed to
>> pandas.read_csv.
>> The wrapper approach looks interesting, but it looks like I need to
>> read
>> up more on contexts before adding that to my own code, otherwise I may
>> not understand it in a month's time.
>
> Glad that IOString works for you here. I seem to remember that after
> writing to the IOString, you have to seek to 0 before reading from
> it. Better check that point!

Stefan (whom I forgot to thank: Verziehung, Stefan!), mentioned seek(0),
so fortunately I was primed when I read the Python documentation for
IOString.

Cheers,

Loris

-- 
This signature is currently under constuction.