Reading by positions plain text files

Mon Dec 13 18:29:52 EST 2010

On Dec 12, 11:21 pm, Dennis Lee Bieber <wlfr... at ix.netcom.com> wrote:
> On Sun, 12 Dec 2010 07:02:13 -0800 (PST), javivd
> <javiervan... at gmail.com> declaimed the following in
> gmane.comp.python.general:
>
>
>
> > f = open(r'c:c:\somefile.txt', 'w')
>
> > f.write('0123456789\n0123456789\n0123456789')
>
>         Not the most explanatory sample data... It would be better if the
> records had different contents.
>
> > f.close()
>
> > f = open(r'c:\somefile.txt', 'r')
>
> > for line in f:
>
>         Here you extract one "line" from the file
>
> >     f.seek(3,0)
> >     print f.read(1) #just to know if its printing the rigth column
>
>         And here you ignored the entire line you read, seeking to the fourth
> byte from the beginning of the file, andreadingjust one byte from it.
>
>         I have no idea of how seek()/read() behaves relative to line
> iteration in the for loop... Given the small size of the test data set
> it is quite likely that the first "for line in f" resulted in the entire
> file being read into a buffer, and that buffer scanned to find the line
> ending and return the data preceding it; then the buffer position is set
> to after that line ending so the next "for line" continues from that
> point.
>
>         But in a situation with a large data set, or an unbuffered I/O
> system, the seek()/read() could easily result in resetting the file
> position used by the "for line", so that the second call returns
> "456789\n"... And all subsequent calls too, resulting in an infinite
> loop.
>
>         Presuming the assignment requires pulling multiple selected fields
> from individual records, where each record is of the same
> format/spacing, AND that the field selection can not be preprogrammed...
>
> Sample data file (use fixed width font to view):
> -=-=-=-=-=-
> Wulfraed       09Ranger  1915
> Bask Euren     13Cleric  1511
> Aethelwulf     07Mage    0908
> Cwiculf        08Mage    1008
> -=-=-=-=-=-
>
> Sample format definition file:
> -=-=-=-=-=-
> Name    0-14
> Level   15-16
> Class   17-24
> THAC0   25-26
> Armor   27-28
> -=-=-=-=-=-
>
> Code to process (Python 2.5, with minimal error handling):
> -=-=-=-=-=-
>
> class Extractor(object):
>     def __init__(self, formatFile):
>         ff = open(formatFile, "r")
>         self._format = {}
>         self._length = 0
>         for line in ff:
>             form = line.split("\t") #file must be tab separated
>             if len(form) != 2:
>                 print "Invalid file format definition: %s" % line
>                 continue
>             name = form[0]
>             columns = form[1].split("-")
>             if len(columns) == 1:   #single column definition
>                 start = int(columns[0])
>                 end = start
>             elif len(columns) == 2:
>                 start = int(columns[0])
>                 end = int(columns[1])
>             else:
>                 print "Invalid column definition: %s" % form[1]
>                 continue
>             self._format[name] = (start, end)
>             self._length = max(self._length, end)
>         ff.close()
>
>     def __call__(self, line):
>         data = {}
>         if len(line) < self._length:
>             print "Data line is too short for required format: ignored"
>         else:
>             for (name, (start, end)) in self._format.items():
>                 data[name] = line[start:end+1]
>         return data
>
> if __name__ == "__main__":
>     FORMATFILE = "SampleFormat.tsv"
>     DATAFILE = "SampleData.txt"
>
>     characterExtractor = Extractor(FORMATFILE)
>
>     df = open(DATAFILE, "r")
>     for line in df:
>         fields = characterExtractor(line)
>         for (name, value) in fields.items():
>             print "Field name: '%s'\t\tvalue: '%s'" % (name, value)
>         print
>
>     df.close()
> -=-=-=-=-=-
>
> Output from running above code:
> -=-=-=-=-=-
> Field name: 'Armor'             value: '15'
> Field name: 'THAC0'             value: '19'
> Field name: 'Level'             value: '09'
> Field name: 'Class'             value: 'Ranger  '
> Field name: 'Name'              value: 'Wulfraed       '
>
> Field name: 'Armor'             value: '11'
> Field name: 'THAC0'             value: '15'
> Field name: 'Level'             value: '13'
> Field name: 'Class'             value: 'Cleric  '
> Field name: 'Name'              value: 'Bask Euren     '
>
> Field name: 'Armor'             value: '08'
> Field name: 'THAC0'             value: '09'
> Field name: 'Level'             value: '07'
> Field name: 'Class'             value: 'Mage    '
> Field name: 'Name'              value: 'Aethelwulf     '
>
> Field name: 'Armor'             value: '08'
> Field name: 'THAC0'             value: '10'
> Field name: 'Level'             value: '08'
> Field name: 'Class'             value: 'Mage    '
> Field name: 'Name'              value: 'Cwiculf        '
> -=-=-=-=-=-
>
>         Note that string fields have not been trimmed, also numeric fields
> are still intextformat... The format definition file would need to be
> expanded to include a "string", "integer", "float" (and "Boolean"?) code
> in order for the extractor to do proper type conversions.
>
> --
>         Wulfraed                 Dennis Lee Bieber         AF6VN
>         wlfr... at ix.netcom.com    HTTP://wlfraed.home.netcom.com/

Clearly it's working. Altough, this code is beyond my python knowledge
(i don't get along with classes, maybe it's  a good moment to learn
about them...) but i'll dig into it.

Thanks a lot! It really helps...

J