ignore specific data

Bengt Richter bokr at oz.net
Tue Nov 22 02:47:05 EST 2005


On 21 Nov 2005 13:59:12 -0800, pkilambi at gmail.com wrote:

>I tried the solutions you provided..these are not as robust as i
>thought would be...
>may be i should put the problem more clearly...
>
>here it goes....
>
>I have a bunch of documents and each document has a header which is
>common to all files. I read each file process it and compute the
>frequency of words in each file. now I want to ignore the header in
>each file. It is easy if the header is always at the top. but
>apparently its not. it could be at the bottom as well. So I want a
>function which goes through the file content and ignores the common
>header and return the remaining text to compute the frequencies..Also
>the header is not just one line..it includes licences and all other
>stuff and may be 50 to 60 lines as well..This "remove_header" has to be
>much more efficient as the files may be huge. As this is a very small
>part of the whole problem i dont want this to slow down my entire
>code...
>
Does this "header" have fixed-constant-string beginning and similar
fixed end with possibly variably text between? I.e., and can there be
multiple headers (i.e., header+ instead of header)?

Assuming this is a grammar[1] of your file:

    datafile: [leading_string] header+ [trailing_string]
    header: header_start header_middle header_end

0) is this a text file of lines? or?
1) is header_start a fixed constant string?
2) does header_start begin with the first character of a line?
3) does it end with the end of the same or 3a) subsequent line?
4) does header_end begin at the beginning of a line?
4a) like 3
4b) like 3a
5) can we ignore header_middle as never containing header_end in any
   form (e.g. in quotes or comments etc)?
6) Anything else you can think of ;-)


[1] using [x] to mean optional x and some_name to mean a string composed
by some rules given by some_name: ... (or described in prose as here ;-)
and some_name+ to mean one or more some_name. (BTW some_name would mean
exactly one, [some_name] zero or one, some_name* zero or morem and somename+
one or more). What's needed is the final resolution to actual constants
or patterns of primitives. Can you define

    header_start: "The actual fixed constant character string defining the header"
    header_end: "whatever?"

Regards,
Bengt Richter



More information about the Python-list mailing list