Sniffing Text Files

David Pratt fairwinds at eastlink.ca
Fri Sep 23 10:13:44 EDT 2005


Thanks Mike for your reply.  I am not aware of libmagic and will look 
to see what it provides.  As far as your first suggestion, this is what 
I have been looking at - probably a combination regex and readlines or 
similar but trying to get a better sense of best sort of approach more 
or less.  I can't rely on file extensions in this case so believing the 
content will be what the file extension indicates would not be so good. 
  Mime types can be helpful but don't always give you the full story 
either - so the need to sniff in the first place so I can apply the 
right process to the file.  As it stands I am filtering mime types to 
the importing process to attempt to limit the possibilities.

Regards,
David


On Friday, September 23, 2005, at 02:01 AM, Mike Meyer wrote:

> David Pratt <fairwinds at eastlink.ca> writes:
>
>> Hi. I have files that I will be importing in at least four different
>> plain text formats, one of them being tab delimited format, a couple
>> being token based uses pipes (but not delimited with pipes), another
>> being xml. There will likely be others as well but the data needs to
>> be extracted and rewritten to a single format. The files can be fairly
>> large (several MB) so I do not want to read the whole file into
>> memory. What approach would be recommended for sniffing the files for
>> the different text formats. I realize CSV module has a sniffer but it
>> is something that is limited more or less to delimited files.  I have
>> a couple of ideas on what I could do but I am interested in hearing
>> from others on how they might handle something like this so I can
>> determine the best approach to take. Many thanks.
>
> With GB memory machines being common, I wouldn't think twice about
> slurping a couple of meg into RAM to examine. But if that's to much,
> how about simply reading in the first <chunk> bytes, and checking that
> for the characters you want? <chunk> should be large enough to reveal
> what you need, but small enogh that your'e comfortable reading it
> in. I'm not sure that there aren't funny interactions between read and
> readline, so do be careful with that.
>
> Another approach to consider is libmagic. Google turns up a number of
> links to Python wrappers for it.
>
>       <mike
> -- 
> Mike Meyer <mwm at mired.org>			http://www.mired.org/home/mwm/
> Independent WWW/Perforce/FreeBSD/Unix consultant, email for more 
> information.
> -- 
> http://mail.python.org/mailman/listinfo/python-list
>



More information about the Python-list mailing list