Sniffing Text Files

David Pratt fairwinds at eastlink.ca
Fri Sep 23 10:51:17 EDT 2005


Hi Steven. Thank you for your detailed response.  The code will be 
executed on a web server with limited memory so the desire to keep file 
loading in check. I like the approach you have suggested to score to 
give the best guess. It leaves it fairly modular in respect to how 
detailed you want to be about adding statements specific to a 
particular format (that would increase the certainty of choosing it 
correctly).  I wish I had more control over the files I may receive but 
I have to assume the worse. File extensions are not always telling the 
true situation and sometimes they can be left off.  Mime types are not 
always interpreted properly either and I am restricting these before 
getting to a sniffing stage to eliminate certain types of files from 
getting that far.  I think what I might do is read the first x lines 
with readlines(). I think a sample of up to the first 100 lines should 
probably be good enough to generate a decent scores for the type.

Regards,
David

> def sniff(filename):
>     """Return one of "xml", "csv", "txt" or "tkn", or "???"
>     if it can't decide the file type.
>     """
>     fp = open(filename, "r")
>     scores = {"xml": 0, "csv": 0, "txt": 0, "tkn": 0}
>     for line in fp.readlines():
>         if not line:
>             continue
>         if line[0] == "<":
>             scores["xml"] += 1
>         if '\t' in line:
>             scores["txt"] += 1
>         if ',' in line:
>             scores["csv"] += 1
>         if SOMETOKEN in line:
>             scores["csv"] += 1
>         # Pick the best guess:
>         L = [(score, name) for (name, score) in scores.items()]
>         L.sort()
>         L.reverse()
>         # L is now sorted from highest down to lowest by score.
>         best_guess = L[0]
>         second_best_guess = L[0]
>         if best_guess[0] > 10*second_best_guess[0]:
>             fp.close()
>             return best_guess[1]
>     fp.close()
>     return "???"



More information about the Python-list mailing list