Sniffing Text Files
David Pratt
fairwinds at eastlink.ca
Fri Sep 23 10:51:17 EDT 2005
Hi Steven. Thank you for your detailed response. The code will be
executed on a web server with limited memory so the desire to keep file
loading in check. I like the approach you have suggested to score to
give the best guess. It leaves it fairly modular in respect to how
detailed you want to be about adding statements specific to a
particular format (that would increase the certainty of choosing it
correctly). I wish I had more control over the files I may receive but
I have to assume the worse. File extensions are not always telling the
true situation and sometimes they can be left off. Mime types are not
always interpreted properly either and I am restricting these before
getting to a sniffing stage to eliminate certain types of files from
getting that far. I think what I might do is read the first x lines
with readlines(). I think a sample of up to the first 100 lines should
probably be good enough to generate a decent scores for the type.
Regards,
David
> def sniff(filename):
> """Return one of "xml", "csv", "txt" or "tkn", or "???"
> if it can't decide the file type.
> """
> fp = open(filename, "r")
> scores = {"xml": 0, "csv": 0, "txt": 0, "tkn": 0}
> for line in fp.readlines():
> if not line:
> continue
> if line[0] == "<":
> scores["xml"] += 1
> if '\t' in line:
> scores["txt"] += 1
> if ',' in line:
> scores["csv"] += 1
> if SOMETOKEN in line:
> scores["csv"] += 1
> # Pick the best guess:
> L = [(score, name) for (name, score) in scores.items()]
> L.sort()
> L.reverse()
> # L is now sorted from highest down to lowest by score.
> best_guess = L[0]
> second_best_guess = L[0]
> if best_guess[0] > 10*second_best_guess[0]:
> fp.close()
> return best_guess[1]
> fp.close()
> return "???"
More information about the Python-list
mailing list