Sniffing Text Files

Fri Sep 23 08:34:35 EDT 2005

On Fri, 23 Sep 2005 01:20:49 -0300, David Pratt wrote:

> Hi. I have files that I will be importing in at least four different 
> plain text formats, one of them being tab delimited format, a couple 
> being token based uses pipes (but not delimited with pipes), another 
> being xml. There will likely be others as well but the data needs to be 
> extracted and rewritten to a single format. The files can be fairly 
> large (several MB) so I do not want to read the whole file into memory. 

Why ever not? On modern machines, "several MB" counts as small files. Let
your operating system worry about memory, at least until you get to really
big (several hundred megabytes) files.

> What approach would be recommended for sniffing the files for the 
> different text formats. 

In no particular order:

(1) Push the problem onto the user: they specify what sort of file they
think it is. If they tell your program the file is XML when it is in fact
a CSV file, your XML importer will report back that that the input file is
a broken XML file.

(2) Look at the file extension (.xml, .csv, .txt, etc) and assume that it
is correct. If the user gives you an XML file called "data.csv", you can
hardly be blamed for treating it wrong. This behaviour is more accepted
under Windows than Linux or Macintosh.

(3) Use the Linux command "file" to determine the contents of the file.
There may be equivalents on other OSes.

(4) Write your own simple scanner that tries to determine if the file is
xml, csv, tab-delimited text, etc. A basic example:

(Will need error checking and hardening)

def sniff(filename):
    """Return one of "xml", "csv", "txt" or "tkn", or "???" 
    if it can't decide the file type.
    """
    fp = open(filename, "r")
    scores = {"xml": 0, "csv": 0, "txt": 0, "tkn": 0} 
    for line in fp.readlines():
        if not line:
            continue
        if line[0] == "<":
            scores["xml"] += 1
        if '\t' in line:
            scores["txt"] += 1
        if ',' in line:
            scores["csv"] += 1
        if SOMETOKEN in line:
            scores["csv"] += 1
        # Pick the best guess:
        L = [(score, name) for (name, score) in scores.items()] 
        L.sort()
        L.reverse()
        # L is now sorted from highest down to lowest by score.
        best_guess = L[0]
        second_best_guess = L[0]
        if best_guess[0] > 10*second_best_guess[0]:
            fp.close()
            return best_guess[1]
    fp.close()
    return "???"

Note that the above code really isn't good enough for production work, but
it should give you an idea how to proceed.

Hope that helps.

-- 
Steven.