converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

Fri May 18 20:04:06 EDT 2007

py_genetic wrote:
> Hello,
> 
> I'm importing large text files of data using csv.  I would like to add
> some more auto sensing abilities.  I'm considing sampling the data
> file and doing some fuzzy logic scoring on the attributes (colls in a
> data base/ csv file, eg. height weight income etc.) to determine the
> most efficient 'type' to convert the attribute coll into for further
> processing and efficient storage...
> 
> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
> there' '100,000,000,000'], [next row...] ....]
> 
> Aside from a missing attribute designator, we can assume that the same
> type of data continues through a coll.  For example, a string, int8,
> int16, float etc.
> 
> 1. What is the most efficient way in python to test weather a string
> can be converted into a given numeric type, or left alone if its
> really a string like 'A' or 'hello'?  Speed is key?  Any thoughts?
> 
> 2. Is there anything out there already which deals with this issue?
> 
> Thanks,
> Conor
> 

This is untested, but here is an outline to do what you want.

First convert rows to columns:

columns = zip(*rows)

Okay, that was a lot of typing. Now, you should run down the columns, 
testing with the most restrictive type and working to less restrictive 
types. You will also need to keep in mind the potential for commas in 
your numbers--so you will need to write your own converters, determining 
for yourself what literals map to what values. Only you can decide what 
you really want here. Here is a minimal idea of how I would do it:

def make_int(astr):
   if not astr:
     return 0
   else:
     return int(astr.replace(',', ''))

def make_float(astr):
   if not astr:
     return 0.0
   else:
     return float(astr.replace(',', ''))

make_str = lambda s: s

Now you can put the converters in a list, remembering to order them.

converters = [make_int, make_float, make_str]

Now, go down the columns checking, moving to the next, less restrictive, 
converter when a particular converter fails. We assume that the make_str 
identity operator will never fail. We could leave it out and have a 
flag, etc., for efficiency, but that is left as an exercise.

new_columns = []
for column in columns:
   for converter in converters:
     try:
       new_column = [converter(v) for v in column]
       break
     except:
       continue
   new_columns.append(new_column)

For no reason at all, convert back to rows:

new_rows = zip(*new_columns)

You must decide for yourself how to deal with ambiguities. For example, 
will '1.0' be a float or an int? The above assumes you want all values 
in a column to have the same type. Reordering the loops can give mixed 
types in columns, but would not fulfill your stated requirements. Some 
things are not as efficient as they might be (for example, eliminating 
the clumsy make_str). But adding tests to improve efficiency would cloud 
the logic.

James