converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

Fri May 18 23:29:54 EDT 2007

On 19/05/2007 10:04 AM, James Stroud wrote:
> py_genetic wrote:
>> Hello,
>>
>> I'm importing large text files of data using csv.  I would like to add
>> some more auto sensing abilities.  I'm considing sampling the data
>> file and doing some fuzzy logic scoring on the attributes (colls in a
>> data base/ csv file, eg. height weight income etc.) to determine the
>> most efficient 'type' to convert the attribute coll into for further
>> processing and efficient storage...
>>
>> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
>> there' '100,000,000,000'], [next row...] ....]
>>
>> Aside from a missing attribute designator, we can assume that the same
>> type of data continues through a coll.  For example, a string, int8,
>> int16, float etc.
>>
>> 1. What is the most efficient way in python to test weather a string
>> can be converted into a given numeric type, or left alone if its
>> really a string like 'A' or 'hello'?  Speed is key?  Any thoughts?
>>
>> 2. Is there anything out there already which deals with this issue?
>>
>> Thanks,
>> Conor
>>
> 
> This is untested, but here is an outline to do what you want.
> 
> First convert rows to columns:
> 
> 
> columns = zip(*rows)
> 
> 
> Okay, that was a lot of typing. Now, you should run down the columns, 
> testing with the most restrictive type and working to less restrictive 
> types. You will also need to keep in mind the potential for commas in 
> your numbers--so you will need to write your own converters, determining 
> for yourself what literals map to what values. Only you can decide what 
> you really want here. Here is a minimal idea of how I would do it:
> 
> 
> def make_int(astr):
>   if not astr:
>     return 0
>   else:
>     return int(astr.replace(',', ''))
> 
> def make_float(astr):
>   if not astr:
>     return 0.0
>   else:
>     return float(astr.replace(',', ''))
> 
> make_str = lambda s: s
> 
> 
> Now you can put the converters in a list, remembering to order them.
> 
> 
> converters = [make_int, make_float, make_str]
> 
> 
> Now, go down the columns checking, moving to the next, less restrictive, 
> converter when a particular converter fails. We assume that the make_str 
> identity operator will never fail. We could leave it out and have a 
> flag, etc., for efficiency, but that is left as an exercise.
> 
> 
> new_columns = []
> for column in columns:
>   for converter in converters:
>     try:
>       new_column = [converter(v) for v in column]
>       break
>     except:
>       continue
>   new_columns.append(new_column)
> 
> 
> For no reason at all, convert back to rows:
> 
> 
> new_rows = zip(*new_columns)
> 
> 
> You must decide for yourself how to deal with ambiguities. For example, 
> will '1.0' be a float or an int? The above assumes you want all values 
> in a column to have the same type. Reordering the loops can give mixed 
> types in columns, but would not fulfill your stated requirements. Some 
> things are not as efficient as they might be (for example, eliminating 
> the clumsy make_str). But adding tests to improve efficiency would cloud 
> the logic.
> 

[apologies in advance if this appears more than once]

This approach is quite reasonable, IF:
(1) the types involved follow a simple "ladder" hierarchy [ints pass the 
float test, floats pass the str test]
(2) the supplier of the data has ensured that all values in a column are 
actually instances of the intended type.

Constraint (1) falls apart if you need dates. Consider 31/12/99, 
31/12/1999, 311299 [int?], 31121999 [int?], 31DEC99, ... and that's 
before you allow for dates in three different orders (dmy, mdy, ymd).

Constraint (2) just falls apart -- with user-supplied data, there seem 
to be no rules but Rafferty's and no laws but Murphy's.

The approach that I've adopted is to test the values in a column for all 
types, and choose the non-text type that has the highest success rate 
(provided the rate is greater than some threshold e.g. 90%, otherwise 
it's text).

For large files, taking a 1/N sample can save a lot of time with little 
chance of misdiagnosis.

Example: file of 1,079,000 records, with 15 columns, ultimately 
diagnosed as being 8 x text, 3 x int, 1 x float, 2 x date (dmy order), 
and [no kidding] 1 x date (ymd order). Using N==101 took about 15 
seconds [Python 2.5.1, Win XP Pro SP2, 3.2GHz dual-core]; N==1 takes 
about 900 seconds. The "converter" function for dates is written in C.

Cheers,
John