converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2
John Machin
sjmachin at lexicon.net
Fri May 18 23:29:54 EDT 2007
On 19/05/2007 10:04 AM, James Stroud wrote:
> py_genetic wrote:
>> Hello,
>>
>> I'm importing large text files of data using csv. I would like to add
>> some more auto sensing abilities. I'm considing sampling the data
>> file and doing some fuzzy logic scoring on the attributes (colls in a
>> data base/ csv file, eg. height weight income etc.) to determine the
>> most efficient 'type' to convert the attribute coll into for further
>> processing and efficient storage...
>>
>> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
>> there' '100,000,000,000'], [next row...] ....]
>>
>> Aside from a missing attribute designator, we can assume that the same
>> type of data continues through a coll. For example, a string, int8,
>> int16, float etc.
>>
>> 1. What is the most efficient way in python to test weather a string
>> can be converted into a given numeric type, or left alone if its
>> really a string like 'A' or 'hello'? Speed is key? Any thoughts?
>>
>> 2. Is there anything out there already which deals with this issue?
>>
>> Thanks,
>> Conor
>>
>
> This is untested, but here is an outline to do what you want.
>
> First convert rows to columns:
>
>
> columns = zip(*rows)
>
>
> Okay, that was a lot of typing. Now, you should run down the columns,
> testing with the most restrictive type and working to less restrictive
> types. You will also need to keep in mind the potential for commas in
> your numbers--so you will need to write your own converters, determining
> for yourself what literals map to what values. Only you can decide what
> you really want here. Here is a minimal idea of how I would do it:
>
>
> def make_int(astr):
> if not astr:
> return 0
> else:
> return int(astr.replace(',', ''))
>
> def make_float(astr):
> if not astr:
> return 0.0
> else:
> return float(astr.replace(',', ''))
>
> make_str = lambda s: s
>
>
> Now you can put the converters in a list, remembering to order them.
>
>
> converters = [make_int, make_float, make_str]
>
>
> Now, go down the columns checking, moving to the next, less restrictive,
> converter when a particular converter fails. We assume that the make_str
> identity operator will never fail. We could leave it out and have a
> flag, etc., for efficiency, but that is left as an exercise.
>
>
> new_columns = []
> for column in columns:
> for converter in converters:
> try:
> new_column = [converter(v) for v in column]
> break
> except:
> continue
> new_columns.append(new_column)
>
>
> For no reason at all, convert back to rows:
>
>
> new_rows = zip(*new_columns)
>
>
> You must decide for yourself how to deal with ambiguities. For example,
> will '1.0' be a float or an int? The above assumes you want all values
> in a column to have the same type. Reordering the loops can give mixed
> types in columns, but would not fulfill your stated requirements. Some
> things are not as efficient as they might be (for example, eliminating
> the clumsy make_str). But adding tests to improve efficiency would cloud
> the logic.
>
[apologies in advance if this appears more than once]
This approach is quite reasonable, IF:
(1) the types involved follow a simple "ladder" hierarchy [ints pass the
float test, floats pass the str test]
(2) the supplier of the data has ensured that all values in a column are
actually instances of the intended type.
Constraint (1) falls apart if you need dates. Consider 31/12/99,
31/12/1999, 311299 [int?], 31121999 [int?], 31DEC99, ... and that's
before you allow for dates in three different orders (dmy, mdy, ymd).
Constraint (2) just falls apart -- with user-supplied data, there seem
to be no rules but Rafferty's and no laws but Murphy's.
The approach that I've adopted is to test the values in a column for all
types, and choose the non-text type that has the highest success rate
(provided the rate is greater than some threshold e.g. 90%, otherwise
it's text).
For large files, taking a 1/N sample can save a lot of time with little
chance of misdiagnosis.
Example: file of 1,079,000 records, with 15 columns, ultimately
diagnosed as being 8 x text, 3 x int, 1 x float, 2 x date (dmy order),
and [no kidding] 1 x date (ymd order). Using N==101 took about 15
seconds [Python 2.5.1, Win XP Pro SP2, 3.2GHz dual-core]; N==1 takes
about 900 seconds. The "converter" function for dates is written in C.
Cheers,
John
More information about the Python-list
mailing list