converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2
James Stroud
jstroud at mbi.ucla.edu
Fri May 18 20:04:06 EDT 2007
py_genetic wrote:
> Hello,
>
> I'm importing large text files of data using csv. I would like to add
> some more auto sensing abilities. I'm considing sampling the data
> file and doing some fuzzy logic scoring on the attributes (colls in a
> data base/ csv file, eg. height weight income etc.) to determine the
> most efficient 'type' to convert the attribute coll into for further
> processing and efficient storage...
>
> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
> there' '100,000,000,000'], [next row...] ....]
>
> Aside from a missing attribute designator, we can assume that the same
> type of data continues through a coll. For example, a string, int8,
> int16, float etc.
>
> 1. What is the most efficient way in python to test weather a string
> can be converted into a given numeric type, or left alone if its
> really a string like 'A' or 'hello'? Speed is key? Any thoughts?
>
> 2. Is there anything out there already which deals with this issue?
>
> Thanks,
> Conor
>
This is untested, but here is an outline to do what you want.
First convert rows to columns:
columns = zip(*rows)
Okay, that was a lot of typing. Now, you should run down the columns,
testing with the most restrictive type and working to less restrictive
types. You will also need to keep in mind the potential for commas in
your numbers--so you will need to write your own converters, determining
for yourself what literals map to what values. Only you can decide what
you really want here. Here is a minimal idea of how I would do it:
def make_int(astr):
if not astr:
return 0
else:
return int(astr.replace(',', ''))
def make_float(astr):
if not astr:
return 0.0
else:
return float(astr.replace(',', ''))
make_str = lambda s: s
Now you can put the converters in a list, remembering to order them.
converters = [make_int, make_float, make_str]
Now, go down the columns checking, moving to the next, less restrictive,
converter when a particular converter fails. We assume that the make_str
identity operator will never fail. We could leave it out and have a
flag, etc., for efficiency, but that is left as an exercise.
new_columns = []
for column in columns:
for converter in converters:
try:
new_column = [converter(v) for v in column]
break
except:
continue
new_columns.append(new_column)
For no reason at all, convert back to rows:
new_rows = zip(*new_columns)
You must decide for yourself how to deal with ambiguities. For example,
will '1.0' be a float or an int? The above assumes you want all values
in a column to have the same type. Reordering the loops can give mixed
types in columns, but would not fulfill your stated requirements. Some
things are not as efficient as they might be (for example, eliminating
the clumsy make_str). But adding tests to improve efficiency would cloud
the logic.
James
More information about the Python-list
mailing list