converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

Sun May 20 22:22:21 EDT 2007

On May 18, 7:07 pm, py_genetic <conor.robin... at gmail.com> wrote:

> Hello,
>
> I'm importing large text files of data using csv.  I would like to add
> some more auto sensing abilities.  I'm considing sampling the data
> file and doing some fuzzy logic scoring on the attributes (colls in a
> data base/ csv file, eg. height weight income etc.) to determine the
> most efficient 'type' to convert the attribute coll into for further
> processing and efficient storage...
>
> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
> there' '100,000,000,000'], [next row...] ....]
>
> Aside from a missing attribute designator, we can assume that the same
> type of data continues through a coll.  For example, a string, int8,
> int16, float etc.
>
> 1. What is the most efficient way in python to test weather a string
> can be converted into a given numeric type, or left alone if its
> really a string like 'A' or 'hello'?  Speed is key?  Any thoughts?
>
> 2. Is there anything out there already which deals with this issue?

There are several replies to your immediate column type-guessing
problem, so I'm not going to address that. Once you decide the
converters for each column, you have to pass the dataset through them
(and optionally rearrange or omit some of them). That's easy to
hardcode for a few datasets with the same or similar structure but it
soon gets tiring.

I had a similar task recently so I wrote a general and efficient (at
least as far as pure python goes) row transformer that does the
repetitive work. Below are some examples from an Ipython session; let
me know if this might be useful and I'll post it here or at the
Cookbook.

George

#======= RowTransformer examples ============================

In [1]: from transrow import RowTransformer
In [2]: rows = [row.split(',') for row in "1,3.34,4-3.2j,John",
"4,4,4,4", "0,-1.1,3.4,None"]
In [3]: rows
Out[3]:
[['1', '3.34', '4-3.2j', 'John'],
 ['4', '4', '4', '4'],
 ['0', '-1.1', '3.4', 'None']]

# adapt the first three columns; the rest are omitted
In [4]: for row in RowTransformer([int,float,complex])(rows):
   ...:     print row
   ...:
[1, 3.3399999999999999, (4-3.2000000000000002j)]
[4, 4.0, (4+0j)]
[0, -1.1000000000000001, (3.3999999999999999+0j)]

# return the 2nd column as float, followed by the 4th column as is
In [5]: for row in RowTransformer({1:float, 3:None})(rows):
   ....:     print row
   ....:
[3.3399999999999999, 'John']
[4.0, '4']
[-1.1000000000000001, 'None']

# return the 3rd column as complex, followed by the 1st column as int
In [6]: for row in RowTransformer([(2,complex),(0,int)])(rows):
   ....:     print row
   ....:
[(4-3.2000000000000002j), 1]
[(4+0j), 4]
[(3.3999999999999999+0j), 0]

# return the first three columns, adapted by eval()
# XXX: use eval() only for trusted data
In [7]: for row in RowTransformer(include=range(3),
default_adaptor=eval)(rows):
   ....:     print row
   ....:
[1, 3.3399999999999999, (4-3.2000000000000002j)]
[4, 4, 4]
[0, -1.1000000000000001, 3.3999999999999999]

# equivalent to the previous
In [8]: for row in RowTransformer(default_adaptor=eval, exclude=[3])
(rows):
   ....:     print row
   ....:
[1, 3.3399999999999999, (4-3.2000000000000002j)]
[4, 4, 4]
[0, -1.1000000000000001, 3.3999999999999999]