converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

Neil Cerutti horpner at yahoo.com
Mon May 21 08:39:23 EDT 2007


On 2007-05-20, John Machin <sjmachin at lexicon.net> wrote:
> On 19/05/2007 3:14 PM, Paddy wrote:
>> On May 19, 12:07 am, py_genetic <conor.robin... at gmail.com> wrote:
>>> Hello,
>>>
>>> I'm importing large text files of data using csv.  I would like to add
>>> some more auto sensing abilities.  I'm considing sampling the data
>>> file and doing some fuzzy logic scoring on the attributes (colls in a
>>> data base/ csv file, eg. height weight income etc.) to determine the
>>> most efficient 'type' to convert the attribute coll into for further
>>> processing and efficient storage...
>>>
>>> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
>>> there' '100,000,000,000'], [next row...] ....]
>>>
>>> Aside from a missing attribute designator, we can assume that the same
>>> type of data continues through a coll.  For example, a string, int8,
>>> int16, float etc.
>>>
>>> 1. What is the most efficient way in python to test weather a string
>>> can be converted into a given numeric type, or left alone if its
>>> really a string like 'A' or 'hello'?  Speed is key?  Any thoughts?
>>>
>>> 2. Is there anything out there already which deals with this issue?
>>>
>>> Thanks,
>>> Conor
>> 
>> You might try investigating what can generate your data. With luck,
>> it could turn out that the data generator is methodical and column
>> data-types are consistent and easily determined by testing the
>> first or second row. At worst, you will get to know how much you
>> must check for human errors.
>> 
>
> Here you go, Paddy, the following has been generated very methodically; 
> what data type is the first column? What is the value in the first 
> column of the 6th row likely to be?
>
> "$39,082.00","$123,456.78"
> "$39,113.00","$124,218.10"
> "$39,141.00","$124,973.76"
> "$39,172.00","$125,806.92"
> "$39,202.00","$126,593.21"
>
> N.B. I've kindly given you five lines instead of one or two :-)

My experience with Excel-related mistakes leads me to think that
column one contains dates that got somehow misformatted on
export.

-- 
Neil Cerutti



More information about the Python-list mailing list