converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

John Machin sjmachin at lexicon.net
Sun May 20 08:12:30 EDT 2007


On 20/05/2007 8:52 PM, Paddy wrote:
> On May 20, 2:16 am, John Machin <sjmac... at lexicon.net> wrote:
>> On 19/05/2007 3:14 PM, Paddy wrote:
>>
>>
>>
>>> On May 19, 12:07 am, py_genetic <conor.robin... at gmail.com> wrote:
>>>> Hello,
>>>> I'm importing large text files of data using csv.  I would like to add
>>>> some more auto sensing abilities.  I'm considing sampling the data
>>>> file and doing some fuzzy logic scoring on the attributes (colls in a
>>>> data base/ csv file, eg. height weight income etc.) to determine the
>>>> most efficient 'type' to convert the attribute coll into for further
>>>> processing and efficient storage...
>>>> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
>>>> there' '100,000,000,000'], [next row...] ....]
>>>> Aside from a missing attribute designator, we can assume that the same
>>>> type of data continues through a coll.  For example, a string, int8,
>>>> int16, float etc.
>>>> 1. What is the most efficient way in python to test weather a string
>>>> can be converted into a given numeric type, or left alone if its
>>>> really a string like 'A' or 'hello'?  Speed is key?  Any thoughts?
>>>> 2. Is there anything out there already which deals with this issue?
>>>> Thanks,
>>>> Conor
>>> You might try investigating what can generate your data. With luck,
>>> it could turn out that the data generator is methodical and column
>>> data-types are consistent and easily determined by testing the
>>> first or second row. At worst, you will get to know how much you
>>> must check for human errors.
>> Here you go, Paddy, the following has been generated very methodically;
>> what data type is the first column? What is the value in the first
>> column of the 6th row likely to be?
>>
>> "$39,082.00","$123,456.78"
>> "$39,113.00","$124,218.10"
>> "$39,141.00","$124,973.76"
>> "$39,172.00","$125,806.92"
>> "$39,202.00","$126,593.21"
>>
>> N.B. I've kindly given you five lines instead of one or two :-)
>>
>> Cheers,
>> John
> 
> John,
> I've had cases where some investigation of the source of the data has
> completely removed any ambiguity. I've found that data was generated
> from one or two sources and been able to know what every field type is
> by just examining a field that I have determined wil tell me the
> source program that generated the data.

The source program that produced my sample dataset was Microsoft Excel 
(or OOo Calc or Gnumeric); it was induced to perform a "save as CSV" 
operation. Does that help you determine the true nature of the first column?


> 
> I have also found that the flow generating some data is subject to
> hand editing so have had to both put in extra checks in my reader, and
> on some occasions created specific editors to replace hand edits by
> checked assisted hand edits.
> I stand by my statement; "Know the source of your data", its less
> likely to bite!
> 

My dataset has a known source, and furthermore meets your "lucky" 
criteria (methodically generated, column type is consistent) -- I'm 
waiting to hear from you about the "easily determined" part :-)

Cheers,
John




More information about the Python-list mailing list