converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

John Machin sjmachin at lexicon.net
Sun May 20 18:49:39 EDT 2007


On May 21, 2:04 am, Paddy <paddy3... at googlemail.com> wrote:
> On May 20, 1:12 pm, John Machin <sjmac... at lexicon.net> wrote:
>
>
>
> > On 20/05/2007 8:52 PM, Paddy wrote:
>
> > > On May 20, 2:16 am, John Machin <sjmac... at lexicon.net> wrote:
> > >> On 19/05/2007 3:14 PM, Paddy wrote:
>
> > >>> On May 19, 12:07 am, py_genetic <conor.robin... at gmail.com> wrote:
> > >>>> Hello,
> > >>>> I'm importing large text files of data using csv.  I would like to add
> > >>>> some more auto sensing abilities.  I'm considing sampling the data
> > >>>> file and doing some fuzzy logic scoring on the attributes (colls in a
> > >>>> data base/ csv file, eg. height weight income etc.) to determine the
> > >>>> most efficient 'type' to convert the attribute coll into for further
> > >>>> processing and efficient storage...
> > >>>> Example row from sampled file data: [ ['8','2.33', 'A', 'BB', 'hello
> > >>>> there' '100,000,000,000'], [next row...] ....]
> > >>>> Aside from a missing attribute designator, we can assume that the same
> > >>>> type of data continues through a coll.  For example, a string, int8,
> > >>>> int16, float etc.
> > >>>> 1. What is the most efficient way in python to test weather a string
> > >>>> can be converted into a given numeric type, or left alone if its
> > >>>> really a string like 'A' or 'hello'?  Speed is key?  Any thoughts?
> > >>>> 2. Is there anything out there already which deals with this issue?
> > >>>> Thanks,
> > >>>> Conor
> > >>> You might try investigating what can generate your data. With luck,
> > >>> it could turn out that the data generator is methodical and column
> > >>> data-types are consistent and easily determined by testing the
> > >>> first or second row. At worst, you will get to know how much you
> > >>> must check for human errors.
> > >> Here you go, Paddy, the following has been generated very methodically;
> > >> what data type is the first column? What is the value in the first
> > >> column of the 6th row likely to be?
>
> > >> "$39,082.00","$123,456.78"
> > >> "$39,113.00","$124,218.10"
> > >> "$39,141.00","$124,973.76"
> > >> "$39,172.00","$125,806.92"
> > >> "$39,202.00","$126,593.21"
>
> > >> N.B. I've kindly given you five lines instead of one or two :-)
>
> > >> Cheers,
> > >> John
>
> > > John,
> > > I've had cases where some investigation of the source of the data has
> > > completely removed any ambiguity. I've found that data was generated
> > > from one or two sources and been able to know what every field type is
> > > by just examining a field that I have determined wil tell me the
> > > source program that generated the data.
>
> > The source program that produced my sample dataset was Microsoft Excel
> > (or OOo Calc or Gnumeric); it was induced to perform a "save as CSV"
> > operation. Does that help you determine the true nature of the first column?
>
> > > I have also found that the flow generating some data is subject to
> > > hand editing so have had to both put in extra checks in my reader, and
> > > on some occasions created specific editors to replace hand edits by
> > > checked assisted hand edits.
> > > I stand by my statement; "Know the source of your data", its less
> > > likely to bite!
>
> > My dataset has a known source, and furthermore meets your "lucky"
> > criteria (methodically generated, column type is consistent) -- I'm
> > waiting to hear from you about the "easily determined" part :-)
>
> > Cheers,
> > John
>
> John,
> Open up your Excel spreadsheet and check what the format is for the
> column. It's not a contest. If you KNOW what generated the data then
> USE that knowledge. It would be counter-productive to do otherwise
> surely?
>
> (I know, don't call you Shirley :-)
>

... and I won't call you Patsy more than this once :-)

Patsy, re-read. The scenario is that I don't have the Excel
spreadsheet; I have a CSV file. The format is rather obviously
"currency" but that is not correct. The point is that (1) it was
methodically [mis-]produced by a known source [your criteria] but the
correct type of column 1 can't be determined by inspection of a value
or 2.

Yeah, it's not a contest, but I was kinda expecting that you might
have taken first differences of column 1 by now ...

Cheers,
John









More information about the Python-list mailing list