Converting a string to the most probable type

Mon Mar 10 11:24:30 EDT 2008

On Mar 7, 6:12 pm, John Machin <sjmac... at lexicon.net> wrote:
> On Mar 8, 1:53 am, "hvendelbo.... at googlemail.com"
>
>
>
>
>
> <hvendelbo.... at googlemail.com> wrote:
> > On Mar 6, 9:17 pm, Luis M. González <luis... at gmail.com> wrote:
>
> > > On 6 mar, 11:27, Pierre Quentel <quentel.pie... at wanadoo.fr> wrote:
>
> > > > Hi,
>
> > > > I would like to know if there is a module that converts a string to a
> > > > value of the "most probable type" ; for instance :
> > > > - if the string is "abcd" the value is the same string "abcd"
> > > > - string "123" : value = the integer 123
> > > > - string "-1.23" (or "-1,23" if the locale for decimals is ,) : value
> > > > = the float -1.23
> > > > - string "2008/03/06" (the format is also locale-dependant) : value =
> > > > datetime.date(2008,03,06)
>
> > > > Like in spreadsheets, special prefixes could be used to force the
> > > > type : for instance '123 would be converted to the *string* "123"
> > > > instead of the *integer* 123
>
> > > > I could code it myself, but this wheel is probably already invented
>
> > > > Regards,
> > > > Pierre
> > > >>> def convert(x):
>
> > >         if '.' in x:
> > >                 try: return float(x)
> > >                 except ValueError: return x
> > >         else:
> > >                 try: return int(x)
> > >                 except: return x
>
> > > >>> convert('123')
> > > 123
> > > >>> convert('123.99')
> > > 123.98999999999999
> > > >>> convert('hello')
>
> > > 'hello'
>
> > Neat solution. The real challenge though is whether to support
> > localised dates, these are all valid:
> > 20/10/01
> > 102001
> > 20-10-2001
> > 20011020
>
> Neat solution doesn't handle the case of using dots as date separators
> e.g. 20.10.01 [they are used in dates in some locales and  the
> location of . on the numeric keypad is easier on the pinkies than / or
> -]
>
> I'm a bit dubious about the utility of "most likely format" for ONE
> input.
>
> I've used a brute-force approach when inspecting largish CSV files
> (with a low but non-zero rate of typos etc) with the goal of
> determining what is the most likely type of data in each column.
> E.g 102001 could be a valid MMDDYY date, but not a valid DDMMYY or
> YYMMDD date. 121212 could be all of those. Both qualify as int, float
> and text. A column with 100% of entries qualifying as text, 99.999% as
> float, 99.99% as integer, 99.9% as DDMMYY, and much lower percentages
> as MMDDYY and YYMMDD would be tagged as DDMMYY. The general rule is:
> pick the type whose priority is highest and whose score exceeds a
> threshold. Priorities: date > int > float > text. Finding the date
> order works well with things like date of birth where there is a wide
> distribution of days and years. However a field (e.g. date interest
> credited to bank account) where the day is always 01 and the year is
> in 01 to 08 would give the same scores for each of 3 date orders ...
> eye-balling the actual data never goes astray.- Hide quoted text -
>
> - Show quoted text -

In the case where dots are used as a date separator, count the number
of dots (you should also count commas).  If a single comma appears and
is preceeded by only numbers or numbers with decimals, assume "foreign
float".  If a single decimal appears and is preceeded by only numbers
or numbers with commas, assume "float".  If 2 decimals appear and each
field is 2 or less characters in length and numeric, assume date.  If
2 decimals appear and the first 2 fields are 2 or less characters in
length and numeric and the last field is 4 characters in length and
numeric, assume date.

There are things you can do, but you must be wary of the fact that it
may not always be 100% perfect.