Converting a string to the most probable type

John Machin sjmachin at lexicon.net
Fri Mar 7 18:12:57 EST 2008


On Mar 8, 1:53 am, "hvendelbo.... at googlemail.com"
<hvendelbo.... at googlemail.com> wrote:
> On Mar 6, 9:17 pm, Luis M. González <luis... at gmail.com> wrote:
>
>
>
> > On 6 mar, 11:27, Pierre Quentel <quentel.pie... at wanadoo.fr> wrote:
>
> > > Hi,
>
> > > I would like to know if there is a module that converts a string to a
> > > value of the "most probable type" ; for instance :
> > > - if the string is "abcd" the value is the same string "abcd"
> > > - string "123" : value = the integer 123
> > > - string "-1.23" (or "-1,23" if the locale for decimals is ,) : value
> > > = the float -1.23
> > > - string "2008/03/06" (the format is also locale-dependant) : value =
> > > datetime.date(2008,03,06)
>
> > > Like in spreadsheets, special prefixes could be used to force the
> > > type : for instance '123 would be converted to the *string* "123"
> > > instead of the *integer* 123
>
> > > I could code it myself, but this wheel is probably already invented
>
> > > Regards,
> > > Pierre
> > >>> def convert(x):
>
> >         if '.' in x:
> >                 try: return float(x)
> >                 except ValueError: return x
> >         else:
> >                 try: return int(x)
> >                 except: return x
>
> > >>> convert('123')
> > 123
> > >>> convert('123.99')
> > 123.98999999999999
> > >>> convert('hello')
>
> > 'hello'
>
> Neat solution. The real challenge though is whether to support
> localised dates, these are all valid:
> 20/10/01
> 102001
> 20-10-2001
> 20011020

Neat solution doesn't handle the case of using dots as date separators
e.g. 20.10.01 [they are used in dates in some locales and  the
location of . on the numeric keypad is easier on the pinkies than / or
-]

I'm a bit dubious about the utility of "most likely format" for ONE
input.

I've used a brute-force approach when inspecting largish CSV files
(with a low but non-zero rate of typos etc) with the goal of
determining what is the most likely type of data in each column.
E.g 102001 could be a valid MMDDYY date, but not a valid DDMMYY or
YYMMDD date. 121212 could be all of those. Both qualify as int, float
and text. A column with 100% of entries qualifying as text, 99.999% as
float, 99.99% as integer, 99.9% as DDMMYY, and much lower percentages
as MMDDYY and YYMMDD would be tagged as DDMMYY. The general rule is:
pick the type whose priority is highest and whose score exceeds a
threshold. Priorities: date > int > float > text. Finding the date
order works well with things like date of birth where there is a wide
distribution of days and years. However a field (e.g. date interest
credited to bank account) where the day is always 01 and the year is
in 01 to 08 would give the same scores for each of 3 date orders ...
eye-balling the actual data never goes astray.



More information about the Python-list mailing list