converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

James Stroud jstroud at mbi.ucla.edu
Sun May 20 03:47:12 EDT 2007


John Machin wrote:
>Against that background, please explain to me how I can use 
> "results from previous tables as priors".
> 
> Cheers,
> John

It depends on how you want to model your probabilities, but, as an 
example, you might find the following frequencies of columns in all 
tables you have parsed from this organization:  35% Strings, 25% Floats, 
20% Ints, 15% Date MMDDYYYY, and 5% Date YYMMDD. Let's say that you have 
also used prior counting statistics to find that there is a 2% error 
rate in the columns (2% of the values of a typical Float column fail to 
cast to Float, 2% of values in Int columns fail to cast to Int, and 
so-on, though these need not all be equal). Lets also say that for 
non-Int columns, 1% of cells randomly selected cast to Int.

These percentages could be converted to probabilities and these 
probabilities could be used as priors in Bayesian scheme to determine a 
column type. Lets say you take one cell randomly and it can be cast to 
an Int. What is the probability that the column is an Int? (See 
<http://tinyurl.com/2bdn38>.)

P_1(H) = 0.20         --> Prior (20% prior columns are Int columns)
P(D|H) = 0.98
P(D|H') = 0.01

P_1(H|D) = 0.9607843  --> Posterior & New Prior "P_2(H|D)"


Now with one test positive for Int, you are getting pretty certain you 
have an Int column. Now we take a second cell randomly from the same 
column and find that it too casts to Int.

P_2(H) = 0.9607843    --> Confidence its an Int column from round 1
P(D|H) = 0.98
P(D|H') = 0.02

P_2(H|D) = 0.9995836


Yikes! But I'm still not convinced its an Int because I haven't even had 
to wait a millisecond to get the answer. Lets burn some more clock cycles.

Lets say we really have an Int column and get "lucky" with our tests (P 
= 0.98**4 = 92% chance) and find two more random cells successfully cast 
to Int:

P_4(H) = 0.9999957
P(D|H) = 0.98
P(D|H') = 0.02

P(H|D) = 0.9999999


I don't know about you, but after only four positives, my calculator ran 
out of significant digits and so I am at least 99.99999% convinced its 
an Int column and I'm going to stop wasting CPU cycles and move on to 
test the next column. How do you know its not a float? Well, given 
floats with only one decimal place, you would expect only 1/10th could 
be cast to Int (were the tenths-decimal place to vary randomly). You 
could generate a similar statistical model to convince yourself with 
vanishing uncertainty that the column that tests positive for Int four 
times in a (random sample) is not actually a Float (even with only one 
decimal place known).


James



More information about the Python-list mailing list