converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2
James Stroud
jstroud at mbi.ucla.edu
Sun May 20 03:47:12 EDT 2007
John Machin wrote:
>Against that background, please explain to me how I can use
> "results from previous tables as priors".
>
> Cheers,
> John
It depends on how you want to model your probabilities, but, as an
example, you might find the following frequencies of columns in all
tables you have parsed from this organization: 35% Strings, 25% Floats,
20% Ints, 15% Date MMDDYYYY, and 5% Date YYMMDD. Let's say that you have
also used prior counting statistics to find that there is a 2% error
rate in the columns (2% of the values of a typical Float column fail to
cast to Float, 2% of values in Int columns fail to cast to Int, and
so-on, though these need not all be equal). Lets also say that for
non-Int columns, 1% of cells randomly selected cast to Int.
These percentages could be converted to probabilities and these
probabilities could be used as priors in Bayesian scheme to determine a
column type. Lets say you take one cell randomly and it can be cast to
an Int. What is the probability that the column is an Int? (See
<http://tinyurl.com/2bdn38>.)
P_1(H) = 0.20 --> Prior (20% prior columns are Int columns)
P(D|H) = 0.98
P(D|H') = 0.01
P_1(H|D) = 0.9607843 --> Posterior & New Prior "P_2(H|D)"
Now with one test positive for Int, you are getting pretty certain you
have an Int column. Now we take a second cell randomly from the same
column and find that it too casts to Int.
P_2(H) = 0.9607843 --> Confidence its an Int column from round 1
P(D|H) = 0.98
P(D|H') = 0.02
P_2(H|D) = 0.9995836
Yikes! But I'm still not convinced its an Int because I haven't even had
to wait a millisecond to get the answer. Lets burn some more clock cycles.
Lets say we really have an Int column and get "lucky" with our tests (P
= 0.98**4 = 92% chance) and find two more random cells successfully cast
to Int:
P_4(H) = 0.9999957
P(D|H) = 0.98
P(D|H') = 0.02
P(H|D) = 0.9999999
I don't know about you, but after only four positives, my calculator ran
out of significant digits and so I am at least 99.99999% convinced its
an Int column and I'm going to stop wasting CPU cycles and move on to
test the next column. How do you know its not a float? Well, given
floats with only one decimal place, you would expect only 1/10th could
be cast to Int (were the tenths-decimal place to vary randomly). You
could generate a similar statistical model to convince yourself with
vanishing uncertainty that the column that tests positive for Int four
times in a (random sample) is not actually a Float (even with only one
decimal place known).
James
More information about the Python-list
mailing list