converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

Sun May 20 07:59:58 EDT 2007

John Machin wrote:
> The model would have to be a lot more complicated than that. There is a 
> base number of required columns. The kind suppliers of the data randomly 
> add extra columns, randomly permute the order in which the columns 
> appear, and, for date columns

I'm going to ignore this because these things have absolutely no affect 
on the analysis whatsoever. Random order of columns? How could this 
influence any statistics, counting, Bayesian, or otherwise?

randomly choose the day-month-year order,
> how much punctuation to sprinkle between the digits, and whether to 
> append some bonus extra bytes like " 00:00:00".

I absolutely do not understand how bonus bytes or any of the above would 
selectively adversely affect any single type of statistics--if your 
converter doesn't recognize it then your converter doesn't recognize it 
and so it will fail under every circumstance and influence any and all 
statistical analysis. Under such conditions, I want very robust 
analysis--probably more robust than simple counting statistics. And I 
definitely want something more efficient.

> Past stats on failure to cast are no guide to the future

Not true when using Bayesian statistics (and any type of inference for 
that matter). For example, where did you get 90% cutoff? From 
experience? I thought that past stats are no guide to future expectations?

  ... a sudden
> change in the failure rate can be caused by the kind folk introducing a 
> new null designator i.e. outside the list ['', 'NULL', 'NA', 'N/A', 
> '#N/A!', 'UNK', 'UNKNOWN', 'NOT GIVEN', etc etc etc]

Using the rough model and having no idea that they threw in a few weird 
designators so that you might suspect a 20% failure (instead of the 2% I 
modeled previously), the  *low probabilities of false positives* (say 5% 
of the non-Int columns evaluate to integer--after you've eliminated 
dates because you remembered to test more restrictive types first) would 
still *drive the statistics*. Remember, the posteriors become priors 
after the first test.

P_1(H) = 0.2    (Just a guess, it'll wash after about 3 tests.)
P(D|H) = 0.8    (Are you sure they have it together enough to pay you?)
P(D|H') = 0.05  (5% of the names, salaries, etc., evaluate to float?)

Lets model failures since the companies you work with have bad typists. 
We have to reverse the probabilities for this:

Pf_1(H) = 0.2 (Only if this is round 1.)
Pf(D|H) = 0.2 (We *guess* a 20% chance by random any column is Int.)
Pf(D|H') = 0.80 (80% of Ints fail because of carpel tunnel, ennui, etc.)

You might take issue with Pf(D|H) = 0.2. I encourage you to try a range 
of values here to see what the posteriors look like. You'll find that 
this is not as important as the *low false positive rate*.

For example, lets not stop until we are 99.9% sure one way or the other. 
With this cutoff, lets suppose this deplorable display of typing integers:

    pass-fail-fail-pass-pass-pass

which might be expected from the above very pessimistic priors (maybe 
you got data from the _Apathy_Coalition_ or the _Bad_Typists_Union_ or 
the _Put_a_Quote_Around_Every_5th_Integer_League_):

P_1(H|D) = 0.800     (pass)
P_2(H|D) = 0.500     (fail)
P_3(H|D) = 0.200     (fail--don't stop, not 99.9% sure)
P_4(H|D) = 0.800     (pass)
P_6(H|D) = 0.9846153 (pass--not there yet)
P_7(H|D) = 0.9990243 (pass--got it!)

Now this is with 5% all salaries, names of people, addresses, favorite 
colors, etc., evaluating to integers. (Pausing while I remember fondly 
Uncle 41572--such a nice guy...funny name, though.)

> There is also the problem of first-time-participating organisations -- 
> in police parlance, they have no priors :-)

Yes, because they teleported from Alpha Centauri where organizations are 
fundamentally different from here on Earth and we can not make any 
reasonable assumptions about them--like that they will indeed cough up 
money when the time comes or that they speak a dialect of an earth 
language or that they even generate spreadsheets for us to parse.

James