converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

John Machin sjmachin at lexicon.net
Sun May 20 05:40:48 EDT 2007


On 20/05/2007 5:47 PM, James Stroud wrote:
> John Machin wrote:
>> Against that background, please explain to me how I can use "results 
>> from previous tables as priors".
>>
>> Cheers,
>> John
> 
> It depends on how you want to model your probabilities, but, as an 
> example, you might find the following frequencies of columns in all 
> tables you have parsed from this organization:  35% Strings, 25% Floats, 
> 20% Ints, 15% Date MMDDYYYY, and 5% Date YYMMDD. 

The model would have to be a lot more complicated than that. There is a 
base number of required columns. The kind suppliers of the data randomly 
add extra columns, randomly permute the order in which the columns 
appear, and, for date columns, randomly choose the day-month-year order, 
how much punctuation to sprinkle between the digits, and whether to 
append some bonus extra bytes like " 00:00:00".

> Let's say that you have 
> also used prior counting statistics to find that there is a 2% error 
> rate in the columns (2% of the values of a typical Float column fail to 
> cast to Float, 2% of values in Int columns fail to cast to Int, and 
> so-on, though these need not all be equal). Lets also say that for 
> non-Int columns, 1% of cells randomly selected cast to Int.

Past stats on failure to cast are no guide to the future ... a sudden 
change in the failure rate can be caused by the kind folk introducing a 
new null designator i.e. outside the list ['', 'NULL', 'NA', 'N/A', 
'#N/A!', 'UNK', 'UNKNOWN', 'NOT GIVEN', etc etc etc]


There is also the problem of first-time-participating organisations -- 
in police parlance, they have no priors :-)

So, all in all, Bayesian inference doesn't seem much use in this scenario.

> 
> These percentages could be converted to probabilities and these 
> probabilities could be used as priors in Bayesian scheme to determine a 
> column type. Lets say you take one cell randomly and it can be cast to 
> an Int. What is the probability that the column is an Int? (See 
> <http://tinyurl.com/2bdn38>.)

That's fancy -- a great improvement on the slide rule and squared paper :-)

Cheers,
John



More information about the Python-list mailing list