converting strings to most their efficient types '1' --> 1, 'A' ---> 'A', '1.2'---> 1.2

Mon May 21 18:57:17 EDT 2007

py_genetic wrote:
> Using a baysian method were my inital thoughts as well.  The key to
> this method, I feel is getting a solid random sample of the entire
> file without having to load the whole beast into memory.

If you feel only the first 1000 rows are representative, then you can 
take a random sample from the first 200-1000 rows depending on how good 
you think the typing was. At 10,000 bytes per row, you are only reading 
in a 10MB file if you read 1000 rows. I just timed reading a 28MB file 
(source tgz file of Open Office 2.0.1) from a local drive at about 1s. 
As I hope I have demonstrated, you will need only a small random sample 
from this 200-1000 for testing (maybe 5-15 depending on quality and priors).

> What are your thoughts on other techniques?  For example training a
> neural net and feeding it a sample, this might be nice and very fast
> since after training (we would have to create a good global training
> set) we could just do a quick transform on a coll sample and ave the
> probabilities of the output units (one output unit for each type).
> The question here would encoding, any ideas?  A bin rep of the vars?
> Furthermore, niave bayes decision trees etc?

I think these latter ideas require more characterization of your input 
than is necessary, especially with judicious ordering of simple 
converter tests. What properties, aside from passing and failing certain 
converter tests, would you use for the data types? For example, you 
might use fraction of the string that are integer digits as a property 
to classify integers. But how many CPU cycles would you spend counting 
these digits and calculating the fractions for your sample? Do you 
expect that your data quality is poor enough to warrant expending CPU 
cycles to quantify the several properties that might characterize each type?

However, you might need some advanced learning tools if you want to 
automatically decide whether a column is a last name or a first name. 
I'm guessing this is not required and you would only want to know 
whether to make such a column an int or a string, in which case the 
testing is pretty straightforward and quick (number of tests per column 
< 10).

I've never converted tables from organizations on a routine basis, but I 
have a feeling that the quality of these tables are not as poor as one 
might fear, especially given reasonable foreknowledge of how data types 
are typically encoded.

James