[scikit-learn] data augmentation following the underlying feature values distributions and correlations

Thomas Evangelidis tevang3 at gmail.com
Mon Dec 18 09:19:13 EST 2017


Greetings,

I want to augment my training set but preserve at the same time the
correlations between feature values. More specifically my features are NMR
resonances of the nuclei of a single amino acid. For example for Glutamic
acid I have for each observation the following feature values:

[CA, HA, CB, HB, CG, HG]

where CA is the resonance of the alpha carbon, HA the resonance of the
alpha proton, and so forth. The complication here is that these feature
values are not independent. HA is covalently bonded to CA, CB to CA, and so
on. Therefore if I sample a random CA value from the distribution of
experimental values of CA, I cannot pick ANY HA VALUE from the respective
experimental distribution, simply because CA and HA are correlated. The
same applies to CA and CB, CB and HB, CB and CG, CG and HG. Is there any
algorithm that can generate [CA, HA, CB, HB, CG, HG] feature vectors that
comply with the atom distributions and their correlations? I saw that
Gaussian Mixture Models have a function to generate random samples from the
fitted Gaussian distribution (sklearn.mixture.GaussianMixture.sample) but
it is not clear if these samples will retain the correlations between the
features (nuclei in this case). If there is not such an algorithm in
scikit-learn,
could you please point me to any other Python library which does that?

Thanks in advance.
Thomas


-- 

======================================================================

Dr Thomas Evangelidis

Post-doctoral Researcher
CEITEC - Central European Institute of Technology
Masaryk University
Kamenice 5/A35/2S049,
62500 Brno, Czech Republic

email: tevang at pharm.uoa.gr

          tevang3 at gmail.com


website: https://sites.google.com/site/thomasevangelidishomepage/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scikit-learn/attachments/20171218/f0141d39/attachment.html>


More information about the scikit-learn mailing list