Archive for the ‘Machine Learning’ Category
Paper Summary: Generative Models for Chemical Structures
Generative Models for Chemical Structures, David White, Richard C. Wilson. Journal of Chemical Information and Modeling Article ASAP
An interesting paper published on JCIM. The authors created a GMM (Gaussian mixture model) based on properties of active compounds over targets, and used the model to generate more molecules that are likely to be active. Each compound is represented based on properties extracted from a graph representation of it, and PCA conducted to reduce dimensionality. Then they sample from the built GMM, and map the samples back to a molecule.
Testing this method on DUD data sets, they authors showed that molecules generated using this method are similar to the compounds in the input sets, and docking results show that the molecules are likely to be active against a target of the input molecules.
Add noise to data
There are two easy ways to add noise, by scale the original data, or by mask some noise on the data.
First for a simple function , the following matlab code add 10% noise to it.
N = 100; x = linspace(-pi, pi, N); y = sin(x); plot(x, y, 'r'); hold on; % add 10% noise based on gaussian scale = 0.1; n1 = randn(1, N); % noise with mean=0 and std=1; y1 = y + n1.*y*scale; plot(x, y1, 'g'); % mask signal with noise n2 = 0.1*randn(1,N)*sqrt(max(abs(y))); % noise with mean=0 and %std=max(amplitude); y2 = y + n2; plot(x, y2, 'b'); % Of course we can combine the two y3 = y1 + n2; plot(x, y3, 'm');
The final result looks like this:

We can also try to add noise to a more complicated synthetic data. For example, the famous swiss roll data[1] in manifold learning. First, we can generate the dataset by this function:
Plot a scatter plot of will give us a swiss roll dataset. For example, the following matlab code will create this figure.
N = 500; r = linspace(0,1,N); t = (3*pi/2)*(1+2*r); x = t.*cos(t); y = t.*sin(t); z = 20*rand(1,N); scatter3(x, y, z, 12, t, 'filled');
Now after adding noise. the standard deviation of the noise is 2% of smallest dimension of the bounding box enclosing the data (as discussed in [2])
mindim = min(max(y)-min(y), max(x)-min(x)); x = x+0.02*randn(1,N)*sqrt(mindim); y = y+0.02*randn(1,N)*sqrt(mindim); scatter3(x, y, z, 12, t, 'filled');
1. Tenenbaum, J.B., Silva, V.D. & Langford, J.C. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 2319-2323 (2000).
2. Balasubramanian, M., Schwartz, E.L., Tenenbaum, J.B., de Silva, V. & Langford, J.C. The Isomap Algorithm and Topological Stability. Science 295, 7a (2002).


