X-Informatics

Finding Structures in the Unstructured

Archive for the ‘Machine Learning’ Category

Paper Summary: Generative Models for Chemical Structures

leave a comment »

Generative Models for Chemical Structures, David White, Richard C. Wilson. Journal of Chemical Information and Modeling Article ASAP

An interesting paper published on JCIM. The authors created a GMM (Gaussian mixture model) based on properties of active compounds over targets, and used the model to generate more molecules that are likely to be active. Each compound is represented based on properties extracted from a graph representation of it, and PCA conducted to reduce dimensionality. Then they sample from the built GMM, and map the samples back to a molecule.

Testing this method on DUD data sets, they authors showed that molecules generated using this method are similar to the compounds in the input sets, and docking results show that the molecules are likely to be active against a target of the input molecules.

Generative Models for Chemical Structures – Journal of Chemical Information and Modeling (ACS Publications)

Enhanced by Zemanta

Written by djiao

July 21, 2010 at 11:40 am

Add noise to data

leave a comment »

There are two easy ways to add noise, by scale the original data, or by mask some noise on the data.
First for a simple function y=\sin(x) , the following matlab code add 10% noise to it.

N = 100;
x = linspace(-pi, pi, N);
y = sin(x);
plot(x, y, 'r');
hold on;

% add 10% noise based on gaussian
scale = 0.1;
n1 = randn(1, N); % noise with mean=0 and std=1;
y1 = y + n1.*y*scale;
plot(x, y1, 'g');

% mask signal with noise
n2 = 0.1*randn(1,N)*sqrt(max(abs(y))); % noise with mean=0 and %std=max(amplitude);
y2 = y + n2;
plot(x, y2, 'b');

% Of course we can combine the two
y3 = y1 + n2;
plot(x, y3, 'm');

The final result looks like this:
y=sin(x)

We can also try to add noise to a more complicated synthetic data. For example, the famous swiss roll data[1] in manifold learning. First, we can generate the dataset by this function:
t=\frac{3}{2}\cdot\pi\cdot(1+2r)\,where\,r\ge 0
x=t\cdot\cos(t)
y=t\cdot\sin(t)
z\in(z_{1}, z_2),\,where\,z_1, z_2\in\mathbb{R}
Plot a scatter plot of (x, y, z) will give us a swiss roll dataset. For example, the following matlab code will create this figure.

N = 500;
r = linspace(0,1,N);
t = (3*pi/2)*(1+2*r);
x = t.*cos(t);
y = t.*sin(t);
z = 20*rand(1,N);
scatter3(x, y, z, 12, t, 'filled');

swiss roll data without noise

Now after adding noise. the standard deviation of the noise is 2% of smallest dimension of the bounding box enclosing the data (as discussed in [2])

mindim = min(max(y)-min(y), max(x)-min(x));
x = x+0.02*randn(1,N)*sqrt(mindim);
y = y+0.02*randn(1,N)*sqrt(mindim);
scatter3(x, y, z, 12, t, 'filled');

swiss roll data with noise

1. Tenenbaum, J.B., Silva, V.D. & Langford, J.C. A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290, 2319-2323 (2000).
2. Balasubramanian, M., Schwartz, E.L., Tenenbaum, J.B., de Silva, V. & Langford, J.C. The Isomap Algorithm and Topological Stability. Science 295, 7a (2002).

Enhanced by Zemanta

Written by djiao

April 13, 2010 at 2:29 pm

Posted in Machine Learning, Programming

Tagged with ,

Follow

Get every new post delivered to your Inbox.