Chapter 5. Unsupervised Learning#
Contents#
Clustering (Bishop, Chater 9):
\(k\)-means clustering
mixtures of Gaussians
self-organizing maps (SOM)
Subspace mapping (Biship, Chapter 12):
principal component analysis (PCA)
singular value decomposition (SVD)
independent component analysis (ICA)
Neural Mechanisms
Receptive field formation
As we grow, we learn that there are different things and creatures in the world, such as plants and animals, and in more detail, dogs, cats and humans. What is remarkable is that most of such learning is done spontaneously without explicit teaching about what is a dog, or labels specifying which is a dog or which is cat. Learning categories without explicit labels is an example of unspervised learning. But how can we define categories without category labels?
The key in unsupervised learning is to find a certain structure in the distribution \(p(\b{x})\) that produced the observed data \(\{\b{x}_1,...,\b{x}_N\}\).
Typical cases are:
Dividing into clusters:
\(k\)-means clustering
mixtures of Gaussians
self-organizing maps (SOM)
Decomposing into components:
principal component analysis (PCA)
singular value decomposition (SVD)
independent component analysis (ICA)
K-means Clustering#
The most basic method of clustering is \(K\)-means clustering, which divides a data set \(\{\b{x}_1,…,\b{x}_N\}\) into \(K\) clusters.
We define prototypes \(\b{\mu}_k\) for \(k=1,...,K\) clusters and specify belonging of data points by binary indicator variables \(r_{nk}\in\{0,1\}\).
A good clustering is achieved by minimizing the distortion measure
We do that by repeating the following steps:
For the current prototypes \(\b{\mu}_k\), re-assign data points.
for each \(\b{x}_n\), find the nearest \(\b{\mu}_k\) and set \(r_{nk} = 1\) and \(r_{n j\ne k} = 0\) .
For the current assignment by \(r_{nk}\), update the prototypes by
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
# Load data from a text file
X = np.loadtxt('data/faithful.txt')
N, D = X.shape
plt.scatter(X[:,0], X[:,1])
plt.xlabel('eruption time (min)')
plt.ylabel('waiting time (min)');
K = 2 # number of clusters
# Initial guess of prototypes
Mu = X[np.random.randint(0, N, K),:] # pick K points randomly
plt.scatter(X[:,0], X[:,1], marker='.')
plt.scatter(Mu[:,0], Mu[:,1], c=range(K), marker='+', s=100)
plt.show()
Y = np.zeros(N, dtype=int) # cluster label
R = np.zeros((N,K), dtype=bool) # assignment matrix
for t in range(5):
# Update assignment
for n in range(N):
# check the distances
dist = [ np.dot(X[n]-Mu[k], X[n]-Mu[k]) for k in range(K)]
# find the nearest mean
Y[n] = np.argmin(dist)
R[n,:] = np.zeros(K)
R[n,Y[n]] = 1
# show assignment
plt.scatter(X[:,0], X[:,1], c=Y, marker='.')
# Update the means
for k in range(K):
Mu[k] = np.mean(X[R[:,k]], axis=0)
# plot the new means
plt.scatter(Mu[:,0], Mu[:,1], c=range(K), marker='+', s=100)
plt.show()
Mixtures of Gaussians#
It is often the case that clusters have some overlaps and assignment is probabilistic. Mixtures of Gaussians is a probabilistic extention of \(K\)-means clustering.
Gaussian mixture distribution has a form
where \(\b{\mu}_k\) and \(\Sigma_k\) are the mean and the variance of \(k\)-th Gaussian and \(\pi_k\) is the mixture probability.
# sample from a Gaussian mixture distribution
def gaussmix(pi, mu, sigma):
K = len(pi)
z = np.random.multinomial(1, pi) # binary stochastic variable
k = list(z).index(1) # the index of z_k=1
x = np.random.multivariate_normal(mu[k], sigma[k])
return x, k
pi = [0.3, 0.7] # mixture probability
mu = [[0,0], [5,3]] # means
sigma = [[[1,-1],[-1,2]], [[2,1],[1,1]]] # variances
N = 1000
X = np.zeros((N,2))
Y = np.zeros(N, dtype=int)
for n in range(N):
X[n,:], Y[n] = gaussmix(pi, mu, sigma)
plt.scatter(X[:,0], X[:,1], c=Y, marker='.');
Expectation-maximization (EM) Algorithm#
We consider a binary stochastic variable
indicating which Gaussian a data point belongs to.
For fitting the parameters \(\pi_k\), \(\mu_k\) and \(\Sigma_k\) \((k=1,...,K)\), we repeat the following steps, called the expectation-maximization (EM) algorithm:
E (expectation) step#
For the current parameters \((\pi_k,\mu_k\),\(\Sigma_k)\), evaluate the posterior distribution of \(\b{z}\) given \(\b{x}\), called responsibility:
M (maximization) step#
For the current responsibility, estimate new parameters:
X = np.loadtxt('data/faithful.txt')
N, D = X.shape
# Initial means, covariance, responsibility
K = 2 # number of clusters
Pi = np.ones(K)/K # cluster probability
Mu = X[np.random.randint(0, N, K),:] # pick K points randomly
Sig = np.repeat(np.cov(X.T).reshape(1,D,D), K, axis=0) # covariance for entire data
plt.scatter(X[:,0], X[:,1], marker='.')
#plt.hold(True)
plt.scatter(Mu[:,0], Mu[:,1], c=range(K), marker='+', s=100)
plt.show()
R = np.zeros((N,K)) # responsibility matrix
pr = np.zeros(K) # data probability for each cluster
Lambda = np.zeros((K,D,D)) # inverse covariance
detSig = np.zeros(K) # sqrt(det(Sig))
for t in range(15):
# Expectation step
for k in range(K):
Lambda[k] = np.linalg.inv(Sig[k]) # inverse covariance
detSig[k] = np.sqrt(np.linalg.det(Sig[k]))
for n in range(N):
# check the distances
for k in range(K):
#dx = np.matrix(X[n,:] - Mu[k,:]) # deviation from mean
dx = X[n,:] - Mu[k,:] # deviation from mean
pr[k] = Pi[k]*np.exp(-dx@Lambda[k]@dx.T/2)/detSig[k]
# responsibility
R[n,:] = pr/np.sum(pr) # responsibility p(z)
# show assignment
plt.scatter(X[:,0], X[:,1], c=np.dot(R,np.arange(K)), marker='.')
# Maximization step
num = np.sum(R, axis=0); # effective numbers for each class
Pi = num/N # class prior
for k in range(K):
Mu[k,:] = np.sum(R[:,k]*X.T, axis=1)/num[k]
dX = X - Mu[k,:]
Sig[k] = R[:,k]/num[k]*dX.T@dX # cluster covariance
# plot the new means
#plt.hold(True)
plt.scatter(Mu[:,0], Mu[:,1], c=range(K), marker='+', s=100)
plt.show()
Dirichlet Process Mixture#
One problem in \(K\)-means and Mixture of Gaussians is that you have to assume the number of clusters \(K\) beforehand.
There is an extension called Dirichlet process mixture that decides the number of clusters based on the data.
See Kevin Murphy (2012) Ch. 25.2 for details.
scikit-learn#
scikit-learn is a popular Pyhon libraries of supervised and unsupervised learning algorithms with excellent documentation, examples and tutorials
from sklearn import mixture
mixture.BayesianGaussianMixture?
# Create a Dirichlet process mixture of Gaussians
dpm = mixture.BayesianGaussianMixture(n_components=5, max_iter=100)
# Fit to data
dpm.fit(X)
Mu = dpm.means_
Pi = dpm.weights_
#K = dpm.n_components
print("pi =", dpm.weights_)
# class label
Y = dpm.predict(X)
plt.scatter(X[:,0], X[:,1], c=Y)
#plt.scatter(Mu[:,0], Mu[:,1], c=range(K), marker='+',s=300);
pi = [0.00398612 0.00396987 0.02808266 0.34934285 0.6146185 ]
/opt/anaconda3/lib/python3.12/site-packages/sklearn/mixture/_base.py:270: ConvergenceWarning: Best performing initialization did not converge. Try different init parameters, or increase max_iter, tol, or check for degenerate data.
warnings.warn(
<matplotlib.collections.PathCollection at 0x11095d670>
Principal Component Analysis#
Grasping the distribution of a high-dimensional data is not easy for human eyes. We often try to find a low-dimensional projection of the data that characteirizes the distribution.
Principal componet analysis (PCA) finds the directions of the data distribution with the largest variance.
Consider a projection of \(D\)-dimensional vector \(\b{x} = (x_1,...,x_D)^T\) to \(M\)-dimensional vector \(\b{y} = (y_1,...,y_M)^T\) by
where \(V = (\b{v}_1,...,\b{v}_M)^T\), \(||\b{v}_m||=1\).
For a data set \(X=(x_1,...,x_N)^T\) with zero mean (mean subtracted), we try to find the projection by \(V\) that make the variance of \(\b{y}\) the largest.
Using the data covariance
the covariance of projection \(\b{z}\) is given as \(V^T C V\).
After solving the eigenvalue problem \(C\b{v}=\lambda \b{v}\), the covariance of projected data is maximized by \(V=(\b{v}_1,...,\b{v}_M)^T\) made of the eigenvectors with the largest eigenvalues \(\lambda_1,...,\lambda_M\).
# sklearn.datasets offers a wide range of public datasets
from sklearn import datasets
# the iris dataset
iris = datasets.load_iris()
X = iris.data # flower features
T = iris.target # flower types
N, D = X.shape
print(N, D)
150 4
# plot the data in 3D
from mpl_toolkits.mplot3d import Axes3D # for 3D plotting
fig = plt.figure()
ax = fig.add_subplot(111,projection='3d')
ax.scatter(X[:,0], X[:,1], X[:,2], c=T);
# Data covariance
X = X - np.mean(X, axis=0) # subtract the mean
C = X.T@X/N
# eigenvalue L[i] and normal eigenvector V[:,i]
L, Vt = np.linalg.eigh( C) # for symmetric matrix
#L, V = linalg.eig( C)
# in matrix form: C*V=V*L, i.e. C=V*L*V' from V'*V=I
print(L) # eigenvalues
print(Vt) # columns are eigenvectors
[0.02367619 0.0776881 0.24105294 4.20005343]
[[ 0.31548719 0.58202985 0.65658877 -0.36138659]
[-0.3197231 -0.59791083 0.73016143 0.08452251]
[-0.47983899 -0.07623608 -0.17337266 -0.85667061]
[ 0.75365743 -0.54583143 -0.07548102 -0.3582892 ]]
ind = np.argsort(-L) # largest first
L = L[ind] # reorder
V = Vt.T[ind]
print(V)
plt.plot(L)
plt.xlabel("index m")
plt.ylabel("$\lambda_m$");
[[-0.36138659 0.08452251 -0.85667061 -0.3582892 ]
[ 0.65658877 0.73016143 -0.17337266 -0.07548102]
[ 0.58202985 -0.59791083 -0.07623608 -0.54583143]
[ 0.31548719 -0.3197231 -0.47983899 0.75365743]]
<>:7: SyntaxWarning: invalid escape sequence '\l'
<>:7: SyntaxWarning: invalid escape sequence '\l'
/var/folders/0z/3cttnw852b959kp4wjh_z_wc0000gn/T/ipykernel_1528/3105391929.py:7: SyntaxWarning: invalid escape sequence '\l'
plt.ylabel("$\lambda_m$");
# Projection of data to the PC space
Z = X@V.T
# First two PC
plt.scatter(Z[:,0], Z[:,1], c=T);
# Last two PC
plt.scatter(Z[:,2], Z[:,3], c=T);
Singular value decomposition (SVD)#
Singular value decomposition (SVD) represents a \(N\times D\) matrix (\(N>D\)) by a weighted sum of products of column vectors \(\b{u}_i \in \Re^N\) and row vectors \(\b{v}_i^T \in \Re^D\)
where \(\b{v}_i\) are the eigenvectors of \(X^T X\) and \(s_i= \sqrt{\lambda_i}\) are given by their corresponding eigenvalues (\(\lambda_1 \ge ... \ge\lambda_D\)).
In a matrix form, SVD is represented as
where
Because \(V\) is the same as that for PCA, the SVD algorithm is often used for PCA.
# X = U*S*V' where S is a diagonal matrix
# i.e. C = X'*X/N = V*S^2/N*V'
U, S, Vs = np.linalg.svd(X)
# See if they match those derived from covariance matrix
print(S**2/N)
print(L)
print(Vs)
print(V)
[4.20005343 0.24105294 0.0776881 0.02367619]
[4.20005343 0.24105294 0.0776881 0.02367619]
[[ 0.36138659 -0.08452251 0.85667061 0.3582892 ]
[-0.65658877 -0.73016143 0.17337266 0.07548102]
[ 0.58202985 -0.59791083 -0.07623608 -0.54583143]
[ 0.31548719 -0.3197231 -0.47983899 0.75365743]]
[[-0.36138659 0.08452251 -0.85667061 -0.3582892 ]
[ 0.65658877 0.73016143 -0.17337266 -0.07548102]
[ 0.58202985 -0.59791083 -0.07623608 -0.54583143]
[ 0.31548719 -0.3197231 -0.47983899 0.75365743]]
Online PCA#
It has been shown that a simple linear neural network can perform computation similar to PCA (Sanger 1989).
Let us consider a two-layer network
with input \(\b{x}=(x_1,...,x_D)^T\), output \(\b{y}=(y_1,...,y_M)^T\), and \(M\times D\) connection weights \(W\) (\(M<D\)).
Consider a generalized Hebbian algorithm In a componet form, it is
In a matrix form, it is represented as
where \(LT[\ ]\) takes the lower triangle (including the diagonal) of a matrix.
It has been shown that the rows of matrix \(W\) converges to the \(M\) eigenvectors with the largest eigen values of the covariance matrix \(E[\b{x}\b{x}^T ]\)
def gha(X, W, alpha=0.01):
"""Generalized Hebbian Alogrithm by Sanger (1989)"""
N, D = X.shape
for n in range(N):
y = W@X[n,:]
W += alpha*(np.outer(y, X[n,:]) - np.tril(np.outer(y,y))@W)
return W
# Iris example
M = 2
W = np.random.randn(M*D).reshape(M,D)
for k in range(10):
W = gha(X, W, alpha=0.01)
print(W)
[[ 0.40847726 0.02543892 0.82847181 0.38527263]
[ 0.22837299 0.7854519 -0.3581771 -0.23751581]]
[[ 0.39275802 -0.01229017 0.83327471 0.39108534]
[ 0.24045046 0.57495399 -0.22335685 -0.13531064]]
[[ 0.39272004 -0.01236537 0.83328719 0.39109547]
[ 0.29265477 0.56311943 -0.19332948 -0.10578299]]
[[ 0.39271996 -0.01236553 0.83328722 0.39109549]
[ 0.35545637 0.59137574 -0.18390082 -0.09073656]]
[[ 0.39271996 -0.01236553 0.83328722 0.39109549]
[ 0.41942491 0.63013499 -0.18161263 -0.08127495]]
[[ 0.39271996 -0.01236553 0.83328722 0.39109549]
[ 0.47732567 0.66631545 -0.18156184 -0.07460747]]
[[ 0.39271996 -0.01236553 0.83328722 0.39109549]
[ 0.52448065 0.69375354 -0.18163711 -0.06965417]]
[[ 0.39271996 -0.01236553 0.83328722 0.39109549]
[ 0.55967733 0.71133026 -0.18119089 -0.06596344]]
[[ 0.39271996 -0.01236553 0.83328722 0.39109549]
[ 0.58436928 0.72086435 -0.18029133 -0.06329361]]
[[ 0.39271996 -0.01236553 0.83328722 0.39109549]
[ 0.60108009 0.72500194 -0.17920695 -0.06145073]]
W/np.linalg.norm(W, axis=1, keepdims=True)
array([[ 0.39238556, -0.012355 , 0.83257769, 0.39076248],
[ 0.62571229, 0.75471245, -0.18655084, -0.06396898]])
# Projection of data to the PC space
Z = X@W.T
# First two PC
plt.scatter(Z[:,0], Z[:,1], c=T);
Infomax principle#
The infomax principle is to find a mapping from input \(\b{x}\) to the output \(\b{y}\) so that the mutual information
is maximized.
PCA is an example of infomax by a linear transformation
while keeping the norm of the row vectors of matrix \(V\) to be one.
For a deterministic mapping or a mapping with homogenetous additive noise
\(H(\b{y}|\b{x})\) is constant, so that
maximization of information transfer \(I(\b{y};\b{x})\)
is equivalent to
maximization of output entropy \(H(\b{y})\).
Note that \(H(\b{y})\le\sum_i H(y_i)\) and the equality is achieved when the output components are independent: \(p(\b{y}) = \prod_i p(y_i).\)
Independent component analysis (ICA)#
We can listen to a particular speaker’s voice or a sound of an instrument among mixed sound. This is called cocktail party effect or blind separation.
For a \(D\) independent signal sources \(\b{s}=(s_1,...,s_D)^T\), we consider mixture signal
or in vector form
Independent component analysis (ICA) tries to recover the original signals by
where components of \(\b{y}=(y_1,...,y_D)\) are independent, i.e.,
When the amplitude of \(y_i\) is bounded, the output entropy \(H(\b{y})\le\sum_i H(y_i)\) is maximized when \(y_i\) are independent.
ICA by Infomax#
If the signal distribution \(p(s_i)\) is supragaussian, having a sharp peak and long tails, ICA is achieved by a network with saturating output function, such as
where \(\tanh(u)=\frac{e^u-e^{-u}}{e^u+e^{-u}}\) is a sigmoid function within \((-1,1)\).
In this case, the derivative of the output entropy is given by (Bell & Sejnowski 1995)
The entropy is maximized by the natural gradient method (Amari 1998),
def ica(X, W=None, alpha=0.01, online=True):
"""ICA by max entropy with tanh()"""
N, D = X.shape
if W is None:
W = np.eye(D) # initial guess
if online:
for n in range(N):
u = W@X[n,:]
y = np.tanh(u)
W += alpha*(W - np.outer(y,u@W))
else: # batch update
U = X@W.T
Y = np.tanh(U)
W += alpha*(W - Y.T@U@W/N)
return W
# Make mixed signals of two decaying tones
D = 2 # channels
N = 2000 # signal length
# Source signal
S = np.zeros((N, D))
for i in range(D):
T = 200 # tone length
M = 10 # number of tones
t = np.arange(T)
s = np.exp(-t/50)*np.sin(0.5*(2+i)*t) # single tone
ts = np.random.randint(0, N-T, M) # start timing
for m in range(M):
S[ts[m]:ts[m]+T,i] += s # add shifted tone
plt.subplot(2, 1, 1)
plt.plot(S)
plt.ylabel("S");
# Mixed signal
A = np.random.random((D,D)) # random matrix
print(A)
X = S@A.T
plt.subplot(2, 1, 2)
plt.plot(X)
plt.ylabel("X");
[[0.34933416 0.94615931]
[0.66340224 0.09287904]]
# Applyy ICA to the mixture
W = np.eye(D)
for k in range(10):
W = ica(X, W, alpha=0.001)
print(W@A)
U = X@W.T
Y = np.tanh(U)
[[ 1.29841536 4.48910639]
[ 3.69102665 -0.0641114 ]]
[[ 0.54155523 7.79805789]
[ 8.4933997 -0.4462112 ]]
[[-0.04142324 8.5485093 ]
[ 9.99179923 -0.06602168]]
[[-0.07492247 8.63599764]
[10.19204851 -0.04077074]]
[[-0.07637662 8.64519207]
[10.21482833 -0.03982083]]
[[-0.07646214 8.64614761]
[10.21736766 -0.03977788]]
[[-0.07646929 8.64624673]
[10.21765007 -0.03977498]]
[[-0.07647002 8.646257 ]
[10.21768147 -0.03977471]]
[[-0.0764701 8.64625807]
[10.21768496 -0.03977468]]
[[-0.07647011 8.64625818]
[10.21768535 -0.03977467]]
# plot the original and separated signals
plt.subplot(2, 1, 1)
plt.plot(S)
plt.ylabel("S")
plt.subplot(2, 1, 2)
plt.plot(U)
plt.ylabel("U");
# plot t2D signal distributions
plt.subplot(1, 4, 1)
plt.plot(S[:,0], S[:,1], '.')
plt.axis('square')
plt.title("S")
plt.subplot(1, 4, 2)
plt.plot(X[:,0], X[:,1], '.')
plt.axis('square')
plt.title("X")
plt.subplot(1, 4, 3)
plt.plot(U[:,0], U[:,1], '.')
plt.axis('square')
plt.title("U")
plt.subplot(1, 4, 4)
plt.plot(Y[:,0], Y[:,1], '.')
plt.axis('square')
plt.title("Y");
# see how the mixing and separation weights are aligned to mixed data
plt.plot(X[:,0], X[:,1], '.')
plt.axis('square')
plt.title("X")
# estimated mixing vectors
a = np.linalg.inv(W)
a = a/np.max(abs(a))
plt.plot([0, a[0,0]], [0, a[0,1]])
plt.plot([0, a[1,0]], [0, a[1,1]])
# separating vectors
w = W/np.max(W)
plt.plot([0,w[0,0]], [0,w[0,1]])
plt.plot([0,w[1,0]], [0,w[1,1]]);
Playing sound#
Do you want to hear the mixed and separated sounds?
You can use IPython.display library to play sound.
https://ipython.readthedocs.io/en/stable/api/generated/IPython.display.html
import IPython.display as ipd
# Source sounds
for i in range(D):
ipd.display(ipd.Audio(S[:,i], rate=4000)) # 4kHz sampling
# Mixed sounds
for i in range(D):
ipd.display(ipd.Audio(X[:,i], rate=4000))
# Separated sounds
for i in range(D):
ipd.display(ipd.Audio(U[:,i], rate=4000))
Receptive fields of visual cortex neurons#
Cat visual cortex (Hubel & Wiesel)
Self-organization (von der Malsburgh)
by Informax (Linsker)
by SOM (Obermayer et al.)
by PCA (Sanger et al.)
by ICA (Bell & Sejnowski)
Replicating receptive field tuning#
Sanger (1989) trained an autoencoder network with patches from natural images as inputs. The PCs learned by generalized Hebbian learning in the hiddenlayers had input weights similar to the receptive field propertis of visual cortical neurons.
References#
Bishop CM (2006) Pattern Recognition and Machine Learning. Springer. https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/
Chapter 9: Mixture models and EM
Chapter 12: Continuous latent variables
Murphy K (2012) Machine learning: A probabilistic perspective. MIT press. https://probml.github.io/pml-book/
Chapter 25.2: Dirichlet process mixture models
Self Organization#
Hubel DH, Wiesel TN (1959). Receptive fields of single neurones in the cat’s striate cortex. Journal of Physiology, 148, 574-91. http://doi.org/10.1113/jphysiol.1959.sp006308
Hubel DH, Wiesel TN (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. Journal of Physiology, 160, 106-154. http://doi.org/10.1113/jphysiol.1962.sp006837
von der Malsburg C (1973). Self-organization of orientation sensitive cells in the striate cortex. Kybernetik, 14, 85-100. https://doi.org/10.1007/BF00288907
Kohonen T (1982). Self-Organized Formation of Topologically Correct Feature Maps. Biol Cybern, 43, 59-69. https://doi.org/10.1007/BF00337288
Olshausen BA, Field DJ (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607-9. https://doi.org/10.1038/381607a0
Rao RP, Ballard DH (1999). Predictive coding in the visual cortex: a functional interpretation of some extra-classical receptive-field effects. Nat Neurosci, 2, 79-87. https://doi.org/10.1038/4580
PCA#
Sanger TD (1989). Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Networks, 2, 459-473. http://doi.org/10.1016/0893-6080(89)90044-0
Turk M, Pentland A (1991). Eigenfaces for recognition. Journal of Cognitive Neurosciece, 3, 71-86. http://doi.org/10.1162/jocn.1991.3.1.71
Infomax and ICA#
Linsker R (1986). From basic network principles to neural architecture. Proceedings of the National Academy of Sciences, USA, 83, 7508-7512, 8390-8394, 8779-8783. http://doi.org/10.1073/pnas.83.19.7508
Bell AJ, Sejnowski TJ (1995). An information-maxization approach to blind separation and blind deconvolution. Neural Comput, 7, 1129-1159. http://doi.org/10.1162/neco.1995.7.6.1129
Bell AJ, Sejnowski TJ (1997). The “independent components” of natural scenes are edge filters. Vision Research, 37, 3327-3338. http://doi.org/10.1016/s0042-6989(97)00121-1
Amari S (1998). Natural gradient works efficiently in learning. Neural Comput, 10, 251-276. http://doi.org/10.1162/089976698300017746