# Adaptive Algorithms for PCA - CNEL Adaptive Algorithms for PCA Review of Principal Component Analysis Solves the matrix equation RV V Diagonalizes the covariance matrix of the input signal, V T RV The eigenvectors become eigenfilters and they span the frequency spectrum of the input signal. This is true only if the dimensionality of the data is very high. (From the spectral decomposition property of eigenvectors) The eigenvectors being linearly independent form the most optimal basis for any vector space. This is the motivation for using eigenvectors in data compression (KLT) How to solve PCA using adaptive structures? Although there are many numerical techniques (SVD) to solve PCA, for real-time applications, we need iterative algorithms that solve PCA using one sample of data at a time.

MATLAB programmers use eig function. I METHOD - Minimization of mean square reconstruction error Desired response is the input itself. N N M Input=x Y=WTx x Wy

J E x x 2 Minimize the cost function using gradient method with constraint W T W 1 The weight matrix will converge to a rotation of the eigenvector matrix as W TV T is a square orthogonal matrix II METHOD Hebbian and anti-Hebbian learning If there are 2 units (neurons) A and B, and there is a connection (synapse) between the units, adjust the strength of the connection (weight w) in proportion to the product of their

activations. If x denotes the input excitation and y denotes the output of the neuron, then the synaptic weights w are updated using the equation W(n+1) = W(n) + y(n)X(n)y(n)X(n) Note that the weight vector W can easily blow up to infinity if there is no restriction on the norm. x1(n) x2(n) y (n) xm(n) Let us define a cost function

W T RW E W T XX T W E y2 J T T T W W W W W W This cost function is nothing but the variance of the output

divided by the norm of the weight that produced it. We are interested in finding out the weight vector that produces the maximum output variance under the constraint of unity norm of W. J RW W T RW W 0 RW W T RW W W RW W , W T RW So, if we maximize the output variance subject to the constraint of unit norm, then W is nothing but the principal eigenvector and the eigenvalue itself is the variance! In the case of Hebbian rule, we just maximize the output variance without putting any restriction on the norm. T J W RW

Thus, Hebbian rule will still give us the principal component but it is an unstable algorithm Anti-Hebbian learning Hebbian learning effectively correlates input and output. A simple minus sign in the learning rule will give a decorrelator W(n+1) = W(n) - y(n)X(n)y(n)X(n) A simple learning rule which uses both the Hebbian and antiHebbian learning rules is the LMS! w(n 1) w(n) e(n) x(n) w(n 1) w(n) d (n) y (n) x(n) w(n 1) w(n) d (n) x(n) y (n) x(n) Stabilizing Hebbs rule - Ojas rule The simplest way to stabilize Hebbs rule is to normalize the weights after every iteration.

w (n 1) w(n) x(n) y (n) w (n 1) w(n 1) w (n 1) 2 w (n 1) w T (n 1) w (n 1) 1 2y 2 (n) O( 2 ) 1/ 2 w(n 1) w (n 1)1 y (n) w(n 1) w(n) x(n) y (n) 1 y (n) 2 2

w(n 1) w (n 1) 1 2y (n) O( ) 2 2 w(n 1) w(n) x(n) y (n) w(n) y 2 (n) 2 y 3 (n) x(n) w(n 1) w(n) x(n) y (n) w(n) y (n) So, Ojas rule will give us the first eigenvector for a small enough step-size. This is the first learning algorithm for PCA. How do we compute other eigenvectors? Deflation procedure is adopted to compute the other eigenvectors. Deflation is similar to Gram-Schmidt procedure! Step I : Find the first principal component using Ojas rule Step II : Compute the projection of the first eigenvector on the

T y v input 1 x Step III :Generate the modified input as x x v1 y x v1v1T x Rx ( I v1v1T ) Rx Step IV : Repeat Ojas rule on the modified data The steps can be repeated to generate all the eigenvectors. Putting it all together Ojas rule + Deflation = Sangers rule Sanger extended Ojas rule to extract multiple eigenvectors using deflation procedure.

The rule is simple to implement The algorithm is on-line Weight matrix Input data Outputs x1(n) wj1(n) x2(n) yj (n)

wj2(n) wjM(n) xM(n) Wji represents the scalar weight between input node i and output node j. Network has M inputs and L outputs. M y j (n) w ji (n) xi (n) j 1,2,3,4,....L i 1 j

w ji (n) y j (n) xi (n) y j (n) wki (n) yk (n) k 1 i 1,2,3,4,....M j 1,2,3,4,....L