Information Theory School Alex Dimakis Murat Kocaoglu Introduction to deep learning and Tensorflow TF intro Jupyter Notebook. MNIST Jupyter notebook Intro to Deep Generative Models (GANs and VAEs) Deep learning for Inverse problems and Compressive Sensing MLTrain 1 Types of Neural nets: Discriminators MLTrain 2 Types of Neural nets: Discriminators Pr(cat) =0.7 Pr(banana) =0.01 Pr(grumpy) =0.2
... Pr(dog) =0.02 MLTrain 3 Types of Neural nets: Generators random noise z G(z) MLTrain 4 Types of Neural nets: Generators random noise z
G(z) MLTrain 5 Types of Neural nets: Generators random noise z G(z) MLTrain 6 MLP Architectures: warmup A single linear neuron: x1 x2
Y 0 0 0 0 1 1 1 0 1 1 1 0
f(x) = w(1)*x(1) + w(2)*x(2) x(1) w(1) w(2) y x(2) MLTrain 7 Depth 1 Model (warmup) Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are ={w1,w2,b}w1,w2,b} x1 w1 x2
w2 f(x) b Inference Exercise: For weights ={w1,w2,b}w1=1,w2=1,b=2} Run this model and make a prediction when the input features are x= [x1= 2, x2= 3] f(x) = ? MLTrain 8 Depth 1 Model (warmup) Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are ={w1,w2,b}w1,w2,b} x1 w1
x2 w2 f(x) b Inference Exercise: For weights ={w1,w2,b}w1=1,w2=1,b=2} Run this model and make a prediction when the input features are x= [x1= 2, x2= 3] f(x) = w1* x1 + w2*x2 +b = 1*2 + 1*3 + 2 = 7 Notice that for general weights the output is f(x) = w1* 2 + w2*3 +b If the true value is f*(x) = 1 , the squared loss is function of the weights. MLTrain 9 Model with 1 hidden layer W11 x1 c1
W12 h1 w1 h= WTx + c f(x) x2 h2 W21 f(x)= wTh +b w2 W22 c2 MLTrain
10 Model with 1 hidden layer W11 c1 W12 x1 w1 h1 h= WTx + c f(x) x2 h2 W21 f(x)= wTh +b
w2 W22 c2 x W, c h w, b f(x) MLTrain 11 MLP Architectures: warmup f(x) = w(1)*x(1) + w(2)*x(2) x1
x2 x(1) Y w(1) w(2) 0 0 0 0 1 1 1 0
1 1 1 0 y x(2) x= [x(1) x(2)]T w=[ w(1) w(2) ]T MLTrain 12 MLP Architectures: warmup A single nonlinear neuron: x1 x2
Y 0 0 0 0 1 1 1 0 1 1 1 0
f(x) = g( w(1)*x(1) + w(2)*x(2) ) x(1) w(1) w(2) g() y x(2) MLTrain 13 MLP Architectures: warmup A single nonlinear neuron: x1 x2 Y
0 0 0 0 1 1 1 0 1 1 1 0 f(x) = g( w(1)*x(1) + w(2)*x(2) )
x(1) w(1) w(2) g() y x(2) (This is also known as logistic regression for g(z)= 1/(1+exp(-z) ) MLTrain 14 MLP Architectures x(1) w1(1) x(2) x(3)
h0(1) y w1(2) w1(3) w1(4) h0(2) h0(3) x(4) MLTrain 15 MLP Architectures x(1) w1(1) x(2) x(3)
h0(1) y w1(2) w1(3) w1(4) h0(2) h0(3) x(4) MLTrain 16 MLP Architectures: Add activation functions x(1) w1(1) x(2) x(3)
g(z) h0(1) y w1(2) w1(3) w1(4) g(z) = max (0, z) (Rectifier Linear Unit ReLu) h0(2) h0(3) x(4) MLTrain 17 MLP Architectures: Add activation
functions x(1) w1(1) x(2) x(3) g(z) h0(1) y w1(2) w1(3) w1(4) g(z) = max (0, z) (Rectifier Linear Unit ReLu) h0(2) h0(3) x(4)
MLTrain 18 example: Learning using gradient Lets start with a depth 1 model: x1 x2 f*(x1,x2)= x1 XOR x2 0 0 0 0 1
1 1 0 1 1 1 0 x1 w1 x2 w2 f(x)
b f(x) = wT x +b , w in R2, b in R. The unknown parameters are ={w1,w2,b}w1,w2,b} Learn from data. What is d and n here? What is a sensible loss function? See example from Ch6 GoodFellow et al. Book MLTrain 19 lets work on an example x1 x2 f*(x1,x2)= x1 XOR x2 0 0 0
0 1 1 1 0 1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} Loss= J()= squared loss MLTrain
20 lets work on an example x1 x2 f*(x1,x2)= x1 XOR x2 0 0 0 0 1 1 1
0 1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} Loss= J() MLTrain 21 lets work on an example x1 x2 f*(x1,x2)=
x1 XOR x2 0 0 0 0 1 1 1 0 1 1 1 0
x1 w1 x2 w2 f(x) b f(x) = wT x +b , w in R2, b in R. weights ={w1,w2,b}w1,w2,b} Loss= J() Exercise: Compute 1.predictions f(x) and 2. loss for J() For ={w1,w2,b}w1=0,w2=1,b=1} using the data set shown in the table MLTrain 22
lets work on an example x1 x2 f*(x1,x2)= x1 XOR x2 0 0 0 0 1 1 1 0 1
1 1 0 Answer: f(x) = [ 1 2 1 2] Loss J() = * ( (0-1)2 + (1-2)2 + ( 1-1)2+(0-2)2 ) = * ( 1+1+ 0+4) =6/4 x1 w1 x2 w2 f(x)
b f(x) = wT x +b , w in R2, b in R. weights ={w1,w2,b}w1,w2,b} Loss= J() Exercise: Compute 1.predictions f(x) and 2. loss for J() For ={w1,w2,b}w1=0,w2=1,b=1} using the data set shown in the table MLTrain 23 How to train x1 x2 f*(x1,x2)=x 1 XOR x2 0
0 0 0 1 1 1 0 1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are ={w1,w2,b}w1,w2,b}
Exercise: 1. Write the loss function for this dataset 2. Compute the gradient of the loss function. 3. For ={w1,w2,b}w1=0,w2=1,b=1} do a gradient step using step size =0.5. Is the point ={w1,w2,b}0, 1, 1} a local minimum ? Loss= J() = ( (0 - ( (0,0)*(w1,w2)+b) )2 + + + ... ) MLTrain 24 How to train x1 x2 f*(x1,x2)=x 1 XOR x2 0 0
0 0 1 1 1 0 1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. 1. Write the loss function Loss= J(theta) = ( (0 - ( (0,0)*(w1,w2)+b)) 2 + (1 - (0,1)*(w1,w2)+b))2+
+ (1 - (1,0)*(w1,w2)+b))2+ + (0 - (1,1)* (w1,w2)+b))2. Compute the gradient of the loss and move the model ! MLTrain 25 How to train x1 x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0
1 1 1 0 1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. 1. Write the loss function Loss= J(theta) = ( (0 - ( (0,0)*(w1,w2)+b)) 2 + (1 - (0,1)*(w1,w2)+b))2+ + (1 - (1,0)*(w1,w2)+b))2+ + (0 - (1,1)* (w1,w2)+b))2. =
MLTrain ( (0 - b)2 + (1 -w2-b)2+ + (1 - w1-b)2+ + (0 -w1-w2-b)2 26 lets work on an example x1 x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0
1 1 1 0 1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} d=3, n=4. Loss= J(theta) = ( (0 - b) 2 + (1 -w2-b)2+
+ (1 - w1-b)2+ + (0 -w1-w2-b)2. The gradient is a vector. first coordinate is ? MLTrain 27 lets work on an example x1 x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0
1 1 1 0 1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} d=3, n=4. Loss= J(theta) = ( (0 - b) 2
+ (1 -w2- b)2+ + (1 - w1-b)2+ + (0 -w1-w2-b)2. The gradient is a vector. first coordinate is dJ / dw1 Second is dJ / dw2 Third is dJ / db MLTrain 28 lets work on an example x1 x2 f*(x1,x2)=x 1 XOR x2 0 0
0 0 1 1 1 0 1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b}
d=3, n=4. Loss= J(theta) = ( (0 - b)2 + (1 -w2- b)2+ + (1 - w1-b)2+ + (0 -w1-w2-b)2. The gradient is a vector. first coordinate is dJ / dw1 Second is dJ / dw2 = 2/4 Third is dJ / db MLTrain 2w1+ w2+2 b -1 w1+2w2+2b -1 2w1+2w2+4b-2 29 The problem with linear models. x1
x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0 1 1 1 0 1 1
1 0 f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} You can train this in TF or by hand and find the optimal w and b. w*=[0,0] and b*= Run this model for some input, eg x=[1,0]? MLTrain 30 The problem with linear models. f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} x1 x2
f*(x1,x2)=x 1 XOR x2 0 0 0 0 1 1 1 0 1 Run this model for some input, eg x=[1,0]? 1
1 0 We can prove that linear networks of any depth will not learn the XOR. You can train this in TF or by hand and find the optimal w and b. w*=[0,0] and b*= But nonlinear networks with one hidden layer will. MLTrain 31 1 hidden layer- linear h= WTx + c x1 x2 h1 W ,c
h2 y= wTh +b w, b y Why can this not learn the XOR ? See example from Ch6 GoodFellow et al. Book MLTrain 32 1 hidden layer- linear h= WTx + c x1 x2 h1 W ,c
h2 y= wTh +b w, b y So y = wT(WTx + c ) +b y= wTWTx + wTc + b y= wTx + b See example from Ch6 GoodFellow et al. Book MLTrain 33 1 hidden layer- linear h= WTx + c x1 x2 h1 W ,c
h2 y= wTh +b w, A deep linear model with any number of layers y b is just a linear model end-to-end. So it will always produce the constant T T So y = w (W x prediction for any input x + c ) +b y= wTWTx + wTc + b y= wTx + b See example from Ch6 GoodFellow et al. Book MLTrain 34
1 hidden layer-h=ReLu g( W x + c ) T y= wTh +b x1 x2 h1 W ,c h2 w, b Try W= [1,1; 1,1] c=[0,-1]T y w=[1,-2]T and b=0 h(0,0) =g([0,-1]T)= [0,0]T g(z) = max ( z, 0)
Relu Link function y= 0 See example from Ch6 GoodFellow et al. Book MLTrain 35 1 hidden layer-h=ReLu g( W x + c ) T y= wTh +b x1 x2 h1 W ,c h2 w, b
Try W= [1,1; 1,1] c=[0,-1]T y w=[1,-2]T and b=0 h(0,1) =g([1,0]T)= [1,0]T g(z) = max ( z, 0) Relu Link function f(0,1)= 1 It has learned correctly all 4 inputs! See example from Ch6 GoodFellow et al. Book MLTrain 36 1 hidden layer-h=ReLu g( W x + c ) T x1 x2
The message here is that a network with 1 hidden layer and linear activations cannot learn the XORy= function. wTh +b In fact a linear network with ANY number of hidden layers cannot learn the XOR function (since any number of layers still keeps it a linear function). Try W= [1,1; 1,1] c=[0,-1]T h1 w, y W But a SINGLE b hidden layer with Relu activations can learn the XOR function as this h2 ,c w=[1,-2]T and b=0 example shows. Furthermore networks with only one hidden layer and Relu activations learnT ANY h(0,1) =g([1,0]T)=can [1,0] function (but they may be exponentially wide). g(z) = max ( z, 0) Relu Link function
f(0,1)= 1 It has learned correctly all 4 inputs! See example from Ch6 GoodFellow et al. Book MLTrain 37 Convolutional Neural Networks How to re-use weights to get more efficient models. MLTrain 38 CNNs: Idea of Convolutional filters x(1) w1 h0(1) x(2)
w2 y h0(2) x(3) h0(3) x(4) MLTrain 39 CNNs: Idea of Convolutional filters x(1) w1 h0(1) x(2) w2
w1 y h0(2) x(3) w2 w1 h0(3) x(4) w2 MLTrain 40 CNNs: Idea of Convolutional filters x(1) w1
h0(1) x(2) w2 w1 y h0(2) x(3) w2 w1 h0(3) x(4) Reuse weights Avoid full connectivity Reduce the number of parameters you have to train Provide translational invariance. Makes sense for images
w2 MLTrain 41 DCGan Architecture based on https://github.com/carpedm20/DCGANtensorflow MLTrain 42 DCGan Architecture based on https://github.com/carpedm20/DCGANtensorflow MLTrain 43 How deconvolution Layers work Convolution Layer
DeConvolution Layer (aka Transposed Convolution) From: https://github.com/vdumoulin/conv_arithmetic see for more fun animations MLTrain 44 Why we need hidden layers and nonlinearities? MLTrain 45 Why hidden layers ? In machine learning you are given a vector x of d features. You often expand it into say d2 features and get better performance. This can be done in three ways: MLTrain 46
Why hidden layers ? In machine learning you are given a vector x of d features. You often expand it into say d2 features and get better performance. This can be done in three ways: 1. Problem specific feature engineering MLTrain 47 Why hidden layers ? In machine learning you are given a vector x of d features. You often expand it into say d2 features and get better performance. This can be done in three ways: 1. Problem specific feature engineering 2. Embed in very general space and use Kernels. MLTrain 48 Why hidden layers ? In machine learning you are given a vector x of d features. You often expand it into say d2 features and get better performance. This can be done in three ways:
1. Problem specific feature engineering 2. Embed in very general space and use Kernels. 3. Learn the features from data. MLTrain 49 Why hidden layers ? In machine learning you are given a vector x of d features. You often expand it into say d2 features and get better performance. This can be done in three ways: 1. Problem specific feature engineering 2. Embed in very general space and use Kernels. 3. Learn the features from data. You can do this if the model has endto-end differentiation. MLTrain 50 example: Learning using gradient Lets start with a depth 1 model: x1
x2 f*(x1,x2)= x1 XOR x2 0 0 0 0 1 1 1 0 1 1
1 0 x1 w1 x2 w2 f(x) b f(x) = wT x +b , w in R2, b in R. The unknown parameters are ={w1,w2,b}w1,w2,b} Learn from data. What is d and n here? What is a sensible loss function? See example from Ch6 GoodFellow et al. Book MLTrain
51 lets work on an example x1 x2 f*(x1,x2)= x1 XOR x2 0 0 0 0 1 1 1 0
1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} Loss= J()= squared loss MLTrain 52 lets work on an example x1 x2 f*(x1,x2)= x1 XOR x2
0 0 0 0 1 1 1 0 1 1 1 0
Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} Loss= J() MLTrain 53 lets work on an example x1 x2 f*(x1,x2)= x1 XOR x2 0 0 0 0
1 1 1 0 1 1 1 0 x1 w1 x2 w2 f(x)
b f(x) = wT x +b , w in R2, b in R. weights ={w1,w2,b}w1,w2,b} Loss= J() Exercise: Compute 1.predictions f(x) and 2. loss for J() For ={w1,w2,b}w1=0,w2=1,b=1} using the data set shown in the table MLTrain 54 lets work on an example x1 x2 f*(x1,x2)= x1 XOR x2 0
0 0 0 1 1 1 0 1 1 1 0 Answer: f(x) = [ 1 2
1 2] Loss J() = * ( (0-1)2 + (1-2)2 + ( 1-1)2+(0-2)2 ) = * ( 1+1+ 0+4) =6/4 x1 w1 x2 w2 f(x) b f(x) = wT x +b , w in R2, b in R. weights ={w1,w2,b}w1,w2,b} Loss= J() Exercise: Compute 1.predictions f(x) and 2. loss for J() For ={w1,w2,b}w1=0,w2=1,b=1}
using the data set shown in the table MLTrain 55 How to train x1 x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0 1 1
1 0 1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are ={w1,w2,b}w1,w2,b} Exercise: 1. Write the loss function for this dataset 2. Compute the gradient of the loss function. 3. For ={w1,w2,b}w1=0,w2=1,b=1} do a gradient step using step size =0.5. Is the point ={w1,w2,b}0, 1, 1} a local minimum ? Loss= J() = ( (0 - ( (0,0)*(w1,w2)+b) )2 MLTrain
56 How to train x1 x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0 1 1 1 0
1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. 1. Write the loss function Loss= J(theta) = ( (0 - ( (0,0)*(w1,w2)+b)) 2 + (1 - (0,1)*(w1,w2)+b))2+ + (1 - (1,0)*(w1,w2)+b))2+ + (0 - (1,1)* (w1,w2)+b))2. Compute the gradient of the loss and move the model ! MLTrain 57 How to train x1
x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0 1 1 1 0 1 1
1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. 1. Write the loss function Loss= J(theta) = ( (0 - ( (0,0)*(w1,w2)+b)) 2 + (1 - (0,1)*(w1,w2)+b))2+ + (1 - (1,0)*(w1,w2)+b))2+ + (0 - (1,1)* (w1,w2)+b))2. = MLTrain ( (0 - b)2 + (1 -w2-b)2+ + (1 - w1-b)2+ + (0 -w1-w2-b)2 58 lets work on an example x1
x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0 1 1 1 0 1 1
1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} d=3, n=4. Loss= J(theta) = ( (0 - b) 2 + (1 -w2-b)2+ + (1 - w1-b)2+ + (0 -w1-w2-b)2. The gradient is a vector. first coordinate is ? MLTrain 59 lets work on an example
x1 x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0 1 1 1 0 1
1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} d=3, n=4. Loss= J(theta) = ( (0 - b) 2 + (1 -w2- b)2+ + (1 - w1-b)2+ + (0 -w1-w2-b)2. The gradient is a vector. first coordinate is dJ / dw1 Second is dJ / dw2 Third is dJ / db MLTrain
60 lets work on an example x1 x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0 1 1 1
0 1 1 1 0 Lets start with a depth 1 model: f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} d=3, n=4. Loss= J(theta) = ( (0 - b)2 + (1 -w2- b)2+ + (1 - w1-b)2+ + (0 -w1-w2-b)2. The gradient is a vector. first coordinate is dJ / dw1 Second is
dJ / dw2 = 2/4 Third is dJ / db MLTrain 2w1+ w2+2 b -1 w1+2w2+2b -1 2w1+2w2+4b-2 61 The problem with linear models. x1 x2 f*(x1,x2)=x 1 XOR x2 0 0 0
0 1 1 1 0 1 1 1 0 f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} You can train this in TF or by hand and find the optimal w and b. w*=[0,0] and b*= Run this model for some input, eg x=[1,0]?
MLTrain 62 The problem with linear models. f(x) = wT x +b , w in R2, b in R. The unknown parameters are theta={w1,w2,b}w1,w2,b} x1 x2 f*(x1,x2)=x 1 XOR x2 0 0 0 0
1 1 1 0 1 Run this model for some input, eg x=[1,0]? 1 1 0 We can prove that linear networks of any depth will not learn the XOR. You can train this in TF or by hand and find the optimal w and b. w*=[0,0] and b*=
But nonlinear networks with one hidden layer will. MLTrain 63 1 hidden layer- linear h= WTx + c x1 x2 h1 W ,c h2 y= wTh +b w, b y Why can this not learn the XOR ?
See example from Ch6 GoodFellow et al. Book MLTrain 64 1 hidden layer- linear h= WTx + c x1 x2 h1 W ,c h2 y= wTh +b w, b y So y = wT(WTx + c ) +b y= wTWTx + wTc + b
y= wTx + b See example from Ch6 GoodFellow et al. Book MLTrain 65 1 hidden layer- linear h= WTx + c x1 x2 h1 W ,c h2 y= wTh +b w, A deep linear model with any number of layers y b is just a linear model end-to-end. So it will always produce the constant T T So y
= w (W x prediction for any input x + c ) +b y= wTWTx + wTc + b y= wTx + b See example from Ch6 GoodFellow et al. Book MLTrain 66 1 hidden layer-h=ReLu g( W x + c ) T y= wTh +b x1 x2 h1 W
,c h2 w, b Try W= [1,1; 1,1] c=[0,-1]T y w=[1,-2]T and b=0 h(0,0) =g([0,-1]T)= [0,0]T g(z) = max ( z, 0) Relu Link function y= 0 See example from Ch6 GoodFellow et al. Book MLTrain 67 1 hidden layer-h=ReLu g( W x + c )
T y= wTh +b x1 x2 h1 W ,c h2 w, b Try W= [1,1; 1,1] c=[0,-1]T y w=[1,-2]T and b=0 h(0,1) =g([1,0]T)= [1,0]T g(z) = max ( z, 0) Relu Link function f(0,1)= 1
It has learned correctly all 4 inputs! See example from Ch6 GoodFellow et al. Book MLTrain 68 1 hidden layer-h=ReLu g( W x + c ) T x1 x2 The message here is that a network with 1 hidden layer and linear activations cannot learn the XORy= function. wTh +b In fact a linear network with ANY number of hidden layers cannot learn the XOR function (since any number of layers still keeps it a linear function). Try W= [1,1; 1,1] c=[0,-1]T h1 w, y W
But a SINGLE b hidden layer with Relu activations can learn the XOR function as this h2 ,c w=[1,-2]T and b=0 example shows. Furthermore networks with only one hidden layer and Relu activations learnT ANY h(0,1) =g([1,0]T)=can [1,0] function (but they may be exponentially wide). g(z) = max ( z, 0) Relu Link function f(0,1)= 1 It has learned correctly all 4 inputs! See example from Ch6 GoodFellow et al. Book MLTrain 69 Part 2 Unsupervised Learning Deep Generative Models (GANs and VAEs)
Using deep generative models to solve inverse problems (Compressed Sensing) MLTrain 70 Learning a Generative We want to learn a probability distribution over model images Be able to sample a random face image This is an unsupervised problem: no labels, just data Standard way: maximize the likelihood of the data over all generative models. Alternative way: Adversarial training. MLTrain 71 Understanding standard training z(1)= Uniform [1,2,3,4,5,6]. z(2)= Uniform [1,2,3,4,5,6] (independent)
72 MLTrain Understanding standard training z(1)= Uniform [1,2,3,4,5,6]. z(2)= Uniform [1,2,3,4,5,6] (independent) g(z) = w1* z(1) + w2*z(2) z(1) w1 g(z) z(2) w2 73 MLTrain Understanding standard training z(1)= Uniform [1,2,3,4,5,6].
z(2)= Uniform [1,2,3,4,5,6] (independent) g(z) = w1* z(1) + w2*z(2) z(1) Say Data is D=[12 ] What is the chance to see D under weights w(1)=1, w(2)=1 ? w1 g(z) z(2) w2 74 MLTrain Understanding standard training z(1)= Uniform [1,2,3,4,5,6]. z(2)= Uniform [1,2,3,4,5,6] (independent) g(z) = w1* z(1) + w2*z(2)
Say Data is D=[12 ] What is the chance to see D under weights w(1)=1, w(2)=1 ? Pr(D |[1,1]) = 1/36 z(1) w1 What if Data D=[12,15] g(z) z(2) w2 75 MLTrain Understanding standard training z(1)= Uniform [1,2,3,4,5,6]. z(2)= Uniform [1,2,3,4,5,6] (independent)
g(z) = w1* z(1) + w2*z(2) Say Data is D=[12 ] What is the chance to see D under weights w(1)=1, w(2)=1 ? Pr(D |[1,1]) = 1/36 z(1) w1 What if Data D=[12,15] g(z) z(2) w2 Pr(D | [1,1]) = 1/36 * 0 If w1=2, w2=1 Pr (12 |[2,1]) =? 76 MLTrain
Understanding standard training z(1) w1 g(z) z(2) w2 Standard unsupervised training: maximize the likelihood of the data. Choose the parameters w that max Pr[ D | w ] (Non-convex tricky problem to solve) 77 MLTrain Adversarial Training ``There are many interesting recent development in deep learning, probably too many for me to describe them all here. The most important one, in my opinion, is adversarial training (also called GAN for Generative
Adversarial Networks). This, and the variations that are now being proposed is the most interesting idea in the last 10 years in ML, in my opinion. -Yann LeCunn MLTrain 78 Adversarial Training Start from a simple distribution z ~ Uniform in R 100 Transform z through a MLP to get G(z) an artificial image Create second network D(x) that takes image input and produces a scalar: D(x) is trying to become: the probability that the image came from the data rather than G. MLTrain 79 Adversarial Training Train D to maximize the probability of the correct label.
Train D on both real images and samples from G Train G to min log ( 1- D( G(z)) 80 or max log D(G(z) ) MLTrain Adversarial Training Train D to maximize the probability of the correct label. G(z) z D(x) Train D on both real images and samples from G Train G to min log ( 1- D( G(z)) Real Images
or max log D(G(z) ) 81 MLTrain After Adversarial is now a machine Training This that can dream up G(z) z new faces: put in a random z And run G(z). 82 MLTrain Current* state of the art: BEGANs - D(x)
G(z) z Real Images Discriminator is an Autoencoder that encodes real images well and generated images poorly MLTrain Adversarial Training 84 MLTrain Adversarial Training 85 MLTrain Adversarial Training 86 MLTrain
Adversarial Training 87 MLTrain Adversarial Training 88 MLTrain You can travel in z space too G(z1) R13000 z1=[1,0,0,..] z2=[1,2,3,..] R100 MLTrain You can travel in z space too
G(z1) R13000 z1=[1,0,0,..] z2=[1,2,3,..] R100 MLTrain BEGANs produce amazing images MLTrain Ok, Modern deep generative models produce amazing pictures. But what can we do with them ? MLTrain Ok, Modern deep generative models produce amazing pictures. But what can we do with them ? Improve compressed sensing 1. Compressed Sensing using Generative Models, Ashish Bora, Ajil Jalal, Eric Price, Alexandros G. Dimakis ICML 2017. Code and demo: https://github.com/AshishBora/csgm
2. Using GANs to defend from Adversarial Examples The Robust Manifold Defense: Adversarial Training using Generative Models A. Ilyas, A. Jalal, E. Asteri, C. Daskalakis, A. G. Dimakis https://arxiv.org/abs/1712.09196 3. Use GANs to sample from counterfactual distributions (causality through GANs) M. Kocaoglu, C. Snyder, A.G. Dimakis and S. Vishwanath CausalGAN: Learning Causal Implicit Generative Models with Adversarial Training ICLR 2018. https://arxiv.org/abs/1709.02023 Code and Demo: https://github.com/mkocaoglu/CausalGAN MLTrain Compressed sensing n m A
x* = y m You observe y = A x* , x in Rn , y in Rm, n>m i.e. m (noisy) linear observations of an unknown vector y in Rn Goal: Recover x* from y ill-posed: there are many possible x* that explain the measurements since we have m linear equations with n unknowns. High-dimensional statistics: Number of parameters n > number of samples m Must make some assumption: that x* is natural in some sense. MLTrain Compressed sensing n m A
= x* k Standard assumption: x is k-sparse. |x|0 =k Noiseless CompSensing optimal recovery problem: MLTrain y m Compressed sensing n m A
= x* k Standard assumption: x is k-sparse. |x|0 =k Noiseless CompSensing optimal recovery problem: NP-hard Relax to solving Basis pursuit Under what conditions is the relaxation tight? MLTrain y
m Compressed sensing Question: for which measurement matrices A, is x* = x1 ? [Donoho,Candes and Tao, RombergCandesTao] If A satisfies (RIP/REC/NSP) condition then x* = x1 Also: If A is created random iid N(0, 1/m ) with m = k log n/k then whp it will satisfy the RIP/REC condition. So: A random measurement matrix A with enough measurements suffices for the LP relaxation to produce the exact unknown sparse vector x*
MLTrain Sparsity in compressed sensing Q1: When do you want to recover some unknown vector by observing linear measurements on its entries? MLTrain Sparsity in compressed sensing Q1: When do you want to recover some unknown vector by observing linear measurements on its entries? MLTrain Sparsity in compressed sensing
Q1: When do you want to recover some unknown vector by observing linear measurements on its entries? sum over values of pixels MLTrain Sparsity in compressed sensing Q1: When do you want to recover some unknown vector by observing linear measurements on its entries? sum over values of pixels Real images are not sparse (except night-time sky). But they can be sparse in a known basis , i.e. x= D x* D can be DCT or Wavelet basis. MLTrain
Sparsity in compressed sensing 1. Sparsity in a basis is a decent model natural Q1: When do you want some unknown by observing 1.to recover Sparsity infor avector basis is alinear measurements on its entries? images decent
model for natural images 2. But now we have much data driven 2. better But now we have much models for natural sum over values of pixels better data driven images: VAEsfor andnatural GANs models images: VAEs and GANs Real images are not sparse (except night-time sky). Idea: Take
sparsity out But they can 3. be sparse in a known basis , i.e. x=Dx D can be DCT or Wavelet basis. of sensing. 3. compressed Idea: Take sparsity out with GAN ofReplace compressed sensing. Replace with GAN 4. Ok. But how to do that? 4. Ok. But how to do that? MLTrain
* Generative model n m z* A x* G(z*) = x* MLTrain = y m Generative model
n m z* A x* = y m G(z*) = x* Assume x* is in the range of a good generative model G(z). How do we recover x* =G(z*) given noisy linear measurements? y = A x* +
What happened to sparsity k ? MLTrain Generative model n m k z* A x* = y m G(z*) = x*
Assume x* is in the range of a good generative model G(z). How do we recover x* =G(z*) given noisy linear measurements? y = A x* + MLTrain Generative model n m k z* A G(z*) = x* x* =
y m Ok, you are replacing sparsity withyou a neural network. Ok, are replacing sparsity To recover wenetwork. were using withbefore, a neural To recoverLasso. before, we were using What is the recovery Lasso. algorithm Assume x* is in the range of a good generative model G(z). What is thenow? recovery algorithm How do we recover x* =G(z*) given noisy linear measurements?
y = A x* + now? MLTrain Recovery algorithm: Step 1: Inverting a GAN x1 G(z) z Given a target image x1 how do we invert the GAN, i.e. find a z1 such that G(z1) is very close to x1 ? MLTrain Recovery algorithm: Step 1: Inverting a GAN x1 G(z) z
Given a target image x1 how do we invert the GAN, i.e. find a z1 such that G(z1) is very close to x1 ? Just define a loss J(z) = || G(z) x1|| Do gradient descent on z (network weights fixed). MLTrain Recovery algorithm: Step 1: Inverting a GAN x1 G(z) z Related work : Creswell and Bharath (2016)
Donahue, Krahenbuhl,Trevor 2016 Dumoulin et al. Adversarially learned Inference Lipton and Tripathi 2017 Given a target image x1 how do we invert the GAN, i.e. find a z1 such that G(z1) is very close to x1 ? Just define a loss J(z) = || G(z) x1|| Do gradient descent on z (network weights fixed). MLTrain Recovery algorithm: Step 1: Inverting a GAN x1 G(z) z Related work : Creswell and Bharath (2016)
Donahue, Krahenbuhl,Trevor 2016 Dumoulin et al. Adversarially learned Inference Lipton and Tripathi 2017 Given a target image x1 how do we invert the GAN, i.e. find a z1 such that G(z1) is very close to x1 ? Just define a loss J(z) = || G(z) x1|| Do gradient descent on z (network weights fixed). MLTrain Recovery algorithm: Step 2: Inpainting x1 G(z) z Given a target image x1 observe only some pixels. How do we invert the GAN now?
MLTrain Recovery algorithm: Step 2: Inpainting x1 G(z) z Given a target image x1 observe only some pixels. How do we invert the GAN, i.e. find a z1 such that G(z1) is very close to x1 on the observed pixels? Just define a loss J(z) = || A G(z) A x1|| Do gradient descent on z (network weights fixed). MLTrain Recovery algorithm: Step 2: Inpainting x1
G(z) z Given a target image x1 observe only some pixels. How do we invert the GAN, i.e. find a z1 such that G(z1) is very close to x1 on the observed pixels? Just define a loss J(z) = || A G(z) A x1|| Do gradient descent on z (network weights fixed). MLTrain Recovery algorithm: Step 3: Super-resolution x1 G(z) z
Given a target image x1 observe blurred pixels. How do we invert the GAN? MLTrain Recovery algorithm: Step 3: Super-resolution x1 G(z) z Given a target image x1 observe blurred pixels. How do we invert the GAN, i.e. find a z1 such that G(z1) is very close to x1 After it has been blurred? Just define a loss J(z) = || A G(z) A x1|| Do gradient descent on z (network weights fixed).
MLTrain Recovery algorithm: Step 3: Super-resolution x1 G(z) z Given a target image x1 observe blurred pixels. How do we invert the GAN, i.e. find a z1 such that G(z1) is very close to x1 After it has been blurred? Just define a loss J(z) = || A G(z) A x1|| Do gradient descent on z (network weights fixed). MLTrain Recovery from linear measurements z
G(z) A MLTrain y Recovery from linear measurements z G(z) Our algorithm is: Do gradient descentis: in z Our algorithm space descenty in z Do gradient A to satisfy measurements. space
to satisfy measurements. Obtain useful gradients through the useful measurements Obtain gradients using the backprop. through measurements using backprop. MLTrain Comparison to Lasso MLTrain m=500 random Gaussian measurements. n= 13k dimensional vectors.
Comparison to Lasso MLTrain Related work Significant prior work on structure beyond sparsity Model-based CS (Baraniuk et al., Cevher et al., Hegde et al., Gilbert et al. , Duarte & Eldar) Projections on Manifolds: Baraniuk & Wakin (2009) Random projections of smooth manifolds. Eftekhari & Wakin (2015) Deep network models: Mousavi, Dasarathy, Baraniuk (here), Chang, J., Li, C., Poczos, B., Kumar, B., and Sankaranarayanan, ICCV 2017 MLTrain On-going work: Our algorithm works even for non-linear measurements. MLTrain
Recovery from nonlinear measurements G(z) z A (nonlinear operator) y This recovery method can be applied even for any non-linear measurement differentiable box A. Even a mixture of losses: approximate my face but also amplify a mustache detector loss. MLTrain Using nonlinear measurements y z
G(z) A (Gender detector) x Target image MLTrain Using nonlinear measurements z G(z) A (Gender detector) x y
Target image MLTrain Using nonlinear measurements z G(z) A (Gender detector) x y Target image MLTrain Using nonlinear measurements
z G(z) A (Gender detector) x y Target image MLTrain Using nonlinear measurements z G(z) A (Gender detector)
x y Target image MLTrain Theoretical results Ok, this recovery algorithm works well for few measurements (e.g. 10x less than lasso). Can we prove something about the quality of recovery? A: Yes, under ML decoder and a few Gaussian iid measurements (ideally m=O(k)) Gaussian matrices have whp. the SREC property. MLTrain Theoretical results
Ok, this recovery algorithm works well for few measurements (e.g. 10x less than lasso). Can we prove something about the quality of recovery? A: Yes, under ML decoder and a few Gaussian iid measurements (ideally m=O(k)) Gaussian matrices have whp. the SREC property, if Number of needed measurements m = k logL (L: Lipschitz constant of the GAN) MLTrain Main results The first and second term are essentially necessary. The third term is the extra penalty for gradient descent sub-optimality. MLTrain Main results
Representation error noise The first and second term are essentially necessary. The third term is the extra penalty for gradient descent sub-optimality. MLTrain optimization error Proof technology Architecture of compressed sensing proofs for Lasso: Lemma 1: A random Gaussian measurement matrix has RIP/REC whp for m = k log n/k rows Lemma 2: Lasso works for matrices that have RIP/REC. Lasso recovers a xhat close to x*
MLTrain Proof technology For a generative model defining a subset of images S: Lemma 1: A random Gaussian measurement matrix has S-REC whp for m= O ( k logL ) measurements. Lemma 2: if A has S-REC, The optimum of the squared loss minimization recovers zhat close to z* MLTrain Proof technology Why is the Restricted Eigenvalue Condition (REC) needed? Lasso solves: If there is a sparse vector x in the nullspace of A then this fails. MLTrain Proof technology
Why is the Restricted Eigenvalue Condition (REC) needed? Lasso solves: If there is a sparse vector x in the nullspace of A then this fails. REC: All approximately k-sparse vectors x are far from the nullspace: A vector x is approximately k-sparse if there exists a set of k coordinates S such that MLTrain Proof technology Unfortunate coincidence: The difference of two k-sparse vectors is 2k sparse. But the difference of two natural images is not natural. The correct way to state REC (That generalizes to our S-REC) is For any two k-sparse vectors x1,x2 , their difference is far from the nullspace: MLTrain Proof technology Our Set-Restricted Eigenvalue Condition (S-REC). For any set A matrix A satisfies S-REC if for all x1, x2 in S For any two natural images, their difference is far from the nullspace of A:
MLTrain Proof technology Our Set-Restricted Eigenvalue Condition (S-REC). For any set A matrix A satisfies S-REC if for all x1, x2 in S The difference of two natural images is far from the nullspace of A: Lemma1: If the set S is the range of a generative model then m= O (k logL) measurements suffice to make a gaussian iid matrix S-REC whp. Lemma2: If the matrix has S-REC then squared loss optimizer zhat must be close to z* MLTrain Extensions and on-going work: Connections to adversarial examples MLTrain
Adversarial examples Pr(cat) =0.01 Move x input to maximize catness of x while keeping it close to xcostis 14 1 MLTrain Adversarial examples Pr(cat) =0.998 Move x input to maximize catness of x while keeping it close to xcostis 14 2 MLTrain 1. Moved in the direction pointed by in cat classifier 1. Moved
the direction 2. Left the manifold of pointed by cat classifier 2. natural Left theimages manifold of natural images Cats sort of cats Costis 14 3 MLTrain Difference from before? In our previous work we were doing gradient descent in z-space so staying in the range of the Generator.
G(z1) This implies that there are no adversarial examples in the range of the generator Suggests a way to defend classifiers if we have a GAN for the domain: simply project on the range before classifying. (we have a preprint on that). R13000 z1=[1,0,0,..] z2=[1,2,3,..] R100 MLTrain
Training GANs on ImageNet is hard MLTrain Conclusions and outlook Defined compressed sensing for images coming from generative models Performs very well for few measurements. Lasso is more accurate for many measurements. Ideas: Better loss functions, combination with lasso, using discriminator in reconstruction. Theory of compressed sensing nicely extends to S-REC and recovery approximation bounds. Algorithm can be applied to non-linear measurements. Can solve general inverse problems if for differentiable measurements. Plug and play different differentiable boxes ! Better generative models (eg for MRI datasets) can be useful. Idea of differentiable compression seems quite general. https://github.com/AshishBora/csgm MLTrain
(we give trained GANs) References: Related recent work Mardani, Morteza, et al. "Deep generative adversarial networks for compressed sensing automates MRI." arXiv preprint arXiv:1706.00051 (2017). Lucas, Alice, Michael Iliadis, Rafael Molina, and Aggelos K. Katsaggelos. "Using deep neural networks for inverse problems in imaging." IEEE SIgnal ProcESSIng MagazInE 1053, no. 5888/18 (2018). Fletcher, Alyson K., and Sundeep Rangan. "Inference in deep networks in high dimensions." arXiv preprint arXiv:1706.06549 (2017). Asim, Muhammad, Fahad Shamshad, and Ali Ahmed. "Solving Bilinear Inverse Problems using Deep Generative Priors." arXiv preprint arXiv:1802.04073 (2018). Kabkab, Maya, Pouya Samangouei, and Rama Chellappa. "Task-Aware Compressed Sensing with Generative Adversarial Networks." arXiv preprint arXiv:1802.01284 (2018). Shah, Viraj, and Chinmay Hegde. "Solving Linear Inverse Problems Using GAN Priors: An Algorithm with Provable Guarantees." arXiv preprint arXiv:1802.08406 (2018). Mixon, Dustin G., and Soledad Villar. "SUNLayer: Stable denoising with generative networks." arXiv preprint arXiv:1803.09319 (2018). Global Guarantees for Enforcing Deep Generative Priors by Empirical Risk by Paul Hand and Vlad Voroninski, arXiv preprint arXiv:1705.07576 Samangouei, Pouya, Maya Kabkab, and Rama Chellappa. "Defense-GAN: Protecting classifiers against adversarial attacks MLTrain using generative models." arXiv preprint arXiv:1805.06605 (2018). 147
fin MLTrain CausalGAN work with Murat Kocaoglu and Chris Snyder, Assume a causal structure on attributes (gender, mustache, long hair, etc) Create a machine that can sample conditional and interventional samples: we call that an implicit causal generative model. Adversarial training. The causal generator seems to allow configurations never seen in the dataset (e.g. women with mustaches) MLTrain CausalGAN Gender Image Generator Mustache G(z)
Age Bald Glasses MLTrain extra random bits z CausalGAN What is the difference between good old probability conditioning vs intervention ? Conditioning on Bald=1 vs Intervention (Bald=1) MLTrain CausalGAN Conditioning on Bald=1 vs Intervention (Bald=1) MLTrain CausalGAN Conditioning on Bald=1 vs Intervention (Bald=1)
Conditioning on Bald =1 gives only Male samples. Intervening produces also female samples. MLTrain CausalGAN Conditioning on Mustache=1 vs Intervention (Mustache=1) MLTrain CausalGAN Conditioning on Mustache=1 vs Intervention (Mustache=1) MLTrain
Welcome back! Please copy the following sentence into the ...
Context Clues Types: Definition Antonym (or contrast) Synonym (or restatement) Inference The word is defined directly in the sentence Word meanings are not directly described, but need to be inferred from the context Other words are used in the sentence...