Batch Learning
Introduction
In the previous article, we run one epoch of the gradient descent algorithm.
In this article we will explore one new idea to stabilize this algorithm: batch learning.
What is a Batch ?
A batch corresponds to multiple elements of data input taken at once. The main goal is to modify the way our weights are updated so that each update is more robust.
In this article, we talked about the direction to follow in order to update the weights.
With batch learning, we want to update our weights according to the average direction on the batch of data input, let δwavg be it. We can still use the same formula to update the weights as before with this simple replacement:
ˆw=w−α∗δwavgWe are able to modify our gradient descent algorithm so that it is applied on a batch of data input:
- pick a batch of data input in the dataset (just means pick several data input)
- run the forward pass for the model on each element of the batch
- use the Loss function to compute the error between the result produced by the model and the expectation given by the data output, this for each (data input, data output) of the different elements of the batch
- run the backward pass to compute:
- the learning flow for each elements of the batch
- the derivative of the Loss function according to W for each elements of the batch
- update the weights of model with the average direction −δwavg
Do it for all data input in the dataset, we call it an epoch. Repeat for several epochs.
The New Forward Pass
There is no new forward pass, we just have to continue running the forward pass as before, just keeping in mind that the different elements in the batch will be grouped together for learning.
For i in 1,2…n, indices of the batch elements, we compute model(xi).
The New Loss Function
From its introduction in the article Loss function, the Loss function has been used to systematically compare the results produced by the model with expectations.
Now, we want to compare the results and the expectations on multiple elements of a batch. Each element of the batch being (data input, data output).
One simple idea is to compute the average loss on these elements.
Example with the Loss function used in the previous articles where we had picked an x from the data input and ytruth from the data output and computed:
loss=Loss(model(x),ytruth)Now we consider a batch so we have several elements (let us say n elements):
- x1, x2, … xn
- ytruth,1, ytruth,2, … ytruth,n
We compute:
loss1=Loss(model(x1),ytruth,1)loss2=Loss(model(x2),ytruth,2)...lossn=Loss(model(xn),ytruth,n)We introduce the average Lossavg as the average of the errors on the batch:
Lossavg=1n.(Loss(model(X1),Ytruth,1)+Loss(model(X2),Ytruth,2)+...+Loss(model(Xn),Ytruth,n))and we evaluate this function on real values:
lossavg=1n.(loss1+loss2+...+lossn)What is interesting to note is that for any Xi we have:
∂Lossavg∂Xi=1n.∂Loss∂XiAnd we recall that ∂Loss∂Xi is used to compute the learning flow.
This means that the new Lossavg has exactly the same impact on learning as Losslearning, with:
Losslearning(X,Ytruth)=1n.Loss(X,Ytruth)In fact Lossavg is just a global indicator that shows the average error at the end of the forward pass. But what is really propagated during the training phase will be Losslearning.
The New Backward Pass
There is no new backward pass, we just have to continue running the backward pass as before, just keeping in mind that the different elements in the batch will be grouped together for learning.
For i in 1,2…n, indices of the batch elements, we compute:
∂Lossavg∂Xiand
∂Lossavg∂WiThanks to the previous paragraph, we know it comes down to compute:
∂Losslearning∂Xi=1n.∂Loss∂Xiand
∂Losslearning∂Wi=1n.∂Loss∂WiWhen we evaluate these two functions we have:
δlearning=1n.δand
δwlearning=1n.δwThe State so far
Let us summarize the status so far. We are trying to learn on a batch of (data input, data output). We have to apply the training phase, which is nearly the same as before.
Let us concentrate on one Lk layer that declares Wk weights. Suppose that our batch has n elements:
- During the forward pass we compute multiple ok results, let us say: ok,1, ok,2 … ok,n.
- During the backward pass we compute multiple δklearning for the learning flow: δklearning,1, δklearning,2 … δklearning,n.
- During the backward pass we also compute multiple δwk,learning: δwk,learning,1, δwk,learning,2 … δwk,learning,n.
What is common for all these steps is that they are fully independent inside the batch. This means that ok,1, ok,2 … ok,n are fully independent. δklearning,1, δklearning,2 … δklearning,n are fully independent. δwk,learning,1, δwk,learning,2 … δwk,learning,n are fully independent.
The fact that they are fully independent is really interesting in terms of computing: we can fully parallelize their computation inside the current step (forward or backward).
Let us talk about our final goal which is to update the weights according to the average direction −δwavg.
For now, this average is clearly out of reach. Indeed, every batch element in the forward pass is isolated from the others. For the backward pass, it is also the case. There is just one modification in the backward pass: the 1n coefficient in the Losslearning. But clearly this is not sufficient to say we are about to compute an average direction.
The last part where things can get right is the weights update
Update the Weights: the New Rule
Let us recall the update formula for the weights:
ˆw=w−α∗δwavgIn order to use the update formula we must compute the explicit formula for ∂Lossavg∂W beforehand.
We need to think about the backward pass once more, to understand how W impacts the final Loss: Lossavg.
Let us consider the Lk layer example we introduced in the precedent paragraph. We try to compute an explicit formula for:
∂Lossavg∂WkUntil now, the values for wk have not been updated, they are still the same… This means that every results obtained through the forward pass were not that independent: ok,1 was computed with wk as a value for the Wk. But ok,2 was computed with the same wk value for Wk … and ok,n was computed with the same wk value for Wk as well.
Thus wk has impacted all batch outputs depending on Wk: ok,1,ok,2…ok,n. Said differently, Wk impacts Lk(Xk,1,Wk), Lk(Xk,2,Wk) … and Lk(Xk,n,Wk).
And by definition: Lk(Xk,1,Wk) impacts Lossavg, Lk(Xk,2,Wk) impacts Lossavg … and Lk(Xk,n,Wk) impacts Lossavg.
This comes down to the realization that the same wk value is responsible for the final lossavg through the different batch outputs computed during the forward pass. So in order to have the impact of Wk on Lossavg we must simply add the impacts of the different batch outputs:
∂Lossavg∂Wk=∂Lossavg∂Wk,1+∂Lossavg∂Wk,2+...+∂Lossavg∂Wk,nAnd we know from this paragraph that:
∂Lossavg∂W=∂Losslearning∂WThis helps us to obtain:
∂Lossavg∂Wk=∂Lossavg∂Wk,1+∂Lossavg∂Wk,2+...+∂Lossavg∂Wk,n=∂Losslearning∂Wk,1+∂Losslearning∂Wk,2+...+∂Losslearning∂Wk,n=1n.(∂Loss∂Wk,1+∂Loss∂Wk,2+...+∂Loss∂Wk,n)We finally obtained what we were looking for:
∂Lossavg∂Wk is the average direction of ∂Loss∂Wk.
Though, we will keep in mind that:
∂Lossavg∂Wk=∂Losslearning∂Wk,1+∂Losslearning∂Wk,2+...+∂Losslearning∂Wk,nAnd we evaluate this function:
δwk,avg=δwk,learning,1+δwk,learning,2+...+δwk,learning,nExample
In this example we will start a new training phase from scratch (compared to the “Example: what we do…” in the previous article) but this time with a batch size of 3. If the choice in this example is simple because we have only 3 data input, it may be more touchy in the general case. Yet, there is no magical formula and it will be up to the developer to decide on this parameter. We use the same very small learning rate as before: α=10−7.
Data
Same data as in the first article.
data input | data output (expectation) |
---|---|
(100 broccoli, 2000 Tagada strawberries, 100 workout hours) | (bad shape) |
(200 broccoli, 0 Tagada strawberries, 0 workout hours) | (good shape) |
(0 broccoli, 2000 Tagada strawberries, 3 000 workout hours) | (good shape) |
Model
Same model as in the weights article.
L1(X1)=X1 with X1=(X11,X12,X13)L2(X2,W2)=W2.X2 with X2=(X21,X22,X23) and W2=(W21,W22,W23)=W21.X21+W22.X22+W23.X23L3(X3)=X3 if X3≥0 else 0model(X)=L3(L2(L1(X))) with X=(X1,X2,X3)Loss(X4,Ytruth)=12(X4−Ytruth)2We use the initial values for our L2 weights:
w2=(1200,−300011600000,15800)Run the Forward Pass
x | o1=L1(x) | o2=L2(o1) | o3=L3(o2) |
---|---|---|---|
(100, 2000, 100) | (100, 2000, 100) | (0) | (0) |
(200, 0, 0) | (200, 0, 0) | (1) | (1) |
(0, 2000, 3 000) | (0, 2000, 3 000) | (0) | (0) |
o3=model(x) | ytruth = expected result | loss=Loss(o3,ytruth) | correct ? |
---|---|---|---|
(0) | (0) | (0) | |
(1) | (1) | (0) | |
(0) | (1) | (0.5) |
Run the Backward Pass
δ4=o3−ytruth δ3=δ4 if o2≥0 else 0 δ2=δ3.w2 with w2=(1200,−300011600000,15800) δw2=δ3∗o1 δ1=δ2Run the Training Phase on the Batch
-
pick data input:
x1=(100,2000,100)x2=(200,0,0)x3=(0,2000,3000) -
run the forward pass:
o31=model(x1)=model((100,2000,100))=(0)o32=model(x2)=model((200,0,0))=(1)o33=model(x3)=model((0,2000,3000))=(0) -
compute lossavg
loss1=Loss(o31,ytruth,1)=(0)loss2=Loss(o32,ytruth,2)=(0)loss3=Loss(o33,ytruth,3)=(0.5)lossavg=13∗(0+0+0.5)=16 -
run the backward pass:
δ41=∂Lossavg∂X4,1(o31,ytruth,1)=∂Losslearning∂X4,1(o31,ytruth,1)=13∗(o31−ytruth,1)=13∗((0)−(0))=(0)δ42=∂Lossavg∂X4,2(o32,ytruth,2)=∂Losslearning∂X4,2(o32,ytruth,2)=13∗(o32−ytruth,2)=13∗((1)−(1))=(0)δ43=∂Lossavg∂X4,3(o33,ytruth,3)=∂Losslearning∂X4,3(o33,ytruth,3)=13∗(o33−ytruth,3)=13∗((0)−(1))=−(13)δ31=δ41 if o21≥0 else 0=(0) if (0)≥0 else 0=(0)δ32=δ42 if o22≥0 else 0=(0) if (1)≥0 else 0=(0)δ33=δ43 if o23≥0 else 0=−(13) if (0)≥0 else 0=−(13)δ21=δ31.w2 with w2=(1200,−300011600000,15800)=(0)∗(1200,−300011600000,15800)=(0,0,0)δ22=δ32.w2=(0)∗(1200,−300011600000,15800)=(0,0,0)δ23=δ33.w2=−(13)∗(1200,−300011600000,15800)δw2,1=δ31∗o11=(0)∗(100,2000,100)=(0,0,0)δw2,2=δ32∗o12=0∗(200,0,0)=(0,0,0)δw2,3=δ33∗o13=−(13)∗(0,2000,3000)=−(0,20003,1000)δ11=δ21=(0,0,0)δ12=δ22=(0,0,0)δ13=δ23=−(13)∗(1200,−300011600000,15800) -
update the weights of model
We use the new rule we saw in this paragraph:
δw2,avg=δw2,learning,1+δw2,learning,2+δw2,learning,3 ^w2=w2−α∗δw2,avg=w2−α∗(δw2,learning,1+δw2,learning,2+δw2,learning,3)=(1200,−300011600000,15800)−10−7∗((0,0,0)+(0,0,0)−(0,20003,1000))=(1200,−300011600000,15800)+(0,0.00023,0.0001)=(1200,0.00023−300011600000,0.0001+15800)Let us keep in mind the new values we computed for w2:
w2=(1200,0.00023−300011600000,0.0001+15800)
We have just run one epoch of the gradient descent algorithm on our whole dataset with a batch size of 3. Let us stop our algorithm now and check the new results when we run a new forward pass on every data input of our dataset.
Run a New Forward Pass
x | o1=L1(x) | o2=L2(o1) | o3=L3(o2) |
---|---|---|---|
(100, 2000, 100) | (100, 2000, 100) | (0.14) | (0.14) |
(200, 0, 0) | (200, 0, 0) | (1) | (1) |
(0, 2000, 3 000) | (0, 2000, 3 000) | (0.43) | (0.43) |
As in the previous article, we add a new column in order to show the result that is aligned with (0) or (1) in order to compare with the expectations. Let us use the same threshold:
- values < 0.5 will be transformed to 0,
- values ≥ 0.5 will be transformed to 1
Now we have:
x | o3=model(x) | result |
---|---|---|
(100, 2000, 100) | (0.14) | (0) |
(200, 0, 0) | (1) | (1) |
(0, 2000, 3 000) | (0.43) | (0) |
and:
o3=model(x) | result | ytruth | loss=Loss(o3,ytruth) | correct ? |
---|---|---|---|---|
(0.14) | (0) | (0) | (0.01) | |
(1) | (1) | (1) | (0) | |
(0.43) | (0) | (1) | (0.16) |
With this small learning rate, our model produces a wrong result for the last data input whereas in the previous article, the learning had fixed the third data input.
We compare the results we obtained here: (0.14), (1), (0.43) to the results we obtained in the previous article: (0.43), (1), (1.3). We see the results are more “moderated” with the batch learning algorithm. The will to fix the result on the last data input has been compensated by the fact there is nothing to learn on both other data input. This goes along with a more “robust” learning on several epochs.
Conclusion
In this article we studied an upgraded version of the gradient descent algorithm with batch learning. This new algorithm is more robust.
This article also concludes our deep learning meta walkthrough. We will now open a new chapter to better understand the learning flow we introduced in the backward pass article.
We will also speak about the different layers we need in order to build a real deep learning model: let us explore the first of them in the next article.