NPTEL Deep Learning – IIT Ropar Week 4 Assignment Answers 2025
1. Using the Adam optimizer with β1=0.9, β2=0.999, and ϵ=10−8, what would be the bias-corrected first moment estimate after the first update if the initial gradient is 4?
- 0.4
- 4.0
- 3.6
- 0.44
Answer :- For Answers Click Here
2. In a mini-batch gradient descent algorithm, if the total number of training samples is 50,000 and the batch size is 100, how many iterations are required to complete 10 epochs?
- 5,000
- 50,000
- 500
- 5
Answer :-
3. In a stochastic gradient descent algorithm, the learning rate starts at 0.1 and decays exponentially with a decay rate of 0.1 per epoch. What will be the learning rate after 5 epochs?
Answer :-
4. In the context of Adam optimizer, what is the purpose of bias correction?
- To prevent overfitting
- To speed up convergence
- To correct for the bias in the estimates of first and second moments
- To adjust the learning rate
Answer :-
5. The figure below shows the contours of a surface.

Suppose that a man walks, from -1 to +1, on both the horizontal (x) axis and the vertical (y) axis. The statement that the man would have seen the slope change rapidly along the x-axis than the y-axis is,
- True
- False
- Cannot say
Answer :-
6. What is the primary benefit of using Adagrad compared to other optimization algorithms?
- It converges faster than other optimization algorithms.
- It is more memory-efficient than other optimization algorithms.
- It is less sensitive to the choice of hyperparameters(learning rate).
- It is less likely to get stuck in local optima than other optimization algorithms.
Answer :- For Answers Click Here
7. What are the benefits of using stochastic gradient descent compared to vanilla gradient descent?
- SGD converges more quickly than vanilla gradient descent.
- SGD is computationally efficient for large datasets.
- SGD theoretically guarantees that the descent direction is optimal.
- SGD experiences less oscillation compared to vanilla gradient descent.
Answer :-
8. What is the role of activation functions in deep learning?
- Activation functions transform the output of a neuron into a non-linear function, allowing the network to learn complex patterns.
- Activation functions make the network faster by reducing the number of iterations needed for training.
- Activation functions are used to normalize the input data.
- Activation functions are used to compute the loss function.
Answer :-
9. What is the advantage of using mini-batch gradient descent over batch gradient descent?
- Mini-batch gradient descent is more computationally efficient than batch gradient descent.
- Mini-batch gradient descent leads to a more accurate estimate of the gradient than batch gradient descent.
- Mini batch gradient descent gives us a better solution.
- Mini-batch gradient descent can converge faster than batch gradient descent.
Answer :-
10. In the Nesterov Accelerated Gradient (NAG) algorithm, the gradient is computed at:
- The current position
- A “look-ahead” position
- The previous position
- The average of current and previous positions
Answer :- For Answers Click Here