NPTEL Reinforcement Learning Week 6 Assignment Answers 2025

1. Which of the following are true?

Dynamic programming methods use full backups and bootstrapping.
Temporal-Difference methods use sample backups and bootstrapping.
Monte-Carlo methods use sample backups and bootstrapping.
Monte-Carlo methods use full backups and no bootstrapping

Answer :- For Answers Click Here

2. Consider the following statements:

(i)TD(0) methods uses unbiased sample of the return.
(ii)TD(0) methods uses a sample of the reward from the distribution of rewards.
(iii) TD(0) methods uses the current estimate of value function.

Which of the above statements is/are true?

(i), (ii)
(i),(iii)
(ii), (iii)
(i), (ii), (iii)

Answer :-

3. Consider an MDP with two states A and B. Given the single trajectory shown below (in the pattern of state, reward, next state…), use on-policy TD(0) updates to make estimates for the values of the 2 states.

A, 3, B, 2, A, 5, B, 2, A, 4, END

Assume a discount factor γ=1, a learning rate α=1 and initial state-values of zero. What are the estimated values for the 2 states at the end of the sampled trajectory? (Note: You are not asked to compute the true values for the two states.)

V(A)=2,V(B)=10
V(A)=8,V(B)=7
V(A)=4,V(B)=12
V(A)=12,V(B)=7

Answer :-

4. Which of the following statements are true for SARSA?

It is a TD method.
It is an off-policy algorithm.
It uses bootstrapping to approximate full return.
It always selects the greedy action choice.

Answer :-

5. Assertion: In Expected-SARSA, we may select actions off-policy.
Reason: In the update rule for Expected-SARSA, we use the estimated expected value of the next state under the policy π rather than directly using the estimated value of the next state that is sampled on-policy.

Assertion and Reason are both true and Reason is a correct explanation of Assertion.
Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
Assertion is true but Reason is false.
Assertion is false but Reason is true.

Answer :- For Answers Click Here

6. Assertion: Q-learning can use asynchronous samples from different policies to update Q values.
Reason: Q-learning is an off-policy learning algorithm.

Assertion and Reason are both true and Reason is a correct explanation of Assertion.
Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
Assertion is true but Reason is false.
Assertion is false but Reason is true.

Answer :-

7. Suppose, for a 2 player game that we have modeled as an MDP, instead of learning a policy over the MDP directly, we separate the deterministic and stochastic result of playing an action to create ‘after-states’ (as discussed in the lectures). Consider the following statements:

(i) The set of states that make up ‘after-states’ may be different from the original set of states for the MDP. (ii) The set of ‘after-states’ could be smaller than the original set of states for the MDP.

Which of the above statements is/are True?

Only (i)
Only (ii)
Both (i) and (ii)
Neither (i) nor (ii)

Answer :-

8. Assertion: Rollout algorithms take advantage of the policy improvement property.
Reason: Rollout algorithms selects action with the highest estimated values.

Assertion and Reason are both true and Reason is a correct explanation of Assertion.
Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
Assertion is true but Reason is false.
Assertion and Reason are both false.

Answer :-

9. Consider the environment given below (CliffWorld discussed in lecture):

Suppose we use ϵ -greedy policy for exploration with a value of ϵ = 0.1. Select the correct option(s):

Q-Learning finds the optimal(red) path.
Q-Learning finds the safer(blue) path.
SARSA finds the optimal(red) path.
SARSA finds the safer(blue) path.

Answer :-

10. Which of the following are True for TD(0) ? (Assume that the environment is truly Markov)

It uses the full return to update the value of states.
Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given a finite number of samples.
Both TD(0) and Monte-Carlo policy evaluation converge to the same value function, given an infinite number of samples.
TD error is given by “δ=v_new(s_t,a_t)−v_old(s_t,a_t)”

Answer :- For Answers Click Here

NPTEL Reinforcement Learning Week 6 Assignment Answers 2025

NPTEL Reinforcement Learning Week 6 Assignment Answers 2025

Related Posts