NPTEL Reinforcement Learning Week 2 Assignment Answers 2025

NPTEL Reinforcement Learning Week 2 Assignment Answers 2025

1. Which of the following is true of the UCB algorithm?

  • The action with the highest Q value is chosen at every iteration
  • After a very large number of iterations, the confidence intervals of unselected actions will not change much
  • The true expected-value of an action always lies within it’s estimated confidence interval.
  • With a small probability ϵ, we select a random action to ensure adequate exploration of the action space.
Answer :- For Answers Click Here 

2.

  • Sub-optimal arms would be chosen more frequently
  • Sub-optimal arms would be chosen less frequently
  • Makes no change to the frequency of picking sub-optimal arms.
  • Sub-optimal arms could be chosen less or more frequently, depending on the samples.
Answer :- 

3. In a 4-arm bandit problem, after executing 100 iterations of the UCB algorithm, the estimates of Q values are Q100(1)=1.73, Q100(2)=1.83, Q100(3)=1.89, Q100(4)=1.55 and the number of times each of them are sampled are- n1=25,n2=20, n3=30, n4=15. Which arm will be sampled in the next trial?

  • Arm 1
  • Arm 2
  • Arm 3
  • Arm 4
Answer :- 

4. We need 8 rounds of median-elimination to get an (ϵ,δ)−PAC arm. Approximately how many samples would have been required using the naive (ϵ,δ)−PAC algorithm given (ϵ,δ)=(1/2,1/e) ? (Choose the value closest to the correct answer)

  • 15000
  • 10000
  • 500
  • 20000
Answer :- 

5.

Which of these equalities/inequalities are correct ?

  • i and iii
  • ii and iv
  • i, ii, iii
  • i, ii, iii, iv
Answer :- 

6.

Answer :- For Answers Click Here 

7. In median elimination method for (ϵ,δ)−PAC bounds, we claim that for every phase l, Pr[A≤B+ϵl]>1−δl. (Sl – is the set of arms remaining in the lth phase)

Consider the following statements:

(i)A – is the maximum of rewards of true best arm in Sl, i.e. in lth phase
(ii)B – is the maximum of rewards of true best arm in Sl+1, i.e. in l+1th phase
(iii)B – is the minimum of rewards of true best arm in Sl+1, i.e. in l+1th phase
(iv)A – is the minimum of rewards of true best arm in Sl, i.e. in lth phase
(v)A – is the maximum of rewards of true best arm in Sl+1, i.e. in l+1th phase
(vi)B – is the maximum of rewards of true best arm in Sl, i.e. in lth phase

Which of the statements above are correct?

  • i and ii
  • iii and iv
  • iii and iv
  • v and vi
  • i and iii
Answer :- 

8. Which of the following statements is NOT true about Thompson Sampling or Posterior Sampling?

  • After each sample is drawn, the q∗ distribution for that sampled arm is updated to be closer to the true distribution.
  • Thompson sampling has been shown to generally give better regret bounds than UCB.
  • In Thompson sampling, we do not need to eliminate arms each round to get good sample complexity
  • The algorithm requires that we use Gaussian priors to represent distributions over q∗ values for each arm
Answer :- 

9. Assertion: The confidence bound of each arm in the UCB algorithm cannot increase with iterations. Reason: The nj term in the denominator ensures that the confidence bound remains the same for unselected arms and decreases for the selected arm

  • Assertion and Reason are both true and Reason is a correct explanation of Assertion
  • Assertion and Reason are both true and Reason is not a correct explanation of Assertion
  • Assertion is true and Reason is false
  • Both Assertion and Reason are false
Answer :- 

10.

Answer :- For Answers Click Here 
Scroll to Top