Machine Learning (ML) Reinforcement Learning Practice Questions
1/20
Correct
0%
In the standard Reinforcement Learning framework, what do we call the entity that makes decisions and learns from the feedback provided by its surroundings?
The Agent is the decision-maker that interacts with the environment. It observes the current state, takes an action, and updates its strategy based on the resulting rewards it receives.
Consider a robot learning to walk. The table below represents its interaction loop. Which entry correctly identifies the Reward?
Component
Example from Robot Scenario
State
Joint angles and balance sensor data
Action
Sending voltage to a leg motor
Reward
?
Option 2 is correct. Let's evaluate the components:
Numerical Feedback: The reward is a signal (scalar) that tells the agent how well it is doing; moving forward is the goal.
Option 1: This defines the "Action Space" available to the robot.
Option 3: Visual data is part of the "State" or "Observation."
Option 4: The floor is part of the "Environment" the robot interacts with.
How does the "Trial and Error" nature of Reinforcement Learning distinguish it from Supervised Learning?
Option 4 is correct. Here is the distinction:
Discovery: Unlike supervised learning, where the correct answer is provided, an RL agent must try various actions to see which leads to the highest reward.
Option 1: RL does not use labels; it uses rewards from the environment.
Option 2: The agent starts with no knowledge and must learn the path.
Option 3: This describes Unsupervised Learning (Clustering).
An AI is being trained to play a video game. It receives +100 points for winning a level and -1 point for every second that passes. What behavior is the -1 point penalty intended to encourage?
Option 1 is correct. This is a common technique in RL reward design:
Efficiency: By penalizing time, the agent learns that "faster is better" to maximize the total cumulative score.
Option 2: Time penalties actually discourage excessive exploration of irrelevant areas.
Option 3: While survival is important, a time penalty specifically targets speed.
Option 4: Pausing would accumulate more negative points, discouraging inaction.
In Reinforcement Learning, the term "State" refers to:
Option 2 is correct. The state is a fundamental concept:
Context: The state ($S_t$) provides the agent with all the necessary information about the environment at a specific moment to make a decision.
Option 1: This is often referred to as the "Terminal State" or "Goal."
Option 3: This refers to the RL agent's "Learning Algorithm" (e.g., Q-Learning).
Option 4: This is defined as the "Return" or "Total Reward."
In Reinforcement Learning, the strategy that an agent uses to determine its next action based on the current state is known as a:
Option 4 is correct. Here is the definition of a Policy:
Strategy: A policy (denoted as π) is a mapping from states to actions. It defines the agent's behavior at any given time.
Option 1: The reward signal defines the goal, not the strategy to reach it.
Option 2: The value function estimates how good a state is, rather than picking the action directly.
Option 3: State space is the set of all possible situations the agent can be in.
What is the primary difference between a Deterministic Policy and a Stochastic Policy?
Option 1 is correct. This describes how actions are selected:
Mapping: A deterministic policy always provides the same action for a specific state, while a stochastic policy gives a probability distribution over a set of actions.
Option 2: Both types of policies rely on states and the ultimate goal of maximizing rewards.
Option 3: Both types can be applied to any domain depending on the complexity of the environment.
Option 4: Both can be updated (learned) over time as the agent improves.
A "Value Function" V(s) is used by an agent to estimate:
Option 2 is correct. Value functions look at the "long-term" potential:
Future Return: It predicts the cumulative future reward an agent can expect to receive starting from a particular state.
Option 1: This describes the "Immediate Reward," which is only one part of the value.
Option 3: This is determined by the "Action Space" of the environment.
Option 4: Distance is a specific feature, but the value function generalizes this into a numerical "goodness" score.
Consider the table below representing an agent's knowledge of two different states. If the agent follows a "Greedy" policy, which state will it move toward?
State
Expected Future Reward (Value)
Immediate Reward
State A
+50
+2
State B
+10
+10
Option 4 is correct. Greedy behavior in RL works as follows:
Maximization: A greedy agent always chooses the action or state that it believes has the highest long-term value, not just the highest immediate reward.
Option 1: Choosing only for immediate gain is "short-sighted" and not the standard definition of a value-based greedy choice.
Option 2: The number of actions does not determine the greedy choice.
Option 3: Greedy policies are entirely dependent on maximizing estimated values.
What is the main purpose of the Q-Function (Action-Value Function), often denoted as Q(s, a)?
Option 3 is correct. This is a core concept in algorithms like Q-Learning:
Action-Specific: Unlike the $V(s)$ function which evaluates the state, $Q(s, a)$ evaluates the specific "action" taken while in that state.
Option 1: This describes a "Frequency Table" or "State Counter."
Option 2: This describes the "State Space" definition.
Option 4: This describes Dimensionality Reduction, not RL value estimation.
An agent has found a path in a maze that gives a reward of +10. Instead of taking that path again, the agent decides to try a different, unknown path to see if it leads to a reward of +100. This decision is an example of:
Option 3 is correct. Here is the logic of the trade-off:
Exploration: This involves trying unfamiliar actions to gather more information about the environment, potentially discovering better rewards.
Option 1: Exploitation would be choosing the known +10 path to guarantee a reward.
Option 2: Generalization refers to applying learned knowledge to new, similar states.
Option 4: Overfitting is a supervised learning concept where a model learns noise instead of the signal.
In the ε-greedy (epsilon-greedy) strategy, what happens as the value of ε increases?
Option 2 is correct. Epsilon controls the randomness:
Randomness: $\epsilon$ represents the probability of exploring. If $\epsilon = 0.1$, the agent explores 10% of the time and exploits 90% of the time.
Option 1: This would happen if $\epsilon$ decreased toward zero.
Option 3: Exploration is a key part of the learning process, not the end of it.
Option 4: $\epsilon$ is an agent hyperparameter and does not change the environment's rewards.
Why is it usually a bad idea for an agent to perform 100% Exploitation from the very beginning of training?
Option 4 is correct. This is the "Local Optima" problem:
Missing Information: Without exploration, the agent only knows what it has already tried. It may settle for a mediocre strategy because it never discovered the superior one.
Option 1: Exploitation speed is determined by hardware, not the strategy type.
Option 2: Exploitation (choosing the max value) is actually computationally simpler than exploring.
Option 3: Exploitation is based on remembering and using known rewards.
Review the comparison table of two different agent behaviors:
Behavior
Description
Priority
Strategy X
Uses current knowledge to get the best reward.
Short-term gain.
Strategy Y
Gathers new information about the environment.
Long-term improvement.
Based on this table, which statement is true?
Option 1 is correct. The table perfectly summarizes the trade-off:
Comparison: Exploitation (X) maximizes immediate performance, while Exploration (Y) invests in information that may lead to better future performance.
Option 2: The definitions are reversed in this option.
Option 3: This trade-off is a unique characteristic of Reinforcement Learning.
Option 4: Strategy Y is actually most useful when the environment is unknown.
What is the common practice regarding the ε (epsilon) value as an agent trains over millions of steps?
Option 3 is correct. This is known as "Epsilon Decay":
Transition: Early on, the agent knows nothing, so it should explore. As it learns, it should "decay" epsilon to exploit its high-quality knowledge.
Option 1: Increasing it would make a smart agent act randomly and lose its high rewards.
Option 2: Constant 1.0 would mean the agent acts randomly forever and never applies what it learned.
Option 4: This would prevent the agent from ever finding better paths after the first one.
In Reinforcement Learning, the Discount Factor (γ), usually a value between 0 and 1, is used to:
Option 2 is correct. The discount factor balances short-term and long-term goals:
Time Value: A low γ (near 0) makes the agent "myopic" (short-sighted), focusing only on immediate rewards, while a high γ (near 1) makes it "farsighted."
Option 1: This is a physical constraint, not a mathematical discount.
Option 3: This describes data splitting in Supervised Learning.
Option 4: The action space size is independent of the reward discount.
Which of the following best describes the Markov Property, which is the foundation of a Markov Decision Process (MDP)?
Option 4 is correct. This is the "memoryless" property of MDPs:
Independence: It assumes the current state $S_t$ contains all the relevant information from the past needed to predict the next state $S_{t+1}$.
Options 1 & 2: These contradict the Markov property by requiring full historical memory.
Option 3: MDPs are probabilistic, but they assume a structured transition model based on actions.
An agent is training with a Discount Factor ($\gamma$) of 0.0. How will this agent behave during its task?
Option 1 is correct. Mathematically, γ=0 zeroes out all future rewards:
Myopia: The agent becomes completely focused on the "now," ignoring any potential high-value outcomes that require more than one step to reach.
Option 2: This would require a high γ (e.g., 0.99).
Option 3: The agent still tries to maximize the current reward, so it isn't moving randomly.
Option 4: The agent still values the immediate reward ($R_{t+1}$), so it will still act.
Consider the following MDP transition table for a simple grid world:
Current State
Action
Next State
Probability
Reward
Square 1
Move Right
Square 2
0.8
+1
Square 1
Move Right
Square 1 (Slip)
0.2
-1
What does the 0.2 Probability represent in this environment?
Option 3 is correct. This represents a stochastic environment:
Uncertainty: In many RL tasks, an action doesn't always lead to the intended result; the probability defines the dynamics of the world.
Option 1: Policy is the agent's choice; here, the agent already chose to move right.
Option 2: Discount factors are constant values, not transition probabilities.
Option 4: The value function is an estimate of total reward, not a probability of a single transition.
In a Model-Free Reinforcement Learning algorithm, how does the agent learn to interact with the environment?
Option 2 is correct. This is how most popular algorithms like Q-Learning work:
Experience: The agent doesn't try to "understand" the rules of the world; it just learns which actions work through trial and error.
Option 1: This describes "Model-Based" RL.
Option 3: This avoids the learning process entirely and is often impossible for complex tasks.
Option 4: This describes Supervised Learning.
Quick Recap of Machine Learning (ML) Reinforcement Learning Concepts
If you are not clear on the concepts of Reinforcement Learning, you can quickly review them here before practicing the exercises. This recap highlights the essential points and logic to help you solve problems confidently.
Foundations of Reinforcement Learning Concepts
Reinforcement Learning (RL) is a machine learning paradigm where a system called an agent learns to make decisions by interacting with an environment. Instead of learning from labeled examples, the agent learns from experience by receiving rewards or penalties for its actions. The goal is to learn a strategy, called a policy, that maximizes total reward over time.
Core Elements of Reinforcement Learning Systems
Component
Description
Agent
The learner or decision maker
Environment
The system the agent interacts with
State (S)
The current situation of the agent
Action (A)
A choice the agent can make
Reward (R)
Feedback from the environment
The interaction cycle is: st → at → rt → st+1, where the agent observes a state, takes an action, receives a reward, and moves to a new state.
Markov Decision Process and Environment Modeling
Reinforcement Learning problems are modeled using a Markov Decision Process (MDP):
MDP = (S, A, P, R, γ)
Symbol
Meaning
S
All possible states
A
All possible actions
P(s'|s,a)
Probability of transitioning to the next state
R(s,a)
Reward function
γ
Discount factor for future rewards
The Markov property means the future depends only on the current state, not the full history.
Return Function and Discounted Reward Optimization
The agent seeks to maximize the total discounted reward, called the return:
Gt = Rt+1 + γRt+2 + γ²Rt+3 + ...
Policy, State Value, and Action Value Functions
A policy π(a|s) defines how the agent behaves in a given state.
The state-value function is: Vπ(s) = E[Gt | st = s]
The action-value function is: Qπ(s,a) = E[Gt | st = s, at = a]
Bellman Optimality Equation
The Bellman equation expresses recursive optimal decision making:
V(s) = maxa [ R(s,a) + γ Σ P(s'|s,a)V(s') ]
Exploration Vs Exploitation Strategy
An RL agent must balance:
Exploration — trying new actions
Exploitation — choosing the best-known action
A common strategy is ε-greedy, where the agent selects a random action with probability ε\varepsilonε to keep learning.
Major Categories of Reinforcement Learning Algorithms
Type
Description
Model-Based
Learns how the environment behaves
Model-Free
Learns directly from experience
Value-Based
Optimizes V or Q values
Policy-Based
Optimizes the policy directly
Actor-Critic
Uses both value and policy learning
Real World Applications of Reinforcement Learning
Game playing such as Chess, Go, and video games
Robotics and autonomous systems
Self-driving vehicles
Financial trading and portfolio management
Recommendation systems
Industrial process control
Summary of Reinforcement Learning
Reinforcement Learning teaches machines how to make decisions by interacting with an environment and learning from rewards. Using states, actions, policies, and value functions, the agent gradually improves its behavior to achieve long-term success.
Key Takeaways for Reinforcement Learning
Reinforcement Learning learns from rewards instead of labeled data
It is modeled using Markov Decision Processes
Policies determine how actions are chosen
Value and Q functions evaluate long-term success
Bellman equations define optimal decision making
Explore More Machine Learning (ML) Exercises
You may also like these Machine Learning (ML) exercises:
Test Your Machine Learning (ML) Reinforcement Learning Knowledge
Practicing Machine Learning (ML) Reinforcement Learning? Don’t forget to test yourself later in our Machine Learning (ML) Quiz.
About This Exercise: Reinforcement Learning
Reinforcement Learning is a unique and powerful type of machine learning where an agent learns by interacting with an environment and receiving feedback in the form of rewards or penalties. In this Solviyo exercise, you will explore how reinforcement learning works through interactive MCQs and real-world inspired scenarios.
Unlike supervised or unsupervised learning, reinforcement learning focuses on decision-making over time. The goal is to learn a strategy, or policy, that maximizes long-term rewards. This approach is widely used in robotics, game playing AI, recommendation systems, and autonomous systems.
What You’ll Learn in Reinforcement Learning
How agents interact with environments in reinforcement learning
The role of rewards, penalties, and feedback
How actions influence future outcomes
Key terms like states, actions, and policies
Real-world examples such as self-driving cars and game AI
How Reinforcement Learning Works
Reinforcement learning models improve by trial and error. An agent takes actions, observes the results, and adjusts its behavior based on the reward received. Over time, the system learns which actions lead to the best outcomes.
In this exercise, you will practice understanding concepts such as exploration vs exploitation, delayed rewards, and optimal decision-making strategies through MCQs designed for clear conceptual learning.
Why Practice Reinforcement Learning MCQs
Reinforcement learning can be difficult to grasp without structured practice. Solviyo’s MCQs help break down complex ideas into easy-to-understand questions that connect theory with real-world AI behavior.
These exercises also help prepare you for machine learning exams, AI interviews, and advanced topics such as deep reinforcement learning and autonomous systems.
Who Should Practice This Topic
Students learning machine learning and artificial intelligence
Beginners exploring how AI systems make decisions
Aspiring ML engineers and robotics enthusiasts
Professionals preparing for AI or ML assessments
Why Learn Reinforcement Learning on Solviyo
Solviyo provides structured reinforcement learning MCQ exercises that focus on building real understanding. With clear explanations and practical scenarios, you will learn how intelligent systems improve their decisions through feedback and experience.
Practicing reinforcement learning on Solviyo gives you a strong foundation for advanced AI topics, including robotics, game AI, and autonomous systems.
Start Practicing Reinforcement Learning Today
Explore the world of intelligent decision-making with Solviyo’s interactive reinforcement learning exercises. Practice consistently, track your progress, and build confidence in one of the most exciting areas of machine learning.