Estimated reading time: 5 minutes
To illustrate the core concepts of Reinforcement Learning, we’ll use a very simplified example in Python. Imagine an agent trying to learn the best way to navigate a small grid world to reach a goal.
1. The Environment
Our environment will be a 1D grid with a starting point, a goal, and potential penalties.
# Define the environment
environment = ['S', '-', '-', 'G'] # S: Start, -: Empty, G: Goal
goal_position = 3
start_position = 0
current_position = start_position
2. The Agent
Our agent will have a simple policy of moving left or right.
# Define the agent's possible actions
actions = ['left', 'right']
# A simple policy (initially random)
import random
def choose_action(position):
return random.choice(actions)
3. States and Actions
The state is the agent’s current position on the grid. The actions are ‘left’ and ‘right’.
4. Reward
The agent receives a reward when it reaches the goal. It might receive a small negative reward for each step to encourage reaching the goal quickly.
# Define the reward function
def get_reward(position):
if position == goal_position:
return 10 # Positive reward for reaching the goal
else:
return -1 # Small negative reward for each step
5. Learning (Simplified Q-Learning Idea)
We’ll use a very basic idea of Q-learning. The agent will learn a “Q-value” for each state-action pair, representing the expected future reward of taking that action in that state. Initially, these Q-values are zero.
# Initialize Q-values (state, action)
q_table = {}
for i in range(len(environment)):
q_table[i] = {'left': 0, 'right': 0}
# Learning parameters
learning_rate = 0.1
discount_factor = 0.9
epsilon = 0.1 # Exploration-exploitation trade-off
6. The Learning Loop
The agent will interact with the environment for several episodes (attempts to reach the goal).
def learn(episodes=100):
global current_position, q_table
for episode in range(episodes):
current_position = start_position
done = False
while not done:
# Exploration-exploitation
if random.random() < epsilon:
action = choose_action(current_position) # Explore
else:
# Exploit: Choose the action with the highest Q-value
action = max(q_table[current_position], key=q_table[current_position].get)
# Take the action and observe the next state and reward
old_position = current_position
if action == 'left':
current_position = max(0, current_position - 1)
elif action == 'right':
current_position = min(len(environment) - 1, current_position + 1)
reward = get_reward(current_position)
# Update Q-value (simplified)
old_q_value = q_table[old_position][action]
max_next_q = max(q_table[current_position].values()) if current_position != goal_position else 0
new_q_value = (1 - learning_rate) * old_q_value + learning_rate * (reward + discount_factor * max_next_q)
q_table[old_position][action] = new_q_value
if current_position == goal_position:
done = True
if (episode + 1) % 20 == 0:
print(f"Episode {episode + 1}: Reached goal")
# Run the learning process
learn(200)
print("\nLearned Q-values:")
print(q_table)
# Demonstrate the learned policy
current_position = start_position
print("\nDemonstrating learned policy:")
print(environment)
for _ in range(10): # Limit steps for demonstration
action = max(q_table[current_position], key=q_table[current_position].get)
print(f"Agent at position {current_position}, choosing action: {action}")
if action == 'left':
current_position = max(0, current_position - 1)
elif action == 'right':
current_position = min(len(environment) - 1, current_position + 1)
print(environment[:current_position] + ['A'] + environment[current_position+1:]) # A for Agent
if current_position == goal_position:
print("Reached Goal!")
break
7. Output of the Learning Process
After training, the q_table
will contain the learned Q-values. The demonstration will show the agent following the learned policy to reach the goal.
Episode 20: Reached goal
Episode 40: Reached goal
Episode 60: Reached goal
Episode 80: Reached goal
Episode 100: Reached goal
Episode 120: Reached goal
Episode 140: Reached goal
Episode 160: Reached goal
Episode 180: Reached goal
Episode 200: Reached goal
Learned Q-values:
{0: {'left': 0.0, 'right': 4.685592048801086}, 1: {'left': 0.5206213387556763, 'right': 6.317325599446087}, 2: {'left': 1.792402877711148, 'right': 8.130369554940097}, 3: {'left': 0, 'right': 0}}
Demonstrating learned policy:
['S', '-', '-', 'G']
Agent at position 0, choosing action: right
['A', '-', '-', 'G']
Agent at position 1, choosing action: right
['-', 'A', '-', 'G']
Agent at position 2, choosing action: right
['-', '-', 'A', 'G']
Agent at position 3, choosing action: right
Reached Goal!
Explanation of the Code
- Environment: A simple list representing the grid.
- Agent: Makes decisions (left or right).
- Q-table: Stores the learned values for each state-action pair.
- Reward: Positive for reaching the goal, negative for each step.
- Learning: The
learn
function simulates episodes of the agent interacting with the environment. It uses a basic Q-learning update rule (simplified for clarity) to adjust the Q-values based on the rewards received. - Exploration vs. Exploitation: The agent sometimes chooses a random action (exploration) to discover new possibilities and sometimes chooses the action with the highest current Q-value (exploitation) to make progress.
- Demonstration: After learning, the agent follows the policy derived from the learned Q-values to reach the goal.
This is a highly simplified example. Real-world Reinforcement Learning problems involve much more complex environments, states, actions, and learning algorithms, often utilizing deep neural networks to handle large state and action spaces.
Leave a Reply