Atul for Marketing
Posts
DQN vs PPO: Which Reinforcement Learning Algorithm Lands Better?

DQN vs PPO: Which Reinforcement Learning Algorithm Lands Better?

Using OpenAI Gym’s LunarLander-v3 environment, I trained two AI agents from scratch: one using Deep Q-Networks (DQN), a value-based method that learns which actions lead to maximum reward, and the other using Proximal Policy Optimization (PPO), a policy-gradient method that directly learns optimal behaviors. I tracked their learning curves, reward scores, fuel usage, and even visual behaviors to determine which agent could not only land but do so gracefully, efficiently, and reliably.

Atul Verma
April 01, 2025

This week in Project52, I took on one of the most exciting challenges yet: a direct face-off between two of the most powerful reinforcement learning (RL) algorithms — Deep Q-Network (DQN) and Proximal Policy Optimization (PPO). My mission? Train two AI agents to master the art of lunar landing, then compare their performance.

In this detailed breakdown, I'll walk through what these algorithms are, how they differ, why they behave the way they do, and which one turned out to be more effective for this kind of task. Buckle up — this is a deep dive into reinforcement learning, explained from the ground up.

🚀 The Mission: Land the Lander

I used OpenAI Gym’s LunarLander-v3 environment, where a lander must safely touch down between two flags on a moon-like surface. The agent has 4 possible actions:

Fire left engine
Fire right engine
Fire main engine (downward thrust)
Do nothing

The environment provides an 8-dimensional state:

X and Y positions
X and Y velocities
Angle
Angular velocity
Left leg contact
Right leg contact

The goal: land softly with both legs touching down, using minimal fuel.

🧐 What is Reinforcement Learning?

Reinforcement Learning (RL) is a training method where an agent learns to make decisions by interacting with an environment. It receives rewards for good actions and penalties for bad ones. Over time, it learns which actions lead to better long-term outcomes.

Imagine training a dog with treats: you give it a reward when it does something right, and it eventually learns what behavior earns the treat. That’s the basic idea behind RL.

In this case, the AI agent is the "dog", and the lander environment is the "world" it must navigate.

🔄 Algorithm 1: Deep Q-Network (DQN)

✅ How It Works

DQN is a value-based RL algorithm. It learns a function called the Q-function, which estimates the total expected reward for taking a given action in a given state.

✨ What is a Q-Value?

A Q-value answers the question:

“If I’m in this situation (state), and I take this action, how much total reward can I expect in the future?”

DQN tries to learn Q-values for every (state, action) pair. Then it always picks the action with the highest Q-value.

🔧 Training Process

Experience Replay: DQN stores all its gameplay experiences in a memory buffer.
Mini-batch Training: It samples random experiences from this buffer to train a neural network.
Target Network: A separate target network helps stabilize training.
ε-greedy Exploration: It usually picks the best action but sometimes tries a random one to explore.

🚫 Limitations

Only works with discrete action spaces (can’t handle continuous actions).
Less stable in noisy environments.

🧹 Algorithm 2: Proximal Policy Optimization (PPO)

✅ How It Works

PPO is a policy-based algorithm. Instead of learning the value of actions, it directly learns a policy — a function that tells it which action to take in each state.

✨ What is a Policy?

A policy is just a brain. It’s a function: State → Action. PPO builds a neural network that outputs a probability distribution over actions.

So instead of picking the best action like DQN, it samples an action from its learned distribution.

💲 What Are Rollouts?

PPO plays full episodes and records what happened:

States visited
Actions taken
Rewards received
Whether the episode ended in a crash or landing

This full history is called a rollout.

✂️ What is a Clipped Loss Function?

PPO doesn’t want to make wild updates. So it limits how much the policy can change at once by clipping the policy update. If the update tries to change the policy too much, it cuts it short.

This makes PPO more stable and less likely to forget everything it learned.

🌎 Environment Types

🔹 Low-Dimensional, Discrete (DQN excels)

Inputs are small vectors (8 numbers)
Actions are fixed choices (left, right, main, nothing)

Examples:

Lunar Lander
CartPole
Atari Pong

🌌 High-Dimensional, Continuous (PPO shines)

Inputs are images, joint angles, video frames
Actions can be any real number (like steering angles)

Examples:

Robotics
Drone flight
Self-driving cars

📊 The Results: DQN vs PPO

I trained both models for 100,000 timesteps each. Here's what I observed:

🏋️ Reward

DQN reached positive rewards early and showed a smooth upward trend.
It quickly learned to control the lander, stabilize its descent, and land between the flags.
PPO, on the other hand, had a rough start. Its reward trajectory showed more volatility and longer periods of negative scores.
PPO required more training to converge on a usable landing strategy. Even then, its policy would occasionally pick suboptimal actions due to its stochastic nature.

⛽ Fuel Usage

DQN’s action selection became more fuel-efficient over time.
Early in training, both models fired thrusters excessively, but DQN began minimizing its burns as it learned a stable hover technique.
PPO, although improving later, often sacrificed fuel efficiency in exchange for maintaining stability.

🧼 Bounces

The bounce count remained near zero for both algorithms.
This indicates either perfect soft landings or consistent crashes.
Since I observed successful landings visually, I can infer the agents often landed smoothly without bouncing.

📊 Visual Behavior

At step 20,000, DQN was already stabilizing mid-air.
By 50,000 steps, it was consistently hovering, rotating upright, and descending slowly.
PPO was still erratic at step 50,000. It frequently overcorrected or drifted horizontally.
Only after 80,000+ steps did PPO begin to show signs of coherent strategy.

⌛ Stability and Generalization

DQN converged quickly and retained its policy well.
PPO had better long-term flexibility, but its exploration sometimes led it to temporarily forget good policies.

🏆 Final Verdict

Category	Winner
Reward Stability	DQN
Convergence Speed	DQN
Fuel Efficiency	DQN
Bounce Control	Tie
Policy Stability	PPO
Future Scalability	PPO

Winner: DQN for Lunar Lander

In this specific environment — one with low-dimensional inputs, discrete actions, and dense rewards — DQN outperformed PPO across the board.

PPO remains the more general-purpose algorithm. It is more scalable and flexible for more complex environments involving high-dimensional or continuous action spaces. But in this contest, simplicity and specialization won.

🚀 What This Taught Me

Not all RL algorithms fit all problems
DQN is highly effective for small, structured environments
PPO is powerful, but may need more time and tuning
Always track performance visually and analytically
Logs, graphs, and side-by-side comparisons reveal insights no reward number alone can

📄 Final Thoughts

This project gave me a deeper appreciation for how different RL algorithms behave under similar conditions. While DQN showed superior performance in this controlled and structured environment, PPO reminded me of the importance of flexibility and long-term adaptability. The key takeaway? No single algorithm is universally best — it always depends on the nature of the problem. Understanding how each algorithm learns, explores, and optimizes is just as important as the final result. And seeing it all unfold through real-time visuals, reward tracking, and fuel analytics made the learning all the more insightful.

-Atul Verma
Creator, Project52 🚀