DQN vs PPO: Which Reinforcement Learning Algorithm Lands Better?

Using OpenAI Gym’s LunarLander-v3 environment, I trained two AI agents from scratch: one using Deep Q-Networks (DQN), a value-based method that learns which actions lead to maximum reward, and the other using Proximal Policy Optimization (PPO), a policy-gradient method that directly learns optimal behaviors. I tracked their learning curves, reward scores, fuel usage, and even visual behaviors to determine which agent could not only land but do so gracefully, efficiently, and reliably.

This week in Project52, I took on one of the most exciting challenges yet: a direct face-off between two of the most powerful reinforcement learning (RL) algorithms — Deep Q-Network (DQN) and Proximal Policy Optimization (PPO). My mission? Train two AI agents to master the art of lunar landing, then compare their performance.

In this detailed breakdown, I'll walk through what these algorithms are, how they differ, why they behave the way they do, and which one turned out to be more effective for this kind of task. Buckle up — this is a deep dive into reinforcement learning, explained from the ground up.

🚀 The Mission: Land the Lander

I used OpenAI Gym’s LunarLander-v3 environment, where a lander must safely touch down between two flags on a moon-like surface. The agent has 4 possible actions:

  • Fire left engine

  • Fire right engine

  • Fire main engine (downward thrust)

  • Do nothing

The environment provides an 8-dimensional state:

  • X and Y positions

  • X and Y velocities

  • Angle

  • Angular velocity

  • Left leg contact

  • Right leg contact

The goal: land softly with both legs touching down, using minimal fuel.

🧐 What is Reinforcement Learning?

Reinforcement Learning (RL) is a training method where an agent learns to make decisions by interacting with an environment. It receives rewards for good actions and penalties for bad ones. Over time, it learns which actions lead to better long-term outcomes.

Imagine training a dog with treats: you give it a reward when it does something right, and it eventually learns what behavior earns the treat. That’s the basic idea behind RL.

In this case, the AI agent is the "dog", and the lander environment is the "world" it must navigate.

🔄 Algorithm 1: Deep Q-Network (DQN)

✅ How It Works

DQN is a value-based RL algorithm. It learns a function called the Q-function, which estimates the total expected reward for taking a given action in a given state.

✨ What is a Q-Value?

A Q-value answers the question:

“If I’m in this situation (state), and I take this action, how much total reward can I expect in the future?”

DQN tries to learn Q-values for every (state, action) pair. Then it always picks the action with the highest Q-value.

🔧 Training Process

  1. Experience Replay: DQN stores all its gameplay experiences in a memory buffer.

  2. Mini-batch Training: It samples random experiences from this buffer to train a neural network.

  3. Target Network: A separate target network helps stabilize training.

  4. ε-greedy Exploration: It usually picks the best action but sometimes tries a random one to explore.

🚫 Limitations

  • Only works with discrete action spaces (can’t handle continuous actions).

  • Less stable in noisy environments.

🧹 Algorithm 2: Proximal Policy Optimization (PPO)

✅ How It Works

PPO is a policy-based algorithm. Instead of learning the value of actions, it directly learns a policy — a function that tells it which action to take in each state.

✨ What is a Policy?

A policy is just a brain. It’s a function: State → Action. PPO builds a neural network that outputs a probability distribution over actions.

So instead of picking the best action like DQN, it samples an action from its learned distribution.

💲 What Are Rollouts?

PPO plays full episodes and records what happened:

  • States visited

  • Actions taken

  • Rewards received

  • Whether the episode ended in a crash or landing

This full history is called a rollout.

✂️ What is a Clipped Loss Function?

PPO doesn’t want to make wild updates. So it limits how much the policy can change at once by clipping the policy update. If the update tries to change the policy too much, it cuts it short.

This makes PPO more stable and less likely to forget everything it learned.

🌎 Environment Types

🔹 Low-Dimensional, Discrete (DQN excels)

  • Inputs are small vectors (8 numbers)

  • Actions are fixed choices (left, right, main, nothing)

Examples:

  • Lunar Lander

  • CartPole

  • Atari Pong

🌌 High-Dimensional, Continuous (PPO shines)

  • Inputs are images, joint angles, video frames

  • Actions can be any real number (like steering angles)

Examples:

  • Robotics

  • Drone flight

  • Self-driving cars

📊 The Results: DQN vs PPO

I trained both models for 100,000 timesteps each. Here's what I observed:

🏋️ Reward

  • DQN reached positive rewards early and showed a smooth upward trend.

  • It quickly learned to control the lander, stabilize its descent, and land between the flags.

  • PPO, on the other hand, had a rough start. Its reward trajectory showed more volatility and longer periods of negative scores.

  • PPO required more training to converge on a usable landing strategy. Even then, its policy would occasionally pick suboptimal actions due to its stochastic nature.

⛽ Fuel Usage

  • DQN’s action selection became more fuel-efficient over time.

  • Early in training, both models fired thrusters excessively, but DQN began minimizing its burns as it learned a stable hover technique.

  • PPO, although improving later, often sacrificed fuel efficiency in exchange for maintaining stability.

🧼 Bounces

  • The bounce count remained near zero for both algorithms.

  • This indicates either perfect soft landings or consistent crashes.

  • Since I observed successful landings visually, I can infer the agents often landed smoothly without bouncing.

📊 Visual Behavior

  • At step 20,000, DQN was already stabilizing mid-air.

  • By 50,000 steps, it was consistently hovering, rotating upright, and descending slowly.

  • PPO was still erratic at step 50,000. It frequently overcorrected or drifted horizontally.

  • Only after 80,000+ steps did PPO begin to show signs of coherent strategy.

⌛ Stability and Generalization

  • DQN converged quickly and retained its policy well.

  • PPO had better long-term flexibility, but its exploration sometimes led it to temporarily forget good policies.

🏆 Final Verdict

Category

Winner

Reward Stability

DQN

Convergence Speed

DQN

Fuel Efficiency

DQN

Bounce Control

Tie

Policy Stability

PPO

Future Scalability

PPO

Winner: DQN for Lunar Lander

In this specific environment — one with low-dimensional inputs, discrete actions, and dense rewards — DQN outperformed PPO across the board.

PPO remains the more general-purpose algorithm. It is more scalable and flexible for more complex environments involving high-dimensional or continuous action spaces. But in this contest, simplicity and specialization won.

🚀 What This Taught Me

  • Not all RL algorithms fit all problems

  • DQN is highly effective for small, structured environments

  • PPO is powerful, but may need more time and tuning

  • Always track performance visually and analytically

  • Logs, graphs, and side-by-side comparisons reveal insights no reward number alone can

📄 Final Thoughts

This project gave me a deeper appreciation for how different RL algorithms behave under similar conditions. While DQN showed superior performance in this controlled and structured environment, PPO reminded me of the importance of flexibility and long-term adaptability. The key takeaway? No single algorithm is universally best — it always depends on the nature of the problem. Understanding how each algorithm learns, explores, and optimizes is just as important as the final result. And seeing it all unfold through real-time visuals, reward tracking, and fuel analytics made the learning all the more insightful.

-Atul Verma
Creator, Project52 🚀