Since the start of my academic career in computer science, I’ve been focusing on computer vision which has led to many opportunities in the past, many of which I am grateful for. However, as I finish my first year of college, it has become increasingly clear that specialization in computer vision can be limiting. Thus, I wanted to expand my skillset to other machine learning topics, one of which is reinforcement learning. Over the last 2 weeks, I’ve been working on reinforcement learning from zero understanding.
From my experience, reinforcement learning has arguably one of the most challenging yet most rewarding learning curves possible. Many of the fundamental knowledge are not intuitive and challenge much of your theoretical understanding about machine learning, unlike computer vision or natural language processing. However, through code implementation, projects and research papers, it has become more natural and understandable. Over the last 9 days, I’ve been working on a repository to implement Twin-Delayed Deep-Deterministic Policy Gradient algorithm and Soft Actor-Critic algorithm to teach Mujoco models in OpenAI’s Gymnasium library to walk, balance, and hop.
The repository features a generalizable framework that incorporates both a class for single simulations and parallel simulations to train any Mujoco model from OpenAI’s Gymnasium library. For simple tasks like balancing a pendulum, single simulations are more than sufficient to train. For more complicated tasks like learning how to hop and balance, or running with four legs, parallel simulations allow more sample generation which in turn provides more experiences to learn from. I developed both types of simulation frameworks because reinforcement learning, in general, benefits from large, diverse samples to learn from, which I had to learn the hard way when training hopper-v5 at first.
Ant-v5 learning to crawl/walk forward using its legs
As you can see above, the model is well-trained to run forward with its four legs. This success is replicated easily across similar models that hop or balance pendulums. However, from my experimentation with training models to stand up or reach for objects, reinforcement learning becomes more difficult with robotics/humanoid models since the reward space is sparse relative to the state space.
To contextualize this, think a ball pit that is entirely filled with blue, red and green balls. At any given moment, you are tasked with finding specific colored balls and must do so as quickly as possible. You have no idea what a red, blue or a green ball is but you can tell if you’re chosen a correctly colored ball via a machine. Under normal circumstances, there is even distribution of the three colors in the ball pit which means as you randomly retrieve from the pit, you build an understanding of the different colors. Eventually, you learn to pick the correctly colored ball rather than randomly choose one from the pit.
Now, imagine that the ball pit increases its volume by 100% and shifts its distribution to have 99% blue, 0.5% red, 0.5% green. Under these circumstances, most of the plastic balls you retrieve by random will be blue, even if you are tasked with finding red or green. The ball pit is so large and the distribution is so uneven that you never know what a red or a green ball is. Similarly, (some) humanoid tasks have so few rewards beyond a certain point, relative to the state space, that the task becomes difficult to progress and takes longer to train such models. Theoretically, it is possible to train humanoid mujuco models eventually to stand up or walk but I can’t afford to run 20,000 episodes on cloud (yet).
Humanoid model struggling to find ways to stand up
I look forward to sharing more progress on models from my reinforcement learning experience as well as other non-AI projects! If you’re feeling generous, please show some GitHub love with some stars on the repository! I know it’s a bit vain but come on, I need something to validate those sleepless nights :,)
RL with Mujoco: Teaching models to walk, balance, and hop!
Leave a Comment
Posted: May 2, 2025 by Richard Tang
Hey everyone!
Since the start of my academic career in computer science, I’ve been focusing on computer vision which has led to many opportunities in the past, many of which I am grateful for. However, as I finish my first year of college, it has become increasingly clear that specialization in computer vision can be limiting. Thus, I wanted to expand my skillset to other machine learning topics, one of which is reinforcement learning. Over the last 2 weeks, I’ve been working on reinforcement learning from zero understanding.
From my experience, reinforcement learning has arguably one of the most challenging yet most rewarding learning curves possible. Many of the fundamental knowledge are not intuitive and challenge much of your theoretical understanding about machine learning, unlike computer vision or natural language processing. However, through code implementation, projects and research papers, it has become more natural and understandable. Over the last 9 days, I’ve been working on a repository to implement Twin-Delayed Deep-Deterministic Policy Gradient algorithm and Soft Actor-Critic algorithm to teach Mujoco models in OpenAI’s Gymnasium library to walk, balance, and hop.
The repository features a generalizable framework that incorporates both a class for single simulations and parallel simulations to train any Mujoco model from OpenAI’s Gymnasium library. For simple tasks like balancing a pendulum, single simulations are more than sufficient to train. For more complicated tasks like learning how to hop and balance, or running with four legs, parallel simulations allow more sample generation which in turn provides more experiences to learn from. I developed both types of simulation frameworks because reinforcement learning, in general, benefits from large, diverse samples to learn from, which I had to learn the hard way when training hopper-v5 at first.
As you can see above, the model is well-trained to run forward with its four legs. This success is replicated easily across similar models that hop or balance pendulums. However, from my experimentation with training models to stand up or reach for objects, reinforcement learning becomes more difficult with robotics/humanoid models since the reward space is sparse relative to the state space.
To contextualize this, think a ball pit that is entirely filled with blue, red and green balls. At any given moment, you are tasked with finding specific colored balls and must do so as quickly as possible. You have no idea what a red, blue or a green ball is but you can tell if you’re chosen a correctly colored ball via a machine. Under normal circumstances, there is even distribution of the three colors in the ball pit which means as you randomly retrieve from the pit, you build an understanding of the different colors. Eventually, you learn to pick the correctly colored ball rather than randomly choose one from the pit.
Now, imagine that the ball pit increases its volume by 100% and shifts its distribution to have 99% blue, 0.5% red, 0.5% green. Under these circumstances, most of the plastic balls you retrieve by random will be blue, even if you are tasked with finding red or green. The ball pit is so large and the distribution is so uneven that you never know what a red or a green ball is. Similarly, (some) humanoid tasks have so few rewards beyond a certain point, relative to the state space, that the task becomes difficult to progress and takes longer to train such models. Theoretically, it is possible to train humanoid mujuco models eventually to stand up or walk but I can’t afford to run 20,000 episodes on cloud (yet).
I look forward to sharing more progress on models from my reinforcement learning experience as well as other non-AI projects! If you’re feeling generous, please show some GitHub love with some stars on the repository! I know it’s a bit vain but come on, I need something to validate those sleepless nights :,)
Share this:
Like this:
Related
Category: Blog, Digital Defiance, Project Updates Tags: AI, project update, reinforcement learning