Abstract

Skill-based reinforcement learning (SBRL) enables rapid adaptation in environments with sparse rewards by pretraining a skill-conditioned policy. Effective skill learning requires jointly maximizing both exploration and skill diversity. However, existing methods often face challenges in simultaneously optimizing for these two conflicting objectives. In this work, we propose a new method, Adaptive Multi-objective Projection for balancing Exploration and skill Diversification (AMPED), which explicitly addresses both exploration and skill diversification. We begin by conducting extensive ablation studies to identify and define a set of objectives that effectively capture the aspects of exploration and skill diversity, respectively. During the skill pretraining phase, AMPED introduces a gradient surgery technique to balance the objectives of exploration and skill diversity, mitigating conflicts and reducing reliance on heuristic tuning. In the subsequent fine-tuning phase, AMPED incorporates a skill selector module that dynamically selects suitable skills for downstream tasks, based on task-specific performance signals. Our approach achieves performance that surpasses SBRL baselines across various benchmarks. These results highlight the importance of explicitly harmonizing exploration and diversity, and demonstrate the effectiveness of AMPED in enabling robust and generalizable skill learning.

Contributions

We introduce AMPED to address the dual objectives of exploration and skill diversity in SBRL. Our framework unifies entropy-based exploration with contrastive skill separation, explicitly resolves their gradient conflicts via PCGrad for more stable updates, and employs a skill selector to adaptively deploy skills during fine-tuning. Empirically, we show that:

  1. Eliminating exploration-diversity gradient interference is crucial.
  2. Combining AnInfoNCE-inspired diversity losses with RND-driven entropy bonuses yields a robust balance between competing incentives.
  3. Implementing our skill selector meaningfully boosts downstream performance.

Visualization of Skills with AMPED

Representative skills learned via pretraining of our approach reveal diverse and specialized behaviors, emerges solely from intrinsic objectives with no extrinsic reward or expert data.

Walker

Getting up from ground

Walker recovery animation

Stepping forward

Walker locomotion animation

Backward somersault

Walker acrobatic animation

Walker skills include rising from a supine position, stepping forward, and performing a backward somersault.

Quadruped

Upside-down recovery

Quadruped recovery animation

Backward somersault

Quadruped flip animation

Clockwise rotation

Quadruped rotation animation

Quadruped skills demonstrate self-righting, acrobatic flips, and rotational maneuvers.

Jaco

Left reach & grasp

Jaco left reach animation

Right reach & grasp

Jaco right reach animation

Upward lifting

Jaco lifting animation

Jaco skills capture precise arm motions such as leftward reaching, rightward grasping, and upward lifting toward a target.

Performance Results

Performance comparison graph showing AMPED outperforming baselines across four metrics

Aggregated expert-normalized performance on 12 URLB downstream tasks after 100k finetuning steps, averaged over 10 random seeds. Four metrics—median, IQM, arithmetic mean, and optimality gap—are plotted using the evaluation protocol introduced by Agarwal et al. Our method (gray) achieves the highest median, IQM, and mean scores and the smallest optimality gap, outperforming the previous state-of-the-art APT (pink) and other baselines.