Project setup

Dataset, task, and environment

Dataset

LIBERO demonstration data was converted into local LeRobot datasets with agent-view RGB, wrist RGB, robot state, actions, and task IDs for multitask conditioning.

Task family

The experiments focus on LIBERO-Spatial bowl-placement tasks. The single-task track starts with placing the black bowl on the plate; the multitask track covers all ten spatial variants.

Environment

Evaluation runs in LIBERO off-screen simulation on pruned benchmark initial states, not demonstration starts, so reported success reflects benchmark rollout behavior.

62,250 multitask frames 10 spatial tasks 2 RGB camera views 50 eval episodes per benchmark

Section 1

Single-task policy

Goal: make one ACT policy reliably solve the black-bowl placement task, then use rollout failures to decide which training changes are worth testing.

Best result 82%

41 / 50 benchmark episodes

Single-task thinking process

From rollout review to targeted ablation

  1. Get ACT training and evaluation running end to end on converted LeRobot data.
  2. Review failures instead of only watching loss curves.
  3. Notice that many failures happen around grasp acquisition and release timing.
  4. Oversample gripper-transition windows to put more training signal on those moments.
  5. Reduce ACT KL weight from 10.0 to 5.0 and verify on benchmark initial states.
Initial stable baseline 44%
After transition oversampling 62%
After KL 10.0 -> 5.0 82%
After KL 5.0 -> 2.0 76%

Before the KL trick

Transition sampling, default KL

Success plateaued at 31 / 50. Remaining failures were sensitive around grasp and placement.

After the KL trick

Same sampling, lower KL

Reducing the KL penalty gave the policy enough action flexibility to reach 41 / 50.

Section 2

Multi-task policy

Goal: train one task-conditioned ACT policy across all ten LIBERO-Spatial tasks and understand why weak tasks do not improve just by sampling them more often.

Best mean 69.4%

10-task benchmark average

Multi-task thinking process

From task imbalance hypothesis to architecture ablations

  1. Add task IDs so one shared policy can condition on the requested spatial task.
  2. Check per-task success instead of only the benchmark mean.
  3. Identify weak tasks 04, 05, and 09 as the tempting targets for reweighting.
  4. Try task-level oversampling and compare against the original task-conditioned baseline.
  5. Observe that oversampling collapses the mean, so the next direction becomes cleaner architecture ablations.
Task-conditioned baseline 69.4%
Task oversampling attempt 22.0%
Balanced task batches 61.4%
Shared task embedding 65.8%

Baseline per-task behavior

Task Scenario Success
02from table center0.94
03on cookie box0.92
09on wooden cabinet0.86
00between plate and ramekin0.82
05on ramekin0.16

Before the oversampling trick

Task-conditioned baseline

The baseline has a strong mean, but task 05 remains a hard failure case.

After the oversampling trick

Oversampling weak tasks

The plausible fix backfires: benchmark mean drops from 69.4% to 22.0%.