File structure
Each physical skill lives in a directory under~/skills/ on the robot:
Episode format (HDF5)
Each episode is stored as an HDF5 file with the following structure:| Dataset | Shape | Description |
|---|---|---|
action | (T, action_dim) | Leader arm commands recorded at each timestep |
qpos | (T, num_joints) | Follower arm joint positions |
qvel | (T, num_joints) | Follower arm joint velocities |
images/main_camera | (T, 480, 640, 3) | Main camera RGB frames |
images/arm_camera | (T, 480, 640, 3) | Wrist camera RGB frames |
T is the number of timesteps in the episode and action_dim is typically 6 (joint positions) or 10 (6 joints + 2 base velocity + 2 reserved).
Additional fields for mobile tasks
When recording with base movement enabled, the episode also includes:| Dataset | Shape | Description |
|---|---|---|
cmd_vel | (T, 2) | Base velocity commands (linear x, angular z) |
odom | (T, ...) | Odometry readings from /odom |
Recording parameters
The recorder captures data at 30 Hz by default with these settings (fromrecorder.yaml):
| Parameter | Value |
|---|---|
| Data frequency | 30 Hz |
| Image resolution | 640 × 480 |
| Max timesteps per episode | 1800 (60 seconds at 30 Hz) |
| Camera topics | /mars/main_camera/left/image_raw, /mars/arm/image_raw |
| Arm state topic | /mars/arm/state |
| Leader command topic | /mars/arm/commands |
Metadata file
Each skill directory contains ametadata.json that evolves as you progress through the pipeline:
After creating the skill:
execution block tells the BehaviorServer everything it needs to load and run the policy: which checkpoint to use, the action dimensionality, the maximum execution duration, and the arm pose to move to before starting inference.
Normalization statistics
The training pipeline computes per-feature normalization statistics (mean and standard deviation) from your dataset and saves them indataset_stats.pt. During inference, the policy uses these stats to normalize observations and unnormalize action outputs, ensuring consistency between what the model saw during training and what it sees at runtime.
