Technical Specification

Modalitynet Technical Specification v1.0

Introduction to the Three Datasets

version	date
0.51	20260123
v1.0	20260511

Disclaimer

This document contains proprietary and confidential information of Noitom Robotics and is intended for authorized academic and/or technical review only. No part of this document may be reproduced, disclosed, distributed, or otherwise made available to any third party, in whole or in part, without the prior written consent of Noitom Robotics. Any unauthorized use of this document is strictly prohibited.

Feature	HiPHI Motion with Object & Vision (HiPHI-MOV)	HiPHI Omni-Modality (HiPHI-OM)	In-The-Wild (ITW)
Primary Goal	Locomotion-Manipulation	Ground-truth Precision	Real-world Robustness
Environment	Transitionary / Complex	Controlled Lab	Unconstrained / Unstructured/Stochastic
Key Sensor	Synchronized RGB-D + MoCap	Optical + IMU Fusion	Portable/Sparse Arrays
Action Scale	Hierarchical (Micro to Macro)	Atomic & Meso-level	Ecological/Naturalistic
Ontology	Morphological Abstraction	Cross-Ontology Mapping	Cross-Ontology Annotation

Overview

High Precision Human Interaction Motion With Object & Vision (HiPHI-MOV) Dataset: The HiPHI-MOV Dataset is a human-centric, high-fidelity multimodal corpus specifically engineered for the development of robust locomotion and whole-body loco-manipulation policies. It includes full-body motion capture, tracking of interacting objects, third-person RGB-D visual data. HiPHI-MOV provides a synchronized data stream that co-registers ground-truth, full-body kinematic trajectories—captured via high-frequency motion capture—with ego-centric and exo-centric visual observations. This structured hierarchy enables the modeling of complex robotic behaviors, ranging from low-level motor primitives (joint-space dynamics) to high-level environmental affordances (scene-contextual navigation).

High Precision Human Interaction Omni-Modality (HiPHI-OM) Dataset : The HiPHI-OM Dataset is a human-centric, high-fidelity, omni-modal repository acquired within a highly instrumented laboratory environment. It includes full-body and fine-grained hand motion capture, hand-level tactile sensing, precise tracking of interacting objects, egocentric RGB-D visual data, and third-person RGB-D visual data. By utilizing synchronized, high-precision sensor arrays, HiPHI-OM provides ground-truth level data for anthropocentric modeling with minimal aleatoric uncertainty. The dataset is designed to be ontology-agnostic, allowing for the decoupling of raw sensor data from specific semantic frameworks to maximize cross-domain generalization and longitudinal utility. Morphologically, the dataset supports a hierarchical structure, encompassing both micro-level kinematic primitives and meso-level sequential task planning.

In the Wild (ITW) Dataset : The In-the-Wild (ITW) Dataset constitutes a human-centric, life-scale, diverse, open-world, multimodal repository of stochastic real-world scenarios designed to advance humanoid robotics and embodied intelligence (EI). It includes sparse-body motion capture sensors and egocentric RGB-D visual data. Departing from traditional laboratory-constrained acquisition, ITW captures ecologically valid human behaviors and interaction dynamics within unconstrained environments. By integrating high-variance environmental noise and long-tail edge cases into the training distribution, ITW facilitates the generalization of laboratory-optimized algorithms toward industrial deployment. When integrated with the HiPHI-OM dataset, it provides a comprehensive cross-domain corpus spanning diverse operational scenarios and sensory modalities.

1. Technical Specification: HiPHI Motion with Vision (HiPHI-MOV) Dataset

Multimodal Sensor Fusion and State Estimation: The HiPHI-MOV dataset provides a high-dimensional data stream that integrates 6-DOF full-body kinematics, high-fidelity hand-joint trajectories (provisioned), and synchronized RGB-D environmental telemetry. This deep fusion of proprioceptive and exteroceptive data supports the development of integrated perception–action loops, enabling models to learn the spatial relationships between body pose and environmental affordances.
Multi-Scale Hierarchical Interpretability: Designed as a structured benchmark for embodied intelligence, the dataset spans three distinct semantic layers: atomic-level action ontologies (kinematic primitives), mesoscopic task planning (sequential logic), and macro-level scene distributions (environmental context). This hierarchy allows for scientific interpretability in model performance, isolating whether failures occur at the motor-control, tactical, or strategic level.
Sim-to-Real Alignment and Zero-Shot Transfer: The dataset is engineered to facilitate sim-to-real alignment, providing the high-precision ground truth necessary to bridge the gap between virtual simulations and physical deployment. By capturing a diverse range of locomotion-manipulation tasks, HiPHI-MOV supports the evaluation of zero-shot transfer capabilities, allowing autonomous control algorithms to generalize to novel environments without additional fine-tuning.
Cross-Ontology Compatibility and Morphological Abstraction: Leveraging advanced motion retargeting frameworks, the HiPHI-MOV dataset decouples captured human motion from specific hardware constraints. This ensures cross-ontology compatibility, where high-fidelity motion data can be seamlessly mapped onto humanoid platforms with disparate degrees of freedom (DoF) and varied mechanical configurations. This abstraction is vital for creating platform-independent foundation models for robotic locomotion.

File Structure

Dataset Folder

File Name	Description	Motion without Object	Motion with Object
motion_actor.bvh	Human motion data	✔	✔
task_info.json	Task information for this collection	✔	✔
config.json	Relationship between motion bvh and object		✔
prop_Object.csv	Object motion information		✔

*Video data will be provided soon.

Object Model Folder (only available for motion data with object)

File Name	Description
Object.obj	Object model (with orientation Y up)
Object.csv	Object weight(kg) info

2. Technical Specification: HiPHI Omni-Modality (HiPHI-OM) Dataset

Human-Centered Hierarchical Modeling: The dataset captures anthropocentric behaviors across multiple granularities, from atomic-level meta-actions to medium- and long-range task sequences. By maintaining an ontology-agnostic underlying structure, HiPHI-OM facilitates high-level cross-ontology generalization, allowing the same raw behavioral data to be effectively mapped to diverse semantic frameworks and research objectives.
Sub-millimeter Spatiotemporal Synchronization: The system achieves high-precision pose capture through a robust humanâ€“computer interaction (HRI) pipeline. By fusing optical markers with inertial measurement units (IMUs), the infrastructure ensures stable and accurate tracking during high-speed, high-acceleration motions. This hybrid approach minimizes occlusion artifacts and latency, providing high-fidelity recordings of complex human kinematics.
Multimodal, High-Dimensional Signal Registration: HiPHI-OM serves as a "ground-truth-level" repository, providing synchronized signals across visual, tactile, and spatial domains. While current releases focus on high-precision target positioning and tactile feedback, the architecture is designed for "full-modal" expansion, with integrated force, audio, and thermal telemetry scheduled for subsequent release cycles within the controlled environment.
Cross-Ontology Compatibility and Morphological Mapping: A core strength of the dataset is its hardware-agnostic nature, achieved through advanced motion retargeting technology. This allows for the seamless translation of human motion data to humanoid robots with heterogeneous degrees of freedom (DoF) and varying physical proportions. By decoupling the data from specific robotic platforms, HiPHI-OM ensures that the learned policies are robust across a wide spectrum of robotic morphologies.

File Structure

Data Structure

File Name	Description
config.json	Metadata and description of the data in this collection
task_info.json	Task information for this collection
camera_params/	Intrinsic and extrinsic parameters for all cameras
head_stereo_depth.csv	Index, timestamp, and PNG path for depth images from the head-mounted depth camera
head_stereo_depth/	Depth maps from the head-mounted depth camera
head_stereo.csv	Per-frame timestamps for the video from the head-mounted depth camera
head_stereo.mp4	Video from the head-mounted depth camera
head_wide.csv	Per-frame timestamps for the video from the head-mounted wide-angle camera
head_wide.mp4	Video from the head-mounted wide-angle camera
fixed1_stereo_depth.csv	Index, timestamp, and PNG path for depth images from fixed camera 1
fixed1_stereo_depth/	Depth maps from fixed camera 1
fixed1_stereo.csv	Per-frame timestamps for the video from fixed camera 1
fixed1_stereo.mp4	Video from fixed camera 1
fixed1_wide.csv	Per-frame timestamps for the wide-angle video from fixed camera 1
fixed1_wide.mp4	Wide-angle video from fixed camera 1
fixed2_stereo_depth.csv	Index, timestamp, and PNG path for depth images from fixed camera 2
fixed2_stereo_depth/	Depth maps from fixed camera 2
fixed2_stereo.csv	Per-frame timestamps for the video from fixed camera 2
fixed2_stereo.mp4	Video from fixed camera 2
fixed2_wide.csv	Per-frame timestamps for the wide-angle video from fixed camera 2
fixed2_wide.mp4	Wide-angle video from fixed camera 2
hand_pressure_data.h5	6DOF data for the motion-capture subject’s skeleton
tracker_sixdof_data.h5	Palm pressure data (all finger joints and palm regions: upper, mid, lower-mid, lower, base)
human_bones.h5	6DOF data for full-body trackers, hands, and props

Obj Model Folder

File Name	Description
Obj.fbx	A versatile file format developed by Autodesk for 3D animation, modeling, and design. Unlike STL, FBX files can contain not only the geometry of a 3D model, but also its textures, animation data, and more.
Obj.stl	A widely used file format for 3D printing. It represents the surface geometry of a 3D object using a series of connected triangles, making it a simple and efficient format for 3D printing.

Data formats for various file types

Video data

Each scene contains a total of six video files from three channels: one head-mounted channel and two fixed camera channels. Each channel includes two videos: one binocular and one wide-angle. The video data is an mp4 file, accompanied by a csv file with the same name, which records the timestamp of each frame. For example, the head_stereo.csv file records the timestamp of each frame of the head_stereo.mp4 video.

Depth data

Each scene contains three channels of depth data: one head-mounted channel and two fixed camera channels. The depth data of each depth camera is uniformly placed in a folder, where each png file corresponds to one frame of depth data. The csv file with the same name records the timestamp of each frame of data. For example, the head_stereo_depth.csv file records the timestamp of each frame in the head_stereo_depth/ depth directory. The depth data of each png is 16-bit, and the following parsing script can be used to obtain one frame of data.

Parse code: read_png_16bit.py

Output result example:

output

1>python read_png_16bit.py dataset\3\3_1_1760508893\depth_fixed\depth_0_1760508893407.png

2Image data type: uint16

3Image shape (height, width): (1280, 720)

4Pixel value range: [0, 4999]

5Pixel values:(0, 0): 2229 (0, 1): 895 (0, 2): 2995 (0, 3): 4374 (0, 4): 3547 (0, 5): 1692 (0, 6): 4714 (0, 7): 1647 (0, 8): 3925 (0, 9): 3390 (0, 10): 2282 (0, 11): 862 (0, 12): 2801 (0, 13): 1817 (0, 14): 3244 (0, 15): 1869 (0, 16): 1273 (0, 17): 1041 (0, 18): 2761 (0, 19): 3518 (0, 20): 2127 (0, 21): 3061 (0, 22): 1924 (0, 23): 3374 (0, 24): 908 (0, 25): 3501 (0, 26): 1822 (0, 27): 3944 (0, 28): 252 (0, 29): 2680 (0, 30): 1078 (0, 31): 4535 (0, 32): 356 (0, 33): 2394 (0, 34): 3 (0, 35): 827 (0, 36): 3834 (0, 37): 4101 (0, 38): 2683 (0, 39): 1128 (0, 40): 2544 (0, 41): 2289 (0, 42): 58 (0, 43): 2335 (0, 44): 3181 (0, 45): 1335 (0, 46): 4882 (0, 47): 4324 (0, 48): 795 (0, 49): 4056 (0, 50): 1729 (0, 51): 1073 (0, 52): 2216 (0, 53): 3168 (0, 54): 719 (0, 55): 693 (0, 56): 3484 (0, 57): 137 (0, 58): 3165 (0, 59): 2427 (0, 60): 3391 (0, 61): 1962 (0, 62): 2656 (0, 63): 3696 (0, 64): 4627 (0, 65): 1604 (0, 66): 4554 (0, 67): 615 (0, 68): 4258 (0, 69): 4757 (0, 70): 343 (0, 71): 202 (0, 72): 2056 (0, 73): 874 (0, 74): 1838 (0, 75): 742 (0, 76): 880 (0, 77): 1573 (0, 78): 3504 (0, 79): 4451 (0, 80): 2053 (0, 81): 667 (0, 82): 4895 (0, 83): 861 (0, 84): 1448 (0, 85): 2262 (0, 86): 80 (0, 87): 1445 (0, 88): 3191 (0, 89): 3864 (0, 90): 2022 (0, 91): 4655 (0, 92): 266 (0, 93): 260 (0, 94): 2292 (0, 95): 2861 (0, 96): 248 (0, 97): 671 (0, 98): 3239 (0, 99): 3710 (0, 100): 3766 (0, 101): 1283 (0, 102): 2494 (0, 103): 2164 (0, 104): 4340 (0, 105): 3539 (0, 106): 1558 (0, 107): 619 (0, 108): 4826 (0, 109): 1730 (0, 110): 2195 (0, 111): 3813 (0, 112): 2310 (0, 113): 1343 (0, 114): 2980 (0, 115): 3945 (0, 116): 315 (0, 117): 4461 (0, 118): 1315 (0, 119): 3767 (0, 120): 1854 (0, 121): 957 (0, 122): 2968 (0, 123): 3151 (0, 124): 1445 (0, 125): 3355 (0, 126): 3410 (0, 127): 860 (0, 128): 2301 (0, 129): 2527 (0, 130): 2324 (0, 131): 3310 (0, 132): 276 (0, 133): 3899 (0, 134): 102 (0, 135): 2384 (0, 136): 2996 (0, 137): 109 (0, 138): 922 (0, 139): 4917 (0, 140): 2406 (0, 141): 619 (0, 142): 307 (0, 143): 2187 (0, 144): 2679 (0, 145): 3516 (0, 146): 2818 (0, 147): 535 (0, 148): 1242 (0, 149): 1102 (0, 150): 3657 (0, 151): 3104 (0, 152): 807 (0, 153): 3926 (0, 154): 3332 (0, 155): 3453 (0, 156): 2338 (0, 157): 250 (0, 158): 3388 (0, 159): 4432 (0, 160): 2745 (0, 161): 538 (0, 162): 2648 (0, 163): 4757 (0, 164): 1002 (0, 165): 4200 (0, 166): 1126 (0, 167): 3228 (0, 168): 4195 (0, 169): 1135 (0, 170): 4117 (0, 171): 789 (0, 172): 3131 (0, 173): 1786 (0, 174): 4705 (0, 175): 2263 (0, 176): 3551 (0, 177): 2455 (0, 178): 1543 (0, 179): 1735 (0, 180): 4995 (0, 181): 886 (0, 182): 3535 (0, 183): 3820 (0, 184): 4037 (0, 185): 3589 (0, 186): 1743 (0, 187): 316 (0, 188): 2223 (0, 189): 2552 (0, 190): 2763 (0, 191): 3179 (0, 192): 4976 (0, 193): 2888 (0, 194): 3415 (0, 195): 3515 (0, 196): 4460 (0, 197): 2020 (0, 198): 4898 (0, 199): 4138 (0, 200): 3994 (0, 201): 3146 (0, 202): 1844 (0, 203): 2860 (0, 204): 4602 (0, 205): 3212 (0, 206): 3750 (0, 207): 3079 (0, 208): 359 (0, 209): 4843 (0, 210): 3290 (0, 211): 718 (0, 212): 1020 (0, 213): 2644 (0, 214): 1384 (0, 215): 4617 (0, 216): 2844 (0, 217): 4825 (0, 218): 4928 (0, 219): 1177 (0, 220): 4585 (0, 221): 3034 (0, 222): 2382 (0, 223): 1233 (0, 224): 2610 (0, 225): 1418 (0, 226): 3538 (0, 227): 2643 (0, 228): 1012 (0, 229): 925 (0, 230): 3815 (0, 231): 1852 (0, 232): 2971 (0, 233): 496 (0, 234): 4573 (0, 235): 3874 (0, 236): 3522 (0, 237): 3187 (0, 238): 2196 (0, 239): 3725 (0, 240): 3469 (0, 241): 1070 (0, 242): 2604 (0, 243): 1639 (0, 244): 4423 (0, 245): 2680 (0, 246): 327 (0, 247): 3259 (0, 248): 1698 (0, 249): 251 (0, 250): 1238 (0, 251): 4077 (0, 252): 2870 (0, 253): 2897 (0, 254): 2452 (0, 255): 2858 (0, 256): 2765 (0, 257): 297 (0, 258): 3220 (0, 259): 3014 (0, 260): 3422 (0, 261): 1762 (0, 262): 2345 (0, 263): 3654 (0, 264): 261 (0, 265): 1800 (0, 266): 1239 (0, 267): 3758 (0, 268): 309 (0, 269): 568 (0, 270): 2154 (0, 271): 1835 (0, 272): 1193 (0, 273): 2603 (0, 274): 3344 (0, 275): 607 (0, 276): 751 (0, 277): 465 (0, 278): 3444 (0, 279): 1199 (0, 280): 1010 (0, 281): 4014 (0, 282): 658 (0, 283): 3120 (0, 284): 689 (0, 285): 2118 (0, 286): 503 (0, 287): 124 (0, 288): 4102 (0, 289): 842 (0, 290): 3979 (0, 291): 460 (0, 292): 160 (0, 293): 4660 (0, 294): 3781 (0, 295): 2831 (0, 296): 4011 (0, 297): 944 (0, 298): 1318 (0, 299): 4858 (0, 300): 3669 (0, 301): 932 (0, 302): 4000 (0, 303): 2817 (0, 304): 2516 (0, 305): 727 (0, 306): 530 (0, 307): 3398 (0, 308): 2861 (0, 309): 3774 (0, 310): 2900 (0, 311): 3533 (0, 312): 1493 (0, 313): 3201 (0, 314): 3312 (0, 315): 4431 (0, 316): 223 (0, 317): 2022 (0, 318): 2874 (0, 319): 910 (0, 320): 4824 (0, 321): 246 (0, 322): 4623 (0, 323): 3496 (0, 324): 463 (0, 325): 3367 (0, 326): 4978 (0, 327): 2157 (0, 328): 2640 (0, 329): 2327 (0, 330): 860 (0, 331): 4609 (0, 332): 2405 (0, 333): 2624 (0, 334): 192 (0, 335): 3151 (0, 336): 3184 (0, 337): 1699 (0, 338): 3350 (0, 339): 690 (0, 340): 3819 (0, 341): 3446 (0, 342): 2070 (0, 343): 697 (0, 344): 1447 (0, 345): 2494 (0, 346): 1968 (0, 347): 2823 (0, 348): 3012 (0, 349): 36 (0, 350): 2428 (0, 351): 3593 (0, 352): 4921 (0, 353): 1773 (0, 354): 585 (0, 355): 4115 (0, 356): 4439 (0, 357): 1189 (0, 358): 2920 (0, 359): 4544 (0, 360): 3181 (0, 361): 3115 (0, 362): 3071 (0, 363): 2899 (0, 364): 824 (0, 365): 4391 (0, 366): 1810 (0, 367): 1204 (0, 368): 2175 (0, 369): 1228 (0, 370): 4392 (0, 371): 1432 (0, 372): 3680 (0, 373): 2839 (0, 374): 1143 (0, 375): 4809 (0, 376): 4825 (0, 377): 2654 (0, 378): 2897 (0, 379): 726 (0, 380): 4421 (0, 381): 3494 (0, 382): 1256 (0, 383): 1552 (0, 384): 2376 (0, 385): 2855 (0, 386): 3714 (0, 387): 223 (0, 388): 1125 (0, 389): 813 (0, 390): 299 (0, 391): 3849 (0, 392): 3600 (0, 393): 2389 (0, 394): 4787 (0, 395): 1902 (0, 396): 4027 (0, 397): 3895 (0, 398): 3006 (0, 399): 2835 (0, 400): 722 (0, 401): 1200 (0, 402): 3251 (0, 403): 4236 (0, 404): 4493 (0, 405): 3922 (0, 406): 3248 (0, 407): 2911 (0, 408): 1439 (0, 409): 2746 (0, 410): 4049 (0, 411): 1887 (0, 412): 547 (0, 413): 2640 (0, 414): 2895 (0, 415): 2927 (0, 416): 705 (0, 417): 4506 (0, 418): 3382 (0, 419): 4055 (0, 420): 2464 (0, 421): 3003 (0, 422): 219 (0, 423): 3077 (0, 424): 1888 (0, 425): 1452 (0, 426): 2162 (0, 427): 4468 (0, 428): 190 (0, 429): 4557 (0, 430): 570 (0, 431): 4314 (0, 432): 4713 (0, 433): 2175 (0, 434): 8 (0, 435): 1294 (0, 436): 727 (0, 437): 1036 (0, 438): 2785 (0, 439): 1803 (0, 440): 1812 (0, 441): 3593 (0, 442): 446 (0, 443): 4430 (0, 444): 3949 (0, 445): 3296 (0, 446): 1341 (0, 447): 2179 (0, 448): 2436 (0, 449): 3399 (0, 450): 4999 (0, 451): 1526 (0, 452): 3562 (0, 453): 4067 (0, 454): 4304 (0, 455): 4841 (0, 456): 3366 (0, 457): 182 (0, 458): 1414 (0, 459): 4010 (0, 460): 2715 (0, 461): 2866 (0, 462): 1879 (0, 463): 4512 (0, 464): 742 (0, 465): 4167 (0, 466): 2028 (0, 467): 882 (0, 468): 1689 (0, 469): 962 (0, 470): 4490 (0, 471): 4545 (0, 472): 3517 (0, 473): 4138 (0, 474): 4169 (0, 475): 1454 (0, 476): 546 (0, 477): 850 (0, 478): 3459 (0, 479): 927 (0, 480): 3729 (0, 481): 123 (0, 482): 1422 (0, 483): 3038 (0, 484): 2690 (0, 485): 4690 (0, 486): 4424 (0, 487): 477 (0, 488): 1018 (0, 489): 2741 (0, 490): 1192 (0, 491): 2116 (0, 492): 769 (0, 493): 1207 (0, 494): 4340 (0, 495): 4091 (0, 496): 164 (0, 497): 3710 (0, 498): 1920 (0, 499): 4843 (0, 500): 3379 (0, 501): 2960 (0, 502): 3162 (0, 503): 4266 (0, 504): 3305 (0, 505): 935 (0, 506): 1676 (0, 507): 2800 (0, 508): 4173 (0, 509): 3277 (0, 510): 35 (0, 511): 3802 (0, 512): 4073 (0, 513): 1402 (0, 514): 3165 (0, 515): 1654 (0, 516): 2070 (0, 517): 4510 (0, 518): 1630 (0, 519): 1641 (0, 520): 2074 (0, 521): 1814 (0, 522): 757 (0, 523): 352 (0, 524): 1806 (0, 525): 3036 (0, 526): 2763 (0, 527): 2077 (0, 528): 1184 (0, 529): 3359 (0, 530): 3640 (0, 531): 2566 (0, 532): 4671 (0, 533): 2531 (0, 534): 1781 (0, 535): 3011 (0, 536): 2608 (0, 537): 2305 (0, 538): 2891 (0, 539): 2155 (0, 540): 4408 (0, 541): 1845 (0, 542): 1001 (0, 543): 2443 (0, 544): 2630 (0, 545): 2735 (0, 546): 1728 (0, 547): 4914 (0, 548): 3458 (0, 549): 2185 (0, 550): 4457 (0, 551): 2353 (0, 552): 4659 (0, 553): 2233 (0, 554): 3447 (0, 555): 2552 (0, 556): 2566 (0, 557): 1079 (0, 558): 2384 (0, 559): 1498 (0, 560): 2127 (0, 561): 4214 (0, 562): 4288 (0, 563): 220 (0, 564): 2664 (0, 565): 4102 (0, 566): 849 (0, 567): 87 (0, 568): 4278 (0, 569): 1012 (0, 570): 4604 (0, 571): 267 (0, 572): 1706 (0, 573): 4179 (0, 574): 3289 (0, 575): 1064 (0, 576): 76 (0, 577): 1531 (0, 578): 4776 (0, 579): 225 (0, 580): 4344 (0, 581): 362 (0, 582): 2157 (0, 583): 4017 (0, 584): 312 (0, 585): 2540 (0, 586): 918 (0, 587): 1094 (0, 588): 4009 (0, 589): 1341 (0, 590): 3738 (0, 591): 4509 (0, 592): 2958 (0, 593): 1906 (0, 594): 4452 (0, 595): 1296 (0, 596): 2124 (0, 597): 2871 (0, 598): 13 (0, 599): 2384 (0, 600): 3010 (0, 601): 1695 (0, 602): 3492 (0, 603): 4401 (0, 604): 1145 (0, 605): 4864 (0, 606): 3383 (0, 607): 1380 (0, 608): 4914 (0, 609): 3132 (0, 610): 4370 (0, 611): 3797 (0, 612): 2368 (0, 613): 4954 (0, 614): 2765 (0, 615): 2994 (0, 616): 1732 (0, 617): 1917 (0, 618): 1338 (0, 619): 2086 (0, 620): 464 (0, 621): 3836 (0, 622): 335 (0, 623): 1885 (0, 624): 2708 (0, 625): 2188 (0, 626): 2631 (0, 627): 1798 (0, 628): 1911 (0, 629): 548 (0, 630): 3335 (0, 631): 1598 (0, 632): 1083 (0, 633): 895 (0, 634): 1474 (0, 635): 1671 (0, 636): 4823 (0, 637): 4373 (0, 638): 1128 (0, 639): 1299......

H5 File

All structured data other than video data and depth data is stored in H5 format.

Each H5 file contains a dataset named data, which consists of multiple records, each corresponding to a single frame. Every record includes three fields: index (an np.int64 indicating the frame number, starting from 0 and increasing sequentially), timestamp (an np.float64 representing the time, where the integer part denotes seconds and the three decimal places indicate milliseconds), and elements, whose structure varies depending on the data type and is described in detail below.

Position and attitude data of the tracker

In each scene, there are trackers used to track the sub-millimeter level pose information (6dof) of some key objects, as shown in the following table:

Name	Meaning
fixed1_cam	Fixed Camera 1
fixed2_cam	Fixed Camera 2
Head	Header
Spine	Back
Hips	Hip
RightUpLeg	Right thigh
RightFoot	Right Foot
LeftUpLeg	Left thigh
LeftFoot	Left Foot
RightHand	Back of right hand
RightHandThumb2	Right thumb tip
RightHandThumb1	Right thumb base
RightHandIndex2	Right index finger tip
RightHandIndex1	Right index finger root
RightHandMiddle2	Right middle fingertip
RightHandMiddle1	Root of the right middle finger
RightHandRing2	Tip of the right ring finger
RightHandRing1	Base of the right ring finger
RightHandPinky2	Right little finger tip
RightHandPinky1	Base of the right little finger
LeftHand	Back of left hand
LeftHandThumb2	Left thumb tip
LeftHandThumb1	Left thumb base
LeftHandIndex2	Left index finger tip
LeftHandIndex1	Left index finger root
LeftHandMiddle2	Left middle finger tip
LeftHandMiddle1	Root of the left middle finger
LeftHandRing2	Left ring finger tip
LeftHandRing1	Base of the left ring finger
LeftHandPinky2	Left little finger tip
LeftHandPinky1	Base of the left little finger
TBD	Other Props

These datas are stored in the tracker_sixdof.h5 file, with sub-millimeter positional accuracy

Parse code: read_sixdof_data_h5_1.py

Output result example:

output

1>python read_sixdof_data_h5.py dataset\3\3_2_1760512240\tracker_sixdof_data.h5

2Successfully opened file: dataset\3\3_2_1760512240\tracker_sixdof_data.h5

3File contains 102 frames of data

4[0] 6DOF, 0, 1760512240.635: Head(pos: [4.700, 9.741, 0.208], rot: [6.931, 2.250, 8.570, 7.848]), Spine2(pos: [9.661, 1.140, 5.715], rot: [1.364, 2.656, 6.938, 9.754]), LeftArm(pos: [6.545, 3.506, 4.913], rot: [6.109, 4.777, 8.283, 7.738]), LeftForeArm(pos: [8.475, 4.751, 4.370], rot: [7.183, 1.154, 1.972, 3.395]), RightArm(pos: [6.025, 1.487, 5.187], rot: [5.980, 0.864, 3.507, 7.056]), RightForeArm(pos: [8.127, 6.362, 0.697], rot: [7.863, 4.106, 2.043, 4.971]), RightHand(pos: [5.904, 8.878, 7.089], rot: [0.521, 6.584, 7.460, 4.837]), RightHandThumb2(pos: [8.239, 4.211, 4.194], rot: [3.032, 3.650, 3.832, 2.387]), RightHandThumb1(pos: [3.484, 5.090, 9.470], rot: [6.862, 9.581, 1.757, 1.394]), RightHandIndex2(pos: [6.113, 2.014, 9.816], rot: [4.125, 1.529, 4.666, 8.462]), RightHandIndex1(pos: [1.576, 6.268, 0.611], rot: [2.259, 4.032, 7.302, 7.763]), RightHandMiddle2(pos: [6.777, 4.669, 2.550], rot: [5.158, 8.411, 9.693, 3.566]), RightHandMiddle1(pos: [8.038, 1.110, 9.366], rot: [9.309, 8.405, 1.692, 2.564]), RightHandRing2(pos: [2.042, 3.063, 2.148], rot: [1.387, 7.023, 6.425, 4.465]), RightHandRing1(pos: [9.300, 7.093, 1.573], rot: [4.383, 3.871, 8.150, 6.244]), RightHandPinky2(pos: [5.530, 6.196, 2.384], rot: [3.299, 4.086, 1.726, 2.509]), RightHandPinky1(pos: [2.016, 7.864, 3.604], rot: [3.591, 4.292, 5.464, 8.199]), LeftHand(pos: [1.252, 2.116, 0.793], rot: [5.515, 5.732, 7.438, 6.570]), LeftHandThumb2(pos: [7.785, 7.563, 8.329], rot: [9.511, 3.618, 7.574, 8.522]), LeftHandThumb1(pos: [6.129, 5.274, 2.667], rot: [1.951, 5.426, 4.259, 8.812]), LeftHandIndex2(pos: [0.070, 0.762, 6.821], rot: [5.819, 2.508, 5.206, 3.099]), LeftHandIndex1(pos: [9.334, 2.247, 4.293], rot: [4.109, 4.035, 2.718, 5.930]), LeftHandMiddle2(pos: [4.786, 5.211, 0.040], rot: [5.388, 4.672, 1.993, 9.700]), LeftHandMiddle1(pos: [9.905, 9.530, 3.052], rot: [8.022, 7.669, 7.746, 1.762]), LeftHandRing2(pos: [4.389, 7.090, 3.820], rot: [9.966, 1.297, 9.525, 9.557]), LeftHandRing1(pos: [3.045, 9.463, 0.549], rot: [3.151, 5.749, 4.670, 0.488]), LeftHandPinky2(pos: [3.227, 0.974, 7.268], rot: [3.382, 7.960, 4.778, 5.802]), LeftHandPinky1(pos: [5.065, 3.246, 0.746], rot: [7.747, 3.241, 0.531, 1.255]), bosch(pos: [6.309, 5.763, 1.114], rot: [4.260, 0.026, 2.476, 4.636]), plug_in(pos: [4.470, 0.403, 3.455], rot: [1.058, 1.706, 9.969, 4.323]), plug(pos: [4.142, 6.197, 0.687], rot: [9.169, 7.165, 4.744, 5.936]), mouse(pos: [9.885, 8.997, 8.870], rot: [3.320, 8.982, 6.344, 3.425]), bottle_cap(pos: [6.199, 2.856, 4.770], rot: [2.192, 5.186, 2.645, 6.752])......

5[101] 6DOF, 101, 1760512241.746: Head(pos: [3.540, 2.404, 1.593], rot: [5.080, 7.258, 6.355, 1.038]), Spine2(pos: [9.579, 2.308, 1.847], rot: [0.059, 8.118, 3.501, 6.526]), LeftArm(pos: [8.999, 0.246, 9.970], rot: [4.544, 0.580, 4.456, 1.992]), LeftForeArm(pos: [2.368, 6.548, 0.357], rot: [2.181, 4.856, 4.108, 3.070]), RightArm(pos: [2.816, 1.743, 8.460], rot: [2.325, 1.201, 6.576, 1.554]), RightForeArm(pos: [1.738, 7.106, 5.225], rot: [0.179, 1.703, 4.765, 0.839]), RightHand(pos: [9.955, 4.232, 4.448], rot: [4.717, 2.154, 9.040, 7.465]), RightHandThumb2(pos: [5.607, 9.120, 0.200], rot: [1.519, 6.191, 5.054, 1.281]), RightHandThumb1(pos: [0.790, 5.465, 3.948], rot: [8.809, 5.387, 0.557, 3.543]), RightHandIndex2(pos: [8.374, 2.914, 9.005], rot: [1.964, 3.172, 4.835, 4.045]), RightHandIndex1(pos: [0.824, 2.716, 6.149], rot: [5.559, 5.185, 0.850, 2.025]), RightHandMiddle2(pos: [2.019, 8.951, 5.565], rot: [5.268, 5.069, 3.224, 5.385]), RightHandMiddle1(pos: [9.861, 8.248, 1.728], rot: [4.456, 5.168, 4.466, 8.939]), RightHandRing2(pos: [9.052, 6.687, 9.520], rot: [8.367, 5.385, 6.231, 6.692]), RightHandRing1(pos: [3.076, 7.954, 6.442], rot: [6.692, 2.316, 9.451, 1.908]), RightHandPinky2(pos: [5.206, 9.323, 3.775], rot: [6.185, 5.471, 0.940, 3.120]), RightHandPinky1(pos: [7.816, 2.848, 1.227], rot: [9.758, 4.951, 5.415, 1.955]), LeftHand(pos: [2.166, 8.124, 9.284], rot: [6.120, 3.479, 8.747, 6.665]), LeftHandThumb2(pos: [8.513, 2.999, 5.465], rot: [0.832, 7.974, 1.544, 6.478]), LeftHandThumb1(pos: [1.582, 0.129, 6.390], rot: [9.830, 5.092, 1.790, 5.468]), LeftHandIndex2(pos: [9.805, 7.059, 1.011], rot: [0.682, 6.371, 2.648, 5.015]), LeftHandIndex1(pos: [8.539, 7.739, 3.178], rot: [7.412, 4.053, 9.804, 5.553]), LeftHandMiddle2(pos: [3.523, 3.369, 8.865], rot: [2.969, 1.512, 1.691, 6.769]), LeftHandMiddle1(pos: [7.458, 3.199, 1.939], rot: [4.782, 2.153, 2.341, 2.070]), LeftHandRing2(pos: [7.223, 9.727, 6.515], rot: [5.988, 3.272, 0.142, 0.802]), LeftHandRing1(pos: [1.205, 6.015, 7.121], rot: [0.597, 5.315, 8.537, 9.457]), LeftHandPinky2(pos: [1.071, 4.223, 8.325], rot: [7.055, 8.880, 2.564, 0.416]), LeftHandPinky1(pos: [7.118, 6.949, 3.440], rot: [6.550, 6.910, 8.168, 9.856]), bosch(pos: [5.119, 9.046, 4.639], rot: [1.326, 0.504, 2.155, 0.527]), plug_in(pos: [3.005, 6.233, 0.248], rot: [1.379, 9.225, 0.698, 1.055]), plug(pos: [5.289, 7.213, 8.430], rot: [3.459, 0.358, 6.354, 7.308]), mouse(pos: [7.984, 4.799, 1.079], rot: [6.441, 7.718, 8.183, 1.362]), bottle_cap(pos: [1.310, 0.743, 8.835], rot: [5.751, 7.284, 8.748, 0.109])

6================================================================================

8Summary:

9 File Path: dataset\3\3_2_1760512240\tracker_sixdof_data.h5

10 Total frames: 102

11 Elements per frame: 33

12 Index range difference: (last_index + 1\) - first_index = (101 + 1\) - 0 = 102

13 Index continuity: Normal

14================================================================================

Human skeletal motion data

The human skeletal motion data (full body + fingers) calculated based on tracker data is included in the file: human_bones.h5

Hierarchy

In addition to the "data" dataset, the root directory of this file also contains a "skeleton" group, which defines information such as human body bone length and connection relationships, equivalent to the "HIERARCHY" in the BVH format. The root group is "Skeleton", and all bones are sub-groups under "Skeleton", with a parallel hierarchical structure. The information contained in each bone is included through attributes. An example is as follows:

output

1/Skeleton

2 (attr) unit = "cm"

3 /Hips

4 (attr) parent = ""

5 (attr) offset = [0.0, 10.0, 0.0]

6 (attr) rotation_type = "quaternion"

7 (attr) channels = ["w","x","y","z"]

8 (attr) children = ["LeftUpLeg","RightUpLeg","Spine"]

9 ......

10 /LeftFoot

11 (attr) parent = "LeftLeg"

12 (attr) offset = [0.0, -40.0, 0.0]

13 (attr) rotation_type = "quaternion"

14 (attr) channels = ["w","x","y","z"]

15 (attr) children = ["LeftFoot_End"]

16 /LeftFoot_End

17 (attr) parent = "LeftFoot"

18 (attr) offset = [0.0, -2.0, 0.0]

19 (attr) rotation_type = "none"

20 (attr) channels = []

21 (attr) children = []

22 (attr) is_end = true

Data

Elements contain all the bones of a human body in one frame

name (string, e.g. "hip"),
position (float32, shape=[3], e.g. x,y,z)
rotation ((float32, shape=[4], e.g. w,x,y,z)

Parse code: read_sixdof_data_h5_2.py

Output result example:

output

1>python read_sixdof_data_h5.py dataset\1\1_41_1760493249\human_bone_data.h5

2Successfully opened file: dataset\1\1_41_1760493249\human_bone_data.h5

3File contains 68 frames of data

4[0] 6DOF, 0, 1760493249.683: Hips(pos: [7.421, 96.760, -150.097], rot: [0.963, -0.018, 0.266, 0.038]), RightUpLeg(pos: [-10.965, 0.397, -2.292], rot: [0.994, 0.059, -0.091, -0.029]), RightLeg(pos: [-0.061, -45.043, -0.383], rot: [0.997, 0.025, -0.057, -0.034]), RightFoot(pos: [-0.603, -42.091, -2.064], rot: [-0.990, 0.050, 0.127, -0.036]), LeftUpLeg(pos: [10.935, -0.395, 2.304], rot: [0.996, 0.068, -0.052, -0.021]), LeftLeg(pos: [0.032, -45.075, 0.384], rot: [-0.983, 0.028, 0.175, -0.041]), LeftFoot(pos: [0.893, -42.288, 1.950], rot: [0.997, -0.024, -0.041, -0.068]), Spine(pos: [0.006, 8.119, -0.013], rot: [1.000, 0.001, 0.010, -0.005]), Spine1(pos: [0.005, 17.979, -0.008], rot: [1.000, 0.001, 0.010, -0.008]), Spine2(pos: [0.003, 12.759, 0.002], rot: [0.999, 0.002, 0.048, -0.010]), Neck(pos: [0.001, 19.140, 0.002], rot: [1.000, 0.008, 0.000, -0.002]), Neck1(pos: [0.000, 4.250, 0.000], rot: [1.000, 0.009, 0.000, -0.002]), Head(pos: [0.001, 4.250, -0.001], rot: [0.944, 0.006, -0.330, -0.003]), RightShoulder(pos: [-2.902, 13.341, -0.006], rot: [-0.998, 0.047, -0.040, 0.020]), RightArm(pos: [-16.056, -0.001, -0.008], rot: [0.728, -0.144, 0.530, 0.411]), RightForeArm(pos: [-27.998, -0.035, 0.031], rot: [0.743, -0.130, 0.577, -0.312]), RightHand(pos: [-25.992, 0.018, -0.012], rot: [0.870, 0.308, 0.097, -0.373]), RightHandThumb1(pos: [-1.937, -0.484, 2.518], rot: [0.889, 0.345, 0.277, -0.117]), RightHandThumb2(pos: [-3.872, 0.000, 0.000], rot: [0.997, 0.000, -0.000, 0.083]), RightHandThumb3(pos: [-2.690, 0.000, 0.000], rot: [0.998, 0.000, 0.000, 0.055]), RightInHandIndex(pos: [-3.389, 0.535, 2.080], rot: [1.000, 0.000, 0.000, 0.000]), RightHandIndex1(pos: [-5.485, -0.096, 1.051], rot: [0.986, 0.014, 0.130, 0.107]), RightHandIndex2(pos: [-3.806, 0.000, 0.000], rot: [0.996, 0.000, -0.000, 0.088]), RightHandIndex3(pos: [-2.158, 0.000, 0.000], rot: [0.998, 0.000, -0.000, 0.062]), RightInHandMiddle(pos: [-3.556, 0.544, 0.796], rot: [1.000, 0.000, 0.000, 0.000]), RightHandMiddle1(pos: [-5.441, -0.088, 0.330], rot: [0.999, 0.000, 0.000, 0.054]), RightHandMiddle2(pos: [-4.153, 0.000, 0.000], rot: [0.999, -0.000, -0.000, 0.054]), RightHandMiddle3(pos: [-2.603, 0.000, 0.000], rot: [-0.999, -0.000, -0.000, -0.036]), RightInHandRing(pos: [-3.539, 0.566, -0.136], rot: [1.000, 0.000, 0.000, 0.000]), RightHandRing1(pos: [-4.873, -0.023, -0.504], rot: [0.995, -0.005, -0.087, 0.055]), RightHandRing2(pos: [-3.619, 0.000, 0.000], rot: [0.998, 0.000, 0.000, 0.055]), RightHandRing3(pos: [-2.511, 0.000, 0.000], rot: [0.999, 0.000, 0.000, 0.037]), RightInHandPinky(pos: [-3.324, 0.494, -1.264], rot: [1.000, 0.000, 0.000, 0.000]), RightHandPinky1(pos: [-4.354, -0.023, -1.147], rot: [0.985, -0.004, -0.174, 0.021]), RightHandPinky2(pos: [-2.898, 0.000, 0.000], rot: [0.999, 0.000, -0.000, 0.032]), RightHandPinky3(pos: [-1.831, 0.000, 0.000], rot: [1.000, -0.000, -0.000, 0.021]), LeftShoulder(pos: [2.899, 13.338, -0.013], rot: [0.993, -0.067, 0.078, 0.049]), LeftArm(pos: [16.100, -0.000, -0.008], rot: [0.692, 0.144, -0.004, -0.707]), LeftForeArm(pos: [28.000, -0.000, -0.000], rot: [0.998, -0.016, -0.049, -0.020]), LeftHand(pos: [26.000, 0.001, 0.004], rot: [0.996, 0.072, -0.057, 0.001]), LeftHandThumb1(pos: [1.937, -0.484, 2.518], rot: [0.879, 0.411, -0.238, 0.039]), LeftHandThumb2(pos: [3.872, 0.000, 0.000], rot: [0.992, 0.000, 0.000, -0.124]), LeftHandThumb3(pos: [2.690, 0.000, 0.000], rot: [0.997, 0.000, -0.000, -0.082]), LeftInHandIndex(pos: [3.389, 0.535, 2.080], rot: [1.000, 0.000, 0.000, 0.000]), LeftHandIndex1(pos: [5.485, -0.096, 1.051], rot: [0.965, 0.030, -0.127, -0.228]), LeftHandIndex2(pos: [3.806, 0.000, 0.000], rot: [0.982, 0.000, 0.000, -0.189]), LeftHandIndex3(pos: [2.158, 0.000, 0.000], rot: [0.991, 0.000, -0.000, -0.133]), LeftInHandMiddle(pos: [3.556, 0.544, 0.796], rot: [1.000, 0.000, 0.000, 0.000]), LeftHandMiddle1(pos: [5.441, -0.088, 0.330], rot: [0.970, 0.000, 0.000, -0.244]), LeftHandMiddle2(pos: [4.153, 0.000, 0.000], rot: [0.970, -0.000, 0.000, -0.244]), LeftHandMiddle3(pos: [2.603, 0.000, 0.000], rot: [0.987, 0.000, 0.000, -0.164]), LeftInHandRing(pos: [3.539, 0.566, -0.136], rot: [1.000, 0.000, 0.000, 0.000]), LeftHandRing1(pos: [4.873, -0.023, -0.504], rot: [-0.963, 0.022, -0.084, 0.256]), LeftHandRing2(pos: [3.619, 0.000, 0.000], rot: [0.966, 0.000, 0.000, -0.257]), LeftHandRing3(pos: [2.511, 0.000, 0.000], rot: [0.985, 0.000, 0.000, -0.172]), LeftInHandPinky(pos: [3.324, 0.494, -1.264], rot: [1.000, 0.000, 0.000, 0.000]), LeftHandPinky1(pos: [4.354, -0.023, -1.147], rot: [0.957, -0.041, 0.169, -0.232]), LeftHandPinky2(pos: [2.898, 0.000, 0.000], rot: [0.937, 0.000, 0.000, -0.350]), LeftHandPinky3(pos: [1.831, 0.000, 0.000], rot: [0.972, 0.000, 0.000, -0.236])......

5[67] 6DOF, 67, 1760493250.952: Hips(pos: [7.361, 96.789, -149.369], rot: [0.994, -0.017, 0.111, 0.011]), RightUpLeg(pos: [-11.183, 0.213, -1.646], rot: [0.996, 0.048, -0.067, -0.024]), RightLeg(pos: [-0.067, -45.055, -0.275], rot: [1.000, 0.006, 0.019, -0.021]), RightFoot(pos: [-0.324, -42.057, -1.510], rot: [-0.997, 0.029, 0.067, -0.029]), LeftUpLeg(pos: [11.191, -0.227, 1.653], rot: [0.996, 0.085, -0.037, -0.000]), LeftLeg(pos: [0.052, -45.053, 0.276], rot: [-0.995, 0.033, 0.093, -0.032]), LeftFoot(pos: [0.561, -42.133, 1.446], rot: [-0.997, 0.039, -0.020, 0.056]), Spine(pos: [0.003, 8.123, -0.000], rot: [1.000, 0.003, 0.007, -0.002]), Spine1(pos: [-0.001, 17.982, -0.001], rot: [1.000, 0.003, 0.007, -0.003]), Spine2(pos: [-0.004, 12.762, -0.003], rot: [0.999, 0.002, 0.031, -0.004]), Neck(pos: [0.001, 19.140, 0.001], rot: [1.000, 0.003, -0.000, 0.001]), Neck1(pos: [0.000, 4.250, 0.000], rot: [1.000, 0.004, -0.000, 0.001]), Head(pos: [0.000, 4.250, -0.001], rot: [0.988, 0.003, -0.155, 0.000]), RightShoulder(pos: [-2.900, 13.338, -0.011], rot: [-0.998, 0.053, -0.019, 0.023]), RightArm(pos: [-16.116, 0.002, -0.001], rot: [0.754, -0.098, 0.489, 0.427]), RightForeArm(pos: [-28.013, 0.001, -0.008], rot: [0.714, 0.011, 0.648, -0.264]), RightHand(pos: [-25.988, 0.005, -0.018], rot: [0.892, 0.273, 0.084, -0.350]), RightHandThumb1(pos: [-1.937, -0.484, 2.518], rot: [0.888, 0.343, 0.283, -0.120]), RightHandThumb2(pos: [-3.872, 0.000, 0.000], rot: [-0.997, -0.000, 0.000, -0.080]), RightHandThumb3(pos: [-2.690, 0.000, 0.000], rot: [0.999, 0.000, 0.000, 0.053]), RightInHandIndex(pos: [-3.389, 0.535, 2.080], rot: [1.000, 0.000, 0.000, 0.000]), RightHandIndex1(pos: [-5.485, -0.096, 1.051], rot: [0.986, 0.014, 0.130, 0.105]), RightHandIndex2(pos: [-3.806, 0.000, 0.000], rot: [0.996, 0.000, -0.000, 0.087]), RightHandIndex3(pos: [-2.158, 0.000, 0.000], rot: [0.998, 0.000, 0.000, 0.061]), RightInHandMiddle(pos: [-3.556, 0.544, 0.796], rot: [1.000, 0.000, 0.000, 0.000]), RightHandMiddle1(pos: [-5.441, -0.088, 0.330], rot: [0.998, -0.000, -0.000, 0.063]), RightHandMiddle2(pos: [-4.153, 0.000, 0.000], rot: [0.998, 0.000, 0.000, 0.063]), RightHandMiddle3(pos: [-2.603, 0.000, 0.000], rot: [0.999, 0.000, 0.000, 0.042]), RightInHandRing(pos: [-3.539, 0.566, -0.136], rot: [1.000, 0.000, 0.000, 0.000]), RightHandRing1(pos: [-4.873, -0.023, -0.504], rot: [-0.995, 0.005, 0.087, -0.055]), RightHandRing2(pos: [-3.619, 0.000, 0.000], rot: [-0.998, -0.000, -0.000, -0.055]), RightHandRing3(pos: [-2.511, 0.000, 0.000], rot: [0.999, 0.000, 0.000, 0.037]), RightInHandPinky(pos: [-3.324, 0.494, -1.264], rot: [1.000, 0.000, 0.000, 0.000]), RightHandPinky1(pos: [-4.354, -0.023, -1.147], rot: [-0.985, 0.002, 0.174, -0.012]), RightHandPinky2(pos: [-2.898, 0.000, 0.000], rot: [1.000, 0.000, 0.000, 0.018]), RightHandPinky3(pos: [-1.831, 0.000, 0.000], rot: [1.000, 0.000, 0.000, 0.012]), LeftShoulder(pos: [2.901, 13.342, 0.010], rot: [0.995, -0.068, 0.058, 0.048]), LeftArm(pos: [16.098, 0.000, -0.000], rot: [0.697, 0.145, 0.002, -0.702]), LeftForeArm(pos: [27.999, -0.002, 0.000], rot: [0.999, -0.023, -0.031, -0.015]), LeftHand(pos: [26.000, -0.000, -0.001], rot: [0.996, 0.068, -0.057, 0.003]), LeftHandThumb1(pos: [1.937, -0.484, 2.518], rot: [0.879, 0.412, -0.239, 0.035]), LeftHandThumb2(pos: [3.872, 0.000, 0.000], rot: [0.992, -0.000, 0.000, -0.125]), LeftHandThumb3(pos: [2.690, 0.000, 0.000], rot: [0.997, 0.000, -0.000, -0.083]), LeftInHandIndex(pos: [3.389, 0.535, 2.080], rot: [1.000, 0.000, 0.000, 0.000]), LeftHandIndex1(pos: [5.485, -0.096, 1.051], rot: [0.964, 0.031, -0.127, -0.232]), LeftHandIndex2(pos: [3.806, 0.000, 0.000], rot: [0.981, 0.000, 0.000, -0.192]), LeftHandIndex3(pos: [2.158, 0.000, 0.000], rot: [0.991, 0.000, -0.000, -0.135]), LeftInHandMiddle(pos: [3.556, 0.544, 0.796], rot: [1.000, 0.000, 0.000, 0.000]), LeftHandMiddle1(pos: [5.441, -0.088, 0.330], rot: [-0.970, -0.000, 0.000, 0.244]), LeftHandMiddle2(pos: [4.153, 0.000, 0.000], rot: [0.970, 0.000, 0.000, -0.244]), LeftHandMiddle3(pos: [2.603, 0.000, 0.000], rot: [0.987, 0.000, 0.000, -0.163]), LeftInHandRing(pos: [3.539, 0.566, -0.136], rot: [1.000, 0.000, 0.000, 0.000]), LeftHandRing1(pos: [4.873, -0.023, -0.504], rot: [-0.963, 0.022, -0.084, 0.253]), LeftHandRing2(pos: [3.619, 0.000, 0.000], rot: [0.967, 0.000, 0.000, -0.254]), LeftHandRing3(pos: [2.511, 0.000, 0.000], rot: [0.985, 0.000, 0.000, -0.170]), LeftInHandPinky(pos: [3.324, 0.494, -1.264], rot: [1.000, 0.000, 0.000, 0.000]), LeftHandPinky1(pos: [4.354, -0.023, -1.147], rot: [0.959, -0.040, 0.169, -0.225]), LeftHandPinky2(pos: [2.898, 0.000, 0.000], rot: [0.941, 0.000, -0.000, -0.339]), LeftHandPinky3(pos: [1.831, 0.000, 0.000], rot: [0.974, 0.000, -0.000, -0.228])

6================================================================================

8Summary:

9 File Path: dataset\1\1_41_1760493249\human_bone_data.h5

10 Total frames: 68

11 Elements per frame: 59

12 Index range difference: (last_index + 1\) - first_index = (67 + 1\) - 0 = 68

13 Index continuity: Normal

14================================================================================

Palm Pressure Data

Palm pressure data is encapsulated in the hand_pressure.h5 file, with each palm containing 129 pressure points, and the value range of each point is 0 to 255.

Hand Pressure Point Map:

elements contain one frame of hand palm pressure data:

name (string, e.g. : "left" or "right")
value (uint8, shape=[129])

Parse code: read_hand_pressure_h5.py

Output result example:

output

1>python read_hand_pressure_h5.py dataset\3\3_2_1760512240\hand_pressure_data.h5

2Successfully opened file: dataset\3\3_2_1760512240\hand_pressure_data.h5

3File contains 68 frames of pressure data

4[0] HandPressure, 0, 1760512240.635: left([86, 17, 246, 100, 167, 183, 120, 221, 1, 75, 80, 241, 219, 47, 29, 84, 226, 203, 102, 161, 82, 64, 19, 161, 149, 211, 97, 156, 208, 148, 67, 188, 115, 131, 151, 120, 59, 98, 37, 56, 1, 112, 187, 152, 249, 229, 88, 22, 168, 224, 241, 206, 130, 18, 139, 181, 137, 15, 114, 111, 72, 154, 244, 230, 42, 2, 179, 105, 56, 120, 239, 75, 228, 75, 130, 182, 91, 152, 255, 85, 120, 178, 125, 207, 187, 5, 80, 31, 88, 74, 52, 160, 175, 139, 114, 157, 77, 168, 130, 116, 172, 103, 197, 152, 237, 239, 209, 124, 68, 163, 35, 185, 39, 21, 74, 147, 66, 53, 157, 168, 13, 168, 184, 56, 156, 21, 18, 219, 8], len=129), right([185, 113, 48, 184, 119, 176, 247, 226, 169, 233, 150, 36, 76, 157, 23, 171, 85, 3, 0, 56, 210, 205, 164, 139, 54, 8, 14, 8, 61, 185, 43, 82, 109, 190, 101, 171, 78, 23, 110, 244, 224, 67, 188, 62, 139, 194, 221, 165, 229, 215, 231, 120, 221, 233, 46, 190, 82, 178, 26, 86, 223, 178, 230, 161, 200, 197, 75, 40, 14, 194, 64, 177, 17, 25, 41, 234, 74, 76, 153, 178, 42, 108, 188, 235, 165, 147, 84, 125, 216, 106, 3, 28, 210, 100, 138, 16, 87, 238, 72, 209, 103, 79, 98, 109, 1, 106, 155, 78, 140, 221, 18, 231, 176, 127, 244, 143, 240, 77, 86, 210, 109, 116, 128, 172, 81, 218, 123, 229, 51], len=129)......

5[67] HandPressure, 67, 1760512241.746: left([111, 134, 161, 38, 190, 204, 79, 248, 252, 35, 160, 126, 137, 243, 216, 127, 131, 9, 185, 228, 218, 174, 194, 30, 87, 229, 170, 59, 98, 239, 37, 31, 32, 112, 43, 170, 54, 83, 117, 100, 129, 27, 7, 110, 79, 34, 96, 180, 163, 99, 185, 104, 44, 46, 130, 35, 50, 139, 90, 183, 64, 110, 185, 34, 42, 142, 154, 112, 216, 240, 21, 19, 140, 199, 4, 140, 209, 108, 2, 51, 89, 136, 211, 31, 135, 60, 243, 68, 4, 120, 125, 226, 235, 19, 57, 154, 97, 198, 102, 179, 78, 210, 165, 3, 30, 206, 161, 47, 197, 101, 18, 43, 68, 50, 228, 126, 74, 67, 248, 186, 164, 125, 182, 27, 184, 203, 209, 30, 46], len=129), right([108, 216, 41, 202, 192, 184, 200, 129, 236, 64, 110, 226, 41, 110, 14, 82, 5, 220, 17, 201, 186, 201, 160, 99, 88, 84, 19, 231, 103, 84, 50, 138, 56, 80, 14, 189, 184, 81, 255, 49, 159, 152, 90, 78, 123, 155, 240, 45, 68, 76, 157, 154, 231, 152, 107, 172, 222, 150, 1, 120, 187, 246, 4, 36, 156, 147, 202, 204, 20, 1, 167, 204, 183, 57, 166, 2, 139, 194, 182, 144, 44, 139, 115, 132, 123, 215, 74, 151, 24, 58, 57, 97, 77, 68, 72, 184, 78, 96, 162, 212, 71, 65, 58, 97, 54, 37, 131, 222, 253, 245, 177, 147, 94, 93, 35, 158, 146, 69, 131, 242, 71, 83, 77, 193, 144, 229, 241, 56, 35], len=129)

6================================================================================

8Summary:

9 Path: dataset\3\3_2_1760512240\hand_pressure_data.h5

10 Total frames: 68

11 Elements per frame: 2

12 Index range difference: (last_index + 1\) - first_index = (67 + 1\) - 0 = 68

13 Index continuity: Normal

14================================================================================

Data Player

This software visualizes the various data points for each recorded entry.

Windows	cage_qt3d_viewer_v2_4_6_win.zip
Linux	cage_qt3d_viewer_v2_4_6_ubuntu.zip

After startup, open trackers_sixdof.h5 in the corresponding data directory. The program will automatically load other data files. The running effect is shown in the figure below:

FAQ

Explain the system's coordinate system.

The tracker's 6DoF and human skeleton motion data share the same world coordinate system.

The origin of the world coordinate system is typically placed on the ground with the Y-axis pointing upward, as shown in the figure above. (i.e., the coordinate system of the optical environment)

The tracker's 6DoF data (trackers_sixdof.h5) represents the coordinate pose of its own model within the world coordinate system, with length units in meters.

The root node Hips in the human skeleton motion data (human_bones.h5) provides 6DoF data in the world coordinate system. Subsequent child bone data represent coordinate poses relative to their parent nodes, with length units in centimeters.

What are the different forms of trackers, and what do their coordinate systems look like?

A tracker is a combination of an optical rigid body and an IMU inertial sensor. Each tracker has its own name, as detailed in the "Tracker List". Currently, there are six distinct configurations. The model files, coordinate systems, and optical point topologies for each tracker configuration are defined as follows:

Type Name	Model file	Diagram
PWR_M_PN3	PWR_M_PN3_V2.stl
PWR_K_PNS	PWR_K_PNS.stl
PWR_K_Link_V2	PWR_K_Link_V2.stl
PWR_H_LinkHand	PWR_H_LinkHand.stl
PWR_M_FingerA	PWR_M_FingerA.stl
PWR_M_FingerB	PWR_M_FingerB.stl

Where are these trackers used?

These trackers fall into two main categories: wireless trackers and wired trackers.

Wireless trackers are used for tracking props.
Wired trackers are used for tracking body parts.

Wireless trackers

PWR_M_PN3: A small wireless tracker attached to smaller props for tracking purposes.
PWR_K_PNS: A wireless large tracker designed for attaching to larger props such as tables, boxes, and dual fixed-position cameras.

Props equipped with wireless trackers have their model files aligned so that their origin point and orientation perfectly match those of the tracker. This ensures that after importing the prop model file, no coordinate conversion is requiredâ€”the prop can be directly driven by the tracker's 6DoF data.

Prop Model File

The following table shows the model file for the cola can in the sample data.

Tool Name	Model file	Diagram
cola_modern_330	cola_modern_330_chip.stl

Wired trackers

PWR_H_LinkHand: Wired tracker mounted on the back of both hands
PWR_M_FingerA and PWR_M_FingerB: Both are wired trackers mounted on fingers, with one at the fingertip and one at the finger base for each finger.

Hand tracker diagram:

Tracker name and model correspondence

The correspondence between trackers and models is as follows: each recording session includes 6DoF data from at least 31 trackers, and this correspondence remains consistent across all recordings. (Additional trackers for props may be included based on the scene.)

python

1 "Head": "PWR_K_Link_V2", # Header

2 "Spine": "PWR_K_Link_V2", # Back

3 "Hips": "PWR_K_Link_V2", # Hip

4 "RightUpLeg": "PWR_K_Link_V2", # Right thigh

5 "RightFoot": "PWR_K_Link_V2", # Right foot

6 "LeftUpLeg": "PWR_K_Link_V2", # Left thigh

7 "LeftFoot": "PWR_K_Link_V2", # Left foot

8 "RightHand": "PWR_H_LinkHand", # Back of right hand

9 "RightHandThumb2": "PWR_M_FingerB", # Right thumb tip

10 "RightHandThumb1": "PWR_M_FingerA", # Right thumb base

11 "RightHandIndex2": "PWR_M_FingerA", # Right index finger tip

12 "RightHandIndex1": "PWR_M_FingerB", # Right index finger root

13 "RightHandMiddle2": "PWR_M_FingerB", # Right middle fingertip

14 "RightHandMiddle1": "PWR_M_FingerA", # Root of the right middle finger

15 "RightHandRing2": "PWR_M_FingerA", # Tip of the right ring finger

16 "RightHandRing1": "PWR_M_FingerB", # Base of the right ring finger

17 "RightHandPinky2": "PWR_M_FingerB", # Tip of the right pinky finger

18 "RightHandPinky1": "PWR_M_FingerA", # Base of the right pinky finger

19 "LeftHand": "PWR_H_LinkHand", # Back of the left hand

20 "LeftHandThumb2": "PWR_M_FingerB", # Tip of the left thumb finger

21 "LeftHandThumb1": "PWR_M_FingerA", # Base of the left thumb finger

22 "LeftHandIndex2": "PWR_M_FingerA", # Tip of the left index finger

23 "LeftHandIndex1": "PWR_M_FingerB", # Base of the left index finger

24 "LeftHandMiddle2": "PWR_M_FingerB", # Tip of the left middle finger

25 "LeftHandMiddle1": "PWR_M_FingerA", # Root of the left middle finger

26 "LeftHandRing2": "PWR_M_FingerA", # Tip of the left ring finger

27 "LeftHandRing1": "PWR_M_FingerB", # Base of the left ring finger

28 "LeftHandPinky2": "PWR_M_FingerB", # Tip of the left little finger

29 "LeftHandPinky1": "PWR_M_FingerA", # Root of the left little finger

30 "fixed1_cam": "PWR_K_PNS", # Fixed Camera 1

31 "fixed2_cam": "PWR_K_PNS", # Fixed Camera 2

Name	Meaning
RightHandPinky2	Right little finger tip
RightHandPinky1	Base of the right little finger
LeftHand	Back of left hand
LeftHandThumb2	Left thumb tip
LeftHandThumb1	Left thumb base
LeftHandIndex2	Left index finger tip
LeftHandIndex1	Left index finger root
LeftHandMiddle2	Left middle finger tip
LeftHandMiddle1	Root of the left middle finger
LeftHandRing2	Left ring finger tip
LeftHandRing1	Base of the left ring finger
LeftHandPinky2	Left little finger tip
LeftHandPinky1	Base of the left little finger
TBD	Other Props

How is the camera tracked, and how are external reference information and coordinate information defined?

The setup includes three camera channels totaling six cameras: one head-mounted camera and two fixed-position cameras. Each channel comprises one RealSense D435 camera and one USB wide-angle camera, connected using identical structural components as shown below.

Camera Connection Diagram

The camera coordinate system defaults to: right-down-front.

Camera intrinsic and extrinsic parameters are defined in the camera_params/ directory. The trackers corresponding to the three-camera system are listed in the table below:

Camera Position	Tracker name
Header	Head
Fixed Camera 1	fixed1_cam
Fixed Camera 2	fixed2_cam

Retrieve the 6DoF data for the corresponding tracker name from trackers_sixdof.h5, then apply the corresponding camera's intrinsic and extrinsic parameters to complete the camera's reprojection.

Sample Data

Description	Data
Move the Coke on the table	HiPHI-OM-move-cola.zip

3. Technical Specification: In-The-Wild (ITW) Dataset

Ecological Validity and Scene Generalization: ITW comprises a high-entropy corpus captured across diverse unconstrained environments, including residential, hospitality, retail, and logistics sectors. By incorporating stochastic variables such as non-uniform lighting, dynamic occlusions, and unstructured spatial layouts, the dataset exposes models to the long-tail edge cases of real-world deployment, significantly enhancing policy robustness against environmental distribution shifts.
Capture of Unconstrained Behavioral Dynamics: Unlike scripted laboratory protocols, ITW prioritizes the recording of naturalistic human-object interactions and operational logic. This focuses the training signal on the inherent "common sense" of human motionâ€”reflecting how humans prioritize tasks and navigate social spacesâ€”which allows for the development of humanoid agents that exhibit more intuitive and predictable behaviors in shared environments.
High-Throughput Distributed Acquisition: The dataset utilizes a decentralized collection strategy involving portable, low-profile sensing arrays. This methodology allows for massive parallelization of data acquisition, achieving a throughput several times higher than traditional teleoperation or laboratory-bound methods. This scalability is critical for the generation of the high-volume datasets required for foundation model training in the embodied AI space.
Cross-Ontology Annotation and Compliance Pipeline: The ITW framework includes an end-to-end pipeline for data desensitization (anonymization), compliance auditing, and post-processing. A specialized toolchain enables the semantic annotation of unstructured data, ensuring it remains compatible across diverse ontologies. This allows real-world behavioral "noise" to be translated into structured training signals that are usable across various robotic morphologies and task-planning architectures.

File Structure

File Name	Description
camera_params/	Intrinsic parameters for the head camera
config.json	Metadata and description of the data in this collection
depth_head.mkv	Head depth video
depth_head.csv	Timestamps for the head depth images
hands_keypoint_3d.json	3D hand keypoint data
head_hands_sixdof.csv	6DOF data for the head and both wrists. The first frame of the head-mounted camera corresponds to the origin of the world coordinate system, and the wrist position is represented as relative information with respect to the head-mounted camera.
task_info.json	Task information for this collection
rgb_head.csv	Per-frame timestamps for the head RGB video
rgb_head.mp4	Head RGB video
mic.wav	Audio recording

Depth Data

Extract each depth PNG image from depth_head.mkv according to the information in depth_head.csv.

Each PNG contains one frame of 16-bit depth data.

Parse code: read_png_16bit.py

Output result example:

output

1>python read_png_16bit.py dataset\3\3_1_1760508893\depth_fixed\depth_0_1760508893407.png

2Image data type: uint16

3Image shape (height, width): (480, 640)

4Pixel value range: [0, 4999]

Hands Keypoint 3d Visualization

Data Structure of `hands_keypoint_3d.json`

Each video directory contains a `hands_keypoint_3d.json` file storing per-frame 3D hand keypoints and MANO parameters. Top-level schema:

json

2 "metadata": { "source": "depth_fusion", ... },

3 "quality_exclusion": { "excluded": false, "reasons": [] },

4 "frames": { "<timestamp>": { ... }, ... },

5 "quality_summary": { ... }

quality_exclusion: Indicates whether the entire video should be excluded (e.g., due to excessive missing data).
quality_summary: Aggregated statistics such as total frames and confidence distribution.

Frame-level schema (frames["<timestamp>"]):

json

2 "excluded": false, // Frame-level exclusion flag

3 "exclude_reason": "", // Reason for exclusion (e.g., "tail")

4 "hands": [

5 {

6 "is_right": true, // true = right hand, false = left hand

7 "confidence": "high", // "high" | "low" | null

8 "keypoints_3d_cam_m": { // 3D coordinates of 21 joints (camera frame, meters)

9 "thumb_cmc": [x, y, z],

10 ...

11 "pinky_tip": [x, y, z]

12 },

13 "mano_parameters": {

14 "global_orient": [ax, ay, az], // Axis-angle rotation vector (rotvec, radians), shape (3,)

15 "transl": [tx, ty, tz], // Wrist position in camera coordinates (meters)

16 "betas": [b0, ..., b9], // MANO shape parameters (10D)

17 "hand_pose": [[[r00,...],...],...], // Rotation matrices for 15 joints, shape (15, 3, 3)

18 "hand_size_scale": 1.01 // Per-hand scale factor relative to MANO output

19 },

20 "wrist_6dof": [tx, ty, tz, rx, ry, rz] // [translation (m); axis-angle rotation (rotvec, radians)]

21 },

22 ...

23 ]

24 }

Joint Order (21):

wrist, thumb_cmc, thumb_mcp, thumb_ip, thumb_tip, index_mcp, index_pip, index_dip, index_tip, middle_mcp, middle_pip, middle_dip, middle_tip, ring_mcp, ring_pip, ring_dip, ring_tip, pinky_mcp, pinky_pip, pinky_dip, pinky_tip

Usage Notes:

Coordinate System. All 3D coordinates are in the OpenCV camera frame: +X right, +Y down, +Z forward; unit is meters.
Confidence Filtering. Only frames with confidence == "high" should be used for training/evaluation. Low-confidence hands are kept for completeness but are excluded from temporal smoothing.
Tail Frames. Frames marked with excluded == true and exclude_reason == "tail" correspond to the last 2 seconds of the video and should be discarded due to unstable end-of-recording quality. Equivalently, drop any frame whose timestamp is greater than max_timestamp - 2.0 seconds.
MANO Parameters. hand_pose is stored as 15×3×3 rotation matrices (not axis-angle). Temporal smoothing is applied in axis-angle (Lie algebra) space and the result is converted back to matrices, which preserves the SO(3) orthogonality constraint. betas is smoothed by a simple Gaussian filter.
Left-Hand Handling. Only the MANO right-hand model (MANO_RIGHT.pkl) is shipped, so left hands are reconstructed by running the right-hand model and mirroring along the X-axis (**verts[:, 0] = -1, joints[:, 0] = -1). This must be done after MANO forward but before comparing the result with keypoints_3d_cam_m. See the inline comment in the example below.

Visualization Example

Please note: if the original RGB frame is already undistorted, do not apply undistortion again.

Parse code: example_kp_vis.py

bash

1python example_visualize_mano_kp.py \

2 --folder PATH_DATA

6Dof data of head and wrist

The 6DoF of the head and wrist is calculated through SLAM algorithm and recorded in: head_hands_sixdof.csv

Wrist 6Dof data only exists when the data collector wears a wrist QR code bracelet.

Citation

If you use the data from this website, please cite this work as

citation

1Noitom Robotics Team, "ModalityNet: The Art of Modalities in Human-Centric Data", Noitom Robotics Blog, 2026.

Or use the BibTeX citation:

citation

1@article{noitomrobotics2026modalitynet,

2author = {Noitom Robotics Team},

3title = {ModalityNet: The Art of Modalities in Human-Centric Data},

4journal = {Noitom Robotics Blog},

5year = {2026},

6note = {https://modalitynet.com},