Best viewed on desktop. Dataset playback and 3D viewers require a screen 1024 px wide or larger.

We are building the largesthuman centric omni-modal datasetfor embodied ai

Available categories include Motion with Object & Vision (MOV), Omni-Modality (OM) and In-The-Wild (ITW)

To explore our ModalityNet datasets please register or login.

Human-Centered Hierarchical ModelingSub-millimeter Spatiotemporal SynchronizationMultimodal, High-Dimensional Signal RegistrationCross-Ontology Compatibility and Morphological MappingEcological Validity and Scene GeneralizationCapture of Unconstrained Behavioral DynamicsHigh-Throughput Distributed AcquisitionCross-Ontology Annotation and Compliance PipelineMultimodal Sensor Fusion and State EstimationMulti-Scale Hierarchical InterpretabilitySim-to-Real Alignment and Zero-Shot TransferCross-Ontology Compatibility and Morphological AbstractionHuman-Centered Hierarchical ModelingSub-millimeter Spatiotemporal SynchronizationMultimodal, High-Dimensional Signal RegistrationCross-Ontology Compatibility and Morphological MappingEcological Validity and Scene GeneralizationCapture of Unconstrained Behavioral DynamicsHigh-Throughput Distributed AcquisitionCross-Ontology Annotation and Compliance PipelineMultimodal Sensor Fusion and State EstimationMulti-Scale Hierarchical InterpretabilitySim-to-Real Alignment and Zero-Shot TransferCross-Ontology Compatibility and Morphological AbstractionHuman-Centered Hierarchical ModelingSub-millimeter Spatiotemporal SynchronizationMultimodal, High-Dimensional Signal RegistrationCross-Ontology Compatibility and Morphological MappingEcological Validity and Scene GeneralizationCapture of Unconstrained Behavioral DynamicsHigh-Throughput Distributed AcquisitionCross-Ontology Annotation and Compliance PipelineMultimodal Sensor Fusion and State EstimationMulti-Scale Hierarchical InterpretabilitySim-to-Real Alignment and Zero-Shot TransferCross-Ontology Compatibility and Morphological AbstractionHuman-Centered Hierarchical ModelingSub-millimeter Spatiotemporal SynchronizationMultimodal, High-Dimensional Signal RegistrationCross-Ontology Compatibility and Morphological MappingEcological Validity and Scene GeneralizationCapture of Unconstrained Behavioral DynamicsHigh-Throughput Distributed AcquisitionCross-Ontology Annotation and Compliance PipelineMultimodal Sensor Fusion and State EstimationMulti-Scale Hierarchical InterpretabilitySim-to-Real Alignment and Zero-Shot TransferCross-Ontology Compatibility and Morphological Abstraction

Our Thesis

ModalityNet is a World Compiler

Physical AI won't be unlocked by data volume alone. It also has to be made learnable.
That second layer is what we build.

Physical Interaction → ? → Learnable Representation → Physical AI

Throughout computing, progress has come from abstraction layers: compilers turn source code into executable programs; operating systems turn hardware into programmable platforms. Physical AI needs a comparable layer — one that turns physical interaction into machine-learnable representations.

That layer is the World Compiler. Its job isn't to collect data, but to organize reality: to synchronize modalities, reconstruct physical states and behaviors, and convert continuous human experience into representations models can learn, plan, and train on.

Scaling data volume matters, and we keep pushing it — but volume alone won't get there. The world already produces vast embodied intelligence that simply isn't in a form machines can learn from. ModalityNet is our implementation of the World Compiler: compact, fully structured corpora that raise the learnability of data at real-world scale, so volume and learnability scale together.

Models evolve. Reality does not. A World Compiler sits beneath every architecture — transformers, diffusion, RL, world models, VLA — and stays valuable no matter which paradigm wins.

Read the blueprint

Data Overview

Data at a Glance

Overview of current year data production capacity at factory site A1 and A2. Factory site B1, C1 and D1 to come online in 2026 increasing capacity. Capacity is updated monthly.

HiPHI-MOV

High Precision Human Interaction

Motion with Object & Vision

0+ hrs

HiPHI-OM

High Precision Human Interaction

Omni-Modality

0+ hrs

ITW

In-the-wild

Continuous multi-environment capture

0+ hrs

OMNI-MODAL DATASETS BUILT FOR ROBOTICS

Multimodal Sensor Fusion and State Estimation
Multi-Scale Hierarchical Interpretability
Sim-to-Real Alignment and Zero-Shot Transfer
Cross-Ontology Compatibility and Morphological Abstraction

Explore HIPHI-MOV

HIGH PRECISION HUMAN INTERACTION: MOTION WITH OBJECT & VISION (HIPHI-MOV)

Build whole-body intelligence with context—full-body motion aligned with video in large, unrestricted spaces.

The HiPHI-MOV Dataset is a human-centric, high-fidelity multimodal corpus specifically engineered for the development of robust locomotion and whole-body loco-manipulation policies. It includes full-body motion capture, tracking of interacting objects and sideview RGB-D data. Full-body motion is modeled and output as a body BVH file with 21 end-effector 6-DOF poses. The entire acquisition environment is deployed within large studios with hybrid optical-inertial motion capture systems. HiPHI-MOV is intentionally designed for whole-body behavior, with palm-level loco-manipulation tasks. The manipulated objects are relatively larger items, such as tables, chairs and boxes. Its structured hierarchy enables the modeling of complex robotic behaviors, ranging from low-level motor primitives (joint-space dynamics) to high-level environmental affordances (scene-contextual navigation).

HIGH PRECISION HUMAN INTERACTION: OMNI-MODALITY (HIPHI-OM)

Teach robot hands true dexterity—millimeter finger motion plus pressure, captured for industrial precision.

The HiPHI-OM Dataset is a human-centric, high-fidelity, omni-modal repository acquired within a highly instrumented laboratory environment. It includes fullbody and fine grained hand motion capture, hand level tactile sensing, precise tracking of interacting objects, egocentric RGB-D visual data, third person RGB-D visual data, audio and temperature measurements. By utilizing synchronized, high-precision sensor arrays, HiPHI-OM provides ground truth level data for anthropocentric modeling with minimal aleatoric uncertainty. The dataset is designed to be ontology-agnostic, allowing for the decoupling of raw sensor data from specific semantic frameworks to maximize cross-domain generalization and longitudinal utility. Morphologically, the dataset supports a hierarchical structure, encompassing both micro-level kinematic primitives and meso-level sequential task planning.

Human-Centered Hierarchical Modeling
Sub-millimeter Spatiotemporal Synchronization
Multimodal, High-Dimensional Signal Registration
Cross-Ontology Compatibility and Morphological Mapping

Explore HIPHI-OM

Ecological Validity and Scene Generalization
Capture of Unconstrained Behavioral Dynamics
High-Throughput Distributed Acquisition
Cross-Ontology Annotation and Compliance Pipeline

Explore ITW

IN-THE-WILD (ITW)

Train for reality, not the lab—natural human behavior across diverse real-world environments.

The ITW Dataset constitutes a human-centric, life-scale, diverse, open-world, multimodal repository of stochastic real-world scenarios designed to advance humanoid robotics and embodied intelligence (EI). It includes sparse-body motion capture sensors and egocentric RGB-D visual data. Departing from traditional laboratory-constrained acquisition, ITW captures ecologically valid human behaviors and interaction dynamics within unconstrained environments. By integrating high-variance environmental noise and long-tail edge cases into the training distribution, ITW facilitates the generalization of laboratory-optimized algorithms toward industrial deployment. Whenintegrated with the HiPHI-OM high-precision dataset, it provides a comprehensive cross-domain corpus spanning diverse operational scenarios and sensory modalities.

Why Omni-Modal

Finger-level truth (not approximations)

Millimeter-scale hand/finger kinematics plus pressure/contact signals, so models can learn real grasp dynamics—not just pose trajectories.

Synchronized multi-modal ground truth

Motion + multi-view video (and additional signals where applicable) captured in time alignment, enabling strong visual grounding and cross-modal learning.

Coverage across the full realism spectrum

Controlled “factory-grade” precision, large-space motion-with-vision, and truly natural “in-the-wild” behavior—so training data spans clean labels and messy real-world variance.

Built for scale, consistency, and deployment

Repeatable acquisition pipelines, standardized calibration/QA, and dataset structure designed for model training workflows—so you get reliable data, not one-off demos.

Trusted By Pioneers in Robotics and AI

From academic institutions to global fortune 500 companies, our data and acquisition pipelines support the current and future development of humanoid robotics embodied ai.

Want to partner with us?

We collaborate with teams pushing the edge of embodied AI.

To explore our ModalityNet datasets please register or login.