Yurun Chen1,2,   Tianyuan Gao3,   Yizhong Ge1,   Shikun Ban4
Yizhou Wang3,   Hongkai Xiong2,5✉,   Wenjun Zeng1,6✉,   Wentao Zhu1,6✉
1Eastern Institute of Technology, Ningbo   2Shanghai Jiao Tong University   3Peking University
4Carnegie Mellon University   5East China Normal University   6Ningbo Institute of Digital Twin
✉ Corresponding authors

TL;DR. We build a framework for a humanoid robot to distinguish itself from humans or similar robots through proprioceptive-visual correspondence, without any predefined kinematic model or identity label. This self-other distinction then bootstraps a predictive self-model, enabling various real-world applications.

Overview

Distinguishing self from others is a fundamental capability for social intelligence. Before an agent can imitate a demonstrator, coordinate with a partner, or simply avoid colliding with a bystander, it must resolve a prior question: Which body is mine?

Cognitive science points to proprioceptive-visual correspondence as a key mechanism.

We therefore frame self-other distinction as an operational self-instance assignment problem:

From several visible bodies, the robot must select the candidate whose configuration matches its proprioceptive state, without identity labels or prior knowledge of its morphology.

Self-other distinction problem setting

Humanoid robots are entering social environments, where they coexist with humans and morphologically identical peers. To act effectively in such settings, they not only need self-other distinction to identify which body in the scene is itself, but also self-modeling to acquire a predictive representation of that body and how it changes with action.

Self-other distinction

Self-other distinction task

Self-modeling

Self-modeling task

From only proprioceptive states and visual observations, and without any predefined kinematic model or identity label, our framework achieves robust self-other distinction, learns a self-model that predicts 3D body occupancy, and supports downstream tasks including target reaching, collision-aware motion planning, and human-to-robot motion retargeting.

Pipeline from proprioceptive-visual correspondence to self-modeling
Pipeline overview

Results

Self-other distinction

We use self-supervised contrastive learning to train our self-other distinction model, utilizing the intrinsic correspondence between proprioception and vision. Across diverse poses, our model consistently distinguishes self from others, selecting the robot, rather than the human distractor.

Self-modeling

Using the selected robot masks as supervision, we learn a self-model that predicts the 3D body occupancy field from proprioception. Our self-model produces coherent predictions. The visualized point cloud tracks the robot across poses, preserving the self-other boundary.

From self-model to physical interaction

Target reaching, collision-aware motion planning, and human-to-robot motion retargeting tasks
Each real-world task can be cast as a spatial constraint on the robot’s body—a target to reach, an obstacle to avoid, or a demonstrated pose to imitate. Because the self-model provides a differentiable mapping from joint configurations to 3D body occupancy, these constraints translate into gradients over joint angles. We test this on three tasks: target reaching, collision-aware motion planning, and human-to-robot motion retargeting.
Target reaching
Side view
Front view

A human moves a night-light to different positions, and the robot optimizes its seven left-arm joint angles to bring the center of a hand-specific self-model to the target.

Collision-aware motion planning
Side view
Front view

The target is behind a board with a circular aperture, so a direct trajectory would collide. The robot uses learned body and hand occupancy inside a motion planner to find a collision-free path.

Human-to-robot motion retargeting
Human demonstration
Robot retargeting

Given a human demonstration, we extract 3D keypoints for body parts such as hands and feet, map them to robot-compatible targets, and optimize the 29 robot joints so that each predicted part reaches its target. The resulting robot joint sequence reproduces the human motion. The robot is externally supported as the self-model captures body geometry rather than physics.

Methods

Self-other distinction method
Self-other distinction. Self-other distinction compares proprioceptive and visual embeddings, selects the self-mask, and trains the alignment with attention-guided contrastive learning.
Kinematics-free self-modeling method
Kinematics-free self-modeling. Kinematics-free self-modeling uses the selected self-mask to learn a pose-conditioned density and visibility field through bounded volumetric mask rendering.

Summary Video