Perception has been linked to “semantics” for a long time. When the researchers refer to the term of “multi-modal”, they usually mean “vision-language”. Despite that the integration of vision and language is indeed exciting, we pay more attention on those modalities that are less explored but equally, if not more, important. For example, force, temperature, position, etc. After all, if perception is to understand the world by sensory signals, then any sensors that can measure the world can provide useful information.