Bala Kumaravel

I am a Senior Researcher at Microsoft Research, Redmond at the Interactive Multimodal AI Systems group. I work on leveraging Generative AI models (Multimodal Large Language Models and Diffusion models) to enhance user productivity and collaboration in business-critical applications. I am particularly interested in customizing, finetuning, and aligning generative AI models for specific end-user applications.

Over the years I’ve worked on projects spanning multimodal copilots that accelerate productivity and collaboration in business-critical workflows; Unified natively multimodal AI copilots for Microsoft Office AI that work across Word, PowerPoint, and Excel amongst other formats; generative pipelines and creative tooling for Bing Creative Ads; live AI agents that assist players in games such as Minecraft; vision perception systems that enable spatial understanding in AR/VR and robotics; and generative approaches that improve meeting experiences through multimodal understanding and content generation.

Before joining Microsoft, I completed my Ph.D. at the University of California, Berkeley where I was advised by Prof. Björn Hartmann. My research at Berkeley was concentrated in the domains of Virtual and Augmented Reality, exploring applications in diverse activities, from AR/VR-assisted robotics interactions to enhancing learning experiences. Before that, I completed my Bachelors at Indian Institute of Technology, Madras where my Bachelors’ thesis won the best interdisciplinary thesis project amongst all engineering departments and the best thesis in the department.

During my PhD I got to spend time at various places and work with amazing collaborators across Microsoft, Adobe and Autodesk - Cuong Nguyen , Stephen DiVerdi , Fraser Anderson , Tovi Grossman , George Fitzmaurice , and Andy Wilson

If you’re exploring multimodal LLMs, diffusion models, or embodied AI for enhancing Human AI interactions I’d love to connect.

news

Jul 15, 2025	Our work - ‘Grounding Task Assistance with Multimodal Cues from a Single Demonstration’ was accepted and presented at ACL’25 Findings link
Oct 21, 2024	I will be speaking at panel discussion at the IEEE International Symposium on Emerging Metaverse on Oct 21st 2024 link
Oct 16, 2024	We presented our work on BlendScape and SpaceBlender at UIST 2024. BlendScape won a Honorable Mention Award at UIST 2024. Check out the works at BlendScape and SpaceBlender.
May 11, 2024	We presented our work on SharedNeRF at CHI 2024. SharedNeRF won a Honorable Mention Award at CHI 2024. Check out the work at SharedNeRF.
Mar 16, 2024	Moved to the Interactive Multimodal AI Systems team at Microsoft Research, Redmond

selected publications

2025

Out of Sight, Not Out of Context? Egocentric Spatial Reasoning in VLMs Across Disjoint Frames

Sahithya Ravi, Gabriel Sarch, Vibhav Vineet, Andrew D Wilson, and Balasaravanan Thoravi Kumaravel

arXiv preprint arXiv:2505.24257, 2025

PDF
Grounding Task Assistance with Multimodal Cues from a Single Demonstration

Gabriel Sarch, Balasaravanan Thoravi Kumaravel, Sahithya Ravi, Vibhav Vineet, and Andrew D Wilson

In Findings of the Association for Computational Linguistics: ACL 2025, Jul 2025

Abs PDF

A person’s demonstration often serves as a key reference for others learning the same task. However, RGB video, the dominant medium for representing these demonstrations, often fails to capture fine-grained contextual cues such as intent, safety-critical environmental factors, and subtle preferences embedded in human behavior. This sensory gap fundamentally limits the ability of Vision Language Models (VLMs) to reason about why actions occur and how they should adapt to individual users. To address this, we introduce MICA (Multimodal Interactive Contextualized Assistance), a framework that improves conversational agents for task assistance by integrating eye gaze and speech cues. MICA segments demonstrations into meaningful sub-tasks and extracts keyframes and captions that capture fine-grained intent and user-specific cues, enabling richer contextual grounding for visual question answering. Evaluations on questions derived from real-time chat-assisted task replication show

2024

BlendScape: Enabling End-User Customization of Video-Conferencing Environments through Generative AI

Shwetha Rajaram, Nels Numan, Balasaravanan Thoravi Kumaravel, Nicolai Marquardt, and Andrew D Wilson

In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, Jul 2024

Abs

Today’s video-conferencing tools support a rich range of professional and social activities, but their generic meeting environments cannot be dynamically adapted to align with distributed collaborators’ needs. To enable end-user customization, we developed BlendScape, a rendering and composition system for video-conferencing participants to tailor environments to their meeting context by leveraging AI image generation techniques. BlendScape supports flexible representations of task spaces by blending users’ physical or digital backgrounds into unified environments and implements multimodal interaction techniques to steer the generation. Through an exploratory study with 15 end-users, we investigated whether and how they would find value in using generative AI to customize video-conferencing environments. Participants envisioned using a system like BlendScape to facilitate collaborative activities in the future, but required further controls to mitigate distracting or unrealistic visual elements. We implemented scenarios to demonstrate BlendScape’s expressiveness for supporting environment design strategies from prior work and propose composition techniques to improve the quality of environments.
SpaceBlender: Creating Context-Rich Collaborative Spaces Through Generative 3D Scene Blending

Nels Numan, Shwetha Rajaram, Balasaravanan Thoravi Kumaravel, Nicolai Marquardt, and Andrew D Wilson

In Proceedings of the 37th Annual ACM Symposium on User Interface Software and Technology, Jul 2024

Abs

There is increased interest in using generative AI to create 3D spaces for Virtual Reality (VR) applications. However, today’s models produce artificial environments, falling short of supporting collaborative tasks that benefit from incorporating the user’s physical context. To generate environments that support VR telepresence, we introduce SpaceBlender, a novel pipeline that utilizes generative AI techniques to blend users’ physical surroundings into unified virtual spaces. This pipeline transforms user-provided 2D images into context-rich 3D environments through an iterative process consisting of depth estimation, mesh alignment, and diffusion-based space completion guided by geometric priors and adaptive text prompts. In a preliminary within-subjects study, where 20 participants performed a collaborative VR affinity diagramming task in pairs, we compared SpaceBlender with a generic virtual environment and a state-of-the-art scene generation framework, evaluating its ability to create virtual spaces suitable for collaboration. Participants appreciated the enhanced familiarity and context provided by SpaceBlender but also noted complexities in the generative environments that could detract from task focus. Drawing on participant feedback, we propose directions for improving the pipeline and discuss the value and design of blended spaces for different scenarios.
SharedNeRF: Leveraging Photorealistic and View-dependent Rendering for Real-time and Remote Collaboration

Mose Sakashita, Balasaravanan Thoravi Kumaravel, Nicolai Marquardt, and Andrew David Wilson

In Proceedings of the CHI Conference on Human Factors in Computing Systems, Jul 2024

Abs

Collaborating around physical objects necessitates examining different aspects of design or hardware in detail when reviewing or inspecting physical artifacts or prototypes. When collaborators are remote, coordinating the sharing of views of their physical environment becomes challenging. Video-conferencing tools often do not provide the desired viewpoints for a remote viewer. While RGB-D cameras offer 3D views, they lack the necessary fidelity. We introduce SharedNeRF, designed to enhance synchronous remote collaboration by leveraging the photorealistic and view-dependent nature of Neural Radiance Field (NeRF). The system complements the higher visual quality of the NeRF rendering with the instantaneity of a point cloud and combines them through carefully accommodating the dynamic elements within the shared space, such as hand gestures and moving objects. The system employs a head-mounted camera for data collection, creating a volumetric task space on the fly and updating it as the task space changes. In our preliminary study, participants successfully completed a flower arrangement task, benefiting from SharedNeRF’s ability to render the space in high fidelity from various viewpoints.
BlendScape: Enabling Unified and Personalized Video-Conferencing Environments through Generative AI

Shwetha Rajaram, Nels Numan, Balasaravanan Thoravi Kumaravel, Nicolai Marquardt, and Andrew D Wilson

In , Jul 2024

2023

StreamFunnel: Facilitating Communication Between a VR Streamer and Many Spectators

Haohua Lyu, Cyrus Vachha, Qianyi Chen, Balasaravanan Thoravi Kumaravel, and Bjoern Hartmann

In , Jul 2023

2022

Shaping the new future of work through mixed reality

Balasaravanan Thoravi Kumaravel

Jul 2022

HTML
Interactive Cross-Dimensional Media for Collaboration and Guidance in Mixed Reality Environments

Balasaravanan Thoravi Kumaravel

University of California, Berkeley, Jul 2022

HTML PDF
Modeling and Influencing Human Attentiveness in Autonomy-to-Human Perception Hand-offs

Yash Vardhan Pant, Balasaravanan Thoravi Kumaravel, Ameesh Shah, Erin Kraemer, Marcell Vazquez-Chanlatte, Kshitij Kulkarni, Bjoern Hartmann, and Sanjit A Seshia

In 2022 IEEE 25th International Conference on Intelligent Transportation Systems (ITSC), Jul 2022

HTML PDF
DreamStream: Immersive and Interactive Spectating in VR

Balasaravanan Thoravi Kumaravel, and Andrew D Wilson

In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, Jul 2022

HTML PDF

2020

TransceiVR: Bridging asymmetrical communication between VR users and external collaborators

Balasaravanan Thoravi Kumaravel, Cuong Nguyen, Stephen DiVerdi, and Bjoern Hartmann

In Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology, Jul 2020

HTML PDF

2019

TutoriVR: A Video-Based Tutorial System for Design Applications in Virtual Reality

Balasaravanan Thoravi Kumaravel, Cuong Nguyen, Stephen DiVerdi, and Bjoern Hartmann

In CHI ’19: Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, Jul 2019

HTML PDF
Loki: Facilitating remote instruction of physical tasks using bi-directional mixed-reality telepresence

Balasaravanan Thoravi Kumaravel, Fraser Anderson, George Fitzmaurice, Bjoern Hartmann, and Tovi Grossman

In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, Jul 2019

HTML PDF