Research

The future of ubiquitous medicine is reasoning foundation models trained on clinical data.

I'm interested in what these vision and vision-language models actually learn—and why they fail. Since medical data is inherently limited, models can become overly prior-driven: at inference they often regress toward the mean, leaning on learned dataset priors rather than raw biological signal. I want to study these failure modes to separate causal signal from spurious correlations.

2026
ICML 2026

Scaling Vision Transformers for Functional MRI with Flat Maps

Connor Lane, Mihir Tripathy, Leema Krishna Murali, Ratna Sagari Grandhi, Shamus Sim Zi Yang, Sam Gijsen, Debojyoti Das, Manish Ram, Utkarsh Kumar Singh, Cesar Kadir Torrico Villanueva, Yuxiang Wei, Will Beddow, Gianfranco Cortes, Suin Cho, Daniel Z. Kaplan, Benjamin Warner, Tanishq M. Abraham, Paul S. Scotti

Spectrum of fMRI data representations for volume, flat map, and parcellation patching

We introduce CortexMAE, a family of self-supervised masked autoencoders for fMRI trained on 2.1K hours of open brain activity data. The work adapts Vision Transformers to functional MRI by converting 3D volumes into cortical flat maps, compares flat-map, parcellation, and volume representations, and releases Brainmarks, an open benchmark suite for evaluating fMRI foundation models across trait prediction and cognitive state decoding.

2025
Algonauts 2025 – 4th Place

Predicting Brain Responses To Natural Movies With Multimodal LLMs

Cesar Kadir Torrico Villanueva, Jiaxin Cindy Tu, Mihir Tripathy, Connor Lane, Rishab Iyer, Paul S. Scotti

Algonauts 2025 model architecture diagram

For our Algonauts 2025 submission, we built a multimodal encoding pipeline that fused features from V-JEPA2 (video), Whisper (audio), Llama 3.2 (text), InternVL3 (vision-language), and Qwen2.5-Omni (vision-language-audio). We linearly projected these model features into a shared latent space, temporally aligned them to fMRI responses, and mapped them to cortical parcels with a lightweight architecture that combines a shared group head with subject-specific residual heads. We then trained hundreds of model variants and stitched parcel-specific subject ensembles for final submission, achieving a mean Pearson correlation of 0.2085 on withheld out-of-distribution movies and placing 4th in the challenge.

Biological Psychiatry 2025

Transformer-Based Foundation Model for Functional Neuroimaging

Teddy Akiki, Mihir Tripathy, Xue Zhang, Adam Pines, Josue Ortega Caro, Syed Rizvi, Christopher Averill, David van Dijk, Chadi Abdallah, Leanne Williams

Published in Biological Psychiatry, Volume 97, Issue 9, pages S160-S161, presenting a transformer-based foundation model approach for functional neuroimaging.

2024
ICML 2024

MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data

Paul S. Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A. Norman, Tanishq Mathew Abraham

MindEye2 architecture diagram

In MindEye2, we pretrained one shared model across 7 NSD subjects and fine-tuned on a held-out 8th subject with as little as one scan session (about 2.5% of full data). The core alignment step uses subject-specific ridge mappings from 13k-18k voxel inputs into a 4096-dimensional shared latent, followed by a shared MLP backbone, diffusion prior, and retrieval/low-level submodules. We decode into OpenCLIP ViT-bigG/14 image space and reconstruct pixels with a fine-tuned SDXL unCLIP pipeline (plus caption-guided refinement), reaching state-of-the-art retrieval/reconstruction while preserving strong performance in the 1-hour regime.