Research

The future of ubiquitous medicine is reasoning foundation models trained on clinical data.

I'm interested in what these vision and vision-language models actually learn—and why they fail. Since medical data is inherently limited, models can become overly prior-driven: at inference they often regress toward the mean, leaning on learned dataset priors rather than raw biological signal. I want to study these failure modes to separate causal signal from spurious correlations.

2025

Algonauts 2025 – 4th Place

Predicting Brain Responses To Natural Movies With Multimodal LLMs

Cesar Kadir Torrico Villanueva, Jiaxin Cindy Tu, Mihir Tripathy, Connor Lane, Rishab Iyer, Paul S. Scotti

Algonauts 2025 model architecture diagram

For our Algonauts 2025 submission, we built a multimodal encoding pipeline that fused features from V-JEPA2 (video), Whisper (audio), Llama 3.2 (text), InternVL3 (vision-language), and Qwen2.5-Omni (vision-language-audio). We linearly projected these model features into a shared latent space, temporally aligned them to fMRI responses, and mapped them to cortical parcels with a lightweight architecture that combines a shared group head with subject-specific residual heads. We then trained hundreds of model variants and stitched parcel-specific subject ensembles for final submission, achieving a mean Pearson correlation of 0.2085 on withheld out-of-distribution movies and placing 4th in the challenge.

arXiv GitHub

2024

ICML 2024

MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data

Paul S. Scotti, Mihir Tripathy, Cesar Kadir Torrico Villanueva, Reese Kneeland, Tong Chen, Ashutosh Narang, Charan Santhirasegaran, Jonathan Xu, Thomas Naselaris, Kenneth A. Norman, Tanishq Mathew Abraham

In MindEye2, we pretrained one shared model across 7 NSD subjects and fine-tuned on a held-out 8th subject with as little as one scan session (about 2.5% of full data). The core alignment step uses subject-specific ridge mappings from 13k-18k voxel inputs into a 4096-dimensional shared latent, followed by a shared MLP backbone, diffusion prior, and retrieval/low-level submodules. We decode into OpenCLIP ViT-bigG/14 image space and reconstruct pixels with a fine-tuned SDXL unCLIP pipeline (plus caption-guided refinement), reaching state-of-the-art retrieval/reconstruction while preserving strong performance in the 1-hour regime.

arXiv Project Page GitHub