The future of ubiquitous medicine is reasoning foundation models trained on clinical data.
I'm interested in what these vision and vision-language models actually learn—and why they fail. Since medical data is inherently limited, models can become overly prior-driven: at inference they often regress toward the mean, leaning on learned dataset priors rather than raw biological signal. I want to study these failure modes to separate causal signal from spurious correlations.
We introduce CortexMAE, a family of self-supervised masked autoencoders for fMRI trained on 2.1K hours of open brain activity data. The work adapts Vision Transformers to functional MRI by converting 3D volumes into cortical flat maps, compares flat-map, parcellation, and volume representations, and releases Brainmarks, an open benchmark suite for evaluating fMRI foundation models across trait prediction and cognitive state decoding.
For our Algonauts 2025 submission, we built a multimodal encoding pipeline that fused features from V-JEPA2 (video), Whisper (audio), Llama 3.2 (text), InternVL3 (vision-language), and Qwen2.5-Omni (vision-language-audio). We linearly projected these model features into a shared latent space, temporally aligned them to fMRI responses, and mapped them to cortical parcels with a lightweight architecture that combines a shared group head with subject-specific residual heads. We then trained hundreds of model variants and stitched parcel-specific subject ensembles for final submission, achieving a mean Pearson correlation of 0.2085 on withheld out-of-distribution movies and placing 4th in the challenge.
Published in Biological Psychiatry, Volume 97, Issue 9, pages S160-S161, presenting a transformer-based foundation model approach for functional neuroimaging.
In MindEye2, we pretrained one shared model across 7 NSD subjects and fine-tuned on a held-out 8th subject with as little as one scan session (about 2.5% of full data). The core alignment step uses subject-specific ridge mappings from 13k-18k voxel inputs into a 4096-dimensional shared latent, followed by a shared MLP backbone, diffusion prior, and retrieval/low-level submodules. We decode into OpenCLIP ViT-bigG/14 image space and reconstruct pixels with a fine-tuned SDXL unCLIP pipeline (plus caption-guided refinement), reaching state-of-the-art retrieval/reconstruction while preserving strong performance in the 1-hour regime.