The future of ubiquitous medicine is reasoning foundation models trained on clinical data.
I'm interested in what these vision and vision-language models actually learn—and why they fail. Since medical data is inherently limited, models can become overly prior-driven: at inference they often regress toward the mean, leaning on learned dataset priors rather than raw biological signal. I want to study these failure modes to separate causal signal from spurious correlations.
For our Algonauts 2025 submission, we built a multimodal encoding pipeline that fused features from V-JEPA2 (video), Whisper (audio), Llama 3.2 (text), InternVL3 (vision-language), and Qwen2.5-Omni (vision-language-audio). We linearly projected these model features into a shared latent space, temporally aligned them to fMRI responses, and mapped them to cortical parcels with a lightweight architecture that combines a shared group head with subject-specific residual heads. We then trained hundreds of model variants and stitched parcel-specific subject ensembles for final submission, achieving a mean Pearson correlation of 0.2085 on withheld out-of-distribution movies and placing 4th in the challenge.
In MindEye2, we pretrained one shared model across 7 NSD subjects and fine-tuned on a held-out 8th subject with as little as one scan session (about 2.5% of full data). The core alignment step uses subject-specific ridge mappings from 13k-18k voxel inputs into a 4096-dimensional shared latent, followed by a shared MLP backbone, diffusion prior, and retrieval/low-level submodules. We decode into OpenCLIP ViT-bigG/14 image space and reconstruct pixels with a fine-tuned SDXL unCLIP pipeline (plus caption-guided refinement), reaching state-of-the-art retrieval/reconstruction while preserving strong performance in the 1-hour regime.