Neuro-Vision to Language: Image Reconstruction and Language enabled Interaction via Brain Recordings
arxiv(2024)
摘要
Decoding non-invasive brain recordings is crucial for advancing our
understanding of human cognition, yet faces challenges from individual
differences and complex neural signal representations. Traditional methods
require custom models and extensive trials, and lack interpretability in visual
reconstruction tasks. Our framework integrating integrates 3D brain structures
with visual semantics by Vision Transformer 3D. The unified feature extractor
aligns fMRI features with multiple levels of visual embeddings efficiently,
removing the need for individual-specific models and allowing extraction from
single-trial data. This extractor consolidates multi-level visual features into
one network, simplifying integration with Large Language Models (LLMs).
Additionally, we have enhanced the fMRI dataset with various fMRI-image related
textual data to support multimodal large model development. The integration
with LLMs enhances decoding capabilities, enabling tasks like brain captioning,
question-answering, detailed descriptions, complex reasoning, and visual
reconstruction. Our approach not only shows superior performance across these
tasks but also precisely identifies and manipulates language-based concepts
within brain signals, enhancing interpretability and providing deeper neural
process insights. These advances significantly broaden non-invasive brain
decoding applicability in neuroscience and human-computer interaction, setting
the stage for advanced brain-computer interfaces and cognitive models.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要