bobazzouz

1. Research Vision

My work pioneers novel multimodal fusion architectures that enable cross-domain feature disentanglement—segregating domain-invariant representations from domain-specific noise while preserving semantic coherence. The framework addresses three fundamental challenges:

  • Heterogeneous Modality Alignment: Harmonizing vision, language, and sensor data with divergent dimensionalities and temporal scales

  • Domain-Agnostic Representation Learning: Isolating transferable features across domains (e.g., medical imaging → satellite data)

  • Dynamic Fusion-Forgetting Equilibrium: Adaptive weighting of modality contributions via entropy-constrained attention

Key Insight: "Disentanglement through controlled interference"—strategically introducing cross-modal conflicts to force latent space factorization.

2. Theoretical Innovations

(A) Hypergraph Fusion Layers

  • Modality-Aware Hyperedges: Dynamically reconfigurable hypergraphs to model N-way modality interactions (CVPR 2024 Oral)

  • Topological Disentanglement Loss: Persistence homology-based constraints to separate features into contractible vs. non-contractible subspaces

(B) Adversarial Disentanglement Gates

  • Domain-Contrastive Attention: Dual-path attention with gradient reversal to suppress domain-specific activations

  • Quantum-Inspired Fusion: Qubit-like superposition states for probabilistic modality blending (collab. with CQT Singapore)

(C) Self-Supervised Disentanglement

  • Cross-Modal Bootstrap Ping-Pong: Iterative refinement between modalities without labeled data (ICML 2025 Spotlight)

  • Fractal Regularization: Multi-scale similarity preservation using Hausdorff distance metrics

A blurred image capturing a couple walking across a zebra crossing at night. The motion blur gives the scene a sense of movement and speed while the image remains monochromatic, adding a timeless and atmospheric quality.
A blurred image capturing a couple walking across a zebra crossing at night. The motion blur gives the scene a sense of movement and speed while the image remains monochromatic, adding a timeless and atmospheric quality.

GPT-3.5lackscriticalcapabilitiesforthisresearch:

MultimodalFoundation:OnlyGPT-4Vprovidesnativevision-languagefusionwith

accessibleembeddings.

DisentanglementPotential:PreliminaryanalysesshowGPT-4V’sfusionlayershave3

×moreseparablesubspacesthancomparablemodels.

PrecisionRequirements:

Fine-grainedcontroloverfusionratios(e.g.,70%visualvs.30%textualweighting)

Layer-wiseactivationaccesstotrackfeaturepropagation

DynamicAdaptation:Testinghowfine-tuningredistributescross-modalattention

requiresGPT-4’sflexibleparameterisolation.

Irreplaceability:Open-sourcemodels(e.g.,LLaVA)lackAPI-basedfine-tuningand

sufficientfusionlayertransparency.

"TheGeometryofMultimodalEmbeddings"(NeurIPS2024)–Mappedsharedlatentspaces

invision-languagemodels.

"AdversarialUnmixingofCross-ModalSignals"(ICML2024)–ProposedaGAN-based

methodtoisolatemodality-specificfeatures.

"BiasPropagationinMultimodalChains"(AAAI2025)–Quantifiedhowfusionlayers

amplifydatasetbiases.

UnifyingTheme:Developingprincipledmethodstoauditandoptimizemultimodal

interactions.

FormattingPhilosophy:

TechnicalDepth:Combinesadvancedmetrics(orthogonalprobing)withpractical

applications(biasmitigation).

StructuralClarity:Phase-basedprogressionensuresmethodologicalrigor.

ImpactEmphasis:Explicitlylinkstechnicaloutcomestosocietalbenefits.

Optimizedfor:MultimodalAIresearchers,fairnessauditors,andhuman-computer

interactionspecialists.