Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MLCD Model #36181

Open
2 tasks done
tanhuajie opened this issue Feb 13, 2025 · 0 comments · May be fixed by #36182
Open
2 tasks done

Add MLCD Model #36181

tanhuajie opened this issue Feb 13, 2025 · 0 comments · May be fixed by #36182

Comments

@tanhuajie
Copy link

tanhuajie commented Feb 13, 2025

Model description

The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.

🔥MLCD-ViT-bigG🔥 series is the state-of-the-art vision transformer model enhanced with 2D Rotary Position Embedding (RoPE2D), achieving superior performance on document understanding and visual question answering tasks. Developed by DeepGlint AI, this model demonstrates exceptional capabilities in processing complex visual-language interactions.

Tips:

Result:

Vision Tower RoPE2D ChartQA DocVQA InfoVQA OCRBench MMMU
CLIP (ViT-L-14-336px) × 66.52 75.21 38.88 525.00 44.20
SigLIP (ViT-SO400M-384px) × 69.28 76.71 41.38 554.00 46.78
DFN5B (ViT-H-14-378px) × 64.36 70.87 38.59 473.00 48.00
MLCD (ViT-L-14-336px) × 67.84 76.46 43.48 531.00 44.30
MLCD (ViT-bigG-14-336px) 71.07 79.63 44.38 572.00 46.78
MLCD (ViT-bigG-14-448px) 73.80 83.34 46.59 582.00 46.00

Open source status

  • The model implementation is available
  • The model weights are available

Provide useful links for the implementation

No response

@tanhuajie tanhuajie linked a pull request Feb 13, 2025 that will close this issue
5 tasks
@qubvel qubvel added the Vision label Feb 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants