- OCR_VG
- Cross-Modal Search Based on MiniCPMV2.0
- Complex Agent Project
- MBTI Role Playing
- Hybrid Modality Fine-Tuning
- Running RAG with 4GB Memory
- AWQ Quantization for MiniCPMV2.6
- Cold Start to Acquire Agent Data
All of the above projects are original works. Feel free to use them, but please respect my intellectual property rights and give a star if you find them useful.
This project integrates OCR and localization tasks, considering layout issues. The project is located in the OCR_VG folder. You can access the Text Recognition and Localization Tutorial.
Using multi-vector and contrastive learning methods, the goal is to train an end-to-end cross-modal search model that can understand dense text and complex tables. Model Link
- Input 20 candidate images:
- Input query text for search:
- Obtain the image most similar to the query.
- Experimental Results: 300 validation set image-text pairs, with a Top1 match accuracy of 96%.
See the Feishu Document
To quickly build an Agent, I have developed a tool for generating agent training data using large models, saving you 95% of the time. This includes data generation in qwen (react) and minicpm formats.
Zero Modification Data Generation Example Generate Code Click Here
Using MiniCPMV2.6, I completed the project for the paper AutoPlan, which can plan and execute complex tasks.
- Input query:
- Obtain task decomposition
- Obtain task execution
- Obtain final answer
See the Feishu Document
Unlike the team at Peking University's Chatlaw, which trains a model for each personality, I have completed seamless switching between 16 personalities using a single 2B model (enabling role-playing of multiple personalities).
The fine-tuning of MiniCPMV only opened up training for text-image dual modalities. This project modified the training mode to include both pure text and text-image pairs, located in the MIniCPM_Series_Tutorial/ft_language_replace_file folder.
You can access the Hybrid Modality Fine-Tuning Tutorial
The degradation of language modality capability due to alignment training refers to the multimodal model mllm, where the ability to respond to pure language inputs decreases, often referred to as the "alignment tax" (essentially another form of catastrophic forgetting). One simple method to mitigate catastrophic forgetting is to mix in raw data. For the loss of language capability in multimodal models, this involves mixing in language data. However, the question of which language data to mix and in what proportion is not the focus of this article, and I am also unable to solve this problem. For practical applications, mllm does not need to be a jack-of-all-trades in language capabilities; rather, it needs to maintain basic Q&A and specialized response capabilities in a specific domain while having excellent multimodal capabilities.
There isn't much to explain here. This project allows running RAG with very low memory.
Tutorial available at RAG
Since bnb quantization of MiniCPMV2.6 cannot be loaded by vllm, I adapted autoawq. I have already submitted a PR to autoawq, and once merged, it will be directly usable.
Feishu Tutorial Link Usage steps are as follows:
- Get the personal autoawq branch
git clone https://github.com/LDLINGLINGLING/AutoAWQ cd AutoAWQ git checkout minicpmv2.6 pip install e .
- Replace the
modeling_minicpmv.py
file in theMiniCPM_Series_Tutorial/MiniCPMV2_6_awq
directory with the same-named file in the MiniCPMV2.6 model save path. - Modify the
model_path
inMiniCPM_Series_Tutorial/MiniCPMV2_6_awq/quantize.py
to your MiniCPMV2.6 save path. - Run
quantize.py
.
After obtaining the awq quantized MiniCPMV2.6 model, you can deploy it using the original vllm, with the same deployment method. The model size is reduced from 16GB to 7GB of VRAM.