Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RuntimeError: Mismatch in Input Channels for Convolution Layer #4

Open
norbob opened this issue Jan 8, 2025 · 3 comments
Open

RuntimeError: Mismatch in Input Channels for Convolution Layer #4

norbob opened this issue Jan 8, 2025 · 3 comments

Comments

@norbob
Copy link

norbob commented Jan 8, 2025

During model training, a runtime error occurs due to a channel mismatch in a convolutional layer:

RuntimeError: Given groups=1, weight of size [3072, 32, 2, 2], expected input[4, 33, 32, 32] to have 32 channels, but got 33 channels instead

This error indicates that the model expects input with 32 channels but receives input with 33 channels instead, causing the training process to fail.

Already Checked

The dataset consists of 512x512 images with 24-bit depth and has been verified to have 3 channels (RGB). All images meet these criteria.
The image sizes and channels were re-verified using a script, and they match the expected specifications.

Question

What steps can be taken to identify the root cause of this channel mismatch? Could this issue be related to model configuration or preprocessing steps? Any guidance on resolving this would be greatly appreciated.

@Passenger12138
Copy link
Owner

From the error message, it seems that the issue is caused by a mismatch in the model architecture. We can investigate this from the following two perspectives:

  1. Model-related issue
    The error indicates that there might be a mismatch in the input layer. Specifically, the v1 version of CogVideo-i2v (patch_embed.proj) uses Conv2d, while the v1.5 version uses Linear2d. This error should not occur if you are using the v1.5 model. Please ensure that you are using the correct model from the Huggingface repository:
    https://huggingface.co/THUDM/CogVideoX1.5-5B-I2V.

  2. Data-related issue
    It also seems that you might be using image data to fine-tune the i2v model. Please note that the i2v model is designed for video generation tasks, and it is not feasible to use image data for fine-tuning. Make sure that your dataset consists of video data.

Additionally, to better assist you in diagnosing the issue, could you please provide more error screenshots? This will help me pinpoint the exact cause of the problem.

For your information, if you are trying to perform text-to-video (T2V) tasks, I plan to release LoRA fine-tuning and full fine-tuning support for CogVideo 1.5 in the next version. This will allow you to fine-tune the model for T2V tasks effectively.

@wujiafu007
Copy link

Thank you for your response, the issue has been resolved.

@norbob
Copy link
Author

norbob commented Jan 10, 2025

Thanks for the response! I wasn’t aware that I can't use images. Is it generally the case that images cannot be used as a dataset for Loras in video models? I'm a newcomer, and it's really hard to grasp these connections right away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants