Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running problem on Linux server #342

Open
looperalt opened this issue Dec 23, 2024 · 15 comments
Open

Running problem on Linux server #342

looperalt opened this issue Dec 23, 2024 · 15 comments
Assignees

Comments

@looperalt
Copy link

When I run iBVPnet on a Linux server, the following problems occur:
TypeError: Binding inputs to tf.function wrapped_fn failed due to Can not cast TensorSpec(shape=(1, 1024, 1365, 4), dtype=tf.float32, name=None) to TensorSpec(shape=(None, None, None, 3), dtype=tf.float32, name=None). Received args: (array([[[[149. , 155. , 163. , 0. ]
and
ValueError: ('train', 'No files in file list')

@yahskapar
Copy link
Collaborator

Hi @looperalt,

My guess is something went wrong when trying to preprocess the dataset - did you check your terminal before that last error message to see if preprocessing was actually successful (e.g., the progress bar for preprocessing showed up, was completed, etc)? Remember, you have to preprocess the dataset for the first time and anytime you change key parameters that would affect preprocessing in your config file (refer to the example config files for more details). Also, note the project README and how datasets are expected to be organized (in most cases, the way they are downloaded).

If it seems like preprocessing somehow starts and then abruptly stops, you could try seeing if the issue is the default multi_process_quota configuration here. Try setting that to a lower value (perhaps 1 to begin with).

@yahskapar yahskapar self-assigned this Dec 23, 2024
@looperalt
Copy link
Author

Hi @looperalt,

My guess is something went wrong when trying to preprocess the dataset - did you check your terminal before that last error message to see if preprocessing was actually successful (e.g., the progress bar for preprocessing showed up, was completed, etc)? Remember, you have to preprocess the dataset for the first time and anytime you change key parameters that would affect preprocessing in your config file (refer to the example config files for more details). Also, note the project README and how datasets are expected to be organized (in most cases, the way they are downloaded).

If it seems like preprocessing somehow starts and then abruptly stops, you could try seeing if the issue is the default multi_process_quota configuration here. Try setting that to a lower value (perhaps 1 to begin with).

Yes, a progress bar appears during preprocessing, but when the progress bar reaches the end, this issue occurs. However, it's strange that when I run the code locally on Windows, there is no problem. When I run it on a Linux remote server, this issue arises. I have followed the process outlined in the README file without any operational errors, and I am puzzled as to why there would be different results when running the code on different systems. I would like to know if you have encountered this issue and how to adjust the parameters when running this code on a Linux remote server

@yahskapar
Copy link
Collaborator

yahskapar commented Dec 23, 2024

@looperalt,

If you can run it locally on a Windows machine but somehow not on a Linux server, are you sure there isn't something going wrong with how your dataset (whether the dataset itself or the preprocessed folder) is being pointed to in the Linux case? I guess if the Linux remote server has too much CPU usage going on, that could also lead to stuck or dead processes that prevent successful preprocessing, but if you use top and similar commands you may find that isn't likely and that you should be fine in that regard.

Feel free to share the config you're trying to run here, perhaps there is something up with the file paths that I can identify.

@looperalt
Copy link
Author

@looperalt,

If you can run it locally on a Windows machine but somehow not on a Linux server, are you sure there isn't something going wrong with how your dataset (whether the dataset itself or the preprocessed folder) is being pointed to in the Linux case? I guess if the Linux remote server has too much CPU usage going on, that could also lead to stuck or dead processes that prevent successful preprocessing, but if you use top and similar commands you may find that isn't likely and that you should be fine in that regard.

Feel free to share the config you're trying to run here, perhaps there is something up with the file paths that I can identify.

When running on the Linux server, the progress bar advances normally, but after the progress bar is full, it displays a ValueError: ('train', 'No files in file list') and exits the program, indicating that only the dataset was read but not preprocessed or saved. This proves that the dataset path is not the issue, and it might be a problem caused by memory killing. Can this issue be resolved? I don't know if you have experience in dealing with such issues. I will try to share my configuration file with you tomorrow. Thank you for your answer, and I am very grateful.

@yahskapar
Copy link
Collaborator

@looperalt,
If you can run it locally on a Windows machine but somehow not on a Linux server, are you sure there isn't something going wrong with how your dataset (whether the dataset itself or the preprocessed folder) is being pointed to in the Linux case? I guess if the Linux remote server has too much CPU usage going on, that could also lead to stuck or dead processes that prevent successful preprocessing, but if you use top and similar commands you may find that isn't likely and that you should be fine in that regard.
Feel free to share the config you're trying to run here, perhaps there is something up with the file paths that I can identify.

When running on the Linux server, the progress bar advances normally, but after the progress bar is full, it displays a ValueError: ('train', 'No files in file list') and exits the program, indicating that only the dataset was read but not preprocessed or saved. This proves that the dataset path is not the issue, and it might be a problem caused by memory killing. Can this issue be resolved? I don't know if you have experience in dealing with such issues. I will try to share my configuration file with you tomorrow. Thank you for your answer, and I am very grateful.

If it was read and not preprocessed or saved, again, try two things 1) adjust multi_process-quota to a lower value, perhaps 1 to begin with, and see if that helps and 2) check your preprocessed data save path itself and make sure you actually have permissions to write to it.

If 1) does not make a difference at all, let me know, and we can dig into other things that may be specific to your situation and causing issues. I should note, the majority of toolbox users (i.e., hundreds of people, myself included) use Linux and the default multi-process setting without any issue, so troubleshooting with respect to your particular remote server is the way to go.

@looperalt
Copy link
Author

@looperalt,
If you can run it locally on a Windows machine but somehow not on a Linux server, are you sure there isn't something going wrong with how your dataset (whether the dataset itself or the preprocessed folder) is being pointed to in the Linux case? I guess if the Linux remote server has too much CPU usage going on, that could also lead to stuck or dead processes that prevent successful preprocessing, but if you use top and similar commands you may find that isn't likely and that you should be fine in that regard.
Feel free to share the config you're trying to run here, perhaps there is something up with the file paths that I can identify.

When running on the Linux server, the progress bar advances normally, but after the progress bar is full, it displays a ValueError: ('train', 'No files in file list') and exits the program, indicating that only the dataset was read but not preprocessed or saved. This proves that the dataset path is not the issue, and it might be a problem caused by memory killing. Can this issue be resolved? I don't know if you have experience in dealing with such issues. I will try to share my configuration file with you tomorrow. Thank you for your answer, and I am very grateful.

If it was read and not preprocessed or saved, again, try two things 1) adjust multi_process-quota to a lower value, perhaps 1 to begin with, and see if that helps and 2) check your preprocessed data save path itself and make sure you actually have permissions to write to it.

If 1) does not make a difference at all, let me know, and we can dig into other things that may be specific to your situation and causing issues. I should note, the majority of toolbox users (i.e., hundreds of people, myself included) use Linux and the default multi-process setting without any issue, so troubleshooting with respect to your particular remote server is the way to go.

I have tried the methods you suggested, continuously adjusting the parameters of the multi-process function, but unfortunately, the error still persists.
Preprocessing dataset...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [04:46<00:00, 71.68s/it]
Traceback (most recent call last):
File "main.py", line 177, in
train_data_loader = train_loader(
File "/03/Datasets/rppgt/dataset/data_loader/iBVPLoader.py", line 50, in init
super().init(name, data_path, config_data)
File "/03/Datasets/rppgt/dataset/data_loader/BaseLoader.py", line 68, in init
self.preprocess_dataset(self.raw_data_dirs, config_data.PREPROCESS, config_data.BEGIN, config_data.END)
File "/03/Datasets/rppgt/dataset/data_loader/BaseLoader.py", line 209, in preprocess_dataset
self.build_file_list(file_list_dict) # build file list
File "/03/Datasets/rppgt/dataset/data_loader/BaseLoader.py", line 531, in build_file_list
raise ValueError(self.dataset_name, 'No files in file list')
ValueError: ('train', 'No files in file list')

@yahskapar
Copy link
Collaborator

Ok, so we've ruled out multi-processing as far as too many processes being the issue.

Here's a few more things to try:

  1. The preprocessed dataset path on the remote server, CACHED_PATH, can you actually write to it on the remote server? Are you able to make files in it (e.g., touch example.txt) and modify those files? It may help to share your config file at this point just to check for anything subtle (e.g., anything weird in the cached file path itself).

  2. Assuming you're trying to use the iBVP dataset based on the error message you pasted, add a simple print statement on the line here, with the same indentation level as the else statement above the line. Print out some debug details about the output of read_video(), for example you can use np.shape as a quick check to make sure the at least the video read is being done successfully. Are sane results returned (e.g., number of frames with dimensions)?

  3. Can you also try re-running the repo setup instructions? This should delete your existing conda environment and make a new one. I would carefully inspect the setup outputs and make sure no errors appeared - depending on the error, there could be a subtle effect on preprocessing with respect to things like face detection.

All the best,

Akshay

@looperalt
Copy link
Author

Ok, so we've ruled out multi-processing as far as too many processes being the issue.

Here's a few more things to try:

  1. The preprocessed dataset path on the remote server, CACHED_PATH, can you actually write to it on the remote server? Are you able to make files in it (e.g., touch example.txt) and modify those files? It may help to share your config file at this point just to check for anything subtle (e.g., anything weird in the cached file path itself).
  2. Assuming you're trying to use the iBVP dataset based on the error message you pasted, add a simple print statement on the line here, with the same indentation level as the else statement above the line. Print out some debug details about the output of read_video(), for example you can use np.shape as a quick check to make sure the at least the video read is being done successfully. Are sane results returned (e.g., number of frames with dimensions)?
  3. Can you also try re-running the repo setup instructions? This should delete your existing conda environment and make a new one. I would carefully inspect the setup outputs and make sure no errors appeared - depending on the error, there could be a subtle effect on preprocessing with respect to things like face detection.

All the best,

Akshay

I have confirmed that my preprocessing folder has write permissions. In fact, there is something very strange: when I first tried to run the code, .npy files appeared in the preprocessing folder, but when I closed the program and ran it again, the .npy files never appeared again. This is a very confusing situation for me.

@yahskapar
Copy link
Collaborator

I would troubleshoot that a bit more, that does sound strange and it's really hard for me to tell what the issue might be since it sounds quite specific to your remote server / your environment. Is it safe to assume, using df -h or similar, you aren't running out of storage space on your remote server somehow?

@looperalt
Copy link
Author

I would troubleshoot that a bit more, that does sound strange and it's really hard for me to tell what the issue might be since it sounds quite specific to your remote server / your environment. Is it safe to assume, using df -h or similar, you aren't running out of storage space on your remote server somehow?

When I run this code, all the CPU core usage will exceed 100%, the server I use is 16 core 80G memory, when I run directly will show the CPU usage of more than 1600%, after a period of time will exit the code shows this error, whether it is related to my memory size and CPU performance

@looperalt
Copy link
Author

When I run the code on a different server, I get the following error
Traceback (most recent call last):
File "/root/miniconda3/envs/ljq/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/root/miniconda3/envs/ljq/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/autodl-tmp/rppgt/dataset/data_loader/iBVPLoader.py", line 164, in preprocess_dataset_subprocess
frames_clips, bvps_clips = self.preprocess(frames, bvps, config_preprocess)
File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 225, in preprocess
frames = self.crop_face_resize(
File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 374, in crop_face_resize
face_region_all.append(self.face_detection(frames[detection_freq * idx], backend, use_larger_box, larger_box_coef))
File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 287, in face_detection
face_zone = detector.detectMultiScale(frame)
cv2.error: OpenCV(4.5.2) /tmp/pip-req-build-13uokl4r/opencv/modules/objdetect/src/cascadedetect.cpp:1389: error: (-215:Assertion failed) scaleFactor > 1 && _image.depth() == CV_8U in function 'detectMultiScale'
And the following error is still displayed when you exit the code
Traceback (most recent call last):
File "main.py", line 177, in
train_data_loader = train_loader(
File "/root/autodl-tmp/rppgt/dataset/data_loader/iBVPLoader.py", line 50, in init
super().init(name, data_path, config_data)
File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 68, in init
self.preprocess_dataset(self.raw_data_dirs, config_data.PREPROCESS, config_data.BEGIN, config_data.END)
File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 209, in preprocess_dataset
self.build_file_list(file_list_dict) # build file list
File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 531, in build_file_list
raise ValueError(self.dataset_name, 'No files in file list')
ValueError: ('train', 'No files in file list')

@looperalt
Copy link
Author

The following error is displayed when using RF as the face processing module
Traceback (most recent call last):
File "/root/miniconda3/envs/ljq/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/root/miniconda3/envs/ljq/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/autodl-tmp/rppgt/dataset/data_loader/iBVPLoader.py", line 164, in preprocess_dataset_subprocess
frames_clips, bvps_clips = self.preprocess(frames, bvps, config_preprocess)
File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 225, in preprocess
frames = self.crop_face_resize(
File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 374, in crop_face_resize
face_region_all.append(self.face_detection(frames[detection_freq * idx], backend, use_larger_box, larger_box_coef))
File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 303, in face_detection
res = RetinaFace.detect_faces(frame)
File "/root/miniconda3/envs/ljq/lib/python3.8/site-packages/retinaface/RetinaFace.py", line 90, in detect_faces
net_out = model(im_tensor)
File "/root/miniconda3/envs/ljq/lib/python3.8/site-packages/tensorflow/python/util/traceback_utils.py", line 153, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/root/miniconda3/envs/ljq/lib/python3.8/site-packages/tensorflow/python/eager/polymorphic_function/function_spec.py", line 406, in bind_function_inputs
raise TypeError(
TypeError: Binding inputs to tf.function wrapped_fn failed due to Can not cast TensorSpec(shape=(1, 1024, 1365, 4), dtype=tf.float32, name=None) to TensorSpec(shape=(None, None, None, 3), dtype=tf.float32, name=None). Received args: (array([[[[149. , 155. , 163. , 0. ],
[149. , 155. , 163. , 0. ],
[149. , 155. , 163. , 0. ],
...,
[123. , 129. , 137. , 0. ],
[123. , 129. , 137. , 0. ],
[123. , 129. , 137. , 0. ]],

    [[149.20312 , 155.20312 , 163.20312 ,   0.      ],
     [149.20312 , 155.20312 , 163.20312 ,   0.      ],
     [149.20312 , 155.20312 , 163.20312 ,   0.      ],
     ...,
     [123.      , 129.      , 137.      ,   0.      ],
     [123.      , 129.      , 137.      ,   0.      ],
     [123.      , 129.      , 137.      ,   0.      ]],

    [[149.67188 , 155.67188 , 163.67188 ,   0.      ],
     [149.67188 , 155.67188 , 163.67188 ,   0.      ],
     [149.67188 , 155.67188 , 163.67188 ,   0.      ],
     ...,
     [123.      , 129.      , 137.      ,   0.      ],
     [123.      , 129.      , 137.      ,   0.      ],
     [123.      , 129.      , 137.      ,   0.      ]],

    ...,

    [[146.32812 , 128.32812 , 107.328125,   0.      ],
     [146.32812 , 128.32812 , 107.328125,   0.      ],
     [146.32812 , 128.32812 , 107.328125,   0.      ],
     ...,
     [ 41.58423 ,  38.58423 ,  42.58423 ,   0.      ],
     [ 39.55542 ,  36.55542 ,  40.55542 ,   0.      ],
     [ 38.      ,  35.      ,  39.      ,   0.      ]],

    [[146.79688 , 128.79688 , 107.796875,   0.      ],
     [146.79688 , 128.79688 , 107.796875,   0.      ],
     [146.79688 , 128.79688 , 107.796875,   0.      ],
     ...,
     [ 41.972412,  38.972412,  42.972412,   0.      ],
     [ 39.723877,  36.723877,  40.723877,   0.      ],
     [ 38.      ,  35.      ,  39.      ,   0.      ]],

    [[147.      , 129.      , 108.      ,   0.      ],
     [147.      , 129.      , 108.      ,   0.      ],
     [147.      , 129.      , 108.      ,   0.      ],
     ...,
     [ 42.140625,  39.140625,  43.140625,   0.      ],
     [ 39.796875,  36.796875,  40.796875,   0.      ],
     [ 38.      ,  35.      ,  39.      ,   0.      ]]]],
  dtype=float32),) and kwargs: {} for signature: (args_0: TensorSpec(shape=(None, None, None, 3), dtype=tf.float32, name=None), /).

@yahskapar
Copy link
Collaborator

yahskapar commented Dec 29, 2024

When I run the code on a different server, I get the following error Traceback (most recent call last): File "/root/miniconda3/envs/ljq/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap self.run() File "/root/miniconda3/envs/ljq/lib/python3.8/multiprocessing/process.py", line 108, in run self._target(*self._args, **self._kwargs) File "/root/autodl-tmp/rppgt/dataset/data_loader/iBVPLoader.py", line 164, in preprocess_dataset_subprocess frames_clips, bvps_clips = self.preprocess(frames, bvps, config_preprocess) File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 225, in preprocess frames = self.crop_face_resize( File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 374, in crop_face_resize face_region_all.append(self.face_detection(frames[detection_freq * idx], backend, use_larger_box, larger_box_coef)) File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 287, in face_detection face_zone = detector.detectMultiScale(frame) cv2.error: OpenCV(4.5.2) /tmp/pip-req-build-13uokl4r/opencv/modules/objdetect/src/cascadedetect.cpp:1389: error: (-215:Assertion failed) scaleFactor > 1 && _image.depth() == CV_8U in function 'detectMultiScale' And the following error is still displayed when you exit the code Traceback (most recent call last): File "main.py", line 177, in train_data_loader = train_loader( File "/root/autodl-tmp/rppgt/dataset/data_loader/iBVPLoader.py", line 50, in init super().init(name, data_path, config_data) File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 68, in init self.preprocess_dataset(self.raw_data_dirs, config_data.PREPROCESS, config_data.BEGIN, config_data.END) File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 209, in preprocess_dataset self.build_file_list(file_list_dict) # build file list File "/root/autodl-tmp/rppgt/dataset/data_loader/BaseLoader.py", line 531, in build_file_list raise ValueError(self.dataset_name, 'No files in file list') ValueError: ('train', 'No files in file list')

I suspect this has something to do with your file path for the Haar Cascade model or perhaps related libraries that were installed (e.g., OpenCV).

  1. Try adding the following code just after this line:
if detector.empty():
    print("Failed to load Haar cascade!")

That, or something similar to check if the detector was initialized properly, should tell us if the issue is with the detector initialization itself. My guess at this point is that somehow the Haar Cascade .xml that is inside the repo isn't being picked up properly. You could try setting that model path to be an absolute path (i.e., specific to your machine) rather than the current relative path here.

  1. You can also try reinstalling or upgrading OpenCV, but generally speaking it seems like your OpenCV is fine based on the version reported in the error itself (4.5.2).

@sajjadk442
Copy link

sajjadk442 commented Jan 2, 2025

I'm facing the same issues but for me its: ValueError: ('test', 'No files in file list'), the same project i run on local it works fine, but when i upload it on drive and run the code from google colab it shows this error, the paths i have triple checked, the files uploaded on the memory as it takes Ram, after that bar 100% completed it throws this error and exits the program.

I have changes multiprocess_quota to 1 , still the same error
harcascade file is there in the dataset dir, i have double checked.

@yahskapar
Copy link
Collaborator

@sajjadk442,

I'd recommend making a separate issue regarding this with more details, but I should stress this toolbox was never designed nor tested to run out-of-the-box on Google Colab, so you may have to figure this out on your own unless someone else has also done this successfully and can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants