Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Allow to specify onRun timeouts in the GUI #174

Open
patrickjahns opened this issue Feb 12, 2025 · 5 comments
Open

Feature request: Allow to specify onRun timeouts in the GUI #174

patrickjahns opened this issue Feb 12, 2025 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@patrickjahns
Copy link

Describe the problem

With #173 default timeouts for starting/stopping cinema4d have been added.
While exploring deadline-cloud in the past weeks, we noticed that on some projects cinema4d would render a scene and show a progress of 100%, but fail "to finish" the rendering in that specific project. For some reason it just "gets-stuck" - manually cancelling that task and requeuing it has solved the problem. This sometimes occured during overnight jobs and workers were "stuck" for hours at a time - which is wasted rendering time

Proposed Solution

Adding a timeout to the onRun step helps to automatically re-start the individual task after a certain period of time.
As we sometimes have frames that render for several minutes - it would be great to specify the timeout for that render task in the cinema4d-submitter gui

  script:
    embeddedFiles:
    - name: runData
      filename: run-data.yaml
      type: TEXT
      data: |
        frame: {{Task.Param.Frame}}
    actions:
      onRun:
        command: cinema4d-openjd
        args:
        - daemon
        - run
        - --connection-file
        - '{{ Session.WorkingDirectory }}/connection.json'
        - --run-data
        - file://{{ Task.File.runData }}
        cancelation:
          mode: NOTIFY_THEN_TERMINATE
        timeout: 180 

Specifically add the parameter and make it configurable in the GUI

        timeout: 180 

Example Use Cases

As outlined before - be able to add a render timeout via the submitter GUI, so that problems with cinema4d "hanging" can be caught and the task restarted for several times

@patrickjahns patrickjahns added enhancement New feature or request needs triage A new issue that needs a first look labels Feb 12, 2025
@karthikbekalp
Copy link
Contributor

karthikbekalp commented Feb 12, 2025

Hi @patrickjahns ,

Thank you for raising this feature request. I will discuss this enhancement with my team and evaluate any potential side effects it may introduce.

I noticed that you also mentioned an issue where the worker shows 100% progress but fails to complete the render. Could you please share the worker/session logs for the workers that got stuck in a new GitHub issue if you don't mind?

If you can provide any reproduction steps, it would be greatly appreciated. This will help us identify and fix the root cause of the workers getting stuck and prevent wasted time. :)

@patrickjahns
Copy link
Author

Hello @karthikbekalp ,

Thank you very much for the quick response 🙏

I noticed that you also mentioned an issue where the worker shows 100% progress but fails to complete the render. Could you please share the worker/session logs for the workers that got stuck in a new GitHub issue if you don't mind?
If you can provide any reproduction steps, it would be greatly appreciated. This will help us identify and fix the root cause of the workers getting stuck and prevent wasted time. :)

I did check on the jobs that had the symptoms, however the logs are no longer available - I will keep an eye out and get them for you.
In the mean time I can describe the problem a bit more - though I do not know if there is much that can be done on the deadline team side - besides allowing to define/modify the timeout parameter ( without needing to edit the job template file manually)

Our observation:

For us the symptom has something to do, that internally cinema4d never properly finishes the job ( the render function never returns https://github.com/aws-deadline/deadline-cloud-for-cinema-4d/blob/mainline/src/deadline/cinema4d_adaptor/Cinema4DClient/cinema4d_handler.py#L97-L104 )

Example log to describe the issue
  • process starts and runs till reporting
....
2025/02/12 00:45:53+01:00 Running command C:\Python311\Scripts\cinema4d-openjd.EXE daemon run --connection-file C:\ProgramData\Amazon\OpenJD\session-da33b625183e47638caca02b919054dbv1l2ovgh/connection.json --run-data file://C:\ProgramData\Amazon\OpenJD\session-da33b625183e47638caca02b919054dbv1l2ovgh\embedded_fileszpp0p8jw\run-data.yaml
2025/02/12 00:45:53+01:00 Command started as pid: 13064
2025/02/12 00:45:53+01:00 Output:
2025/02/12 00:45:53+01:00 INFO: Applying user-level configuration: C:\Users\deadline-worker\.openjd\adaptors\runtime\configuration.json
2025/02/12 00:45:53+01:00 INFO: Applying user-level configuration: C:\Users\deadline-worker\.openjd\adaptors\Cinema4DAdaptor\Cinema4DAdaptor.json
2025/02/12 00:45:53+01:00 ADAPTOR_OUTPUT: 
2025/02/12 00:45:53+01:00 ADAPTOR_OUTPUT: STDOUT: Performing action: {"name": "frame", "args": {"frame": 4}}
2025/02/12 00:45:53+01:00 ADAPTOR_OUTPUT: STDOUT: Performing action: {"name": "start_render", "args": {"frame": 4}}
2025/02/12 00:46:00+01:00 ADAPTOR_OUTPUT: STDOUT: Rendering frame 4 at <Wed Feb 12 00:46:00 2025>
2025/02/12 00:46:20+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 0
2025/02/12 00:46:25+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 16
2025/02/12 00:46:29+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 25
2025/02/12 00:46:35+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 33
2025/02/12 00:46:40+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 41
2025/02/12 00:46:41+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 50
2025/02/12 00:46:44+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 58
2025/02/12 00:46:48+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 66
2025/02/12 00:46:50+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 75
2025/02/12 00:46:53+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 83
2025/02/12 00:46:54+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 91
2025/02/12 00:46:55+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 100
  • Cinema4d hangs here
  • CPU usage goes down
  • GPU usage stops and memory gets deallocated

In a normal case, the following log files would appear, but when the job is stuck, it stays at above log "forever" (until we interrupt it)

2025/02/12 00:46:56+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 100
2025/02/12 00:46:56+01:00 ADAPTOR_OUTPUT: STDOUT: ALF_PROGRESS 100
2025/02/12 00:46:57+01:00 ADAPTOR_OUTPUT: STDOUT: Finished Rendering
2025/02/12 00:46:57+01:00 INFO: Done Cinema4DAdaptor main
2025/02/12 00:46:57+01:00 Process pid 13064 exited with code: 0 (unsigned) / 0x0 (hex)
2025/02/12 00:46:57+01:00 ----------------------------------------------
2025/02/12 00:46:57+01:00 Uploading output files to Job Attachments
2025/02/12 00:46:57+01:00 ----------------------------------------------
....

We have let the cinema4d render run "locally" (via the c4d inbuilt renderview dialog) on the same machine (also on different machines) and we observe the same "problems", that at random frames the renderjob just stops and is "idle" and wastes time "doing nothing". We have reached out to Maxon support regarding this issue - however since this is not a crash or problem that can be pinned to specific frames and occurs randomly, they suggestion is to just try to do the usual "voodoo" ( update drivers, downgrade drivers, patch windows, change c4d version etc. ).
However, when forcefully stopping cinema4d and picking up at the same frame, it renders fine.

On CMF fleets we could try to create AMIs with pinned drivers and pinned cinema4d versions - however that doesn't seem like a favorable and scalable solution in the long run. So we rather prefer to "retry" long running jobs

We currently edit the adaptor_cinema4d_job_template.yaml ( https://github.com/aws-deadline/deadline-cloud-for-cinema-4d/blob/mainline/src/deadline/cinema4d_submitter/adaptor_cinema4d_job_template.yaml#L117) and add a timeout: xxx to the onRun job.
This for us improves the "reliability" of deadline, as it can kill cinema4d in such a scenario and try rendering the frame again. For now we look at the average frame rendering time of previous tests for that scene and just multiply it by 3.

Since the rendertime is specific to a project - we would like to be able to specify the timeout in the GUI. So we can safely let large jobs run without requiring to manually cancel/interrupt and requeue jobs where cinema4d hangs

I hope the description helps to understand the problem we are facing - and for now our solution is to use a inbuilt openjd/deadline feature to mitigate "frustrating issues" with the vendor/upstream software - it would just be nice to "have a button" for that existing feature ;-)

@karthikbekalp
Copy link
Contributor

Thanks for your reply. I discussed with my team and looks like we implemented a similar solution for Nuke: https://github.com/aws-deadline/deadline-cloud-for-nuke/pull/143/files

But I think all DCCs can benefit from a feature like this. Our team will actively prioritize this feature request. :)

@karthikbekalp karthikbekalp removed the needs triage A new issue that needs a first look label Feb 13, 2025
@patrickjahns
Copy link
Author

@karthikbekalp

Thank you for your quick response - let me know if there is anything we can help here.

@karthikbekalp
Copy link
Contributor

Thanks @patrickjahns.

Here's the draft PR for this feature: aws-deadline/deadline-cloud#605

If you have any feedback or suggestions that can improve this feature, feel free to share them. :)

@karthikbekalp karthikbekalp self-assigned this Feb 19, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants