This repository has been archived by the owner on Sep 19, 2022. It is now read-only.

kubeflow / pytorch-operator Public archive

Notifications You must be signed in to change notification settings
Fork 143
Star 307

Code
Issues 55
Pull requests 8
Actions
Projects
Security
Insights

Additional navigation options

Code
Issues
Pull requests
Actions
Projects
Security
Insights

Issues: kubeflow/pytorch-operator

Labels 82 Milestones 0

55 Open 92 Closed

Author

Filter by author

Label

Filter by label

Use alt + click/return to exclude labels

or ⇧ + click/return for logical OR

Projects

Filter by project

Milestones

Filter by milestone

Assignee

Filter by who’s assigned

Assigned to nobody

Sort

Sort by

Newest Oldest Most commented Least commented Recently updated Least recently updated Best match

Most reactions

Issues list

unable to build image for ppc64le

#365 opened Nov 27, 2021 by gajanankulkarni-18

PytorchJob DDP training will stop if I delete a worker pod

#364 opened Nov 20, 2021 by Shuai-Xie

run https://github.com/kubeflow/pytorch-operator/blob/master/sdk/python/test/test_e2e.py failed

#363 opened Nov 19, 2021 by sxl1993

Multi-gpu in a single pod

#362 opened Nov 19, 2021 by wallarug

service label mismatches selector, which result in inconsistency kind/bug

#360 opened Nov 5, 2021 by konnase

The training hangs after reloading one of master/worker pods area/engprod kind/question

#359 opened Oct 28, 2021 by dmitsf

Can I freeze pytorchjob training pods and migrate them to other nodes?

#356 opened Sep 22, 2021 by Shuai-Xie

Pytorch version may have an effect on the training reproduction

#355 opened Sep 21, 2021 by Shuai-Xie

Different DDP training results of PytorchJob and Bare Metal

#354 opened Sep 18, 2021 by Shuai-Xie

Can PytorchJob skip or cancel the init cantainer?

#352 opened Sep 15, 2021 by SeibertronSS

volcano change the PodGroup CRD APIGroup to volcano.sh

#351 opened Sep 7, 2021 by qiankunli

container "pytorch" is waiting to start: PodInitializing kind/bug

#348 opened Aug 15, 2021 by gogogwwb

Upgrade to v1 CRDs

#347 opened Aug 12, 2021 by mcristina422

[feat] Support PyTorch 1.9

#346 opened Aug 4, 2021 by gaocegege

PytorchJob replicas has different node affinity behaviors compared with Deployment

#344 opened Jul 21, 2021 by Shuai-Xie

Worker template should be configurable.

#335 opened May 27, 2021 by MartinForReal

'host not found' error occurs during PyTorch distributed learning kind/feature

#333 opened Apr 30, 2021 by JGoo1

NCCL "Connection Refused" for Worker Pods

#332 opened Apr 26, 2021 by twolffpiggott

whether multi-gpu-per-pod setup be supported in PytorchJob

#331 opened Apr 25, 2021 by tingweiwu

can I use PyTorchJobClient inside a pod of the cluster?

#330 opened Apr 9, 2021 by omlomloml

Mnist dataset server is down

#325 opened Mar 17, 2021 by Jeffwan

Operator has invalid memory address error on specific pytorchjob spec

#321 opened Feb 22, 2021 by ca-scribner

Unlable to spawn PyTorchJob due to image alpine dependency of pytorch-operator kind/bug

#319 opened Feb 11, 2021 by asahalyft

Is python sdk still being maintained?

#317 opened Feb 3, 2021 by ca-scribner

dist.init_process_group stuck

#313 opened Dec 22, 2020 by ravenj73

Previous 1 2 3 Next

Previous Next

ProTip! Add no:assignee to see everything that’s not assigned.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly