how to check Federated Learning Job is working? #419

Sergiossrr · 2023-10-12T02:02:17Z

What happened:
I followed "Using Federated Learning Job in Surface Defect Detection Scenario".As the last step,"After the job completed, you will find the model generated on the directory /model in $EDGE1_NODE and $EDGE2_NODE."
So how can i check the job is completed or is working,and using kubectl get federatedlearningjob surface-defect-detectiononly shows NAME and AGE

Environment:
openEuler 22.03 LTS
kubernetes v1.21.1
kubeedge v1.14.2
edgemesh v1.14.0
sedna v0.6.0

Sedna Version

$ kubectl get -n sedna deploy gm -o jsonpath='{.spec.template.spec.containers[0].image}'
# paste output here

$ kubectl get -n sedna ds lc -o jsonpath='{.spec.template.spec.containers[0].image}'
# paste output here

Kubernets Version

$ kubectl version
# paste output here

KubeEdge Version

$ cloudcore --version
# paste output here

$ edgecore --version
# paste output here

CloudSide Environment:

Hardware configuration

$ lscpu
# paste output here

OS

$ cat /etc/os-release
# paste output here

Kernel

$ uname -a
# paste output here

Others

EdgeSide Environment:

Hardware configuration

$ lscpu
# paste output here

OS

$ cat /etc/os-release
# paste output here

Kernel

$ uname -a
# paste output here

Others

The text was updated successfully, but these errors were encountered:

Sergiossrr · 2023-10-12T02:13:25Z

@MooreZheng @JoeyHwong-gk please give me some suggestions

JoeyHwong-gk · 2023-10-13T01:01:33Z

Hi there,

In the provided example, the federated learning job simulates independent training on two separate edge nodes using their respective training data. The federated learning process combines these independent models' weights on the cloud, achieving the requirements of federated learning.

After job completion, you can find the merged model's weights in the /model directory on either edge node ($EDGE1_NODE or $EDGE2_NODE). This merged model results from federated learning, combining contributions from both nodes. You can compare this merged model with models trained individually to observe differences in performance.

Hope this helps! Feel free to ask if you have further questions.

Sergiossrr · 2023-10-13T02:48:28Z

Thanks for your reply @JoeyHwong-gk. Besides, I have two questions.

Q1: Which version is recommended for the images of train and aggregation? I use isula instead of docker. After directly isula pull images:v0.3.0, the pods' status is OutOfMemory, and after pulling v0.5.0, the pods' status is CrashLoopBackOff.
Q2: How can I check that the job is running? In other words, how can I check the logs when the job is running?Or only I can do is waiting before the /model directory has model's weights?

JoeyHwong-gk · 2023-10-13T06:19:09Z

Q1: Which version is recommended for the images of train and aggregation? I use isula instead of docker. After directly isula pull images:v0.3.0, the pods' status is OutOfMemory, and after pulling v0.5.0, the pods' status is CrashLoopBackOff.
Q2: How can I check that the job is running? In other words, how can I check the logs when the job is running?Or only I can do is waiting before the /model directory has model's weights?

For Q1:
I cannot confirm the cause of the issues without more information. The containers are automatically pushed to Docker Hub, and all versions should be available. However, I strongly recommend using the latest version, v0.5.1, as it might contain crucial bug fixes and improvements.

For Q2:
Certainly, you can check the running logs from any node (both the cloud-side server and the edge nodes) without waiting for completion. The logs can provide real-time information about the job's progress and any errors that might occur during execution. You don't need to wait until the model's weights appear in the /model directory to access the logs.

Feel free to review the logs for insights into the job's status and any potential issues. If you encounter specific errors in the logs, please provide those details for further assistance.

JoeyHwong-gk · 2023-10-13T06:22:07Z

/assign @jaypume

Sergiossrr · 2023-10-18T09:00:39Z

Thanks for your reply @JoeyHwong-gk @jaypume.

For Q1:
Train pods still cannot work.
When I pull images:v0.5.1, the error message is fetch and parse manifest failed. After switching to v0.4.0, aggregation pod is running,but train pod gives an error and the status is ImagePullBackOff.

Relevant useful information is as follows:
kubectl logs surface-defect-detection-train-llw2c
container "train-worker" in pod "surface-defect-detection-train-llw2c" is waiting to start: trying and failing to pull image

On EDGE1_NODE, isula ps

CONTAINER ID    IMAGE                                                                   COMMAND                 CREATED         STATUS          PORTS   NAMES                           
72a491b1f348    kubeedge/pause:3.1                                                      "/pause"                27 minutes ago  Up 27 minutes           k8s_POD_surface-defect-detection-train-n6t5n_default_b6c0235c-f382-4740-a22a-1974910049d0_0
d8131b19f7df    f64c26f478a3d054ef86062baf4692e3188297632a5487c70d1ff03398d73a1c        "sedna-lc"              4 hours ago     Up 4 hours              k8s_lc_lc-hwxx9_sedna_f59e93da-c6cd-41c6-8a52-581a43dee5f7_0
61209b5d9b6a    kubeedge/pause:3.1                                                      "/pause"                4 hours ago     Up 4 hours              k8s_POD_lc-hwxx9_sedna_f59e93da-c6cd-41c6-8a52-581a43dee5f7_0
5bee722df55d    f51171e9ee03b0367d8914f615a3b46583411b4dfc6d4f09011c90aea02845f5        "edgemesh-agent"        4 hours ago     Up 4 hours              k8s_edgemesh-agent_edgemesh-agent-w79rf_kubeedge_25770be1-8420-4a19-8add-e80b8bf98f38_0
d7f50545bb10    kubeedge/pause:3.1                                                      "/pause"                4 hours ago     Up 4 hours              k8s_POD_edgemesh-agent-w79rf_kubeedge_25770be1-8420-4a19-8add-e80b8bf98f38_0
eb111d22af29    a6c0cb5dbd21197123942b3469a881f936fd7735f2dc9a22763b6f777f24345e        "/opt/bin/flanneld..."  7 hours ago     Up 7 hours              k8s_kube-flannel-edge_kube-flannel-edge-ds-4xgjx_kube-flannel_f43bd8eb-171c-4882-b05a-a1de33b9bdc0_7
2a96a3e38f14    kubeedge/pause:3.1                                                      "/pause"                7 hours ago     Up 7 hours              k8s_POD_kube-flannel-edge-ds-4xgjx_kube-flannel_f43bd8eb-171c-4882-b05a-a1de33b9bdc0_0
e4598be3a11c    5dade4ce550b85d4a56054bc8d74e72350f46613129145c28dd7fa39ccf2c6be        "/docker-entrypoin..."  10 hours ago    Up 10 hours             k8s_mqtt_mqtt-kubeedge_default_d2c774f6-c412-4a38-8aba-08f953c2009c_0
a2ceb1ec5118    kubeedge/pause:3.1                                                      "/pause"                10 hours ago    Up 10 hours             k8s_POD_mqtt-kubeedge_default_d2c774f6-c412-4a38-8aba-08f953c2009c_0

and kubectl edit pod surface-defect-detection-train-llw2c

apiVersion: v1
kind: Pod
metadata:
  creationTimestamp: "2023-10-17T15:48:13Z"
  generateName: surface-defect-detection-train-
  labels:
    federatedlearningjob.sedna.io/name: surface-defect-detection
    federatedlearningjob.sedna.io/uid: 0b8097ad-83cd-4be6-afc8-ec1b27a60b3e
    federatedlearningjob.sedna.io/worker-type: train
  name: surface-defect-detection-train-llw2c
  namespace: default
  ownerReferences:
  - apiVersion: sedna.io/v1alpha1
    blockOwnerDeletion: true
    controller: true
    kind: FederatedLearningJob
    name: surface-defect-detection
    uid: 0b8097ad-83cd-4be6-afc8-ec1b27a60b3e
  resourceVersion: "71706"
  uid: 9008081f-f045-4c28-8eb4-e483eff2be87
spec:
  containers:
  - env:
    - name: batch_size
      value: "32"
    - name: learning_rate
      value: "0.001"
    - name: epochs
      value: "2"
    - name: DATA_PATH_PREFIX
      value: /home/data
    - name: PARTICIPANTS_COUNT
      value: "2"
    - name: NAMESPACE
      value: default
    - name: MODEL_NAME
      value: surface-defect-detection-model
    - name: DATASET_NAME
      value: edge2-surface-defect-detection-dataset
    - name: LC_SERVER
      value: http://localhost:9100
    - name: AGG_PORT
      value: "7363"
    - name: AGG_IP
      value: surface-defect-detection-aggregation.default
    - name: WORKER_NAME
      value: trainworker-c98rq
    - name: TRAIN_DATASET_URL
      value: /home/data/data/2.txt
    - name: JOB_NAME
      value: surface-defect-detection
    - name: TRANSMITTER
      value: ws
    - name: MODEL_URL
      value: /home/data/model
    image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0
    imagePullPolicy: IfNotPresent
    name: train-worker
    resources:
      limits:
        memory: 2Gi
      requests:
        memory: 2Gi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /home/data/
      name: sedna-default-volume-name
    - mountPath: /home/data/data/
      name: dataz
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-ffhfg
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostNetwork: true
  nodeName: kubenode2
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: OnFailure
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - hostPath:
      path: /
      type: Directory
    name: sedna-default-volume-name
  - hostPath:
      path: /data/
      type: Directory
    name: dataz
  - name: kube-api-access-ffhfg
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T14:50:20Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T14:50:20Z"
    message: 'containers with unready status: [train-worker]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T14:50:20Z"
    message: 'containers with unready status: [train-worker]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2023-10-17T14:50:20Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - image: kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0
    imageID: ""
    lastState: {}
    name: train-worker
    ready: false
    restartCount: 0
    started: false
    state:
      waiting:
        message: Back-off pulling image "kubeedge/sedna-example-federated-learning-surface-defect-detection-train:v0.4.0"
        reason: ImagePullBackOff
  hostIP: 192.168.xxx.xxx
  phase: Pending
  podIP: 192.168.xxx.xxx
  podIPs:
  - ip: 192.168.xxx.xxx
  qosClass: Burstable
  startTime: "2023-10-17T14:50:20Z"

JoeyHwong-gk · 2023-10-19T06:46:45Z

I apologize for the inconvenience you faced with pulling the images. It's possible that network issues caused the problem. As an alternative, I recommend trying to build the containers directly using the build_image.sh script. This way, you can bypass potential network-related problems and create the containers locally.

dsj-kaiyue · 2023-12-01T13:04:26Z

Q1: Which version is recommended for the images of train and aggregation? I use isula instead of docker. After directly isula pull images:v0.3.0, the pods' status is OutOfMemory, and after pulling v0.5.0, the pods' status is CrashLoopBackOff.
Q2: How can I check that the job is running? In other words, how can I check the logs when the job is running?Or only I can do is waiting before the /model directory has model's weights?

For Q1: I cannot confirm the cause of the issues without more information. The containers are automatically pushed to Docker Hub, and all versions should be available. However, I strongly recommend using the latest version, v0.5.1, as it might contain crucial bug fixes and improvements.

For Q2: Certainly, you can check the running logs from any node (both the cloud-side server and the edge nodes) without waiting for completion. The logs can provide real-time information about the job's progress and any errors that might occur during execution. You don't need to wait until the model's weights appear in the /model directory to access the logs.

Feel free to review the logs for insights into the job's status and any potential issues. If you encounter specific errors in the logs, please provide those details for further assistance.

Hello, I encountered the same issue while trying example4: Collaboratively Train Yolo-v5 Using MistNet on the COCO128 Dataset. The pods for the edge node show CrashLoopBackOff, and when I use kubectl describe pods yolo-v5-train-897dd, the displayed events are none. I am using Docker image version V:0.4.3. Can you tell me how to resolve this issue,please?

Sergiossrr added the kind/bug Categorizes issue or PR as related to a bug. label Oct 12, 2023

kubeedge-bot assigned jaypume Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

how to check Federated Learning Job is working? #419

how to check Federated Learning Job is working? #419

Sergiossrr commented Oct 12, 2023 •

edited

Loading

Sergiossrr commented Oct 12, 2023

JoeyHwong-gk commented Oct 13, 2023

Sergiossrr commented Oct 13, 2023

JoeyHwong-gk commented Oct 13, 2023

JoeyHwong-gk commented Oct 13, 2023

Sergiossrr commented Oct 18, 2023

JoeyHwong-gk commented Oct 19, 2023

dsj-kaiyue commented Dec 1, 2023

how to check Federated Learning Job is working? #419

how to check Federated Learning Job is working? #419

Comments

Sergiossrr commented Oct 12, 2023 • edited Loading

Sergiossrr commented Oct 12, 2023

JoeyHwong-gk commented Oct 13, 2023

Sergiossrr commented Oct 13, 2023

JoeyHwong-gk commented Oct 13, 2023

JoeyHwong-gk commented Oct 13, 2023

Sergiossrr commented Oct 18, 2023

JoeyHwong-gk commented Oct 19, 2023

dsj-kaiyue commented Dec 1, 2023

Sergiossrr commented Oct 12, 2023 •

edited

Loading