Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v1.19.x] hmem/synapseai: Refine the error handling and warning #9579

Merged
merged 1 commit into from
Nov 17, 2023

Conversation

shijin-aws
Copy link
Contributor

@shijin-aws shijin-aws commented Nov 16, 2023

backport #9523

Currently, synapseai_init return FI_ENODATA when there is no
synapseai device on the platform. This will cause warnings
are printed on any non-synapseai platform. This patch improves
this by making synapseai_init return ENOSYS when all the API
calls succeeded but zero devices are detected.

Signed-off-by: Shi Jin <[email protected]>
(cherry picked from commit f66e3f7)
@shijin-aws shijin-aws requested a review from j-xiong November 16, 2023 18:12
@shijin-aws
Copy link
Contributor Author

tcp provider test has a failure in AWS CI.

server_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.36.135 'timeout 360 /bin/bash --login -c '"'"'/home/ubuntu/PortaFiducia/build/libraries/libfabric/pr9579-debug/install/fabtests/bin/fi_cq_data -e rdm -o senddata -p tcp -s 172.31.36.135'"'"''

client_command: ssh -n -o StrictHostKeyChecking=no -o ConnectTimeout=30 -o BatchMode=yes 172.31.43.66 'timeout 360 /bin/bash --login -c '"'"'/home/ubuntu/PortaFiducia/build/libraries/libfabric/pr9579-debug/install/fabtests/bin/fi_cq_data -e rdm -o senddata -p tcp -s 172.31.43.66 172.31.36.135'"'"''
client_stdout:
Posting send with CQ data: 0x123456789abcdef
Done

client returncode: 0
server_stdout:
Waiting for CQ data from client
remote_cq_data: success
fi_cq_data_entry.len verify: success
timeout: the monitored command dumped core

server returncode: 255

Not related to this PR. Do you know if it is a known issue? @j-xiong

@j-xiong
Copy link
Contributor

j-xiong commented Nov 16, 2023

@shijin-aws not aware of this. Is it failing for all the builds or just a specific one?

@shijin-aws
Copy link
Contributor Author

Just a specific one.

@shijin-aws
Copy link
Contributor Author

bot:aws:retest

@j-xiong
Copy link
Contributor

j-xiong commented Nov 17, 2023

The ucx failure is a known issue and is unrelated.

@shijin-aws shijin-aws merged commit 19f79b7 into ofiwg:v1.19.x Nov 17, 2023
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants