Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Internal (high level) library error when submitting dml::mem_copy-task with too large byte size #51

Open
adbexy opened this issue Dec 10, 2024 · 0 comments

Comments

@adbexy
Copy link

adbexy commented Dec 10, 2024

I get an unexpected internal library error when executing a dml::mem_copy with a too large byte size (compared to the max_transfer_size specified in the accel-config configuration file).

How to reproduce

(see below for system information)

  • Unpack the attached archive and cd into the contained directory
  • Run sudo bash ./setup.sh, to setup the environment.
  • Compile:
    cmake .
    cmake --build .
  • And run sudo ./program > result

For information on compile options, issued operations, execution path, etc., please refer to the attached source code.

Error description

After a few successful submissions (where the submission size is less than the configured max_transfer_size, configured in config.conf), the code produces internal library errors (dml::status_code::error) for all submissions where the submitted byte size is greater than max_transfer_size.

With reference to https://intel.github.io/DML/documentation/api_docs/high_level_api.html#operation-status-values I would expect to get an error like dml::status_code::bad_size (Invalid byte size was specified).

Use Case

Make the library more robust against this type of error, and make debugging easier. It is not intuitive to re-check the size of submissions when the API returns an internal error, especially when there is a more appropriate error like dml::status_code::bad_size.

System Information

OS Info

OS name
PRETTY_NAME="Debian GNU/Linux trixie/sid"
NAME="Debian GNU/Linux"
VERSION_CODENAME=trixie
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
kernel version

6.6.13-amd64

accel-config version

4.1.3.git71676025

CPU model

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             192
On-line CPU(s) list:                0-191
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Xeon(R) CPU Max 9468
CPU family:                         6
Model:                              143

DML version

Latest (1.2.0), Date: 2024-12-09
Commit Hash: f59ed47

DSA_reproduce_error.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant