Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : suppress conversion from 'size_t' to 'int' #9046

Merged
merged 2 commits into from
Oct 16, 2024

Conversation

danbev
Copy link
Collaborator

@danbev danbev commented Aug 15, 2024

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the following warnings that are generated on Windows when using MSVC:

src\llama-vocab.cpp(211,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data

This is done by adding a cast for the size_t returned from symbols.size(). I believe this is safe as it seems unlikely that symbols, which stores an entry for each UTF8 character, would become larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that are currently generated when building on Windows.

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:

```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
```

This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.
@ggerganov ggerganov merged commit 9e04102 into ggml-org:master Oct 16, 2024
52 checks passed
@ngxson
Copy link
Collaborator

ngxson commented Oct 16, 2024

Maybe I'm missing something here, but I don't really get why this change is needed. In fact, I find this make the code a bit less readable than before.

Basically we have this pattern in countless places in the code base: for (size_t i = 0; i < vector.size(); i++)

So why we only apply changes to this file and not the whole code base?

Also, if we want the safest route possible, maybe consider using the return type of vector.size() (a.k.a ::size_type)

@danbev
Copy link
Collaborator Author

danbev commented Oct 17, 2024

Maybe I'm missing something here, but I don't really get why this change is needed. In fact, I find this make the code a bit less readable than before.

The motivation for this change was when working on Windows (using MSVC) the number of compiler warnings generated was/is extensive, and at the time my goal was to try to minimize the warnings as much as possible. I'm no longer working a Windows system nor have access to one so it less of a priority for me now, but I do think that for those that are embedding llama.cpp it would be nice to reduce these warnings.

So why we only apply changes to this file and not the whole code base?

I was doing this in smaller/isolated changes as I was making this changes in between my other tasks and I was not able to dedicate a larger portion of time to this.

@ggerganov
Copy link
Member

So why we only apply changes to this file and not the whole code base?

Also, if we want the safest route possible, maybe consider using the return type of vector.size() (a.k.a ::size_type)

In this case, the try_add_bigram function expects signed integers:

https://github.com/ggerganov/llama.cpp/blob/66c2c93082289325199ae1f773f3b0ab2e399a47/src/llama-vocab.cpp#L292-L295

The alternative would be:

        for (size_t i = 1; i < symbols.size(); ++i) {
            try_add_bigram((int) i - 1, (int) i);
        }

Also, if we want the safest route possible, maybe consider using the return type of vector.size() (a.k.a ::size_type)

It's not always safe because one can easily underflow by mistake when using size_t. Overall, it's tricky and I haven't figured out the best practice for writing .size() loops. There always seem to be some deficiencies in each pattern.

@ngxson
Copy link
Collaborator

ngxson commented Oct 18, 2024

Ok thanks, that makes sense now.

I thought that i in the loop was used to access the array by index. If that was the case, it make no sense to change i from a size_t to int. But if it's consumed by try_add_bigram, then it's ok to change it here.

drollings pushed a commit to drollings/llama.cpp that referenced this pull request Oct 18, 2024
* llama : suppress conversion from 'size_t' to 'int'

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:

```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
```

This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.

* squash! llama : suppress conversion from 'size_t' to 'int'

Move cast into for loop.
dsx1986 pushed a commit to dsx1986/llama.cpp that referenced this pull request Oct 29, 2024
* llama : suppress conversion from 'size_t' to 'int'

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:

```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
```

This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.

* squash! llama : suppress conversion from 'size_t' to 'int'

Move cast into for loop.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 15, 2024
* llama : suppress conversion from 'size_t' to 'int'

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:

```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
```

This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.

* squash! llama : suppress conversion from 'size_t' to 'int'

Move cast into for loop.
arthw pushed a commit to arthw/llama.cpp that referenced this pull request Nov 18, 2024
* llama : suppress conversion from 'size_t' to 'int'

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the
following warnings that are generated on Windows when using MSVC:

```console
src\llama-vocab.cpp(211,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument':
    conversion from 'size_t' to 'int', possible loss of data
```

This is done by adding a cast for the size_t returned from
symbols.size(). I believe this is safe as it seems unlikely that
symbols, which stores an entry for each UTF8 character, would become
larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that
are currently generated when building on Windows.

* squash! llama : suppress conversion from 'size_t' to 'int'

Move cast into for loop.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants