llama : suppress conversion from 'size_t' to 'int' #9046

danbev · 2024-08-15T18:03:49Z

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the following warnings that are generated on Windows when using MSVC:

src\llama-vocab.cpp(211,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data
src\llama-vocab.cpp(517,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data

This is done by adding a cast for the size_t returned from symbols.size(). I believe this is safe as it seems unlikely that symbols, which stores an entry for each UTF8 character, would become larger than INT_MAX.

The motivation for this change is to reduce the number of warnings that are currently generated when building on Windows.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

This commit updates llm_tokenizer_spm.tokenize to suppress/remove the following warnings that are generated on Windows when using MSVC: ```console src\llama-vocab.cpp(211,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data src\llama-vocab.cpp(517,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data ``` This is done by adding a cast for the size_t returned from symbols.size(). I believe this is safe as it seems unlikely that symbols, which stores an entry for each UTF8 character, would become larger than INT_MAX. The motivation for this change is to reduce the number of warnings that are currently generated when building on Windows.

src/llama-vocab.cpp

Move cast into for loop.

ngxson · 2024-10-16T21:19:23Z

Maybe I'm missing something here, but I don't really get why this change is needed. In fact, I find this make the code a bit less readable than before.

Basically we have this pattern in countless places in the code base: for (size_t i = 0; i < vector.size(); i++)

So why we only apply changes to this file and not the whole code base?

Also, if we want the safest route possible, maybe consider using the return type of vector.size() (a.k.a ::size_type)

danbev · 2024-10-17T05:45:44Z

Maybe I'm missing something here, but I don't really get why this change is needed. In fact, I find this make the code a bit less readable than before.

The motivation for this change was when working on Windows (using MSVC) the number of compiler warnings generated was/is extensive, and at the time my goal was to try to minimize the warnings as much as possible. I'm no longer working a Windows system nor have access to one so it less of a priority for me now, but I do think that for those that are embedding llama.cpp it would be nice to reduce these warnings.

So why we only apply changes to this file and not the whole code base?

I was doing this in smaller/isolated changes as I was making this changes in between my other tasks and I was not able to dedicate a larger portion of time to this.

ggerganov · 2024-10-17T07:33:44Z

So why we only apply changes to this file and not the whole code base?

Also, if we want the safest route possible, maybe consider using the return type of vector.size() (a.k.a ::size_type)

In this case, the try_add_bigram function expects signed integers:

https://github.com/ggerganov/llama.cpp/blob/66c2c93082289325199ae1f773f3b0ab2e399a47/src/llama-vocab.cpp#L292-L295

The alternative would be:

        for (size_t i = 1; i < symbols.size(); ++i) {
            try_add_bigram((int) i - 1, (int) i);
        }

Also, if we want the safest route possible, maybe consider using the return type of vector.size() (a.k.a ::size_type)

It's not always safe because one can easily underflow by mistake when using size_t. Overall, it's tricky and I haven't figured out the best practice for writing .size() loops. There always seem to be some deficiencies in each pattern.

ngxson · 2024-10-18T14:10:54Z

Ok thanks, that makes sense now.

I thought that i in the loop was used to access the array by index. If that was the case, it make no sense to change i from a size_t to int. But if it's consumed by try_add_bigram, then it's ok to change it here.

* llama : suppress conversion from 'size_t' to 'int' This commit updates llm_tokenizer_spm.tokenize to suppress/remove the following warnings that are generated on Windows when using MSVC: ```console src\llama-vocab.cpp(211,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data src\llama-vocab.cpp(517,1): warning C4267: 'argument': conversion from 'size_t' to 'int', possible loss of data ``` This is done by adding a cast for the size_t returned from symbols.size(). I believe this is safe as it seems unlikely that symbols, which stores an entry for each UTF8 character, would become larger than INT_MAX. The motivation for this change is to reduce the number of warnings that are currently generated when building on Windows. * squash! llama : suppress conversion from 'size_t' to 'int' Move cast into for loop.

ggerganov reviewed Aug 16, 2024

View reviewed changes

src/llama-vocab.cpp Outdated Show resolved Hide resolved

squash! llama : suppress conversion from 'size_t' to 'int'

ed4fcf9

Move cast into for loop.

ggerganov approved these changes Oct 16, 2024

View reviewed changes

ggerganov merged commit 9e04102 into ggml-org:master Oct 16, 2024
52 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : suppress conversion from 'size_t' to 'int' #9046

llama : suppress conversion from 'size_t' to 'int' #9046

danbev commented Aug 15, 2024

ngxson commented Oct 16, 2024

danbev commented Oct 17, 2024

ggerganov commented Oct 17, 2024

ngxson commented Oct 18, 2024

llama : suppress conversion from 'size_t' to 'int' #9046

llama : suppress conversion from 'size_t' to 'int' #9046

Conversation

danbev commented Aug 15, 2024

ngxson commented Oct 16, 2024

danbev commented Oct 17, 2024

ggerganov commented Oct 17, 2024

ngxson commented Oct 18, 2024