LM Enforcer cause hanged generation and what is the Sampler setting #486 #110

remichu-ai · 2024-06-02T17:08:08Z

I have been using LM enforcer for a while for function calling with exllamav2, and one in a while it will cause the exllama generation to hang.

Previously I just attributed it to the model to not be smart enough for function calling. However, I now can reliably produce the issue with a specific prompt and model. The strange thing is that the generation without lm enforcer is correct:

The prompt:
`
conversation....
coordinator_agent:

```json`

Correct result without using lm enforcer, just normal generation:

{
  "functions_calling": [
    {
      "reason": "The manager_agent has confirmed that they can speak English, which addresses the user's question directly.",
      "name": "QuestionAnswered",
      "arguments": {
        "question_answered": "True"
      }
    }
  ]
}

Could it be due to my sampler setting?

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.1
settings.top_k = 50
settings.top_p = 0.9
settings.min_p = 0.06
settings.token_repetition_penalty = 1.01
settings.temperature_last = False

The model above is wizardlm 8x22b which is quite good at functional calling, as it can be seen the raw response without llm enforcer is correct.

Any advice is appreciated. Currently, I suspect it got to do with the sampling setting, as in most generation i can get correct function_calling response with the same lm enforcer setting

The text was updated successfully, but these errors were encountered:

noamgat · 2024-06-03T08:37:00Z

Interesting. I wonder if it could be related to if the token filtering in exllamav2 happens before or after the softmax is applied to the weights. If it is applied afterwards, the min_p requirement may prove impossible in some cases - if all of the tokens with p>0.06 after the softmax are marked illegal by LMFE.

By hang, do you mean freeze, or crash?

remichu-ai · 2024-06-03T09:00:16Z

By hang i mean, the terminal just kinda not responding. I cant not kill it by ctrl+c but need to force close the terminal. If you have any idea on what I can try just let me know.

The strange thing is without lm enforcer, it generated the function calling correctly, meaning the needed tokens are also the most likely tokens. However, it hanged when i ad lm enforcer.

In addition, the same code work when i am using another model (same prompt, same lm enforcer), so i dont know if it is an issue with the way i defined the lm enforcer or not.

Currently the way i defined the lm enforcer is a bit long winded.

Create pydantic model of functions to call dynamically based on the list of functions
put all of this into one pydantic object e.g function_calling: Union[the list of pydantic models]
Get the schema of this parent model
Replace all the $ref in the schema with the actual referenced object itself so the schema doesnt contain any $ref
create lm enforcer from this schema.

Do you see any problem with how i create the token enforcer?

remichu-ai · 2024-06-03T16:08:19Z

I streamed the token and see what under the hood then turn out, hang just mean the generation is very very slow under certain prompt.

I track the len of pass_tokens list and it hit almost 32k possible tokens. Do you have recommendation for the generation setting for this case?

Also, i thought topk will kick in and limit it to top 50 token?
sampler.py

        if len(filters) > 0:

            pass_tokens = None
            end_tokens = None
            for f in filters:

                pt, et = f.next()
                if pt is not None: pass_tokens = pt if pass_tokens is None else pass_tokens & pt
                if et is not None: end_tokens = et if end_tokens is None else end_tokens | et

            pass_tokens_list = list(pass_tokens)
            pass_tokens_list_len = len(pass_tokens_list)       # 31855 possible tokens
            pass_token_text = []
            # print(type(tokenizer.id_to_piece_with_special))
            for tok_id in pass_tokens_list:
                # print(tok_id)
                # print(type(tok_id))
                # print(tokenizer.id_to_piece_with_special[tok_id])
                pass_token_text.append(tokenizer.id_to_piece_with_special[tok_id])

thigger · 2024-06-06T10:30:41Z

I think I am experiencing the same bug: TabbyAPI, Exllama 0.1.4, A6000 48Gb, Command-R, Windows 10
Using the json_schema option in the API, temperature 0.0
I've not had issues using other models (Mixtral 8x7b, Phi-3-medium-128k) but generation appears to hang occasionally when using json_schema with Command-R. I can't see exactly what's happening but the CUDA use on the GPU drops to zero except for a brief blip (up to 10%) every ~50 seconds.

My case is similar in that the model produces valid JSON if json_schema is not enforced.
I'm assuming this is likely to be an lm enforcer issue but appreciate that it could be at other levels!

remichu-ai · 2024-06-06T10:36:35Z

You can refer to my issue created on exllamav2 git hub also. I closed the issue but seems like no solution at the moment

thigger mentioned this issue Jun 6, 2024

LM Enforcer cause hanged generation and what is the Sampler setting turboderp-org/exllamav2#486

Closed

CortexPE mentioned this issue Jun 22, 2024

Probable case of EOS being ignored #115

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LM Enforcer cause hanged generation and what is the Sampler setting #486 #110

LM Enforcer cause hanged generation and what is the Sampler setting #486 #110

remichu-ai commented Jun 2, 2024 •

edited

Loading

noamgat commented Jun 3, 2024

remichu-ai commented Jun 3, 2024

remichu-ai commented Jun 3, 2024

thigger commented Jun 6, 2024

remichu-ai commented Jun 6, 2024

LM Enforcer cause hanged generation and what is the Sampler setting #486 #110

LM Enforcer cause hanged generation and what is the Sampler setting #486 #110

Comments

remichu-ai commented Jun 2, 2024 • edited Loading

noamgat commented Jun 3, 2024

remichu-ai commented Jun 3, 2024

remichu-ai commented Jun 3, 2024

thigger commented Jun 6, 2024

remichu-ai commented Jun 6, 2024

remichu-ai commented Jun 2, 2024 •

edited

Loading