Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LM Enforcer cause hanged generation and what is the Sampler setting #486 #110

Open
remichu-ai opened this issue Jun 2, 2024 · 5 comments

Comments

@remichu-ai
Copy link

remichu-ai commented Jun 2, 2024

I have been using LM enforcer for a while for function calling with exllamav2, and one in a while it will cause the exllama generation to hang.

Previously I just attributed it to the model to not be smart enough for function calling. However, I now can reliably produce the issue with a specific prompt and model. The strange thing is that the generation without lm enforcer is correct:

The prompt:
`
conversation....
coordinator_agent:

```json`

Correct result without using lm enforcer, just normal generation:

{
  "functions_calling": [
    {
      "reason": "The manager_agent has confirmed that they can speak English, which addresses the user's question directly.",
      "name": "QuestionAnswered",
      "arguments": {
        "question_answered": "True"
      }
    }
  ]
}

Could it be due to my sampler setting?

settings = ExLlamaV2Sampler.Settings()
settings.temperature = 0.1
settings.top_k = 50
settings.top_p = 0.9
settings.min_p = 0.06
settings.token_repetition_penalty = 1.01
settings.temperature_last = False

The model above is wizardlm 8x22b which is quite good at functional calling, as it can be seen the raw response without llm enforcer is correct.

Any advice is appreciated. Currently, I suspect it got to do with the sampling setting, as in most generation i can get correct function_calling response with the same lm enforcer setting

@noamgat
Copy link
Owner

noamgat commented Jun 3, 2024

Interesting. I wonder if it could be related to if the token filtering in exllamav2 happens before or after the softmax is applied to the weights. If it is applied afterwards, the min_p requirement may prove impossible in some cases - if all of the tokens with p>0.06 after the softmax are marked illegal by LMFE.

By hang, do you mean freeze, or crash?

@remichu-ai
Copy link
Author

By hang i mean, the terminal just kinda not responding. I cant not kill it by ctrl+c but need to force close the terminal. If you have any idea on what I can try just let me know.

The strange thing is without lm enforcer, it generated the function calling correctly, meaning the needed tokens are also the most likely tokens. However, it hanged when i ad lm enforcer.

In addition, the same code work when i am using another model (same prompt, same lm enforcer), so i dont know if it is an issue with the way i defined the lm enforcer or not.

Currently the way i defined the lm enforcer is a bit long winded.

  • Create pydantic model of functions to call dynamically based on the list of functions
  • put all of this into one pydantic object e.g function_calling: Union[the list of pydantic models]
  • Get the schema of this parent model
  • Replace all the $ref in the schema with the actual referenced object itself so the schema doesnt contain any $ref
  • create lm enforcer from this schema.

Do you see any problem with how i create the token enforcer?

@remichu-ai
Copy link
Author

I streamed the token and see what under the hood then turn out, hang just mean the generation is very very slow under certain prompt.

I track the len of pass_tokens list and it hit almost 32k possible tokens. Do you have recommendation for the generation setting for this case?

Also, i thought topk will kick in and limit it to top 50 token?
sampler.py

        if len(filters) > 0:

            pass_tokens = None
            end_tokens = None
            for f in filters:

                pt, et = f.next()
                if pt is not None: pass_tokens = pt if pass_tokens is None else pass_tokens & pt
                if et is not None: end_tokens = et if end_tokens is None else end_tokens | et

            pass_tokens_list = list(pass_tokens)
            pass_tokens_list_len = len(pass_tokens_list)       # 31855 possible tokens
            pass_token_text = []
            # print(type(tokenizer.id_to_piece_with_special))
            for tok_id in pass_tokens_list:
                # print(tok_id)
                # print(type(tok_id))
                # print(tokenizer.id_to_piece_with_special[tok_id])
                pass_token_text.append(tokenizer.id_to_piece_with_special[tok_id])

@thigger
Copy link

thigger commented Jun 6, 2024

I think I am experiencing the same bug: TabbyAPI, Exllama 0.1.4, A6000 48Gb, Command-R, Windows 10
Using the json_schema option in the API, temperature 0.0
I've not had issues using other models (Mixtral 8x7b, Phi-3-medium-128k) but generation appears to hang occasionally when using json_schema with Command-R. I can't see exactly what's happening but the CUDA use on the GPU drops to zero except for a brief blip (up to 10%) every ~50 seconds.

My case is similar in that the model produces valid JSON if json_schema is not enforced.
I'm assuming this is likely to be an lm enforcer issue but appreciate that it could be at other levels!

@remichu-ai
Copy link
Author

You can refer to my issue created on exllamav2 git hub also. I closed the issue but seems like no solution at the moment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants