Skip to content

Commit

Permalink
Changing README table to markdown
Browse files Browse the repository at this point in the history
  • Loading branch information
noamgat committed Oct 17, 2023
1 parent 6f66a9c commit 11e3fd1
Showing 1 changed file with 17 additions and 16 deletions.
33 changes: 17 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,21 +128,22 @@ Using this library guarantees that the output will match the format, but it does
In order to help you understand the aggressiveness caused by the format enforcement, if you pass ```output_scores=True``` and ```return_dict_in_generate=True``` in the ```kwargs``` to ```generate_enforced()``` (these are existing optional parameters in the ```transformers``` library), you will also get a token-by-token dataframe showing which token was selected, its score, and what was the token that would have been chosen if the format enforcement was not applied. If you see that the format enforcer forced the language model to select tokens with very low weights, it is a likely contributor to the poor results. Try modifying the prompt to guide the language model to not force the format enforcer to be so aggressive.

Example using the regular expression format ``` Michael Jordan was Born in (\d)+.```
```
generated_token generated_token_idx generated_score leading_token leading_token_idx leading_score
0 ▁ 29871 1.000000 ▁ 29871 1.000000
1 Michael 24083 0.000027 ▁Sure 18585 0.959473
2 ▁Jordan 18284 1.000000 ▁Jordan 18284 1.000000
3 ▁was 471 1.000000 ▁was 471 1.000000
4 ▁Born 19298 0.000008 ▁born 6345 1.000000
5 ▁in 297 0.994629 ▁in 297 0.994629
6 ▁ 29871 0.982422 ▁ 29871 0.982422
7 1 29896 1.000000 1 29896 1.000000
8 9 29929 1.000000 9 29929 1.000000
9 6 29953 1.000000 6 29953 1.000000
10 3 29941 1.000000 3 29941 1.000000
11 . 29889 0.999512 . 29889 0.999512
12 </s> 2 0.981445 </s> 2 0.981445
```

idx | generated_token | generated_token_idx | generated_score | leading_token | leading_token_idx | leading_score
:------------ | :-------------| :-------------| :------------- | :------------ | :-------------| :-------------
0 | ▁ | 29871 | 1.000000 | ▁ | 29871 | 1.000000
1 | Michael | 24083 | 0.000027 | ▁Sure | 18585 | 0.959473
2 | ▁Jordan | 18284 | 1.000000 | ▁Jordan | 18284 | 1.000000
3 | ▁was | 471 | 1.000000 | ▁was | 471 | 1.000000
4 | ▁Born | 19298 | 0.000008 | ▁born | 6345 | 1.000000
5 | ▁in | 297 | 0.994629 | ▁in | 297 | 0.994629
6 | ▁ | 29871 | 0.982422 | ▁ | 29871 | 0.982422
7 | 1 | 29896 | 1.000000 | 1 | 29896 | 1.000000
8 | 9 | 29929 | 1.000000 | 9 | 29929 | 1.000000
9 | 6 | 29953 | 1.000000 | 6 | 29953 | 1.000000
10 | 3 | 29941 | 1.000000 | 3 | 29941 | 1.000000
11 | . | 29889 | 0.999512 | . | 29889 | 0.999512
12 | ```</s>``` | 2 | 0.981445 | ```</s>``` | 2 | 0.981445


You can see that the model "wanted" to start the answer using ```Sure```, but the format enforcer forced it to use ```Michael``` - there was a big gap in token 1. Afterwards, almost all of the leading scores are all within the allowed token set, meaning the model likely did not hallucinate due to the token forcing. The only exception was timestep 4 - " Born" was forced while the LLM wanted to choose "born". This is a hint for the prompt engineer, to change the prompt to use a lowercase b instead.

0 comments on commit 11e3fd1

Please sign in to comment.