Lm eval upgraded to 0.4.7 #1692

12010486 · 2025-01-10T11:30:50Z

Upgraded run_lm_eval.py

Run on top of the latest 0.4.7 of https://github.com/EleutherAI/lm-evaluation-harness.
Accuracy and throughput aligned with older version.
The parser might need new args to expand functionalities. Can handle in subsequent PRs

Results

SW 1.19.0
Optimum-habana 1.16.0dev
HW Gaudi2

Evaluate Llama 7B on Gaudi on task PiQA, using the BF16 data type:

python run_lm_eval.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--use_hpu_graphs \
--use_kv_cache \
--bf16 \
--batch_size=1 \
--tasks piqa \
-o eval.json

Previous results:

"results": {
    "piqa": {
      "acc": 0.780739934711643,
      "acc_stderr": 0.009653357463605322,
      "acc_norm": 0.7861806311207835,
      "acc_norm_stderr": 0.009565994206915602
    }

This PR

"results": {
   "piqa": {
     "alias": "piqa",
     "acc,none": 0.780739934711643,
     "acc_stderr,none": 0.00965335746360534,
     "acc_norm,none": 0.7910772578890098,
     "acc_norm_stderr,none": 0.009485227030105051
   }

Evaluate Llama 70B on 8 Gaudi2 cards on task WinoGrande, using the BF16 data type:

deepspeed --num_gpus 8 run_lm_eval.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--use_hpu_graphs \
--use_kv_cache \
--bf16 \
--batch_size=1 \
--tasks winogrande \
-o eval.json

Previous results:

"results": {
    "winogrande": {
      "acc": 0.8034727703235991,
      "acc_stderr": 0.011168120593569576
    }

This PR

"results": {
   "winogrande": {
     "alias": "winogrande",
     "acc,none": 0.7774269928966061,
     "acc_stderr,none": 0.011690933809712669
   }

From Running with FP8

Example to measure the tensor quantization statistics on LLama2-70b

QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py \
--use_deepspeed --world_size 8 run_lm_eval.py \
-o acc_70b_bs1_measure.txt \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--use_flash_attention \
--flash_attention_recompute \
--bf16 \
--batch_size 1

Previous results:

"results": {
	"winogrande": {
      "acc": 0.7971586424625099,
      "acc_stderr": 0.011301439925936645
    },
	"piqa": {
      "acc": 0.8166485310119695,
      "acc_stderr": 0.0090282839846894,
      "acc_norm": 0.8226332970620239,
      "acc_norm_stderr": 0.00891219356474512
    },
	"lambada_openai": {
      "ppl": 2.6662092249764062,
      "ppl_stderr": 0.044583268655204886,
      "acc": 0.7915777217155056,
      "acc_stderr": 0.005658885722145787
    },
    "hellaswag": {
      "acc": 0.6526588329018124,
      "acc_stderr": 0.004751522127418457,
      "acc_norm": 0.8431587333200558,
      "acc_norm_stderr": 0.0036290784658809705
    }

This PR

"results": {
    "hellaswag": {
      "alias": "hellaswag",
      "acc,none": 0.6470822545309699,
      "acc_stderr,none": 0.0047690075450822905,
      "acc_norm,none": 0.8378809002190799,
      "acc_norm_stderr,none": 0.0036780679944243455
    },
    "lambada_openai": {
      "alias": "lambada_openai",
      "perplexity,none": 2.6694269472300975,
      "perplexity_stderr,none": 0.04551856980551172,
      "acc,none": 0.7933242771201242,
      "acc_stderr,none": 0.0056413387353956
    },
    "piqa": {
      "alias": "piqa",
      "acc,none": 0.8204570184983678,
      "acc_stderr,none": 0.008954834329201175,
      "acc_norm,none": 0.8307943416757345,
      "acc_norm_stderr,none": 0.008747815763315788
    },
    "winogrande": {
      "alias": "winogrande",
      "acc,none": 0.7790055248618785,
      "acc_stderr,none": 0.01166122363764341
    }

examples/text-generation/run_lm_eval.py

yafshar · 2025-01-14T18:23:46Z

@12010486 please run make style and fix the formatting!

examples/text-generation/run_lm_eval.py

Co-authored-by: Yaser Afshar <[email protected]>

12010486 · 2025-01-15T11:10:01Z

Hi @yafshar, apologies for the missing make style. Now from a formatting point of view, everything should be there. Btw, in reality 51 more files were edited with make style, but I've seen that Iman already took care in #1693. Shall I already merge his PR in mine, or best to wait for that to be merged in main?

alexey-belyakov · 2025-01-15T16:18:52Z

examples/text-generation/run_lm_eval.py

+        args: argparse.Namespace,
+        options: GenerationConfig,
+    ) -> None:
+        super().__init__(pretrained=args.model_name_or_path, device=args.device)


Here we initialize the base class with the pretrained model as a string, and HFLM instantiates this model. After that we just override this instance:
self._model = model
To me, it looks like overhead.
We can try pass our model instance instead of str or use another class as a parent

I agree, sounds like parent constructor HFLM executes some extra code which are not necessary for habana. Maybe instead of the full parent class HFLM(TemplateLM) only use TemplateLM

from lm_eval.models.huggingface TemplateLM

and instead of

super().__init__(pretrained=args.model_name_or_path, device=args.device)

TemplateLM.__init__(self)

If you do this, you should also call the _get_backend

So, two options here, based on your input.

Instead of a string (args.model_name_or_path), pass the model in the init like:
super().__init__(pretrained=model, device=args.device)
This will automatically exclude all the cuda code, and it is working, both on 1 HPU and 8 HPU. Downside: warnings during the initialization, like this

Change class from HFLM to TemplateLM. This might be lengthier with the re-definition of multiple functions, but I'm working also on this.

@yafshar, @alexey-belyakov, I appreciate any further input on it.
With both options, I can prune the code that might be now redundant. For now I'm moving in parallel

alexey-belyakov · 2025-01-15T16:32:16Z

examples/text-generation/run_lm_eval.py

-        raise NotImplementedError()
-
-    def find_bucket(self, length):
+    def find_bucket(self, length: int) -> list[int]:


Suggested change

def find_bucket(self, length: int) -> list[int]:

def find_bucket(self, length: int) -> int:

alexey-belyakov · 2025-01-15T16:47:15Z

examples/text-generation/run_lm_eval.py

-        json.dump(results, open(args.output_file, "w"), indent=2)
-        print(json.dumps(results, indent=2))
+
+        json.dump(


Suggested change

json.dump(

json_str = json.dumps(results, indent=2, default=utils.handle_non_serializable, ensure_ascii=False)

with open(args.output_file, "w") as output_file:

output_file.write(json_str)

if args.show_config:

print(json_str)

yafshar · 2025-01-16T12:45:44Z

examples/text-generation/run_lm_eval.py

@@ -86,22 +87,35 @@ def setup_lm_eval_parser():
        default=["hellaswag", "lambada_openai", "piqa", "winogrande"],
    )
    parser.add_argument("--limit_iters", type=int, help="limit examples to run that many iterations", default=None)
+    parser.add_argument(


There is https://github.com/EleutherAI/lm-evaluation-harness/blob/4c26a9c176ecfb40b369503ce211e356bb800fdc/lm_eval/evaluator.py#L388 in the interface. Does it somehow similar to what you did here? If it does maybe we can eliminate this and related code?

I believe we should look at simple_evaluator, rather than evaluator, but in any case, in the latest version of the testharness, there are really multiple args related to what to print and how, see from this line below https://github.com/EleutherAI/lm-evaluation-harness/blob/4c26a9c176ecfb40b369503ce211e356bb800fdc/lm_eval/__main__.py/#L410

I took only one of them and tried to use it the same way it is expected from the original repo, if I did it right. The true reason why I've added it, is that is printing now really a lot of info, and I was not able to see the information related to HPU without this flag.

12010486 added 10 commits December 23, 2024 11:55

lm eval harness version above 0.4.0. Similar to [SW-190418]

09e60ad

Fix assertion error on cuda dev

9c47787

Add _create_model()

0c6fc48

Working on multiple HPU

72aa69f

Merge branch 'huggingface:main' into lm_eval-to-latest

612833b

Update requirements_lm_eval.txt to v0.4.7

207a044

Update run_lm_eval.py

e0d52a2

Fix for pretrained entry

46fe13c

Fix for Json output

6659f78

Better solution for Json and clean-up

13ca64e

12010486 requested a review from regisss as a code owner January 10, 2025 11:30