Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lm eval upgraded to 0.4.7 #1692

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

12010486
Copy link
Contributor

@12010486 12010486 commented Jan 10, 2025

Upgraded run_lm_eval.py

Results

SW 1.19.0
Optimum-habana 1.16.0dev
HW Gaudi2

From Examples

  1. Evaluate Llama 7B on Gaudi on task PiQA, using the BF16 data type:
python run_lm_eval.py \
--model_name_or_path meta-llama/Llama-2-7b-hf \
--use_hpu_graphs \
--use_kv_cache \
--bf16 \
--batch_size=1 \
--tasks piqa \
-o eval.json

Previous results:

"results": {
    "piqa": {
      "acc": 0.780739934711643,
      "acc_stderr": 0.009653357463605322,
      "acc_norm": 0.7861806311207835,
      "acc_norm_stderr": 0.009565994206915602
    }

This PR

"results": {
   "piqa": {
     "alias": "piqa",
     "acc,none": 0.780739934711643,
     "acc_stderr,none": 0.00965335746360534,
     "acc_norm,none": 0.7910772578890098,
     "acc_norm_stderr,none": 0.009485227030105051
   }
  1. Evaluate Llama 70B on 8 Gaudi2 cards on task WinoGrande, using the BF16 data type:
deepspeed --num_gpus 8 run_lm_eval.py \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--use_hpu_graphs \
--use_kv_cache \
--bf16 \
--batch_size=1 \
--tasks winogrande \
-o eval.json

Previous results:

"results": {
    "winogrande": {
      "acc": 0.8034727703235991,
      "acc_stderr": 0.011168120593569576
    }

This PR

"results": {
   "winogrande": {
     "alias": "winogrande",
     "acc,none": 0.7774269928966061,
     "acc_stderr,none": 0.011690933809712669
   }

From Running with FP8

  1. Example to measure the tensor quantization statistics on LLama2-70b
QUANT_CONFIG=./quantization_config/maxabs_measure.json python ../gaudi_spawn.py \
--use_deepspeed --world_size 8 run_lm_eval.py \
-o acc_70b_bs1_measure.txt \
--model_name_or_path meta-llama/Llama-2-70b-hf \
--attn_softmax_bf16 \
--use_hpu_graphs \
--trim_logits \
--use_kv_cache \
--bucket_size=128 \
--bucket_internal \
--use_flash_attention \
--flash_attention_recompute \
--bf16 \
--batch_size 1

Previous results:

"results": {
	"winogrande": {
      "acc": 0.7971586424625099,
      "acc_stderr": 0.011301439925936645
    },
	"piqa": {
      "acc": 0.8166485310119695,
      "acc_stderr": 0.0090282839846894,
      "acc_norm": 0.8226332970620239,
      "acc_norm_stderr": 0.00891219356474512
    },
	"lambada_openai": {
      "ppl": 2.6662092249764062,
      "ppl_stderr": 0.044583268655204886,
      "acc": 0.7915777217155056,
      "acc_stderr": 0.005658885722145787
    },
    "hellaswag": {
      "acc": 0.6526588329018124,
      "acc_stderr": 0.004751522127418457,
      "acc_norm": 0.8431587333200558,
      "acc_norm_stderr": 0.0036290784658809705
    } 

This PR

"results": {
    "hellaswag": {
      "alias": "hellaswag",
      "acc,none": 0.6470822545309699,
      "acc_stderr,none": 0.0047690075450822905,
      "acc_norm,none": 0.8378809002190799,
      "acc_norm_stderr,none": 0.0036780679944243455
    },
    "lambada_openai": {
      "alias": "lambada_openai",
      "perplexity,none": 2.6694269472300975,
      "perplexity_stderr,none": 0.04551856980551172,
      "acc,none": 0.7933242771201242,
      "acc_stderr,none": 0.0056413387353956
    },
    "piqa": {
      "alias": "piqa",
      "acc,none": 0.8204570184983678,
      "acc_stderr,none": 0.008954834329201175,
      "acc_norm,none": 0.8307943416757345,
      "acc_norm_stderr,none": 0.008747815763315788
    },
    "winogrande": {
      "alias": "winogrande",
      "acc,none": 0.7790055248618785,
      "acc_stderr,none": 0.01166122363764341
    }

@12010486 12010486 requested a review from regisss as a code owner January 10, 2025 11:30
@yafshar
Copy link
Contributor

yafshar commented Jan 14, 2025

@12010486 please run make style and fix the formatting!

12010486 and others added 5 commits January 15, 2025 10:41
@12010486
Copy link
Contributor Author

12010486 commented Jan 15, 2025

Hi @yafshar, apologies for the missing make style. Now from a formatting point of view, everything should be there. Btw, in reality 51 more files were edited with make style, but I've seen that Iman already took care in #1693. Shall I already merge his PR in mine, or best to wait for that to be merged in main?

args: argparse.Namespace,
options: GenerationConfig,
) -> None:
super().__init__(pretrained=args.model_name_or_path, device=args.device)
Copy link
Contributor

@alexey-belyakov alexey-belyakov Jan 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we initialize the base class with the pretrained model as a string, and HFLM instantiates this model. After that we just override this instance:
self._model = model
To me, it looks like overhead.
We can try pass our model instance instead of str or use another class as a parent

Copy link
Contributor

@yafshar yafshar Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, sounds like parent constructor HFLM executes some extra code which are not necessary for habana. Maybe instead of the full parent class HFLM(TemplateLM) only use TemplateLM

from lm_eval.models.huggingface TemplateLM

and instead of

super().__init__(pretrained=args.model_name_or_path, device=args.device)
TemplateLM.__init__(self)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you do this, you should also call the _get_backend

Copy link
Contributor Author

@12010486 12010486 Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, two options here, based on your input.

  1. Instead of a string (args.model_name_or_path), pass the model in the init like:
    super().__init__(pretrained=model, device=args.device)
    This will automatically exclude all the cuda code, and it is working, both on 1 HPU and 8 HPU. Downside: warnings during the initialization, like this
  2. Change class from HFLM to TemplateLM. This might be lengthier with the re-definition of multiple functions, but I'm working also on this.

@yafshar, @alexey-belyakov, I appreciate any further input on it.
With both options, I can prune the code that might be now redundant. For now I'm moving in parallel

raise NotImplementedError()

def find_bucket(self, length):
def find_bucket(self, length: int) -> list[int]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def find_bucket(self, length: int) -> list[int]:
def find_bucket(self, length: int) -> int:

json.dump(results, open(args.output_file, "w"), indent=2)
print(json.dumps(results, indent=2))

json.dump(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
json.dump(
json_str = json.dumps(results, indent=2, default=utils.handle_non_serializable, ensure_ascii=False)
with open(args.output_file, "w") as output_file:
output_file.write(json_str)
if args.show_config:
print(json_str)

@@ -86,22 +87,35 @@ def setup_lm_eval_parser():
default=["hellaswag", "lambada_openai", "piqa", "winogrande"],
)
parser.add_argument("--limit_iters", type=int, help="limit examples to run that many iterations", default=None)
parser.add_argument(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is https://github.com/EleutherAI/lm-evaluation-harness/blob/4c26a9c176ecfb40b369503ce211e356bb800fdc/lm_eval/evaluator.py#L388 in the interface. Does it somehow similar to what you did here? If it does maybe we can eliminate this and related code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we should look at simple_evaluator, rather than evaluator, but in any case, in the latest version of the testharness, there are really multiple args related to what to print and how, see from this line below https://github.com/EleutherAI/lm-evaluation-harness/blob/4c26a9c176ecfb40b369503ce211e356bb800fdc/lm_eval/__main__.py/#L410

I took only one of them and tried to use it the same way it is expected from the original repo, if I did it right. The true reason why I've added it, is that is printing now really a lot of info, and I was not able to see the information related to HPU without this flag.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants