Skip to content

Commit

Permalink
Merge pull request #420 from vinjn/fix-371-enc-is-not-defined
Browse files Browse the repository at this point in the history
Move enc to gloabal namespace to fix #371
  • Loading branch information
karpathy authored Feb 27, 2024
2 parents a022d02 + dccf362 commit 325be85
Showing 1 changed file with 2 additions and 1 deletion.
3 changes: 2 additions & 1 deletion data/openwebtext/prepare.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
# it is better than 1 usually though
num_proc_load_dataset = num_proc

enc = tiktoken.get_encoding("gpt2")

if __name__ == '__main__':
# takes 54GB in huggingface .cache dir, about 8M documents (8,013,769)
dataset = load_dataset("openwebtext", num_proc=num_proc_load_dataset)
Expand All @@ -38,7 +40,6 @@
# })

# we now want to tokenize the dataset. first define the encoding function (gpt2 bpe)
enc = tiktoken.get_encoding("gpt2")
def process(example):
ids = enc.encode_ordinary(example['text']) # encode_ordinary ignores any special tokens
ids.append(enc.eot_token) # add the end of text token, e.g. 50256 for gpt2 bpe
Expand Down

0 comments on commit 325be85

Please sign in to comment.