Skip to content

Commit

Permalink
refactor
Browse files Browse the repository at this point in the history
  • Loading branch information
joshuawe authored and jettjaniak committed Apr 5, 2024
1 parent 7405f72 commit fa1f52e
Showing 1 changed file with 8 additions and 7 deletions.
15 changes: 8 additions & 7 deletions scripts/tokenize_dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,17 +57,18 @@
text_docs = input_dataset[args.column_name]
else:
if len(input_dataset.column_names) > 1:
raise ValueError("There are more than one column in the specified dataset")
raise ValueError("There is more than one column in the specified dataset")
text_docs = input_dataset[input_dataset.column_names[0]]

tokenized_dataset = tokenize_dataset(
text_docs,
tokenizer,
context_size=args.context_size,
batch_size=args.batch_size,
)
output_dataset = Dataset.from_dict(
{
"tokens": tokenize_dataset(
text_docs,
tokenizer,
context_size=args.context_size,
batch_size=args.batch_size,
)
"tokens": tokenized_dataset,
}
)

Expand Down

0 comments on commit fa1f52e

Please sign in to comment.