Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backport tokenize-independelty-then-copy from marin #783

Merged
merged 13 commits into from
Nov 6, 2024
2 changes: 1 addition & 1 deletion config/gpt2_small_fast_pile.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
data: !include data/pile_source_old.yaml
data: !include data/pile_mixture.yaml
model:
type: gpt2
hidden_dim: 768
Expand Down
6 changes: 3 additions & 3 deletions src/levanter/data/text.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
from levanter.store.cache import CacheOptions, TreeCache
from levanter.store.jagged_array import JaggedArrayStore
from levanter.store.tree_store import TreeStore
from levanter.utils.fsspec_utils import fsspec_expand_glob
from levanter.utils.fsspec_utils import expand_glob
from levanter.utils.hf_utils import num_cpus_used_by_tokenizer


Expand Down Expand Up @@ -508,7 +508,7 @@ def urls_for_split(self, split):
else:
raise ValueError(f"Unknown split {split}")

urls = [globbed for url in urls for globbed in fsspec_expand_glob(url)]
urls = [globbed for url in urls for globbed in expand_glob(url)]
return urls


Expand Down Expand Up @@ -625,7 +625,7 @@ def _prepare_supervised_example(ex: dict, tokenizer: PreTrainedTokenizerBase) ->
def mk_supervised_dataset(config: LMSupervisedDatasetConfig, tokenizer: PreTrainedTokenizerBase):
import levanter.data

validation_urls = [url for url_pat in config.validation_urls for url in fsspec_expand_glob(url_pat)]
validation_urls = [url for url_pat in config.validation_urls for url in expand_glob(url_pat)]
dataset = levanter.data.datasource_from_jsonl(validation_urls)

input_field = config.input_field
Expand Down
Loading
Loading