Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hierarchical tokenizers #27

Merged
merged 28 commits into from
Jul 29, 2024
Merged
Changes from 1 commit
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
2338ac0
work on tokenizer config
nleroy917 Jun 12, 2024
47a33ae
update tests
nleroy917 Jun 13, 2024
78ee6a0
work on hc universes
nleroy917 Jun 13, 2024
6500c77
add tests githu action
nleroy917 Jun 13, 2024
983075a
work on README and codecov
nleroy917 Jun 13, 2024
7f0b654
switch working directory
nleroy917 Jun 13, 2024
b912538
update tokenizer config
nleroy917 Jun 14, 2024
bf39b5c
realy update documentation
nleroy917 Jun 14, 2024
ae37dab
fix doc tests
nleroy917 Jun 14, 2024
9f04a08
docs for common/utils
nleroy917 Jun 14, 2024
c1121b5
work on documentation
nleroy917 Jun 14, 2024
a2af7c1
work on meta-token tokenizer
nleroy917 Jun 15, 2024
3b52388
finish making meta tokenizer... hopefully
nleroy917 Jun 16, 2024
74518ef
basic implementation of the meta tokenizer
nleroy917 Jun 16, 2024
8ba4294
update test data
nleroy917 Jun 16, 2024
797b204
flush out the meta tokenizer
nleroy917 Jun 16, 2024
8342c2e
add python bindings to the meta tokenizer
nleroy917 Jun 25, 2024
b9cb071
remove gtokens
nleroy917 Jun 25, 2024
a0bc942
add token param
nleroy917 Jun 25, 2024
0c9889b
add dynamic tokenizer builder
nleroy917 Jun 25, 2024
cac5261
update tests for saving tokens.gtok
nleroy917 Jun 25, 2024
df7df86
WIP TokenizerBuilder
nleroy917 Jun 26, 2024
429b69e
meta tokenizer updates
nleroy917 Jun 26, 2024
4e831c2
add export functionality to tokenizers
nleroy917 Jun 27, 2024
e70cf13
small tweaks
nleroy917 Jul 22, 2024
c8abb9b
bindings
nleroy917 Jul 22, 2024
27fe07e
add module paths to classes
nleroy917 Jul 29, 2024
f636780
bump version and changelog
nleroy917 Jul 29, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
fix doc tests
nleroy917 committed Jun 14, 2024
commit ae37dabdb37edc4736fae91448b88e6fa1ab7e9c
6 changes: 3 additions & 3 deletions gtars/src/io/mod.rs
Original file line number Diff line number Diff line change
@@ -11,14 +11,14 @@
//! use gtars::io::write_tokens_to_gtok;
//!
//! let ids = vec![42, 101, 999];
//! write_tokens_to_gtok("tokens.gtok".as_str(), &ids);
//! write_tokens_to_gtok("tokens.gtok", &ids);
//! ```
//! ### Read tokens from disk
//! ```rust
//! use gtars::io::read_tokens_from_gtok;
//! let ids = read_tokens_from_gtoK("tokens.gtok".to_str());
//! let ids = read_tokens_from_gtok("tokens.gtok").unwrap();
//!
//! println!(ids); // [42, 101, 999]
//! println!("{:?}", ids); // [42, 101, 999]
//! ```
pub mod gtok;
pub mod consts;
12 changes: 7 additions & 5 deletions gtars/src/lib.rs
Original file line number Diff line number Diff line change
@@ -10,17 +10,19 @@
//! ## Examples
//! ### Create a tokenizer and tokenize a bed file
//! ```rust
//! use gtars::tokenizers::TreeTokenizer;
//! use std::path::Path;
//!
//! use gtars::tokenizers::{Tokenizer, TreeTokenizer};
//! use gtars::common::models::RegionSet;
//!
//! let path_to_bed_file = "path/to/screen.bed";
//! let path_to_bed_file = "tests/data/peaks.bed";
//! let tokenizer = TreeTokenizer::try_from(Path::new(path_to_bed_file)).unwrap();
//!
//! let path_to_tokenize_bed_fil = "path/to/peaks.bed";
//! let path_to_tokenize_bed_file = "tests/data/to_tokenize.bed";
//! let rs = RegionSet::try_from(Path::new(path_to_tokenize_bed_file)).unwrap();
//!
//! let tokenized_regions = tokenizer.tokenize_region_set(&rs);
//! println!(tokenized_regions.ids);
//! println!("{:?}", tokenized_regions.ids);
//! ```
//!
//! You can save the result of this tokenization to a file for later use in machine learning model training:
@@ -29,7 +31,7 @@
//! use gtars::io::write_tokens_to_gtok;
//!
//! let ids = vec![42, 101, 999];
//! write_tokens_to_gtok("tokens.gtok".as_str(), &ids);
//! write_tokens_to_gtok("tokens.gtok", &ids);
//! ```
pub mod ailist;
pub mod common;
12 changes: 7 additions & 5 deletions gtars/src/tokenizers/mod.rs
Original file line number Diff line number Diff line change
@@ -7,17 +7,19 @@
//! ## Example
//! ### Create a tokenizer and tokenize a bed file
//! ```rust
//! use gtars::tokenizers::TreeTokenizer;
//! use std::path::Path;
//!
//! use gtars::tokenizers::{Tokenizer, TreeTokenizer};
//! use gtars::common::models::RegionSet;
//!
//! let path_to_bed_file = "path/to/screen.bed";
//! let path_to_bed_file = "tests/data/peaks.bed.gz";
//! let tokenizer = TreeTokenizer::try_from(Path::new(path_to_bed_file)).unwrap();
//!
//! let path_to_tokenize_bed_fil = "path/to/peaks.bed";
//! let let rs = RegionSet::try_from(Path::new(path_to_tokenize_bed_file)).unwrap();
//! let path_to_tokenize_bed_file = "tests/data/to_tokenize.bed";
//! let rs = RegionSet::try_from(Path::new(path_to_tokenize_bed_file)).unwrap();
//!
//! let tokenized_regions = tokenizer.tokenize_region_set(&rs);
//! println!(tokenized_regions.ids);
//! println!("{:?}", tokenized_regions.ids);
//! ```
pub mod cli;
pub mod config;
Binary file added gtars/tokens.gtok
Binary file not shown.