Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop word2vec.list() #23

Open
wants to merge 32 commits into
base: dev-list
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
03065ee
Add test file
koheiw Nov 20, 2023
ad51374
Change to serialized tokens
koheiw Nov 20, 2023
11dfd0c
Move header file
koheiw Nov 20, 2023
2323170
Disable parameters for file inputs
koheiw Nov 21, 2023
ca8c062
Remove code for file inputs
koheiw Nov 24, 2023
b186742
Remove method for character
koheiw Nov 30, 2023
ffd3b92
Change tokens from string to int
koheiw Nov 30, 2023
267eed9
Remove vocabulary
koheiw Dec 6, 2023
2d46347
Set frequency
koheiw Feb 9, 2024
edd824d
Remove stopwords where types == ""
koheiw Feb 10, 2024
112bcc9
Don't remove any words
koheiw Feb 10, 2024
e1de9dc
Disable save() and load()
koheiw Feb 10, 2024
a369ead
Remove progress bar and callback for vocaburary
koheiw Feb 11, 2024
a4898db
Improve handling of sentence lenghts
koheiw Feb 11, 2024
621bee7
Use for loop
koheiw Feb 11, 2024
3b19210
Clean up the code
koheiw Feb 11, 2024
32a3f90
Fix word and document index
koheiw Feb 11, 2024
b249975
Tidy up
koheiw Feb 11, 2024
562d1a1
Tidy up
koheiw Feb 11, 2024
f6a3dc2
Tidy up
koheiw Feb 11, 2024
1b6be16
Tidy up
koheiw Feb 11, 2024
2045cdb
Tidy up
koheiw Feb 11, 2024
4a90483
Build
koheiw Feb 12, 2024
b15313d
Update
koheiw Feb 12, 2024
4c79ddc
Update
koheiw Feb 12, 2024
708fd05
Update
koheiw Feb 12, 2024
276f3cb
Remove files for file IO
koheiw Mar 4, 2024
a3b4991
Convert character to integer
koheiw Mar 4, 2024
01a79a2
Merge branch 'dev-list' into dev-list
koheiw Mar 5, 2024
e05f809
Remove mapper.cpp form Makevars
koheiw Aug 2, 2024
b063d20
Build
koheiw Aug 2, 2024
0151da5
Add word2vec.tokens
koheiw Aug 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,8 @@ Description: Learn vector representations of words by continuous bag of words an
URL: https://github.com/bnosac/word2vec
License: Apache License (>= 2.0)
Encoding: UTF-8
RoxygenNote: 7.2.3
RoxygenNote: 7.3.1
Depends: R (>= 2.10)
Imports: Rcpp (>= 0.11.5), stats
Imports: Rcpp (>= 0.11.5), stats, fastmatch
LinkingTo: Rcpp, RcppProgress
Suggests: udpipe
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ S3method(predict,word2vec_trained)
S3method(summary,word2vec)
S3method(summary,word2vec_trained)
S3method(word2vec,list)
S3method(word2vec,tokens)
export(doc2vec)
export(read.word2vec)
export(read.wordvectors)
Expand Down
4 changes: 0 additions & 4 deletions R/RcppExports.R
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,3 @@ w2v_nearest_vector <- function(ptr, x, top_n = 10L, min_distance = 0.0) {
.Call('_word2vec_w2v_nearest_vector', PACKAGE = 'word2vec', ptr, x, top_n, min_distance)
}

w2v_read_binary <- function(modelFile, normalize, n) {
.Call('_word2vec_w2v_read_binary', PACKAGE = 'word2vec', modelFile, normalize, n)
}

29 changes: 24 additions & 5 deletions R/word2vec.R
Original file line number Diff line number Diff line change
Expand Up @@ -156,13 +156,14 @@ word2vec <- function(x,
#' modelb <- word2vec(x = txt, dim = 15, iter = 20, split = c(" \n\r", "\n\r"))
#' all.equal(as.matrix(modela), as.matrix(modelb))
#' \dontshow{\} # End of main if statement running only if the required packages are installed}
word2vec.list <- function(x,
word2vec.tokens <- function(x,
type = c("cbow", "skip-gram"),
dim = 50, window = ifelse(type == "cbow", 5L, 10L),
iter = 5L, lr = 0.05, hs = FALSE, negative = 5L, sample = 0.001, min_count = 5L,
stopwords = integer(),
threads = 1L,
...){

#x <- lapply(x, as.character)
type <- match.arg(type)
stopwords <- as.integer(stopwords)
Expand All @@ -182,17 +183,35 @@ word2vec.list <- function(x,
iter <- as.integer(iter)
lr <- as.numeric(lr)
skipgram <- as.logical(type %in% "skip-gram")
encoding <- "UTF-8"
model <- w2v_train(x, stopwords,
modelFile = model,
minWordFreq = min_count,

model <- w2v_train(x, attr(x, "types"), minWordFreq = min_count,
size = dim, window = window, #expTableSize = expTableSize, expValueMax = expValueMax,
sample = sample, withHS = hs, negative = negative, threads = threads, iterations = iter,
alpha = lr, withSG = skipgram, ...)
model$data$stopwords <- stopwords
model
}

#' @export
word2vec.list <- function(x, ...){
if (!is.character(attr(x, "types"))) {
x <- serialize(x, stopwords)
class(x) <- "tokens"
}
word2vec(x, ...)
}

serialize <- function(x, stopwords) {
vocaburary <- unique(unlist(x, use.names = FALSE))
vocaburary <- setdiff(vocaburary, stopwords)
x <- lapply(x, function(x) {
v <- fastmatch::fmatch(x, vocaburary)
v[is.na(v)] <- 0L
return(v)
})
attr(x, "types") <- vocaburary
return(x)
}

#' @title Get the word vectors of a word2vec model
#' @description Get the word vectors of a word2vec model as a dense matrix.
Expand Down
6 changes: 3 additions & 3 deletions man/word2vec.list.Rd → man/word2vec.tokens.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 0 additions & 1 deletion src/Makevars
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ PKG_LIBS = -pthread
PKG_CPPFLAGS = -pthread -DSTRICT_R_HEADERS -I./word2vec/include -I./word2vec/lib

SOURCES = word2vec/lib/huffmanTree.cpp \
word2vec/lib/mapper.cpp \
word2vec/lib/nsDistribution.cpp \
word2vec/lib/trainer.cpp \
word2vec/lib/trainThread.cpp \
Expand Down
1 change: 0 additions & 1 deletion src/Makevars.win
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ PKG_LIBS = -pthread
PKG_CPPFLAGS = -pthread -DSTRICT_R_HEADERS -I./word2vec/include -I./word2vec/lib

SOURCES = word2vec/lib/huffmanTree.cpp \
word2vec/lib/mapper.cpp \
word2vec/lib/nsDistribution.cpp \
word2vec/lib/trainer.cpp \
word2vec/lib/trainThread.cpp \
Expand Down
14 changes: 0 additions & 14 deletions src/RcppExports.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -88,27 +88,13 @@ BEGIN_RCPP
return rcpp_result_gen;
END_RCPP
}
// w2v_read_binary
Rcpp::NumericMatrix w2v_read_binary(const std::string modelFile, bool normalize, std::size_t n);
RcppExport SEXP _word2vec_w2v_read_binary(SEXP modelFileSEXP, SEXP normalizeSEXP, SEXP nSEXP) {
BEGIN_RCPP
Rcpp::RObject rcpp_result_gen;
Rcpp::RNGScope rcpp_rngScope_gen;
Rcpp::traits::input_parameter< const std::string >::type modelFile(modelFileSEXP);
Rcpp::traits::input_parameter< bool >::type normalize(normalizeSEXP);
Rcpp::traits::input_parameter< std::size_t >::type n(nSEXP);
rcpp_result_gen = Rcpp::wrap(w2v_read_binary(modelFile, normalize, n));
return rcpp_result_gen;
END_RCPP
}

static const R_CallMethodDef CallEntries[] = {
{"_word2vec_w2v_train", (DL_FUNC) &_word2vec_w2v_train, 17},
{"_word2vec_w2v_dictionary", (DL_FUNC) &_word2vec_w2v_dictionary, 1},
{"_word2vec_w2v_embedding", (DL_FUNC) &_word2vec_w2v_embedding, 2},
{"_word2vec_w2v_nearest", (DL_FUNC) &_word2vec_w2v_nearest, 4},
{"_word2vec_w2v_nearest_vector", (DL_FUNC) &_word2vec_w2v_nearest_vector, 4},
{"_word2vec_w2v_read_binary", (DL_FUNC) &_word2vec_w2v_read_binary, 3},
{NULL, NULL, 0}
};

Expand Down
49 changes: 3 additions & 46 deletions src/rcpp_word2vec.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
#include <iostream>
#include <iomanip>
#include "word2vec.hpp"
#include "wordReader.hpp"
#include <unordered_map>

// [[Rcpp::depends(RcppProgress)]]
Expand Down Expand Up @@ -82,29 +81,6 @@ Rcpp::List w2v_train(Rcpp::List texts_,
if (verbose) { // NOTE: consider removing progress bar
Progress p(100, true);
trained = model->train(trainSettings, corpus,
//trainFile, stopWordsFile, // NOTE: remove
// [&p] (float _percent) {
// p.update(_percent / 2);
// /*
// std::cout << "\rParsing train data... "
// << std::fixed << std::setprecision(2)
// << _percent << "%" << std::flush;
// */
// },
// [&vocWords, &trainWords, &totalWords] (std::size_t _vocWords, std::size_t _trainWords, std::size_t _totalWords) {
// /*
// Rcpp::Rcerr << std::endl
// << "Finished reading data: " << std::endl
// << "Vocabulary size: " << _vocWords << std::endl
// << "Train words: " << _trainWords << std::endl
// << "Total words: " << _totalWords << std::endl
// << "Start training" << std::endl
// << std::endl;
// */
// vocWords = _vocWords;
// trainWords = _trainWords;
// totalWords = _totalWords;
// },
[&p] (float _alpha, float _percent) {
/*
std::cout << '\r'
Expand All @@ -119,26 +95,8 @@ Rcpp::List w2v_train(Rcpp::List texts_,
p.update(_percent);
}
);
//std::cout << std::endl;
} else {
trained = model->train(trainSettings, corpus,
//trainFile, stopWordsFile, // NOTE: remove
// nullptr,
// [&vocWords, &trainWords, &totalWords] (std::size_t _vocWords, std::size_t _trainWords, std::size_t _totalWords) {
// /*
// Rcpp::Rcerr << std::endl
// << "Finished reading data: " << std::endl
// << "Vocabulary size: " << _vocWords << std::endl
// << "Train words: " << _trainWords << std::endl
// << "Total words: " << _totalWords << std::endl
// << "Start training" << std::endl
// << std::endl;
// */
// vocWords = _vocWords;
// trainWords = _trainWords;
// totalWords = _totalWords;
// },
nullptr);
trained = model->train(trainSettings, corpus, nullptr);
}
Rcpp::Rcout << "Training done\n";
//return Rcpp::List::create();
Expand Down Expand Up @@ -313,6 +271,8 @@ Rcpp::List w2v_nearest_vector(SEXP ptr,
return out;
}

/* NOTE: temporarily disabled

// [[Rcpp::export]]
Rcpp::NumericMatrix w2v_read_binary(const std::string modelFile, bool normalize, std::size_t n) {
try {
Expand Down Expand Up @@ -416,9 +376,6 @@ Rcpp::NumericMatrix w2v_read_binary(const std::string modelFile, bool normalize,
return embedding_default;
}

/* NOTE: temporarily disabled


// [[Rcpp::export]]
Rcpp::List d2vec(SEXP ptr, Rcpp::StringVector x, std::string wordDelimiterChars = " \n,.-!?:;/\"#$%&'()*+<=>@[]\\^_`{|}~\t\v\f\r") {
Rcpp::XPtr<w2v::w2vModel_t> model_w2v(ptr);
Expand Down
82 changes: 0 additions & 82 deletions src/word2vec/include/mapper.hpp

This file was deleted.

Loading