tools for preparing seq2seq data #114

bobbyjaros · 2016-04-26T23:03:01Z

Adding nnparse.exe and SeqToSeqData.scala, which together can go from paired text files to the formatted matrices consumed by SeqToSeq

[Just a cleaner version of #77 (which had some extra mods unrelated to this PR)]

newparse can optionally output paragraphids and sentenceids for each token. p1 s1 w1 p2 s2 w2 p3 s3 w3 p4 s4 w4 p5 s5 w5 p6 s6 w6 nnparse harnesses this functionality in a very simple version of this, which assumes each newline denotes a paragraph and each ". " or "? " or "! " denotes a new sentence.

Starts with the output of nnparse.exe, two paired files each with this format: p1 s1 w1 p2 s2 w2 p3 s3 w3 p4 s4 w4 p5 s5 w5 p6 s6 w6 (For SeqToSeq we assume each line contains one sentence, so the paragraphid (the first column) denotes the sentence and sentenceid (the second column) is always ignored). The two parsed sentence IMats are paired line-by-line: the ith line of the src IMat corresponds to the ith line of the dst IMat. Produces two paired SMat's of the following form: w00 w01 w02 w03 w04 w05 ... w10 w11 w12 w13 w14 w15P ... w20 w21 w22 w23P w24 w25P ... w30 w31P w32 ... w40P w32P w33 ... where wij is the dictionary index of the i'th word in the j'th sentence and words with a P suffix are padding symbols. The columns of the two output SMat's are still paired: column j of the src output SMat and column j of the dst output SMat correspond to line j of the src input and line j of the dst input respectively. Furthermore, the sentences are collated into batches of similar lengths. The minibatches are randomly permuted after collation to avoid training bias. See in-file docs for additional options.

Bobby Jaros added 4 commits December 17, 2015 22:36

Merge remote-tracking branch 'upstream/master' into nnparse

bcf4d0b

Functionality to map indices from src dict to target dict

45de1ca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tools for preparing seq2seq data #114

tools for preparing seq2seq data #114

bobbyjaros commented Apr 26, 2016 •

edited

Loading

tools for preparing seq2seq data #114

Are you sure you want to change the base?

tools for preparing seq2seq data #114

Conversation

bobbyjaros commented Apr 26, 2016 • edited Loading

bobbyjaros commented Apr 26, 2016 •

edited

Loading