Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tools for preparing seq2seq data #114

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

bobbyjaros
Copy link

@bobbyjaros bobbyjaros commented Apr 26, 2016

Adding nnparse.exe and SeqToSeqData.scala, which together can go from paired text files to the formatted matrices consumed by SeqToSeq

[Just a cleaner version of #77 (which had some extra mods unrelated to this PR)]

Bobby Jaros added 4 commits December 17, 2015 22:36
newparse can optionally output paragraphids and sentenceids for each token.
        p1 s1 w1
        p2 s2 w2
        p3 s3 w3
        p4 s4 w4
        p5 s5 w5
        p6 s6 w6

nnparse harnesses this functionality in a very simple version of this, which
assumes each newline denotes a paragraph and each ". " or "? " or "! "
denotes a new sentence.
Starts with the output of nnparse.exe, two paired files each with this format:
         p1 s1 w1
         p2 s2 w2
         p3 s3 w3
         p4 s4 w4
         p5 s5 w5
         p6 s6 w6

(For SeqToSeq we assume each line contains one sentence, so the paragraphid
(the first column) denotes the sentence and sentenceid (the second column)
is always ignored).

The two parsed sentence IMats are paired line-by-line:  the ith line of the
src IMat corresponds to the ith line of the dst IMat.

Produces two paired SMat's of the following form:
         w00  w01  w02  w03  w04  w05  ...
         w10  w11  w12  w13  w14  w15P ...
         w20  w21  w22  w23P w24  w25P ...
         w30  w31P w32                 ...
         w40P w32P w33                 ...

where
   wij is the dictionary index of the i'th word in the j'th sentence and
   words with a P suffix are padding symbols.

The columns of the two output SMat's are still paired:  column j of the
src output SMat and column j of the dst output SMat correspond to line j
of the src input and line j of the dst input respectively.

Furthermore, the sentences are collated into batches of similar lengths.

The minibatches are randomly permuted after collation to avoid training bias.

See in-file docs for additional options.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant