Noun phrase detection in JavaScript using neural nets with convnetjs. Part of my bachelor thesis.
It uses deep neural networks inspired by the paper: "A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning" by R. Colobert and J.Weston.
Since this neural networks in this paper are suitable for other NLP tasks, this library should be helpful for other NLP tasks as well.
Grab all files from the repository. If you want to use noun phrase detection, download the train.txt and test.txt from https://www.clips.uantwerpen.be/conll2000/chunking/. Extract those files and put them in the folder trainExample/conll2000.
Now just host all those files on a server and navigate to the root folder in your browser. If you run Unix with python installed you can just use the server.sh to run a local server and find the example code under localhost:8000.
You will need require.js to use the project: http://requirejs.org/
To use the library just copy these 4 files and load all except the last one with require.js:
- nounphrasejs.js
- getWordWindowConfiguration.js
- getSentenceConfiguration.js
- convnet-min.js
"getWordWindowConfiguration" and "getSentenceConfiguration" are optional and you will most likely only need one of them for your project. Check Network Architectures for more information.
To just classify a given sentence, load a pretrained network from JSON.
nounphrasejs.readTextFile("/jsonNets/wordWindowNounPhrase.txt", function(json) {
var configuration = getWordWindowConfiguration(json);
// TODO: Do stuff with the configuration here.
};
Make sure to use either getWordWindowConfiguration or getSentenceConfiguration depending on the type of the saved network. You can use the pretrained network JSON files from the "jsonNets" folder of this repository. Their filename indicates if they are a word window configuration or a sentence configuration.
The actual classification can be done by:
configuration.classifySentence(["The", "blue", "cat", "sat", "on", "a", "mat", "."],
function(word, wordIndex, result, percentages) {
alert("Word " + word + " classified as " + result + ".");
});
Check the API documentation for more details on the callback functions. https://github.com/JulianMH/NounPhraseJS/blob/master/API.md
To train a network yourself, loading a training dataset is required.
var dictionary = new Dictionary();
nounphrasejs.readTextFile("/trainExample/wikiWars/train.txt", function(text) {
var corpus = nounphrasejs.parseTextCorpus(text, dictionary, true);
// TODO: Do stuff with the text corpus here.
};
The last parameter of parseTextCorpus indicates if new words from the text corpus should be added to the dictionary. You want this to be true for train data and false for test data. If you provide a dictionary of words to use, the parameter should be false in both cases.
The dataset is expected to be in a similar format to https://www.clips.uantwerpen.be/conll2000/chunking/, all tags that do not include noun phrase information are simply ignored.
To do the actual training and testing, a NetworkConfiguration object is needed. Call either getWordWindowConfiguration or getSentenceConfiguration depending on which network architecture you prefer.
var options = {};
var configuration = getWordWindowConfiguration(options, dictionary);
You do not need to pass any options, the configuration will just use reasonable default values for any missing parameter. For a list of possible parameters, check the API documentation. https://github.com/JulianMH/NounPhraseJS/blob/master/API.md
To train the network by taking 10000 random samples from your training data, call this:
configuration.train(trainCorpus, 100000);
If you want to be able to output training statistics or provide a progress bar, pass a progress callback function.
configuration.train(trainCorpus, 100000, function(index, stats, trainTime) {
alert("Trained with " + index + " of 10000 samples.");
});
Testing works similar.
var correctLabels = configuration.test(testCorpus);
You can also pass a progress callback function.
var correctLabels = configuration.test(testCorpus, function(index, correctLabels, predictedLabel, actualLabel, percentages, testTime) {
alert("Text example number " + index + " was predicted to be "
+ predictedLabel +" which is " + (predictedLabel == actualLabel) + ".");
});
Check the API documentation for more details on these callback functions. https://github.com/JulianMH/NounPhraseJS/blob/master/API.md
Check the license.txt file for detailed license information.
In general, most of the files are under MIT License. The WikiWars dataset is under Wikipedias Creative Commons License. The CONLL2000 dataset has no compatible license and thus is not included in this repository.