Meeting with Marten: notes
Wrote a script to convert Gigaword to bare sentences and tokenize them.
Wrote another script to train Word2vec on the resulting file.
I think copying the file to a SSD and run the training script from there would improve speed. It did. Running on regular disk, the program can only use 200-300% CPU (i.e. the equivalent of 2-3 CPUs running 100%) but with this trick, it can reach 3000%.
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1925 minhle 20 0 4061164 464940 11880 R 3051 0.7 43:50.84 python3
Started my script for running for 12 hours. If that's not enough, we need to ask for a dedicated server.
[minhle@fs0 wsd-with-marten]$ sbatch word2vec_gigaword.job
Submitted batch job 1394958
The job failed. Error: OSError: [Errno 122] Disk quota exceeded
.
I moved everything to scratch
folder. Restarted.
[minhle@fs0 wsd-with-marten]$ sbatch word2vec_gigaword.job
Submitted batch job 1395268
The Gigaword word2vec training finished somewhere last week. I ran it through wordsim-353 and it got:
- Pearson's r: 0.5095
- Spearman's rho: 0.5169
Comparable to Collobert and Weston (2008), not bad? At least we know that it captured something.
>>> import gensim
>>> model = gensim.models.Word2Vec.load('output/gigaword')
>>> model.wv.evaluate_word_pairs('data/wordsim353_combined.tab')
((0.5095200162518948, 1.0200792645277232e-24), SpearmanrResult(correlation=0.51687561224435263, pvalue=1.6647396296963623e-25), 0.0)
>>> a=model.accuracy('data/questions-words.txt')
2017-05-15 11:12:17,899 : INFO : precomputing L2-norms of word weight vectors
2017-05-15 11:12:20,394 : INFO : capital-common-countries: 73.7% (373/506)
2017-05-15 11:12:29,047 : INFO : capital-world: 67.2% (1703/2536)
2017-05-15 11:12:29,664 : INFO : currency: 23.0% (41/178)
2017-05-15 11:12:35,885 : INFO : city-in-state: 9.3% (170/1822)
2017-05-15 11:12:36,933 : INFO : family: 92.8% (284/306)
2017-05-15 11:12:40,109 : INFO : gram1-adjective-to-adverb: 19.5% (181/930)
2017-05-15 11:12:41,284 : INFO : gram2-opposite: 26.0% (89/342)
2017-05-15 11:12:45,860 : INFO : gram3-comparative: 76.7% (1021/1332)
2017-05-15 11:12:47,913 : INFO : gram4-superlative: 75.0% (450/600)
2017-05-15 11:12:50,492 : INFO : gram5-present-participle: 68.7% (519/756)
2017-05-15 11:12:55,416 : INFO : gram6-nationality-adjective: 80.4% (1162/1445)
2017-05-15 11:13:00,457 : INFO : gram7-past-tense: 57.9% (858/1482)
2017-05-15 11:13:03,889 : INFO : gram8-plural: 72.8% (722/992)
2017-05-15 11:13:06,107 : INFO : gram9-plural-verbs: 58.3% (379/650)
2017-05-15 11:13:06,108 : INFO : total: 57.3% (7952/13877)
Notice: this model is trained on a compute node so you need to enter one to load it back into memory. Otherwise it might throw "ImportError: No module named 'UserString'".
Today I'll adapt gensim to train sense vectors. Found this super helpful thread where some guy tried to modify it to reproduce Omer Levy's paper. Let me paste the instructions here for later reference:
- Ensure your Python environment is using your working-copy of gensim for the python & the
_inner
compiled code (typically.so
shared-libraries) – this might involve invoking setup.py from inside your project directory, or doing a 'pip' install using a local path - When your changes to the
.pyx
files seem ready, usecython
to compile them to.c
code. (You might need to do this from the root of the project, eg:cython gensim/models/word2vec_inner.pyx
) - Use the command
python ./setup.py build_ext --inplace
, from the root of your gensim directory, to compile the.c
to shared-libraries. (Depending on how well you did step (1), you might also need to do something likepython ./setup.py install
to also install the shared-libraries elsewhere.) - Run your tests, confirming especially that your changed code (and not some older or elsewhere-installed version) is being run from where you expect it. Debug & repeat (2)-(3) as necessary.
Modified the vocabulary code to recognize and register senses.
Went out to attent Antske's talk.
Finished, checked Python implementation of new CBOW. When provided with a
sense_delimiter
, it will only train sense embeddings, leaving word
(and context) embeddings untouched.
It is rather hard to write automated tests to check these requirement...
TODO:
- Implement the C version of CBOW
- Put some kind of warning in Skip-gram part
- What to do with
score_sentence_cbow
? - How to write automated test?
BabelNet-WordNet mappings.
Don't know how to deal with the C implementation... For now I use only the Python implementation for sense embeddings. Will it be fast enough?? I'll need to run on full data to find out.
Marten has created a version of disambiguated Wikipedia. I fixed some small bugs and ran sense embedding script on it. This is after 1.5 hour:
2017-05-30 14:59:13,067 : INFO : PROGRESS: at 0.40% examples, 515 words/s, in_qsize 64, out_qsize 0
2017-05-30 14:59:16,951 : INFO : PROGRESS: at 0.40% examples, 516 words/s, in_qsize 64, out_qsize 0
2017-05-30 14:59:28,178 : INFO : PROGRESS: at 0.40% examples, 516 words/s, in_qsize 63, out_qsize 0
2017-05-30 14:59:32,155 : INFO : PROGRESS: at 0.40% examples, 517 words/s, in_qsize 64, out_qsize 0
2017-05-30 14:59:44,045 : INFO : PROGRESS: at 0.40% examples, 517 words/s, in_qsize 63, out_qsize 0
Apparently I need to complete that C implementation.
Hurraaaaah! I can't believe I managed to do this in an afternoon. The difference between C and Python is drastic.
2017-05-30 17:15:33,480 : INFO : PROGRESS: at 0.95% examples, 92117 words/s, in_qsize 0, out_qsize 0
2017-05-30 17:15:34,536 : INFO : PROGRESS: at 0.96% examples, 92076 words/s, in_qsize 0, out_qsize 0
2017-05-30 17:15:35,546 : INFO : PROGRESS: at 0.98% examples, 92182 words/s, in_qsize 0, out_qsize 0
2017-05-30 17:15:36,584 : INFO : PROGRESS: at 0.99% examples, 92152 words/s, in_qsize 0, out_qsize 0
2017-05-30 17:15:37,636 : INFO : PROGRESS: at 1.00% examples, 92142 words/s, in_qsize 0, out_qsize 0
For now it's good that it runs. I will need to check the resulting embeddings next week. Hopefully they make sense.
[minhle@fs0 wsd-with-marten]$ tail slurm-1411884.out
2017-05-30 17:43:44,800 : INFO : PROGRESS: at sentence #3380000, processed 7838951 words, keeping 799 word types
2017-05-30 17:43:44,898 : INFO : PROGRESS: at sentence #3390000, processed 7862696 words, keeping 799 word types
...
Met Marten, Piek and Jacopo.
Priority decided: Marten will focus on this project.
Talked to Marten: notes.
Examining the synset embeddings, they don't make sense to me whilst word embeddings are still good. Could it be that the synset embeddings weren't trained after all?
--- Example: 09044862-n ---
Synset('united_states.n.01')
Nearest neighbors:
Tomango 0.454
construes 0.444
192-mile 0.432
--``If 0.420
Temping 0.416
20:2020 0.413
Rautbord 0.409
implimenting 0.408
rideable 0.405
Zilar 0.400
>> US
Nearest neighbors:
Canada 0.695
America 0.694
Japan 0.668
Europe 0.657
Germany 0.656
U.S. 0.655
France 0.654
UK 0.625
Australia 0.620
California 0.601
The focus now is to get LSTM running so synset embeddings will need to wait.
Finished data processing code. Now it's time to write the LSTM.
Divided into 3959 batches (1000195 elements each, std=23020, except last batch of 1000008)... Done.
Added 4363547 elements as padding (0.11%).
Found super useful guide here.
In [38]: # Create input data
...: X = np.random.randn(2, 10, 8)
...:
...: # The second example is of length 6
...: X[1,6:] = 0
...: X_lengths = [10, 6]
...:
...: cell = tf.contrib.rnn.LSTMCell(num_units=64, state_is_tuple=True)
...:
...: outputs, last_states = tf.nn.dynamic_rnn(
...: cell=cell,
...: dtype=tf.float64,
...: sequence_length=X_lengths,
...: inputs=X)
...:
...: result = tf.contrib.learn.run_n(
...: {"outputs": outputs, "last_states": last_states},
...: n=1,
...: feed_dict=None)
...:
...: assert result[0]["outputs"].shape == (2, 10, 64)
...:
...: # Outputs for the second example past length 6 should be 0
...: assert (result[0]["outputs"][1,7,:] == np.zeros(cell.output_size)).all()
...:
Managed to train the LSTM model (and load it back, which is non-trivial!)
Implemented "sampled" softmax of Jean et al. (2015):
Before:
Epoch: 10
batch #100 cost: 7.1678638
batch #200 cost: 7.1870413
Epoch: 10 -> Train cost: 7.264, elapsed time: 6.4 minutes
After:
Epoch: 10
batch #100 cost: 7.0222702
batch #200 cost: 6.9646182
Epoch: 10 -> Train cost: 6.980, elapsed time: 2.9 minutes
So the the time was cut in more than half and NLL is reduced (because the vocabulary size is smaller).
Added resampling of training batches so that each token has equal chance to become target and batch order is erased.
The batching and padding function runs terribly slow. The last 1m sentences took almost 1 hour to complete.
processed 172000000 items, elapsed time: 1037.6 minutes...
processed 173000000 items, elapsed time: 1079.0 minutes...
processed 174000000 items, elapsed time: 1119.9 minutes...
Logged in the node and found my script using all RAM, a lot of Swap and only 3% CPU. An obvious memory thrashing case.
%Cpu(s): 0.0 us, 0.1 sy, 0.0 ni, 96.9 id, 2.9 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 99.5/65879032 [|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||]
KiB Swap: 73.3/33554428 [||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||| ]
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
4592 minhle 20 0 82.706g 0.057t 824 D 3.3 92.9 208:45.24 python3
246 root 20 0 0 0 0 S 0.7 0.0 75:19.89 kswapd0
247 root 20 0 0 0 0 S 0.7 0.0 6:33.39 kswapd1
Attempted a dry run and found out that just loaded all sentences into memory already cause the memory to overflow to swap. On disk, the pickle file is 11G. After loading into memory, it becomes 100G. And that was on login node, the compute node has even smaller RAM. Probably I need to sort on disk instead and convert everything into Numpy arrays as soon as I load it into memory.
References
Jean, S., Cho, K., Memisevic, R., & Bengio, Y. (2015). On Using Very Large Target Vocabulary for Neural Machine Translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (pp. 1–10). Beijing, China: Association for Computational Linguistics. Retrieved from http://www.aclweb.org/anthology/P15-1001
The script finished. It's surprisingly quick.
[minhle@fs0 wsd-with-marten]$ wc -l output/gigaword.txt
175771829 output/gigaword.txt
[minhle@fs0 wsd-with-marten]$ wc -l output/gigaword.txt.sorted
122811493 output/gigaword.txt.sorted
[minhle@fs0 wsd-with-marten]$ grep "Train co" slurm-1489807.out
Epoch: 1 -> Train cost: 7.144, elapsed time: 35.4 minutes
Epoch: 2 -> Train cost: 6.848, elapsed time: 70.9 minutes
Epoch: 3 -> Train cost: 6.323, elapsed time: 106.3 minutes
Epoch: 4 -> Train cost: 5.966, elapsed time: 141.8 minutes
Epoch: 5 -> Train cost: 5.740, elapsed time: 177.3 minutes
Epoch: 6 -> Train cost: 5.629, elapsed time: 212.7 minutes
Epoch: 7 -> Train cost: 5.549, elapsed time: 248.2 minutes
Epoch: 8 -> Train cost: 5.481, elapsed time: 283.7 minutes
Epoch: 9 -> Train cost: 5.455, elapsed time: 319.2 minutes
Epoch: 10 -> Train cost: 5.395, elapsed time: 354.6 minutes
Monitor dev cost and use it as a stopping criteria.
Report timeline following this guide. Look like the program spends most of its time on GPU indeed. I don't know if I can make it any better.
Started a new experiment with big corpus, big model (more embedding dims, more hidden nodes).
Does it use GPU? I measured it and it did. Perhaps when Jacopo ran nvidia-smi
,
it showed only his usage.
[minhle@node001 ~]$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 370.28 Driver Version: 370.28 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX TIT... On | 0000:03:00.0 Off | N/A |
| 22% 41C P2 162W / 250W | 11663MiB / 12206MiB | 66% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 19977 C python3 11661MiB |
+-----------------------------------------------------------------------------+
Testing trivial model (#ce39e02a5d34705a069915e29d51ebd7888ea649). It runs much faster indeed.
Epoch: 10 finished, elapsed time: 0.3 minutes
Train cost: 0.123
Dev cost: 11.142, hit@100: 0.0%
Compared to the normal model below, the trivial model is 8.67 times faster, i.e. CPU operations take roughly 11% running time. This can be improved of course.
Epoch: 10 finished, elapsed time: 2.6 minutes
Train cost: 7.148
Dev cost: 7.906, hit@100: 0.5%
How to load batches of different shapes into memory and use them in the same
variable? I tested tf.assign()
to see if it copies memory around and
apparently it does.
import tensorflow as tf
var = tf.Variable(0.9)
var2 = tf.Variable(0.0)
copy_first_variable = var2.assign(var)
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
print('Initial value: %f' %sess.run(var2))
sess.run(copy_first_variable)
print('After assigning: %f' %sess.run(var2))
add1 = var2.assign_add(1)
sess.run(var2)
print('After adding: %f' %sess.run(add1))
print('Value of the source: %f' %sess.run(var))
Output:
Initial value: 0.000000
After assigning: 0.900000
After adding: 1.900000
Value of the source: 0.900000
An alternative approach is to store everything in a flattened array + batch indices + batch sizes. We can sample the batch info and use it to locate the data.
import tensorflow as tf
import numpy as np
data = tf.Variable(np.random.rand(10000))
batch_info = tf.Variable(np.array([[0, 100, 20], [2000, 30, 30], [2900, 50, 100], [7900, 30, 70]]), dtype=tf.int32)
i, = tf.train.slice_input_producer([batch_info])
batch = tf.reshape(data[i[0]:i[0]+i[1]*i[2]], (i[1], i[2]))
col = tf.random_uniform((1,), maxval=tf.shape(batch)[1], dtype=tf.int32)
y = batch[:,col[0]]
sess = tf.Session()
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)
coord = tf.train.Coordinator()
threads = tf.train.start_queue_runners(sess=sess, coord=coord)
for i in range(100):
batch_val, y = sess.run([batch, y])
print('%s\t%s' %(batch_val.shape, y.shape))
coord.request_stop()
coord.join(threads)
sess.close()
Output:
(30, 30)
(30, 70)
(100, 20)
(50, 100)
(50, 100)
...
A problem with this approach is I can't sample batches according to length anymore so the loss will be distorted (shorter sentences are over-represented). Let's see how it goes...
Hit a wall: tensorflow/tensorflow#9506
I tested different solution but int32 isn't allowed into GPU and float32
doesn't work with tf.nn.embedding_lookup
.
with tf.device("/gpu:0"):
a = tf.Variable([0,1,2], dtype=tf.float32)
E = tf.Variable(np.random.rand(1000, 50))
embs = tf.nn.embedding_lookup(E, a, validate_indices=False)
with tf.Session(config=tf.ConfigProto(log_device_placement=True)) as sess:
sess.run(tf.global_variables_initializer())
sess.run(a)
Met with Jacopo. Looks like computer science department can't solve our issue. One solution is to write a short paper (or technical report) on reproducing the Google paper. But first Jacopo would like to give it a shot.
Ran things on full data and the program start chunking out NaN. Turns out that int32 is not big enough so I get negative indices.
>>> import numpy as np
>>> train=np.load('output/gigaword-lstm-wsd.train.npz')
>>> idx = train['indices']
>>> idx[-1]
array([-1123920531, 117, 101, 377887591, 3973, 3972], dtype=int32)
I converted everything into int64 and now the preparation script throws MemoryError :-( Now I need to take care of memory usage again.
Plan: use mem-map to perform linearization and serialization at the same time. Write both sentences and vocabs into one vector (they are flattened anyway), indices are written to an 2-D array.
I should have realized that int64 is troublesome since I had seen this discussion on Github earlier. After solving all problems with saving and loading data, now I get this error:
Input 'strides' of 'StridedSlice' Op has type int32 that does not match type int64 of argument 'begin'
This is how to reproduce it:
In [35]: import tensorflow as tf
...: a=tf.Variable([0,1,2], dtype=tf.int64)
...: i = tf.constant(1, dtype=tf.int64)
...: a[i]
...:
[...]
TypeError: Input 'strides' of 'StridedSlice' Op has type int32 that does not match type int64 of argument 'begin'.
In [36]: import tensorflow as tf
...: a=tf.Variable([0,1,2], dtype=tf.int32)
...: i = tf.constant(1, dtype=tf.int32)
...: a[i]
...:
Out[36]: <tf.Tensor 'strided_slice_22:0' shape=() dtype=int32>
Submitted an issue (#11318).
Rolled back everything related to preloading data. I hate it anyway since the code is complex and awkward. Now I'll pack everything and send to Jacopo.
The thing is running! Sent to Marten, Jacopo and Rutger. Next week I won't work on WSD to spend more time on coreference.
The issue turns out to be solved 2 months ago. But I ran into another issue.
Got results from Marten. The "Google" model at ~10 epochs got 61% (WSD). This is still lower than the baseline but better than the "large" model.
Jacopo suggested that we should write up a report.
I suggested that we train the "google" model on different sizes of the data to make the case that we need 100B corpus to get Google's results.
Fixed miscelleneous problems. Generalize the code. Fixed WSI.
Submitted training on 1% and 10% of Gigaword. Scheduled to run for maximally 2 weeks.
[minhle@fs0 wsd-with-marten]$ sbatch train-lstm-wsd-01pc-data-google-model.job
Submitted batch job 1529915
[minhle@fs0 wsd-with-marten]$ sbatch train-lstm-wsd-10pc-data-google-model.job
Submitted batch job 1529916
Started training WSI on DAS-5 (large model). Scheduled to run for 2 weeks.
[minhle@fs0 wsd-with-marten]$ sbatch train-lstm-wsi.job
Submitted batch job 1532704
All jobs died because of disk space :(( I should have used TensorFlow support for restarting training.
Implemented managed session. Started training again. (I couldn't restore )
[minhle@fs0 wsd-with-marten]$ sbatch train-lstm-wsd-01pc-data-google-model.job
Submitted batch job 1538265
[minhle@fs0 wsd-with-marten]$ sbatch train-lstm-wsd-10pc-data-google-model.job
Submitted batch job 1538275
Met Jacopo about my WSI idea. He liked it and, since me and Marten are both super busy, suggested that we can hire a student assistant or supervise a master thesis on this topic.
Make an outline of the reimplementation paper. There are some numbers we need to fill in.
Help Marten to improve performance of his evaluation code.
Now I need to measure speedups.
Out of space. The prepared training data was removed a while ago (to get more
space apparently). I need to move EvEn
to Cartesius and prepare the data
again...
[minhle@fs0 wsd-with-marten]$ nohup python3 -u prepare-lstm-wsd.py output/gigaword.txt output/gigaword-lstm-wsd &
[1] 29077
$ zip EvEn.zip -r Even
Now just wait and wait and wait...
Moved EvEn to Cartesius, got a little more space.
Strengthen reproducibility by versioning preprocessed data.
Started preprocessing data to measure speedups. I will need an unoptimized version of the batches so it will take some time.
[minhle@fs0 wsd-with-marten]$ ls -l preprocessed-data/40e2c1f
total 0
lrwxrwxrwx 1 minhle minhle 75 Sep 12 16:52 gigaword.txt -> /home/minhle/scratch/wsd-with-marten/preprocessed-data/694cb4d/gigaword.txt
lrwxrwxrwx 1 minhle minhle 82 Sep 12 16:52 gigaword.txt.sorted -> /home/minhle/scratch/wsd-with-marten/preprocessed-data/694cb4d/gigaword.txt.sorted
[minhle@fs0 wsd-with-marten]$ nohup python3 -u prepare-lstm-wsd.py > preprocessed-data/40e2c1f/prepare-lstm-wsd.py.out 2>&1 &
[1] 12220
[minhle@fs0 wsd-with-marten]$ tail -f preprocessed-data/40e2c1f/prepare-lstm-wsd.py.out
nohup: ignoring input
Building vocabulary...
processed 1000000 items, elapsed time: 0.3 minutes...
processed 2000000 items, elapsed time: 0.7 minutes...
processed 3000000 items, elapsed time: 1.0 minutes...
processed 4000000 items, elapsed time: 1.3 minutes...
Crashed due to disk space limit. Mailed Kees asking for more space.
Kees increased my quota to 400G! Now I have enough to run my experiment.
Wasted another morning wrestling with TensorFlow. I had known that deleting packages is a bad idea. Now my TensorFlow installation doesn't work anymore. Emailed Kees asking about installing cuDNN v6.
Managed to run the preprocessing script anyway:
[minhle@fs0 wsd-with-marten]$ nohup python3 -u prepare-lstm-wsd.py > preprocessed-data/40e2c1f/prepare-lstm-wsd.py.out 2>&1 &
[1] 22530
[minhle@fs0 wsd-with-marten]$ tail -f preprocessed-data/40e2c1f/prepare-lstm-wsd.py.out
nohup: ignoring input
Reading vocabulary from preprocessed-data/40e2c1f/gigaword-for-lstm-wsd.index.pkl... Done.
Sentences are already sorted at preprocessed-data/40e2c1f/gigaword.txt.sorted
Dividing and padding...
processed 1000000 items, elapsed time: 0.2 minutes...
processed 2000000 items, elapsed time: 0.4 minutes...
processed 3000000 items, elapsed time: 0.6 minutes...
processed 4000000 items, elapsed time: 0.8 minutes...
processed 5000000 items, elapsed time: 1.0 minutes...
Label propagation:
- http://scikit-learn.org/stable/modules/label_propagation.html
- https://github.com/yamaguchiyuto/label_propagation
Implemented a test script
Turned the test script into a class for performing label propagation.
When I tested it with tiny data, one of labeled instance changes and the unlabeled instance gets its label. Isn't it strange?
input: {'horse': array([0, 1, -1]), 'dog': array([0, 1, -1])}
output: {'horse': array([1, 1, 0]), 'dog': array([1, 1, 0])}
TODO: try different values of parallel_iterations in tf.nn.dynamic_rnn
Started working on experiments that would run on Cartesius: save all models, put results into versioned directories, vary the random seed.
Took me a morning to setup Cartesius. Started the first step: preprocessing Gigaword.
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/process-gigaword.job
Submitted batch job 3718227
[minhle@int1 wsd-dynamic-sense-vector]$ tail -f slurm-3718227.out
Tried parallel training on Cartesius. It could only run one training script at a time, all the rest failed and reported 4 different errors (!): "Blas GEMM launch failed", "OOM when allocating tensor", "failed initializing StreamExecutor for CUDA device", and "Failed to create session".
Comparing the preprocessed version of Gigaword in DAS-5 and a (uncompleted) preprocessed version of Gigaword in Cartesius and found that they are different.
Tried to parallelize BeautifulSoup, have negative effect: processing 10 documents used to take 1.2 mins now increases to 1.6 mins. Moreover, processing the full 737 files doesn't take up so much time anyway. The bottleneck turns out to be spacy. Running its models The current version use threads instead of processes so the purported parallelization actually does nothing.
Why did I get bogged down to this optimization hell? The preprocessing runs for more than a day but so what? It also creates unstable output but I didn't find any evidence of that leading to incorrect result. I wasted a day and now I wonder if I created a new bug. Really, I shouldn't have gone down this road.
Submitted the job on Cartesius. It's pending...
and this is the reason:
"This e-mail is to inform you that due to a vulnerability in the system software, the Cartesius system has been brought to an unscheduled maintenance mode. The batch system is being drained which means that your jobs, although they can still be committed to the queue, will not be executed. Please note that jobs that were already dispatched and being executed, have not been canceled.
We are currently in contact with our supplier for a solution. Our estimate is that the system will be up and running again by Tuesday morning, next week. Please find the actual system status on https://userinfo.surfsara.nl/systems/status."
prepare-lstm-wsd.job
finished. Now try exp-variation.job
.
It started:
[minhle@int2 wsd-dynamic-sense-vector]$ cat slurm-3725929.out
Started: Sat Nov 4 00:37:02 CET 2017
Running seed 124. Log is written to output/2017-11-04-a19751e/lstm-wsd-gigaword-small_seed-124.log
[minhle@int2 wsd-dynamic-sense-vector]$ tail -f output/2017-11-04-a19751e/lstm-wsd-gigaword-small_seed-124.log
...
Epoch #1:
finished 1000 of 49551 batches, sample batch cost: 7.0678520
[minhle@gcn42 ~]$ nvidia-smi
Sat Nov 4 00:46:51 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26 Driver Version: 375.26 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K40m On | 0000:02:00.0 Off | 0 |
| N/A 45C P0 119W / 235W | 10916MiB / 11439MiB | 83% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla K40m On | 0000:82:00.0 Off | 0 |
| N/A 36C P0 62W / 235W | 10873MiB / 11439MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| 0 77394 C python3 10912MiB |
| 1 77394 C python3 10869MiB |
+-----------------------------------------------------------------------------+
A big mistake: I didn't set max_to_keep
(default to 5) so the saver removes
my best model but keeps the latest one.
Fixed it. This time I'll run a test job before submitting. The last job took me 4 days and 1% of my budget.
Submitted test.job
, waiting (Priority)... Done. It works. Now it's time to
restart exp-variation.job
.
Submitted variation experiments. I divided the job file into two files because in 5 days we can only maximally finish 5 seeds.
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-variation1.job
Submitted batch job 3742501
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-variation2.job
Submitted batch job 3742545
[minhle@int1 wsd-dynamic-sense-vector]$ squeue | grep minh
3742501 gpu exp-vari minhle PD 0:00 1 (Priority)
3742545 gpu exp-vari minhle PD 0:00 1 (Priority)
Questions:
- why does epoch number start at 1?
- how to use 2 GPUs?
- can the RAM fit 2 instances?
5 days has ended. 8 experiments successfully completed, 2 were aborted.
[minhle@int1 wsd-dynamic-sense-vector]$ ls output/2017-11-08-ce8a024/lstm-wsd-gigaword-small_seed*best-model.index
output/2017-11-08-ce8a024/lstm-wsd-gigaword-small_seed-124-best-model.index
output/2017-11-08-ce8a024/lstm-wsd-gigaword-small_seed-16651-best-model.index
output/2017-11-08-ce8a024/lstm-wsd-gigaword-small_seed-261-best-model.index
output/2017-11-08-ce8a024/lstm-wsd-gigaword-small_seed-293-best-model.index
output/2017-11-08-ce8a024/lstm-wsd-gigaword-small_seed-651-best-model.index
output/2017-11-08-ce8a024/lstm-wsd-gigaword-small_seed-91-best-model.index
output/2017-11-08-ce8a024/lstm-wsd-gigaword-small_seed-9165-best-model.index
output/2017-11-08-ce8a024/lstm-wsd-gigaword-small_seed-961-best-model.index
Now I'll evaluate them:
sbatch cartesius/exp-variation-score.job
Error: one_full_experiment_v2.sh: No such file or directory
--> mailed Marten.
Tried to measure speed with(out) some optimization tricks. Found out that the preprocessed data (for training) doesn't have length information. Need to rerun preprocessing script again! (Maybe I should work around this... Changing the script would mean more debugging and more waiting...)
I looked at the 280 lines of code in prepare-lstm-wsd.py
and didn't want to
touch any of them. For the moment I change anything, I'll go down the rabbit
hole and 1 day will pass very fast.
Started a complex network of jobs on Cartesius. I hope that they all work...
[minhle@int1 wsd-dynamic-sense-vector]$ git checkout a453bc1
Previous HEAD position was a9618a6... make them runnable
HEAD is now at a453bc1... run on Cartesius
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/prepare-lstm-wsd.job
Submitted batch job 3764660
[minhle@int1 wsd-dynamic-sense-vector]$ squeue | grep minh
3764660 normal prepare- minhle R 0:11 1 tcn635
[minhle@int1 wsd-dynamic-sense-vector]$ git checkout a9618a6
Previous HEAD position was a453bc1... run on Cartesius
HEAD is now at a9618a6... make them runnable
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/prepare-lstm-wsd.job
Submitted batch job 3764663
[minhle@int1 wsd-dynamic-sense-vector]$ squeue | grep minh
3764663 normal prepare- minhle R 0:02 1 tcn721
3764660 normal prepare- minhle R 1:25 1 tcn635
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch --dependency=afterok:3764660 cartesius/exp-variation-score.job
Submitted batch job 3764664
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch --dependency=afterok:3764663 cartesius/exp-optimization1.job
Submitted batch job 3764741
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch --dependency=afterok:3764663 cartesius/exp-optimization2.job
Submitted batch job 3764742
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch --dependency=afterok:3764663 cartesius/exp-optimization3.job
Submitted batch job 3764743
[minhle@int1 wsd-dynamic-sense-vector]$ squeue | grep minh
3764669 gpu exp-opti minhle PD 0:00 1 (Dependency)
3764741 gpu exp-opti minhle PD 0:00 1 (Dependency)
3764742 gpu exp-opti minhle PD 0:00 1 (Dependency)
3764743 gpu exp-opti minhle PD 0:00 1 (Dependency)
3764663 normal prepare- minhle R 7:54 1 tcn721
3764660 normal prepare- minhle R 9:17 1 tcn635
Explanation:
- To run the scoring script for variation experiment, I need to rerun preparation script version a453bc1.
- To run optimization experiments, I need to run (for the first time) preparation script version a9618a6.
- I divided optimization experiments into smaller jobs to avoid hitting the 5-day limit, all of them will start simultaneously.
Somehow my jobs was killed early, actually on the same day I submitted them :-| I'll resubmit and contact Cartesius people.
[minhle@int1 wsd-dynamic-sense-vector]$ git checkout a453bc1
Previous HEAD position was a9618a6... make them runnable
HEAD is now at a453bc1... run on Cartesius
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/prepare-lstm-wsd.job
Submitted batch job 3771725
[minhle@int1 wsd-dynamic-sense-vector]$ squeue | grep minh
3771725 normal prepare- minhle R 0:07 1 tcn618
[minhle@int1 wsd-dynamic-sense-vector]$ git checkout a9618a6
Checking out files: 100% (47/47), done.
Previous HEAD position was a453bc1... run on Cartesius
HEAD is now at a9618a6... make them runnable
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/prepare-lstm-wsd.job
Submitted batch job 3771726
[minhle@int1 wsd-dynamic-sense-vector]$ squeue | grep minh
3771726 normal prepare- minhle R 0:04 1 tcn619
3771725 normal prepare- minhle R 0:26 1 tcn618
It was killed again, SURFsara hasn't replied.
Tested job files with --time=1:00
and --time 1:00
(without the equal sign).
They both seem to work properly and when they are canceled, there were a
message, for example:
slurmstepd: *** JOB 3774760 ON tcn559 CANCELLED AT 2017-11-21T15:00:38 DUE TO TIME LIMIT ***
So my preprocessing script must have been killed for a different reason.
Set for all NumPy arrays dtype=np.int32
. Now the preparation script is running
without error while consuming 99% of RAM. I'll need to do memory map otherwise
I'll have no space for my model.
Marten and I fixed some bugs in Marten's evaluation script.
Tried loading shuffled dataset, it uses only around 40 GB (of 62.5 Gb) but I creating models resulted in a memory error. Luckily, GPU nodes have 96 GB so I hope that's enough.
Forgot to disable subvocab creation for sampled_softmax=False
. Did it now.
Restarted experiment, only exp-optimization1.job
for now.
[minhle@int1 ~]$ squeue | grep minh
3777791 gpu exp-opti minhle R 11:40 1 gcn39
The training did work but super slowly. After 8h38m, it "finished 27000 of 157344 batches" (17% of an epoch). At this speed it will take 19 days to finish 10 epochs. In other words, it will consume all of my budget. I'll use this to estimate
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-optimization2.job
Submitted batch job 3778073
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-optimization3.job
Submitted batch job 3778074
Got these numbers for the variation experiment:
[minhle@int1 output]$ tail -vn +1 results-seed-*/semcor/results.txt
==> results-seed-124/semcor/results.txt <==
912
==> results-seed-16651/semcor/results.txt <==
905
==> results-seed-261/semcor/results.txt <==
914
==> results-seed-293/semcor/results.txt <==
923
==> results-seed-651/semcor/results.txt <==
946
==> results-seed-91/semcor/results.txt <==
918
==> results-seed-9165/semcor/results.txt <==
929
==> results-seed-961/semcor/results.txt <==
911
[minhle@int1 output]$ tail -vn +1 results-seed-*/mun/results.txt
==> results-seed-124/mun/results.txt <==
989
==> results-seed-16651/mun/results.txt <==
984
==> results-seed-261/mun/results.txt <==
989
==> results-seed-293/mun/results.txt <==
994
==> results-seed-651/mun/results.txt <==
1009
==> results-seed-91/mun/results.txt <==
985
==> results-seed-9165/mun/results.txt <==
1004
==> results-seed-961/mun/results.txt <==
990
An experiment failed because I forgot to copy one preprocessed data file to the right folder. Restarted:
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-optimization2.job
Submitted batch job 3778716
Restarted yet again because subvocabs weren't prepared even for sampled softmax.
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-h256p64.job
Submitted batch job 3779735
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-optimization2.job
Submitted batch job 3779738
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-optimization3.job
Submitted batch job 3779739
Canceled h256p64 because the result directory is incorrect. Reran and started data size experiments:
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-data-size.job 01
Submitted batch job 3779915
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-data-size.job 10
Submitted batch job 3779916
[minhle@int1 wsd-dynamic-sense-vector]$ !squeu
squeue | grep minh
3779919 gpu exp-h256 minhle R 0:06 1 gcn46
3779916 gpu exp-data minhle R 1:54 1 gcn18
3779915 gpu exp-data minhle R 1:57 1 gcn8
3779739 gpu exp-opti minhle R 58:37 1 gcn15
3779738 gpu exp-opti minhle R 58:40 1 gcn14
Yet another problem: turns out that I was actually using the full dataset instead of 1% or 10%. Cancel jobs and restarted again.
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-data-size.job 01
Submitted batch job 3780005
[minhle@int1 wsd-dynamic-sense-vector]$ sbatch cartesius/exp-data-size.job 10
Submitted batch job 3780006
[minhle@int1 wsd-dynamic-sense-vector]$ !sque
squeue | grep minh
3780006 gpu exp-data minhle R 0:08 1 gcn39
3780005 gpu exp-data minhle R 1:32 1 gcn18
3779919 gpu exp-h256 minhle R 35:06 1 gcn46
3779739 gpu exp-opti minhle R 1:33:37 1 gcn15
3779738 gpu exp-opti minhle R 1:33:40 1 gcn14
Training 1% data takes ~40 min per epoch:
Epoch #1 finished:
Train cost: 8.001
Dev cost: 10.722, hit@100: 0.5%
Saved best model to output/2017-11-23-fb4baa4/lstm-wsd-gigaword_01-pc_google-best-model
Elapsed time: 38.8 minutes
Stopped data-size experiment because the dev set was different for each data size. It could become misleading when I plot learning-curve graph.