-
Notifications
You must be signed in to change notification settings - Fork 0
/
publications.html
executable file
·623 lines (576 loc) · 51.9 KB
/
publications.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
<!doctype html>
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<title>Gideon Kotzé - Publications</title>
<link rel="shortcut icon" href="favicon.ico" />
<!-- Load CSS -->
<link href="css/style.css" rel="stylesheet" type="text/css" />
<!-- Load Fonts -->
<link rel="stylesheet" href="http://fonts.googleapis.com/css?family=Droid+Serif:regular,italic,bold,bolditalic" type="text/css" />
<link rel="stylesheet" href="http://fonts.googleapis.com/css?family=Droid+Sans:regular,bold" type="text/css" />
<!-- Load jQuery library -->
<script type="text/javascript" src="scripts/jquery-1.6.2.min.js"></script>
<!-- Load custom js -->
<script type="text/javascript" src="scripts/panelslide.js"></script>
<script type="text/javascript" src="scripts/custom.js"></script>
<!-- Load topcontrol js -->
<script type="text/javascript" src="scripts/scrolltopcontrol.js"></script>
<!-- Load NIVO Slider -->
<link rel="stylesheet" href="css/nivo-slider.css" type="text/css" media="screen" />
<link rel="stylesheet" href="css/nivo-theme.css" type="text/css" media="screen" />
<script src="scripts/jquery.nivo.slider.pack.js" type="text/javascript"></script>
<script src="scripts/nivo-options.js" type="text/javascript"></script>
<!-- Load fancybox -->
<script type="text/javascript" src="scripts/jquery.fancybox-1.3.4.pack.js"></script>
<script type="text/javascript" src="scripts/jquery.easing-1.3.pack.js"></script>
<script type="text/javascript" src="scripts/jquery.mousewheel-3.0.4.pack.js"></script>
<script type="text/javascript">
function unhide(divID) {
var item = document.getElementById(divID);
if (item) {
item.className=(item.className=='hidden')?'unhidden':'hidden';
}
}
</script>
<script type="text/javascript">
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-35431506-1']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
</script>
<link rel="stylesheet" href="css/jquery.fancybox-1.3.4.css" type="text/css" media="screen" />
</head>
<body>
<!--This is the START of the header-->
<div id="topcontrol" style="position: fixed; bottom: 5px; left: 960px; opacity: 1; cursor: pointer;" title="Go to Top"></div>
<div id="header-wrapper">
<div id="header">
<div id="logo"><a href="index.html"><img src="images/logo.png" width="100" height="80" alt="logo" /></a></div>
<div id="rightoflogo"> <a href="index.html"><img src="images/myphoto.jpg" width="95" height="95" alt="logo" /></a> </div>
<div id="header-text">
<h4>Gideon J. Kotzé</h4>
<h6><a href="index.html">Home</a></h6>
</div>
</div>
</div>
<!--END of header-->
<!--This is the START of the menu-->
<div id="menu-wrapper">
<div id="main-menu">
<ul>
<li><a class="selected" href="index.html">Home</a></li>
<li><a href="about.html">About</a></li>
<li><a href="downloads/cv_kotze2021.pdf">Résumé</a></li>
<li><a href="publications.html">Publications -></a></li>
<li><a href="projects.html">Projects</a></li>
<li><a href="links.html">Links</a></li>
<li><a href="downloads.html">Downloads</a></li>
<li><a href="mailto:[email protected]">Contact</a></li>
</ul>
</div>
<!--This is the START of the footer-->
<!-- <div id="footer">
<div id="social-box">
<ul>
<li>
<a href="https://www.linkedin.com/in/gideon-kotz%C3%A9-53142a10/">
<img src="http://www.linkedin.com/img/webpromo/btn_viewmy_160x33.png" width="160" height="33" border="0" alt="View Gideon Kotzé's profile on LinkedIn">
</a>
</li>
</ul>
</div>
</div>
--> <!--END of footer-->
</div>
<!--END of menu-->
<!--This is the START of the content-->
<!-- <div id="content"> -->
<!-- </li>
</ul>
</div>
</div> -->
<!--END of footer-->
<!-- </div> -->
<!--END of menu-->
<!--This is the START of the content-->
<div id="content">
<!-- <div class="about"> -->
<h5>This is a list of my academic publications.</h5>
<!-- <div class="spacer"></div> -->
<br>
<h4>2022</h4>
<p class="publications">Kapanadze, O., Kotzé, G., and Hanneforth, T. 2022. Building Resources for Georgian Treebanking-based NLP. In: Özgün, A., Zinova, Y. (eds) Language, Logic, and Computation. TbiLLC 2019. <font style="font-style:italic !important">Lecture Notes in Computer Science</font>, vol 13206, pp. 60-78. Springer, Cham. URL: <a href="https://doi.org/10.1007/978-3-030-98479-3_4">https://doi.org/10.1007/978-3-030-98479-3_4</a> <a href="javascript:unhide('abstract_KapanadzeEA:2022');">[Abstract]</a> <a href="javascript:unhide('bibtex_KapanadzeEA:2022');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_KapanadzeEA:2022" class="hidden">
<p align="justify">We describe past and present work surrounding the development of treebank related NLP resources for Georgian. In particular, we provide an overview of efforts made in the development of a morphologically and syntactically annotated treebank for this non-configurational language, as well as its application in the development of a syntactic parser. Building on this, we also report ongoing work in utilizing manual and automatic alignment solutions for the creation of a Georgian/German parallel treebank. The end goal is the development of resources and tools for improved computational processing and linguistic analysis of the Georgian language.</p>
</div>
<!-- BibTeX -->
<div id="bibtex_KapanadzeEA:2022" class="hidden">
<p>@Article{KapanadzeEA:2022,</p>
<p class="tab">author = {Kapanadze, Oleg and Kotz\'{e}, Gideon and Hanneforth, Thomas},</p>
<p class="tab">title = {Building Resources for Georgian Treebanking-based NLP},</p>
<p class="tab">journal = {Lecture Notes in Computer Science},</p>
<p class="tab">year = {2022},</p>
<p class="tab">pages = {60--78},</p>
<p class="tab">volume = {13206},</p>
<p class="tab">issue = {},</p>
<p class="tab">publisher = {Springer},</p>
<p class="tab">doi = {10.1007/978-3-030-98479-3_4}</p>
<p>}</p>
</div>
<br>
<h4>2020</h4>
<p class="publications">Kotzé, G. and <a href="https://fwolff.net.za/">Wolff, F.</a> 2020. Exchanging image processing and OCR components in a Setswana digitisation pipeline. <font style="font-style:italic !important">South African Computer Journal</font>. Vol. 32(2), pp. 281-231. South African Institute of Computer Scientists and Information Technologists (SAICSIT). URL: <a href="https://sacj.cs.uct.ac.za/index.php/sacj/article/view/707">https://sacj.cs.uct.ac.za/index.php/sacj/article/view/707</a>. doi: 10.18489/sacj.v32i2.707 <a href="javascript:unhide('abstract_KotzeWolff:2020');">[Abstract]</a> <a href="javascript:unhide('bibtex_KotzeWolff:2020');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_KotzeWolff:2020" class="hidden">
<p align="justify">As more natural language processing (NLP) applications benefit from neural network based approaches, it makes sense to re-evaluate existing work in NLP. A complete pipeline for digitisation includes several components handling the material in sequence. Image processing after scanning the document has been shown to be an important factor in final quality. Here we compare two different approaches for visually enhancing documents before Optical Character Recognition (OCR), (1) a combination of ImageMagick and Unpaper and (2) OCRopus. We also compare Calamari, a new line-based OCR package using neural networks, with the well-known Tesseract 3 as the OCR component. Our evaluation on a set of Setswana documents reveals that the combination of ImageMagick/Unpaper and Calamari improves on a current baseline based on Tesseract 3 and ImageMagick/Unpaper with over 30%, achieving a mean character error rate of 1.69 across all combined test data.</p>
</div>
<div id="bibtex_KotzeWolff:2020" class="hidden">
<p>@InProceedings{KotzeWolff:2020,</p>
<p class="tab">author = {Kotz\'{e}, Gideon and Wolff, Friedel},</p>
<p class="tab">title = {Exchanging image processing and OCR components in a Setswana digitisation pipeline},</p>
<p class="tab">journal = {South African Computer Journal},</p>
<p class="tab">year = {2020},</p>
<p class="tab">pages = {281--231},</p>
<p class="tab">volume = {32},</p>
<p class="tab">issue = {2},</p>
<p class="tab">publisher = {South African Institute of Computer Scientists and Information Technologists (SAICSIT)},</p>
<p class="tab">doi = {10.1007/s10579-016-9369-0}</p>
<p>}</p>
</div>
<br>
<h4>2019</h4>
<p class="publications">Hanneforth, T., Kapanadze, O. and Kotzé, G. Applying Computer Technologies to the Georgian Language: From a Treebank to a Syntactic Parser [Abstract]. Presented at <font style="font-style:italic !important">The Thirteenth International Tbilisi Symposium on Language, Logic and Computation</font>, Batumi, Georgia, 16-20 September 2019. <a href="http://events.illc.uva.nl/Tbilisi/Tbilisi2019/uploaded_files/inlineitem/Kapanadze.pdf">pdf</a>. <a href="javascript:unhide('bibtex_HanneforthEA:2019');">[BibTeX]</a></p>
<div id="bibtex_HanneforthEA:2019" class="hidden">
<p>@InProceedings{HanneforthEtAl:2019,</p>
<p class="tab">author = {Hanneforth, Thomas and Kapanadze, Oleg and Kotz\'{e}, Gideon},</p>
<p class="tab">title = {Applying Computer Technologies to the Georgian Language: From a Treebank to a Syntactic Parser},</p>
<p class="tab">journal = {Thirteenth International Tbilisi Symposium on Language, Logic and Computation},</p>
<p class="tab">year = {2019},</p>
<p class="tab">month = {September},</p>
<p class="tab">address = {Batumi, Georgia},</p>
<p class="tab">url = {http://events.illc.uva.nl/Tbilisi/Tbilisi2019/uploaded_files/inlineitem/Kapanadze.pdf}</p>
<p>}</p>
</div>
<br>
<h4>2017</h4>
<!--LRE2017 -->
<!-- entry -->
<p class="publications">Kotzé, G., Vandeghinste, V., Martens, S. and Tiedemann, J. 2017. Large aligned treebanks for syntax-based machine translation. <font style="font-style:italic !important">Language Resources and Evaluation</font>. Vol. 51(2). Springer. URL: <a href="http://link.springer.com/article/10.1007/s10579-016-9369-0">http://link.springer.com/article/10.1007/s10579-016-9369-0</a>. doi: 10.1007/s10579-016-9369-0 <a href="javascript:unhide('abstract_KotzeEA:2017');">[Abstract]</a> <a href="javascript:unhide('bibtex_KotzeEA:2017');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_KotzeEA:2017" class="hidden">
<p align="justify">We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the non-terminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present non-terminal alignment evaluation scores for a variety of tree alignment approaches. Finally, based on the parallel treebanks created by these approaches, we evaluate the MT system itself and compare the scores with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.</p>
</div>
<!-- BibTeX -->
<div id="bibtex_KotzeEA:2017" class="hidden">
<p>@Article{KotzeEA:2017,</p>
<p class="tab">author = {Kotz\'{e}, Gideon and Vandeghinste, Vincent and Martens, Scott and Tiedemann, J\"{o}rg},</p>
<p class="tab">title = {Large aligned treebanks for syntax-based machine translation},</p>
<p class="tab">journal = {Language Resources and Evaluation},</p>
<p class="tab">year = {2017},</p>
<!-- <p class="tab">number = {57},</p> -->
<p class="tab">pages = {1--34},</p>
<p class="tab">volume = {51},</p>
<p class="tab">issue = {2},</p>
<p class="tab">publisher = {Springer},</p>
<p class="tab">doi = {10.1007/s10579-016-9369-0}</p>
<p>}</p>
</div>
<!--PRASA 2017 -->
<!-- entry -->
<p class="publications">Kotzé, G. and <a href="https://fwolff.net.za/">Wolff, F.</a> 2017. Developing and evaluating a pipeline for Setswana OCR. <font style="font-style:italic !important">Proceedings of the 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech)</font>, pp. 236-241. IEEE. URL: <a href="https://ieeexplore.ieee.org/document/8261154">https://ieeexplore.ieee.org/document/8261154</a>. doi: 10.1109/RoboMech.2017.8261154 <a href="javascript:unhide('abstract_KotzeWolff:2017');">[Abstract]</a> <a href="javascript:unhide('bibtex_KotzeWolff:2017');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_KotzeWolff:2017" class="hidden">
<p align="justify">Optical Character Recognition (OCR) plays an important role in the creation of digital language resources. As OCR solutions are often language specific, the availability of models for South African languages also contributes to alleviating the language data scarcity problem. We describe the development of a digitisation pipeline in the context of a multilingual corpus project. We test a recently developed OCR model for the Setswana language against a selection of quality assured texts, while improving our output using image processing software and a newly developed tool, Ontrafel, for post-processing OCR output in PDF files. Each step in the pipeline is shown to improve the output quality when measured against the Character Error Rate metric. Finally, a qualitative analysis provides some insights that may contribute to refining steps or improving the existing OCR model. Apart from the creation of new digital language data for Setswana, we hope that our work stimulates and contributes to further research into high-quality digitisation of South African language resources.</p>
</div>
<!-- BibTeX -->
<div id="bibtex_KotzeWolff:2017" class="hidden">
<p>@Article{KotzeWolff:2017,</p>
<p class="tab">author = {Kotz\'{e}, Gideon and Wolff, Friedel},</p>
<p class="tab">title = {{Developing and evaluating a pipeline for Setswana OCR}},</p>
<p class="tab">journal = {2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics (PRASA-RobMech)},</p>
<p class="tab">year = {2017},</p>
<p class="tab">pages = {236--241},</p>
<!-- <p class="tab">number = {57},</p> -->
<p class="tab">publisher = {IEEE},</p>
<p class="tab">volume = {},</p>
<p class="tab">number = {},</p>
<p class="tab">doi = {10.1109/RoboMech.2017.8261154},</p>
<p class="tab">month = {Nov}</p>
<p>}</p>
</div>
<br>
<h4>2016</h4>
<!-- PRASA2016 -->
<!-- entry -->
<!-- <p class="publications">Kotzé, G. Accepted. Refining semi-automatic parallel corpus creation for Zulu/English statistical machine translation. <font style="font-style:italic !important">Proceedings of the 2016 PRASA-RobMech International Conference.</font> Stellenbosch, South Africa.</p> -->
<!--PRASA2016 -->
<!-- entry -->
<p class="publications">Kotzé, G. 2016. Refining semi-automatic parallel corpus creation for Zulu to English statistical machine translation. In: <font style="font-style:italic !important">Proceedings of the 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech)</font>, pp. 48-53. IEEE. URL: <a href="http://ieeexplore.ieee.org/document/7813168/">http://ieeexplore.ieee.org/document/7813168/</a>. doi: 10.1109/RoboMech.2016.7813168 <a href="javascript:unhide('abstract_Kotze:2016');">[Abstract]</a> <a href="javascript:unhide('bibtex_Kotze:2016');">[BibTeX]</a> </p>
<!-- abstract -->
<div id="abstract_Kotze:2016" class="hidden">
<p align="justify">Although their use in training quality machine translation systems has been proven, parallel corpora—large collections of translated texts—are generally hard to come by for the majority of languages. To counteract this fact, a relatively small collection may be processed in more depth by further cleaning and more accurately splitting and aligning the texts. We apply this to an existing English/Zulu parallel corpus that has been used for statistical machine translation experiments. After these preprocessing steps, we run the same experiments for comparative purposes. Our results suggest that compatibility of bitexts, the choice of sentence splitters used on different parts of the text, as well as manual work, may have a notable effect on both the corpus size and on automatic translation quality.</p>
</div>
<!-- BibTeX -->
<div id="bibtex_Kotze:2016" class="hidden">
<p>@Article{Kotze:2016,</p>
<p class="tab">author = {Kotz\'{e}, Gideon,</p>
<p class="tab">title = {Refining semi-automatic parallel corpus creation for Zulu to English statistical machine translation},</p>
<p class="tab">journal = {Proceedings of the 2016 Pattern Recognition Association of South Africa and Robotics and Mechatronics International Conference (PRASA-RobMech)},</p>
<p class="tab">year = {2016},</p>
<p class="tab">pages = {48--53},</p>
<p class="tab">publisher = {IEEE},</p>
<p class="tab">isbn = {978-1-5090-3334-8},</p>
<p class="tab">doi = {10.1109/RoboMech.2016.7813168}</p>
<p>}</p>
</div>
<br>
<h4>2015</h4>
<!-- SACJ article -->
<!-- entry -->
<p class="publications">Kotzé, G. and <a href="https://fwolff.net.za/">Wolff, F.</a> 2015. Syllabification and parameter optimisation in Zulu to English machine translation. <font style="font-style:italic !important">South African Computer Journal.</font> No. 57, pp. 1-23. South African Institute of Computer Scientists and Information Technologists (SAICSIT). URL: <a href="https://sacj.cs.uct.ac.za/index.php/sacj/article/view/323">https://sacj.cs.uct.ac.za/index.php/sacj/article/view/323</a>. doi: 10.18489/sacj.v0i57.323 <a href="javascript:unhide('abstract_KotzeWolff:2015');">[Abstract]</a> <a href="javascript:unhide('bibtex_KotzeWolff:2015');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_KotzeWolff:2015" class="hidden">
<p align="justify">We present a series of experiments involving the machine translation of Zulu to English using a well-known statistical software system. Due to morphological complexity and relative scarcity of resources, the case of Zulu is challenging. Against a selection of baseline models, we show that a relatively naive approach of dividing Zulu words into syllables leads to a surprising improvement. We further improve on this model through manual configuration changes. Our best model significantly outperforms the baseline models (BLEU measure, at p < 0.001) even when they are optimised to a similar degree, only falling short of the well-known Morfessor morphological analyser that makes use of relatively sophisticated algorithms. These experiments suggest that even a simple optimisation procedure can improve the quality of this approach to a significant degree. This is promising particularly because it improves on a mostly language independent approach—at least within the same language family. Our work also drives the point home that sub-lexical alignment for Zulu is crucial for improved translation quality.</p>
</div>
<!-- BibTeX -->
<div id="bibtex_KotzeWolff:2015" class="hidden">
<p>@Article{KotzeWolff:2015,</p>
<p class="tab">author = {Kotz\'{e}, Gideon and Wolff, Friedel},</p>
<p class="tab">title = {Syllabification and parameter optimisation in Zulu to English machine translation},</p>
<p class="tab">journal = {South African Computer Journal},</p>
<p class="tab">year = {2015},</p>
<p class="tab">number = {57},</p>
<p class="tab">pages = {1--23},</p>
<p class="tab">publisher = {South African Institute of Computer Scientists and Information Technologists (SAICSIT)},</p>
<p class="tab">doi = {10.18489/sacj.v0i57.323}</p>
<p>}</p>
</div>
<br>
<h4>2014</h4>
<!-- PRASA 2014 -->
<!-- entry -->
<p class="publications"><a href="https://fwolff.net.za/">Wolff, F.</a> and Kotzé, G. 2014. Experiments with syllable-based Zulu-English machine translation. In <font style="font-style:italic !important">Puttkammer, M. and Eiselen, R. (eds.): Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium</font>. pp. 217-222. Cape Town, South Africa. <a href="http://www.prasa.org/proceedings/2014/prasa2014-38.pdf">[pdf]</a> <a href="javascript:unhide('abstract_WolffKotze:2014');">[Abstract]</a> <a href="javascript:unhide('bibtex_WolffKotze:2014');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_WolffKotze:2014" class="hidden">
<p align="justify">Due to morphological complexity and scarce resources, machine translation from Zulu to English is challenging. We investigate the possibility of phrase-based statistical machine translation from Zulu to English using syllables as the tokens in the Zulu source text. Initial experiments on a relatively small but multi-domain data set suggest merit in our approach, with our best syllable-based model outperforming the best word-based model by 12,90% using the BLEU evaluation measure. Our syllabification approach is largely language independent, at least within the Bantu language family, and holds promise for similar efforts in related languages.</p>
</div>
<!-- BibTeX -->
<div id="bibtex_WolffKotze:2014" class="hidden">
<p>@InProceedings{WolffKotze:2014,</p>
<p class="tab">author = {Wolff, Friedel and Kotz\'{e}, Gideon},</p>
<p class="tab">title = {Experiments with syllable-based {Zulu-English} machine translation},</p>
<p class="tab">journal = {Proceedings of the 2014 PRASA, RobMech and AfLaT International Joint Symposium},</p>
<p class="tab">year = {2014},</p>
<p class="tab">editor = {Puttkammer, M. and Eiselen, R.},</p>
<p class="tab">publisher = {PRASA},</p>
<p class="tab">pages = {217--222},</p>
<p class="tab">address = {Cape Town, South Africa},</p>
<p class="tab">isbn = {978-0-620-62617-0}</p>
<p>}</p>
</div>
<!-- SaLTMiL2014 at LREC2014 -->
<!-- entry -->
<p class="publications">Kotzé, G. and <a href="https://fwolff.net.za/">Wolff, F.</a> 2014. Experiments with syllable-based English-Zulu alignment. In <font style="font-style:italic !important">Proceedings of the SaLTMiL Workshop on free/open-source language resources for the machine translation of less-resourced languages (at LREC 2014)</font>, pp. 7-11, Reykjavík, Iceland. <a href="http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-SALTMIL%20Proceedings.pdf">[pdf]</a> <a href="javascript:unhide('abstract_KotzeWolff:2014');">[Abstract]</a> <a href="javascript:unhide('bibtex_KotzeWolff:2014');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_KotzeWolff:2014" class="hidden">
<p align="justify">As a morphologically complex language, Zulu has notable challenges aligning with English. One of the biggest concerns for statistical machine translation is the fact that the morphological complexity leads to a large number of words for which there exist very few examples in a corpus. To address the problem, we set about establishing an experimental baseline for lexical alignment by naively dividing the Zulu text into syllables, resembling its morphemes. A small quantitative as well as a more thorough qualitative evaluation suggests that our approach has merit, although certain issues remain. Although we have not yet determined the effect of this approach on machine translation, our first experiments suggest that an aligned parallel corpus with reasonable alignment accuracy can be created for a language pair, one of which is under-resourced, in as little as a few days. Furthermore, since very little language-specific knowledge was required for this task, our approach can almost certainly be applied to other language pairs and perhaps for other tasks as well.</p>
</div>
<!-- BibTeX -->
<div id="bibtex_KotzeWolff:2014" class="hidden">
<p>@InProceedings{KotzeWolff:2014,</p>
<p class="tab">author = {Kotz\'{e}, Gideon and Wolff, Friedel},</p>
<p class="tab">title = {Experiments with syllable-based {English-Zulu} alignment},</p>
<p class="tab">journal = {Proceedings of the SaLTMiL Workshop on free/open-source language resources for the machine translation of less-resourced languages (at LREC 2014)},</p>
<p class="tab">year = {2014},</p>
<p class="tab">pages = {7--11},</p>
<p class="tab">address = {Reykjav\'{i}k, Iceland},</p>
<p class="tab">isbn = {978-2-9517408-8-4},</p>
<p class="tab">url = {http://www.lrec-conf.org/proceedings/lrec2014/workshops/LREC2014Workshop-SALTMIL%20Proceedings.pdf}</p>
<p>}</p>
</div>
<br>
<h4>2013</h4>
<!-- PhD thesis -->
<!-- entry -->
<p class="publications">Kotzé, G. 2013. <font style="font-style:italic !important">Complementary Approaches to Tree Alignment: Combining Statistical and Rule-Based Methods.</font> PhD thesis. University of Groningen. URL: <a href="http://hdl.handle.net/11370/316d1696-276f-456d-b1cd-bc81fd4262ab">http://hdl.handle.net/11370/316d1696-276f-456d-b1cd-bc81fd4262ab</a> <a href="downloads/GideonThesis_Electronic.pdf">[pdf]</a> <a href="javascript:unhide('abstract_Kotze:2013');">[Abstract]</a> <a href="javascript:unhide('bibtex_Kotze:2013');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_Kotze:2013" class="hidden">
<p align="justify">Large collections of translated texts—called parallel corpora—are often automatically aligned on word and sentence level to be used as training data for machine translation systems. We may also choose to syntactically analyze the sentences to produce syntax trees. If we do this on both sides and the nodes of the trees are also aligned, the end result is called a parallel treebank.
The best translation systems are statistically based, but in recent years there has been a shift to the incorporation of more linguistically motivated data, which includes the use of parallel treebanks. These are only useful on a very large scale because of the amount of information a system needs about how one language is to be translated into another in order to be effective. Because of this, we investigate techniques for the automatic and accurate alignment of these nodes. Another motive for our research is the fact that parallel treebanks are also useful for other techniques and that as a linguistic resource, remain scientifically interesting. This process is called tree alignment.
We find that a combination of statistical and rule-based techniques, using relatively small sets of training data and few features, is sufficient to produce very accurate alignments. Finally, we also find that when we apply alignments covering a relatively large set of nodes—even though some of them are wrong—on a syntax-based machine translation system, this leads to better translation results than applying alignments that are more accurate but fewer in number.</p>
</div>
<!-- bibtex -->
<div id="bibtex_Kotze:2013" class="hidden">
<p>@PhDThesis{Kotze:2013,</p>
<p class="tab">title = {Complementary Approaches to Tree Alignment: Combining Statistical and Rule-Based Methods},</p>
<p class="tab">author = {Kotz\'{e}, Gideon},</p>
<p class="tab">school = {University of Groningen},</p>
<p class="tab">month = {June},</p>
<p class="tab">year = {2013},</p>
<p class="tab">isbn = {978-90-367-6177-2},</p>
<p class="tab">url = {http://hdl.handle.net/11370/316d1696-276f-456d-b1cd-bc81fd4262ab},</p>
<p class="tab">note = {\url{http://hdl.handle.net/11370/316d1696-276f-456d-b1cd-bc81fd4262ab}},</p>
<p>}</p>
</div>
<!-- Springer 2013 -->
<!-- entry -->
<p class="publications">Vandeghinste, V., Martens, S., Kotzé, G., Tiedemann, J., Van den Bogaert, J., De Smet, K., Van Eynde, F., and Van Noord, G. 2013. Parse and Corpus-based Machine Translation. In <font style="font-style:italic !important">Peter Spyns and Jan Odijk (eds.): Essential Speech and Language Technology for Dutch</font>. pp. 305-319. Springer. URL: <a href="https://link.springer.com/chapter/10.1007%2F978-3-642-30910-6_17">https://link.springer.com/chapter/10.1007%2F978-3-642-30910-6_17</a>. <a href="javascript:unhide('abstract_VandeghinsteEA:2013');">[Abstract]</a> <a href="javascript:unhide('bibtex_VandeghinsteEA:2013');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_VandeghinsteEA:2013" class="hidden">
<p align="justify">In this paper the PaCo-MT project is described, in which Parse and Corpus-based Machine Translation has been investigated: a data-driven approach to stochastic syntactic rule-based machine translation. In contrast to the phrase-based statistical machine translation systems (PB-SMT) which are string-based and do not use any linguistic knowledge, an MT engine in a different paradigm was built: a tree-based data-driven system that automatically induces translation rules from a large syntactically analysed parallel corpus. The architecture is presented in detail as well as an evaluation in comparison with our previous work and with the current state-of-the art PB-SMT system Moses.</p>
</div>
<!-- bibtex -->
<div id="bibtex_VandeghinsteEA:2013" class="hidden">
<p>@InBook{VandeghinsteEA:2013,</p>
<p class="tab">author = {Vandeghinste, Vincent and Martens, Scott and Kotz\'{e}, Gideon and Tiedemann, J\"{o}rg and Van den Bogaert, Joachim and De Smet, Koen and Van Eynde, Frank and Van Noord, Gertjan},</p>
<p class="tab">chapter = {Parse and Corpus-based Machine Translation},</p>
<p class="tab">title = {Essential Speech and Language Technology for Dutch},</p>
<p class="tab">pages = {305--319},</p>
<p class="tab">year = {2013},</p>
<p class="tab">isbn = {978-3-642-30909-0},</p>
<p class="tab">editor = {Spyns, P. and Odijk, J.},</p>
<p class="tab">publisher = {Springer},</p>
<p class="tab">url = {https://link.springer.com/chapter/10.1007%2F978-3-642-30910-6_17}</p>
<p>}</p>
</div>
<br>
<h4>2012</h4>
<!-- CLIN2012 -->
<!-- entry -->
<p class="publications">Kotzé, G. 2012. Transformation-based tree-to-tree alignment. In <font style="font-style:italic !important">Computational Linguistics in the Netherlands Journal</font>. Vol. 2, pp. 71-96. URL: <a href="https://www.clinjournal.org/clinj/article/view/17/15">https://www.clinjournal.org/clinj/article/view/17/15</a> <a href="javascript:unhide('abstract_Kotze:2012');">[Abstract]</a> <a href="javascript:unhide('bibtex_Kotze:2012');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_Kotze:2012" class="hidden">
<p align="justify">Previous experiments suggest that a rule-based approach to tree alignment error correction serves to be an effective complement to statistical alignment. We show how, using relatively few features, an implementation of Brill’s Transformation-Based Learning algorithm improves the results of a high precision model of the statistical aligner Lingua-Align. Using our system to correct already tree aligned data, we achieve balanced F-scores of 80.6 on our test set and 85.2 on our development test set. Using it as a tree aligner on word aligned data, our best F-scores using the same model amount to 78.7 and 83.0 respectively. Finally, we apply a pipeline of alignment and error correction tools to create several versions of a large parallel treebank consisting of various domains for Dutch to English for use in a syntax-based MT system. We conclude that transformation-based learning is a promising approach for the large-scale creation of parallel treebanks for various NLP purposes.</p>
</div>
<!-- bibtex -->
<div id="bibtex_Kotze:2012" class="hidden">
<p>@Article{Kotze:2012,</p>
<p class="tab">author = {Kotz\'{e}, Gideon},</p>
<p class="tab">title = {Transformation-based tree-to-tree alignment},</p>
<p class="tab">journal = {Computational Linguistics in the Netherlands Journal},</p>
<p class="tab">volume = {2},</p>
<p class="tab">pages = {71--96},</p>
<p class="tab">year = {2012},</p>
<p class="tab">issn = {2211-4009},</p>
<p class="tab">url = {https://www.clinjournal.org/clinj/article/view/17/15}</p>
<p>}</p>
</div>
<!-- LREC2012 -->
<!-- entry -->
<p class="publications">Kotzé, G., Vandeghinste, V., Martens, S. and Tiedemann, J. 2012. Large aligned treebanks for syntax-based machine translation. In <font style="font-style:italic !important">Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)</font>, pp. 467-473, Istanbul, Turkey. <a href="http://www.lrec-conf.org/proceedings/lrec2012/pdf/924_Paper.pdf">[pdf]</a> <a href="javascript:unhide('abstract_KotzeEA:2012');">[Abstract]</a> <a href="javascript:unhide('bibtex_KotzeEA:2012');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_KotzeEA:2012" class="hidden">
<p align="justify">We present a collection of parallel treebanks that have been automatically aligned on both the terminal and the nonterminal constituent level for use in syntax-based machine translation. We describe how they were constructed and applied to a syntax- and example-based machine translation system called Parse and Corpus-Based Machine Translation (PaCo-MT). For the language pair Dutch to English, we present evaluation scores of both the nonterminal constituent alignments and the MT system itself, and in the latter case, compare them with those of Moses, a current state-of-the-art statistical MT system, when trained on the same data.</p>
</div>
<!-- bibtex -->
<div id="bibtex_KotzeEA:2012" class="hidden">
<p>@InProceedings{KotzeEA:2012,</p>
<p class="tab">author = {Kotz\'{e}, Gideon and Vandeghinste, Vincent and Martens, Scott and Tiedemann, J\"{o}rg},</p>
<p class="tab">title = {Large Aligned Treebanks for Syntax-based Machine Translation},</p>
<p class="tab">booktitle = {Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12)},</p>
<p class="tab">pages = {467--473},</p>
<p class="tab">year = {2012},</p>
<p class="tab">address = {Istanbul, Turkey},</p>
<p class="tab">editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Mehmet U\v{g}ur Do\v{g}an and Bente Maegaard and Joseph Mariani and Jan Odijk and Stelios Piperidis},</p>
<p class="tab">publisher = {European Language Resources Association (ELRA)},</p>
<p class="tab">isbn = {978-2-9517408-7-7},</p>
<p class="tab">language = {english},</p>
<p class="tab">url = {http://www.lrec-conf.org/proceedings/lrec2012/pdf/924_Paper.pdf}</p>
<p>}</p>
</div>
<br>
<h4>2011</h4>
<!-- NODALIDA 2011 -->
<!-- entry -->
<p class="publications">Kotzé, G. 2011. Finding Statistically Motivated Features Influencing Subtree Alignment Performance. In <font style="font-style:italic !important">Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa (eds.): Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011</font>, May 11-13, Riga, Latvia. NEALT Proceedings Series, Vol. 11, pp. 332-335. Tartu: Tartu University Library. <a href="http://dspace.utlib.ee/dspace/bitstream/handle/10062/17366/0Kotze_81.pdf">[pdf]</a> <a href="javascript:unhide('abstract_Kotze2011a');">[Abstract]</a> <a href="javascript:unhide('bibtex_Kotze2011a');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_Kotze2011a" class="hidden">
<p align="justify">In this paper, we present results of an ongoing investigation of a manually aligned parallel treebank and an automatic tree aligner. We establish the features that show a significant correlation with alignment performance. We present those features with the biggest correlation scores and discuss their significance, with mention of future applications of these findings.</p>
</div>
<!-- bibtex -->
<div id="bibtex_Kotze2011a" class="hidden">
<p>@Article{Kotze2011a,</p>
<p class="tab">author = {Kotz\'{e}, Gideon},</p>
<p class="tab">title = {Finding Statistically Motivated Features Influencing Subtree Alignment Performance},</p>
<p class="tab">editor = {Bolette Sandford Pedersen, Gunta Ne\v{s}pore and Inguna Skadi\c{n}a},</p>
<p class="tab">booktitle = {Proceedings of the 18th Nordic Conference of Computational Linguistics},</p>
<p class="tab">series = {NEALT Proceedings Series},</p>
<p class="tab">volume = {11},</p>
<p class="tab">publisher = {Tartu University Library},</p>
<p class="tab">pages = {332--335},</p>
<p class="tab">year = {2011},</p>
<p class="tab">address = {Riga, Latvia},</p>
<p class="tab">issn = {1736-8197},</p>
<p class="tab">url = {https://dspace.ut.ee/bitstream/handle/10062/17366/0Kotze_81.pdf}</p>
<p>}</p>
</div>
<!-- ESSLLI 2011 -->
<!-- entry -->
<p class="publications">Kotzé, G. 2011. Improving syntactic tree alignment through rule-based error correction. In <font style="font-style:italic !important">Proceedings of ESSLLI 2011 Student Session</font>, pp. 122-127, Ljubljana, Slovenia. <a href="http://www.stanford.edu/~danlass/esslli2011stus/kotze.pdf">[pdf]</a> <a href="javascript:unhide('abstract_Kotze2011b');">[Abstract]</a> <a href="javascript:unhide('bibtex_Kotze2011b');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_Kotze2011b" class="hidden">
<p align="justify">Automatic alignment of parallel treebanks often display regular errors that can be corrected by improving the alignment model.
However, if the aligner is statistical, often much more training data is needed to properly address these errors. In some cases, a rule-based approach
to error correction may provide a quick and convenient solution. We present an approach that highlights problematic phenomena which enables us to pinpoint systematic error patterns for which we can devise rules for correction. Finally, we investigate the application of two manually constructed rules on a large parallel treebank.</p>
</div>
<!-- bibtex -->
<div id="bibtex_Kotze2011b" class="hidden">
<p>@Article{Kotze2011b,</p>
<p class="tab">author = {Kotz\'{e}, Gideon},</p>
<p class="tab">title = {Improving syntactic tree alignment through rule-based error correction},</p>
<p class="tab">booktitle = {Proceedings of ESSLLI 2011 Student Session},</p>
<p class="tab">pages = {122--127},</p>
<p class="tab">year = {2011},</p>
<p class="tab">address = {Ljubljana, Slovenia},</p>
<p class="tab">url = {https://web.stanford.edu/~danlass/esslli2011stus/kotze.pdf}</p>
<p>}</p>
</div>
<!-- CORPUS LINGUISTICS ST. PETERSBURG 2011 -->
<!-- entry -->
<p class="publications">Kotzé, G. 2011. Rule-induced error correction of aligned parallel treebanks. In <font style="font-style:italic !important">Proceedings of the International Conference "Corpus Linguistics - 2011"</font>, pp. 35-40, Saint Petersburg, Russia. <a href="http://www.ccl.kuleuven.be/Projects/PACO/CorpusLinguistics_Kotze_final.pdf">[pdf]</a> <a href="javascript:unhide('abstract_Kotze2011c');">[Abstract]</a> <a href="javascript:unhide('bibtex_Kotze2011c');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_Kotze2011c" class="hidden">
<p align="justify">Automatic sub-tree alignment of parallel treebanks often display regular errors that can be corrected by improving the alignment
model. However, if the aligner is statistical, often much more training data is needed to properly address these errors. In some cases, a rule-based approach to error correction can provide a quick and convenient solution. We present an approach that highlights problematic phenomena which enables us to pinpoint regular error patterns for which we can devise rules for correction.</p>
</div>
<!-- bibtex -->
<div id="bibtex_Kotze2011c" class="hidden">
<p>@InProceedings{Kotze2011c,</p>
<p class="tab">author = {Kotz\'{e}, Gideon},</p>
<p class="tab">title = {Rule-Induced Error Correction of Aligned Parallel Treebanks},</p>
<p class="tab">booktitle = {Proceedings of the International Conference "Corpus Linguistics - 2011"},</p>
<p class="tab">pages = {35--40},</p>
<p class="tab">year = {2011},</p>
<p class="tab">address = {Saint Petersburg, Russia},</p>
<p class="tab">isbn = {978-5-8465-0005-5},</p>
<p class="tab">url = {http://www.ccl.kuleuven.be/Projects/PACO/CorpusLinguistics_Kotze_final.pdf}</p>
<p>}</p>
</div>
<!-- EAMT 2011 -->
<!-- entry -->
<p class="publications">Vandeghinste, V., Van den Bogaert, J., Martens, S., and Kotzé, G. 2011. PaCo-MT: Parse and Corpus-based Machine Translation. In <font style="font-style:italic !important">Forcada, M.L., Depraetere, H., and Vandeghinste, V. (eds.): Proceedings of the 15th International Conference of the European Association for Machine Translation</font>, p. 347. Leuven, Belgium. <a href="http://www.mt-archive.info/EAMT-2011-PaCoMT.pdf">[pdf]</a> <a href="javascript:unhide('abstract_VandeghinsteEA:2011');">[Abstract]</a> <a href="javascript:unhide('bibtex_VandeghinsteEA:2011');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_VandeghinsteEA:2011" class="hidden">
<p align="justify">The PaCo-MT project is building a stochastic example-based transfer system translating from Dutch into English and French, and vice versa. It is a data-driven tree-to-tree based approach towards MT, transducing the input parse tree into a set of target language parse trees without node ordering. This Synchronous Tree Substitution Grammar (limited to regular subtrees) is induced from a subtree-aligned parallel treebank, using a discriminative model for tree alignment. Monolingual parses were created by pre-existing parsers, such as the Alpino parser for Dutch, the Stanford parser for English, and the Berkeley parser for French. A tree-based target language modeler using a probabilistic context-free grammar based on large monolingual treebanks decodes the output forest and determines node ordering.</p>
<BR>
<p align="justify">By this approach we aim at combining the strengths of data-driven MT with the strengths of rule-based MT, avoiding the weaknesses of each of these approaches. Results show that although BLEU scores are not yet at par with Moses, long distance movements pose no problems for our approach, and we do not drop important words, yielding a more grammatical output than PBSMT systems.</p>
</div>
<!-- bibtex -->
<div id="bibtex_VandeghinsteEA:2011" class="hidden">
<p>@InProceedings{VandeghinsteEA:2011,</p>
<p class="tab">author = {Vandeghinste, Vincent and Van den Bogaert, Joachim and Martens, Scott and Kotz\'{e}, Gideon},</p>
<p class="tab">title = {PaCo-MT: Parse and Corpus-based Machine Translation},</p>
<p class="tab">booktitle = {Proceedings of the 15th International Conference of the European Association for Machine Translation},</p>
<p class="tab">year = {2011},</p>
<p class="tab">month = {May},</p>
<p class="tab">pages = {347},</p>
<p class="tab">editor = {Forcada, M.L., Depraetere, H., and Vandeghinste V.},</p>
<p class="tab">address = {Leuven, Belgium},</p>
<p class="tab">series = {EAMT},</p>
<p class="tab">isbn = {9789081486118},</p>
<p class="tab">url = {http://www.mt-archive.info/EAMT-2011-PaCoMT.pdf}</p>
<p>}</p>
</div>
<br>
<h4>2009</h4>
<!-- TIEDEMANN AND KOTZE RANLP WORKSHOP 2009 -->
<!-- entry -->
<p class="publications">Tiedemann, J. and Kotzé, G. 2009. A Discriminative Approach to Tree Alignment. In <font style="font-style:italic !important">Ilisei, I., Pekar, V. and Bernardini, S. (eds.): Proceedings of the International Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography and Language Learning (in connection with RANLP'09)</font>, pp. 33-39. Borovets, Bulgaria. <a href="https://www.aclweb.org/anthology/W09-4206.pdf">[pdf]</a> <a href="javascript:unhide('abstract_TiedemannKotze2009');">[Abstract]</a> <a href="javascript:unhide('bibtex_TiedemannKotze2009');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_TiedemannKotze2009" class="hidden">
<p align="justify">In this paper we propose a discriminative framework for automatic tree alignment. We use a rich feature set and a log-linear model trained
on small amounts of hand-aligned training data. We include contextual features and link dependencies to improve the results even further. We achieve an overall F-score of almost 80% which is significantly better than other scores reported for this task.</p>
</div>
<!-- bibtex -->
<div id="bibtex_TiedemannKotze2009" class="hidden">
<p>@InProceedings{TiedemannKotze2009,</p>
<p class="tab">author = {Tiedemann, Jörg and Kotz\'{e}, Gideon},</p>
<p class="tab">title = {A Discriminative Approach to Tree Alignment},</p>
<p class="tab">booktitle = {Proceedings of the International Workshop on Natural Language Processing Methods and Corpora in Translation, Lexicography and Language Learning (in connection with RANLP'09)},</p>
<p class="tab">year = {2009},</p>
<p class="tab">pages = {33--39},</p>
<p class="tab">publisher = {Association for Computational Linguistics},</p>
<p class="tab">address = {Borovets, Bulgaria},</p>
<p class="tab">month = {September},</p>
<p class="tab">editor = {Ilisei, I., Pekar, V. and Bernardini, S.},</p>
<p class="tab">url = {http://www.aclweb.org/anthology/W09-4206},
<p class="tab">isbn = {978-954-452-010-6}</p>
<p>}</p>
</div>
<!-- TIEDEMANN AND KOTZE TLT 2009 -->
<!-- entry -->
<p class="publications">Tiedemann, J. and Kotzé, G. 2009. Building a Large Machine-Aligned Parallel Treebank. In <font style="font-style:italic !important">Passarotti, M. and Przepiórkowski, A. and Raynaud, S. and Van Eynde, F. (eds.): Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories (TLT'08)</font>, pp. 197-208. Milan, Italy. <a href="https://convegni.unicatt.it/meetings_B_1.pdf">[pdf]</a> <a href="javascript:unhide('abstract_TiedemannKotze2009b');">[Abstract]</a> <a href="javascript:unhide('bibtex_TiedemannKotze2009b');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_TiedemannKotze2009b" class="hidden">
<p align="justify">This paper reports on-going work on building a large automatically tree-aligned parallel treebank in the context of a syntax-based machine translation (MT) approach. For this we develop a discriminative tree aligner based on a log-linear model with a rich feature set. We incorporate various language-independent and language-specific features taking advantage of existing tools and annotation. Our initial experiments on a small hand-aligned treebank show promising results even with small amounts of training data. The performance of our approach is well above unsupervised techniques reported elsewhere. This enables us to quickly create training material and alignment models for additional language pairs. In recent work, we aligned more than one million sentence pairs and started our experiments with the extraction of transfer knowledge for our example-based machine translation system.</p>
</div>
<!-- bibtex -->
<div id="bibtex_TiedemannKotze2009b" class="hidden">
<p>@InProceedings{TiedemannKotze2009b,</p>
<p class="tab">author = {Tiedemann, Jörg and Kotz\'{e}, Gideon},</p>
<p class="tab">title = {{Building a Large Machine-Aligned Parallel Treebank}},</p>
<p class="tab">booktitle = {Proceedings of the 8th International Workshop on
Treebanks and Linguistic Theories (TLT'08)},</p>
<p class="tab">year = {2009},</p>
<p class="tab">pages = {197--208},</p>
<p class="tab">topic = {Alignment and Parallel corpora},</p>
<p class="tab">editor = {Passarotti, M. and Przepi\'{o}rkowski, A. and Raynaud, S. and Van Eynde, F.},</p>
<p class="tab">publisher = {EDUCatt, Milano/Italy},</p>
<p class="tab">address = {Milan, Italy},</p>
<p class="tab">isbn = {978-88-8311-712-1},</p>
<p class="tab">url = {https://convegni.unicatt.it/meetings_B_1.pdf}</p>
<p>}</p>
</div>
<br>
<h4>2008</h4>
<!-- LITERATOR 2008 -->
<!-- entry -->
<p class="publications">Kotzé, G. 2008. Development of an Afrikaans wordnet: methodology and integration / Ontwikkeling van 'n Afrikaanse woordnet : metodologie en integrasie. <font style="font-style:italic !important">Literator: Journal of Literary Criticism, Comparative Linguistics and Literary Studies : Human language technology for South African languages</font>: Special Issue 1, Volume 29(1). pp. 163-184. URL: <a href="https://literator.org.za/index.php/literator/article/view/105/89">https://literator.org.za/index.php/literator/article/view/105</a> <a href="javascript:unhide('abstract_Kotze2008');">[Abstract]</a> <a href="javascript:unhide('bibtex_Kotze2008');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_Kotze2008" class="hidden">
<p align="justify">The Afrikaans wordnet is a lexical-conceptual network in the form of an electronic lexical database, developed at the North-West University. In this article, a methodology for a semi-automatic construction of the entries - so-called synonym sets - is investigated. Firstly, a background is given on the nature of a wordnet, as well as "WordNet", on which it is based. Other wordnets, as well as applications of wordnets, are also discussed here. Next, the macrostructure of a wordnet in terms of its integration and compatibility with other wordnets is investigated, after which the proposed methodology is presented with a discussion of the results. Finally, a projection is made to the integration of the Afrikaans wordnet with other resources, which include "WordNet" and an Afrikaans lexical database, called ALEXANDER.</p>
</div>
<!-- bibtex -->
<div id="bibtex_Kotze2008" class="hidden">
<p>@Article{Kotze2008,</p>
<p class="tab">author = {Kotz\'{e}, Gideon},</p>
<p class="tab">title = {Development of an Afrikaans wordnet: methodology and integration / Ontwikkeling van 'n Afrikaanse woordnet: metodologie en integrasie},</p>
<p class="tab"> journal = {Literator: Journal of Literary Criticism, Comparative Linguistics and Literary Studies: Human language technology for South African languages: Special Issue 1},</p>
<p class="tab">year = {2008},</p>
<p class="tab">volume = {29},</p>
<p class="tab">number = {1},</p>
<p class="tab">pages = {163--184},</p>
<p class="tab">url = {https://literator.org.za/index.php/literator/article/view/105},</p>
<p class="tab">issn = {0258-2279}</p>
<p>}</p>
</div>
<br>
<h4>2006</h4>
<!-- MASTER'S THESIS -->
<!-- entry -->
<p class="publications">Kotzé, G. 2006. <font style="font-style:italic !important">Building a WordNet for Afrikaans: Preliminary research in the form of an inquiry into the feasibility and optimal methodology for the development of a wordnet database.</font> Master's thesis. Free University of Amsterdam. <a href="downloads/MastersThesis_Kotze2006.pdf">[pdf]</a> <a href="javascript:unhide('abstract_Kotze:Thesis:2006');">[Abstract]</a> <a href="javascript:unhide('bibtex_Kotze:Thesis:2006');">[BibTeX]</a></p>
<!-- abstract -->
<div id="abstract_Kotze:Thesis:2006" class="hidden">
<p align="justify">This thesis is written as part of the preliminary research for a proposed project at the Centre for Text Technology at the North-West University in Potchefstroom, North-West Province, South Africa. In this work a methodology for constructing a wordnet for Afrikaans is proposed, which will be based on the Princeton WordNet that was developed at Princeton University, USA. All relevant concepts are introduced, starting with an analysis of the prototype wordnet, the Princeton WordNet, after which the focus shifts to its extension to multilingual wordnet databases. An investigation is made into the available resources and tools available for the proposed project, after which we verify its feasibility. Afterwards, various methodologies for wordnet construction are investigated and analysed. Based on all these analyses, a detailed methodology with a work schedule is proposed for building a core Afrikaans wordnet, also keeping in mind future extension and potential problems. We conclude that the existence of several automatic techniques have greatly improved the process of wordnet construction compared to a few years ago, but that they are still heavily dependent on quality lexical resources and tools. Finally, some suggestions are made for future extensions and applications of the wordnet.</p>
</div>
<!-- bibtex -->
<div id="bibtex_Kotze:Thesis:2006" class="hidden">
<p>@MastersThesis{Kotze:Thesis:2006,</p>
<p class="tab">author = {Kotz\'{e}, Gideon},</p>
<p class="tab">title = {Building a WordNet for Afrikaans: Preliminary research in the form of an inquiry into the feasibility and optimal methodology for the development of a wordnet database},</p>
<p class="tab">school = {Free University of Amsterdam},</p>
<p class="tab">address = {Amsterdam, The Netherlands},</p>
<p class="tab">year = {2006},</p>
<p class="tab">url = {http://www.gideonkotze.co.za/downloads/MastersThesis_Kotze2006.pdf}</p>
<p>}</p>
</div>
</div>
<!--END of content-->
</body>
</html>