-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
new
lt-merge
command to merge LU's from BEG to END tag
Since we need to unquote when generating before tf-inject, we need to double-quote escaped chars here: $ echo '^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^til/til<pr>$ ^x\@y.com/x\@y.com<email>$^»/»<rquot><MERGE_END>$ ^da/da<adv>$' | lttoolbox/lt-merge ^ikke/ikke<adv>$ ^«til x\\\@y.com»/«til x\\\@y.com»<MERGED>$ ^da/da<adv>$ $ echo '^«/«<lquot><MERGE_BEG>$[[tf:i:a]]^veldig/veldig<adv>$[[/]]^»/»<rquot><MERGE_END>$' | lttoolbox/lt-merge ^«\[\[tf:i:a\]\]veldig\[\[\/\]\]»/«\[\[tf:i:a\]\]veldig\[\[\/\]\]»<MERGED>$ If we run this between analysis and wblank-attach, then after the `lt-proc -b generator.bin` step we should have e.g. ^ikkje<adv>/ikkje$ ^«til x\\\@y.com»<MERGED>/«til x\\\@y.com»$ ^då<adv>/då$ which after `cg-proc -1 -n -g genprefs.bin` would turn into ikkje «til x\@y.com» då Note how \\\@ turned into \@ – we removed one layer of quoting, but this is still in the apertium stream so special chars stay quoted until the final tf-inject. TODO: We need to be able to pass MERGED stuff unchanged through biltrans and generator, would like to <re>.+</re><i><s n="MERGED"/></i> but . is literal period in re(!) and even ANY_CHAR doesn't seem supported in lt-proc -b. It should be possible to support with a `step_case_override` in `FSTProcessor::biltrans`. We need an `lt-merge --unmerge` to undo the merge: $ echo '^ikkje<adv>/ikkje$ ^«\[\[tf:i:a\]\]s\\\^å\[\[\/\]\]»<MERGED>/«\[\[tf:i:a\]\]s\\\^å\[\[\/\]\]»$' | lt-merge --unmerge ^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å[[/]]» which then becomes $ echo '^ikkje<adv>/ikkje$ «[[tf:i:a]]s\^å[[/]]»' |cg-proc -1ng nob-nno.genprefs.rlx.bin ikkje «[[tf:i:a]]s\^å[[/]]» which tf-inject is happy to handle.
- Loading branch information
Showing
8 changed files
with
221 additions
and
1 deletion.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
.Dd December 10, 2024 | ||
.Dt LT-MERGE 1 | ||
.Os Apertium | ||
.Sh NAME | ||
.Nm lt-merge | ||
.Nd lexical merger for Apertium | ||
.Sh SYNOPSIS | ||
.Nm lt-merge | ||
.Op Fl u | ||
.Op Ar input_file Op Ar output_file | ||
.Sh DESCRIPTION | ||
.Nm lt-merge | ||
is the application responsible for merging and unmerging | ||
lexical units | ||
.Pp | ||
It accomplishes this. | ||
.Sh OPTIONS | ||
.Bl -tag -width Ds | ||
.It Fl u , Fl Fl unmerge | ||
Run in reverse, this splits previously merged words. | ||
.It Fl v , Fl Fl version | ||
Display the version number. | ||
.It Fl h , Fl Fl help | ||
Display this help. | ||
.El | ||
\" .Sh FILES | ||
\" .Bl -tag -width Ds | ||
\" .It Ar input_file | ||
\" The input compiled dictionary. | ||
\" .El | ||
.Sh SEE ALSO | ||
.Xr apertium 1 , | ||
.Xr lt-proc 1 | ||
.Sh COPYRIGHT | ||
Copyright \(co 2024 Universitat d'Alacant / Universidad de Alicante. | ||
This is free software. | ||
You may redistribute copies of it under the terms of | ||
.Lk https://www.gnu.org/licenses/gpl.html the GNU General Public License . | ||
.Sh BUGS | ||
Many... lurking in the dark and waiting for you! |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
/* | ||
* Copyright (C) 2024 Universitat d'Alacant / Universidad de Alicante | ||
* | ||
* This program is free software; you can redistribute it and/or | ||
* modify it under the terms of the GNU General Public License as | ||
* published by the Free Software Foundation; either version 2 of the | ||
* License, or (at your option) any later version. | ||
* | ||
* This program is distributed in the hope that it will be useful, but | ||
* WITHOUT ANY WARRANTY; without even the implied warranty of | ||
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU | ||
* General Public License for more details. | ||
* | ||
* You should have received a copy of the GNU General Public License | ||
* along with this program; if not, see <https://www.gnu.org/licenses/>. | ||
*/ | ||
#include <lttoolbox/fst_processor.h> | ||
#include <lttoolbox/file_utils.h> | ||
#include <lttoolbox/cli.h> | ||
#include <lttoolbox/lt_locale.h> | ||
#include <iostream> | ||
|
||
|
||
int main(int argc, char *argv[]) | ||
{ | ||
LtLocale::tryToSetLocale(); | ||
CLI cli("merge lexical units from the one tagged BEG until END", PACKAGE_VERSION); | ||
cli.add_file_arg("input_file"); | ||
cli.add_file_arg("output_file"); | ||
cli.add_bool_arg('u', "unmerge", "Undo the merge"); | ||
cli.parse_args(argc, argv); | ||
|
||
auto strs = cli.get_strs(); | ||
bool unmerge = cli.get_bools()["unmerge"]; | ||
InputFile input; | ||
if (!cli.get_files()[1].empty()) { | ||
input.open_or_exit(cli.get_files()[0].c_str()); | ||
} | ||
UFILE* output = openOutTextFile(cli.get_files()[1]); | ||
|
||
FSTProcessor fstp; | ||
fstp.initBiltrans(); | ||
fstp.quoteMerge(input, output); | ||
|
||
return 0; | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# -*- coding: utf-8 -*- | ||
import unittest | ||
from basictest import ProcTest | ||
import unittest | ||
|
||
class MergeTest(unittest.TestCase, ProcTest): | ||
inputs = ['^nochange<n>$'] | ||
expectedOutputs = ['^nochange<n>$'] | ||
procflags = [] | ||
|
||
def compileTest(self, tmpd): | ||
return True # "pass" | ||
|
||
def openProc(self, tmpd): | ||
return self.openPipe('lt-merge', self.procflags+[]) | ||
|
||
|
||
class SimpleTest(MergeTest): | ||
inputs = ['^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^så/så<adv>$ ^veldig/v<adv>$^»/»<rquot><MERGE_END>$ ^bra/bra<adj>$' ] | ||
expectedOutputs = ['^ikke/ikke<adv>$ ^«så veldig»/«så veldig»<MERGED>$ ^bra/bra<adj>$'] | ||
|
||
|
||
class SingleTest(MergeTest): | ||
inputs = ['^not/very<useful><MERGE_BEG><MERGE_END>$' ] | ||
expectedOutputs = ['^not/not<MERGED>$'] | ||
|
||
|
||
class EscapeTest(MergeTest): | ||
# Using r'' to avoid doubling escapes even more: | ||
inputs = [r'^ikke/ikke<adv>$ ^«/«<lquot><MERGE_BEG>$^så/så<adv>$ ^ve\[dig/v<adv>$^»/»<rquot><MERGE_END>$ ^bra/bra<adj>$'] | ||
expectedOutputs = [r'^ikke/ikke<adv>$ ^«så ve\\\[dig»/«så ve\\\[dig»<MERGED>$ ^bra/bra<adj>$'] | ||
|
||
|
||
class WordblankTest(MergeTest): | ||
# Using r'' to avoid doubling escapes even more: | ||
inputs = [r'^«/«<lquot><MERGE_BEG>$[[tf:i:a]]^ve\/ldig/v<adv>$[[/]]^»/»<rquot><MERGE_END>$'] | ||
expectedOutputs = [r'^«\[\[tf:i:a\]\]ve\\\/ldig\[\[\/\]\]»/«\[\[tf:i:a\]\]ve\\\/ldig\[\[\/\]\]»<MERGED>$'] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters