-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathunipen.def
394 lines (380 loc) · 16.6 KB
/
unipen.def
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
.COMMENT ####################################
# UNIPEN 1.0 FORMAT DEFINITION #
####################################
# ----- Copyright (c) 1994, Isabelle Guyon, AT&T Bell Laboratories ------ #
# #
# DISCLAIMER: #
# #
# USER SHALL BE FREE TO USE AND COPY THIS SOFTWARE FREE OF CHARGE OR #
# FURTHER OBLIGATION. #
# #
# THIS SOFTWARE IS NOT OF PRODUCT QUALITY AND MAY HAVE ERRORS OR #
# DEFECTS. #
# #
# PROVIDER GIVES NO EXPRESS OR IMPLIED WARRANTY OF ANY KIND AND ANY #
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR PURPOSE ARE #
# DISCLAIMED. #
# #
# PROVIDER SHALL NOT BE LIABLE FOR ANY DIRECT, INDIRECT, SPECIAL, #
# INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF ANY USE OF THIS #
# SOFTWARE. #
# #
The format is self-defined from 3 basic keywords:
.COMMENT, .RESERVE and .KEYWORD
A - DATA TYPES
----------
.RESERVE [N] Integer or decimal number represented by digits separated by a
dot; may start with a sign; no commas allowed.
.RESERVE [S] String: any combination of keyboard ASCII symbols, except space,
new-line, tabulations and words starting by a dot in the first
column.
.RESERVE [F] Free text: a succession of strings separated by space,
new-line and tab.
.RESERVE [R] Reserved string: a string which has a special meaning for the
UNIPEN format, as defined in the reserved string glossary.
.RESERVE [L] Label: a string enclosed between double quotes which may contain
spaces new-lines or tabulations, all counted as spaces; the
escape character is backslash; inside a label, double quotes
should be replaced by \", backslash by \\, tabulations by \t and
new-lines by \n.
.RESERVE [.] Repeat the last type until a new type is indicated.
.RESERVE [+] Repeat all preceding types any number of times.
.COMMENT B - KEYWORDS
--------
.KEYWORD .KEYWORD [S] [R] [.] [F] Define a new keyword:
keyword, argument types, documentation.
.KEYWORD .RESERVE [S] [F] Define a new reserved string:
reserved string, documentation.
.KEYWORD .COMMENT [F] Comments for human reading, to be
ignored by the machine parser.
.KEYWORD .INCLUDE [S] Name of file to be included as header
(e.g. documentation or lexicon file).
** Do no put the PATH.
** No include file should contain
another include file.
.COMMENT --------- Mandatory declarations ------------------------------------
.KEYWORD .VERSION [N] MANDATORY version number of the format
(current version 1.0).
.KEYWORD .DATA_SOURCE [S] MANDATORY name of institution or
person where the data came from.
.KEYWORD .DATA_ID [S] Name of this database.
.KEYWORD .COORD [R] [.] Declaration of the coordinates used in
.PEN_DOWN and .PEN_UP components, a
subset of: X, Y, T, P, Z, B, RHO,
THETA, PHI, including at least X and Y.
.KEYWORD .HIERARCHY [S] [.] Declaration of segmentation hierarchy
used by .SEGMENT. Examples of arguments
may be: 0, 1, 2, ..., DOCUMENT, TEXT,
PARAGRAPH, PAGE, LINE, SENTENCE, WORD,
FORMULA, CHARACTER, STROKE, SHEET,
GLYPH, DIACRITICAL, GESTURE, KEY, etc.
This list is not limitative.
A typical hierarchy is:
.HIERARCHY SENTENCE WORD CHARACTER.
.COMMENT --------- Data documentation ----------------------------------------
.KEYWORD .DATA_CONTACT [F] Where to reach the person responsible
to answer questions about the database.
.KEYWORD .DATA_INFO [F] Nature and structure of the data.
.KEYWORD .SETUP [F] Data collection recording conditions.
.KEYWORD .PAD [F] Data collection device.
.COMMENT --------- Alphabet -------------------------------------------------
.KEYWORD .ALPHABET [L] [.] Declaration of all characters used in
data labels. In this version of the
UNIPEN format, characters are
restricted to English keyboard ASCII.
"0" "1" "2" "3" "4" "5" "6" "7" "8"
"9" "A" "B" "C" "D" "E" "F" "G" "H"
"I" "J" "K" "L" "M" "N" "O" "P" "Q"
"R" "S" "T" "U" "V" "W" "X" "Y" "Z"
"a" "b" "c" "d" "e" "f" "g" "h" "i"
"j" "k" "l" "m" "n" "o" "p" "q" "r"
"s" "t" "u" "v" "w" "x" "y" "z" "~"
"!" "@" "#" "$" "%" "^" "&" "*" "("
")" "-" "+" "=" "|" "\\" "/" "{" "}"
"?" "[" "]" "\"" ":" "<" ">" "," "."
";" "'" "`" "_" "\ "
A broader set of characters will be
allowed in the next versions.
.KEYWORD .ALPHABET_FREQ [N] [.] Natural frequencies of characters in
the data (need not add up to one).
.COMMENT --------- Lexicon ---------------------------------------------------
.KEYWORD .LEXICON_SOURCE [S] Name of institution or person where
the lexicon came from.
.KEYWORD .LEXICON_ID [S] Name of the lexicon.
.KEYWORD .LEXICON_CONTACT [F] Where to reach the person responsible
to answer questions about the lexicon.
.KEYWORD .LEXICON_INFO [F] Informations about the lexicon.
.KEYWORD .LEXICON [L] [.] Representative set of class labels
found in the database, generally at
the word or character segmentation
level.
.KEYWORD .LEXICON_FREQ [N] [.] Frequencies of lexical entries defined
by .LEXICON. Lexical frequencies
characterize the distribution
from which data samples were drawn
at random. Therefore, the number of
times a lexical entry appears in the
database should be approximately
proportional to the lexical
frequencies. Normalizing such that
all numbers add-up to one is not
necessary.
.COMMENT --------- Data layout -----------------------------------------------
.KEYWORD .X_DIM [N] Width of the bounding box, in pixels
(using the resolution of the input
device, not that of the display).
.KEYWORD .Y_DIM [N] Height of the bounding box, in pixels.
.KEYWORD .H_LINE [N] [.] Distance in pixels between the bottom
of the bounding box and horizontal
guidelines, such as a baseline.
.KEYWORD .V_LINE [N] [.] Distance in pixels between the left
edge of the bounding box and vertical
guidelines.
.COMMENT --------- Unit system ---------------------------------------------
.KEYWORD .X_POINTS_PER_INCH [N] x resolution of the data collection
device (1 inch ~ 2.5 cm).
.KEYWORD .Y_POINTS_PER_INCH [N] y resolution of the data collection
device.
.KEYWORD .Z_POINTS_PER_INCH [N] z (altitude) resolution of the data
collection device.
.KEYWORD .X_POINTS_PER_MM [N] x resolution of the data collection
device (in SI units).
.KEYWORD .Y_POINTS_PER_MM [N] y resolution of the data collection
device.
.KEYWORD .Z_POINTS_PER_MM [N] z (altitude) resolution of the data
collection device.
.KEYWORD .POINTS_PER_GRAM [N] Pressure resolution of the data
collection device.
.KEYWORD .POINTS_PER_SECOND [N] Sampling rate, MANDATORY if T not in
.COORD.
.COMMENT ------- Pen trajectory ----------------------------------------------
.KEYWORD .PEN_DOWN [N] [.] Pen down component: repeated sequences
of coordinates as defined by .COORD,
pen touching the pad surface.
.KEYWORD .PEN_UP [N] [.] Pen up component: same as .PEN_DOWN,
but with the pen not touching the pad
surface.
.KEYWORD .DT [N] Elapsed time measured when pen
coordinates are elided because the pen
was immobile or out of proximity of
the pad sensors.
.COMMENT -------- Data annotations -------------------------------------------
.KEYWORD .DATE [N] [N] [N] Date stamp: month, day, year.
.KEYWORD .STYLE [R] PRINTED, CURSIVE or MIXED.
.KEYWORD .WRITER_ID [S] MANDATORY unique writer identification.
.KEYWORD .COUNTRY [S] Country of origin.
.KEYWORD .HAND [R] Writer hand: L for left, R for right.
.KEYWORD .AGE [N] Writer age, in years.
.KEYWORD .SEX [R] Writer sex: M for male, F for female.
.KEYWORD .SKILL [R] Skill of writer, familiarity with
input device: BAD, OK or GOOD.
.KEYWORD .WRITER_INFO [F] Misc. information about writer.
.KEYWORD .SEGMENT [S] [R] [R] [L] Type of segment, its delineation,
quality and label.
-> First argument: type of segment,
such as the ones declared in
.HIERARCHY (e. g. SENTENCE, WORD,
CHARACTER, etc.).
-> Second argument: segment
delineation by a A[:M]]-[B[:N]],[C]
expression (see reserved string
glossary). Components are numbered in
order of apparition in the present
data set, starting from zero.
The component counter is reset to zero
at each beginning of new file and
each .START_SET. Empty pen streams (or
"components") are NOT counted. If the
segment delineation is NON ambiguous,
the second argument may be either
replaced by ? or omitted,
if ALL following arguments are omitted.
-> Third argument: quality level,
BAD for illegible, OK for regular,
GOOD for superior, ? for unknown.
The quality may be omitted only if the
fourth argument is also omitted.
-> Fourth argument: label (Sentence,
word, character, etc.)
The label may be omitted.
.KEYWORD .START_SET [S] Start a new set; the argument is
the set name.
The component counter is reset to zero
and the lexicon is deleted.
In the absence of .START_SET, the
component counter is automatically
reset to zero at the beginning of each
file and the set name is the file name.
.KEYWORD .START_BOX [.] Erase all components from the previous
data bounding box, and start a new
one (useful for browsing). No argument.
In the absence of .START_BOX,
segmentation points of highest
hierarchy level will be used.
.COMMENT -------- Recognizer documentation -----------------------------------
.KEYWORD .REC_SOURCE [S] MANDATORY name of institution or
person where the recognizer came from.
.KEYWORD .REC_ID [S] MANDATORY recognizer name.
.KEYWORD .REC_CONTACT [F] Where to reach the person responsible
to answer questions about the
recognizer.
.KEYWORD .REC_INFO [F] Nature and structure of the recognizer,
number of free parameters, number of
training examples.
.KEYWORD .IMPLEMENT [F] Recognizer implementation, software
and hardware.
.COMMENT -------- Recognizer declarations ------------------------------------
.KEYWORD .TRAINING_SET [S] [S] [S] [R] [+] Training data set.
-> First argument: data source,
from .DATA_SOURCE.
-> Second argument: database name,
from .DATA_ID.
-> Third argument: data set name,
from .START_SET or the file name.
-> Fourth argument: segment of data
delineated by a [A[:M]]-[B[:N]],[C]
expression (see reserved string
glossary).
The four arguments types may be
repeated any number of times.
.KEYWORD .TEST_SET [S] [S] [S] [R] [+] MANDATORY test set.
Always disjoint from the training set.
Same argument types as for
.TRAINING_SET.
.KEYWORD .ADAPT_SET [S] [S] [S] [R] [+] Writer adaptation set on which the
recognizer was fine tuned to perform
best on a particular writer.
Same argument types as for
.TRAINING_SET.
.KEYWORD .LEXICON_SET [S] [S] [R] [+] Lexicon used by the recognizer.
-> First argument: lexicon source,
from .LEXICON_SOURCE.
-> Second argument: lexicon name,
from .LEXICON_ID.
-> Third argument: a [N]-[N],[N]
expression (see reserved string
glossary) defining the subset of
lexical entries used. The second
argument may be omitted if only one
lexicon is used.
.COMMENT -------- Recognition results ----------------------------------------
.KEYWORD .REC_TIME [R] [N] Delineation of a data segment, and
recognition time.
-> First argument: segment of data
delineated by a [A[:M]]-[B[:N]],[C]
expression (see reserved string
glossary).
-> Second argument: REAL recognition
time (in seconds) it took with the
implementation described in .IMPLEMENT.
.KEYWORD .REC_LABELS [S] [R] [R] [L] [.] Segment type, its delineation, its
acceptance decision and its
recognition labels.
-> First argument: type of segment,
such as the ones declared in
.HIERARCHY (e. g. SENTENCE, WORD,
CHARACTER, etc.).
-> Second argument: segment
delineation, as in .SEGMENT.
-> Third argument: ACCEPT, REJECT
or ? for unknown.
-> Following arguments: labels
(Characters, words, sentences, etc.)
in order of decreasing likelihood
(best guess first).
Labels may be omitted if the second
argument is REJECT.
.KEYWORD .REC_SCORES [S] [R] [N] [N] [.] Segment type, its delineation, its
acceptance score and its recognition
scores.
-> First argument: type of segment,
such as the ones declared in
.HIERARCHY (e. g. SENTENCE, WORD,
CHARACTER, etc.).
-> Second argument: segment delineation
as in .SEGMENT.
-> Third argument: acceptance score,
high => ACCEPT, low => REJECT,
0 if unknown.
-> Following arguments: scores for
the labels given in .REC_LABELS in
order of decreasing values
(highest score = best guess, first).
.COMMENT C - RESERVED STRING GLOSSARY
------------------------
(data types are also reserved, see section I)
.RESERVE [N]-[N],[N] Compact representation of a list of
numbers, used by .LEXICAL_SET,
.SEGMENT, .REC_TIME, .REC_LABELS,
.REC_SCORES. The list:
2, 3, 4, 5, 15, 9, 50, 51, 52 ,53,
54, 55 would be represented as:
2-5,15,9,50-55.
Commas allow non contiguous numbers
(useful for segmentation of delayed
strokes, i dots and t bars).
NO SPACES are allowed in the notation.
.RESERVE [A[:M]]-[B[:N]],[C] More flexible representation to
delineate segments of data
by breaking components. A,B and C are
component numbers, M and N are point
numbers in the component.
Both components and points are 0-base.
L defaults to zero and M to the last
point in the component. Thus the
[N]-[N],[N] expressions are special
cases of a [A[:M]]-[B[:N]],[C]
expression. The example
1:40-3,5,6:0-6:12 delineates component
1 from point 40 to the end,
all of component 2, 3 and 5 and
component 6 from the beginning to
point 12.
NO SPACES are allowed in the notation.
.RESERVE X X position of the pen on the pad
surface, in units of X given by
.X_POINTS_PER_INCH
.RESERVE Y Y position of the pen on the pad
surface, in units of Y given by
.Y_POINTS_PER_INCH
.RESERVE T Time in MILLISECONDS.
.RESERVE P Pressure in units of P given by
.UNITS_PER_GRAM. Use preferably a
linearized pressure in 1000 units
per gram of force and calibrate the
zero as the threshold of pen reaching
the pad surface. Negative pressures
can account for remaining non-
linearities and hysteresis.
.RESERVE Z Altitude above the pad surface, in
units of Z given by .Z_POINTS_PER_INCH.
.RESERVE BUTTON Barrel button states: 0, 1, ...
.RESERVE RHO Rotational angle of the stylus,
measured in degrees from some
nominal orientation of the stylus
(e. g. barrel button on top).
The angle increases with clockwise
rotation as seen from the rear
end of the stylus.
.RESERVE THETA XY angle of the stylus, measured in
degrees, increasing from the X axis
in the counter-clockwise direction.
.RESERVE PHI Z angle of the stylus, increasing
from the pad surface,
in the positive Z direction.
.RESERVE L Left handed.
.RESERVE R Right handed.
.RESERVE M Male.
.RESERVE F Female.
.RESERVE BAD Unskilled writer, illegible writing.
.RESERVE OK Average quality writing, unambiguously
legible.
.RESERVE GOOD Superior quality writing, most
recognizers should get it.
.RESERVE ? Unknown.
.RESERVE PRINTED Printed handwriting style.
.RESERVE CURSIVE Cursive handwriting style.
.RESERVE MIXED Mixed printed and cursive handwriting
style.
.RESERVE ACCEPT Accepted by recognizer.
.RESERVE REJECT Rejected by recognizer.