-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathxmltc.tex
413 lines (375 loc) · 19.6 KB
/
xmltc.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
%--------------------------------------------------------------------
% $Id: xmltc.tex,v 1.26 2008-01-29 12:41:42 stijn Exp $
%--------------------------------------------------------------------
\documentclass[twoside,leqno,onecolumn,pdftex]{article}
\usepackage{LaTex_Template}
\usepackage{url}
\usepackage{xspace}
\newcommand{\error}{\textsf{error}\xspace}
\newcommand{\REMARK}[1]{$\diamond$\textbf{#1}$\diamond$}
%\newcommand{\REMARK}[1]{}
\begin{document}
\title{XML TYPECHECKING}
\author{ V\'eronique Benzaken \\
Universit\'e Paris 11\\
\url{http://www.lri.fr/~benzaken}
\\[2ex]
Giuseppe Castagna \\
C.N.R.S. and Universit\'e Paris 7\\
\url{http://www.pps.jussieu.fr/~gc}
\\[2ex]
Haruo Hosoya \\
The University of Tokyo \\
\url{http://arbre.is.s.u-tokyo.ac.jp/~hahosoya}
\\[2ex]
Benjamin C. Pierce\\
University of Pennsylvania \\
\url{http://www.cis.upenn.edu/~bcpierce}
\\[2ex]
Stijn Vansummeren \\
Hasselt University and transnational University of Limburg\\
\url{http://alpha.uhasselt.be/~lucg5855/}
}
\date{}
\maketitle
\noindent
\begin{synonyms}
None
\end{synonyms}
~\\\\
\begin{definition}
In general, \emph{typechecking} refers to the problem where, given a
program $P$, an input type $\sigma$, and an output type $\tau$, one
must decide whether $P$ is \emph{type-safe}, that is, whether it
produces only outputs of type $\tau$ when run on inputs of type
$\sigma$. In the XML context, typechecking problems mainly arise in
two forms:
\begin{itemize}
\item
\item \emph{XML-to-XML transformations,} where $P$ transforms
XML documents conforming to a given type into XML documents
conforming to another given type; and
\item \emph{XML publishing,} where $P$ transforms relational
databases into XML views of these databases and it is necessary to
check that all generated views conform to a specified type.
\end{itemize}
A \emph{type} for XML documents is typically a regular tree
language, usually expressed as a schema written in a schema language
such as DTD, XML Schema, or a Relax NG (see \underline{XML Types}).
In the XML publishing case, the input type $\sigma$ is a
\underline{relational database schema}, possibly with
\underline{integrity constraints}.
Typechecking problems may or may not be decidable, depending on (1) the
class of programs considered, (2) the class of input types (relational
schemas, DTDs, XML Schemas, Relax NG schema, or perhaps other subclasses
of the regular tree languages), and (3) the class of output types. In
cases where it is decidable, typechecking can be done \emph{exactly}. In
cases where it is undecidable, one must revert to \emph{approximate} or
\emph{incomplete} typecheckers that may return false negatives---i.e., may
reject a program even if it is type-safe. Even when exact typechecking is
possible, approximate typechecking may be preferable as this is often
computationally cheaper than exact typechecking.
In the programming languages literature, typechecking often not only
entails verifying that all outputs are of type $\tau$, but also
requires detecting when the program may abort with a run-time error
on inputs of type $\sigma$~\cite{pierce:types-proglang}. The above
definition encompasses such cases: view run-time errors as a special
result value $\error$ and then typecheck a program against an output
type that does not contain the value $\error$.
\end{definition}
~\\\\
\begin{historicalBG}
Although typechecking is a fundamental and well-studied problem in
the theory of programming languages~\cite{pierce:types-proglang},
the types necessary for XML typechecking (based on regular tree
languages) differ significantly from the conventional data types
usually considered (i.e., lists, records, classes, and so on).
Indeed, although it is possible to encode XML types into
conventional datatypes, this encoding lacks flexibility in the sense
that programs tend to need artificial changes when types
evolve~\cite{HVPRegtypeToplas}. For this reason, Hosoya et
al.~\cite{HVPRegtypeToplas} proposed regular tree languages as the
``right'' notion of types for XML and presented an approximate
typechecker in this context. The typechecker was implemented in the
XML-to-XML transformation language XDuce~\cite{HosoyaPierceTOIT03}
whose approach was later extended to general purpose programming by
CDuce (functional programming) and Xtatic (object-oriented
programming). XDuce's approach also lies at the basis of XQuery's
typechecking algorithm~\cite{XQuery}.
The contemporary study of exact typechecking for XML-to-XML
transformations started with an investigation of relatively simple
transformation
languages~\cite{Milo:PODS99,Murata97,PapakonstantinouVianu00}.
Ironically enough, the fundamentals of exact typechecking for more
advanced transformation languages were already investigated a long
time before XML appeared~\cite{Engelfriet77,Engelfriet85}. These
fundamentals were revived in the XML era by Milo, Suciu, and Vianu
in their seminal work on $k$-pebble tree
transducers~\cite{DBLP:journals/jcss/MiloSV03}, which was later
extended to other transformation
languages~\cite{Tozawa01,Maneth05,Tozawa06}. The computational
complexity of exact typechecking was investigated in
\cite{DBLP:journals/jcss/MartensN07,Maneth07,FrischHosoya07}. Exact
typechecking algorithms for XML publishing scenarios were given by
Alon et al.~\cite{DBLP:journals/tocl/AlonMNSV03}.
\end{historicalBG}
~\\\\
\begin{scientific}
\subsection*{Exact Typechecking}
\paragraph*{XML-to-XML Transformations.}
\label{sec:xml-transformations}
Recall that in this setting, $P$ is a program that should transform
XML documents of a type $\sigma$ into documents of a type $\tau$.
When the languages in which the transformation and the types are
expressed are sufficiently restricted in power, exact typechecking is
possible. There are two major approaches to the construction of an
exact typechecking algorithm: {\em forward inference} and {\em
backward inference}. Forward inference solves the typechecking
problem directly by first computing the image $O$ of the input type
$\sigma$ under the transformation $P$, i.e., $O := \{ P(t) \mid t \in
\sigma \}$, and then checking $O \subseteq
\tau$~\cite{DBLP:journals/jcss/MiloSV03,Murata97,PapakonstantinouVianu00,DBLP:journals/jcss/MartensN07,DBLP:journals/sigmod/Suciu02}.
This approach does not work if $O$ goes beyond \emph{context-free tree
languages} as checking $O \subseteq \tau$ then becomes undecidable.
Sadly, this is already the case when $P$ is written in very simple
transformation languages, such as the \emph{top-down tree transducers}
(this fact is known as folklore; see, e.g.,~\cite{FrischHosoya07}.)
Also, computing $O$ itself becomes undecidable for more advanced
transformation languages.
Backward inference, on the other hand, first computes the pre-image
$I$ of the output type $\tau$ under $P$, i.e., $I := \{ t \mid P(t)\in
\tau \}$, and then checks $\sigma \subseteq I$. Backward inference
often works even when the transformation language is too expressive
for forward inference. The technique has successfully been applied to
a range of formal models of real-world transformation languages like
XSLT, from the top-down and bottom-up tree transducers
\cite{Engelfriet77}, to macro tree transducers
\cite{Engelfriet85,Maneth07,FrischHosoya07}, macro forest transducers
\cite{Perst04}, $k$-pebble tree transducers
\cite{DBLP:journals/jcss/MiloSV03}, tree transducers based on
alternating tree automata~\cite{Tozawa01}, tree transducers dealing
with atomic data values~\cite{DBLP:journals/sigmod/Suciu02}, and high-level tree
transducers~\cite{Tozawa06}.
As mentioned in the definition, static detection of run-time errors
can be phrased as a particular form of typechecking by introducing a
special output value $\error$. Exact typechecking in this form has
been investigated for XQuery programs written in the non-recursive
for-let-where return fragment of XQuery without automatic coercions
but with the various XPath axes; node constructors; value and node
comparisons; and node label and content inspections, in the setting
where the input type $\sigma$ is given by a recursion-free regular
tree language. The crux of decidability here is a small-model
property: if $P(t) = \error$ for some $t$ of type $\sigma$ then there
exists another input $t'$ of type $\sigma$ whose size depends only on
$P$ and $\sigma$ such that $P(t') = \error$. It then suffices to
enumerate all inputs $t' \in \sigma$ up to the maximum size and check
$P(t') = \error$. There are only a finite number of such $t'$ (up to
isomorphism, and $P$ cannot distinguish between isomorphic inputs),
from which decidability follows~\cite{vansummeren:welldef-trees}. This
small model property continues to hold when we extend the above XQuery
fragment with arbitrary primitives satisfying some general niceness
properties; see Vansummeren~\cite{vansummeren:welldef-trees}.
\paragraph*{XML Publishing.}
\label{sec:xml-publishing}
In this setting, the input to the program $P$ is a relational database
$D$. Suppose that $P$ computes its output XML tree by posing simple
select-project-join queries to $D$, nesting the results to these
queries, and constructing new XML elements. Exact typechecking for
programs of this form, when the input type $\sigma$ is a relational
database schema with key and foreign key constraints, and the output
type $\tau$ is a ``star-free'' DTD, is
decidable~\cite{DBLP:journals/tocl/AlonMNSV03}. (See
\cite{DBLP:journals/tocl/AlonMNSV03} for a precise definition and
examples of the concept ``star-free''.) As was the case for detecting
runtime errors in XQuery programs, the crux of decidability here is
again a small model property. Typechecking remains decidable for
output DTDs $\tau$ that are not star-free, but then the queries in $P$
must not use projection. Typechecking unfortunately becomes
undecidable when the output types $\tau$ are given by XML Schemas or
Relax NG Schemas~\cite{DBLP:journals/tocl/AlonMNSV03}. Typechecking
also becomes undecidable when $P$ uses queries more expressive than
select-project-join queries.
\subsection*{Approximate Typechecking}
The expressive power that realistic applications
require of practical transformation languages is often too high to
allow for exact typechecking. In such cases, one must revert to
approximate or incomplete typecheckers that guarantee that all
succesfully checked programs are type-safe, but that may also reject
some type-safe programs. Existing techniques can be grouped into two
categories: type systems and flow analyses.
\paragraph{Type systems}
Many conventional programming languages (such as C and Java) specify
what programs to accept by a {\em type
system}~\cite{pierce:types-proglang}. Typically, such a system
consists of a set of {\em typing rules} that determine the type of
each subexpression of a program. Often, in order to help the typechecker,
the programmer is required to supply {\em type
annotations} on variable declarations and in other specified places.
The pioneer work applying this approach to the XML setting was the
XDuce (``transduce'') language \cite{HosoyaPierceTOIT03},
whose type system is based on regular tree languages. One significant point
in this work is its definition of a natural notion of {\em subtyping}
as the inclusion relation between regular tree languages and
its demonstration of the usefulness of allowing a value of one
type to be viewed as another type with a syntaticaly completely
different structure \cite{HVPRegtypeToplas}. In addition, although
the decision problem for subtyping is known to be EXPTIME-complete,
the ``top-down algorithm'' used in the XDuce implementation is
empirically shown to be efficient
in most cases that actually arise in typechecking
\cite{HVPRegtypeToplas,SudaHosoya05,jacm08}. Such a type system
also needs machinery to reduce the amount of type annotations that
otherwise tends to be a burden to the user, in particular when the
language supports a non-trivial mechanism to manipulate XML documents
such as regular expression patterns \cite{HosoyaPierceRegpatJFP} or
filter expressions \cite{HosoyaFiltersJFP}. A series of works
address this problem by proposing automatic type inference schemes
that have certain precision properties in a sense similar to the exact
typechecking in the previous section
\cite{HosoyaPierceRegpatJFP,HosoyaFiltersJFP,VouillonPLANX06}. These
ideas have further been extended for XML attributes
\cite{HosoyaMurataTCS06} and parameteric polymorphism
\cite{HosoyaFrischCastagna05POPL,VouillonPOPL06}.
CDuce (pronounced ``seduce'') extends XDuce in various ways
\cite{cduce2003}. From a language point of view CDuce embraces
XDuce's approach of a functional language based on regular expression
patterns~\cite{HosoyaPierceRegpatJFP} and extends it with
finer-grained pattern matching, complete two-way compatibility with
programs and libraries in the OCaml programming language, Unicode,
queries, XML Schema validation,
and, above all, higher-order and overloaded functions. XML types are
enriched with general purpose data types, intersection and negation
types, and functional types. Finally, the CDuce type inference algorithm
for patterns is implemented by a new kind of tree automaton and proved
to be optimal~\cite{Frisch04}. Among these extensions he addition of
higher-order functions is significant. Theoretically, this extension
is not trivial first because functions do not fit well in the
framework of finite tree automata, but more deeply because this
entails a definitional cycle: the definition of
typechecking uses subtyping, whose definition then uses the semantics
of types (NB: subtyping is defined as inclusion between the sets
denoted by given two types), whose definition in turn uses
well-typedness of values; the last part depends on typechecking in the
presence of higher-order functions since typechecking of a function
abstraction $\lambda x.e$ requires analysis of its internal expression
$e$. Some solutions are known for breaking this circularity
\cite{FrischLICS02,VouillonPOPL06}. Also, an approach to combine one
of these treatments with high-order functions has been proposed
\cite{VouillonPOPL06}.
Several research groups explore ways of mixing a XDuce-like type
system with an existing popular language. Xtatic
\cite{XtaticExperience} carries out this program for the C\# language, developing
techniques to blend regular expression types with an object-oriented
type system. XJ \cite{harren-raghavachari-shmueli-05} makes a closely
related effort for Java. OCamlDuce \cite{ocamlduce} mixes with OCaml,
proposing a method to intermingle a standard ML type inference
algorithm with XDuce-like typechecking. XHaskell \cite{LuSulzmann04}
is also another instance for Haskell; their approach is, however, to
embed XML types into Haskell typing structures (such as tuples and
disjoint sums) in the style of data-binding, yet support XDuce-like
subtyping in its full flexibility by deploying a coercion technique
\cite{Sulzmann06}.
The formal semantics of XQuery defined by the W3C contains a type
system based on a set of inductive typing rules \cite{XQuery}. Their
first draft was heavily based on XDuce's type system
\cite{FernandezSimeonWadler01}. Later, they switched to a different
one that reflects the object-oriented hierarchical typing structure
adopted by XML Schema.
As byproducts of the above pieces of work, several optimization and
compilation techniques that exploit typing information have been
proposed \cite{Frisch04,LevinPierce05}.
\paragraph{Flow-analysis}
Flow analysis is a static analysis technique that has long been
studied in the programming language community. A series of
investigations has been conducted for adapting flow analysis to
approximate XML typechecking, concurrently to XDuce-related work,
\cite{Bigwig,Kirkegaard06,Moeller2005}. In this approach, the user
needs to write no type annotations for intermediate values like in
XDuce, but instead the static anlyzer completely infers them, thus
providing a more user-friendly system. One potential drawback is that
the specification is rather informal and therefore, when the analyzer
raises an error, the reason can sometimes be unclear; empirically,
however, such false negatives are rare.
Flow analysis is applied first to Bigwig language system, an extension
of Java with an XML-manipulating facility called ``templates''
\cite{Bigwig}. Though this first attempt handles only XHTML types,
they naturally generalize it to arbitrary XML types, calling the
resulting system XAct \cite{Kirkegaard06}. Their techniques are
further extended and applied to static analysis of XSLT
\cite{Moeller2005}.
\end{scientific}
~\\\\
\begin{keyApp}
XML typechecking is a key component of XQuery, the standard XML
query language. As outlined above, XML typechecking in XQuery is
based on a set of inductive typing rules that reflects the
object-oriented hierarchical typing structure adopted by XML
Schema. Different approaches to XML typechecking may be found in
research prototypes like CDuce, OcamlDuce, XDuce, XAct, XHaskell,
and Xtatic for which references may be found below.
\end{keyApp}
~\\\\
% \begin{future}
% Optional. Open problems and discussions.
% \end{future}
% ~\\\\
% \begin{experimental}
% Optional. PROBABLY NOT NEEDED?
% \end{experimental}
% ~\\\\
% \begin{datasets}
% Optional. PROBABLY NOT NEEDED?
% \end{datasets}
% ~\\\\
\begin{URL}
\begin{description}
\item[CDuce: ] \url{http://www.cduce.org}
\item[OCamlDuce: ] \url{http://www.cduce.org/ocaml.html}
\item[XAct: ] \url{http://www.brics.dk/Xact/}
\item[Xduce: ] \url{http://xduce.sourceforge.net}
\item[XHaskell: ] \url{http://taichi.ddns.comp.nus.edu.sg/taichiwiki/XhaskellHomePage}
\item[Xtatic: ] \url{http://www.cis.upenn.edu/~bcpierce/xtatic}
\end{description}
\end{URL}
~\\\\
\begin{crossreference}
XML Types,
Relational Database Schema,
Integrity Constraints
\end{crossreference}
~\\
%bibliograph, took from siam ltexpprt.sty
\def\thebibliography#1{%
%\cleardoublepage
\parindent 0em
\vspace{6pt}
\begin{flushleft}\normalsize{\bf RECOMMENDED READING} \\
\suggestion
\end{flushleft}
\addvspace{3pt}\nopagebreak\list
%% default is no labels, for those not using \cite or BibTeX
% {[\arabic{enumi}]} {\settowidth\labelwidth{[#1]}
{[\arabic{enumi}]}{\settowidth\labelwidth{mm}
\leftmargin\labelwidth
\advance\leftmargin\labelsep
\usecounter{enumi}\@bibsetup}
\def\newblock{\hskip .11em plus .33em minus -.07em}
\sloppy\clubpenalty4000\widowpenalty4000
\sfcode`\.=1000\relax}
\def\@bibsetup{\itemindent=0pt \itemsep=0pt \parsep=0pt
\small}
\def\suggestion{ A gentle introduction to exact typechecking for both
XML-to-XML transformations and XML publishing can be found in
\cite{DBLP:journals/sigmod/Suciu02}. A non-technical presentation of
Regular Expression Types and Patterns and their use in query
languages can be found in the joint DBPL and XSym 2005 invited
talk~\cite{Cas05a}. For a more complete presentation of Regular
Expression Types and Patterns and the associated type-checking and
subtyping algorithms we recommend the reader to refer to the seminal
JFP article by Hosoya, Pierce, and Vouillon~\cite{HVPRegtypeToplas}.
The joint ICALP and PPDP 2005 keynote talk~\cite{CF05} constitutes a
relatively simple survey of the problem of type-checking
higher-order functions and an overview on how to derive subtyping
algorithms semantically: full technical details can be found in an extended version published in the JACM~\cite{jacm08}. } \bibliographystyle{plain}
\bibliography{biblio,hh/haruo.bib}
\end{document}