-
Notifications
You must be signed in to change notification settings - Fork 3
/
libintro.html
executable file
·363 lines (312 loc) · 13.6 KB
/
libintro.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Introduction To LibTidy</title>
<meta http-equiv="Content-Type" content="text/xhtml; charset='us-ascii'" />
<link rel="stylesheet" href="tidy.site.css" type="text/css" />
<meta charset="UTF-8">
<style type="text/css">
/*<![CDATA[*/
table.info
{
border-style: none
}
div.content
{
text-align: left;
vertical-align: text-top;
width: 100%;
top: 0;
}
div.links
{
vertical-align: text-top;
/* margin-left: 0%;
** margin-right: 2%;
*/
width: 30%;
position: absolute;
top: 0;
left: 70%;
}
code
{
/* font-weight: bold */
}
/*]]>*/
</style>
</head>
<body>
<div class="legacy">
<h1>Legacy Content Notice</h1>
<p>The content on this page is no longer under active maintenance,
however it remains here for historical interest as well as for
preventing link rot. Please see current content here:
<a href="http://www.html-tidy.org/developer/">http://www.html-tidy.org/developer/</a></p>
<p>Links have been updated and some editorial content had been injected, but
there is no guarantee of future changes to this page.</p>
</div>
<div class="content">
<h1>What's all this about LibTidy?</h1>
<p>LibTidy, like it sounds, is a library version of Dave Raggett's
popular HTML Tidy. In fact, one of the motivations for starting the
Source Forge project was to refactor HTML Tidy as a callable library.
Although the command line tool is great, it is difficult and inefficient
to integrate into other software.</p>
<h2 id="requirements">Requirements</h2>
<p>We had several informal requirements for the library:</p>
<dl>
<dt>You Can Get There From Here</dt>
<dd><p>Probably the most important requirement is that the library
be easy to integrate. Because of the almost universal
adoption of C linkage, a C interface may be called from a
great many programming languages. This, and the fact that code
was already in C and the team was already most comfortable with
C, led to the decision that the library's public interface
should be kept in C.</p>
<p>The other major design decision was to use opaque types in the
public interface. This allows the application to just pass
in integer around and the need to transform data types in
different languages is minimized.</p>
<p>This strategy has already paid off. It was straight-forward
to write very thin library wrappers for C++, Pascal, and COM/ATL.
It was also quick to generate a Perl wrapper using
<a href="http://www.swig.org">SWIG</a>. SWIG wrappers for Python,
Ruby, Java and others should also be possible.</p>
</dd>
<dt>Don't Break Anything</dt>
<dd><p>Of course, Tidy must remain Tidy. It wasn't acceptable to
introduce bugs or drop (many) features. In the end, the body
of test documents proved invaluable to getting things working.</p>
</dd>
<dt>Thread Safe / Reentrant</dt>
<dd><p>Because there are many uses for HTML Tidy - from
content validation, content scraping to conversion to
XHTML - it was important to make LibTidy run reasonably
well within server applications as well as client side.</p>
<p>This requirement implies that the library be fully re-entrant
so that it may be used within multi-threaded applications.</p>
</dd>
<dt>Adaptable I/O</dt>
<dd><p>As part of the larger integration strategy, it was decided to
fully abstract all I/O. This means a (relatively) clean
separation between character encoding processing and shovelling
bytes back and forth. Internally, the library reads from
"sources" and writes to "sinks". This abstraction is used for
both markup and configuration "files". Concrete implementations
are provided for file and memory I/O. But new sources and sinks
may be provided via the public interface.</p>
</dd>
</dl>
<p>We had some prior art to follow as well. Most notably, Marc-Andre
Lemburg's <a href="http://www.lemburg.com/files/python/mxTidy.html">mxTidy</a>.
In the process of writing a Python wrapper for Tidy, Marc-Andre applied
these principles and built a C library. LibTidy can be seen as a completion
of Marc's work.</p>
<div class="legacy">
<p>This <strong>Getting Started</strong> content is obsolete. Please see
consult <a href="https://github.com/htacg/tidy-html5/tree/master/README">current
documentation in our source repository</a>, and then skip forward to
<a href="#example"><strong>Example</strong></a>.</p>
</div>
<h2 id="start">Getting Started</h2>
<h3>Get The Source</h3>
<p>The best way to get the lib sources is directly from CVS. If you have
CVS installed (recommended!), just execute the following commands:</p>
<pre>
C:\src> mkdir tidylib
C:\src> cd tidylib
C:\src\tidylib> set TIDYCVSROOT=:pserver:[email protected]:/cvsroot/tidy
C:\src\tidylib> cvs -d %TIDYCVSROOT% login
C:\src\tidylib> cvs -d %TIDYCVSROOT% export -d C:\src\tidylib -r HEAD _
build console htmldoc include src test
</pre>
<p>When CVS prompts you for the password, just hit ENTER. The underscore
(_) above denotes line continuation. Do not type it in, just use one
long command line. The procedure is similar for Unix variants. Just
translate to the appropriate path separator for your file system and
do not use the -d <dir> option. Copy and paste the above into a script
or batch file. For the truly lazy, you can pull a gzipped source tarball
from the Tidy <a href="http://tidy.sourceforge.net/#source">Project
Page</a>.</p>
<h3>Build It</h3>
<p>For an overview of build options, see build/readme.txt. It describes
the overall layout and more info on supported build systems.</p>
<h4>Unix / GNU</h4>
<p>For GNU gcc, just use the gmake <code>build/gmake/Makefile</code>.
The usual target is <code>all</code>. If you want a debug build, use
the <code>debug</code> target. For other Unix compilers, you may have
to set the CC macro to point to your compiler, usually just
<code>cc</code>. The same, large number of Unix systems are supported
"out of the box" as Tidy Classic. Tidy usually does a good job of
automatically identifying the current platform. If not, tweak
platform.h as needed and send us a patch!</p>
<p>If you are using GCC/MinGW, you should use gmake as well.</p>
<p>In addition, there are targets for <code>clean</code> and
<code>install</code>. Be sure to look at the Makefile before using
<code>install</code> to make sure the binaries, headers and library
go where you want. By default, <code>/usr/bin</code>,
<code>/usr/include</code>, and <code>/usr/lib</code>, respectively.
There are macros in the Makefile to customize your installation.</p>
<pre>
make all
</pre>
<h4>Windows / Visual C++</h4>
<p>For VC++, use you can use either <code>msvc/Makefile.vc6</code> on
the command line or <code>build/msvc/tidy.dsw</code> in the IDE. As
the names imply, these work with Visual C++ version 6.0. Service
Pack 3 is highly recommended. Makefile.vc6 supports the same targets:
<code>all</code>, <code>debug</code>, <code>clean</code> and
<code>install</code> are all available.</p>
<p><code>nmake /f Makefile.vc6 all</code></p>
<h4>GNU AutoConf/AutoMake</h4>
<p>The input files to drive the GNU AutoConf tool set have been added.
See <code>build/gnuauto/readme.txt</code> for instructions on how to
use GNU build tools with Tidy.</p>
<h2 id="example">Example</h2>
<p>Perhaps the easiest way to understand how to call Tidy is to see
a simple program that uses it. A basic thing to know about the API
is that functions that return an integer use the following values:</p>
<dl>
<dt>0 == Success</dt>
<dd><p>Good to go.</p></dd>
<dt>1 == Warnings, No Errors</dt>
<dd><p>Check error buffer or track error messages for details.</p></dd>
<dt>2 == Errors and Warnings</dt>
<dd><p>By default, Tidy will not produce output. You can force output
with the <a href="http://api.html-tidy.org/tidy/quickref_5.0.0.html#force-output"><code>TidyForceOutput</code></a> option.
As with warnings, check error buffer or track error
messages for details.</p></dd>
<dt><0 == Severe error</dt>
<dd><p>Usually value equals <code>-errno</code>. See errno.h.</p></dd>
</dl>
<p>Also, by default, warning and error messages are sent to <code>stderr</code>.
You can redirect diagnostic output using either <code>tidySetErrorFile()</code>
or <code>tidySetErrorBuffer()</code>. See <code>tidy.h</code> for details.</p>
<pre>
#include <tidy.h>
#include <buffio.h>
#include <stdio.h>
#include <errno.h>
int main(int argc, char **argv )
{
const char* input = "<title>Foo</title><p>Foo!";
TidyBuffer output = {0};
TidyBuffer errbuf = {0};
int rc = -1;
Bool ok;
TidyDoc tdoc = tidyCreate(); // Initialize "document"
printf( "Tidying:\t%s\n", input );
ok = tidyOptSetBool( tdoc, TidyXhtmlOut, yes ); // Convert to XHTML
if ( ok )
rc = tidySetErrorBuffer( tdoc, &errbuf ); // Capture diagnostics
if ( rc >= 0 )
rc = tidyParseString( tdoc, input ); // Parse the input
if ( rc >= 0 )
rc = tidyCleanAndRepair( tdoc ); // Tidy it up!
if ( rc >= 0 )
rc = tidyRunDiagnostics( tdoc ); // Kvetch
if ( rc > 1 ) // If error, force output.
rc = ( tidyOptSetBool(tdoc, TidyForceOutput, yes) ? rc : -1 );
if ( rc >= 0 )
rc = tidySaveBuffer( tdoc, &output ); // Pretty Print
if ( rc >= 0 )
{
if ( rc > 0 )
printf( "\nDiagnostics:\n\n%s", errbuf.bp );
printf( "\nAnd here is the result:\n\n%s", output.bp );
}
else
printf( "A severe error (%d) occurred.\n", rc );
tidyBufFree( &output );
tidyBufFree( &errbuf );
tidyRelease( tdoc );
return rc;
}
</pre>
<p>Look Ma, no temp files!</p>
<h2 id="appnotes">Application Notes</h2>
<p>Of course, there are functions to parse and save both markup and
configuration files. For the adventurous, it is possible to create
new input sources and output sinks. For example, a URL source could
pull the markup from a given URL.</p>
<p>It is also worth rememebering that
an application may instantiate <em><b>any number</b></em> of document and
buffer objects. They are fairly cheap to initialize and destroy (just
memory allocation and zeroing, really), so they may be created
and destroyed locally, as needed. There is no problem keeping them
around a while for keeping state. For example, a server app might
keep a global document as a master configuration. As documents are
parsed, they can copy their configuration data from the master
instance. See <code>tidyOptCopyConfig()</code>. If the master copy is
initialized at startup, no synchronization is necessary.</p>
<h2 id="apidocs">API Docs</h2>
<p>Several <a href="http://api.html-tidy.org/">API Docs</a> have been
added to Tidy header files and generated using
<a href="http://www.stack.nl/~dimitri/doxygen/">Doxygen</a>.</p>
<div class="legacy">
<p>This <strong>Nightly Build</strong> content is obsolete.</p>
</div>
<h2 id="nightlybuild">Nightly Build</h2>
<p>The build procedures on the Source Forge
Compile Farm have been updated to produce the command line driver
based on the library sources. See
<a href="http://binaries.html-tidy.org">Tidy Binaries</a>.</p>
<div class="legacy">
<p>This <strong>Future Directions</strong> content is obsolete. Please
consult our
<a href="https://github.com/htacg/community/blob/master/roadmap.md">roadmap</a>
in our Community repository.</p>
</div>
<h2 id="future">Future Directions</h2>
<p>The ink isn't dry yet on LibTidy and already folks want more! Well,
waddaya expect? Several ideas have been discussed on the dev mailing
list.</p>
<dl>
<dt>Character Encoding</dt>
<dd><p>Currently, all character encoding support is hard wired into
the library. This means we do a poor job of supporting many
popular encodings such as GB2312, euc-kr, eastern European
languages, cyrillic, etc. Any of these languages must first
be transcoded into ISO-10646/Unicode before Tidy can work
with it.</p>
<p>Two basic approaches have been proposed: just use iconv or
adapt Clark Coopers's XML::Encoding as a callable library.
On the face of it, iconv is preferable. Because it is GPL'ed,
however, the license may be incompatible. Also, there are
transcription issues related to Big5 and other code sets
that may or may not be addressed by iconv. XML::Encoding,
otoh, uses the Perl Artistic License and explicitly supports
all alternate transcriptions for Big5 and others. For more
info, see <a
href="http://search.cpan.org/src/COOPERCL/XML-Encoding-1.01/maps/">CPAN</a>
and <a href="https://github.com/htacg/tidy-html5/issues">Tidy Issues</a>.</p>
</dd>
<dt>Error Handling</dt>
<dd>
<br />
<ul>
<li>Categorize errors</li>
<li>Improve message localization</li>
<li>Improve separation of parsing and diagnostics</li>
</ul>
<br />
</dd>
<dt>Content Model</dt>
<dd>
<br />
<ul>
<li>Per-element-and-version attribute support</li>
<li>DTD Internal Subset support</li>
<li>Modular XHTML support (XHTML 1.1)</li>
</ul>
</dd>
</dl>
<p>
<em>Editorial changes on 23 November 2015 by J. Derry</em><br/>
<em>Page last updated on 26 November 2002 by C. Reitzel</em>
</p>
</div>
</body>
</html>