-
Notifications
You must be signed in to change notification settings - Fork 6
/
README
1170 lines (815 loc) · 38 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
any-dl, a tool for downloading Mediathek video-files
====================================================
Overview
========
The tool any-dl is inspired by, and has it's name derived from tools like
youtube-dl, arte-dl, dctp-dl, zdf-dl, ...
These tools are specialized downloading tools for videos
of youtube, as well as tv-broadcasting companies.
All these tools do download video files, and for accomplishing this task,
they need to download and analyze webpages, via which thos cvideos are
presented to the viewer.
All these small tools are only programmed to work with certain video archives,
and a lot of work is going into these kind of tools.
any-dl is intended to be generic enough to allow downloads of videos from all
these platforms, and for this case, probviding a Domain Specific Langage
(DSL) which defines how the videos of a certain server can be downloaded.
The DSL is designed to allow defining parsers, which say, how to scrape
the archives.
This language will explained below.
But as normal user, you normally will not need to know this parser definition
language.
You just need to know how t use any-dl, and this is pretty simple.
So, before the parser definition language will be explained,
the usage of any-dl will be explained.
any-dl provides a program with a certain language,
that allows doing the parsing stuff of websites with
focus on video download.
If you miss a parser for a certain site or if you have written
one by your own, please let me know.
As any-dl does delegate the stream-downloads to certain tools,
it will make sense to have the following tools also installed:
- rtmpdump
Compilation / Installation / Setup
==================================
When you read this README file from within the directory
you downloaded (via git or elsewhere), then you
already have unpacked it.
You need to compile and install / setup the tool.
You will need to have OCaml installed, as well as
some libraries.
The library ocamlnet currently (May 2024) is not available for ocaml 5.x.
So you need to use ocaml 4. The currently used version in development is
ocaml 4.14.2.
When using OPAM, you need the following packages:
- pcre
- ocamlnet
- conf-gnutls (gnutls must be installed on your system too)
- xmlm
- yojson
- csv
If you don't use OPAM but instead the package manager from your Linux installation,
the packages might have different names.
On Arch some packages might only be available via AUR.
At the moment any-dl itself has no OPAM-package.
It might be added later.
To compile any-dl, just type "make" at the shell then.
$ make
The the file "any-dl" should be in the current directory.
You can then copy it to $HOME/bin if your PATH-variable
points to it.
Or possible you may copy it to /usr/lovcal/bin or /usr/bin
depending on the Linux-/Unix-system you are using, and
the filesystem-standard it is using.
The file "rc-file.adl" does contain needed parser-definitions
for any-dl to work as expected.
This file must be available in one of three places:
- /etc/any-dl.rc
- $XDG_CONFIG_HOME/any-dl.rc ( default: $HOME/.config/any-dl.rc )
- $HOME/.any-dl.rc
So, please copy from the local dir, where you built any-dl
the file "rc-file.adl" to one of the three places, mentioned above.
This command would do it for the default of the XDG_CONFIG_HOME environment variable:
$ cp rc-file.adl $HOME/.config/any-dl.rc # copy the config file to the XDG-default-dir
But if the XDG_CONFIG_HOME-environment variable is set,
it's better to use it:
$ cp rc-file.adl $XDG_CONFIG_HOME/any-dl.rc # copy the config file to the XDG-dir
( For command-line-newbies:
The "$" symbol in the mentioned command lines, which you have to
type do represent the prompt of the shell; don't type it. )
If you want to add your own parser-definitions, it would make sense
to save the file "rc-file.adl" in the /etc/-directory, as mentioned above
and then add your own parser-definitions in one of those places,
where a any-dl config-file can be placed inside your HOME directory.
This would have the advantage, that the config-file coming with any-dl will
always be placed in the /etc/ directory, and your local parser-definitions
will be saved in your local configs in $HOME.
PLEASE, BE AWARE, THAT ANY-DL READS ALL CONFIG-FILES AS IF THEY WERE
CONCATENATED INTO ONE BIG FILE.
So, if you want to add your ADDITIONAL parsers to those that are already
existing in the file inside "rc-file.adl" ( e.g. copied to /etc/any-dl.rc ),
this can be done by just editing the config files in your $HOME-dir
and write your own definitions into these files.
It's not necessary (and also not recommended) to have there a copy of the
"rc-file.adl", in which you added your parsers.
Just write ONLY your own parsers into your local files.
So, in other words: if you want to add your own parser-definitions place them
solely in one of the default places fo config files in $HOME.
Don't copy the stuff from "rc-file.adl" stored in /etc/any-dl.rc.
( If you place the config file "rc-file.adl" in $XDG_CONFIG_HOME/any-dl.rc
you can add your parsers in $HOME/.any-dl.rc )
Of course you can also add your additional parsers in the place,
where all parsers from "rc-file.adl" are stored... but when you update
to a newer version of any-dl you maybe by accident overwrite your own
stuff with the new "rc-file.adl" coming with a newer version of any-dl.)
IF YOU WISH TO USE OTHER CONFIG-FILES, you can specify them
with the -f option of any-dl.
If you use the -f option, you can give a filename-path, which is used
as config file then.
BE AWARE: All DEFAULT PLACES of config files WILL THEN BE IGNORED!
If you wish to add more than one config file, you can do it by just
using the -f option more than once.
Usage
=====
You need to provide the url from the video archive,
and give it to any-dl as a command line argument.
Very often you have to quote the url inside of " and "
so that certain symbols are not interpreted by the shell,
from which you start any-dl.
For example on ARTE mediathek, there is
a telecast "Frankreichs mythische Orte", and the URL
of it is:
http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html
If you want to download the video of it, at the shell it will look like this:
$ any-dl "http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html"
Then any-dl would download the video. :-)
That's all :-)
The same principle holds true for any other archives, for which
a parser definition already is provided.
If there is no such parser defined, any-dl will tell you with an
exception-message.
You then may ask, if there already is a parser for it available,
written by the author of any-dl, or by any other persons.
Or you could learn the parser definition language and program your
own parser for that archive.
If you send the parser you wrote to the author of any-dl,
then in a newer release of any-dl, other people could use it also.
By the way: there are already also some parser definitions, that are not
focussed on certain video archives.
There is the parser "linkextract" as well as "linkextract_xml".
You can use them to pick out html-hyperreferences (typically called "links"
or "references") or links in xml-files.
To pick a certain parser can be done with the command-line switch "p":
$ any-dl -p linkextract "http://videos.arte.tv/de/videos/frankreichs-mythische-orte--7167432.html"
will print out all href's of the document (and they should all appear as absolute URLs).
The names of all defined/available parsers can be displayed with the "l"-switch:
$ any-dl -l
If a parser as URLs, on which it will be invoked as default (when not using -p)
it is also displayed with -l as switch.
If you want to write your own parser-definitions, you need the list of
commands. You can get it with the -c switch:
$ any-dl -c
will print a list of all keywords that the lexer/scanner does accept.
That's enough for an introduction.
And here now follows a brief introduction into the parser definition language.
Parser-Definition Language: Intro
=================================
Here is a simple parser definition, that allows to pick out all
html-hyper-references from a webpage and print them.
parsername "linkextract": ( "" )
start
linkextract;
print;
end
As you can see, the definition allows to give a parsername
to the definition of the parser, an inbetween of "start" and "end"
the commands that define the parser, are listed.
A get-command that downloads the url
(which is given via the command line) is done
implicitly.
Then the commands "linkextract" and "print"
are executed.
So, all links from the document, referred to by the URL
are printed.
The part with the parantehses and quoting-symbols allows to bind certain
URL's to this parser, so that a parser can be selected automatically
via the URL. So, a parser, dedicated to a certain URL will be invoked
to work on the document, that has a certain URL.
Via command line arguments, it is possible, to select a different parser,
to do it differently than using the defaults.
As an example see at the parser, that does look-up for the
video-files of the NDR-TV-broadcaster in germany:
# Example-URL: http://www.ndr.de/fernsehen/sendungen/mein_nachmittag/videos/wochenserie361.html
#
parsername "ndr_mediathek_get": ( "http://www.ndr.de" )
start
match( "http://.*?mp4" );
rowselect(0);
store("url");
# download the video
# ------------------
paste("wget ", $url );
system;
end
There you can see, that the parser-name is set to
"ndr_mediathek_get", and the URL, to which this parser is bound by
default is "http://www.ndr.de".
This does mean, that any URLs, that start with "http://www.ndr.de"
will be parsed with the "ndr_mediathek_get" parser.
If you give an URL like the one in the example (shown above the parser)
as command line argument to any-dl, then the parser "ndr_mediathek_get"
is invoked to look for the video file.
Again, an implicit get is invoked.
Because the first doeument must be downloaded in any case,
the first get is done implicitly.
It's obvious that the first document must be downloaded,
and it makes writing the parsers easier.
Stack and named variables
-------------------------
This language is somehow special, that uses a mix of
a stack-based language and one that allows named variables.
The stack has a size of one value.
Most functions use the stack. They can get their argument from there,
as well as puttin gtheir results to the stack.
A one-value-stack, which is used to read arguments from and save
results to, does behave like a pipe in unix-environment.
Something is written to a pipe by someone, and the same thing is read from a pipe by someone.
So, the stack emulates something like a pipe.
(Another analogy would be Perl's built in variable $_
but a Pipe analogy does fit the picture better. I think.)
Because this behaviour sometimes is not providing enough complexity,
any-dl also allows to store data/results in named variables.
The NDR-example explained
-------------------------
The first command does a MATCH with regular expressions
on the contents of the first document.
It does the match on the document, which was downloaded by the implicit
GET-command. This document was put onto the 1-valued-stack.
The match command reads the argument (the document) from the stack,
tries to match for the certain regular expression, and puts the result
onto the stack.
Then from the result (a match is a 2D-matrix, meaning an array of an array",
the first row (index == 0) is selected with ROWSELECT.
The resulting selection holds an array.
This selection-result is put to the 1-valued-stack.
The stack-value (selection-result) is stored in the named variable "url" for later use
via the STORE-command.
To come back to the pipe-analogy, it's like a pipe that would look like this
(pseudocode):
GET(<start-url>) | MATCH(<regular_expression>) | ROWSELECT( <index> ) | STORE( <varname> ) | ....
The paste-command pastes the literal string and the contents of the named variable "url"
together, and places the result on the 1-valued stack.
The system command tries to use the system() command (which you may know
from other programming languages, the shell or the system-API)
and as argument uses the value from the stack.
So, if the variable "url" contanins the video-url,
the system()-call would look like this one:
system("wget <video-url>");
That is the parser language explained by example.
I hope, this example shows you, what there is all about the input language
(parser definition language).
It's comparingly easy (IMHO), and in this way it will be possible to have easy access
to a lot of different video archives, all with the same tool.
So, it is not necessary to look for tool-updates, when some URLs and how they
are connected together, on a video-archive-page, do change.
If something changes in the way a video url is presented on one of these
video/archives / Mediatheken, then only the according parser-definition
needs to be updated. The tool any-dl itself does not needed to be changed.
Also, all the different tools that provide video-download-functionality,
with all
their seperated effort of the programmer (many programmers), done to make only
certain archive be accessed, can be freed to make just the basic analyzing of
the webpages that provide the videos, and save effort to program a tool.
So, one tool and many archives, instead of many tools for some archives.
So, I think the advantage may be obvious to you.
Now, details about the language will follow.
Language Features:
==================
Parser-Definitions:
parsername "<parser-name>": ( <list-of-urls> )
start
<command_1>
...
<command_n>
end
Example: see above.
<list-of-urls> is a comma-seperated list of strings.
Commands all end with a semicolon ( ';' ).
Commands / functions, that do not have parameters, will be used without
parenatheses ( '(' and ')' ).
Only when a command / function will need arguments,
these will be passed inside parenatheses ( '(' and ')' )
which follow the name of the command/function.
Some commands are available with and without parantheses.
An example is the print-command/function.
Stringquoting at the moment has three dfferent styles:
String-Quoting: " "
String-Quoting: >>> <<<
String-Quoting: _*_ _*_
The language offers a stack of size 1.
That means, that results from one command / function
can be passed as input for the next command/function
and this is default behaviour.
Not all commands / functions do need the stack
for input, and not all do leave something there
as result (and input for following functions/commands).
But if there is the need for transfering a result,
normally no additional variables are needed.
Most often, the data can be transferred from one function/command
to the next one via the 1-valued-stack.
But in certain cases, this is not enough.
For these cases there are named variables also.
To store the current data from the default-stack
under a certain name, the command
store("<variablename>");
will be used.
To copy (restore/recall) the value of the named variable back to the default stack,
the command
recall("<variablename>");
can be used.
In the paste()-command/function, it is possible, to access
named variables via the $-notation, that you might know from
other programming languages, like Perl for example.
In the NDR-parser, it looks like this:
paste("wget ", $url );
This does paste together the literal string "wget "
and the contents of the named variable "url".
The result of paste is stored at the one-valued stack.
And the system-command uses this value as it's argument
(and therefore downloads a file with the wget-tool).
Startup-sequence:
-----------------
The document(-url) given via command line is loaded automatically.
The loaded document is automatically saved as a named variable (name: "BASEDOC").
Command Line Options:
---------------------
-l list parser-definitions and related URLs
-p <parsername> selects a certain parser, to be used for all urls.
The names that can be selected can be listed with
the -l option, or one can look into the rc-file.
-f filename for rc-file
-v verbose output
-vv very verbose output
-c show commands of parserdef-language
-v verbose
-s safe: no download via system invoked
-i interactive: interactive features enabled
-a auto-try: try all parsers
-as auto-try-stop: try all parsers; stop after first success
-u set the user-agent-string manually
-ir set the initial referrer from '-' to custom value
-ms set a sleep-time in a (bulk-) get-command in milli-seconds
=> sleeps only for bulk-get-commands (get that would call a list of documents,
not for single get-commands)
-sep set seperator-string, which is printed between parser-calls
-help Display this list of options
--help Display this list of options
Examples:
---------
1.: Print html-links of a webpage:
If you want to print the href-links of html,
use any-dl with the predefined parser for link-extraction:
$ any-dl -p linkextract <url_list>
List of commands/keywords and a short-description of them:
=========================================================
appendto
Appends tmpvar to a named variable.
If that variable does not exist already, it's internally created as empty match-result
(which means the appendto-command then creates the varable itself with the new data.)
"appendto" only works on match-results.
Two match results will be concatenated this way, and be saved in the
named variable automatically. (No store-command is needed after append.)
The itmes will be concatenated as Rows, so adding two matchres'
will add the second matchres as appending it's rows to the matchres in the
named variable.
basename
creates the basename of an url or filename;
the leading filename or URL-path is removed
call
call a macro.
The macro is working like textual insertion of the commands of the
macro at the place where the "call"-command is used.
csv_read
csv_read reads in a file as csv-file.
The result is placed in tmpvar as Match_result.
csv_save_as
csv_save_as does save a *match_result* to a csv-file.
All data is transformed to have equal number of columns in each row.
Arguments of csv_save_as() are appended into a resulting filename.
csv_save
csv_save does save a *match_result* to a csv-file.
All data is transformed to have equal number of columns in each row.
The filename is derived from the used STARTURL.
The charcater set is shrinked down to a subset of ASCII.
".csv" is appended automatically.
colselect
selects columns from a match-result
# Example:
# --------
colselect(2);
delete
deletes / removes a variable.
It is not accessible anymore then.
This means: accessing it can result in an error,
because it's like accessing a variable that was not
defined at all.
download
downloads an entity and storing to a file.
# Examples:
# ---------
download;
download( $filename );
dropcol
drops a column from a match-result
droprow
drops a row from a match-result
dummy
just a dummy command (something like a NOP of processors)
dump
dump a html-page: deparses the tags, prints tags and data
annotated; data is indented and an underline prepended.
The underline is a multitude (defaults to 2) of the deepness
of the nesting in the parse-tree.
Means: the deeper something is wrapped in tags, the higher the indentation.
dump_data
dump a the data-part of a html-page: deparses the tags, prints data part,
and NOT the tags.
Works like un-tag html, or like a html-2-text.
emptydummy
just a dummy command (something like a NOP of processors),
but gives back Empty as tmpvar
end
end-keyword for the parser-definition
grep
extract matching elements from data
grepv
extract non-matching elements from data
(grepv: grep -v)
exitparse
exit's a parse of one parser.
This means, that the URL that is currently tried to be parsed and
worked on, will not be further investigated.
But if there are more than one URl given via command-line,
then the next url will be investigated.
This means: even if by accident your parser for one url is
exited (e.g. you are developing the parser for that URL),
the next one will be worked on.
get
gets a document like html or xml page.
Could also be a file, but not a stream so far.
htmldecode
Decodes the HTML-Quotings like " and such stuff
back into "normal" characters.
iselectmatch
this is an interactive selectmatch. ("i" for interactive).
Without the "-i" switch on the command line, it behaves
like selectmatch().
But when the "-i" switch is set via command line,
then an interactive menue will be displayed, so that
the user can select an option; this option will allow
to select the row by the selected column-index interactively.
The user selects a number (beginning from 0).
The corresponding column of the selected number
will be used for selection of the row.
If the input is not valid, a default value will be used.
The default value is the value, that is the second arg of
iselectmatch(). It would be the same as a hard coded selection
of a selectmatch().
So, in most cases it would make sense to use iselectmatch()
instead of selectmatch().
# Example:
# --------
iselectmatch( <col_idx>, <matchpat>, <default_pattern>);
linkextract
extracts href-links from html-pages;
relative links will tried to be converted into absolute links.
linkextract_xml
extracting href-items of an xml-document
list_variables
displays all named variables.
Prints variable-name only.
(show_variables does also print the contents of the variables)
makeurl
tries to make an url from a string
match
tries to match to the used pattern.
PCRE-matches are used.
The result is a matrix, containing of
rows-of-"column"-elements.
Please note:
For real matches: Col 0 is the whole match, all others are the groups of a match.
For match_results, thatare just "arrays of arrays" (not coming from a match,
this obviously does not hold.
If you do a match, and want only the selected groups to appear in your
result, use
dropcol(0);
to kick out the whole-match.
# examples:
match("Regexp-String");
match(>>>another "Regex"-String<<<);
mselect
a multiple-select, like select, but the result will be an array
of items (Strings or URls) not a single element.
# Example:
# --------
mselect(1,2);
parsername
this keywords starts the definition of a parser.
paste
the paste()-command creates a string from strings and variable-names (-notation).
paste() accepts a list of items, seperated by commas (",").
# Example:
# --------
paste( "literal string", $varname, "foo", $bar );
post
post does make a post-request (instead of get-request) to a webserver.
The post-data is stored in named variables; the names of the variables
will be given to post as arguments, e.g.:
post( "name_1", "name_2" );
and the values will be looked up internally.
For that purpose, the post-data has to be stored in named variables,
before the post-command is called, so that the value for a variable
can be looked up by the post-command.
The URL for the post-command is taken from tmpvar.
# Example:
# --------
post("valname_1", "valname_2", "valname_3"); # the values must be set as named variables before.
print
print invoked without parantheses prints the value on the one-val-stack.
print() with parantheses prints strings and variables (denoted by $-notatation),
which means it accepts the same parameters as paste() but does not change the
one-val-stack.
print() used on an empty string does end the line automatically.
This means, a new line will be used for further commands.
If you wish to print only a certain string, without line-endlings added,
you need to use print_string()
print_string
accepts only one string-argument and prints it.
It prints the plain string, and does not add line-ending automatically.
quote
wraps the one-val-stack value with '"' and '"'.
needed for arguments that are given to other tools,
which will be invoked bia system() (which is invoking
a shell).
readline
reads one line from stdin / console.
Without arguments, the input is stored in the TMPVAR,
With argument, the argument is used as variable-name,
and the input is stored in this named variable.
# Examples:
# ---------
readline;
readline("VarnameForInputLine");
recall
get a named value and store it on the one-val-stack.
# Example:
# --------
recall("varname");
rowselect
selects a certain row from a match-result.
# Example:
# --------
rowselect(0);
save
saves a document to a file.
The filename is derived from the url of the document.
The charcater set is shrinked down to a subset of ASCII.
save_as
saves a document to a file with filename as argument.
select
selects ONE part of a tmpvar.
Examples: select(0);
select(3);
For rows and columns:
document:
---------
0 selects the document,
1 selects the url of the document
any other value selects the document too
document-array:
---------------
selects document with index (starting at 0)
rows/columns:
-------------
selects ONE ELEMENT from a row or a column.
The row/column must already have been selected with rowselect() or colselect().
select() does NOT allow matches on match-results (which are a matrix internally).
# Example:
# --------
select(2);
selectmatch
allows to select a row from a match-result, by specifiying
a column-index and a string-matching-pattern for this certain
element.
So, this is a more advanced rowselect() with additional matching capabilities.
show_match
shows a match-result in a certain way; this command is
intended to display matchese in a way, wher they can be read easily.
Most often will be used in parser-development.
But can of course also be used for informing the user
on the steps that any-dl has done (e.,g. just be verbose and
display the matches). But normally, rather developers
will be interested in these details.
show_type
just shows the "type" of the value in the one-val-stack.
show_variables
displays all named variables.
Prints variable-name and contents of the variable.
(list_variables does only print the names of the variables)
start
this keyword indicates the start of the keywords section
of a parser definition.
store
store the value from the one-val-stack as named variable.
(use recall() for getting it back to the one-val-stack, or
$-notation in some of the commnds that accept this notatiom).
# Example:
store("varname");
storematch
Stores the tmpvar (must be match-result) to a named variable,
with Row- and Column-Indexes as part of the name:
storematch("MyName"); # stores matchresult as MyName.(col).(row)
(for all col's and row's as indexes of the match-result)
subst
string-substiturion.
Uses Pcre.replace internally.
# Example:
# --------
subst("pattern", "subst-string");
system
calls the system() command with the string that is hold in
the one-val-stack.
table_to_matchres (expermental feature so far)
converts a html-table to a match-result.
This conversion works for single tables.
So, a selection of a table should be as specific as possible, so that
only one table will be seleted with tagselect.
Then the conversion works.
If more than one table has been extracted by tagselect, then
they all will becoerced into ONE mathc-result.
If that's, what is wanted, anything is fine. Otherwise, seperate table-selection will be necessary.
Use tagselect with "htmlstring"-extractor, like this:
# Example:
# --------
tagselect("table"."id"="foobar" | htmlstring );
table_to_matchres;
csv_save;
tagselect
selects tags and "subtags" from a document tree and gives back
data accordingly.
selection can be a *list* of tags, and optionally the argument
"args" or the argument "arg" with a key-parameter (of a key-value pair)
that selects the certain argument.
See above in the command-examples for syntax details.
Selection list does do a selection on the firt selector-specification.
Then the resulting stuff is again selected, and so on.
Example:
--------
tagselect("table", "a", "img"."align"="top"| dump);
The document is first scanned for table's.
The outermost match is selected. So if a table is inside a table,
the outer tag will be selected, and the whole outer table be selected.
The inner table would just be content of the first one.
No in-depth selection is done.
All found table's then are scanned for <a ...> tags, which
should be the <a href="..."> stuff.
From the found <a ...>-tags any img-tags inside these a-tags
will be selected, if they also are top-aligned.
The result then is dumped to screen/console.
tagselect selects elements from the document tree,
so that a selection picks that certain tag and all it's descenmdants.
That means for example, that a data-slurp-extraction will show all data from the descendants.
But all other extractors ONLY LOOK UP THE TOPMOST element.
(And not the desendants)
The reason is: that the selected element normaly is what needs to be analyzed,
not necessarily the descendants.
With the "anytag" selector in tagselect (e.g. 'tagselect( anytags, argpairs );' )
ANY tag is selected, so ALL tags are TOPMOST tags, because any descendant also
is edetected as a new tag.
This is a depth-first selection, with each element being a top-element.
This way you can access all descendants and analyse them, fr example
extract all argpairs from all the tags of the whole document.
# Examples, showing the allowed syntax:
# -------------------------------------
tagselect( "a"| dump ); # dumps all <a ...> tags
tagselect( "br"| dump ); # dumps all <br>-tags
tagselect( "table", "a"| dump ); # <a ...> inside tables will be dumped
tagselect( "img"."src"| dump ); # <img src="..."> wil be dumped
tagselect("table", "a", "img"."align"="top"| dump); # all img-tags with "align"="top" will be selected,
# if they appear inside a table; the stuff is dumped to screen
tagselect( "table", "a" | argpairs ); # extract argpairs from the stuff that was selected
tagselect( "table", "a" | arg("href") ); # extract value for the arg with key/name "href" from the stuff that was selected
# the pair-extratcors ( "argpairs", "argkeys", "argvals" ) can be used as single-extractor-arguments
# the other selectros select one item only (not pairs) and can be given as list, like this:
tagselect("img"."src" | arg("src"), arg("alt") );
# tagselect used with "anytags"-selector
# --------------------------------------
# the "anytags"-selector selects ANY tags,
# which means that ALL tags from the document are
# picked up in depth-first manner.
# without anytags, a match does pick a tag with all descendants.
# But these descendants will not be extracted with a extractor-pattern!
# --------------------------------------
tagselect( anytags | argpairs ); # shows argpairs of ANY / ALL tags found (depth-first)
titleextract
extracts the contents from the <title>-tag of a webpage
and puts the resutlt to the one-val-stack.
to_string
converts the value of the one-val-stack to a string-representation.
to_matchres
converts the value of the one-val-stack to a value of the same type,