CDX-summarize-warc-indexer

A collection of SOLR queries to reproduce similar numbers to the ones produced by cdx-summarize, but using a SOLR index that has been constructed using warc-indexer from the UKWA webarchive-discovery and using V6 schema.xml or V7 schema.xml.

Usage of execute_solr_queries.sh

execute_solr_queries.sh --help

usage: execute_solr_queries.sh [queryfile]
  without arguments it runs all .q files in the current directoty
  queryfile : a single json formatted query file to run

Constructing the .summary file

The output of the bash script is a collection of CSV files which do need to be combined into a single .summary file so that they are completely compatible with cdx-summarize. This remains a todo.

Performance

This has only been tested on a small installation of the SOLR backend of a SOLRWayback. The queries can probably be optimized.

MIME-Types

The same rationale to mime-type simplification as in cdx-summarize has been used, but here based off the mime-type that has been determined by warc-indexer (the field content_type) is used. In order to be completely the same as cdx-summarize, which operates on CDX(J) files and where the mime-type is the one reported by the server, you would need to use the field content_type_served. This is not done, since we do have the extra information reported by tika, droid et al. and it is more precise and more correct.

Limit

Currently the scripts place a limit of 10 million second-level domains on the query. This may or may not be enough depending on the size of the web archive.

Domains

The "domain" field in the warc-indexer, as used in the SOLRWayback, holds the private domain name as determined by having one more hierarchical level than the public suffix, as determined by Mozilla's publicsuffix.org list. This means that the subdomains of e.g. ".ac.uk" and ".co.uk" are separated into different "domain" values. This is currently not the same as for cdx-summarize which only takes the second-level domain.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.gitignore		.gitignore
DomainStatJSON.java		DomainStatJSON.java
DomainStatMerger.java		DomainStatMerger.java
LICENSE		LICENSE
README.md		README.md
README_SOLR.sh		README_SOLR.sh
execute_solr_queries.sh		execute_solr_queries.sh
solr_stream_queries.sh		solr_stream_queries.sh
solrq_audio.q		solrq_audio.q
solrq_css.q		solrq_css.q
solrq_font.q		solrq_font.q
solrq_html.q		solrq_html.q
solrq_http.q		solrq_http.q
solrq_https.q		solrq_https.q
solrq_image.q		solrq_image.q
solrq_js.q		solrq_js.q
solrq_json.q		solrq_json.q
solrq_pdf.q		solrq_pdf.q
solrq_total.q		solrq_total.q
solrq_video.q		solrq_video.q

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CDX-summarize-warc-indexer

Usage of execute_solr_queries.sh

Constructing the .summary file

Performance

MIME-Types

Limit

Domains

About

Releases

Packages

Languages

License

netarchivesuite/cdx-summarize-warc-indexer

Folders and files

Latest commit

History

Repository files navigation

CDX-summarize-warc-indexer

Usage of execute_solr_queries.sh

Constructing the .summary file

Performance

MIME-Types

Limit

Domains

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages