Skip to content

Solr Recipes and Tricks

Michael Berkowski edited this page Nov 1, 2019 · 6 revisions

Sometimes you may want to make Solr do things outside of procedures described in the Local Development Quickstart. Here's a list of some of those things.

Most of these utilize jq, the JSON processing command line utility. Install it as appropriate for your OS, e.g. brew install jq or apt-get install jq or yum install jq.

All examples use /corename when querying Solr. Substitute the server's real Solr core/index name.

Dump all data to JSON via curl

Connect via SSH to a Geoportal server (or use your local desktop) and execute curl to dump all documents to $HOME/server-output.json on the server. Solr treats this as a query and wraps the search results in other metadata. We need .response.docs. Change rows= if you need more than 100,000.

Note: This dumps all fields in Solr including some calculated fields, and cannot be used to directly import back into a different Solr index.

# From a geoportal server...
$ curl 'http://localhost:8983/solr/corename/search?q=*%3A*&start=0&rows=100000&wt=json' \
    | jq '.response.docs' \
    # save to $HOME/server-output.json
    > ~/server-output.json

Dump all data directly over SSH

Same as previous, but run it from your workstation over SSH in one shot, saving the output file to $HOME/local-output.json on your workstation.

# From your workstation...
$ ssh geoportal-server-address.example.edu \
    curl 'http://localhost:8983/solr/corename/search?q=*%3A*&start=0&rows=100000&wt=json' \
    # Extract only the search result
    | jq '.response.docs' \
    # save to $HOME/local-output.json
    > ~/local-output.json

Dump all documents for import back to Solr

To import documents into another Solr index, the dump requires some post-processing to remove dynamic (calculated by the index) fields. Fields to remove include:

  • _version_
  • timestamp
  • score
  • solr_bboxtype
  • solr_bboxtype__minX
  • solr_bboxtype__minY
  • solr_bboxtype__maxX
  • solr_bboxtype__maxY

The following dumps documents via SSH to local-output.json on your workstation, ready for import into another Solr index.

# From your workstation...
$ ssh geoportal-server-address.example.edu \
    curl 'http://localhost:8983/solr/corename/search?q=*%3A*\&start=0\&rows=100000\&wt=json' \
    | jq 'del(.response.docs[]["_version_", "score", "timestamp", "solr_bboxtype", "solr_bboxtype__minX", "solr_bboxtype__minY", "solr_bboxtype__maxX", "solr_bboxtype__maxY"])|.response.docs' \
    # save to $HOME/local-output.json
    > ~/local-output.json

Load a large JSON file directly into solr

Solr provides bin/post as a command-line handler to ingest documents. Using JSON produced with the previous method, import the file to a Solr index:

# On your workstation or another Solr host
# "-c development" would load docs to a core named "development"
$ /path/to/solr/bin/post -c corename /path/to/local-output.json

Dump documents over SSH DIRECTLY to another Solr core

Combine all of the above to dump documents out of Solr on a remote server and pipe that output DIRECTLY into Solr on your workstation.

# Run from your workstation
$ ssh geoportal-server-address.example.edu \
    curl 'http://localhost:8983/solr/corename/search?q=*%3A*\&start=0\&rows=100000\&wt=json' \
    | jq 'del(.response.docs[]["_version_", "score", "timestamp", "solr_bboxtype", "solr_bboxtype__minX", "solr_bboxtype__minY", "solr_bboxtype__maxX", "solr_bboxtype__maxY"])|.response.docs' \
    # Pipe that right into Solr on your workstation! ("-" at the end forces it to use stdin)
    | /path/to/your/workstation/solr/bin/post -c corename -type application/json -

Retrieve the Geoportal's nightly dump

The Geoportal has a nightly Rake task rake geoportal:export_data which dumps all records to public/data.json in the application root, including suppressed records. Its format is similar to the first example above and requires jq post-processing to remove dynamic fields before it can be used for import.

The dump task runs only once a day, so the records may not be as fresh as querying them directly from Solr.

# From your workstation...
$ curl https://geo.btaa.org/data.json \
    | jq 'del(.response.docs[]["_version_", "score", "timestamp", "solr_bboxtype", "solr_bboxtype__minX", "solr_bboxtype__minY", "solr_bboxtype__maxX", "solr_bboxtype__maxY"])|.response.docs' \
    # save to $HOME/local-output.json
    > ~/local-output.json