Skip to content

michaelbrunnbauer/sparqlcrawlserver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparqlCrawlServer
=================

Version: 1.6
Author: netEstate GmbH, http://www.netestate.de/, [email protected]
License: Apache License version 2 <http://www.apache.org/licenses/>

Sparql Crawl Server is a Java Web application that crawls a list of URLs 
containing RDF and executes a SPARQL query on the resulting graphs.  
Every document gets its own named graph with the final document URL as 
graph name. Already crawled documents are cached in memory and on disk.

The crawler supports RDF/XML, RDFa, N3 and Turtle and will open a maximum 
of one HTTP connection per server IP. It respects the Robot Exclusion Standard.

HTTP POST request parameters (UTF-8 encoded):

  query: The SPARQL query.
  graphs: A comma separated list of URLs in <>: <url1>,<url2>,...
  timeout: number of seconds to wait for the RDF documents.

HTTP Response: application/sparql-results+json

If the timeout is reached, the SPARQL query will be executed with the current
results. RDF from unfinished HTTP requests is added to the cache when finished.

Build instructions:

Configure the path to your settings file in 

 SparqlCrawlServer/WebContent/WEB-INF/web.xml.

Build by invoking ant.
Minimum Java version is 7.0, minimum Servlet version is 3.0.

Deploy the file SparqlCrawlServer.war in your Servlet container (e.g. Tomcat 7).

The context-path of the servlet is /sparqlcrawl (so, normally the
endpoint URL is http://<your-server>/SparqlCrawlServer/sparqlcrawl).

Settings file:

The settings file is a groovy file, see example-settings.groovy.

  cacheLifetimeOk: number of seconds to cache crawled RDF documents.
  cacheLifetimeContentTooLong: number of seconds to cache 'content-too-long'
                               crawl results. (see maxRdfXmlSize)
  cacheLifetimeHttpError: number of seconds to cache HTTP error crawl results.
  cacheLifetimeNetworkError: number of seconds to cache network error crawl
                             results.
  cacheLifetimeNotAllowedByRobots: number of seconds to cache 'disallowed'
                                   crawl results.
  cacheLifetimeParseError: number of seconds to cache RDF parse error crawl
                           results.
  cacheLifetimeWrongContentType: number of seconds to cache wrong content type
                                 error crawl results.
  
  datasetCacheLifetimeComplete: number of seconds to cache completely crawled
                                document lists in memory.
  datasetCacheLifetimeIncomplete: number of seconds to cache incompletely
                                  crawled document lists in memory.
  
  logFilePath: path to the log file
  crawlThreads: number of threads used for crawling.
  crawlThreadStackSize: stack size for crawl threads
  dnsThreads: number of threads for dns lookup
  dnsThreadStackSize: stack size for dns lookup threads
  dnsTimeout: DNS timeout
  maxRdfXmlSize: maximum size of RDF documents
  maxRobotsAge: maximum cache time of robots.txt files.
  maxRobotsSize: maximum allowed size of a robots.txt file.
  modelCachePath: path to disk-cache-folder
  socketTimeout: socket timeout
  unixOs: is the underlying os unix like? (true/false)
  userAgent: user-agent header value for crawl requests

About

SPARQL Crawl Server

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published