-
Notifications
You must be signed in to change notification settings - Fork 1
michaelbrunnbauer/sparqlcrawlserver
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
SparqlCrawlServer ================= Version: 1.6 Author: netEstate GmbH, http://www.netestate.de/, [email protected] License: Apache License version 2 <http://www.apache.org/licenses/> Sparql Crawl Server is a Java Web application that crawls a list of URLs containing RDF and executes a SPARQL query on the resulting graphs. Every document gets its own named graph with the final document URL as graph name. Already crawled documents are cached in memory and on disk. The crawler supports RDF/XML, RDFa, N3 and Turtle and will open a maximum of one HTTP connection per server IP. It respects the Robot Exclusion Standard. HTTP POST request parameters (UTF-8 encoded): query: The SPARQL query. graphs: A comma separated list of URLs in <>: <url1>,<url2>,... timeout: number of seconds to wait for the RDF documents. HTTP Response: application/sparql-results+json If the timeout is reached, the SPARQL query will be executed with the current results. RDF from unfinished HTTP requests is added to the cache when finished. Build instructions: Configure the path to your settings file in SparqlCrawlServer/WebContent/WEB-INF/web.xml. Build by invoking ant. Minimum Java version is 7.0, minimum Servlet version is 3.0. Deploy the file SparqlCrawlServer.war in your Servlet container (e.g. Tomcat 7). The context-path of the servlet is /sparqlcrawl (so, normally the endpoint URL is http://<your-server>/SparqlCrawlServer/sparqlcrawl). Settings file: The settings file is a groovy file, see example-settings.groovy. cacheLifetimeOk: number of seconds to cache crawled RDF documents. cacheLifetimeContentTooLong: number of seconds to cache 'content-too-long' crawl results. (see maxRdfXmlSize) cacheLifetimeHttpError: number of seconds to cache HTTP error crawl results. cacheLifetimeNetworkError: number of seconds to cache network error crawl results. cacheLifetimeNotAllowedByRobots: number of seconds to cache 'disallowed' crawl results. cacheLifetimeParseError: number of seconds to cache RDF parse error crawl results. cacheLifetimeWrongContentType: number of seconds to cache wrong content type error crawl results. datasetCacheLifetimeComplete: number of seconds to cache completely crawled document lists in memory. datasetCacheLifetimeIncomplete: number of seconds to cache incompletely crawled document lists in memory. logFilePath: path to the log file crawlThreads: number of threads used for crawling. crawlThreadStackSize: stack size for crawl threads dnsThreads: number of threads for dns lookup dnsThreadStackSize: stack size for dns lookup threads dnsTimeout: DNS timeout maxRdfXmlSize: maximum size of RDF documents maxRobotsAge: maximum cache time of robots.txt files. maxRobotsSize: maximum allowed size of a robots.txt file. modelCachePath: path to disk-cache-folder socketTimeout: socket timeout unixOs: is the underlying os unix like? (true/false) userAgent: user-agent header value for crawl requests
About
SPARQL Crawl Server
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published