-
Notifications
You must be signed in to change notification settings - Fork 0
Automatically exported from code.google.com/p/alertfeed
License
mhfrantz/alertfeed
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
The AlertFeed project contains an AppEngine implementation of a crawl, index, and search backend for a specific class of online documents that represent alerts of some sort. The design is roughly divided into two complementary parts, CapMirror and CapQuery. This document describes features of these designs, some of which have not yet been implemented. CapMirror ========= Objective --------- To provide a crawl/index platform for CAP data in the App Engine environment. Overview -------- CapMirror is a crawl/index pipeline for CAP data. It builds a Datastore that will be the backend for CapQuery. Infrastructures --------------- The entire design is executed in App Engine in Python. It relies on the Datastore, cron, and taskqueue API's. Detailed Design --------------- + Whitelist of URL's containing XML indices of CAP URL's. Whitelist is initially hardcoded, but can eventually be edited online via an admin page. Stored as Feed models in Datastore. + Crawl workflow will be described as simple Model subclasses (Crawl and CrawlShard) that are stored using the App Engine Datastore service. + A cron job will poke the server to advance the crawl workflow. It performs a small, fixed number of Datastore queries to determine whether to crawl each feed in the whitelist, the state of any crawl in progress, and is responsible for finishing the Crawl record when all work is complete. It is robust in the presence of timeouts because the crawl state is persistent. + Task queues are used to execute shards of work. One queue ("crawlpush") is responsible for duplicate elimination (so that a URL is only crawled once per cycle, even if it appears in multiple feeds), and queuing tasks in the second queue. The second queue ("crawlworker") is responsible for fetching and parsing URL's. + Two types of files can be handled: CAP files and CAP indices. A CAP file is stored as a CapAlert model in Datastore. A CAP index is processed by queuing each entry in the "crawlpush" queue, allow recursion. + A crawl status page can render the crawl workflow state in a human-readable form. Other admin screens can show past crawls and all indexed CAP data. Any errors encountered are saved in the CrawlShard and CapAlert models. + The CAP semantics will be implemented, allowing alerts to be created, modified, and expire/retire. (TBD) + We will purge data from the Datastore after some interval (e.g. one year). + Each alert will be updated atomically, but there will be no attempt to coordinate the state of multiple alerts. (TBD) + Indexing consists of storing indexable attributes (e.g. "category") and of adding keys, e.g. geohash, to enable efficient querying by the front end. Code Location ------------- cap_crawl.py (main crawl execution code) cap_mirror.py (administrative screens) cap_schema.py (Datastore schema) CapQuery ======== Objective --------- To provide a scalable serving platform for CAP data as either KML files or CAP indices. Overview -------- CapQuery is a query service that provides CAP data formatted as CAP indices (ATOM files containing embedded CAP alerts) or KML. It relies on the App Engine Datastore that is populated by CapMirror. Infrastructures --------------- The entire design is executed in App Engine in Python. It relies on the Datastore API. Detailed Design --------------- + User-facing service that can retrieve the KML (/cap2kml) and CAP index (/cap2atom) versions of the alerts. + "Table of Contents" view that shows the feeds and some basic statistics, e.g. number of alerts. (TBD) + Search parameters include feed URL, category, geo bounding box (TBD), and other indexed properties of the CapAlert model. The query parameter namespace is aligned with the CAP standard, using the XML element names in the query string, e.g. "category" and "severity". + A flexible query API maps CGI parameters to indexed schema elements, and allows for common (but not arbitrary) combinations of predicates. (See web_query.py) + If necessary, a query can be split into a Datastore GQL query and a subsequent filtering of the Datastore query results. Common, simple queries are expected to be handled by the Datastore. + Serves a static (or one day a self-refreshing) KML that matches the search criteria. The search parameters are encoded in the URL that is used to refresh. + Stable URL's for queries, e.g. http://alert-feed.appspot.com/cap2atom?category=Geo. + Serves data only from the most recent crawl. TBD: Historical queries, including timeseries. + Original CAP data (XML) is stored in the Datastore (CapAlert.text). It is normalized at query time when inlined into the ATOM that forms a CAP index (/cap2atom). + Both strict and non-conforming parsers are used. TBD: Indicate to the user non-conforming CAP, or allow filtering. + KML is generated at query time (/cap2kml). This is expensive, but we would like to offer customization, e.g. style sheets, to control how CAP maps to KML. + *PROBLEM* The size of the Datastore query (measured as the number of models) is unbounded with respect to the user's query specification. Need to use query sharding and precalculation (during the crawl) to mitigate. Code Location ------------- cap_query.py web_query.py
About
Automatically exported from code.google.com/p/alertfeed
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published