GitHub - mhfrantz/alertfeed: Automatically exported from code.google.com/p/alertfeed

mhfrantz / alertfeed Public

Notifications You must be signed in to change notification settings
Fork 0
Star 1

Automatically exported from code.google.com/p/alertfeed

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
_xmlplus		_xmlplus
stylesheets		stylesheets
testdata		testdata
ui		ui
BUILD		BUILD
COPYING		COPYING
README		README
app.yaml		app.yaml
appengine_test_util.py		appengine_test_util.py
atom_index.xml		atom_index.xml
cap2kml.py		cap2kml.py
cap_crawl.py		cap_crawl.py
cap_crawl_integration_test.py		cap_crawl_integration_test.py
cap_crawl_test.py		cap_crawl_test.py
cap_dump.xml		cap_dump.xml
cap_fake.py		cap_fake.py
cap_index_parse.py		cap_index_parse.py
cap_mirror.py		cap_mirror.py
cap_mirror_test.py		cap_mirror_test.py
cap_parse_db.py		cap_parse_db.py
cap_parse_db_test.py		cap_parse_db_test.py
cap_parse_mem.py		cap_parse_mem.py
cap_parse_mem_test.py		cap_parse_mem_test.py
cap_query.py		cap_query.py
cap_schema.py		cap_schema.py
cap_schema_mem.py		cap_schema_mem.py
cap_schema_mem_test.py		cap_schema_mem_test.py
cap_test_util.py		cap_test_util.py
caplib_adapter.py		caplib_adapter.py
caps.html		caps.html
crawl.html		crawl.html
crawl_header.html		crawl_header.html
crawls.html		crawls.html
crawls_purged.html		crawls_purged.html
cron.yaml		cron.yaml
db_test_util.py		db_test_util.py
db_util.py		db_util.py
db_util_test.py		db_util_test.py
fake_clock.py		fake_clock.py
feeds.html		feeds.html
index.yaml		index.yaml
invalid_crawl.html		invalid_crawl.html
invalid_nudge.html		invalid_nudge.html
model_parser.py		model_parser.py
mox_util.py		mox_util.py
no_arguments.html		no_arguments.html
paged_query.py		paged_query.py
pager.html		pager.html
purge_crawls.html		purge_crawls.html
queue.yaml		queue.yaml
run_devserver.sh		run_devserver.sh
shards.html		shards.html
taskqueue_test_util.py		taskqueue_test_util.py
unknown_arguments.html		unknown_arguments.html
unknown_feed_list.html		unknown_feed_list.html
unknown_feed_url.html		unknown_feed_url.html
users_test_util.py		users_test_util.py
valid_arguments.html		valid_arguments.html
web_query.py		web_query.py
webapp_util.py		webapp_util.py
xml_util.py		xml_util.py
xml_util_test.py		xml_util_test.py

Repository files navigation

The AlertFeed project contains an AppEngine implementation of a crawl, index,
and search backend for a specific class of online documents that represent
alerts of some sort.  The design is roughly divided into two complementary
parts, CapMirror and CapQuery.  This document describes features of these
designs, some of which have not yet been implemented.

CapMirror
=========

Objective
---------
To provide a crawl/index platform for CAP data in the App Engine environment.

Overview
--------
CapMirror is a crawl/index pipeline for CAP data.  It builds a Datastore that
will be the backend for CapQuery.

Infrastructures
---------------
The entire design is executed in App Engine in Python.  It relies on the
Datastore, cron, and taskqueue API's.

Detailed Design
---------------

+ Whitelist of URL's containing XML indices of CAP URL's.  Whitelist is
  initially hardcoded, but can eventually be edited online via an admin
  page. Stored as Feed models in Datastore.

+ Crawl workflow will be described as simple Model subclasses (Crawl and
  CrawlShard) that are stored using the App Engine Datastore service.

+ A cron job will poke the server to advance the crawl workflow.  It performs
  a small, fixed number of Datastore queries to determine whether to crawl
  each feed in the whitelist, the state of any crawl in progress, and is
  responsible for finishing the Crawl record when all work is complete.  It is
  robust in the presence of timeouts because the crawl state is persistent.

+ Task queues are used to execute shards of work.  One queue ("crawlpush") is
  responsible for duplicate elimination (so that a URL is only crawled once
  per cycle, even if it appears in multiple feeds), and queuing tasks in the
  second queue.  The second queue ("crawlworker") is responsible for fetching
  and parsing URL's.

+ Two types of files can be handled: CAP files and CAP indices.  A CAP file is
  stored as a CapAlert model in Datastore.  A CAP index is processed by
  queuing each entry in the "crawlpush" queue, allow recursion.

+ A crawl status page can render the crawl workflow state in a human-readable
  form.  Other admin screens can show past crawls and all indexed CAP data.
  Any errors encountered are saved in the CrawlShard and CapAlert models.

+ The CAP semantics will be implemented, allowing alerts to be created,
  modified, and expire/retire. (TBD)

+ We will purge data from the Datastore after some interval (e.g. one year).

+ Each alert will be updated atomically, but there will be no attempt to
  coordinate the state of multiple alerts.  (TBD)

+ Indexing consists of storing indexable attributes (e.g. "category") and of
  adding keys, e.g. geohash, to enable efficient querying by the front end.

Code Location
-------------
cap_crawl.py (main crawl execution code)
cap_mirror.py (administrative screens)
cap_schema.py (Datastore schema)


CapQuery
========

Objective
---------
To provide a scalable serving platform for CAP data as either KML files or CAP
indices.

Overview
--------
CapQuery is a query service that provides CAP data formatted as CAP indices
(ATOM files containing embedded CAP alerts) or KML.  It relies on the App
Engine Datastore that is populated by CapMirror.

Infrastructures
---------------
The entire design is executed in App Engine in Python.  It relies on the
Datastore API.

Detailed Design
---------------

+ User-facing service that can retrieve the KML (/cap2kml) and CAP index
  (/cap2atom) versions of the alerts.

+ "Table of Contents" view that shows the feeds and some basic statistics,
  e.g. number of alerts.  (TBD)

+ Search parameters include feed URL, category, geo bounding box (TBD), and
  other indexed properties of the CapAlert model.  The query parameter
  namespace is aligned with the CAP standard, using the XML element names in
  the query string, e.g. "category" and "severity".

+ A flexible query API maps CGI parameters to indexed schema elements, and
  allows for common (but not arbitrary) combinations of predicates.  (See
  web_query.py)

+ If necessary, a query can be split into a Datastore GQL query and a
  subsequent filtering of the Datastore query results.  Common, simple queries
  are expected to be handled by the Datastore.

+ Serves a static (or one day a self-refreshing) KML that matches the search
  criteria.  The search parameters are encoded in the URL that is used to
  refresh.

+ Stable URL's for queries,
  e.g. http://alert-feed.appspot.com/cap2atom?category=Geo.

+ Serves data only from the most recent crawl.  TBD: Historical queries,
  including timeseries.

+ Original CAP data (XML) is stored in the Datastore (CapAlert.text).  It is
  normalized at query time when inlined into the ATOM that forms a CAP index
  (/cap2atom).

+ Both strict and non-conforming parsers are used.  TBD: Indicate to the user
  non-conforming CAP, or allow filtering.

+ KML is generated at query time (/cap2kml).  This is expensive, but we would
  like to offer customization, e.g. style sheets, to control how CAP maps to
  KML.

+ *PROBLEM* The size of the Datastore query (measured as the number of models)
  is unbounded with respect to the user's query specification.  Need to use
  query sharding and precalculation (during the crawl) to mitigate.

Code Location
-------------
cap_query.py
web_query.py