-
Notifications
You must be signed in to change notification settings - Fork 11
Data Consistency Checks for openstreetmap.org
keepright/keepright
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
/_ _ _ _ _ . _ /__/_ /\ /_'/_'/_/ / / /_// // / _/ Data Consistency Checks for openstreetmap.org ---------------------------------------------- openstreetmap.org (OSM) provides a wiki-style means of creating a world-wide street map where everybody is encouraged to contribute. This is a collection of scripts that will examine part of the OSM database and try to find errors that should be corrected by users. As a result you get ugly lists of errors and are invited to correct them. This document explains how to run data consistency checks on your own database and set up a webpage presenting the results. Online Resources ---------------- The official instance of keepright can be found at http://keepright.at The development instance with all the latest changes can be found at http://osm.mueschelsoft.de/keepright PREREQUISITES ------------- Packages required on Linux: php5 php5-cli apache postgis postgresql >= 8.3 with matching release of postgis (postgresql-8.3-postgis) postgres-client php5-mysql php5-pgsql php5-intl (support for utf-8) php5-idn (support for IDN domain names) mysql-server mysql-client phpMyAdmin phpPgAdmin sun-java7-jre wget wput bzip2 Optional: joe mc You will need both Postgres and MySQL because the checks require GIS functions and the error-presentation scripts rely on MySQL. They will not be recoded to use Postgres because you won't find Postgres on many webhosters. Using sun-java7-jre is not optional. You need java7 by Sun (now Oracle), at least release 6. The checks depend on a copy of the OSM database, split up in parts. Using only a subset of the planet file will result in false-positives because ways are cut in two at the border. To avoid this the splitting is done with overlapping borders - the border regions are included in both adjacent dumps. In the end errors in the overlapping area are discarded. The planet is split up in currently 85 parts, so called 'schemas'. They are processed sequentially and independently. It looks like the osmosis plugin heavily depends on the osmosis version being used. The plugin is tested and works with osmosis_0.42. THE BIG PICTURE --------------- This is the whole process from getting the planet file, running the checks, publishing the check results and collecting user comments. error_view is the resulting table containing all errors. It's the source for the map presentation. backend scripts running on processing servers: main.php update source code loop over all database schemas and process them one by one finally start all over process_schema.php do all that is necessary for processing a single schema: prepeare database (create db tables) diff-update planet file load database with planet file run the checks export & upload results to web server prepareDB.php create database tables, activate postGIS planet.php call osmosis with options for diff-updating a planet file part let osmosis use a custom plugin called 'pl' that creates special dump files osmosis plugin 'pl' PostgreSqlMyDatasetDumpWriter.java create dump files suitable for loading with COPY commands in PostgreSQL the format mainly differs from the current 'snapshot' format in that all geometries are in meters instead of lat/lon it was established before the current 'snapshot' format evolved and cannot be changed with realistic effort any more prepare_helpertables.php update redundant columns prepare_countries.php create structures needed for boundary processing run-checks.php start all the check routines found in config file error_types.php 0010_*.php ... 9999_*.php compare old and new errors, update error states rebuild the error_view table export_errors.php export error_view to dump file webUpdateClient.php upload error_view to web server start procedures on web server for loading the new file communicating with webUpdateServer.php frontend scripts running on web server: report_map.php myText.js, myTextFormat.js main display script including the map and myText layer derived from OpenLayers using an extended version of the Text layer points.php deliver error entries to the client browser selecting errors matching error type selection and current viewport of map comment.php receive user feedback and store it on the webserver's comments table PLANET FILE MANAGEMENT & SQUID ------------------------------ The planet file is split in appriximately 85 rectangular areas called 'schemas' (this wrong term evolved in the early days, because every part of the planet resides in its own database schema). Have a look at config/planet.odg for the splitting layout. When processing a file osmosis will download all diffs since last update and apply them to the schema's planet file. As the planet diffs include updates for the whole planet osmosis includes objects out of scope to the current schema's planet file. That is why cutting the schema's planet file has to be repeated after the diff-update. All of these files are diff-updated individually. That means you always work with the most recent version of each file but you end up downloading the same diff files over and over. That's where the web proxy squid comes into play: Squid caches all web access. It speeds up your downloads and avoids unnecessary traffic (the saving is by a factor of 85 - 1). Setting up squid is quite easy. On Debian/ubuntu Linuxes do something like this: > aptitude install squid change the config file /etc/squid/squid.conf to increase overall cache size to 1000MB if you like (default is 100MB which is a little bit small). Choose cache size big enough to hold all planet diffs that are needed for updating even the oldest schema in the loop (depending on loop cycle time). cache_dir ufs /var/spool/squid 1000 16 256 restart squid >/etc/init.d/squid restart tell your osmosis: add this line to ~/.osmosis: JAVACMD_OPTIONS="-Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128 " my ~/.osmosis looks like this: JAVACMD_OPTIONS=" -Xmx2500m -Djava.io.tmpdir=/media/big_harddisk/tmp/ -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128 " The website check (#410) too benefits from an http proxy. Have a look at the respective options in you keepright user config file (~/.keepright). Increasing the cache size to 5000MB for use with the website check seems appropriate. SETTING UP LOCAL DATABASES -------------------------- [Don't skip this section if you already have a local database!] This project uses a modified version of the "simple PostgreSQL schema" as specified in osmosis/script/pgsql_simple_schema.sql, which is part of the source distribution of Osmosis. This means that the base tables are the same, but there are additional columns providing redundancy. This redundancy is used to boost performance of the queries as it can save some joins. For example the ways table has the number of nodes, as well as the id and lat/lon of the first and last node as additional columns; in way_nodes you find lat/lon of the nodes. The downside is, you cannot use a default database. And you have to use a modified version of Osmosis to convert the planet file. A plugin for Osmosis is provided with the sources. It teaches Osmosis a new option --pl that will create dump files with parts of the redundancy needed. This is the short form of an article on the wiki http://wiki.openstreetmap.org/wiki/Mapnik/PostGIS Tuning PostgreSQL configuration for performance of OSM databases is an adventure. Since PostgreSQL 9.0 you can use the pgtune tool. It creates a modified version of your postgresql.conf depending on main memory installed and depending on the usage type you provide (DW seems to be the best matching one). These are setup parameters you could set before starting out manually: >>>Tune database parameters edit /etc/postgresql/8.3/main/postgresql.conf and add/modify these parameters: shared_buffers = 1024MB work_mem = 128MB maintenance_work_mem = 128MB wal_buffers = 512kB checkpoint_segments = 20 max_fsm_pages = 1536000 effective_cache_size = 512MB autovacuum = off >>>assert the auto-vacuum daemon being shut down joe /etc/crontab comment out any auto-vacuum-daemon entry >>>Tune shmmax kernel parameter joe /etc/sysctl.conf edit/add the parameter kernel.shmmax=300000000 after that reboot the machine or simply execute sysctl -w kernel.shmmax=300000000 && /etc/init.d/postgresql-8.3 restart >>>Optionally turn off postgres user authentication for local access joe /etc/postgresql/8.3/main/pg_hba.conf Add this line: local all all trust This is a security risk. You will not need a password when using the command line psql shell. Most probably you'll use phppgadmin and won't need this. >>> Alternatively to turning off local password prompting you may create a .pgpass file joe ~/.pgpass add a line of this form: hostname:port:database:username:password 127.0.0.1:*:*:keepright:yourpasswordhere chmod 0600 ~/.pgpass >>>Create the new user su - postgres createuser keepright Shall the new role be a superuser? (y/n) y You needn't create the postgres database, as the updateDB script will do that automatically. But you have to set the password for the keepright user inside postgres. Still as user postgres start the psql shell: > psql postgres=# ALTER ROLE keepright WITH PASSWORD 'shhh!'; ALTER ROLE just in case the scripts don't work as expected: creating the database and installing postGIS is easy if you're using postgresql>=9.1 CREATE DATABASE osm WITH OWNER = osm; inside the newly created database just run CREATE LANGUAGE plpgsql; -- (should already be there) CREATE EXTENSION postgis; Wondering why the auto-vac-daemon ist shut off? The daemon will start analyzing and vacuuming tables every few hours to keep index performance up on a high level. But this consumes large amounts of IO bandwidth and disturbes normal operation. Vacuuming is done by hand throughout the scripts because there are many temporary tables that need analyzing and the daemon never comes at the right time. Basically it is done once after loading data and then manually after creation of temp tables and adding indexes. For inserting actual data take a look at updateDB.php, planet.php and config: These scripts download a planet dump from the net or diff-update an already existing set of planet excerpts and insert the planet files contents in a database. In config you can define the databases and the coordinates of the areas. Don't forget to adapt the appropriate configuration variables to match your database credentials in ~/.keepright. OSMOSIS_BIN has to point to the location where you have put the osmosis executable. Configuration is split in two parts: config/config is the default file. This file will always be read first and it will be updated via svn to add new setup options. You will want to make settings differ from the standard settings. Therefore you can change the file ~/.keepright which includes only the system-specific settings (this file will be created upon the first run of main.php). ~/.keepright will be read after the built-in config file so any settings made here will overwrite the default. Finally you will have to setup a MySQL database and user for the destination tables needed by the presentation scripts if you want to run a web server. Update ~/.keepright and webconfig.inc.php with the new database credentials. RUNNING THE CHECKS ------------------ First of all you need to specify database credentials in ~/.keepright. Second, take a look at the list of error types in config/error_types.php. Here you may specify which types of checks should be executed. Anything different from zero will enable a job. Assuming you already have a populated database you start checking by calling run-checks.php from the shell: > php run-checks.php 1 20 30 40 will start the checks 20, 30 and 40 on the database schema called 1. Providing check numbers on the command line is optional. If none are given, all checks are run if they are enabled in ~/.keepright. When processing has finished you will have (among others) a newly created table called public.error_view. Here you can find records for all errors that exist. This postgres table will get transferred into MySQL by export_errors.php and webUpdateClient.php. As time goes by you will update your database and maybe errors are getting corrected. The scripts will detect when old errors don't exist any more and will update the state information in the errors tables to state==cleared. VISUALIZING RESULTS ------------------- report_map.php is used for displaying errors on the map. This script displays a slippy map using an exra layer to draw icons. Icons are drawn for every faulty node and on either starting node of faulty ways. They display some hint about the error when hovered. Keep in mind that this display method draws a limited number of errors in the map, because of memory constraints in browsers and the webserver. WRITING YOUR OWN CHECKS ----------------------- Take a look at the existing checks to see how they work. Then take a look at the template file 0000_template.php. If you write a new check you also have to mention it in the config file. Keep in mind that all checks are included using include() inside a while loop that is running inside run-checks.php. Surprisingly that doesn't matter much, with some exceptions: Any checks run in the same scope, you are even allowed to declare functions (inside the while-loop!) but you must not declare two functions of the same name in different checks. Also don't rely on global variables not being used at the beginning of your script (maybe another check did already initialize a variable of the same name). The same is valid for temporary tables you may need. Always check if a table already exists before creating it. At the end of the script drop any tables you have created. Maybe in the future I will change this into an oop-styled buch of classes, but up to now it is working great this way. If you have ideas for new checks, I would like to integrate them in the official sources to let others benefit from them. So please let me know! And please let me assign a unique check numer for your checks to avoid collisions. SCHEMA SPLITTING WORKFLOW -------------------------- when planet schema files grow beyond certain limits it is necessary to further split them. e.g. splitting old schema '1' into '86' and '87' determine optimum splitting boundaries to achieve equally sized files of up to 4GB uncompressed xml data: /home/harald/OSM/osmosis-0.36/bin/osmosis --rx 1.osm --tee 2 --bb left=-30 top=85 right=1.8 bottom=52.3 idTrackerType=BitSet completeWays=yes completeRelations=yes --wx 86.osm --bb left=-30 top=52.3 right=1.8 bottom=49.3 idTrackerType=BitSet completeWays=yes completeRelations=yes --wx 87.osm > log 2>&1 & update config blocks in config/schemas.php use planet.php to get boundaries including padding: php planet.php --cut 1.osm 86 87 again use osmosis to split files using definitive boundaries in planet directory copy config directory of old planet file to new directories in 0130_islands.php check that there is a starting point in every schema, add some as required copy old error records to both new schemas: first duplicate errors into secondary new schema: insert into public.errors(error_id, error_type, object_type, object_id, state, first_occurrence, last_checked, lat, lon, "schema", msgid, txt1, txt2, txt3, txt4, txt5) SELECT error_id, error_type, object_type, object_id, state, first_occurrence, last_checked, lat, lon, '87', msgid, txt1, txt2, txt3, txt4, txt5 FROM public.errors where "schema"='1'; last move errors from old schema into primary new schema: update public.errors set "schema"='86' where "schema"='1'; update make.sh to include the new schemas and exclude the old ones run checks on webserver db: copy comments from old schema to new ones: INSERT INTO `comments`(`schema`, `error_id`, `state`, `comment`, `timestamp`, `ip`, `user_agent`) SELECT '86', `error_id`, `state`, `comment`, `timestamp`, `ip`, `user_agent` FROM `comments` WHERE `schema`='1' UPDATE comments SET `schema`='87' WHERE `schema`='1'; update table schemata: drop old schema line and add new schema with updated boundaries delete old planet file, old error_view table from webserver, old config file sections RUNNING ON WINDOWS ------------------ install packages from these locations: PHP http://windows.php.net/download/ PostgreSQL http://www.enterprisedb.com/products-services-training/pgdownload#windows choose PostgreSQL 9.2 or later, choose the x64 flavor PostGIS http://postgis.refractions.net/download/windows/ choose version 2 or later install PostGIS using application stack builder shipping with PostgreSQL bzip2 http://sourceforge.net/projects/gnuwin32/files/bzip2/1.0.5/bzip2-1.0.5-bin.zip/download SVN client (Apache SVN) http://www.sliksvn.com/en/download Java Runtime Environment http://www.oracle.com/technetwork/java/javase/downloads/ make sure you catch the 64 bit version of Java Add some directories to PATH environment variable pointing to the programs you just installed (change paths accordingly to match your environment): C:\Program Files\SlikSvn\bin;C:\Program Files (x86)\bzip2\bin;C:\Program Files\php5;C:\Program Files (x86)\Java\jre7\bin Create an OSM directory (eg. C:\OSM\) and check out the source files. In a cmd window type C: cd \ md OSM cd OSM mkdir keepright svn co svn://svn.code.sf.net/p/keepright/code/ keepright Create a copy of the config.php file you find in C:\OSM\keepright\config\config.php.template and rename it to userconfig.php in the same directory. This file replaces the ~/.keepright file used on Linux environments and is never overwritten by svn updates. Open the file in your favourite text editor and change the paths accordingly. Create a file called C:\users\<your user name>\osmosis.bat and give it the following content: set JAVACMD=C:\Program Files\Java\jre7\bin\java.exe set JAVACMD_OPTIONS=-Xmx2500m -Djava.io.tmpdir=C:\temp\ increase the memory limit in case osmosis should crash. in case you want to use a http proxy add this part to your JAVACMD_OPTIONS " -Dhttp.proxyHost=<your.proxy.host> -Dhttp.proxyPort=3128 " in OSM\osmosis\bin\osmosis.bat you may have to change the next-to last line to include your JAVACMD in apos (") in case it contains spaces just like this: SET EXEC="%JAVACMD%" %JAVACMD_OPTIONS% -cp "%PLEXUS_CP%" -Dapp.home="%MYAPP_HOME%" -Dclassworlds.conf="%MYAPP_HOME%\config\plexus.conf" %MAINCLASS% %OSMOSIS_OPTIONS% %* Create/modify your php.ini file to enable the PostgreSQL-Extension If you downloaded the non-installer version of php you need to create a php.ini file yourself. Just copy and rename php.ini-production in your php directory to php.ini. Find the extension_dir setting and point it to the ext directory in your php installation: extension_dir = "C:\Programme\php5\ext" find and uncomment the line loading the PostgreSQL-Extension: extension=php_pgsql.dll LEGAL STUFF ----------- Sources are licensed under GPLv2. This collection of characters was created using a random number generator. I don't think these files are useful for anything or anyone. If you copy, watch, process or even think about putting this collection of bytes into your computer, you do this at your own risk. Don't blame me. IMPRESSUM --------- This work is done without commercial background, just for my personal pleasure. I would be very happy if it was helpful for the OSM Project. If you like to contact me, my mailbox at the austrian server of gmx is labelled keepright
About
Data Consistency Checks for openstreetmap.org
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published