Releases: RichJackson/cogstack
release for publication
WELCOME TO Cogstack
Introduction
CogStack is a lightweight distributed, fault tolerant database processing architecture, intended to make NLP processing and
preprocessing easier in resource constained environments. It makes use of the Spring Batch framework in order to provide a fully configurable
pipeline with the goal of generating an annotated JSON that can be readily indexed into elasticsearch, or pushed back to a database.
In the parlance of the batch processing domain language,
it uses the partitioning concept to create 'partition step' metadata for a DB table. This metadata is persisted in the
Spring database schema, whereafter each partition can then be executed locally or farmed out remotely via a JMS middleware
server (only ActiveMQ is suported at this time). Remote worker JVMs then retrieve metadata descriptions of work units.
The outcome of processing is then persisted in the database, allowing robust tracking and simple restart of failed partitions.
Why does this project exist/ why is batch processing difficult?
The CogStack is a range of technologies designed to to support modern, open source healthcare analytics within the NHS, and is
chiefly comprised of the Elastic stack (elasticsearch, kibana etc), GATE, Bioyodie and Biolark (clinical natural language processing for
entity extraction), OCR, clinical text de-identification, and Apache Tika for MS Office to text conversion.
When processing very large datasets (10s - 100s of millions rows of data), it is likely that some rows will present certain
difficulties for different processes. These problems are typically hard to predict - for example,
some documents may have very long sentences, an unusual sequence of characters, or machine only content. Such circumstances can
create a range of problems for NLP algorithms, and thus a fault tolerant batch frameworks are required to ensure robust, consistent
processing.
Installation
We're not quite at a regular release cycle yet, so if you want a stable version, I suggest downloading v 1.0.0 from the release
page. However, if you want more features and (potentially) fewer bugs, it's best to build from source on the master branch.
To build from source:
- Install Tesseract and Imagemagick
(can be installed but apt-get on Ubuntu) - Run the following:
gradlew clean build
Quick Start Guide
The absolute easiest way to get up and running with CogStack is to use Docker. Docker can provide
lightweight virtualisation of a variety of microservices that CogStack makes use of. When coupled with the microservice orchestration
docker compose technology, all of the components required to use CogStack can be set up with a few
simple commands.
First, ensure you have docker v1.13 or above installed.
Elasticsearch in docker requires the following to be set on the host:
sudo sysctl -w vm.max_map_count=262144
Now you need to build the required docker containers. Fortunately, the gradle build file can do this for you.
From the CogStack top level directory:
gradlew buildSimpleContainers
Assuming the containers have been built successfully, simply navigate to
cd docker-cogstack/compose-ymls/simple/
And type
docker-compose up
All of the docker containers should be up and communicating with each other. You can view their status with
docker ps -a
That's it!
"But that's what?", I hear you ask?
The high level workflow of CogStack is as follows:
- Read a row of the table into the CogStack software
- Process the columns of the row with inbuilt Processors
- Construct a JSON that represents the table row and new data arising from the webservice
- Index the JSON into an elasticsearch cluster
- Visualise the results with Kibana
To understand what's going on, we need to delve into what each of the components is doing. Let's start with the container called
'some-postgres'. Let's assume this is a database that contains a table that we want to process somehow. In fact this example database already
contains some example data. If you have some database browsing software, you should be able to connect to it with the following JDBC confguration
source.JdbcPath = jdbc:postgresql://localhost:5432/cogstack
source.Driver = org.postgresql.Driver
source.username = cogstack
source.password = mysecretpassword
You should see a table called 'tblinputdocs' in the 'cogstack' database with four lines of dummy data. This table is now constantly
being scanned and indexed into elasticsearch. If you know how to use the Kibana tool,
you can visualise the data in the cluster.
Now bring the compose configuration down with from the same compose directory as before:
docker-compose down
This is the most basic configuration, and really doesn't do too much other than convert a database table/view into an elasticsearch index.
For more advanced use cases/configurations, check out the integration test below.
Integration Tests
Although cogstack has unit tests where appropriate, the nature of the project is such that the real value fo testing comes
from the integration tests. Consequently, cogstack has an extensive suite.
To run the integration tests, ensure the required external services are available
(which also give a good idea of how cogstack is configured). These services are Postgresql, Biolark, Bioyodie and Elasticsearch. The easiest
way to get these going is with Docker. Once you have docker installed, cogstack handily will
build the containers you need for you (apart from elasticsearch, where the official image will suffice). To build the containers:
From the CogStack top level directory:
gradlew buildAllContainers
Note, Biolark and Bioyodie are external applications. Building their containers (and subsequently running their integration tests) may require you to
meet their licencing conditions. Please check with Tudor Groza (Biolark) and Angus Roberts/Genevieve Gorrell if in doubt.
Assuming the containers have been built successfully, navigate to
cd docker-cogstack/compose-ymls/nlp/
And type
docker-compose up
to launch all of the external services.
All being well, you should now be able to run the integration tests. Each of these demonstrate a different facet of cogstack's functionality.
Each integration test follows the same pattern:
- Generate some dummy data for processing, by using an integration test execution listener
- Activate a configuration appropriate for the data and run cogstack
- Verify results
All integration tests for Postgres can be run by using:
gradlew postgresIntegTest
Although if you're new to cogstack, you might find it more informative to run them individually, and inspect the results after each one. For example,
to run a single test:
gradlew -DpostgresIntegTest.single=<integration test class name> -i postgresIntegTest
Available classes for integration tests are in the package
src/integration-test/java/uk/ac/kcl/it/postgres
For example, to load the postgres database with some dummy word files into a database table called , process them with Tika, and load them into ElasticSearch index called <test_index2> and a postgres table called
gradlew -DpostgresIntegTest.single=TikaWithoutScheduling -i postgresIntegTest
then point your browser to localhost:5601
A note on SQL Server
Microsoft have recently made SQL Server available on linux, with a docker container available. This is good news, as
most NHS Trusts use SQL Server for most of their systems. To run this container
docker run -e 'ACCEPT_EULA=Y' -e 'SA_PASSWORD=yourStrong(!)Password' -p 1433:1433 -d microsoft/mssql-server-linux
...noting their licence conditions. This container will then allow you to run the integration tests for SQL Server:
gradlew sqlServergresIntegTest
Single tests can be run in the same fashion as Postgres, substituting the syntax as appropriate (e.g.)
gradlew -DsqlServerIntegTest.single=TikaWithoutScheduling -i sqlServerIntegTest
A note on GATE
Applications that require GATE generally need to be configured to point to the GATE installation directory (or they would need to include a rather large amount of plugins on their classpath). To do this in cogstack, set the appropriate properties as detailed in gate.* .
Acceptance Tests
The accompanying manuscript for this piece describes some artifically generated pseudo-documents containing misspellings and
other string mutations in order to validate the de-identification algorithm without requiring access to real world
data. These results can be replicated (subject to RNG) by using the acceptance test package.
To reproduce the results described in the manuscript, simply run the following command:
gradlew -DacceptTest.single=ElasticGazetteerAcceptanceTest -i acceptTest
to reconfigure this test class for the different conditions described in the manuscript, you will need to alter the parameters inside the
elasticgazetteer_test.properties
file, which describes the potential options. For efficiency, it is recommended to do this from inside an IDE.
...
Version 1.0.0 released!
The code's been in production for a while now, without major disaster. Welcome to 1.0.0...