Skip to content

Commit

Permalink
Merge pull request #405 from matthiaskoenig/develop
Browse files Browse the repository at this point in the history
pkdb-v0.6.0
  • Loading branch information
matthiaskoenig authored Jun 24, 2019
2 parents c183053 + 1022a4a commit 0a2c003
Show file tree
Hide file tree
Showing 111 changed files with 3,398 additions and 1,886 deletions.
5 changes: 3 additions & 2 deletions .env.local
Original file line number Diff line number Diff line change
Expand Up @@ -7,10 +7,11 @@
PKDB_DOCKER_COMPOSE_YAML=docker-compose-develop.yml
PKDB_DJANGO_CONFIGURATION=local

PKDB_API_BASE="http://0.0.0.0:8000"
PKDB_API_BASE=http://0.0.0.0:8000
FRONTEND_BASE=http://0.0.0.0:8080

PKDB_SECRET_KEY="cgasj6yjpkagcgasj6yjpkagcgasj6yjpkag"
PKDB_DEFAULT_PASSWORD="pkdb"
PKDB_ADMIN_PASSWORD=pkdb_admin

PKDB_DB_NAME=postgres
PKDB_DB_PASSWORD=postgres
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ __pycache__/
# Environments
# ----------------------------
.env
.env.production
frontend/.env.production

# ----------------------------
Expand Down
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ env:
PKDB_API_BASE="http://0.0.0.0:8000"

PKDB_SECRET_KEY="cgasj6yjpkagcgasj6yjpkagcgasj6yjpkag"
PKDB_DEFAULT_PASSWORD="pkdb"
PKDB_ADMIN_PASSWORD="pkdb"

PKDB_DB_NAME=postgres
PKDB_DB_PASSWORD=postgres
Expand Down
172 changes: 139 additions & 33 deletions CURATION.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,32 @@ the latest data on file changes therefore allowing rapid iteration of curation a
validation. Information on how to setup the `watch_study` script is provided
in https://github.com/matthiaskoenig/pkdb_data

As initial setup create a virtual environment with `pkdb_data` installed.
```
git clone https://github.com/matthiaskoenig/pkdb_data.git
cd pkdb_data
mkvirtualenv pkdb_data --python=python3.6
pip install -e .
```
Next step is to export your credentials via environment variables.
Create a `.env` file with the following content
```
API_BASE=https://develop.pk-db.com
USER=PKDB_USERNAME
PASSWORD=PKDB_PASSWORD
```
And export the variables via
```
set -a && source .env
```
To check the environment variables use
```
echo $API_BASE
echo $USER
echo $PASSWORD
```

To watch a given study use
```
# activate virtualenv with watch_study script
workon pkdb_data
Expand All @@ -31,6 +57,9 @@ workon pkdb_data
(pkdb_data) watch_study -s $STUDYFOLDER
```

Here the example output for the successfully uploaded study `Renner2007`. On file changes the study will be reuploaded
![Interactive curation with watch_study](./docs/curation/watch_study.png "Interactive curation with watch_study")

## 1. PDF, Reference, Figures & Tables

For upload a certain folder structure and organisation is expected of the `$STUDYFOLDER`.
Expand All @@ -39,8 +68,8 @@ The first step is to create the folder and the basic files in the folder
- folder name is `STUDYNAME`, e.g., `Albert1974`
- folder contains the pdf as `STUDYNAME.pdf`, e.g., `Albert1974.pdf`
- folder contains additional files associated with study, i.e.,
- tables, named `STUDYNAME_Tab[1-9]*.png`, e.g., `Albert1974_Tab1.png`
- figures, named `STUDYNAME_Fig[1-9]*.png`, e.g., `Albert1974_Fig2.png`
- tables, named `STUDYNAME_Tab[1-9]*.png`, e.g., `Albert1974_Tab2.png`
- figures, named `STUDYNAME_Fig[1-9]*.png`, e.g., `Albert1974_Fig1.png`
- excel file, named `STUDYNAME.xlsx`, e.g., `Albert1974.xlsx`

In addition the folder can contain data files,
Expand All @@ -52,6 +81,7 @@ Information about the study is stored in the `study.json` which we will create
in the following steps. Information about the reference for the study is stored
in the `reference.json` (this file is created automatically and should not be altered).

![Overview study files](./docs/curation/curation_files.png "Overview study files")

## 2. Initial study information (`study.json`)

Expand All @@ -72,12 +102,9 @@ contains all the relevant information
"access": "public || private",
"creator": "mkoenig",
"curators": [
"mkoenig",
"janekg"
["mkoenig", 0.5]
],
"collaborators": [],
"substances": [],
"keywords": [],
"descriptions": [],
"comments": [],
"groupset": {
Expand Down Expand Up @@ -106,9 +133,10 @@ contains all the relevant information
* Fill in basic information for study, i.e., the `name` field with the `$STUDYNAME`, the `creator` and `curator` and `collaborator` fields with existing users
(creator is a single user, whereas curators is a list of users),
the `sid` and `reference` field with the `PubmedId` of the study.
* Substances which are used in the study should be listed in the `substances`. Substances must be existing substances which can be looked up at https://develop.pk-db.com/#/curation
* `keywords` relevant for the study should be mentioned in the `keywords` list. Keywords must be existing keywords which can be looked up at https://develop.pk-db.com/#/curation
* The `reference` field is optional. If no pubmed entry exist for publication a `reference.json` should be build manually (please ask what to do in such a case).
* The `access` field provides information on who can see the study. `public` provides access to everyone, `private` only to the `creator`, `curators` and `collaborators`.
* The `curators` is a list which consists of either curator names (e.g. `mkoenig`) or a curator name with a curation score between 0.0 and 5.0 (e.g. `[mkoenig, 3.5]`)
* The `licence` field provides information on the licence of the publication. This is either `open` in case of Open Access publications or `closed` otherwise. Images and the PDF are only shown publicly if the publication is Open Access.

After this initial information is created in the `study_json` we can start running the `watch_study` script.
```
Expand All @@ -121,7 +149,8 @@ Descriptions are quotations from the publication to substantiate and support the
`comments` provide the possibility to store information from individual curators. Examples of comments are stating incorrect units, missing data or strange conversions of data.


## 3. Curation of groups
## 3. Curation of groups and individuals
### groups
The next step is the curation of the `group` information, i.e., which groups where studied. The information is stored in the `groupset` of the following form.
Retrieve group information from the publication and copy it in the description of the groupset. The top group containing all subjects must be called `all`.

Expand All @@ -131,7 +160,7 @@ An overview over the available `characteristica` and possible choices is availab
```json
{
"groupset": {
"description": [
"descriptions": [
"The subjects were six healthy volunteers, three males and three females, aged 21.0 ± 0.9 years (range 20 to 22 years) and weighing 63 ± 11 kg (range 50 to 76 kg). All were nonsmokers. Subjects abstained from all methylxanthine containing foods and beverages during the entire period of the study. Compliance with this requirement was assessed by questioning at each presentation for blood sampling or urine delivery."
],
"comments": [],
Expand All @@ -141,19 +170,19 @@ An overview over the available `characteristica` and possible choices is availab
"name": "all",
"characteristica": [
{
"category": "species",
"measurement_type": "species",
"choice": "homo sapiens"
},
{
"category": "healthy",
"measurement_type": "healthy",
"choice": "Y"
},
{
"category": "overnight fast",
"measurement_type": "overnight fast",
"choice": "Y"
},
{
"category": "fasted",
"measurement_type": "fasted",
"min": "10",
"max": "14",
"unit": "hr"
Expand All @@ -173,7 +202,7 @@ An overview over the available `characteristica` and possible choices is availab
All available fields for characteristica on group are:
```json
{
"category": "categorial",
"measurement_type": "categorial",
"choice": "categorial|string",
"count": "int",
"mean": "double",
Expand All @@ -186,9 +215,39 @@ All available fields for characteristica on group are:
"unit": "categorial"
}
```
### individuals
Individuals are curated very similar to groups with the exception that individuals must belong
to a given group, i.e., the `group` attribute must be set. Individuals are most often defined based on spreadsheet mappings.
See for instance below individuals which are defined via a table.

```json
"individuals": [
{
"name": "col==subject",
"group": "col==group",
"source": "Akinyinka2000_Tab1.csv",
"format": "TSV",
"figure": "Akinyinka2000_Tab1.png",
"characteristica": [
{
"measurement_type": "age",
"value": "col==age",
"unit": "yr"
},
{
"measurement_type": "weight",
"value": "col==weight",
"unit": "kg"
},
{
"measurement_type": "sex",
"choice": "col==sex"
}
]
}
]
```

Individuals are curated like groups with the exception that individuals must belong
to a given group, i.e., the `group` attribute must be set. Individuals are most often defined via excel spreadsheets.

## 4. Curation of interventions/interventionset
```json
Expand All @@ -202,7 +261,7 @@ to a given group, i.e., the `group` attribute must be set. Individuals are most
"route": "iv",
"form": "solution",
"application": "single dose",
"category": "dosing",
"measurement_type": "dosing",
"value": "0.5",
"unit": "g/kg",
"time": "0",
Expand All @@ -217,7 +276,7 @@ All available fields for intervention and interventionset are:
{

"name": "string",
"category": "categorial",
"measurement_type": "categorial",

"substance": "categorial (substance)",
"route": "categorial {oral, iv}",
Expand All @@ -240,15 +299,41 @@ All available fields for intervention and interventionset are:
```
* TODO: document after next iteration

## 5. Curation of output/outputset
Data from Figures should be digized using plot digitizer. See guidelines in
## 5. Curation of outputs and time courses
The actual data in publication is available either from tables, figures or stated with the text.
All information should be curated by the means of excel spreadsheets, i.e., data must be digitized and transferred from the
PDF in a spreadsheet.

- Use Excel (LibreOffice/OpenOffice) spreadsheets to store data, with all digitized data is stored in excel spreadsheets
- Use Excel (LibreOffice/OpenOffice) spreadsheets to store digitized data
- change language settings to use US numbers and formats, i.e. ‘.’ separator). Always use points (‘.’) as number separator, never comma (‘,’), i.e. 1.234 instead of 1,234.
- PlotDigitizer to digitize figures (https://sourceforge.net/projects/plotdigitizer/)

```json
For all figures and tables from which data is extracted individual images (`png`) must be stored in the study folder, i.e.,
- tables, named `STUDYNAME_Tab[1-9]*.png`, e.g., `Albert1974_Tab2.png`
- figures, named `STUDYNAME_Fig[1-9]*.png`, e.g., `Albert1974_Fig1.png`
Use the screenshot functionality in the PDF viewer and save with image program like `gimp` after removing unnecessary borders of images.

![Overview study files](./docs/curation/curation_files.png "Overview study files")

### Figures
- Use PlotDigitizer to digitize figures (https://sourceforge.net/projects/plotdigitizer/)
- Open the image to digitize (`STUDYNAME_Fig[1-9]*.png`)
- Use the Zoom function to increase the image if necessary (easier to click on data points)
- First axes have to be calibrated (make sure to set logarithmical axes where necessary); calibration should be done very carfully because it will have a systematic effect (bias) on all digitized data points.
- Digitize one curve at a time
- Digitize mean and error separately (the actual error can than be calculated in excel as `abs(mean-error)`)
- In case of time courses adapt the time points to the actual times reported in the publication

![Figure digitization](./docs/curation/figure_digitization.png "Figure digitization (axes have been calibrated and first mean curve digitized)")

Some tips for digitizion of figures:
- Check that axes are do not have any breaks or different scales (sometimes an axis is continued or not scaled equally). This will create problems in the digitization.
- use the same axes calibration for all curves in a figure, i.e., always finish all data from a single figure panel completely in one go (otherwise different bias for the different curves)
- The left lower corner is not always `(0,0)`, use the minimal x-tick an y-tick with available numerical values
- set the number of digits to a reasonable value (2-3 digits)

### Tables

```json
{
"source": "Akinyinka2000_Tab3.csv",
"format": "TSV",
Expand All @@ -260,26 +345,47 @@ Data from Figures should be digized using plot digitizer. See guidelines in
],
"substance": "paraxanthine",
"tissue": "col==tissue",
"pktype": "cmax || tmax || auc_inf || thalf",
"measurement_type": "cmax || tmax || auc_inf || thalf",
"mean": "col==cmax || col==tmax || col==aucinf || col==thalf",
"sd": "col==cmax_sd || col==tmax_sd || col==aucinf_sd || col==thalf_sd",
"unit": "\u00b5g/ml || hr || \u00b5g*hr/ml || hr"
}
```

```json
{
"timecourses": [
{
"group": "all",
"groupby": "intervention",
"interventions": "col==intervention",
"source": "Albert1974_Fig1.tsv",
"format": "TSV",
"figure": "Albert1974_Fig1.png",
"substance": "paracetamol",
"tissue": "plasma",
"measurement_type": "concentration",
"time": "col==time_min",
"time_unit": "min",
"mean": "col==apap",
"cv": "col==apap_cv",
"unit": "µg/ml"
}
]
}
```


## Units for curation
Pint units:
https://github.com/hgrecco/pint/blob/master/pint/default_en_0.6.txt
Units are crucial information to make sense out of the data set. So many `characteristica`, `interventions` and `outputs/timecourses`
require to provide unit information.
We are using a pints unit model. If units are missing or incorrect the `watch_study` script will report the
respective issues.


## Open issues
- individual set
- time course data
- column data

## Frequently asked questions (FAQ)
# How to encode multiple substances which are given (e.g., caffeine and acetaminophen are given)?
# Frequently asked questions (FAQ)
## How to encode multiple substances which are given (e.g., caffeine and acetaminophen are given)?
- encode as individual interventions of caffeine and acetaminophen, i.e., single interventions
with the respective doses. In the outputs a list of intervention is provided, i.e., in this example the interventions for caffeine and acetaminophen

Expand Down
23 changes: 18 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -90,16 +90,30 @@ or to run migrations
```bash
docker-compose run --rm backend python manage.py makemigrations
```
## Authentication data dump
```bash

docker-compose -f $PKDB_DOCKER_COMPOSE_YAML run --rm backend ./manage.py dumpdata auth --indent 2 > ./backend/pkdb_app/fixtures/auth.json
docker-compose -f $PKDB_DOCKER_COMPOSE_YAML run --rm backend ./manage.py dumpdata users --indent 2 > ./backend/pkdb_app/fixtures/users.json
docker-compose -f $PKDB_DOCKER_COMPOSE_YAML run --rm backend ./manage.py dumpdata rest_email_auth --indent 2 > ./backend/pkdb_app/fixtures/rest_email_auth.json
```

## authentication data load
```bash

docker-compose -f $PKDB_DOCKER_COMPOSE_YAML run --rm backend ./manage.py loaddata auth pkdb_app/fixtures/auth.json
docker-compose -f $PKDB_DOCKER_COMPOSE_YAML run --rm backend ./manage.py loaddata users pkdb_app/fixtures/users.json
docker-compose -f $PKDB_DOCKER_COMPOSE_YAML run --rm backend ./manage.py loaddata rest_email_auth pkdb_app/fixtures/rest_email_auth.json
```
## REST services
PKDB provides a REST API which allows simple interaction with the database.
An overview over the REST endpoints is available from
```
http://localhost:8000/api/v1/
http://localhost:8123/api/v1/
```

Elastic Search engine is running on `localhost:9200` but is also reachable via django views.
General examples can be found here: https://django-elasticsearch-dsl-drf.readthedocs.io/en/0.16.2/basic_usage_examples.html
The REST API supports elastisearch queries, with examples
available from https://django-elasticsearch-dsl-drf.readthedocs.io/en/0.16.2/basic_usage_examples.html

Query examples:
```
Expand All @@ -120,6 +134,5 @@ http://localhost:8000/api/v1/substances_elastic/suggest/?search:name=cod

## Fill database
Database is filled using `pkdb_data` repository at https://github.com/matthiaskoenig/pkdb_data

## Read

© 2017-2019 Jan Grzegorzewski & Matthias König.
2 changes: 1 addition & 1 deletion backend/pkdb_app/_version.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
"""
Definition of version string.
"""
__version__ = "0.5.2"
__version__ = "0.6.0"
Loading

0 comments on commit 0a2c003

Please sign in to comment.