Skip to content

Commit

Permalink
README now points to pylidc.github.io documentation. Other updates an…
Browse files Browse the repository at this point in the history
…d modifications to account for recent property changes.
  • Loading branch information
notmatthancock committed Mar 29, 2018
1 parent 9adbcbf commit d76cfab
Show file tree
Hide file tree
Showing 7 changed files with 608 additions and 651 deletions.
264 changes: 3 additions & 261 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,263 +1,5 @@
`pylidc` is a python library intended to improve workflow associated with the [LIDC dataset](https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI).

# pylidc
Install via pip: `pip install pylidc`.

`pylidc` is a python library intended to enhance workflow associated with the [LIDC dataset](https://wiki.cancerimagingarchive.net/display/Public/LIDC-IDRI), including utilities for both querying by attributes (e.g., collecting all annotations where malignancy is labeled as greater than 3 and spiculation is labeled a value equal to 1), and for common functional routines that act on the associated data (e.g., estimating the diameter or volume of a nodule from one of its annotations).

Routines for visualizing the annotations, both atop the CT data and as a surface in 3D, are implemented. These functionalities are implemented via an object relational mapping (ORM), using `sqlalchemy` to an sqlite database containing the annotation information from the XML files provided by the LIDC dataset.

![](https://raw.githubusercontent.com/pylidc/pylidc/master/img/viz-in-scan-example.png)
![](https://raw.githubusercontent.com/pylidc/pylidc/master/img/viz-in-3d-example.png)

Table of Contents
=================

* [Installation and setup](#installation-and-setup)
* [Dicom file directory configuration](#dicom-file-directory-configuration)
* [Citing](#citing)
* [Example usage](#example-usage)
* [Basic examples](#basic-examples)
* [The Scan model](#the-scan-model)
* [The Annotation model](#the-annotation-model)
* [Advanced queries](#advanced-queries)
* [Get a random result](#get-a-random-result)
* [Query multiple model parameters with a join](#query-multiple-model-parameters-with-a-join)
* [Resampling the volumes](#resampling-the-volumes)
* [Clustering annotations](#clustering-annotations-to-identify-those-which-refer-to-the-same-physical-nodule)


## Installation and setup

`pylidc` has been used on Linux, Mac, and Windows, and on Python 2 and 3.

The package can be installed via `pip`:

pip install pylidc

### Dicom file directory configuration

While `pylidc` has many functions for analyzing and querying only annotation data (which do not require DICOM image data access), `pylidc` also has many functions that do require access to the DICOM files associated with the LIDC dataset. `pylidc` looks for a special configuration file that tells it where DICOM data is located on your system. You can use `pylidc` without creating this configuration file, but of course, any functions that depend on CT image data will not be usable.

`pylidc` looks in your home folder for a configuration file called, `.pylidcrc` on Mac and Linux, or `pylidc.conf` on Windows. You must create this file. On Linux and Mac, the file should be located at `/home/[user]/.pylidcrc`. On Windows, the file should be located at `C:\Users\[User]\pylidc.conf`.

The configuration file should be formatted as follows:

[dicom]
path = /path/to/big_external_drive/datasets/LIDC-IDRI
warn = True

If you want to use `pylidc` without utilizing the DICOM data (for say, querying annotation attributes, etc.), you can remove `path` and set `warn` to `False`, i.e.,

[dicom]
warn = False

and the module won't bother you about it each time you import the module.

The expected folder hierarchy in the specified `path` is: `PatientID` > `StudyInstanceUID` > `SeriesInstanceUID` > `*.dcm`. If you downloaded the data from the TCIA site, the folder hierarchy will (probably!) already be formatted in this way.

## Citing

If you find `pylidc` helpful to your research, you could cite by noting that software was developed for, and first mentioned in, the following publication:

> Matthew C. Hancock, Jerry F. Magnan. **Lung nodule malignancy classification using only radiologist quantified image features as inputs to statistical learning algorithms: probing the Lung Image Database Consortium dataset with two statistical learning methods**. *SPIE Journal of Medical Imaging*. Dec. 2016. [http://dx.doi.org/10.1117/1.JMI.3.4.044504](http://dx.doi.org/10.1117/1.JMI.3.4.044504)
## Example usage

### Basic examples

There are three data models: `Scan`, `Annotation`, and `Contour`. The relationships are "one to many" for each model going left to right, i.e., `Scan`'s have many `Annotation`'s, and `Annotation`'s have many `Contour`'s. The main models to query are the `Scan` and `Annotation` models.

The main workhorse for querying is the `pylidc.query` function. This function just wraps the `sqlalchemy.query` function.

#### The `Scan` model

Here's some example usage for querying scan objects.

import pylidc as pl

qu = pl.query(pl.Scan).filter(pl.Scan.slice_thickness <= 1)
print(qu.count())
# => 97

scan = qu.first()
print(scan.patient_id, scan.pixel_spacing, scan.slice_thickness)
# => LIDC-IDRI-0066, 0.63671875, 0.6

print(len(scan.annotations))
# => 11

print(scan.get_path_to_dicom_files())
# '/path/to/big_external_drive/datasets/LIDC-IDRI/LIDC-IDRI-0066/1.3.6.1.4.1.14519.5.2.1.6279.6001.143774983852765282237869625332/1.3.6.1.4.1.14519.5.2.1.6279.6001.430109407146633213496148200410'

You can engage an interactive slice view by calling:

scan.visualize()

By default, note that calling `visualize` on a scan object doesn't include its annotation information. However, if `scan.visualize` is supplied with a list of `Annotation` objects (say, grouped by the `scan.cluster_annotations()` function), then this annotation information *will* be displayed. For example

nodules = scan.cluster_annotations()
print(len(nodules))
# => 3

# Visualize the slices with the provided annotations indicated by arrows.
scan.visualize(annotations)

`scan.cluster_annotations` returns a list. Each element of the list is a list of `Annotation` objects where each `Annotation` refers to the same nodule in the scan (probably). See [Clustering annotations](#clustering-annotations-to-identify-those-which-refer-to-the-same-physical-nodule) for more details on this function.


#### The `Annotation` model

Let's grab the first annotation from the `Scan` object above:

ann = scan.annotations[0]
print(ann.scan.patient_id)
# => LIDC-IDRI-0066

print(ann.spiculation, ann.Spiculation)
# => 3, Medium Spiculation

print("%.2f, %.2f, %.2f" % (ann.diameter, ann.surface_area, ann.volume))
# => 15.49, 1041.37, 888.05

ann.print_formatted_feature_table()
# => Feature Meaning #
# => - - -
# => Subtlety | Obvious | 5
# => Internalstructure | Soft Tissue | 1
# => Calcification | Absent | 6
# => Sphericity | Ovoid | 3
# => Margin | Poorly Defined | 1
# => Lobulation | Near Marked Lobulation | 4
# => Spiculation | Medium Spiculation | 3
# => Texture | Solid | 5
# => Malignancy | Moderately Suspicious | 4

fvals, fstrings = ann.feature_vals(return_str=True)

for fname,fval,fstr in zip(pl.annotation_feature_names, fvals, fstrings):
print(fname.title(), fval, fstr)
# => 'Subtlety', 5, 'Obvious'
# => 'Internalstructure', 1, 'Soft Tissue'
# => 'Calcification', 6, 'Absent'
# => 'Sphericity', 3, 'Ovoid'
# => 'Margin', 1, 'Poorly Defined'
# => 'Lobulation', 4, 'Near Marked Lobulation'
# => 'Spiculation', 3, 'Medium Spiculation'
# => 'Texture', 5, 'Solid'
# => 'Malignancy', 4, 'Moderately Suspicious'

Let's try a different query on the annotations directly:

qu = pl.query(pl.Annotation).filter(pl.Annotation.lobulation > 3,
pl.Annotation.malignancy == 5)
print(qu.count())
# => 183

ann = qu.first()
print(ann.lobulation, ann.Lobulation, ann.malignancy, ann.Malignancy)
# => 4, Near Marked Lobulation, 5, Highly Suspicious

print(len(ann.contours))
# => 8

print(ann.contours_to_matrix().shape)
# => (671, 3)

print(ann.contours_to_matrix().mean(axis=0) - ann.centroid)
# => [ 0. 0. 0.]

You can engage an interactive slice viewer that displays annotation values and the radiologist-drawn contours:

ann.visualize_in_scan()

You can also view the nodule contours in 3d by calling:

ann.visualize_in_3d()

### Advanced queries

#### Get a random result

One common objective for data exploration is to grab a random instance from some query. You can accomplish this by import `func` from `sqlalchemy`, and using `random`.

from sqlalchemy import func
scan = pl.query(pl.Scan).filter(pl.Scan.contrast_used == True).order_by(func.random()).first()
ann = pl.query(pl.Annotation).filter(pl.Annotation.malignancy == 5).order_by(func.random()).first()

The first query grabs a random `Scan` instance where contrast is used. The second query grabs a random `Annotation` instance where malignancy is equal to 5.

#### Query multiple model parameters with a join

Another common objective is to query for an `Annotation` object which is constrained by its corresponding `Scan` in some way. For example:

anns = pl.query(pl.Annotation).join(pl.Scan).filter(pl.Scan.slice_thickness < 1, pl.Annotation.malignancy != 3)

### Resampling the volumes

The `Annotation` member function, `uniform_cubic_resample`, takes a cubic region of interest with the centroid at the center of the volume. The corresponding CT value volume is resampled to have voxel spacing of 1 millimeter and a side length as given by the functions `side_length` parameter. Along with the uniformly resampled, cubic CT image volume, a corresponding boolean-valued volume is also returned that is 1 where the nodule exists in the resampled CT volume and 0 otherwise.

import pylidc as pl
import matplotlib.pyplot as plt
from skimage.measure import find_contours

ann = pl.query(pl.Annotation).first()
vol, seg = ann.uniform_cubic_resample(side_length = 100)
print(vol.shape, seg.shape)
# => (101, 101, 101) (101, 101, 101)

# View middle slice of interpolated volume (pixel spacing now = 1mm)
plt.imshow(vol[:,:,50], cmap=plt.cm.gray)

# View middle slice of interpolated segmentation volume as contours
# atop the interpolated image.
contours = find_contours(seg[:,:,50], 0.5)
for contour in contours:
plt.plot(contour[:,1], contour[:,0], '-r')

plt.show()

![](https://raw.githubusercontent.com/pylidc/pylidc/master/img/resample-example.png)

### Clustering annotations to identify those which refer to the same physical nodule.

The LIDC dataset doesn't assign unique global identifiers to the physical nodules. For a given physical nodule, there may exist up to 4 annotations that refer to it. The annotations are anonymous, so even if it is known that 4 annotations refer to the same nodule, it is impossible to tell which annotator provided each annotation across multiple nodules consistently.

However, we can estimate when annotations refer to the same physical nodule in a scan by examining the properties of the annotations and clustering them based on the properties. `pylidc` provides [a number of distance metrics between annotations](https://github.com/pylidc/pylidc/pylidc/annotation_distance_metrics.py) based on the annotation contour coordinates. The `Scan` model provides a [`cluster_annotations`](https://github.com/pylidc/pylidc/pylidc/Scan.py) function which then clusters annotations by determining the connected components of the adjacency graph associated with a chosen distance-between-annotations metric and a chosen distance tolerance.

Here's an example:

```
import pylidc as pl
scan = pl.query(pl.Scan).first()
nods = scan.cluster_annotations()
print("Scan is estimated to have", len(nods), "nodules.")
for i,nod in enumerate(nods):
print("Nodule", i+1, "has", len(nod), "annotations.")
for j,ann in enumerate(nod):
print("-- Annotation", j+1, "centroid:", ann.centroid)
```

Output:
```
Scan is estimated to have 4 nodules.
Nodule 1 has 4 annotations.
-- Annotation 1 centroid: [ 331.90680101 312.30982368 1480.44962217]
-- Annotation 2 centroid: [ 328.60546875 309.91796875 1479.73046875]
-- Annotation 3 centroid: [ 327.91666667 309.88293651 1479.01785714]
-- Annotation 4 centroid: [ 332.55660377 313.88050314 1479.94339623]
Nodule 2 has 4 annotations.
-- Annotation 1 centroid: [ 360.81122449 169.19642857 1542.10459184]
-- Annotation 2 centroid: [ 360.82233503 169.21319797 1542.14720812]
-- Annotation 3 centroid: [ 361.05243446 168.86142322 1542.34269663]
-- Annotation 4 centroid: [ 361.25501433 171. 1542.80659026]
Nodule 3 has 1 annotations.
-- Annotation 1 centroid: [ 336.41666667 348.83333333 1545.75 ]
Nodule 4 has 4 annotations.
-- Annotation 1 centroid: [ 340.54020979 245.07692308 1606.14160839]
-- Annotation 2 centroid: [ 341.29061103 244.65275708 1605.90834575]
-- Annotation 3 centroid: [ 341.75417299 244.03490137 1606.95827011]
-- Annotation 4 centroid: [ 341.53110048 245.58532695 1606.5 ]
```
You can supply annotation clusters (variable, `nods`, above) to the `scan.visualize` function, and arrows will annotate where the nodules are present in the scan.
[See the full documentation and tutorials here.](https://pylidc.github.io)
Loading

0 comments on commit d76cfab

Please sign in to comment.