-
Notifications
You must be signed in to change notification settings - Fork 8
1. Preprocessing the Data
Here, we discuss how to preprocess the data for the following image formats: .npy, .svs, .tiff, .tif, .vms, .vmu, .ndpi, .scn, .mrxs, .svslide, .bif, .jpeg, .png.
Any WSI that you have must be placed in the same directory. PathFlowAI will search this directory for these image types and runs its preprocessing pipeline individually on each image.
Suppose we set up a directory inputs/ , placing image A01.npy into this directory. Accompanying it for a classification task needs to be an .xml file with the same basename, A01, in the same directory. The XML file must be an annotation file exported using an annotation suite such as ASAP, QuPath, etc. If seeking to run a segmentation task, replace the XML file with a numpy file (file name is [basename]_mask.npy) of the same size of the image (make sure sizes agree!) containing a segmentation mask (for now, background must be labeled as 0, and next components must be ordered from 0, 1, ... etc.). In the near future, an accompanying annotation file of sorts will not be required to run the pipeline.
So considering the task, the input/ directory would contain:
A01.npy
A01_mask.npy if segmentation
A01.xml if classification
Repeating this for the other images.
Once this is done, the preprocessing pipeline can be run.
pathflowai-preprocess preprocess_pipeline -odb patch_information.db --preprocess --patches --basename A01 --input_dir inputs/ --patch_size 256 --intensity_threshold 45. -tc 7 -t 0.05
This searches the input/ directory for files beginning with A01, will autodetect an image file, and then decide whether this is a classification or segmentation task given whether a xml file was chosen or an _mask.npy file.
First, let's suppose this is a segmentation task, where we want the patches to be 256x256, with seven classes to predict on the mask (0-6 labels). The --preprocess flag, when running the preprocessing for the first time, is essential, which converts the image into the ZARR format (see paper). Once it is stored in the ZARR format, you can delete the original image. The --patches option will then store the patches of the image into a SQL database specified by -odb (newly created if file does not exist and appends with every new patch size and image specified). The RGB pixels of each patch are converted into gray scale, and then if at least (-t) proportion of pixels have a grayscale intensity (this will change soon!) greater than (--intensity_threshold) in the patch will be analyzed and stored in the SQL database. The area, by percentage, of each annotation in the patch is stored in addition to the patch's positional information. All of this information is dumped to a SQL database, in a table denoting the patch size, as such:
sqlite> .headers on
sqlite> select * from "256" limit 5;
index|ID|x|y|patch_size|annotation|0|1|2|3|4|5|6
0|A01|0|0|256|0|0.777984619140625|0.0207977294921875|0.001129150390625|0.00091552734375|0.0|0.0506591796875|0.148513793945312
1|A01|0|256|256|0|0.647369384765625|0.0170745849609375|0.0|0.001190185546875|0.0|0.191436767578125|0.142929077148437
2|A01|0|512|256|0|0.912002563476563|0.0428466796875|0.0017852783203125|0.006805419921875|0.0008697509765625|0.0187530517578125|0.016937255859375
3|A01|0|768|256|0|0.762771606445312|0.0353546142578125|0.0013427734375|0.007171630859375|0.0|0.166168212890625|0.027191162109375
4|A01|0|1024|256|0|0.79571533203125|0.0261383056640625|0.0037689208984375|0.00384521484375|0.0|0.020599365234375|0.149932861328125
where annotation is the annotation that has the largest area in the patch.
Let's suppose we'd like to store another set of patches into the SQL, but with a different patch size. One only needs to specify:
pathflowai-preprocess preprocess_pipeline -odb patch_information.db --patches --basename A01 --input_dir inputs/ --patch_size 512 --intensity_threshold 45. -tc 7 -t 0.05
Removing the --preprocess flag since the ZARR file has already been saved, and adjusting the patch size accordingly. This adds the patch data to a new SQL table, "512", in the same SQL database, at no additional storage, processed in about a minute.
If, suppose, we wanted to perform a classification task with the xml, we'd specify annotations (-a) to query (they should exist in the xml) rather than supply a number of target classes (-tc).
pathflowai-preprocess preprocess_pipeline -odb patch_information.db --preprocess --patches --basename A01 --input_dir inputs/ --patch_size 224 --intensity_threshold 45. -t 0.05 -a background -a portal -a parenchyma
I'd like to note here that actually classification can be done using _mask.npy information supplied, so as long as the model is being trained using the SQL schema (so supplying a _mask.npy file enables classification and segmentation tasks), but the XML files can only be used for classification for now. The SQL data base would look the same, except that annotations can now overlap, so the areas can some to more or less than 1, and the annotation column names reflect the annotation names supplied. In the end, the SQL attained looks and feels the same, whether it be a classification task or segmentation task, and the SQL by itself is enough to train a model for classification tasks, but a _mask.npy is needed additionally for the segmentation mask.
All that is needed when running the deep learning analytics is the SQL and ZARR, and optionally, the segmentation mask NPY. A byproduct of the process are a set of PKL files that are stored in the same directory, denoted by [basename]_mask.pkl, which normally store shape information, but these can be thrown out.
If you are looking to preprocess all of the images at once, even for different patch sizes, consider writing a for loop or deploy a set of jobs to a high performance computing scheduler, one for each basename (WSI ID). I have a PBS/Torque job submission system in my repositories, but consider something like this:
for base in A01 A02 A03; do nohup pathflowai-preprocess preprocess_pipeline -odb patch_information.db --preprocess --patches --basename $base --input_dir inputs/ --patch_size 224 --intensity_threshold 45. -t 0.05 -a background -a portal -a parenchyma & ; done
Systems like Luigi (https://github.com/spotify/luigi), NextFlow (https://github.com/nextflow-io/nextflow), Airflow (https://airflow.apache.org/), Toil (https://toil.readthedocs.io/en/latest/) or CWL (https://github.com/common-workflow-language/common-workflow-language) are systems worth exploring for pipeline deployment, and may be integrated with PathFlowAI in the near future.
Although I have not personally tried job submission using tools such as these, https://github.com/lh3/asub, they're worth a try. If you'd like me to add a quick job submission script to run preprocessing in bulk, I am happy to supply my commands or write a custom script for this workflow to enable deployment of torque/slurm/sge jobs by bash for loop or job array. The main goal is to run all of these WSI preprocessing in parallel using job submission of sorts, and I am happy to help out with this task.
Now that our data is preprocessed, we can run a classification or segmentation task.