context/motivation

By the post Assign cluster name based on cluster size (April 7, 2022), user mcmc observes DataWarrior labels the clusters in sequence of their creation. mcmc suggests it were beneficial if the cluster labels would reflect their popularity e.g., the greater the number of molecules in a cluster, the lower the assigned label.

use case

The script runs from the command line and requires an installation of Python 3. All functionality is provided by the standard library, there are no additional dependencies:

datawarrior_clustersort.py [-h] [-r] file

DataWarrior’s result of clustering (Chemistry -> cluster compounds/reactions) may be exported as text file (File -> Save Special) is read as input (file). The script identifies the column with DataWarrior’s cluster labels by search for the column header Cluster No assigned by default. A new record file is written where the entries are sort to report the most populous cluster and its entries first, followed by the less populous clusters. To reflect this sequence now based on counting entries per cluster, the cluster’s labels are newly assigned.

It is possible to reverse the sort by either optional --reverse, or -r. Then, the script reports first the cluster with the least number of entries.

test case

A library of 100 random drug-like molecules was generated by DataWarrior and clustered at low threshold of similarity (Structure FragFp 0.4, file 100Random_Molecules.dwar). The export (file 100Random_Molecules.txt) was processed by

python3 datawarrior_clustersort.py 100Random_Molecules.txt

to yield 100Random_Molecules_sort.txt as newly assigned set. As preview, the script briefly describes the distribution prior and after the sort to the CLI:

DataWarrior's assignment of clusters:
cluster:        1 molecules:       11
cluster:        2 molecules:       28
cluster:        3 molecules:       29
cluster:        4 molecules:       14
cluster:        5 molecules:        8
cluster:        6 molecules:        2
cluster:        7 molecules:        3
cluster:        8 molecules:        2
cluster:        9 molecules:        1
cluster:       10 molecules:        1
cluster:       11 molecules:        1

clusters newly sorted and labeled:
cluster:        1 molecules:       29
cluster:        2 molecules:       28
cluster:        3 molecules:       14
cluster:        4 molecules:       11
cluster:        5 molecules:        8
cluster:        6 molecules:        3
cluster:        7 molecules:        2
cluster:        8 molecules:        2
cluster:        9 molecules:        1
cluster:       10 molecules:        1
cluster:       11 molecules:        1

A running instance of DataWarrior was able to read the newly written file 100Random_Molecules_sort.txt both by File -> Open, as well as by the short cut Ctrl + O.

content of the project

tree -a -L2 -I .git

.
├── datawarrior_clustersort.py
├── .gitignore
├── LICENSE
├── .pre-commit-config.yaml
├── README.html
├── README.md
├── README.org
├── requirements-dev.txt
└── test_data
    ├── 100Random_Molecules.dwar
    ├── 100Random_Molecules_sort.txt
    └── 100Random_Molecules.txt

2 directories, 11 files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.org

README.org

context/motivation

use case

test case

content of the project

Files

README.org

Latest commit

History

README.org

File metadata and controls

context/motivation

use case

test case

content of the project