By the post Assign cluster name based on cluster size (April 7, 2022), user mcmc observes DataWarrior labels the clusters in sequence of their creation. mcmc suggests it were beneficial if the cluster labels would reflect their popularity e.g., the greater the number of molecules in a cluster, the lower the assigned label.
The script runs from the command line and requires an installation of Python 3. All functionality is provided by the standard library, there are no additional dependencies:
datawarrior_clustersort.py [-h] [-r] file
DataWarrior’s result of clustering (Chemistry -> cluster
compounds/reactions
) may be exported as text file (File -> Save
Special
) is read as input (file
). The script identifies the
column with DataWarrior’s cluster labels by search for the column
header Cluster No
assigned by default. A new record file is
written where the entries are sort to report the most populous
cluster and its entries first, followed by the less populous
clusters. To reflect this sequence now based on counting entries
per cluster, the cluster’s labels are newly assigned.
It is possible to reverse the sort by either optional --reverse
,
or -r
. Then, the script reports first the cluster with the least
number of entries.
A library of 100 random drug-like molecules was generated by
DataWarrior and clustered at low threshold of similarity (Structure
FragFp 0.4
, file 100Random_Molecules.dwar
). The export (file
100Random_Molecules.txt
) was processed by
python3 datawarrior_clustersort.py 100Random_Molecules.txt
to yield 100Random_Molecules_sort.txt
as newly assigned set. As
preview, the script briefly describes the distribution prior and
after the sort to the CLI:
DataWarrior's assignment of clusters: cluster: 1 molecules: 11 cluster: 2 molecules: 28 cluster: 3 molecules: 29 cluster: 4 molecules: 14 cluster: 5 molecules: 8 cluster: 6 molecules: 2 cluster: 7 molecules: 3 cluster: 8 molecules: 2 cluster: 9 molecules: 1 cluster: 10 molecules: 1 cluster: 11 molecules: 1 clusters newly sorted and labeled: cluster: 1 molecules: 29 cluster: 2 molecules: 28 cluster: 3 molecules: 14 cluster: 4 molecules: 11 cluster: 5 molecules: 8 cluster: 6 molecules: 3 cluster: 7 molecules: 2 cluster: 8 molecules: 2 cluster: 9 molecules: 1 cluster: 10 molecules: 1 cluster: 11 molecules: 1
A running instance of DataWarrior was able to read the newly written
file 100Random_Molecules_sort.txt
both by File -> Open
, as well
as by the short cut Ctrl + O
.
tree -a -L2 -I .git
.
├── datawarrior_clustersort.py
├── .gitignore
├── LICENSE
├── .pre-commit-config.yaml
├── README.html
├── README.md
├── README.org
├── requirements-dev.txt
└── test_data
├── 100Random_Molecules.dwar
├── 100Random_Molecules_sort.txt
└── 100Random_Molecules.txt
2 directories, 11 files