By the post Assign cluster name based on cluster size (April 7, 2022), user mcmc observes DataWarrior labels the clusters in sequence of their creation. mcmc suggests it were beneficial if the cluster labels would reflect their popularity e.g., the greater the number of molecules in a cluster, the lower the assigned label.
The script runs from the command line and requires an installation of Python 3. All functionality is provided by the standard library, there are no additional dependencies:
datawarrior_clustersort.py [-h] [-r] file
DataWarrior's result of clustering (Chemistry -> cluster compounds/reactions
) may be exported as text file (File -> Save Special
) is read as input (file
). The script identifies the column
with DataWarrior's cluster labels by search for the column header
Cluster No
assigned by default. A new record file is written where the
entries are sort to report the most populous cluster and its entries
first, followed by the less populous clusters. To reflect this sequence
now based on counting entries per cluster, the cluster's labels are
newly assigned.
It is possible to reverse the sort by either optional --reverse
, or
-r
. Then, the script reports first the cluster with the least number
of entries.
A library of 100 random drug-like molecules was generated by DataWarrior
and clustered at low threshold of similarity (Structure FragFp 0.4
, file 100Random_Molecules.dwar
). The export (file
100Random_Molecules.txt
) was processed by
python3 datawarrior_clustersort.py 100Random_Molecules.txt
to yield 100Random_Molecules_sort.txt
as newly assigned set. As
preview, the script briefly describes the distribution prior and after
the sort to the CLI:
DataWarrior's assignment of clusters:
cluster: 1 molecules: 11
cluster: 2 molecules: 28
cluster: 3 molecules: 29
cluster: 4 molecules: 14
cluster: 5 molecules: 8
cluster: 6 molecules: 2
cluster: 7 molecules: 3
cluster: 8 molecules: 2
cluster: 9 molecules: 1
cluster: 10 molecules: 1
cluster: 11 molecules: 1
clusters newly sorted and labeled:
cluster: 1 molecules: 29
cluster: 2 molecules: 28
cluster: 3 molecules: 14
cluster: 4 molecules: 11
cluster: 5 molecules: 8
cluster: 6 molecules: 3
cluster: 7 molecules: 2
cluster: 8 molecules: 2
cluster: 9 molecules: 1
cluster: 10 molecules: 1
cluster: 11 molecules: 1
A running instance of DataWarrior was able to read the newly written
file 100Random_Molecules_sort.txt
both by File -> Open
, as well as
by the short cut Ctrl + O
.
tree -a -L2 -I .git
.
├── datawarrior_clustersort.py
├── .gitignore
├── LICENSE
├── .pre-commit-config.yaml
├── README.html
├── README.md
├── README.org
├── requirements-dev.txt
└── test_data
├── 100Random_Molecules.dwar
├── 100Random_Molecules_sort.txt
└── 100Random_Molecules.txt
2 directories, 11 files