context/motivation

By the post Assign cluster name based on cluster size (April 7, 2022), user mcmc observes DataWarrior labels the clusters in sequence of their creation. mcmc suggests it were beneficial if the cluster labels would reflect their popularity e.g., the greater the number of molecules in a cluster, the lower the assigned label.

use case

The script runs from the command line and requires an installation of Python 3. All functionality is provided by the standard library, there are no additional dependencies:

datawarrior_clustersort.py [-h] [-r] file

DataWarrior's result of clustering (Chemistry -> cluster compounds/reactions) may be exported as text file (File -> Save Special) is read as input (file). The script identifies the column with DataWarrior's cluster labels by search for the column header Cluster No assigned by default. A new record file is written where the entries are sort to report the most populous cluster and its entries first, followed by the less populous clusters. To reflect this sequence now based on counting entries per cluster, the cluster's labels are newly assigned.

It is possible to reverse the sort by either optional --reverse, or -r. Then, the script reports first the cluster with the least number of entries.

test case

A library of 100 random drug-like molecules was generated by DataWarrior and clustered at low threshold of similarity (Structure FragFp 0.4, file 100Random_Molecules.dwar). The export (file 100Random_Molecules.txt) was processed by

python3 datawarrior_clustersort.py 100Random_Molecules.txt

to yield 100Random_Molecules_sort.txt as newly assigned set. As preview, the script briefly describes the distribution prior and after the sort to the CLI:

DataWarrior's assignment of clusters:
cluster:        1 molecules:       11
cluster:        2 molecules:       28
cluster:        3 molecules:       29
cluster:        4 molecules:       14
cluster:        5 molecules:        8
cluster:        6 molecules:        2
cluster:        7 molecules:        3
cluster:        8 molecules:        2
cluster:        9 molecules:        1
cluster:       10 molecules:        1
cluster:       11 molecules:        1

clusters newly sorted and labeled:
cluster:        1 molecules:       29
cluster:        2 molecules:       28
cluster:        3 molecules:       14
cluster:        4 molecules:       11
cluster:        5 molecules:        8
cluster:        6 molecules:        3
cluster:        7 molecules:        2
cluster:        8 molecules:        2
cluster:        9 molecules:        1
cluster:       10 molecules:        1
cluster:       11 molecules:        1

A running instance of DataWarrior was able to read the newly written file 100Random_Molecules_sort.txt both by File -> Open, as well as by the short cut Ctrl + O.

content of the project

tree -a -L2 -I .git

.
├── datawarrior_clustersort.py
├── .gitignore
├── LICENSE
├── .pre-commit-config.yaml
├── README.html
├── README.md
├── README.org
├── requirements-dev.txt
└── test_data
    ├── 100Random_Molecules.dwar
    ├── 100Random_Molecules_sort.txt
    └── 100Random_Molecules.txt

2 directories, 11 files

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

context/motivation

use case

test case

content of the project

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
test_data		test_data
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.html		README.html
README.md		README.md
README.org		README.org
datawarrior_clustersort.py		datawarrior_clustersort.py
requirements-dev.txt		requirements-dev.txt

License

nbehrnd/datawarrior_clustersort

Folders and files

Latest commit

History

Repository files navigation

context/motivation

use case

test case

content of the project

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages