-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #107 from RConsortium/106-post-volcalc-blog
volcalc blog post
- Loading branch information
Showing
2 changed files
with
220 additions
and
0 deletions.
There are no files selected for viewing
220 changes: 220 additions & 0 deletions
220
...emical-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/index.qmd
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,220 @@ | ||
--- | ||
title: "Unlocking Chemical Volatility: How the volcalc R Package is Streamlining Scientific Research" | ||
description: "Kristina Riemer, director of the CCT Data Science Team at the University | ||
of Arizona, and Eric Scott, Scientific Programmer and Educator in the CCT Data Science Team, | ||
the developers behind the volcalc package, discuss the motivation and development of this | ||
innovative tool designed to automate the calculation of chemical compound volatilities. " | ||
author: "R Consortium" | ||
image: "volcalc.webp" | ||
date: "09/23/2024" | ||
--- | ||
|
||
![](volcalc.webp) | ||
|
||
The R Consortium recently interviewed [Kristina Riemer](https://datascience.cct.arizona.edu/person/kristina-riemer), director of [the CCT Data Science Team](https://datascience.cct.arizona.edu/) at the University of Arizona, and [Eric Scott](https://datascience.cct.arizona.edu/person/eric-scott), Scientific Programmer and Educator in the CCT Data Science Team, the developers behind the [volcalc package](https://github.com/Meredith-Lab/volcalc), to discuss the motivation and development of this innovative tool designed to automate the calculation of chemical compound volatilities. volcalc streamlines the process by allowing users to input a compound and quickly receive its volatility information, eliminating the need for time-consuming manual calculations. Initially created to assist [Dr. Laura Meredith](https://has.arizona.edu/person/laura-meredith) in managing a large database of volatile compounds, volcalc has since grown into a more versatile tool under Eric’s leadership, now supporting a wider range of researchers. | ||
|
||
Kristina and Eric share insights into the challenges they faced, including managing dependencies, | ||
integrating with CRAN and Bioconductor, and refining complex molecular identification methods. | ||
They also discuss future enhancements, such as incorporating temperature-specific volatility | ||
calculations and expanding the package’s functionality to estimate other compound characteristics. | ||
This project was funded by the R Consortium. | ||
|
||
**Could you share what motivated the development of the volcalc package and how it | ||
aligns with the broader goals of the R ecosystem, particularly in scientific computing?** | ||
|
||
**Kristina**: I was heavily involved in the initial development of volcalc, and later | ||
on, Eric took over the project. We developed volcalc because we began collaborating | ||
with Dr. Laura Meredith, who was compiling a database of volatile chemical compounds. | ||
At the time, she had around 300 compounds, and her students manually gathered details | ||
for each one by examining their representations and calculating various associated values. | ||
This process was tedious and prone to errors, so we thought there must be a more | ||
efficient and automated way to handle it. | ||
|
||
That’s when we came up with the idea of creating a pipeline where someone could input a | ||
compound and quickly receive its volatility information, eliminating the need for all | ||
the manual labor. The purpose of volcalc was to transform the process from taking months | ||
to gather details for 300 compounds to obtaining information for thousands in a much | ||
shorter time. | ||
|
||
**Eric**: volcalc was initially developed specifically for a project where the | ||
researchers were mainly interested in chemical compounds from the | ||
[KEGG database (Kyoto Encyclopedia of Genes and Genomes)](https://www.genome.jp/kegg/). | ||
When I joined the team and learned about the project, I was thrilled because, as a | ||
chemical ecologist, I saw its potential. However, I also recognized a limitation: | ||
the tool only worked with the KEGG database. This was a drawback because many | ||
researchers, including food scientists and others who work with similar compounds, | ||
might not find their compounds in that specific database. | ||
|
||
This realization inspired me to apply for the R Consortium grant. We saw a significant | ||
opportunity to expand volcalc, making it more flexible and applicable to a wider range | ||
of researchers. We also wanted to improve its integration within the R ecosystem by | ||
adding features like returning the file path of a molecule representation after | ||
downloading it, so it could be easily piped into subsequent steps. These enhancements | ||
aimed to make the tool more versatile and user-friendly for a broader audience. | ||
|
||
**What were the most significant challenges you faced during the development of the | ||
initial version of volcalc, and how did you overcome them?** | ||
|
||
**Kristina**: One of the most challenging aspects of developing volcalc, which | ||
continues to be an issue, is managing dependencies. Specifically, we rely heavily | ||
on a command-line program to handle much of the processing. Early on, we struggled | ||
with how to enable users to run volcalc without needing to install this program on | ||
their own computers, as many of our users aren’t familiar with that kind of setup. | ||
I spent a lot of time trying to create a reproducible environment using | ||
[Binder](https://mybinder.org/), but I was never able to get it fully working. | ||
Even today, there are still issues related to managing these dependencies, which | ||
Eric can elaborate on further. | ||
|
||
It was incredibly important to have Eric on this project because I don’t have a | ||
strong background in chemistry. His ability to come in and figure out some of | ||
the intricate details that would have taken me much longer to grasp was a huge | ||
advantage. The more we can collaborate with domain experts, the better our results will be. | ||
|
||
**Eric**: One thing that has helped with the dependency challenges is that we’ve | ||
started building volcalc on [R-Universe](https://r-universe.dev/search), which means | ||
binaries are available there. While it’s not on CRAN yet, having these binaries on | ||
R-Universe makes installation a bit easier. However, we’ve faced some challenges | ||
with dependencies, particularly because two of them are from Bioconductor. We | ||
didn’t originally aim to develop this package for Bioconductor, which uses S4 | ||
objects and has different standards than CRAN. Our goal was to get it on CRAN, | ||
but our first submission was rejected because the license field for the Bioconductor | ||
package wasn’t formatted to CRAN’s liking. These differences between Bioconductor and | ||
CRAN have created barriers, even though the authors of the Bioconductor package have | ||
been very responsive. Their package works fine on Bioconductor, but it doesn’t meet | ||
CRAN’s criteria, which has been a frustrating challenge. | ||
|
||
Another major challenge in developing volcalc relates to the method we use for | ||
estimating volatility. This method involves counting the numbers of different | ||
functional groups on molecules—such as hydroxyl groups or sulfur atoms—and assigning | ||
coefficients to them. To do this programmatically, we use something called SMARTS, | ||
which is essentially like regular expressions but for molecular structures. Regular | ||
expressions for text are already challenging, but SMARTS is even more complex | ||
because it deals with three-dimensional molecules. | ||
|
||
Before I joined the group, the first version of volcalc had most of these functional | ||
groups figured out, but not all. I spent a significant amount of time trying to | ||
develop SMARTS strings to match additional molecules. Moving forward, I hope that | ||
if we implement new versions, we can get help from the community to refine these | ||
SMARTS strings, as there are likely people out there who are more skilled at it than I am. | ||
|
||
**The original project proposal mentions expanding volcalc to work with any chemical | ||
compound with a known structure. What are the key technical challenges you anticipate | ||
in achieving this goal?** | ||
|
||
**Eric**: This task turned out to be less difficult than I initially expected, but | ||
let me explain. In the original version of volcalc, before we received the R Consortium | ||
funding, the main function started with a KEGG ID—an identifier specific to the KEGG | ||
database. The function would download a MOL file, which is a text representation of a | ||
molecule corresponding to that ID. It would then identify and count the functional groups | ||
in the molecule, and finally, calculate the volatility based on those counts. | ||
|
||
The major change we needed to implement to make volcalc more versatile was to decouple | ||
these steps. In the current version of volcalc, the functionality to download a MOL file | ||
from KEGG is still available, but it’s now separate from the main function that calculates | ||
volatility. This means that the inputs for calculating volatility can now be any MOL file, | ||
not just ones from KEGG. The file can come from any database, be exported from other software, | ||
or even be downloaded manually. Additionally, the tool now supports SMILES, which is | ||
another, simpler text-based representation of molecules. | ||
|
||
There are various ways to represent chemicals in text, including another format | ||
called [InChI.](https://www.inchi-trust.org/) The Bioconductor packages we use, ChemmineR | ||
and ChemmineOB, have the ability to translate from InChI and other types of chemical | ||
representations. However, that feature isn’t available on Windows. So, I decided to | ||
keep volcalc focused on SMILES and MOL files. I believe that chemists and other | ||
researchers should be able to obtain data in one of these two formats, or use another | ||
tool to translate their data into these formats. I didn’t want to overload volcalc | ||
with the responsibility of being a chemical representation translator, as that didn’t | ||
seem like its primary purpose. | ||
|
||
**Can you walk us through the process of implementing the SIMPOL algorithm within the | ||
volcalcc package?** | ||
|
||
**Kristina**: The algorithm itself is fairly simple; it’s just basic math. You need | ||
to input some constants, the mass of the compound, and the counts of the functional | ||
groups we discussed earlier. Writing the code for this was straightforward and not | ||
particularly challenging. | ||
|
||
**Eric**: Each functional group has a coefficient associated with it, which is | ||
multiplied by the number of times that group appears in the molecule. These values | ||
are then summed up, and the mass of the molecule is factored in as well. The challenging | ||
part wasn’t the algorithm itself, which is straightforward—just multiplying by | ||
coefficients and adding them up. The real difficulty was interpreting what the authors | ||
of the algorithm meant by each of the functional groups. Some were oddly specific, | ||
like how the hydroxyl group that is part of a nitrophenol group isn’t supposed to count | ||
toward the total number of hydroxyl groups. I spent a lot of time poring over the paper, | ||
particularly one table, to fully understand how they defined each group. That | ||
interpretation was the hardest part. | ||
|
||
**What future functionalities or expansions do you see as crucial for volcalc, especially | ||
in the context of evolving research needs in chemoinformatics?** | ||
|
||
**Eric**: Right now, we’re working on allowing users to specify different temperatures. | ||
The paper that describes the SIMPOL.1 method includes equations for how the coefficients | ||
of each functional group change with temperature. These changes aren’t always linear, | ||
and the contributions of functional groups can shift in importance as the temperature | ||
varies. This is an important feature to include because the version of volcalc we | ||
currently have uses coefficients calculated at 20°C, based on a table from the original | ||
paper. To accommodate other temperatures, we need to integrate another table that | ||
provides equations for calculating these coefficients based on temperature, and that’s | ||
what we’re working on. | ||
|
||
Another key feature we want to leave room for in the future is the ability to add other | ||
methods for estimating volatility. SIMPOL.1 is just one type of group contribution | ||
method, but there are other approaches described in various papers that use different | ||
functional groups, equations, and coefficients. The basic idea remains the same: count | ||
the functional groups in a molecule, apply an equation, and estimate volatility. We’re | ||
trying to structure the code in a way that makes it easy to incorporate additional | ||
methods later, even if we don’t add them right away. I think these are the most important | ||
features we’re focusing on right now. | ||
|
||
**Kristina**: We’re focused on the features I mentioned in the near future, but looking | ||
further ahead, I could see volcalc expanding to estimate other characteristics of compounds | ||
beyond just volatility. While I’m not a chemistry expert or a chemical ecologist, I | ||
imagine that those interested in volatility might also be interested in other compound | ||
characteristics that currently lack automated tools for estimation. So, it’s possible | ||
the package could evolve to include those features. | ||
|
||
That said, one of the things I appreciate about the R package ecosystem is that it | ||
allows for specialized tools. Since anyone can build what they need, we don’t end up | ||
with massive, overly complex packages that try to do everything and become difficult | ||
to maintain. It might be better to keep volcalc focused and leave room for separate | ||
packages to handle additional functionality. This way, the tools remain manageable | ||
and easier to maintain in the long run. | ||
|
||
**How has it been working with the R Consortium? Would you recommend applying for | ||
an ISC grant to other R developers?** | ||
|
||
**Kristina**: The application process was straightforward, and I found the grant | ||
format to be very practical. It was focused on milestones and product development, | ||
which is refreshing compared to many academic research grants that tend to avoid | ||
specific deliverables. I highly recommend considering this grant. I believe people | ||
often overlook smaller funding sources, but even small amounts can make a big impact | ||
on the work you’re doing. | ||
|
||
**Eric**: The first time I applied for an R Consortium grant was as a grad student, | ||
and I strongly encourage trainees to apply as well. It was a great experience for me | ||
because I could do it independently—my advisor wasn’t involved as one of the authors, | ||
and it wasn’t a complex process like applying for an NSF grant. It was straightforward | ||
and really rewarding. The only tricky part was figuring out the payment process, but | ||
that’s something people can work out. | ||
|
||
I’ve noticed there seem to be fewer projects in recent years, and I don’t think it’s | ||
due to a lack of funding. It seems like fewer people are applying, which is why I | ||
especially encourage others to give it a shot. From what I’ve seen, there’s a very | ||
good chance of getting funded if you apply right now. | ||
|
||
People should be creative and think broadly about how their project can benefit the | ||
broader R community. This doesn’t mean you need to develop the next big thing like | ||
R-Universe or CRAN. It can be something smaller, like a package that other R users | ||
will find helpful. For example, with our project, volcalc, our main goal was to | ||
encourage chemists—who usually use point-and-click software—to start using R. | ||
That was enough of a contribution to the R community to get funded. So, I really | ||
encourage people to think creatively about what “benefiting the R community” can mean. | ||
|
||
## About ISC Funded Projects | ||
|
||
A major goal of the R Consortium is to strengthen and improve the infrastructure | ||
supporting the R Ecosystem. We seek to accomplish this by funding projects that will | ||
improve both technical infrastructure and social infrastructure. | ||
|
||
[Learn More!](/all-projects/) |
Binary file added
BIN
+100 KB
...cal-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/volcalc.webp
Binary file not shown.