Skip to content

Commit

Permalink
volcalc blog post
Browse files Browse the repository at this point in the history
  • Loading branch information
jcasman committed Sep 23, 2024
1 parent c202c14 commit 333be9b
Show file tree
Hide file tree
Showing 2 changed files with 220 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
---
title: "Unlocking Chemical Volatility: How the volcalc R Package is Streamlining Scientific Research"
description: "Kristina Riemer, director of the CCT Data Science Team at the University
of Arizona, and Eric Scott, Scientific Programmer and Educator in the CCT Data Science Team,
the developers behind the volcalc package, discuss the motivation and development of this
innovative tool designed to automate the calculation of chemical compound volatilities. "
author: "R Consortium"
image: "volcalc.webp"
date: "09/23/2024"
---

![](volcalc.webp)

The R Consortium recently interviewed [Kristina Riemer](https://datascience.cct.arizona.edu/person/kristina-riemer), director of [the CCT Data Science Team](https://datascience.cct.arizona.edu/) at the University of Arizona, and [Eric Scott](https://datascience.cct.arizona.edu/person/eric-scott), Scientific Programmer and Educator in the CCT Data Science Team, the developers behind the [volcalc package](https://github.com/Meredith-Lab/volcalc), to discuss the motivation and development of this innovative tool designed to automate the calculation of chemical compound volatilities. volcalc streamlines the process by allowing users to input a compound and quickly receive its volatility information, eliminating the need for time-consuming manual calculations. Initially created to assist [Dr. Laura Meredith](https://has.arizona.edu/person/laura-meredith) in managing a large database of volatile compounds, volcalc has since grown into a more versatile tool under Eric’s leadership, now supporting a wider range of researchers.

Kristina and Eric share insights into the challenges they faced, including managing dependencies,
integrating with CRAN and Bioconductor, and refining complex molecular identification methods.
They also discuss future enhancements, such as incorporating temperature-specific volatility
calculations and expanding the package’s functionality to estimate other compound characteristics.
This project was funded by the R Consortium.

**Could you share what motivated the development of the volcalc package and how it
aligns with the broader goals of the R ecosystem, particularly in scientific computing?**

**Kristina**: I was heavily involved in the initial development of volcalc, and later
on, Eric took over the project. We developed volcalc because we began collaborating
with Dr. Laura Meredith, who was compiling a database of volatile chemical compounds.
At the time, she had around 300 compounds, and her students manually gathered details
for each one by examining their representations and calculating various associated values.
This process was tedious and prone to errors, so we thought there must be a more
efficient and automated way to handle it.

That’s when we came up with the idea of creating a pipeline where someone could input a
compound and quickly receive its volatility information, eliminating the need for all
the manual labor. The purpose of volcalc was to transform the process from taking months
to gather details for 300 compounds to obtaining information for thousands in a much
shorter time.

**Eric**: volcalc was initially developed specifically for a project where the
researchers were mainly interested in chemical compounds from the
[KEGG database (Kyoto Encyclopedia of Genes and Genomes)](https://www.genome.jp/kegg/).
When I joined the team and learned about the project, I was thrilled because, as a
chemical ecologist, I saw its potential. However, I also recognized a limitation:
the tool only worked with the KEGG database. This was a drawback because many
researchers, including food scientists and others who work with similar compounds,
might not find their compounds in that specific database.

This realization inspired me to apply for the R Consortium grant. We saw a significant
opportunity to expand volcalc, making it more flexible and applicable to a wider range
of researchers. We also wanted to improve its integration within the R ecosystem by
adding features like returning the file path of a molecule representation after
downloading it, so it could be easily piped into subsequent steps. These enhancements
aimed to make the tool more versatile and user-friendly for a broader audience.

**What were the most significant challenges you faced during the development of the
initial version of volcalc, and how did you overcome them?**

**Kristina**: One of the most challenging aspects of developing volcalc, which
continues to be an issue, is managing dependencies. Specifically, we rely heavily
on a command-line program to handle much of the processing. Early on, we struggled
with how to enable users to run volcalc without needing to install this program on
their own computers, as many of our users aren’t familiar with that kind of setup.
I spent a lot of time trying to create a reproducible environment using
[Binder](https://mybinder.org/), but I was never able to get it fully working.
Even today, there are still issues related to managing these dependencies, which
Eric can elaborate on further.

It was incredibly important to have Eric on this project because I don’t have a
strong background in chemistry. His ability to come in and figure out some of
the intricate details that would have taken me much longer to grasp was a huge
advantage. The more we can collaborate with domain experts, the better our results will be.

**Eric**: One thing that has helped with the dependency challenges is that we’ve
started building volcalc on [R-Universe](https://r-universe.dev/search), which means
binaries are available there. While it’s not on CRAN yet, having these binaries on
R-Universe makes installation a bit easier. However, we’ve faced some challenges
with dependencies, particularly because two of them are from Bioconductor. We
didn’t originally aim to develop this package for Bioconductor, which uses S4
objects and has different standards than CRAN. Our goal was to get it on CRAN,
but our first submission was rejected because the license field for the Bioconductor
package wasn’t formatted to CRAN’s liking. These differences between Bioconductor and
CRAN have created barriers, even though the authors of the Bioconductor package have
been very responsive. Their package works fine on Bioconductor, but it doesn’t meet
CRAN’s criteria, which has been a frustrating challenge.

Another major challenge in developing volcalc relates to the method we use for
estimating volatility. This method involves counting the numbers of different
functional groups on molecules—such as hydroxyl groups or sulfur atoms—and assigning
coefficients to them. To do this programmatically, we use something called SMARTS,
which is essentially like regular expressions but for molecular structures. Regular
expressions for text are already challenging, but SMARTS is even more complex
because it deals with three-dimensional molecules.

Before I joined the group, the first version of volcalc had most of these functional
groups figured out, but not all. I spent a significant amount of time trying to
develop SMARTS strings to match additional molecules. Moving forward, I hope that
if we implement new versions, we can get help from the community to refine these
SMARTS strings, as there are likely people out there who are more skilled at it than I am.

**The original project proposal mentions expanding volcalc to work with any chemical
compound with a known structure. What are the key technical challenges you anticipate
in achieving this goal?**

**Eric**: This task turned out to be less difficult than I initially expected, but
let me explain. In the original version of volcalc, before we received the R Consortium
funding, the main function started with a KEGG ID—an identifier specific to the KEGG
database. The function would download a MOL file, which is a text representation of a
molecule corresponding to that ID. It would then identify and count the functional groups
in the molecule, and finally, calculate the volatility based on those counts.

The major change we needed to implement to make volcalc more versatile was to decouple
these steps. In the current version of volcalc, the functionality to download a MOL file
from KEGG is still available, but it’s now separate from the main function that calculates
volatility. This means that the inputs for calculating volatility can now be any MOL file,
not just ones from KEGG. The file can come from any database, be exported from other software,
or even be downloaded manually. Additionally, the tool now supports SMILES, which is
another, simpler text-based representation of molecules.

There are various ways to represent chemicals in text, including another format
called [InChI.](https://www.inchi-trust.org/) The Bioconductor packages we use, ChemmineR
and ChemmineOB, have the ability to translate from InChI and other types of chemical
representations. However, that feature isn’t available on Windows. So, I decided to
keep volcalc focused on SMILES and MOL files. I believe that chemists and other
researchers should be able to obtain data in one of these two formats, or use another
tool to translate their data into these formats. I didn’t want to overload volcalc
with the responsibility of being a chemical representation translator, as that didn’t
seem like its primary purpose.

**Can you walk us through the process of implementing the SIMPOL algorithm within the
volcalcc package?**

**Kristina**: The algorithm itself is fairly simple; it’s just basic math. You need
to input some constants, the mass of the compound, and the counts of the functional
groups we discussed earlier. Writing the code for this was straightforward and not
particularly challenging.

**Eric**: Each functional group has a coefficient associated with it, which is
multiplied by the number of times that group appears in the molecule. These values
are then summed up, and the mass of the molecule is factored in as well. The challenging
part wasn’t the algorithm itself, which is straightforward—just multiplying by
coefficients and adding them up. The real difficulty was interpreting what the authors
of the algorithm meant by each of the functional groups. Some were oddly specific,
like how the hydroxyl group that is part of a nitrophenol group isn’t supposed to count
toward the total number of hydroxyl groups. I spent a lot of time poring over the paper,
particularly one table, to fully understand how they defined each group. That
interpretation was the hardest part.

**What future functionalities or expansions do you see as crucial for volcalc, especially
in the context of evolving research needs in chemoinformatics?**

**Eric**: Right now, we’re working on allowing users to specify different temperatures.
The paper that describes the SIMPOL.1 method includes equations for how the coefficients
of each functional group change with temperature. These changes aren’t always linear,
and the contributions of functional groups can shift in importance as the temperature
varies. This is an important feature to include because the version of volcalc we
currently have uses coefficients calculated at 20°C, based on a table from the original
paper. To accommodate other temperatures, we need to integrate another table that
provides equations for calculating these coefficients based on temperature, and that’s
what we’re working on.

Another key feature we want to leave room for in the future is the ability to add other
methods for estimating volatility. SIMPOL.1 is just one type of group contribution
method, but there are other approaches described in various papers that use different
functional groups, equations, and coefficients. The basic idea remains the same: count
the functional groups in a molecule, apply an equation, and estimate volatility. We’re
trying to structure the code in a way that makes it easy to incorporate additional
methods later, even if we don’t add them right away. I think these are the most important
features we’re focusing on right now.

**Kristina**: We’re focused on the features I mentioned in the near future, but looking
further ahead, I could see volcalc expanding to estimate other characteristics of compounds
beyond just volatility. While I’m not a chemistry expert or a chemical ecologist, I
imagine that those interested in volatility might also be interested in other compound
characteristics that currently lack automated tools for estimation. So, it’s possible
the package could evolve to include those features.

That said, one of the things I appreciate about the R package ecosystem is that it
allows for specialized tools. Since anyone can build what they need, we don’t end up
with massive, overly complex packages that try to do everything and become difficult
to maintain. It might be better to keep volcalc focused and leave room for separate
packages to handle additional functionality. This way, the tools remain manageable
and easier to maintain in the long run.

**How has it been working with the R Consortium? Would you recommend applying for
an ISC grant to other R developers?**

**Kristina**: The application process was straightforward, and I found the grant
format to be very practical. It was focused on milestones and product development,
which is refreshing compared to many academic research grants that tend to avoid
specific deliverables. I highly recommend considering this grant. I believe people
often overlook smaller funding sources, but even small amounts can make a big impact
on the work you’re doing.

**Eric**: The first time I applied for an R Consortium grant was as a grad student,
and I strongly encourage trainees to apply as well. It was a great experience for me
because I could do it independently—my advisor wasn’t involved as one of the authors,
and it wasn’t a complex process like applying for an NSF grant. It was straightforward
and really rewarding. The only tricky part was figuring out the payment process, but
that’s something people can work out.

I’ve noticed there seem to be fewer projects in recent years, and I don’t think it’s
due to a lack of funding. It seems like fewer people are applying, which is why I
especially encourage others to give it a shot. From what I’ve seen, there’s a very
good chance of getting funded if you apply right now.

People should be creative and think broadly about how their project can benefit the
broader R community. This doesn’t mean you need to develop the next big thing like
R-Universe or CRAN. It can be something smaller, like a package that other R users
will find helpful. For example, with our project, volcalc, our main goal was to
encourage chemists—who usually use point-and-click software—to start using R.
That was enough of a contribution to the R community to get funded. So, I really
encourage people to think creatively about what “benefiting the R community” can mean.

## About ISC Funded Projects

A major goal of the R Consortium is to strengthen and improve the infrastructure
supporting the R Ecosystem. We seek to accomplish this by funding projects that will
improve both technical infrastructure and social infrastructure.

[Learn More!](/all-projects/)
Binary file not shown.

0 comments on commit 333be9b

Please sign in to comment.