diff --git a/posts/unlocking-chemical-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/index.qmd b/posts/unlocking-chemical-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/index.qmd new file mode 100644 index 0000000..0e2da5e --- /dev/null +++ b/posts/unlocking-chemical-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/index.qmd @@ -0,0 +1,220 @@ +--- +title: "Unlocking Chemical Volatility: How the volcalc R Package is Streamlining Scientific Research" +description: "Kristina Riemer, director of the CCT Data Science Team at the University +of Arizona, and Eric Scott, Scientific Programmer and Educator in the CCT Data Science Team, +the developers behind the volcalc package, discuss the motivation and development of this +innovative tool designed to automate the calculation of chemical compound volatilities. " +author: "R Consortium" +image: "volcalc.webp" +date: "09/23/2024" +--- + +![](volcalc.webp) + +The R Consortium recently interviewed [Kristina Riemer](https://datascience.cct.arizona.edu/person/kristina-riemer), director of [the CCT Data Science Team](https://datascience.cct.arizona.edu/) at the University of Arizona, and [Eric Scott](https://datascience.cct.arizona.edu/person/eric-scott), Scientific Programmer and Educator in the CCT Data Science Team, the developers behind the [volcalc package](https://github.com/Meredith-Lab/volcalc), to discuss the motivation and development of this innovative tool designed to automate the calculation of chemical compound volatilities. volcalc streamlines the process by allowing users to input a compound and quickly receive its volatility information, eliminating the need for time-consuming manual calculations. Initially created to assist [Dr. Laura Meredith](https://has.arizona.edu/person/laura-meredith) in managing a large database of volatile compounds, volcalc has since grown into a more versatile tool under Eric’s leadership, now supporting a wider range of researchers. + +Kristina and Eric share insights into the challenges they faced, including managing dependencies, +integrating with CRAN and Bioconductor, and refining complex molecular identification methods. +They also discuss future enhancements, such as incorporating temperature-specific volatility +calculations and expanding the package’s functionality to estimate other compound characteristics. + This project was funded by the R Consortium. + +**Could you share what motivated the development of the volcalc package and how it +aligns with the broader goals of the R ecosystem, particularly in scientific computing?** + +**Kristina**: I was heavily involved in the initial development of volcalc, and later +on, Eric took over the project. We developed volcalc because we began collaborating +with Dr. Laura Meredith, who was compiling a database of volatile chemical compounds. +At the time, she had around 300 compounds, and her students manually gathered details +for each one by examining their representations and calculating various associated values. +This process was tedious and prone to errors, so we thought there must be a more +efficient and automated way to handle it. + +That’s when we came up with the idea of creating a pipeline where someone could input a +compound and quickly receive its volatility information, eliminating the need for all +the manual labor. The purpose of volcalc was to transform the process from taking months +to gather details for 300 compounds to obtaining information for thousands in a much +shorter time. + +**Eric**: volcalc was initially developed specifically for a project where the +researchers were mainly interested in chemical compounds from the +[KEGG database (Kyoto Encyclopedia of Genes and Genomes)](https://www.genome.jp/kegg/). +When I joined the team and learned about the project, I was thrilled because, as a +chemical ecologist, I saw its potential. However, I also recognized a limitation: +the tool only worked with the KEGG database. This was a drawback because many +researchers, including food scientists and others who work with similar compounds, +might not find their compounds in that specific database. + +This realization inspired me to apply for the R Consortium grant. We saw a significant +opportunity to expand volcalc, making it more flexible and applicable to a wider range +of researchers. We also wanted to improve its integration within the R ecosystem by +adding features like returning the file path of a molecule representation after +downloading it, so it could be easily piped into subsequent steps. These enhancements +aimed to make the tool more versatile and user-friendly for a broader audience. + +**What were the most significant challenges you faced during the development of the +initial version of volcalc, and how did you overcome them?** + +**Kristina**: One of the most challenging aspects of developing volcalc, which +continues to be an issue, is managing dependencies. Specifically, we rely heavily +on a command-line program to handle much of the processing. Early on, we struggled +with how to enable users to run volcalc without needing to install this program on +their own computers, as many of our users aren’t familiar with that kind of setup. +I spent a lot of time trying to create a reproducible environment using +[Binder](https://mybinder.org/), but I was never able to get it fully working. +Even today, there are still issues related to managing these dependencies, which +Eric can elaborate on further. + +It was incredibly important to have Eric on this project because I don’t have a +strong background in chemistry. His ability to come in and figure out some of +the intricate details that would have taken me much longer to grasp was a huge +advantage. The more we can collaborate with domain experts, the better our results will be. + +**Eric**: One thing that has helped with the dependency challenges is that we’ve +started building volcalc on [R-Universe](https://r-universe.dev/search), which means +binaries are available there. While it’s not on CRAN yet, having these binaries on +R-Universe makes installation a bit easier. However, we’ve faced some challenges +with dependencies, particularly because two of them are from Bioconductor. We +didn’t originally aim to develop this package for Bioconductor, which uses S4 +objects and has different standards than CRAN. Our goal was to get it on CRAN, +but our first submission was rejected because the license field for the Bioconductor +package wasn’t formatted to CRAN’s liking. These differences between Bioconductor and +CRAN have created barriers, even though the authors of the Bioconductor package have +been very responsive. Their package works fine on Bioconductor, but it doesn’t meet +CRAN’s criteria, which has been a frustrating challenge. + +Another major challenge in developing volcalc relates to the method we use for +estimating volatility. This method involves counting the numbers of different +functional groups on molecules—such as hydroxyl groups or sulfur atoms—and assigning +coefficients to them. To do this programmatically, we use something called SMARTS, +which is essentially like regular expressions but for molecular structures. Regular +expressions for text are already challenging, but SMARTS is even more complex +because it deals with three-dimensional molecules. + +Before I joined the group, the first version of volcalc had most of these functional +groups figured out, but not all. I spent a significant amount of time trying to +develop SMARTS strings to match additional molecules. Moving forward, I hope that +if we implement new versions, we can get help from the community to refine these +SMARTS strings, as there are likely people out there who are more skilled at it than I am. + +**The original project proposal mentions expanding volcalc to work with any chemical +compound with a known structure. What are the key technical challenges you anticipate +in achieving this goal?** + +**Eric**: This task turned out to be less difficult than I initially expected, but +let me explain. In the original version of volcalc, before we received the R Consortium +funding, the main function started with a KEGG ID—an identifier specific to the KEGG +database. The function would download a MOL file, which is a text representation of a +molecule corresponding to that ID. It would then identify and count the functional groups +in the molecule, and finally, calculate the volatility based on those counts. + +The major change we needed to implement to make volcalc more versatile was to decouple +these steps. In the current version of volcalc, the functionality to download a MOL file +from KEGG is still available, but it’s now separate from the main function that calculates +volatility. This means that the inputs for calculating volatility can now be any MOL file, +not just ones from KEGG. The file can come from any database, be exported from other software, +or even be downloaded manually. Additionally, the tool now supports SMILES, which is +another, simpler text-based representation of molecules. + +There are various ways to represent chemicals in text, including another format +called [InChI.](https://www.inchi-trust.org/) The Bioconductor packages we use, ChemmineR +and ChemmineOB, have the ability to translate from InChI and other types of chemical +representations. However, that feature isn’t available on Windows. So, I decided to +keep volcalc focused on SMILES and MOL files. I believe that chemists and other +researchers should be able to obtain data in one of these two formats, or use another +tool to translate their data into these formats. I didn’t want to overload volcalc +with the responsibility of being a chemical representation translator, as that didn’t +seem like its primary purpose. + +**Can you walk us through the process of implementing the SIMPOL algorithm within the +volcalcc package?** + +**Kristina**: The algorithm itself is fairly simple; it’s just basic math. You need +to input some constants, the mass of the compound, and the counts of the functional +groups we discussed earlier. Writing the code for this was straightforward and not +particularly challenging. + +**Eric**: Each functional group has a coefficient associated with it, which is +multiplied by the number of times that group appears in the molecule. These values +are then summed up, and the mass of the molecule is factored in as well. The challenging +part wasn’t the algorithm itself, which is straightforward—just multiplying by +coefficients and adding them up. The real difficulty was interpreting what the authors +of the algorithm meant by each of the functional groups. Some were oddly specific, +like how the hydroxyl group that is part of a nitrophenol group isn’t supposed to count +toward the total number of hydroxyl groups. I spent a lot of time poring over the paper, +particularly one table, to fully understand how they defined each group. That +interpretation was the hardest part. + +**What future functionalities or expansions do you see as crucial for volcalc, especially +in the context of evolving research needs in chemoinformatics?** + +**Eric**: Right now, we’re working on allowing users to specify different temperatures. +The paper that describes the SIMPOL.1 method includes equations for how the coefficients +of each functional group change with temperature. These changes aren’t always linear, +and the contributions of functional groups can shift in importance as the temperature +varies. This is an important feature to include because the version of volcalc we +currently have uses coefficients calculated at 20°C, based on a table from the original +paper. To accommodate other temperatures, we need to integrate another table that +provides equations for calculating these coefficients based on temperature, and that’s + what we’re working on. + +Another key feature we want to leave room for in the future is the ability to add other +methods for estimating volatility. SIMPOL.1 is just one type of group contribution +method, but there are other approaches described in various papers that use different +functional groups, equations, and coefficients. The basic idea remains the same: count +the functional groups in a molecule, apply an equation, and estimate volatility. We’re +trying to structure the code in a way that makes it easy to incorporate additional +methods later, even if we don’t add them right away. I think these are the most important +features we’re focusing on right now. + +**Kristina**: We’re focused on the features I mentioned in the near future, but looking +further ahead, I could see volcalc expanding to estimate other characteristics of compounds +beyond just volatility. While I’m not a chemistry expert or a chemical ecologist, I +imagine that those interested in volatility might also be interested in other compound +characteristics that currently lack automated tools for estimation. So, it’s possible +the package could evolve to include those features. + +That said, one of the things I appreciate about the R package ecosystem is that it +allows for specialized tools. Since anyone can build what they need, we don’t end up +with massive, overly complex packages that try to do everything and become difficult +to maintain. It might be better to keep volcalc focused and leave room for separate +packages to handle additional functionality. This way, the tools remain manageable +and easier to maintain in the long run. + +**How has it been working with the R Consortium? Would you recommend applying for +an ISC grant to other R developers?** + +**Kristina**: The application process was straightforward, and I found the grant +format to be very practical. It was focused on milestones and product development, +which is refreshing compared to many academic research grants that tend to avoid +specific deliverables. I highly recommend considering this grant. I believe people +often overlook smaller funding sources, but even small amounts can make a big impact +on the work you’re doing. + +**Eric**: The first time I applied for an R Consortium grant was as a grad student, +and I strongly encourage trainees to apply as well. It was a great experience for me +because I could do it independently—my advisor wasn’t involved as one of the authors, +and it wasn’t a complex process like applying for an NSF grant. It was straightforward +and really rewarding. The only tricky part was figuring out the payment process, but +that’s something people can work out. + +I’ve noticed there seem to be fewer projects in recent years, and I don’t think it’s +due to a lack of funding. It seems like fewer people are applying, which is why I +especially encourage others to give it a shot. From what I’ve seen, there’s a very +good chance of getting funded if you apply right now. + +People should be creative and think broadly about how their project can benefit the +broader R community. This doesn’t mean you need to develop the next big thing like +R-Universe or CRAN. It can be something smaller, like a package that other R users +will find helpful. For example, with our project, volcalc, our main goal was to +encourage chemists—who usually use point-and-click software—to start using R. +That was enough of a contribution to the R community to get funded. So, I really +encourage people to think creatively about what “benefiting the R community” can mean. + +## About ISC Funded Projects + +A major goal of the R Consortium is to strengthen and improve the infrastructure +supporting the R Ecosystem. We seek to accomplish this by funding projects that will +improve both technical infrastructure and social infrastructure. + +[Learn More!](/all-projects/) diff --git a/posts/unlocking-chemical-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/volcalc.webp b/posts/unlocking-chemical-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/volcalc.webp new file mode 100644 index 0000000..15f4b19 Binary files /dev/null and b/posts/unlocking-chemical-volatility-how-the-volcalc-r-package-is-streamlining-scientific-research/volcalc.webp differ