Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of parsing simple dates #8386

Merged
merged 5 commits into from
Jan 28, 2025

Conversation

tobias-hotz
Copy link
Contributor

This PR aims to improve the performance of IsoDate when handling "simple" Dates. This method is not very slow, but it is called very often (e.g. by UUIDMapper during harvesting), which makes it show up in some performance measurements. See #8007 where parsing simple Dates would take as much as 10-20% of the entire harvesting CPU time.

To validate the performance improvement, JMH benchmarks were added.
This is the baseline performance of the patch:

Benchmark                               (arg)  Mode  Cnt    Score     Error  Units
ISODateBenchmark.measureIsoSimple  1976-06-03  avgt    5  604.618 ± 69.233  ns/op
ISODateBenchmark.measureIsoSimple  1976/06/03  avgt    5  608.454 ± 44.684  ns/op
ISODateBenchmark.measureIsoSimple    24-06-06  avgt    5  874.100 ± 55.160  ns/op

Notice that the overall execution time is very small, but due to the extremly high number of calls, this is still a relevant metric
The first idea was to precompile the pattern used for the regex (see the first number), which resulted in the following numbers:

Benchmark                               (arg)  Mode  Cnt    Score     Error  Units
ISODateBenchmark.measureIsoSimple  1976-06-03  avgt    5  466.098 ±  32.134  ns/op
ISODateBenchmark.measureIsoSimple  1976/06/03  avgt    5  456.157 ±  52.184  ns/op
ISODateBenchmark.measureIsoSimple    24-06-06  avgt    5  716.912 ± 112.579  ns/op

This means a minor improvement, but there is still room to improve.
The second commit changes the algorithm to use as little string operations as possible. Whith this. I get the following numbers:

Benchmark                               (arg)  Mode  Cnt    Score    Error  Units
ISODateBenchmark.measureIsoSimple  1976-06-03  avgt    5  105.912 ± 11.120  ns/op
ISODateBenchmark.measureIsoSimple  1976/06/03  avgt    5  115.320 ±  9.542  ns/op
ISODateBenchmark.measureIsoSimple    24-06-06  avgt    5  193.689 ± 14.505  ns/op

The speedup is about 5x compared to the baseline performance.

Checklist

  • I have read the contribution guidelines
  • Pull request provided for main branch, backports managed with label
  • Good housekeeping of code, cleaning up comments, tests, and documentation
  • Clean commit history broken into understandable chucks, avoiding big commits with hundreds of files, cautious of reformatting and whitespace changes
  • Clean commit messages, longer verbose messages are encouraged
  • API Changes are identified in commit messages
  • Testing provided for features or enhancements using automatic tests
  • User documentation provided for new features or enhancements in manual
  • Build documentation provided for development instructions in README.md files
  • Library management using pom.xml dependency management. Update build documentation with intended library use and library tutorials or documentation

Funded by LGL BW

JMH allows per-method benchmarks. See jmh/README for more details
Simples dates are common in many cases. Performance analysis showed that this can be a performance hotspot during harvesting.
This is due to the design of UUIDMapper, which loads all metadata for a harvester for every new batch. This can not be easily changed, but we can improve the performance here, which also helps other code paths using this method.
This change improves the performance of the method by a factor of about 1.3x
This change improves the performance of parseDate by removing as many string operations as possible. Doing this and some other minor optimisations, we can improve the performance so it is about 5x faster than the original.
@tobias-hotz tobias-hotz changed the title Simple datetime perf Improve performance of parsing simple Dates Sep 24, 2024
@tobias-hotz tobias-hotz changed the title Improve performance of parsing simple Dates Improve performance of parsing simple dates Sep 24, 2024
@fxprunayre fxprunayre requested a review from juanluisrp October 9, 2024 13:11
@fxprunayre fxprunayre added this to the 4.4.7 milestone Oct 9, 2024
@CLAassistant
Copy link

CLAassistant commented Dec 8, 2024

CLA assistant check
All committers have signed the CLA.

@tobias-hotz
Copy link
Contributor Author

Hi @juanluisrp,

I hope you’re doing well! I noticed this PR has been waiting for a while, and I wanted to follow up. The core changes are relatively straightforward, but if they feel too complex, I can roll back to the first attempt, as that version is much simpler and already yields measurable results. However, the current version is more polished and brings additional improvements.

If the JMH tests are a concern, I’m happy to separate them into another PR to speed up the review process. Please let me know if there’s anything I can do to help get this PR merged.

Thanks a lot for your time!

@juanluisrp
Copy link
Contributor

Hi @tobias-hotz. The enhancement looks good to me since it still passing the ISODate tests. However I'm not sure if we should add this new performance test module to the source code. Do you plan to keep using it for other improvements?

@tobias-hotz
Copy link
Contributor Author

Hi @juanluisrp,
I’ve removed the JMH part, as I don’t currently plan to use it for anything else. It was a useful method for measuring performance in this PR, but we can always implement ad-hoc tests for any future JMH benchmarking if needed.

@juanluisrp juanluisrp merged commit 3e3dacb into geonetwork:main Jan 28, 2025
7 checks passed
@geonetworkbuild
Copy link
Collaborator

The backport to 4.2.x failed:

The process '/usr/bin/git' failed with exit code 128
stderr
error: commit a754b4957356dde13735698869510837400ef49e is a merge but no -m option was given.
fatal: cherry-pick failed

stdout
Auto-merging pom.xml
[backport-8386-to-4.2.x 1bbc7fb189] Add initial JMH infrastructure and benchmarks
 Author: Tobias Hotz <[email protected]>
 Date: Fri Sep 6 13:34:05 2024 +0200
 5 files changed, 145 insertions(+)
 create mode 100644 jmh/.gitignore
 create mode 100644 jmh/README.md
 create mode 100644 jmh/pom.xml
 create mode 100644 jmh/src/main/java/org/fao/geonet/domain/ISODateBenchmark.java
[backport-8386-to-4.2.x 5107b50ce4] Improve the performance of simple dates, take 1
 Author: Tobias Hotz <[email protected]>
 Date: Tue Sep 24 12:34:57 2024 +0200
 1 file changed, 3 insertions(+), 1 deletion(-)
[backport-8386-to-4.2.x afa8ece847] Improve the performance of simple dates, take 2
 Author: Tobias Hotz <[email protected]>
 Date: Tue Sep 24 12:41:51 2024 +0200
 1 file changed, 65 insertions(+), 42 deletions(-)
Auto-merging pom.xml
[backport-8386-to-4.2.x c0e730286e] Revert "Add initial JMH infrastructure and benchmarks"
 Author: Tobias Hotz <[email protected]>
 Date: Tue Jan 28 09:24:05 2025 +0100
 5 files changed, 145 deletions(-)
 delete mode 100644 jmh/.gitignore
 delete mode 100644 jmh/README.md
 delete mode 100644 jmh/pom.xml
 delete mode 100644 jmh/src/main/java/org/fao/geonet/domain/ISODateBenchmark.java

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-4.2.x 4.2.x
# Navigate to the new working tree
cd .worktrees/backport-4.2.x
# Create a new branch
git switch --create backport-8386-to-4.2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick a00732a80c76dbc3046390107169122f8b70a4f7,8d5e5e64305d055147d972223aef0500e5fce70f,3c90f55e8c82e5b97ebcfa537d65a319f25f89e5,b0ebf075236076678b66a265ffe661e1d27bb5cd,a754b4957356dde13735698869510837400ef49e
# Push it to GitHub
git push --set-upstream origin backport-8386-to-4.2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-4.2.x

Then, create a pull request where the base branch is 4.2.x and the compare/head branch is backport-8386-to-4.2.x.

juanluisrp pushed a commit that referenced this pull request Jan 28, 2025
Simples dates are common in many cases. Performance analysis showed that this can be a performance
hotspot during harvesting.

This is due to the design of UUIDMapper, which loads all metadata for a harvester for every new batch. This
can not be easily changed, but we can improve the performance here, which also helps other code paths
using this method.

This change improves the performance of parseDate by removing as many string operations as possible.
Doing this and some other minor optimisations, we can improve the performance so it is about 5x faster
than the original.
@tobias-hotz tobias-hotz deleted the simple_datetime_perf branch January 28, 2025 09:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants