Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Velocity Potential and Stream Function Calculations #1072

Open
wants to merge 38 commits into
base: develop
Choose a base branch
from

Conversation

KarinaAsmar-NOAA
Copy link
Contributor

This PR adds the CPC-requested streamfunction and velocity potential at 200mb to SFS. It is meant to resolve [Issue #902 ] and update on the runtime issues from the previous PR #951 . Job scripts used for testing are in WCOSS2: /lfs/h2/emc/vpppg/noscrub/karina.asmar/vpot_strm/UPP (submit_run_gfsv16_wcoss2.sh and submit_run_sfs_wcoss2.sh).

@KarinaAsmar-NOAA
Copy link
Contributor Author

@WenMeng-NOAA We have updated the computation for streamfunction and velocity potential for compatibility with UPP parallelization. The subroutine now uses a poisson solver with a convergence tester to reduce runtimes. Comparison of spectral and numerical results is here

@GeorgeVandenberghe-NOAA
Copy link
Contributor

The poisson solver works but is substantially slower than the gather --> stptranf --> scatter operation to solve the equation spectrally. The gather is much cheaper than it sounds because it is TWO fields, (U and V) at one level, not the several hundred such fields that comprise state plus derivative variables. A gather of two fields takes about 0.06 seconds on hera at GFS (high) resolution and it hard to even measure at CFS resolution. The relaxations take about 10 seconds on hera for chi and psi together and less than a second for the spectral solver. Spectral solver takes 27 seconds for the GFS (high) resolution case while the poisson solver takes 161 seconds. Spectral solver takes less than a second for the CFS case while the poisson solver takes about three seconds.

@@ -81,6 +84,9 @@ SUBROUTINE COLLECT_LOC ( A, B )
deallocate(buff)
deallocate(rbufs)

tb=mpi_wtime()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KarinaAsmar-NOAA Clean up the debugging code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@GeorgeVandenberghe-NOAA Would you please clean up the debugging part of COLLECT_LOC.f? Let me know when done and I'll push it to this branch.

@@ -104,6 +110,8 @@ SUBROUTINE COLLECT_ALL ( A, B )
real, dimension(im,jm), intent(out) :: b
integer ierr,n
real, allocatable :: rbufs(:)
real*8 tb,ta
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KarinaAsmar-NOAA Clean up the debugging code in this routine.

@WenMeng-NOAA
Copy link
Collaborator

@KarinaAsmar-NOAA Please update fortran code in the subroutine as following format:

  • all lowercase
  • indentation for loops, conditionals and blocks

@WenMeng-NOAA
Copy link
Collaborator

@DusanJovic-NOAA @junwang-noaa For your reference, I ran UFS-WM RT 'cpld_control_sfs_intel' on Hera:

  1. develop branch,
    run directory: /home/Wen.Meng/stmp2/FV3_RT/rt_449231/cpld_control_sfs_intel
    inline post runtime: 5 seconds

  2. develop branch, replace UPP version #PR 1072, include velocity potential and stream function generation
    run directory: /home/Wen.Meng/stmp2/FV3_RT/rt_700773/cpld_control_sfs_intel
    inline post runtime: 8.7 seconds

@GeorgeVandenberghe-NOAA
Copy link
Contributor

GeorgeVandenberghe-NOAA commented Dec 17, 2024

The allreduce is numerically critical. What it is doing is checking for the maximum difference between the previous value of psi/chi and the result at the end of the next iteration. If the allreduce is left out, this difference is only evaluated on the subdomain associated with MPI rank 0. The allreduce does a max operation on this error on ALL of the ranks. Removing the allreduce will result in faster exit since convergence is only checked on rank 0 and a "bad" rank's larger errors will not be removed by further iterations. The ROOT problem with this whole poisson solver is the extremely slow convergence for all methods tried to date. Note, if we do get rid of the allreduce, rank 0 is one of the best ranks to test on because it's on a corner domain and likely where convergence IS slowest.

@KarinaAsmar-NOAA
Copy link
Contributor Author

The allreduce is numerically critical. What it is doing is checking for the maximum difference between the previous value of psi/chi and the result at the end of the next iteration. If the allreduce is left out, this difference is only evaluated on the subdomain associated with MPI rank 0. The allreduce does a max operation on this error on ALL of the ranks. Removing the allreduce will result in faster exit since convergence is only checked on rank 0 and a "bad" rank's larger errors will not be removed by further iterations. The ROOT problem with this whole poisson solver is the extremely slow convergence for all methods tried to date. Note, if we do get rid of the allreduce, rank 0 is one of the best ranks to test on because it's on a corner domain and likely where convergence IS slowest.

@GeorgeVandenberghe-NOAA It seems that the mpi_allreduce call causes the inline post runs to crash (@WenMeng-NOAA , plese confirm if the inline keeps crashing after removal of allreduce). Removing it does not seem to affect the SFS results (see pptx).

@GeorgeVandenberghe-NOAA
Copy link
Contributor

@KarinaAsmar-NOAA
Copy link
Contributor Author

@WenMeng-NOAA Would you please run your tests with the latest changes? Let me know if anything fails and/or if runtimes are not acceptable.

@KarinaAsmar-NOAA Your latest commit removed mpi_allreduce which significantly improved the runtime of the offline post. May I know the reason of this change?

@WenMeng-NOAA I removed it thinking that could be the issue that is causing the inline post to crash, since it is a collective operation requiring the reduced values to be called and distributed across all processes. We were using it to help the poisson loops converge. After removing it, the addition of the openMP directives and the reduced number of loops helped keep accuracy of results and reduce runtimes. You can see the SFS result comparison STRM_VPOT_Poisson_Results_v5.pptx. I haven't been able to test GFS on Hera with the spectral method, but once WCOSS2 is back up I can generate some results. If the inline post works for this test, then it was probably the mpi_allreduce that was causing the runtime/memory issue.

Latest comparisons of SFS and GFS with 100,000 iterations and no 'mpi_allreduce' here:
STRM_VPOT_Poisson_Results_v5.pptx

do i=ista,iend
if (j>1 .and. j<jm) then
chi(i,j) = 0.25*(ptmp(i-1,j)+ptmp(i+1,j)+ptmp(i,j-1)+ptmp(i,j+1))-dtmp(i,j)
edif=psi(i,j)-pval
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KarinaAsmar-NOAA @JesseMeng-NOAA What is the 'pval' used for? I don't see it initialized or calculated?

Copy link
Contributor

@JesseMeng-NOAA JesseMeng-NOAA Dec 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be
edif=chi(i,j)-ptmp(i,j)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WenMeng-NOAA The 'pval' is for evaluating errors across iterations. I missed adding it when restoring the allreduce, it is added now.

@WenMeng-NOAA
Copy link
Collaborator

Here are the runtime tests for the offline post:

GFS(C768) 120 tasks || develop: 142 seconds  ||  PR #1072: 188 seconds
GEFS (C384) 48 tasks || develop: 52 seconds   || PR #1072: 77 seconds
SFS (C96) 48 tasks  ||  develop: 3.86 seconds   || PR #1072: 5.4 seconds

@WenMeng-NOAA
Copy link
Collaborator

Here are the runtime test for the inline post with the UFS RT 'cpld_control_sfs' on Hera:

Without  velocity potential and stream function generation: 5 seconds
   run directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_449231-develop/cpld_control_sfs_intel
With velocity potential and stream function generation: 9 seconds
   run directory: /scratch1/NCEPDEV/stmp2/Wen.Meng/FV3_RT/rt_2703606-sfs/cpld_control_sfs_intel

@DusanJovic-NOAA and @junwang-noaa Please let me know your comments on this PR.

parm/post_avblflds.xml Outdated Show resolved Hide resolved
parm/post_avblflds.xml Outdated Show resolved Hide resolved
exit
endif
enddo ! end of jjk loop for chi
tc=mpi_wtime()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KarinaAsmar-NOAA Comment out the debug code from line 4967 to 4969.

enddo ! end of jjk loop for psi
!
chi=0.
tb=mpi_wtime()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KarinaAsmar-NOAA Comment out this line.

!
! poisson solver for psi and chi
psi=0.
ta=mpi_wtime()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@KarinaAsmar-NOAA Comment out this line.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Ready for commit queue Ready for Review This PR is ready for code review. SFSV1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adapt the calculations of velocity potential and streamfunction in UPP
5 participants