Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] SCDF not available due to skipper crash #1067

Open
5 of 12 tasks
omiazek-ads opened this issue Aug 8, 2023 · 1 comment
Open
5 of 12 tasks

[BUG] SCDF not available due to skipper crash #1067

omiazek-ads opened this issue Aug 8, 2023 · 1 comment
Labels
bug Something isn't working CCB Issue for CCB ops Ticket from ADS operation team priority:major Set the priority to major because the production is heavily impacted

Comments

@omiazek-ads
Copy link

omiazek-ads commented Aug 8, 2023

Environment:

  • Delivery tag: 2.0.0-rc2
  • Platform: OPS Orange Cloud
  • Configuration: RS 1.5

Current Behavior:
SCDF is frequently unavailable since 04/08/2023: half the time, it is impossible to deploy any stream, and impossible to access SCDF GUI.

Expected Behavior:
SCDF is available.

Steps To Reproduce:
Connect to the SCDF GUI.

Test execution artefacts (i.e. logs, screenshots…)
Here is the memory consumption graph:
Error_SCDF_heap_space
Here is the error log:
Error_SCDF_heap_space.txt

Whenever possible, first analysis of the root cause
The pod spring-cloud-dataflow-skipper is frequently in CrashLoopBackOff (liveness/readiness KO).
The logs indicate that the application runs out of java heap space.
There is no explicit java heap space setting, and the default setting is unknown.
We explicitly configured the java heap space to 1024 MB and restarted the pod.
After that, the problem is no longer present.
The used memory has increased from ~850 MB to ~1350 MB.

Workaround
Increase the java heap space in the SCDF skipper deployment:

kubectl edit deploy -n processing spring-cloud-dataflow-skipper

Add the 2 following lines in the "env" section:

        - name: JAVA_OPTS
          value: -Xms1024m -Xmx1024m

Bug Generic Definition of Ready (DoR)

  • The affect version in which the bug has been found is mentioned
  • The context and environment of the bug is detailed
  • The description of the bug is clear and unambiguous
  • The procedure (steps) to reproduce the bug is clearly detailed
  • The tested User Story / features is linked to the bug if available
  • Logs are attached if available
  • A data set attached if available

Bug Generic Definition of Done (DoD)

  • the modification implemented (the solution to fix the bug) is described in the bug.
  • Unit tests & Continuous integration performed - Test results available - Structural Test coverage reported by SONAR
  • Code committed in GIT with right tag or Analysis/Trade Off documentation up-to-date in reference-system-documentation repository
  • Code is compliant with coding rules (SONAR Report as evidence)
  • Acceptance criteria of the related User story are checked and Passed
@omiazek-ads omiazek-ads added bug Something isn't working CCB Issue for CCB priority:major Set the priority to major because the production is heavily impacted ops Ticket from ADS operation team labels Aug 8, 2023
@LAQU156
Copy link

LAQU156 commented Aug 9, 2023

System_CCB_2023_w32 : Moved into "Accepted OPS", Action on OPS team to create a Pull request on Infrastructure repository

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CCB Issue for CCB ops Ticket from ADS operation team priority:major Set the priority to major because the production is heavily impacted
Projects
None yet
Development

No branches or pull requests

2 participants