Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Logging for Cloud.gov App Logs #3062

Closed
4 tasks done
adborden opened this issue Mar 31, 2021 · 25 comments
Closed
4 tasks done

Enable Logging for Cloud.gov App Logs #3062

adborden opened this issue Mar 31, 2021 · 25 comments
Assignees
Labels
compliance Relating to security compliance or documentation logging Notifications O&M Operations and maintenance tasks for the Data.gov platform

Comments

@adborden
Copy link
Contributor

adborden commented Mar 31, 2021

User Story

In order to be able to get alerts on logs, data.gov sysadmins wants application logs on cloud.gov to be directed to New Relic.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN all application logs are sent to New Relic
    WHEN an exception/error is logged
    THEN New Relic receives the logs

Background

New Relic Logs provides an ingest service for application logs and enables alerting on the logs which would satisfy monitoring controls using a product we already use.

Security Considerations (required)

New Relic logs is already approved, no security concerns

Sketch

@adborden
Copy link
Contributor Author

adborden commented Mar 31, 2021

Some back of the envelope math...

Catalog gunicorn logs are rotated at 50MB for a maximum of 10 logs. The period of those 10 logs cover an average of 6 days. We estimate over a month, that's 10.55 GB of data. We're already using ~25GB of data from NR instrumentation ingestion already. Assuming catalog is our largest application in terms of log volume, between catalog, inventory, dashboard, we should still be under the 100GB free tier monthly limit. Beyond that, it would be $0.25/GB.

host start end size (MB) delta (days) MB/day GB/mo
catalogbweb1p.prod-ocsit.bsp.gsa.gov 2021-03-30 7:18:13 2021-03-24 8:32:58 500 5.948090278 84.06059368 2.462712705
catalogbweb2p.prod-ocsit.bsp.gsa.gov 2021-03-30 0:45:42 2021-03-23 23:27:42 500 6.054166667 82.58774948 2.419562973
catalogweb1p.prod-ocsit.bsp.gsa.gov 2021-03-30 10:45:06 2021-03-24 21:08:38 500 5.566990741 89.81513052 2.631302652
catalogweb2p.prod-ocsit.bsp.gsa.gov 2021-03-30 1:43:20 2021-03-25 3:44:30 500 4.915856481 101.7116757 2.979834248
catalogpubweb1p.prod-ocsit.bsp.gsa.gov 2021-03-31 0:21:50 2021-03-09 18:43:12 43.42 21.23516204 2.044721859 0.05990396072
         Total 10.55331654

@adborden adborden changed the title [research 1d]: Enable New Relic Logs for one application Enable New Relic Logs for catalog Mar 31, 2021
@adborden
Copy link
Contributor Author

adborden commented Apr 5, 2021

Could this be used for SSB managed boundary for AU-6 (3)? Can we ship logs from AWS to NR Logs?

@adborden
Copy link
Contributor Author

Could this be used for SSB managed boundary for AU-6 (3)? Can we ship logs from AWS to NR Logs?

Yes, we're already running fluent-bit in SSB, we just need to add the NR plugin and configuration

@hkdctol hkdctol changed the title Enable New Relic Logs for catalog Enable New Relic Logs for Cloud.gov Apps Apr 22, 2021
@hkdctol hkdctol added the compliance Relating to security compliance or documentation label Apr 22, 2021
@jbrown-xentity
Copy link
Contributor

May need to improve server run commands to include running with gunicorn and new-relic.

@jbrown-xentity jbrown-xentity changed the title Enable New Relic Logs for Cloud.gov Apps Enable Alerting for Cloud.gov App Logs Sep 23, 2021
@jbrown-xentity
Copy link
Contributor

To note, we definitely need to have Aaron's custom log solution to ingest cloud.gov logs before sending them to New Relic.

@mogul
Copy link
Contributor

mogul commented Dec 13, 2021

The log service exists, but it's not bound consistently to the apps between staging and prod:

bmogilefsky@rocinante-w10:~/Documents/Code/catalog.data.gov$ cf t -s prod
API endpoint:   https://api.fr.cloud.gov
API version:    3.109.0
user:           [email protected]
org:            gsa-datagov
space:          prod
bmogilefsky@rocinante-w10:~/Documents/Code/catalog.data.gov$ cf services | grep logs
logstack-space-drain-prod      user-provided                                          inventory, dashboard, catalog                                                                      
bmogilefsky@rocinante-w10:~/Documents/Code/catalog.data.gov$ cf t -s staging
API endpoint:   https://api.fr.cloud.gov
API version:    3.109.0
user:           [email protected]
org:            gsa-datagov
space:          staging
bmogilefsky@rocinante-w10:~/Documents/Code/catalog.data.gov$ cf services | grep logs
logstack-space-drain-staging    user-provided                                          inventory  

@mogul
Copy link
Contributor

mogul commented Dec 13, 2021

More information on how that all works over here.

@FuhuXia FuhuXia self-assigned this Jan 7, 2022
@mogul mogul self-assigned this Jan 11, 2022
@mogul
Copy link
Contributor

mogul commented Jan 11, 2022

I just ripped out the FluentBit option because it doesn't do the filter-processing that the LogStash option already does for CF logs. However, it's easy to add the New Relic output plugin to LogStash. I will do that shortly.

@jbrown-xentity
Copy link
Contributor

@jbrown-xentity
Copy link
Contributor

I think the GSA/datagov-logstack#28 PR added most of the necessary fixes. However, the actual license key wasn't implemented. I think this will take a small redesign to implement this properly. We can either use a secret like inventory (see manifest and .profile), or we could add to the environment in a different way (removing from manifest, and adding environment variable manually), or something else I'm not thinking of.

@nickumia-reisys
Copy link
Contributor

To be clear, when this is working, are we expecting to see the logs here?

image

@nickumia-reisys nickumia-reisys self-assigned this Apr 28, 2022
@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Apr 28, 2022

Current Status:

  • New code with NR plugin deployed to management space
  • S3 log forward has issues
  • NR log forward does not seem to be working

TODO:

  • Fix the datagov-logstack github action deploy (using wrong cf action)
  • Debug the S3/NR errors (or lack thereof)
  • Create a manual test of NR log forward by creating a dummy docker-compose app and use NR License Key to use docker-compose logstash to forward logs and see if they show up.

@nickumia-reisys
Copy link
Contributor

Current state of the app:

image

Possible Next Steps:

  • Do unit testing locally with logstash
  • Try to get production logs to work

@jbrown-xentity
Copy link
Contributor

That log triggered something; I think we have a different endpoint for NewRelic. Not sure how/if we can set that in the logstash new relic plugin... See https://github.com/GSA/inventory-app/blob/main/manifest.yml#L29

@nickumia-reisys
Copy link
Contributor

That gave a different error,

image

I do think the last part of this is just figuring out the right URL.

@nickumia-reisys
Copy link
Contributor

And no.. specifying the rest of the generic /log/v1 doesn't work 😕

image

@nickumia-reisys
Copy link
Contributor

I maybe found it! 😀

@nickumia-reisys
Copy link
Contributor

No way!!

image

P.S. No idea why NR thinks I'm PDT all the time 🙄

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented Apr 30, 2022

Well.. something's broken... it sends logs for a bit and then stops.. The only way to spur it again is to re-push restart the app (with proper RAM allocation)

image

@nickumia-reisys
Copy link
Contributor

Turns out the app is crashing,

  • One factor was RAM allocation. 700M was originally allocated, it has grown to 920M.
  • Another factor is unknown. The logs are not helpful in debugging. I wonder if it's because it has issues with the S3 output. Will investigate.

@nickumia-reisys
Copy link
Contributor

Removing the S3 output made it more stable,

image

It also dropped RAM usage to 650M.

@nickumia-reisys
Copy link
Contributor

nickumia-reisys commented May 1, 2022

It's been stable for the last 24 hours,

image

But we need to discuss the following,

@mogul
Copy link
Contributor

mogul commented May 1, 2022

S3 output should be fixed... That export is how our logs were being shipped to GSA's SOC.

@nickumia-reisys
Copy link
Contributor

NR and S3 Outputs working harmoniously,

image

@nickumia-reisys
Copy link
Contributor

Summary of work completed:

Current state of Logstack application:

  • Everything about the CD works except the last step which tries to drain all of the logs from all of the apps in the prod and staging space. This step is skipped if it's already set up. There is an automated script that works when run manually locally. However, in github actions, the script fails because of weird discrepancies in the file struture and/or missing dependencies.
  • Logs from all of our apps are being populated in NR.

@nickumia-reisys nickumia-reisys changed the title Enable Alerting for Cloud.gov App Logs Enable Logging for Cloud.gov App Logs May 12, 2022
@nickumia-reisys nickumia-reisys added this to the Sprint 20220512 milestone May 12, 2022
@nickumia-reisys nickumia-reisys added O&M Operations and maintenance tasks for the Data.gov platform Notifications labels Oct 7, 2023
@nickumia-reisys nickumia-reisys moved this to 🗄 Closed in data.gov team board Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compliance Relating to security compliance or documentation logging Notifications O&M Operations and maintenance tasks for the Data.gov platform
Projects
Archived in project
Development

No branches or pull requests

6 participants