-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reimplement Coraza-SPOA #103
Conversation
with: | ||
go-version: '1.21' | ||
|
||
- name: setup environment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd really avoid going in this direction, this assumes a haproxy is running in local which is going to be a big friction to those willing to contribute. Personally I would never install haproxy locally to debug a problem so I suggest we spin up a haproxy instance using docker with testcontainers
api (see https://pkg.go.dev/github.com/testcontainers/testcontainers-go) or just docker-compose so the test is selfcontained and does not assume I have something installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explicitly did this without using containers (atleast inside the CI) as its faster to run. Locally and in production I would always recommend running without docker as the performance difference is huge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Also I dont have docker on my machine since its just very slow on macos)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really prefer to have something that is portable. It is OK if we want CI to run haproxy locally but we also need to provide a way to run it with docker. Again here we prioritize user experience if I have to install locally haproxy to contribute I'd rather not contribute hence we use containers. For advanced users they can definitively use local haproxy.
internal/e2e_test.go
Outdated
"github.com/dropmorepackets/haproxy-go/pkg/testutil" | ||
) | ||
|
||
const directives = ` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use go:embed
for this, also in general we would like to keep a consistent http e2e suite by suing https://github.com/corazawaf/coraza/tree/main/http/e2e which is being used in coraza http handler, caddy and coraza-proxy-wasm. Running it is as easy as spin up an instance and run a go command, see https://github.com/corazawaf/coraza-caddy/blob/main/magefile.go#L74
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should I use embed for a single string? Inside the coraza tests its done the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But yeah I can migrate the e2e tests to run the command but I am not a fan of running commands and basically ignoring the go test framework. If the Upstream would properly expose the tests as a list we could run these in parallel and reduce testing times while also reducing the amount of external dependencies
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened an issue on coraza itself to expose this and make it easier for other repos to use the e2e test suite.
corazawaf/coraza#1006
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Waiting for a coraza release to fix this up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it is better you use master for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good, I added a few comments and I also think we are missing some unit tests, specifically for functions around parsing request elements. Great work!
ee4e7f7
to
f0db161
Compare
Sorry that I didn't update on this again but I am currently working on getting a talk done. Will try to get the other points done by the end of next week |
I think this can be reviewed and tested now. There is still a fd leak when reloading and changing the logger, also when there is a reload with logger change and then an a reload that reverts this, the old file can be raced. There should probably a small wrapper around the logger to provide prevent multiple writers to the same file, but I guess this can wait. |
@fionera Can you solve the conflicts here? |
Sure will do that |
Updated :) |
This allows us to not rely on the http header, but instead use a variable inside the transaction.
This does leak fds as the logging target never gets closed, this has to be fixed before release but this will work for now.
It is not correctly configured anyway and we have go specific linters enabled
I don't know why I didn't just close it after leaving the function. There is no other use of the fd...
And now without missing to remove an old file 🗡️ |
Since we're having stability issues with the current main branch (See: #97) I've been following this rewrite closely hoping it would get merged and fix our issues. But since merging is stalled since a while I got curious and went ahead with deploying the rewrite branch anyway and putting some of our production traffic against it for a test. But unfortunately, this didn't go well - I couldn't make any conclusions on the stability vs. main because the rewrite branch has a massive memory leak issue and so I couldn't keep it running it for long. So after less than two hours the coraza-spoa daemon's memory use grew above 4 GB and additionally it also used a lot more CPU (about twice as much or more) than the main branch version. And so I ended the experiment again. My setup uses the CRS v4.7.0 plus a few custom exclusion rules to deal with false positives. Response checking is disabled. Config options and rules were the same when doing this comparison, so the only thing that differs is that one time I ran the coraza-spoa daemon binary built from the rewrite branch and the other time that from the main branch. First, here's a graph of how the overall system's memory consumption developed after enabling the rewrite branch coraza-spoa daemon. There's an obvious memory leak which forced me to stop the experiment and downgrade to the main branch version again after ~2 hrs. And, after switching back to main, the memory use instantly became stable again: And also for CPU usage there's a clear difference, so see the CPU consumption graph below that depicts the CPU usage of the coraza-spoa daemon process. While running the rewrite branch version the CPU consumption is consistently much higher and after downgrading to the main branch version it drops again to a much lower level: What I further noticed is that I got very frequently messages like the following below in our /var/log/messages when running the rewrite branch version: coraza-spoa[3659835]: 2024/09/28 16:13:01 write tcp 127.0.0.1:9000->127.0.0.1:50932: write: broken pipe With the main branch coraza-spoa daemon we're also sometimes seeing such "broken pipe" log messages but they are very rare, so they happen just every couple of days. But while running the rewrite branch version this happened all the time, like there were some such messages logged almost every minute. So I'm wondering if there are also functional issues then (beyond just the resource consumption problems). Or if maybe these "broken pipe" errors also cause the memory leak (perhaps under these error conditions memory isn't properly released?). But so unfortunately, I have no clue what the root cause of these issues might be. Perhaps, it's not even the code in the rewrite branch itself, so the rewrite branch also brings in a lot of newer dependencies - hence the memory leak could also be caused by a dependency (though less likely). So I can only share these observations from our first real-world test now and unfortunately it seems the rewrite branch isn't quite ready for general use yet. But hopefully these remaining issues can still be solved now... |
Perhaps this could help you out to find the root cause of the issue: "A "broken pipe" error typically indicates that a process was attempting to write data to another process or socket, but the receiving process was closed or became unavailable before the data transfer was complete. This can occur for various reasons, including: Network Issues: If the communication was happening over a network, a drop in network connectivity or a closed connection can result in a broken pipe. |
Did you check with before the fixes I pushed a couple days ago? I found the resource leak which was because of a wrong usage of defer statements. The brokenn pipe issues are happening when haproxy decides to close the connection because of either too long evaluation times or too many requests. I am currently investivating why exactly haproxy does that and will keep you updated. |
Yes, I used the latest version available (commit 6ea3e43) |
I cannot reproduce this behaviour on my side. Can you run the newest commit with the memory profile flag? |
If we write it too early and eviction triggers, we will return it to the pool and a race occurs.
Can you try with the newest master? I fixed the disconnect handling and added a memprofile flag. If the memory leak is still happening I would love to get a pprof to debug it further. Thank you :) |
I've tried the latest rewrite branch code and I have partially good news: The memory leak is gone, so I couldn't reproduce this anymore. Also the CPU usage was much better and on around the same level than the version from current main branch. So it looks like the memory leak previously was also eating CPU cycles and that's why CPU usage was higher. So the resource usage issues seem to be solved now, not worse than main branch anymore. However, I'm still seeing "broken pipe" etc. output in /var/log/messages frequently which doesn't happen when running the code from the main branch:
So this is just a short snippet from the log but it's those same three errors ("frame is too big", "use of closed network connection" and "broken pipe") that repeat again and again. And "frame is too big" seems new, so I think I didn't see this before when I tested the earlier version that still had the memory leak. NOTE: I tested this all still with go 1.21 because I don't have a newer version on AlmaLinux 9 and was too lazy to build the newer one from source. So I've reverted your latest commit d0ba9f8 and also changed the go version in go.mod back to 1.21 to make it work. And I don't know if that's a problem, so if there's anything why go >= 1.23.1 is really required... |
@@ -1,27 +1,30 @@ | |||
module github.com/corazawaf/coraza-spoa | |||
|
|||
go 1.19 | |||
go 1.23.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With go 1.23.1 the application fails to start for me.
With 1.22.3 (as used by the waf-core https://github.com/corazawaf/coraza/blob/main/go.mod#L3) it works without any config changes
Error:
Oct 06 16:43:57 lb01 coraza[79533]: {"level":"fatal","error":"initializing application '\\x00': invalid WAF config from string: failed to readfile: read /etc/coraza-spoa/coraza.conf: invalid argument","time":"2024-10-06T16:43:57+02:00","message":"Failed creating applications"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same error with go 1.23.2 :
9:45PM FTL Failed creating applications error="initializing application '\x00': invalid WAF config from string: failed to readfile: read /opt/coraza-spoa/etc/coraza.conf: invalid argument"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Will try to reproduce
I get an error when result processing is enabled ( response_check: true ) : 9:49PM PNC Error handling request error="response id is empty"
|
The ID argument is using the wrong var. It would have to be txn.coraza.id |
Thank you, it works correctly with tnx.coraza.id parameter |
Merged. Work will continue from here; v0.1 was created to keep the old version. This will be tagged as 1.0.0-experimental |
Another go at updating coraza-spoa to a the up to date coraza version and a faster spop library. It also uses upstream e2e tests
Q: Why rewrite instead of incremental changes?
A: I didn't had time nor motivation to figure out why some things are how they are. It was just easier to do it like this.
TODOs:
Other PRs that are similar:
Closes #84 #85 #90 #88 #75