-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication Bug in Semgrep JSON Report Causing Mitigation of the Original Finding #11227
Comments
I tested the same scenario, but this time I re-uploaded the same report using the UI (Ad Hoc Import), and the problem did not occur. It seems the issue only arises when importing the report through subsequent API imports, and possibaly related to engagement type, which are different (CI/CD vs Interactive). Here are the steps I followed:
I plan to create an engagement with the type "Interactive" and then upload the test report into that engagement. This approach will be used instead of "auto creating context" in the import-scan request to determine if it resolves the issue. Update: |
@farsheedify
HTH |
Thanks for your reply @mtesauro
I can provide more accurate test data/scenario if it can help. |
@farsheedify This is super interesting. Is it possible you can add a sanitized scan file that shows this behavior so we can reproduce the issue? |
Sure. I ran the scan on some sample files. To reproduce the issue, first upload semgrep-report_Line31.json using the following curl command: curl -X "POST" "https://demo.defectdojo.org/api/v2/import-scan/" -H "accept: application/json" -H "Authorization: Token ${{ secrets.DEFECTDOJO_TOKEN }}" -H "Content-Type: multipart/form-data" -F [email protected] -F "product_type_name=clubpay" -F "active=true" -F "verified=true" -F "close_old_findings=true" -F "engagement_name=${GITHUB_RUN_ID}(SAST)" -F "build_id=${GITHUB_RUN_ID}" -F "minimum_severity=Info" -F "close_old_findings_product_scope=true" -F "scan_date=$TODAY" -F "engagement_end_date=$TODAY" -F "commit_hash=${GITHUB_SHA}" -F "product_name=${product_name}" -F "auto_create_context=true" -F "scan_type=Semgrep JSON Report" -F "tags=${GITHUB_REF_NAME}" -F "source_code_management_uri=${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}" Next, upload semgrep-report_second_run_Line24.json using the same curl command. After the second upload, view all findings for the product. You will see two findings with the title: python.lang.security.audit.dangerous-subprocess-use-tainted-env-args.dangerous-subprocess-use-tainted-env-args The first finding (Original Finding) will be marked as "inactive, verified, mitigated" and is detected on sample/brute.py (Line 31). The second finding will be marked as "inactive, duplicate" and is detected on sample/brute.py (Line 24). |
@farsheedify Thanks this helps a ton. |
I tested the scenario outlined above and have determined that close old findings and deduplication are working as expected, but are stomping on each other. Because close old findings on the import level waits for all findings to be created, and then compares the hash codes, deduplication is running when the finding is saved. So on a finding that is expected to be a duplicate, deduplication will get to it first in the background via a celery worker, mark the new finding as a duplicate, and close it. Close old findings will get the saved finding back, but before deduplication changes its state, and is technically a stale record in this specific case. Close old findings will then close the original finding, and leave the new one open These two smart features will then conflict with each other. Deduplication is saying "mark any new findings as mitigated because they are a duplicate of an existing finding" whereas close old findings says "mark older findings as mitigated because I want to focus on the newest finding found". This use case is essentially cancelling itself out, but leaving the state of the system in a weird spot with everything being closed. A potential mitigation for this is to force the close old finding function (and therefore the entire import process) to wait on the async saving processes to finish, and then fetch the latest version of the same findings. Unfortunately, this will remove all performance gains of moving those processes to the background, for the sake of a very specific configuration that is not necessarily logical. My advice here would be to assess your needs, and pick either close old findings or deduplication. |
@farsheedify Maffooch makes great observations in ☝️ I couldn't have written it up better. |
Thanks for your time and investigation. Additionally, I tested this scenario using the "reimport" endpoint, and the issue did not occur. How is the reimport endpoint different from the import endpoint in terms of deduplication and closing old findings? Is it possible to have that behavior in the import endpoint as well? Although migrating to re-import endpoint from the import endpoint is super challenging for us due to this issue #3958. I do not think this is a rare scenario; although it might go unnoticed for a long time. We had this issue for more than two years, but it came to our attention when we decided to export DefectDojo's data into a Power BI dashboard for data analysis. Our data analyst noticed that roughly more than 80 percent of our mitigated findings across 50 teams, with about 250 services, are related to SAST findings. After a deeper investigation, we found that most of these mitigations are due to changes in the position of the code (line number) and not actually the bug being fixed. This issue might be hidden for many clients. We decided to test switching to the "Semgrep JSON Report" instead of the "GitLab SAST Report" to migrate from "legacy" deduplication (which includes the line number in the hash) to the "unique ID from the tool" (which is the fingerprint of the Semgrep report for each finding). However, this issue is preventing us from switching our importing strategy. I think the "calculation" of the hash code for deduplication is not applicable in this scenario as DefectDojo already has the unique ID from the tool (the fingerprint field in the JSON report of Semgrep). From your clarifications, which I could be wrong about, it seems that the deduplication algorithm and closing old findings cannot be used together (which is not the case, as we are already using them together for all our test types from the start with no issues). |
The code that marks the existing finding as mitigated (incorrectly) is in the import process. It looks like maybe something is going wrong there since the presence of the unique id should prevent stuff like this from happening. The hash codes for the existing finding and the "new" finding are different, so something in the hashcode calculation is going wrong. When deduplication is disabled, the finding is still marked as mitigated. So I don't think deduplication is the cause of this. |
Looks like the I think the What do you think @Maffooch @mtesauro
@farsheedify The problem in 3958 should not happen for Semgrep because of the unique id. Are you seeing this problem with Semgrep, or did you mention it because you saw it with other parsers? |
Thanks for your response @valentijnscholten. The main issue with migrating from the import endpoint to the re-import endpoint is that the re-import endpoint operates on an engagement scope (except for the initial import). This means that if bugs in the same test type are not addressed during the first call to the re-import endpoint (which functions as an import), there is no way to close those bugs in the future, even if they are fixed. This situation causes significant trouble as we lose the "close old findings" functionality in 'product scope'. Specifically for issue 3958, we have not tested the Semgrep JSON Report, as we have been uploading Semgrep reports using the GitLab SAST Report format, and we faced that issue with this test type, and other parsers. Currently we are using Anchore Grype, Trivy, KICS and Gitleaks other than Semgrep. |
Bug description
In our organization, we are experiencing an issue with SAST scans where changes in code (the position of the vulnerable line) cause DefectDojo to mark the finding as mitigated and create a new finding for the same bug at the new line position. Currently, we are uploading the report with the scan type set to GitLab SAST Report in our CI/CD. As I understand, the default behavior for deduplication of static scans includes the line number in the algorithm.
I noticed that for Semgrep JSON Reports, DefectDojo uses the fingerprint ID provided by Semgrep for deduplication. I decided to test this change in the demo instance of DefectDojo to see if it could resolve our issue with SAST findings being closed and reopened due to line changes.
I scanned a sample repository on GitHub with Semgrep and generated a JSON report. I uploaded the report using the following curl command on the demo server:
curl -X "POST" "https://demo.defectdojo.org/api/v2/import-scan/" -H "accept: application/json" -H "Authorization: Token ***" -H "Content-Type: multipart/form-data" -F [email protected] -F "product_type_name=ptype_name" -F "active=true" -F "verified=true" -F "close_old_findings=true" -F "engagement_name=${GITHUB_RUN_ID}(SAST)" -F "build_id=${GITHUB_RUN_ID}" -F "minimum_severity=Info" -F "close_old_findings_product_scope=true" -F "scan_date=$TODAY" -F "engagement_end_date=$TODAY" -F "commit_hash=${GITHUB_SHA}" -F "product_name=${product_name}" -F "auto_create_context=true" -F "scan_type=Semgrep JSON Report" -F "tags=${GITHUB_REF_NAME}" -F "source_code_management_uri=${GITHUB_SERVER_URL}/${GITHUB_REPOSITORY}"
Originally, a bug was found on the 26th line of a file. After changing that line of code to the 21st line and rerunning the scan, the original bug was marked as mitigated, and the new bug was marked as a duplicate. Now, the bug has disappeared from the active bugs. The image below summarizes the situation. The bug and the file are the same. It seems that DefectDojo correctly marked the new finding as a duplicate of the previous finding (since only the line changed, but the fingerprints are the same). However, it incorrectly marked the original bug as mitigated, causing it to be removed from the active findings.
Steps to reproduce
Steps to reproduce the behavior:
Expected behavior
When rerunning the scan on the same file, if there is a change in the line of code, DefectDojo should:
1.Check the Fingerprint First: Use the fingerprint in the Semgrep report to identify findings.
2. Mark New Findings Correctly: Mark newly imported findings with similar instances (with different line numbers) as active and verified, since they can be assumed the latest instances.
3. Handle Original Findings Appropriately: Mark the original findings as duplicates (since they have outdated line of code).
Since the new finding contains the latest, updated line of code, it should be kept as the active finding. Ideally, DefectDojo should update the original finding's line of code without creating a new finding. This approach ensures that:
The text was updated successfully, but these errors were encountered: