Skip to content

Commit

Permalink
Merge branch 'main' into dependabot/pip/dill-0.3.9
Browse files Browse the repository at this point in the history
  • Loading branch information
swashko authored Nov 8, 2024
2 parents 8a9f6e9 + 024100c commit 5dc1bd6
Show file tree
Hide file tree
Showing 6 changed files with 49 additions and 25 deletions.
47 changes: 29 additions & 18 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,9 @@ This needs to change, and proper tooling is the first step.

![ModelScan Preview](/imgs/modelscan-unsafe-model.gif)

ModelScan is an open source project from [Protect AI](https://protectai.com/) that scans models to determine if they contain
unsafe code. It is the first model scanning tool to support multiple model formats.
ModelScan currently supports: H5, Pickle, and SavedModel formats. This protects you
ModelScan is an open source project from [Protect AI](https://protectai.com/?utm_campaign=Homepage&utm_source=ModelScan%20GitHub%20Page&utm_medium=cta&utm_content=Open%20Source) that scans models to determine if they contain
unsafe code. It is the first model scanning tool to support multiple model formats.
ModelScan currently supports: H5, Pickle, and SavedModel formats. This protects you
when using PyTorch, TensorFlow, Keras, Sklearn, XGBoost, with more on the way.

## TL;DR
Expand All @@ -38,9 +38,9 @@ modelscan -p /path/to/model_file.pkl

Models are often created from automated pipelines, others may come from a data scientist’s laptop. In either case the model needs to move from one machine to another before it is used. That process of saving a model to disk is called serialization.

A **Model Serialization Attack** is where malicious code is added to the contents of a model during serialization(saving) before distribution — a modern version of the Trojan Horse.
A **Model Serialization Attack** is where malicious code is added to the contents of a model during serialization(saving) before distribution — a modern version of the Trojan Horse.

The attack functions by exploiting the saving and loading process of models. When you load a model with `model = torch.load(PATH)`, PyTorch opens the contents of the file and begins to running the code within. The second you load the model the exploit has executed.
The attack functions by exploiting the saving and loading process of models. When you load a model with `model = torch.load(PATH)`, PyTorch opens the contents of the file and begins to running the code within. The second you load the model the exploit has executed.

A **Model Serialization Attack** can be used to execute:

Expand All @@ -51,14 +51,27 @@ A **Model Serialization Attack** can be used to execute:

These attacks are incredibly simple to execute and you can view working examples in our 📓[notebooks](https://github.com/protectai/modelscan/tree/main/notebooks) folder.

## Enforcing And Automating Model Security

ModelScan offers robust open-source scanning. If you need comprehensive AI security, consider [Guardian](https://protectai.com/guardian?utm_campaign=Guardian&utm_source=ModelScan%20GitHub%20Page&utm_medium=cta&utm_content=Open%20Source). It is our enterprise-grade model scanning product.

![Guardian Overview](/imgs/guardian_overview.png)

### Guardian's Features:

1. **Cutting-Edge Scanning**: Access our latest scanners, broader model support, and automatic model format detection.
2. **Proactive Security**: Define and enforce security requirements for Hugging Face models before they enter your environment—no code changes required.
3. **Enterprise-Wide Coverage**: Implement a cohesive security posture across your organization, seamlessly integrating with your CI/CD pipelines.
4. **Comprehensive Audit Trail**: Gain full visibility into all scans and results, empowering you to identify and mitigate threats effectively.

## Getting Started

### How ModelScan Works

If loading a model with your machine learning framework automatically executes the attack,
If loading a model with your machine learning framework automatically executes the attack,
how does ModelScan check the content without loading the malicious code?

Simple, it reads the content of the file one byte at a time just like a string, looking for
Simple, it reads the content of the file one byte at a time just like a string, looking for
code signatures that are unsafe. This makes it incredibly fast, scanning models in the time it
takes for your computer to process the total filesize from disk(seconds in most cases). It also secure.

Expand All @@ -78,7 +91,7 @@ it opens you up for attack. Use your discretion to determine if that is appropri

### What Models and Frameworks Are Supported?

This will be expanding continually, so look out for changes in our release notes.
This will be expanding continually, so look out for changes in our release notes.

At present, ModelScan supports any Pickle derived format and many others:

Expand All @@ -90,7 +103,7 @@ At present, ModelScan supports any Pickle derived format and many others:
| | [keras.models.save(save_format= 'keras')](https://www.tensorflow.org/guide/keras/serialization_and_saving) | Keras V3 (Hierarchical Data Format) | Yes |
| Classic ML Libraries (Sklearn, XGBoost etc.) | pickle.dump(), dill.dump(), joblib.dump(), cloudpickle.dump() | Pickle, Cloudpickle, Dill, Joblib | Yes |

### Installation
### Installation
ModelScan is installed on your systems as a Python package(Python 3.9 to 3.12 supported). As shown from above you can install
it by running this in your terminal:

Expand All @@ -114,7 +127,7 @@ pip install 'modelscan[ tensorflow, h5py ]'

ModelScan supports the following arguments via the CLI:

| Usage | Argument | Explanation |
| Usage | Argument | Explanation |
|----------------------------------------------------------------------------------|------------------|---------------------------------------------------------|
| ```modelscan -h ``` | -h or --help | View usage help |
| ```modelscan -v ``` | -v or --version | View version information |
Expand Down Expand Up @@ -143,9 +156,9 @@ Once a scan has been completed you'll see output like this if an issue is found:
![ModelScan Scan Output](https://github.com/protectai/modelscan/raw/main/imgs/cli_output.png)

Here we have a model that has an unsafe operator for both `ReadFile` and `WriteFile` in the model.
Clearly we do not want our models reading and writing files arbitrarily. We would now reach out
Clearly we do not want our models reading and writing files arbitrarily. We would now reach out
to the creator of this model to determine what they expected this to do. In this particular case
it allows an attacker to read our AWS credentials and write them to another place.
it allows an attacker to read our AWS credentials and write them to another place.

That is a firm NO for usage.

Expand Down Expand Up @@ -182,7 +195,7 @@ to learn more!

## Licensing

Copyright 2023 Protect AI
Copyright 2024 Protect AI

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
Expand All @@ -201,9 +214,7 @@ limitations under the License.
We were heavily inspired by [Matthieu Maitre](http://mmaitre314.github.io) who built [PickleScan](https://github.com/mmaitre314/picklescan).
We appreciate the work and have extended it significantly with ModelScan. ModelScan is OSS’ed in the similar spirit as PickleScan.

## Contributing

We would love to have you contribute to our open source ModelScan project.
If you would like to contribute, please follow the details on [Contribution page](https://github.com/protectai/modelscan/blob/main/CONTRIBUTING.md).
## Contributing


We would love to have you contribute to our open source ModelScan project.
If you would like to contribute, please follow the details on [Contribution page](https://github.com/protectai/modelscan/blob/main/CONTRIBUTING.md).
Binary file added imgs/guardian_overview.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 2 additions & 6 deletions modelscan/modelscan.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,11 +91,7 @@ def _iterate_models(self, model_path: Path) -> Generator[Model, None, None]:
with Model(file) as model:
yield model

if (
not _is_zipfile(file, model.get_stream())
and Path(file).suffix
not in self._settings["supported_zip_extensions"]
):
if not _is_zipfile(file, model.get_stream()):
continue

try:
Expand All @@ -114,7 +110,7 @@ def _iterate_models(self, model_path: Path) -> Generator[Model, None, None]:
continue

yield Model(file_name, file_io)
except zipfile.BadZipFile as e:
except (zipfile.BadZipFile, RuntimeError) as e:
logger.debug(
"Skipping zip file %s, due to error",
str(model.get_source()),
Expand Down
1 change: 1 addition & 0 deletions modelscan/settings.py
Original file line number Diff line number Diff line change
Expand Up @@ -128,6 +128,7 @@ class SupportedModelFormats:
"bdb": "*",
"pdb": "*",
"shutil": "*",
"asyncio": "*",
},
"HIGH": {
"webbrowser": "*", # Includes webbrowser.open()
Expand Down
Binary file added tests/data/password_protected.zip
Binary file not shown.
18 changes: 17 additions & 1 deletion tests/test_modelscan.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
import dill
import pytest
import requests
import shutil
import socket
import subprocess
import sys
Expand Down Expand Up @@ -331,6 +332,10 @@ def file_path(tmp_path_factory: Any) -> Any:

initialize_data_file(f"{tmp}/data/malicious14.pkl", malicious14_gen())

shutil.copy(
f"{os.path.dirname(__file__)}/data/password_protected.zip", f"{tmp}/data/"
)

return tmp


Expand Down Expand Up @@ -1361,7 +1366,18 @@ def test_scan_directory_path(file_path: str) -> None:
"benign0_v3.dill",
"benign0_v4.dill",
}
assert results["summary"]["skipped"]["skipped_files"] == []
assert results["summary"]["skipped"]["skipped_files"] == [
{
"category": "SCAN_NOT_SUPPORTED",
"description": "Model Scan did not scan file",
"source": "password_protected.zip",
},
{
"category": "BAD_ZIP",
"description": "Skipping zip file due to error: File 'test.txt' is encrypted, password required for extraction",
"source": "password_protected.zip",
},
]
assert results["errors"] == []


Expand Down

0 comments on commit 5dc1bd6

Please sign in to comment.