Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Benchmark Proposals and Discussion #2

Open
ttj opened this issue Apr 3, 2024 · 58 comments
Open

Benchmark Proposals and Discussion #2

ttj opened this issue Apr 3, 2024 · 58 comments

Comments

@ttj
Copy link
Contributor

ttj commented Apr 3, 2024

Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.

The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can then provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.

To propose a new benchmark, please create a public git repository with all the necessary code.

The repository must be structured as follows:

  • It must contain a generate_properties.py file which accepts the seed as the only command line parameter.
  • There must be a folder with all .vnnlib files, which may be identical to the folder containing the generate_properties.py file
  • There must be a folder with all .onnx files, which may be identical to the folder containing the generate_properties.py file
  • The generate_properties.py file will be run using Python 3.8 on a t2.large AWS instance. (see https://vnncomp.christopher-brix.de/)
@ttj
Copy link
Contributor Author

ttj commented Apr 3, 2024

As one mechanism to address benchmarks not being available until far too close to the evaluation timeframe, we have solicited benchmarks to register via the Google form registration process now, these are those submitted as of today 4/3 and discussed in rules/kickoff meeting:

https://github.com/AI4OPT/ml4acopf_benchmark

https://github.com/edwardxu0/adaGDVB/ and https://github.com/edwardxu0/GDVB/ -> aggregated into a formal benchmark here: https://github.com/dynaroars/vnncomp-benchmark-generation

https://github.com/loonwerks/vnncomp2023

https://github.com/ANTONIONLP/safeNLP

For any other benchmark proposals, please comment on this issue with a link to the benchmark details

@ttj ttj mentioned this issue Apr 3, 2024
@naizhengtan
Copy link

We want to propose NN4SysBench (paper).
Here is the repo from last year: https://github.com/Khoury-srg/VNNComp23_NN4Sys

(We will likely add a new application this year. Will update this post when done.)

@kollerlukas
Copy link

We would like to propose our benchmark: Verify1s (https://github.com/kollerlukas/cora-vnncomp2024-benchmark).

@bwbellmath
Copy link

bwbellmath commented May 7, 2024

We updated last-years ml4acopf_benchmark to offer two alternative ReLU-only options for the same models. These are provided as 6 new onnx files designed to make this benchmark more accessible to teams who have not implemented the sigmoid, sin, or cos activation functions. The new onnx files are attached and will be included in the previous github repo pending a pull request.

The main change is that each of the sigmoid, sin, and cos nodes have been replaced with ReLU based models with error bounded by 10E-3. There is a residual configuration which takes advantage of residual connections to make periodic approximations of sine and cosine and a non-residual configuration which uses minimally complex structures to emulate sine and cosine only within the region from -pi/2 to pi/2.

Edit: We did some cleanup and have further eliminated ConstantOfShape, Where, Equal, Expand, Pow, and Neg node types from the linearized models provided within https://github.com/AI4OPT/ml4acopf_benchmark.

Our updated benchmarks offer a contrast between a higher complexity nonlinear problem to push the boundaries of nonlinear verification versus two linearized models for the same problems which may produce new constraint violations.

@aliabigdeli
Copy link

aliabigdeli commented May 7, 2024

Hello! We are considering introducing a new benchmark.

Assuming having a neural network controller approximation with a piecewise linear model in the form of a set of linear models with added noise to account for local linearization error. The objective of this benchmark is to investigate the neural network output falls within the range we obtain from our linear model output plus some uncertainty. Given input $x \in X$ and output $Y = f_{NN}(x)$, the query is of the form: $A_{mat}\times X + b + U_{lb} \leq Y \leq A_{mat}\times X + b + U_{ub}$ for each linear model in its abstraction region.

The neural network controller we used in this benchmark is an image-based controller for an Autonomous Aircraft Taxiing System whose goal is to control an aircraft’s taxiing at a steady speed on a taxiway. This network was introduced in the paper Verification of Image-based Neural Network Controllers Using Generative Models. The neural network integrates a concatenation of the cGAN (conditional GAN) and controller, resulting in a unified neural network controller with low-dimensional state inputs. In this problem, the inputs to the neural network consist of two state variables and two latent variables. The aircraft’s state is determined by its crosstrack position (p) and heading angle error (θ) with respect to the taxiway center line. Two latent variables with a range of -0.8 to 0.8 are introduced to account for environmental changes.

Because in this case the output spec depends on both the input and output and considering the VNN-LIB limitation, we added a skip-connection layer to the neural network to have the input values present in the output space. We also added one linear layer after that to create a linear equation for each local model.

The benchmark repo: https://github.com/aliabigdeli/LinearizeNN_benchmark2024

The idea of this benchmark came from one of our projects in which we approximated the NN controller with a piecewise linear model, and we wanted to check if the neural network output falls within the range we obtained from our linear model output plus some uncertainty. However, in most areas of the state space, verification by state-of-the-art tools fails (could neither verify nor falsify, and return the timeout, although I increased the timeout to 12 hr).

We would greatly appreciate any feedback, opinions, and suggestions from both competition organizers and participants. Thank you!

@hocdot
Copy link

hocdot commented May 13, 2024

As one mechanism to address benchmarks not being available until far too close to the evaluation timeframe, we have solicited benchmarks to register via the Google form registration process now, these are those submitted as of today 4/3 and discussed in rules/kickoff meeting:

https://github.com/AI4OPT/ml4acopf_benchmark

https://github.com/edwardxu0/adaGDVB/ and https://github.com/edwardxu0/GDVB/

https://github.com/loonwerks/vnncomp2023

https://github.com/ANTONIONLP/safeNLP

For any other benchmark proposals, please comment on this issue with a link to the benchmark details

@ttj We submitted benchmark generation tools (https://github.com/edwardxu0/adaGDVB/ and https://github.com/edwardxu0/GDVB/) but the proposed benchmark is available here (https://github.com/dynaroars/vnncomp-benchmark-generation). Could you please update it?

@KaidiXu
Copy link

KaidiXu commented May 15, 2024

Thank @kollerlukas for proposing the Verify1s benchmark. I have a concern regarding the 1-second timeout currently set for the competition. The same discussion happened in previous years, please refer to here. In general, based on last year’s report (see Table 7), it is evident that most verifiers require more than this 1-second threshold simply for initial loading the verifier. This overhead is unavoidable under the current rule that mandates each instance be run in independent scripts, starting anew with loading the verifier each time. While batch mode operation could potentially alleviate this overhead, it appears the competition rules remain unchanged this year, disallowing this approach. To ensure fairness and the practical applicability of the benchmark across different verification methods, I recommend increasing the timeout to at least 30 seconds as suggested in previous years, which is longer than the overhead for all verifiers. In addition, the official report subtracts this overhead from all verifiers, as shown in cactus plots, many instances can be solved in less than 1 second, reflecting each verifier's capabilities.

@kollerlukas
Copy link

@KaidiXu Thank you for your feedback. We were unaware that there already had been a discussion about 1s timeouts.
If our understanding is correct, currently, the timeout does not consider the overhead.
We would like the timeout to be only the verification time without prepare time and overhead. Is there a possibility of doing this? We could increase the timeout in the vnnlib files, e.g., to 30s, and when scoring only instances with a run time of (1s + overhead) are counted.

@KaidiXu
Copy link

KaidiXu commented May 16, 2024

@kollerlukas Yes, you are right. Finally, the official report will subtract overhead for all verifiers, as you can see in cactus plots, there are indeed a lot of instances that can be solved in less than 1 second for all verifiers. However, if in vnnlib files, they are only a 1-second timeout, then most verifiers will fail. I'm not sure our scoring system can support this post-processing after subtracting overhead and only counting instances that can be verified <=1s.

@ttj
Copy link
Contributor Author

ttj commented May 17, 2024

last call for any new benchmarks, please post by tomorrow 5/17, we will plan to make the form early next week, solicit feedback on it, then send out to tool participants for voting with a voting response deadline of May 31. We will have in the form as well the benchmarks considered last year. There are a couple repeats/modified proposals of benchmarks proposed last year re-proposed above and a few that were not scored from last year re-proposed as well, so if proposers want only want one version considered, please comment. For continuity and year-to-year progress tracking, probably we should include at least all the scored ones from last year, for reference, the last year benchmarks are here:

https://github.com/ChristopherBrix/vnncomp2023_benchmarks

and their overview/descriptions:

stanleybak/vnncomp2023#2 (comment)

@kollerlukas
Copy link

@KaidiXu We increased the timeout in the instances.csv to 30s.
@ttj Is a post-processing step to only count the instances that were solved in (1s + overhead) possible?

@wu-haoze
Copy link

We want to re-propose the dist-shift benchmarks coming from this paper .
Here is the repo: https://github.com/wu-haoze/dist-shift-vnn-comp

@ttj
Copy link
Contributor Author

ttj commented May 17, 2024

@KaidiXu We increased the timeout in the instances.csv to 30s. @ttj Is a post-processing step to only count the instances that were solved in (1s + overhead) possible?

@ChristopherBrix do you think that is possible ?

@ttj
Copy link
Contributor Author

ttj commented May 17, 2024

We have created an initial draft Google form for collecting votes here to solicit feedback on any changes or if anything missing. Please do not submit it yet, as more benchmarks may come in today, and we will finish adding the previous VNN-COMP'23 benchmarks to the form as well. If you have any feedback on the form/process for voting, please let us know by posting here or in rules issue, we will post again once the form is about finalized for any final feedback before soliciting the official votes next week from the tool participants:

https://forms.gle/PQBW7tJeYUMJ9Dm3A

@shizhouxing
Copy link

shizhouxing commented May 18, 2024

We would like to propose a benchmark for Lyapunov-stable neural control.

There are two models for state feedback and output feedback, respectively. Each model consists of a controller which is a shallow ReLU network, a Lyapunov function which is a quadratic function, and nonlinear operators modelling the dynamics of a 2D quadrotor. The model for output feedback further contains a shallow LeakyReLU network as the observer. The goal is to certify the Lyapunov stability, which has been encoded in the ONNX graphs and VNNLIB specifications.

Repo: https://github.com/shizhouxing/LSNC_VNNCOMP2024

@xiangruzh
Copy link

Hi all,

I would like to repropose the yolo benchmark from last year. The benchmark was included in VNN-COMP 2023 but unscored.

The network of the original YOLOv2 model is replaced with a much smaller network containing 5 residual blocks with conv and ReLU layers for simplicity of verification. It takes in an image (3 x 52 x 52) and outputs a tensor of a shape 125 x 13 x 13 containing the confidence scores, the classification result, and the positions of 5 bounding boxes for each of the 13 x 13 grid cells. Only 5 of the outputs are actually used in verification.

For each instance, the property is verified if the confidence score of at least one bounding box is greater than a preset threshold given any bounded perturbations on the input image (similar to other image robustness benchmarks). Note that here we only use 5 outputs related to confidence scores, even though the output tensor seems huge. An example of the specification to check is shown here:
YOLO benchmark output

Here’s the repo: https://github.com/xiangruzh/Yolo-Benchmark

@huanzhang12
Copy link

Glad to know that we will reconsider all 2023 benchmarks! I looked into benchmarks in 2022 and 2021, and there were also many great ones. Here I re-propose the cifar100_tinyimagenet benchmark in VNN-COMP 2022.

Since we have the division of tracks this year, to make the benchmark more accessible and friendly for new tools in this field, I reduced the complexity of this benchmark. I kept only two of the four CIFAR100 models in medium sizes (this will reduce the workload to support this benchmark). The TinyImageNet models/data may be too large for new tools in this field, so I split it into a separate benchmark. I hope at least the cifar100 benchmark (with only 2 onnx files) can be voted into the main track. Cifar100 has the same image size (32x32) as the commonly used Cifar10, and the architecture contains FC, conv, and ReLU layers only. I also removed the largest “super” model, which may pose a challenge for some tools due to too many layers. Hopefully, many tools can support the cifar100 benchmark without too much additional effort.

In 2022, the benchmark had a reproducibility problem during generation, where the JPEG decoded image could be slightly different depending on installed library versions, causing unexpected issues (e.g., the same image index can be sat on one machine but unsat on another machine). So I’ve saved all images into a checkpoint and loaded them directly to avoid this issue.

Here are the re-proposed benchmarks (I split the Cifar100 and TinyImageNet models into two benchmarks as mentioned above):

https://github.com/huanzhang12/vnncomp2024_cifar100_benchmark
https://github.com/huanzhang12/vnncomp2024_tinyimagenet_benchmark

@ttj
Copy link
Contributor Author

ttj commented May 21, 2024

Thanks very much everyone, we have updated the voting Google form here with all the proposed benchmarks, along with those from last year (some of these are duplicated with new/revised proposals above, we have put everything in the form, so if benchmark proposers would like only 1 version considered [as eg a couple maybe were proposed last year but not scored], please post below which version to have in the form):

https://forms.gle/PQBW7tJeYUMJ9Dm3A

Please take a look and let us know of any edits suggested before we open voting, so please do not yet submit the form until we post again saying voting is open in case of any final revisions.

Votes will be due from tool participants by May 31, so please let us know of any feedback on the form/voting process ASAP and at the latest by this Friday 5/24 so that tool participants will have at least a week to submit votes

We will email the email list with this update as well in case anyone not tracking the github issue actively

@bwbellmath
Copy link

bwbellmath commented May 21, 2024 via email

@ttj
Copy link
Contributor Author

ttj commented May 24, 2024

thanks @bwbellmath I have updated the form; final reminder of any feedback on the form or if we missed any benchmarks etc

@KaidiXu
Copy link

KaidiXu commented May 25, 2024

Hi @ttj, I found https://github.com/ANTONIONLP/safeNLP contains 10812 instances with 120s timeout for each of them which is 360h timeout in total.

@huanzhang12
Copy link

@ttj It looks like some benchmarks have duplicated entries in the current voting form:

yolo by @xiangruzh: https://github.com/xiangruzh/Yolo-Benchmark
dist-shift by @wu-haoze: https://github.com/wu-haoze/dist-shift-vnn-comp
nn4sys by @naizhengtan: https://github.com/Khoury-srg/VNNComp23_NN4Sys

These are re-proposed benchmarks with no change compared to last year, and their github repos are not changed since last year. Shall we list them only once in the voting form, as part of VNN-COMP 2023 benchmarks? It might cause confusion that one benchmark gets voted in two duplicated entries (e.g., one entry is voted for the core track and the other entry is not).

@Hongji1001
Copy link

Hi @ttj, I found the generate_properties.py file is missing in the https://github.com/ANTONIONLP/safeNLP.

@ttj
Copy link
Contributor Author

ttj commented May 28, 2024

thanks @KaidiXu and @Hongji1001 ; @KatyaKom @ANTONIONLP and @Tgl70: can you please see discussion above re your benchmark? it will need a little modification, note that the max benchmark time is described in the rules (6 hours) and you need the generate_properties file; I presume the way yours is set up you can modify to just select a subset of the instances and have generate properties pick random test samples/instances, please ensure you follow the guidelines re: setting a seed so we can reproduce at evaluation time

here is a representative generate_properties (thanks @huanzhang12 ), but you can also see other benchmarks for examples, and I presume you have something like this somewhere already given what is in your repository with the instances

https://github.com/huanzhang12/vnncomp2024_cifar100_benchmark/blob/main/generate_properties.py

there may have been some confusion as what is stored in the benchmarks repository from the last iteration is the outputs from some of the benchmarks (eg just the instances after usage of generate_properties for what was actually evaluated); we will try to do a better job tracking these in the future to keep things together

@huanzhang12 thanks: yes, I will merge some of these (I had asked above but none of the benchmark proposers responded, so was not clear if exactly same or not)

I will post later today re opening voting at the google form with the modifications, we will extend voting to Monday 6/3 given the few updates, but will not extend past that, so any final updates please let us know asap

@ttj
Copy link
Contributor Author

ttj commented May 28, 2024

I have merged nn4sys and dist-shift duplicates in the form

There were 3 versions of YOLO last year; I cannot figure out which is the same, can someone please clarify @xiangruzh ?

Which is the https://github.com/xiangruzh/Yolo-Benchmark equivalent to?

https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/cctsdb_yolo

or

https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/yolo

The other was by collins so could figure that out; from the files I think it is this one, but would be good to have confirmation, I will update it to be this one in the form:

https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/yolo

@xiangruzh
Copy link

@ttj
Copy link
Contributor Author

ttj commented May 28, 2024

great, thanks @xiangruzh , we have updated the voting form

I think given no other comments at this time, let's go ahead and open voting at the Google form, please coordinate with your team and submit your votes by Monday, June 3 AOE at latest, we will post the results once in and subdivision into the 2 tracks

https://forms.gle/PQBW7tJeYUMJ9Dm3A

One team had already submitted votes without these final changes to the form; I saved a copy in case it is useful I can send, but given the changes merging a few benchmarks, please do resubmit, our apologies for the inconvenience

@Tgl70
Copy link

Tgl70 commented May 28, 2024

Hi @ttj, sorry I've just seen the discussion.

I presume the way yours is set up you can modify to just select a subset of the instances and have generate properties pick random test samples/instances, please ensure you follow the guidelines re: setting a seed so we can reproduce at evaluation time

Yes you are correct, I will modify it to be in the 6 hours timeout (and follow the guidelines).
I will also add the generate_properties.py file.

@shizhouxing
Copy link

Hi @hocdot , for the benchmark https://github.com/dynaroars/vnncomp-benchmark-generation, we found it requires GPU to run alpha-beta-CROWN and NeuralSAT for filtering instances, which may not be compatible with the AWS instance for the benchmark generation in this competition (see #2 (comment)).

Additionally, we have tried generating the benchmark locally on our GPU server. However, for the CIFAR-10 and CIFAR-100 parts of the benchmark, we found it was taking an extremely long time and it was still not able to find any valid instance during the filtering process, after several days. Could you please kindly check the benchmark, to see if you could successfully generate the instances using a new environment, or if there is anything that should be fixed?

Thanks.

@mldiego
Copy link

mldiego commented May 30, 2024

Adding to @shizhouxing comment: is the generate_properties.py file allowed to depend on any of the tools used in the competition? Seems like the properties are filtered out by some "simple heuristic" determined by verification tools (neuralsat, crown). This could lead to some unfair property generation...

@Tgl70
Copy link

Tgl70 commented May 31, 2024

Hi @ttj , I have uploaded the generate_properties.py file.
Also, now it generates the instances.csv file with 20s timeout per query and 1080 random queries (instead of the full 10812 queries) with the seed required for reproducibility.
Let me know if I need to modify it further or if it's ok now.

@ttj
Copy link
Contributor Author

ttj commented Jun 3, 2024

thanks everyone, reminder to finish voting by today June 3, some teams have not submitted votes and they are due today AOE (will aim to post results tomorrow)

for discussion above re a couple benchmarks and edits: if any final edits/revisions to benchmarks as people have looked at them in detail for voting, please finalize ASAP

@ttj
Copy link
Contributor Author

ttj commented Jun 4, 2024

We have compiled the benchmark votes in this spreadsheet:

https://docs.google.com/spreadsheets/d/1SYSnFhi9UzjBAAsxQp6k3Ox0YJn4vG8YjL6-0M7BvUI/edit?usp=sharing

The results are compiled in that spreadsheet and are as follows; please let us know if there are any mistakes

Regular Track Benchmarks
https://github.com/ANTONIONLP/safeNLP
https://github.com/Khoury-srg/VNNComp23_NN4Sys and https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/nn4sys
https://github.com/kollerlukas/cora-vnncomp2024-benchmark
https://github.com/aliabigdeli/LinearizeNN_benchmark2024
https://github.com/wu-haoze/dist-shift-vnn-comp and https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/dist_shift
https://github.com/huanzhang12/vnncomp2024_cifar100_benchmark
https://github.com/huanzhang12/vnncomp2024_tinyimagenet_benchmark
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/acasxu
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/cgan
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/collins_rul_cnn
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/metaroom
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/tllverifybench

Extended Track Benchmarks
https://github.com/AI4OPT/ml4acopf_benchmark
https://github.com/dynaroars/vnncomp-benchmark-generation
https://github.com/loonwerks/vnncomp2023
https://github.com/shizhouxing/LSNC_VNNCOMP2024
https://github.com/xiangruzh/Yolo-Benchmark and https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/yolo
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/cctsdb_yolo
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/collins_yolo_robustness
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/ml4acopf
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/traffic_signs_recognition
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/vggnet16
https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/vit

Unscored benchmarks
All benchmarks received at least one vote, so there are no unscored benchmarks; however note several benchmarks received only one vote total

@Jubengo
Copy link

Jubengo commented Jun 20, 2024

Hi all,
We were wondering what is the exact timeline for the next steps:

  • availability of the repositery compiling all 2024 benchmarks
  • availability of the submission tool
  • final submission

@shizhouxing
Copy link

Hi @hocdot , for the benchmark https://github.com/dynaroars/vnncomp-benchmark-generation, we found it requires GPU to run alpha-beta-CROWN and NeuralSAT for filtering instances, which may not be compatible with the AWS instance for the benchmark generation in this competition (see #2 (comment)).

Additionally, we have tried generating the benchmark locally on our GPU server. However, for the CIFAR-10 and CIFAR-100 parts of the benchmark, we found it was taking an extremely long time and it was still not able to find any valid instance during the filtering process, after several days. Could you please kindly check the benchmark, to see if you could successfully generate the instances using a new environment, or if there is anything that should be fixed?

Thanks.

Hi @hocdot @ttj @ChristopherBrix, is there any update regarding the benchmark generation issue? As @Jubengo said, the repository compiling all the benchmarks and the submission tool are still not available for this year, so the validity of the benchmarks has not been fully confirmed. Can we have these updated soon? Thanks!

@ttj
Copy link
Contributor Author

ttj commented Jun 24, 2024

Hi all, We were wondering what is the exact timeline for the next steps:

  • availability of the repositery compiling all 2024 benchmarks
  • availability of the submission tool
  • final submission

We are finishing up the submission system set up and will update soon (some AWS things needed to be finished), an issue has been created here #3 with preliminary details

Re the deadlines: we will push the deadline back one week, so July 5, from what is posted on website https://sites.google.com/view/vnn2024 , and will finish the submission system readiness ASAP; it is possible we will have a little more time to extend, but we need time to run things in advance of SAIV (and given the number of benchmarks, it is possible there is quite some compute time necessary)

@shizhouxing
Copy link

Hi @hocdot , for the benchmark https://github.com/dynaroars/vnncomp-benchmark-generation, we found it requires GPU to run alpha-beta-CROWN and NeuralSAT for filtering instances, which may not be compatible with the AWS instance for the benchmark generation in this competition (see #2 (comment)).
Additionally, we have tried generating the benchmark locally on our GPU server. However, for the CIFAR-10 and CIFAR-100 parts of the benchmark, we found it was taking an extremely long time and it was still not able to find any valid instance during the filtering process, after several days. Could you please kindly check the benchmark, to see if you could successfully generate the instances using a new environment, or if there is anything that should be fixed?
Thanks.

Hi @hocdot @ttj @ChristopherBrix, is there any update regarding the benchmark generation issue? As @Jubengo said, the repository compiling all the benchmarks and the submission tool are still not available for this year, so the validity of the benchmarks has not been fully confirmed. Can we have these updated soon? Thanks!

Hi, @hocdot @ttj @ChristopherBrix, it has been more than a month and there is still no update regarding the validity of the benchmark. Is it still going to be included or excluded from the competition?

Hi @kollerlukas @sdemarch, since you voted for this benchmark, were you able to generate the benchmark or did you make any modification on your own?

Thanks!

@ChristopherBrix
Copy link

Sorry, I've finally got the website up and running again. I'm currently working my way through all benchmarks. I'll post an update today.

You'll have to create new accounts at https://vnncomp.christopher-brix.de/ I'll enable everyone this evening as soon as tool submission can start.

@ChristopherBrix
Copy link

I've uploaded almost all benchmarks at https://github.com/ChristopherBrix/vnncomp2024_benchmarks

Some benchmarks could not be generated correctly:
cora-vnncomp2024-benchmark
The MNIST dataset cannot be downloaded from yann.lecun.com This seems to be a common issue (https://stackoverflow.com/questions/66577151/http-error-when-trying-to-download-mnist-data) with several alternatives. Could you please update your benchmark accordingly?

cgan
Apparently, large files can no longer be downloaded directly from Google Drive. Instead, one has to manually confirm that one wants to download it even though it couldn't be checked for viruses. Can you upload the files somewhere else?
In the meantime, I'll just copy the exact benchmark from last year, but this way we cannot regenerate the instances with different seeds.

cctsdb_yolo
The benchmark repository is no longer online. I'll reuse last years benchmark set, but we won't be able to generate new instances with new seeds.

vnncomp-benchmark-generation
As mentioned by @shizhouxing, setting up that benchmark requires quite some additional work. I haven't been able to get that done yet.

@aliabigdeli
Copy link

aliabigdeli commented Jul 4, 2024

@ChristopherBrix
For cgan, could we use gdown library? because it can solve the problem.

@ttj
Copy link
Contributor Author

ttj commented Jul 5, 2024

thanks everyone!

for https://github.com/dynaroars/vnncomp-benchmark-generation if the issues cannot be resolved very soon, we will likely need to exclude it from scoring, adding other github ids that I think related as no response on it yet: @edwardxu0 @dynaroars @hocdot

we have extended the deadline for tool submission to July 12 (see #3 (comment) )

@ChristopherBrix
Copy link

ChristopherBrix commented Jul 5, 2024

@aliabigdeli Yes, please add a requirements.txt file to your root. It will automatically be used to install all specified pip packages.

Tool submission is open!
Please try to submit your tools. If you encounter any issues, feel free to ping me here, or write me an email ([email protected])
Currently, AWS has only granted us limited resources, so if you pick the p3 or g5 instance, you might have to wait a bit in the queue before your submission is evaluated. If you just want to make sure that your installation works as intended, and don't care about the runtime performance yet, please use m5 instances right now, we have more capacity there. I'm working with AWS to resolve this, and will post an update once it's done.
I've enabled all accounts.

@shizhouxing
Copy link

Thanks for uploading the benchmarks, @ChristopherBrix.

large_models.zip is missing in setup.sh (2023 version for reference), so some models are missing for vggnet16_2023, nn4sys_2023, cgan_2023.

Also, in safenlp, there are subdirectories in onnx/ and vnnlib/, but not in instances.csv

@Tgl70
Copy link

Tgl70 commented Jul 6, 2024

Hi @shizhouxing that shouldn't happen (re safeNLP), when I run generate_properties.py (https://github.com/ANTONIONLP/safeNLP/blob/main/generate_properties.py) I get the correct instances.csv with entries in the form: onnx/ruarobot/perturbations_0.onnx,vnnlib/ruarobot/hyperrectangle_2903.vnnlib,20

Is it possible that the older version was used instead?

@shizhouxing
Copy link

Hi @Tgl70, I saw the safenlp benchmark uploaded to ChristopherBrix/vnncomp2024_benchmarks has incorrect entries in instances.csv: https://github.com/ChristopherBrix/vnncomp2024_benchmarks/blob/main/benchmarks/safenlp/instances.csv

@Tgl70
Copy link

Tgl70 commented Jul 6, 2024

@shizhouxing you're right!
It should be just a matter of re-running generate_properties.py and substituting instances.csv.
@ChristopherBrix let me know if any problem arises!

@ChristopherBrix
Copy link

I've added the large models from last year. I couldn't really test it right now (will do so tomorrow), but it should work.

@Tgl70 The incorrect instances.csv file was caused by some postprocessing on my end. Unfortunately, something in my CI pipeline is currently broken, and I cannot push an update to the server right now. Once that's fixed, I'll regenerate the benchmark.

@ChristopherBrix
Copy link

The cora benchmark has been added.
The large models should now be downloaded correctly.
The safenlp benchmark is fixed.
The duplicated collins benchmark has been removed.

@mldiego
Copy link

mldiego commented Jul 9, 2024

@aliabigdeli

I checked the linearizenn onnx models back in the day and there was only one, which we could easily support. However, I tested on the submission site and there appears to be several models which were not available before (and that apparently MATLAB does not support). How are these models different from the original one: https://github.com/aliabigdeli/LinearizeNN_benchmark2024/tree/main/onnx ?

Is there a way to generate the new models similarly as this one for easier support?

I am aware that we voted for this benchmark to be included in the regular track, but generating unseen new models last week before submission (which are different from the ones previously available) does not seem fair. So if the models cannot be converted to onnx like the original one and participants may be having issues supporting these new models, I suggest we move it to the extended track, or only consider it in future competitions.

@aliabigdeli
Copy link

@mldiego

There were no changes to the benchmark in recent weeks. You need to run the code to get the all onnx files.
The specifications in this benchmark compare the output value of the neural network with a linear combination of the inputs. Because the current tools accept the specifications only based on the output variables, for this problem, one needs to use skip connection layers to have the input variables available in the output, and then the linear layers are needed to compare the linear combination of the inputs and the output value. For future support, I also added the vnnlib files (ending with "_io.vnnlib") that contain the specification based on the input and output in a joint format. These files can work only with the main onnx files (the original neural network in this problem).

@mldiego
Copy link

mldiego commented Jul 11, 2024

@aliabigdeli

I am just going with the rules here:

image

The vnnlib files were not provided, and the onnx file provided (1), is not in used for the competition. Based on these rules, this benchmark proposal was incomplete and misleading, as I doubt every participant will install and generate every file from scratch for all proposed benchmarks, hence why the rules for the benchmark proposal.

@aliabigdeli
Copy link

@mldiego

As you brought up the rules, after running the generate_properties.py file with a random seed, you will get two folders (onnx and vnnlib). If what you're saying is correct (which I don't think), many benchmarks are not correctly proposed, like the following:

https://github.com/feiyang-cai/cgan_benchmark2023
https://github.com/stanleybak/vggnet16_benchmark2022
...

@shizhouxing
Copy link

thanks everyone!

for https://github.com/dynaroars/vnncomp-benchmark-generation if the issues cannot be resolved very soon, we will likely need to exclude it from scoring, adding other github ids that I think related as no response on it yet: @edwardxu0 @dynaroars @hocdot

we have extended the deadline for tool submission to July 12 (see #3 (comment) )

So is this benchmark going to be excluded, since tomorrow is the deadline and there is no response?

@ttj
Copy link
Contributor Author

ttj commented Jul 12, 2024

thanks everyone!
for https://github.com/dynaroars/vnncomp-benchmark-generation if the issues cannot be resolved very soon, we will likely need to exclude it from scoring, adding other github ids that I think related as no response on it yet: @edwardxu0 @dynaroars @hocdot
we have extended the deadline for tool submission to July 12 (see #3 (comment) )

So is this benchmark going to be excluded, since tomorrow is the deadline and there is no response?

It seems like no one has been able to get it added successfully to consider it, so that would be my recommendation, unless anyone disagrees or has been able to use it successfully

@shizhouxing
Copy link

Another thing about safenlp:

Safenlp has folders (medical and ruarobot) in onnx and vnnlib, and files in medical and ruarobot may have same filenames (e.g., vnnlib/ruarobot/hyperrectangle_1402.vnnlib and vnnlib/medical/hyperrectangle_1402.vnnlib).
The current scripts would just take hyperrectangle_1402.vnnlib for counterexample saving and ignore ruarobot or medical. So the script may not work correctly, as counterexamples may get saved to the same file perturbations_0_hyperrectangle_1402.counterexample for both instances:
https://github.com/ChristopherBrix/vnncomp2024_benchmarks/blob/main/benchmarks/safenlp/instances.csv#L804
https://github.com/ChristopherBrix/vnncomp2024_benchmarks/blob/main/benchmarks/safenlp/instances.csv#L924,
and we can't tell which instance the counterexample comes from.

@ChristopherBrix The scripts may need to be modified.

@ChristopherBrix
Copy link

@shizhouxing Thank you for pointing that out. I'll adapt the scripts!

@ttj
Copy link
Contributor Author

ttj commented Oct 11, 2024

We have shared a draft overleaf report and are also fixing some execution issues identified by some participants. Please let us know if you didn't get it and we will share it, we emailed the listservs so far

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests