-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Benchmark Proposals and Discussion #2
Comments
As one mechanism to address benchmarks not being available until far too close to the evaluation timeframe, we have solicited benchmarks to register via the Google form registration process now, these are those submitted as of today 4/3 and discussed in rules/kickoff meeting: https://github.com/AI4OPT/ml4acopf_benchmark https://github.com/edwardxu0/adaGDVB/ and https://github.com/edwardxu0/GDVB/ -> aggregated into a formal benchmark here: https://github.com/dynaroars/vnncomp-benchmark-generation https://github.com/loonwerks/vnncomp2023 https://github.com/ANTONIONLP/safeNLP For any other benchmark proposals, please comment on this issue with a link to the benchmark details |
We want to propose NN4SysBench (paper). (We will likely add a new application this year. Will update this post when done.) |
We would like to propose our benchmark: Verify1s (https://github.com/kollerlukas/cora-vnncomp2024-benchmark). |
We updated last-years ml4acopf_benchmark to offer two alternative ReLU-only options for the same models. These are provided as 6 new onnx files designed to make this benchmark more accessible to teams who have not implemented the sigmoid, sin, or cos activation functions. The new onnx files are attached and will be included in the previous github repo pending a pull request. The main change is that each of the sigmoid, sin, and cos nodes have been replaced with ReLU based models with error bounded by 10E-3. There is a residual configuration which takes advantage of residual connections to make periodic approximations of sine and cosine and a non-residual configuration which uses minimally complex structures to emulate sine and cosine only within the region from -pi/2 to pi/2. Edit: We did some cleanup and have further eliminated ConstantOfShape, Where, Equal, Expand, Pow, and Neg node types from the linearized models provided within https://github.com/AI4OPT/ml4acopf_benchmark. Our updated benchmarks offer a contrast between a higher complexity nonlinear problem to push the boundaries of nonlinear verification versus two linearized models for the same problems which may produce new constraint violations. |
Hello! We are considering introducing a new benchmark. Assuming having a neural network controller approximation with a piecewise linear model in the form of a set of linear models with added noise to account for local linearization error. The objective of this benchmark is to investigate the neural network output falls within the range we obtain from our linear model output plus some uncertainty. Given input The neural network controller we used in this benchmark is an image-based controller for an Autonomous Aircraft Taxiing System whose goal is to control an aircraft’s taxiing at a steady speed on a taxiway. This network was introduced in the paper Verification of Image-based Neural Network Controllers Using Generative Models. The neural network integrates a concatenation of the cGAN (conditional GAN) and controller, resulting in a unified neural network controller with low-dimensional state inputs. In this problem, the inputs to the neural network consist of two state variables and two latent variables. The aircraft’s state is determined by its crosstrack position (p) and heading angle error (θ) with respect to the taxiway center line. Two latent variables with a range of -0.8 to 0.8 are introduced to account for environmental changes. Because in this case the output spec depends on both the input and output and considering the VNN-LIB limitation, we added a skip-connection layer to the neural network to have the input values present in the output space. We also added one linear layer after that to create a linear equation for each local model. The benchmark repo: https://github.com/aliabigdeli/LinearizeNN_benchmark2024 The idea of this benchmark came from one of our projects in which we approximated the NN controller with a piecewise linear model, and we wanted to check if the neural network output falls within the range we obtained from our linear model output plus some uncertainty. However, in most areas of the state space, verification by state-of-the-art tools fails (could neither verify nor falsify, and return the timeout, although I increased the timeout to 12 hr). We would greatly appreciate any feedback, opinions, and suggestions from both competition organizers and participants. Thank you! |
@ttj We submitted benchmark generation tools (https://github.com/edwardxu0/adaGDVB/ and https://github.com/edwardxu0/GDVB/) but the proposed benchmark is available here (https://github.com/dynaroars/vnncomp-benchmark-generation). Could you please update it? |
Thank @kollerlukas for proposing the Verify1s benchmark. I have a concern regarding the 1-second timeout currently set for the competition. The same discussion happened in previous years, please refer to here. In general, based on last year’s report (see Table 7), it is evident that most verifiers require more than this 1-second threshold simply for initial loading the verifier. This overhead is unavoidable under the current rule that mandates each instance be run in independent scripts, starting anew with loading the verifier each time. While batch mode operation could potentially alleviate this overhead, it appears the competition rules remain unchanged this year, disallowing this approach. To ensure fairness and the practical applicability of the benchmark across different verification methods, I recommend increasing the timeout to at least 30 seconds as suggested in previous years, which is longer than the overhead for all verifiers. In addition, the official report subtracts this overhead from all verifiers, as shown in cactus plots, many instances can be solved in less than 1 second, reflecting each verifier's capabilities. |
@KaidiXu Thank you for your feedback. We were unaware that there already had been a discussion about 1s timeouts. |
@kollerlukas Yes, you are right. Finally, the official report will subtract overhead for all verifiers, as you can see in cactus plots, there are indeed a lot of instances that can be solved in less than 1 second for all verifiers. However, if in vnnlib files, they are only a 1-second timeout, then most verifiers will fail. I'm not sure our scoring system can support this post-processing after subtracting overhead and only counting instances that can be verified <=1s. |
last call for any new benchmarks, please post by tomorrow 5/17, we will plan to make the form early next week, solicit feedback on it, then send out to tool participants for voting with a voting response deadline of May 31. We will have in the form as well the benchmarks considered last year. There are a couple repeats/modified proposals of benchmarks proposed last year re-proposed above and a few that were not scored from last year re-proposed as well, so if proposers want only want one version considered, please comment. For continuity and year-to-year progress tracking, probably we should include at least all the scored ones from last year, for reference, the last year benchmarks are here: https://github.com/ChristopherBrix/vnncomp2023_benchmarks and their overview/descriptions: |
We want to re-propose the dist-shift benchmarks coming from this paper . |
@ChristopherBrix do you think that is possible ? |
We have created an initial draft Google form for collecting votes here to solicit feedback on any changes or if anything missing. Please do not submit it yet, as more benchmarks may come in today, and we will finish adding the previous VNN-COMP'23 benchmarks to the form as well. If you have any feedback on the form/process for voting, please let us know by posting here or in rules issue, we will post again once the form is about finalized for any final feedback before soliciting the official votes next week from the tool participants: |
We would like to propose a benchmark for Lyapunov-stable neural control. There are two models for state feedback and output feedback, respectively. Each model consists of a controller which is a shallow ReLU network, a Lyapunov function which is a quadratic function, and nonlinear operators modelling the dynamics of a 2D quadrotor. The model for output feedback further contains a shallow LeakyReLU network as the observer. The goal is to certify the Lyapunov stability, which has been encoded in the ONNX graphs and VNNLIB specifications. |
Hi all, I would like to repropose the yolo benchmark from last year. The benchmark was included in VNN-COMP 2023 but unscored. The network of the original YOLOv2 model is replaced with a much smaller network containing 5 residual blocks with conv and ReLU layers for simplicity of verification. It takes in an image (3 x 52 x 52) and outputs a tensor of a shape 125 x 13 x 13 containing the confidence scores, the classification result, and the positions of 5 bounding boxes for each of the 13 x 13 grid cells. Only 5 of the outputs are actually used in verification. For each instance, the property is verified if the confidence score of at least one bounding box is greater than a preset threshold given any bounded perturbations on the input image (similar to other image robustness benchmarks). Note that here we only use 5 outputs related to confidence scores, even though the output tensor seems huge. An example of the specification to check is shown here: Here’s the repo: https://github.com/xiangruzh/Yolo-Benchmark |
Glad to know that we will reconsider all 2023 benchmarks! I looked into benchmarks in 2022 and 2021, and there were also many great ones. Here I re-propose the cifar100_tinyimagenet benchmark in VNN-COMP 2022. Since we have the division of tracks this year, to make the benchmark more accessible and friendly for new tools in this field, I reduced the complexity of this benchmark. I kept only two of the four CIFAR100 models in medium sizes (this will reduce the workload to support this benchmark). The TinyImageNet models/data may be too large for new tools in this field, so I split it into a separate benchmark. I hope at least the cifar100 benchmark (with only 2 onnx files) can be voted into the main track. Cifar100 has the same image size (32x32) as the commonly used Cifar10, and the architecture contains FC, conv, and ReLU layers only. I also removed the largest “super” model, which may pose a challenge for some tools due to too many layers. Hopefully, many tools can support the cifar100 benchmark without too much additional effort. In 2022, the benchmark had a reproducibility problem during generation, where the JPEG decoded image could be slightly different depending on installed library versions, causing unexpected issues (e.g., the same image index can be sat on one machine but unsat on another machine). So I’ve saved all images into a checkpoint and loaded them directly to avoid this issue. Here are the re-proposed benchmarks (I split the Cifar100 and TinyImageNet models into two benchmarks as mentioned above): https://github.com/huanzhang12/vnncomp2024_cifar100_benchmark |
Thanks very much everyone, we have updated the voting Google form here with all the proposed benchmarks, along with those from last year (some of these are duplicated with new/revised proposals above, we have put everything in the form, so if benchmark proposers would like only 1 version considered [as eg a couple maybe were proposed last year but not scored], please post below which version to have in the form): https://forms.gle/PQBW7tJeYUMJ9Dm3A Please take a look and let us know of any edits suggested before we open voting, so please do not yet submit the form until we post again saying voting is open in case of any final revisions. Votes will be due from tool participants by May 31, so please let us know of any feedback on the form/voting process ASAP and at the latest by this Friday 5/24 so that tool participants will have at least a week to submit votes We will email the email list with this update as well in case anyone not tracking the github issue actively |
Ml4acopf only needs the github link, the zip file is redundant now, thanks!
On May 21, 2024, at 10:53, Taylor Johnson ***@***.***> wrote:
Thanks very much everyone, we have updated the voting Google form here with all the proposed benchmarks, along with those from last year (some of these are duplicated with new/revised proposals above, we have put everything in the form, so if benchmark proposers would like only 1 version considered [as eg a couple maybe were proposed last year but not scored], please post below which version to have in the form):
https://forms.gle/PQBW7tJeYUMJ9Dm3A
Please take a look and let us know of any edits suggested before we open voting, so please do not yet submit the form until we post again saying voting is open in case of any final revisions.
Votes will be due from tool participants by May 31, so please let us know of any feedback on the form/voting process ASAP and at the latest by this Friday 5/24 so that tool participants will have at least a week to submit votes
We will email the email list with this update as well in case anyone not tracking the github issue actively
—
Reply to this email directly, view it on GitHub<#2 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJNL3MBMROX5ZRZ3ADDWHE3ZDOC2NAVCNFSM6AAAAABFUPBILWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRTGE2DKNBYHA>.
You are receiving this because you commented.Message ID: ***@***.***>
|
thanks @bwbellmath I have updated the form; final reminder of any feedback on the form or if we missed any benchmarks etc |
Hi @ttj, I found https://github.com/ANTONIONLP/safeNLP contains 10812 instances with 120s timeout for each of them which is 360h timeout in total. |
@ttj It looks like some benchmarks have duplicated entries in the current voting form: yolo by @xiangruzh: https://github.com/xiangruzh/Yolo-Benchmark These are re-proposed benchmarks with no change compared to last year, and their github repos are not changed since last year. Shall we list them only once in the voting form, as part of VNN-COMP 2023 benchmarks? It might cause confusion that one benchmark gets voted in two duplicated entries (e.g., one entry is voted for the core track and the other entry is not). |
Hi @ttj, I found the generate_properties.py file is missing in the https://github.com/ANTONIONLP/safeNLP. |
thanks @KaidiXu and @Hongji1001 ; @KatyaKom @ANTONIONLP and @Tgl70: can you please see discussion above re your benchmark? it will need a little modification, note that the max benchmark time is described in the rules (6 hours) and you need the generate_properties file; I presume the way yours is set up you can modify to just select a subset of the instances and have generate properties pick random test samples/instances, please ensure you follow the guidelines re: setting a seed so we can reproduce at evaluation time here is a representative generate_properties (thanks @huanzhang12 ), but you can also see other benchmarks for examples, and I presume you have something like this somewhere already given what is in your repository with the instances https://github.com/huanzhang12/vnncomp2024_cifar100_benchmark/blob/main/generate_properties.py there may have been some confusion as what is stored in the benchmarks repository from the last iteration is the outputs from some of the benchmarks (eg just the instances after usage of generate_properties for what was actually evaluated); we will try to do a better job tracking these in the future to keep things together @huanzhang12 thanks: yes, I will merge some of these (I had asked above but none of the benchmark proposers responded, so was not clear if exactly same or not) I will post later today re opening voting at the google form with the modifications, we will extend voting to Monday 6/3 given the few updates, but will not extend past that, so any final updates please let us know asap |
I have merged nn4sys and dist-shift duplicates in the form There were 3 versions of YOLO last year; I cannot figure out which is the same, can someone please clarify @xiangruzh ? Which is the https://github.com/xiangruzh/Yolo-Benchmark equivalent to? https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/cctsdb_yolo or https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/yolo The other was by collins so could figure that out; from the files I think it is this one, but would be good to have confirmation, I will update it to be this one in the form: https://github.com/ChristopherBrix/vnncomp2023_benchmarks/tree/main/benchmarks/yolo |
great, thanks @xiangruzh , we have updated the voting form I think given no other comments at this time, let's go ahead and open voting at the Google form, please coordinate with your team and submit your votes by Monday, June 3 AOE at latest, we will post the results once in and subdivision into the 2 tracks https://forms.gle/PQBW7tJeYUMJ9Dm3A One team had already submitted votes without these final changes to the form; I saved a copy in case it is useful I can send, but given the changes merging a few benchmarks, please do resubmit, our apologies for the inconvenience |
Hi @ttj, sorry I've just seen the discussion.
Yes you are correct, I will modify it to be in the 6 hours timeout (and follow the guidelines). |
Hi @hocdot , for the benchmark https://github.com/dynaroars/vnncomp-benchmark-generation, we found it requires GPU to run alpha-beta-CROWN and NeuralSAT for filtering instances, which may not be compatible with the AWS instance for the benchmark generation in this competition (see #2 (comment)). Additionally, we have tried generating the benchmark locally on our GPU server. However, for the CIFAR-10 and CIFAR-100 parts of the benchmark, we found it was taking an extremely long time and it was still not able to find any valid instance during the filtering process, after several days. Could you please kindly check the benchmark, to see if you could successfully generate the instances using a new environment, or if there is anything that should be fixed? Thanks. |
Adding to @shizhouxing comment: is the generate_properties.py file allowed to depend on any of the tools used in the competition? Seems like the properties are filtered out by some "simple heuristic" determined by verification tools (neuralsat, crown). This could lead to some unfair property generation... |
Hi @ttj , I have uploaded the generate_properties.py file. |
thanks everyone, reminder to finish voting by today June 3, some teams have not submitted votes and they are due today AOE (will aim to post results tomorrow) for discussion above re a couple benchmarks and edits: if any final edits/revisions to benchmarks as people have looked at them in detail for voting, please finalize ASAP |
Hi all,
|
Hi @hocdot @ttj @ChristopherBrix, is there any update regarding the benchmark generation issue? As @Jubengo said, the repository compiling all the benchmarks and the submission tool are still not available for this year, so the validity of the benchmarks has not been fully confirmed. Can we have these updated soon? Thanks! |
We are finishing up the submission system set up and will update soon (some AWS things needed to be finished), an issue has been created here #3 with preliminary details Re the deadlines: we will push the deadline back one week, so July 5, from what is posted on website https://sites.google.com/view/vnn2024 , and will finish the submission system readiness ASAP; it is possible we will have a little more time to extend, but we need time to run things in advance of SAIV (and given the number of benchmarks, it is possible there is quite some compute time necessary) |
Hi, @hocdot @ttj @ChristopherBrix, it has been more than a month and there is still no update regarding the validity of the benchmark. Is it still going to be included or excluded from the competition? Hi @kollerlukas @sdemarch, since you voted for this benchmark, were you able to generate the benchmark or did you make any modification on your own? Thanks! |
Sorry, I've finally got the website up and running again. I'm currently working my way through all benchmarks. I'll post an update today. You'll have to create new accounts at https://vnncomp.christopher-brix.de/ I'll enable everyone this evening as soon as tool submission can start. |
I've uploaded almost all benchmarks at https://github.com/ChristopherBrix/vnncomp2024_benchmarks Some benchmarks could not be generated correctly: cgan cctsdb_yolo vnncomp-benchmark-generation |
@ChristopherBrix |
thanks everyone! for https://github.com/dynaroars/vnncomp-benchmark-generation if the issues cannot be resolved very soon, we will likely need to exclude it from scoring, adding other github ids that I think related as no response on it yet: @edwardxu0 @dynaroars @hocdot we have extended the deadline for tool submission to July 12 (see #3 (comment) ) |
@aliabigdeli Yes, please add a Tool submission is open! |
Thanks for uploading the benchmarks, @ChristopherBrix.
Also, in safenlp, there are subdirectories in onnx/ and vnnlib/, but not in instances.csv |
Hi @shizhouxing that shouldn't happen (re safeNLP), when I run Is it possible that the older version was used instead? |
Hi @Tgl70, I saw the safenlp benchmark uploaded to |
@shizhouxing you're right! |
I've added the large models from last year. I couldn't really test it right now (will do so tomorrow), but it should work. @Tgl70 The incorrect |
The cora benchmark has been added. |
I checked the linearizenn onnx models back in the day and there was only one, which we could easily support. However, I tested on the submission site and there appears to be several models which were not available before (and that apparently MATLAB does not support). How are these models different from the original one: https://github.com/aliabigdeli/LinearizeNN_benchmark2024/tree/main/onnx ? Is there a way to generate the new models similarly as this one for easier support? I am aware that we voted for this benchmark to be included in the regular track, but generating unseen new models last week before submission (which are different from the ones previously available) does not seem fair. So if the models cannot be converted to onnx like the original one and participants may be having issues supporting these new models, I suggest we move it to the extended track, or only consider it in future competitions. |
There were no changes to the benchmark in recent weeks. You need to run the code to get the all onnx files. |
I am just going with the rules here: The vnnlib files were not provided, and the onnx file provided (1), is not in used for the competition. Based on these rules, this benchmark proposal was incomplete and misleading, as I doubt every participant will install and generate every file from scratch for all proposed benchmarks, hence why the rules for the benchmark proposal. |
As you brought up the rules, after running the https://github.com/feiyang-cai/cgan_benchmark2023 |
So is this benchmark going to be excluded, since tomorrow is the deadline and there is no response? |
It seems like no one has been able to get it added successfully to consider it, so that would be my recommendation, unless anyone disagrees or has been able to use it successfully |
Another thing about safenlp: Safenlp has folders ( @ChristopherBrix The scripts may need to be modified. |
@shizhouxing Thank you for pointing that out. I'll adapt the scripts! |
We have shared a draft overleaf report and are also fixing some execution issues identified by some participants. Please let us know if you didn't get it and we will share it, we emailed the listservs so far |
Both tool participants and outsiders such as industry partners can propose benchmarks. All benchmarks must be in .onnx format and use .vnnlib specifications, as was done last year. Each benchmark must also include a script to randomly generate benchmark instances based on a random seed. For image classification, this is used to select the images considered. For other benchmarks, it could, for example, perturb the size of the input set or specification.
The purpose of this thread is present your benchmarks and provide preliminary files to get feedback. Participants can then provide comments, for example, suggesting you to simplify the structure of the network or remove unsupported layers.
To propose a new benchmark, please create a public git repository with all the necessary code.
The repository must be structured as follows:
The text was updated successfully, but these errors were encountered: