-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Category Proposal: Piecewise Linear Activations (ReLU) #2
Comments
Should the specifications be a single set? Some specs like "Clear of conflict is minimal" requires encoding as a disjunction (unsafe if left > clear or right > clear, ect.), which I'm not sure all tools will support. Perhaps it makes sense to provide both if not all tools can handle disjunctions. |
Benchmarks: ACAS is low dimensional. We also need some high dimensional benchmarks. Activations: I think just ReLUs is sufficient. Specifications: safety is good but I think we ought to have local robustness here as well. |
Thanks for the feedback, yes, we will need to handle this, e.g., by providing a list of the sets in the disjunction, then tools that don't support disjuncts can check each of these in e.g. multiple calls, whereas tools that do internally support disjuncts can handle them however they do. |
Thanks for the feedback. While I agree the ACAS networks are small, if one considers all the properties, the full range of input sets, and all the networks, they are nontrivial to analyze (e.g., I think at least still hours of analysis for state-of-the-art). We're certainly welcome to other benchmarks if there are suggestions for larger ones. What I was thinking (but welcome to revision) is the CNN category would cover the larger networks and robustness specifications (and explicitly covering convolutional layers), but if there are networks with only ReLUs for MNIST/CIFAR/etc., we could probably add those as benchmarks for this category if that is what you're thinking. |
Yes, I think there are some MNIST networks with RELU / fully connected layers. Local robustness could make sense here. |
To my mind the architectures and the activations are orthogonal to the size of the models. Indeed I would have thought we certainly wanted to have a set of models in each category ranging from small/manageable to large/undoable. In addition I do think we need to cover extensively models with high dimensional inputs, including in a ReLU category. This is what the AI/ML community use and I think we want to reach out to them and show that our benchmarks cover the sort of models they themselves use (at least in terms of architecture, if not quite of size just yet) and not just case studies normally explored in verification. |
Yes, these are good points, and I agree. I had not really thought of having classification without the convolutional layers, which is why I primarily had put ACAS-Xu as the representative benchmark, but assuming that's possible / architectures can be provided, then certainly we could have parameterized examples scaling the size/architecture of the networks within this category as well, for e.g. MNIST and potentially other data sets. With respect to high-dimensional inputs, I presume you are thinking of images that would be covered with e.g. MNIST/other data sets in this category, but please let us know if you have other examples in mind that might make appropriate benchmarks. |
So maybe we ought to have 2 main categories only: ReLU and sigmoids. The present CNN could sit as a subcategory of the ReLU one. In the ReLU category we could have high (vision, classifiers) and low (Acas, ...) dimensionality inputs, some fully connected, some convolutional, etc. Some existing tools may not support all the benchmarks, but I think that's fine. |
This is to follow up on Taylor's request. |
Thanks for the feedback! Let's see what other feedback comes in as people hopefully start to reply with what categories they're interested to see if this makes sense based on level of participation, or whether keeping them split is ideal. |
I'd like the nnenum tool to participate in this category. Hopefully we can get a few benchmarks in addition to ACASXu that are higher input dimension, but not just vision ones. |
I agree with Alessio's suggestion. If we are going to bring big models (e.g., MNIST) to this category, it makes sense to reduce to two categories regarding network structure: piecewise linear (including ReLU, convolution, max pooling, etc) and smooth nonlinear (sigmoid and tanh). Then orthogonally for each category, we can have big models (e.g., MNIST) and small models (e.g., ACAS). We may see how other tool authors will respond. |
I'm generally on board with Alessio and Changliu's suggestions, and I like what @changliuliu pointed out about orthogonality. To expand a bit more on it, here are four orthogonal properties of categories that I've identified.
Note that I've distinguished the type of non-linearity used from the type of linear layers used in the network structure, since convolutional layers can co-exist with non-piecewise linear activations (e.g. LeNet is tanh+convolution) We don't have to have a category covering every possible combination of properties To avoid an explosion in the number of categories, rather than having additional categories for |
Seperately: I'd like to enter the MIPVerify tool for this category as it is currently described. More details on the tool:
|
I would like to enter our tool ERAN in this category: Cheers, |
Hi, Regards, |
The oval group would like to sign up for this category. Our work will be based on the methods presented in the journal paper `Branch and Bound for Piecewise Linear Neural Network Verification’ and developments thereof. Cheers, |
Piecewise/ReLU Category Participants: Oval If anyone else plans to join this category, please add a comment soon, as the participants in this category need to decide on the benchmarks soon (by about May 15). |
Hi, Regards, |
Hi, We would like to enter VeriNet https://vas.doc.ic.ac.uk/software/neural/. The toolkit supports:
Regards, |
Finalized Piecewise/ReLU Category Participants: Oval For benchmarks, as indicated earlier, we will decide as a group the benchmarks, and anyone may propose some. For this category by default to move things along, I suggest we use the ACAS-Xu networks as one benchmark set, and decide which of the networks and properties, if not all of them. There has been earlier discussion regarding some classification networks in this category as well. If anyone has some and can provide details (e.g., links to models) that would fall into this category and not the CNN category (e.g., without convolutional layers), please speak up. Any other benchmarks are of course welcome, but details need to be provided soon (~May 15) so we may finalize things to give at least a month or so for everyone to look at them. If we consider classification, as in the CNN category, there is a lot of parameterization (on the network architecture, the data set, the specification, etc.), so those are the details we will want to discuss next and finalize, as well as how all of these aspects will be specified to ease interoperability between the tools. |
Hi We also like to enter Venus (https://vas.doc.ic.ac.uk/software/neural/) in this category. Best Panagiotis |
I suggest we also consider adversarial robustness for MNIST fully connected NN with ReLU activations. Some MNIST trained models can be found here |
For ACAS Xu, I suggest a subcategory of the following 10 set of networks / benchmarks I found to be difficult. ACASXU-HARD:
edit: fixed type in prop 7 to net 1-9... not 1-7 |
I agree with haithamkhedr in that we should consider adversarial robustness benchmarks for MNIST networks; however, we should also consider networks of different depths as som toolkits seem to handle deeper networks better than others. I can provide 3 MNIST fully connected networks with 2, 4 and 6 fully-connected layers and 256 nodes in each layer. |
Thanks! Can you please provide links so participants can check them out? |
Hi @stanleybak, It looks like the part of our code that was designed to handle non-standard input dimensions, I didn't expect this to show in the ONNX model. I've modified the code to get rid of most of these operations; the only thing left is one flatten operation as I was told to use inputs of dim (1, 784, 1); however, the PyTorch model requires inputs on the form (1, 784). Let me know if this looks okay to you. |
Yes the network is great now @pat676. One more weird thing that doesn't really need to be fixed is that the input images look like they want to be lists of integers, but instead they are floating point values close to integers. For example, in I'm not sure if this was on purpose. My guess was you started with a single-precision floating point value in the range [0, 1] and multiplied it by 255 and then forgot to round. |
Thanks, you are right; I updated the images with rounded pixel values now. |
Hello @pat676 The image files looks good now. it looks like the expected classification file got deleted, although it's maybe easy enough to recreate since all images classify correctly. |
I recreated the labels file and uploaded it to the images folder. |
For the images @pat676 , we want a plot of num images verified over time, right? so something like this: What should the max timeout be? 60 secs per image or more? Should we use a log scale for the time values? Also, for some of the parameters (say 0.1 epsilon) there may only be a few images that are unsafe... so maybe mixing safe and unsafe in the same graph is better in this case? We probably then want one image per network, and then one image that combines all the networks? Maybe one for all the unsafe images together? What do you think? |
Yes, the plot looks good @stanleybak . I agree that plotting safe, unsafe and all epsilons separately is unnecessary; however, it may be better to merge different epsilon values into one plot than merging safe/ unsafe. The reason is that, in my experience, toolkits vary a lot on safe/ unsafe, but usually have a similar relative performance over different epsilon values. So one for safe and one for unsafe for each network; resulting in a total of 6 plots for the three networks. One that combines all networks is also a good idea. For the timeout, 60 seconds seems too short. Normally papers report at least 30 minutes; if possible we should do the same for the results to be meaningful. |
30 minutes is okay for some benchmarks, but my concern would be that the worst-case runtime would then be 30 minutes * 50 images * 3 epsilon values * 3 networks = 13500 minutes = 9 days! Is there some subset of the images / networks where 30 minutes is more interesting, and a shorter timeout could be used for the others? Understandably they won't all take the full 30 minutes, but especially the large networks with large epsilons they might mostly timeout. |
Hi @stanleybak, I see the problem, how about these solutions:
This should reduce the worst-case performance to about 1.5 days and I expect all toolkits to finish in less than 1 day. Having different timeouts for different cases is, in my opinion, unfortunate as it may make the results more difficult to interpret. |
Yes, that's much better. I think it would be good to have some that have longer timeouts too, but perhaps we can do that in future years. |
Hi @stanleybak do we have any update on the precision? |
I haven't seen anything, but certain layers are not supported by the onnx runtime, at least the one that installed for me by default. These ones include ReLU and Gemm layers, so that if I want to execute I need to use 32-bit floats. It probably doesn't matter for this iteration of the competition, as I don't think it changes any of the verification results. |
In terms of reporting the results, I assume we'll have some pdf like ARCH-COMP where we each write a section for our tool. With some small coordination we could also produce a combined table and graph that I think would be useful, even if we're all using slightly different computers this time around. I have scripts that can help with this, if desired. The input looks something like this, with one file per tool:
It's just a tab separated file with the network as the first entry, the property as the second, the result as the third (safe/unsafe/error/timeout/anything-else) and the runtime in seconds as the fourth. The scripts make a .tex file that's a table that we can include in the document, as well as a .pdf from gnuplot that looks like this, which is similar to what we discussed earlier: Does this output look okay. Is this something we want to do @ttj ? What were the plans for reporting? |
Hi @pat676 what will be the total tolerance bound ( i.e. UB~LB) in this case, epsilon or 2*epsilon? |
Hi @Neelanjana314, Each pixel should be constrained to LB=pixel_value - eps and UB=pixel_value + eps and clipped to [0, 1]. So the python command (assuming that the pixels are in floating-point format) is: lb = np.clip(img - eps, 0, 1) |
in terms of ACAS Xu, are the benchmarks suggested by @stanleybak final? Cheers, |
@GgnDpSngh My understanding was there would be one category ACASXU-ALL that uses all 10 specifications / networks with a 5 minute timeout, and then a second category for ACASXU-HARD that uses the 10 instances I suggested with a longer 6 hour timeout. Is this correct @ttj ? Is there an updated timeline when we're supposed to upload dockerfiles and/or results? |
@pat676 for your MNIST networks, I had to scale the input images by 1/255 (so that they are in the range [0, 1]) to get a correct classification on every sample for those networks correct. Do your networks include a scaling layer? (Also, presumably the epsilons are scaled in the range [0, 1]?) |
Hi @vtjeng, That is correct, the images should be scaled to [0, 1] and the epsilons are already scaled correctly. The network only includes fully-connected layers, not scaling layers. |
@stanleybak, @ttj, I wanted to make sure that I was getting the conventions for the ACASXU networks correct. Here's what I've been assuming: Input / OutputOrder of variables is as follows:
Bounds on Input in PropertiesIn addition to the input constraints specified for the property, we also impose the input constraints for the network. For example, for Property 1:
|
The inputs / outputs look correct. The lower / upper bounds for property 1 I used are:
Which looks like it matches. Also make sure you scale the inputs before passing them to the network. I mentioned this in an earlier comment on June 2. Here's the code I used for that
The scaled ranges can be found in the marabou property files to double-check with what you get: https://github.com/NeuralNetworkVerification/Marabou/tree/master/resources/properties |
Hello, Is there an overleaf document to add our results? Also, when is the due date to submit final results? Thanks |
@ttj, just so everyone's on the same page, for ACASXU-ALL, this is intended to be a 5 minute timeout, using 45 networks for properties 1-4 (180 instances), as well as property 5-10 which are defined for one network each (6 more instances), for a total of 186 instances, correct? The original Reluplex paper omits some of the networks in properties 1-4, but then in the Marabou paper it looks like they analyzed all 45. |
so overall the benchmarks here are all ACASXU properties with a timeout of 5 min, and the three FCN from @pat676 with eps=0.02 and 0.05 with a timeout of 15 min? Cheers, |
yes. I suggested a longer timeout for the ACASXU-hard subset if needed. no longer than 6 hours |
Networks consisting of piecewise linear activations such as ReLUs
Representative benchmark: ACAS-Xu
Benchmark details: see appendix section 6: https://arxiv.org/abs/1702.01135
Specifications: safety (output lies in specific range based on a set of inputs)
Questions: keep to only ReLUs, or allow other linear and piecewise linear activations?
The text was updated successfully, but these errors were encountered: