-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Buildin test runner #7
Comments
This protocol can also be useful for other purpose. Here are some that I can think of: |
Agree. Other purposes. This can be useful for security where security = quality, but quality is not ceteris paribus so having a way for anyone to assess, attest quality helps support a model I have used in numerous industries Protection (t) > Detection (t) + Reaction (t). Time based security ... |
In the context of PVM implementations here are my two cents: We will most definitely need a universal harness for testing PVMs, and I was planning on proposing/creating one eventually. Otherwise I don't see a practical way of evaluating the performance milestones (just running a node on a test net and eyeballing it is not going to be enough), and as @xlc you've noted it's going to be tricky to evaluate correctness too. But there's one more benefit to having such a harness - we can fuzz all of the implementations against each other! What this essentially means - have a fuzzer generate random programs, and then run that program on two PVM implementations in parallel, and compare the outputs. If the two PVMs disagree then we're dealing with either a) the first PVM implementation is wrong and needs to be corrected, b) the second PVM implementation is wrong and needs to be corrected, c) the behavior is under-specced or nonsensical and should be clarified in the GP. So the beauty of this approach is that you don't need to explicitly create test vectors, the fuzzers are pretty good at finding weird corner cases that would be hard to test for otherwise, and they often already include an automatic "test case reducer" that will spit out the shortest input that triggers a given failure.
I think this should be up to the given implementation how they want to do it. Assuming the test runner is written in Rust (which is a reasonable choice assuming we want to fuzz) we could define a trait that each implementation would have to implement, and have one generic stdin/stdout based implementation for those implementations that don't want to write any extra code. (For example, for my PolkaVM I'd like to link to the test harness natively to make fuzzying faster.) Also, for PVM performance tests specifically we will probably want a slightly more richer interface than just a single So in PolkaVM I already have a interface for benchmarking in my benchmarking harness and the PVM debugger also defined an interface for running programs. So ideally we'd merge the two together and define a single interface to both run performance tests and correctness/fuzz tests. So a few points that a good PVM interface needs:
|
There are many ways to define such interface. stdin/out is one of the easiest way that works universally but we definitely could define a Rust trait, implement a pipe adapter, a C FFI adapter, a HTTP adapter etc to confirm such Rust trait. |
So to get the ball rolling, we should maybe try to prototype a test harness unofficially (with test running, performance benchmarks and differential fuzzying) and see how it goes. For PVM specifically I can whip up such a harness relatively easily, but there's not much point if PolkaVM is going to be the only one hooking into it. Is there anyone who would like to volunteer their PVM implementation? It can be written in any language (for fuzzying it should be enough that only one VM is natively fuzzable; the other can just piggyback on it), but it should be reasonably complete (all instructions implemented and working) and it (obviously) must be open source. Would anyone be interested? As far as I can see (unless I missed one) all of the PVM implementations except PolkaVM are currently closed source, because all of the nodes are closed source? If so you'd have to open up at least your PVM implementation, but considering PolkaVM is already open source keeping your PVM implementation closed source doesn't really give you any competitive advantage, and you'll have to open it up anyway when delivering the first prize milestone. |
we are happy to collaborate. I can spend some time next week to refactor our code to decouple our PVM implementation and move it to a open repo |
@xlc Sounds good to me; please ping me with a link when you're done. If anybody else's interested you're also welcome - ideally the more implementations we add, we more we can cross-fuzz against each other and find discrepancies. |
Our TypeScript implementation of PVM is already published on NPM (obfuscated though). I think the reason people keep it private is not to have a competitive advantage, but rather mitigating the risk of having someone plain copy your work. |
We (jam4s) are also in and happy to participate :) |
If you want to do it privately then it isn't. (: But then you're on your own with integrating it with the test harness (at least until there would be a generic interface), and it also can create problems when existing interfaces need to be refactored/changed/etc., since there would be no way for whoever's doing the refactoring to run your PVM and make sure it still works with the harness. Also having your PVM open source benefits everyone, because the more PVMs we have the more cross-fuzzying combinations we have. If I only have PolkaVM in the test harness and everyone else's PVM is private then the only thing you can check is whether your PVM matches PolkaVM's behavior, which is pointless (remember that the target here is to implement the GP, not to mimic whatever PolkaVM is doing, which sometimes even actively goes against the GP because it's used as a research vehicle to improve the GP). On the other hand, if everyone contributes their PVM and we have 31 PVMs in the test harness - now you can fuzz your PVM against 30 other PVMs, and if all 30 PVMs behave like X and you behave like Y then statistically it's almost certain that it's your PVM that's wrong (unless all 30 independent implementations implemented this particular part wrong, which in theory could happen). |
I think we need a universal test runner to run the test vectors.
In order for a team to receive the JAM prize, the Core Fellowship will need to evaluate if it passes the test vectors. However, without a universal test runner, each team will need to supply their custom test runner to run the tests. And then who test the test runner? It becomes non-trivial work to validate the test runner to ensure it correctly parse, execute, and compare the outputs.
The test vectors also include some implementation specific values such as Safrole error code, which should be ignored by test runners as the values are implementation specific and not defined in GP. This increases the complexity of test runner. i.e. more work to everyone.
I will suggest we define a simple stdio/out based test running protocol and implement a universal test runner that is able to run test vectors against with all implementations that also implements such protocol.
We could test input piped via stdin and expect test output from stdout and then compare the outputs and do extra handling such as ignore error code. The data could be just binary or hex encoded (controlled by a flag). It should be easy for any team to build a binary / script that support such protocol and supply it to the test runner to run all the tests.
The text was updated successfully, but these errors were encountered: