Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Buildin test runner #7

Open
xlc opened this issue Jul 15, 2024 · 10 comments
Open

Buildin test runner #7

xlc opened this issue Jul 15, 2024 · 10 comments

Comments

@xlc
Copy link
Contributor

xlc commented Jul 15, 2024

I think we need a universal test runner to run the test vectors.

In order for a team to receive the JAM prize, the Core Fellowship will need to evaluate if it passes the test vectors. However, without a universal test runner, each team will need to supply their custom test runner to run the tests. And then who test the test runner? It becomes non-trivial work to validate the test runner to ensure it correctly parse, execute, and compare the outputs.

The test vectors also include some implementation specific values such as Safrole error code, which should be ignored by test runners as the values are implementation specific and not defined in GP. This increases the complexity of test runner. i.e. more work to everyone.

I will suggest we define a simple stdio/out based test running protocol and implement a universal test runner that is able to run test vectors against with all implementations that also implements such protocol.

We could test input piped via stdin and expect test output from stdout and then compare the outputs and do extra handling such as ignore error code. The data could be just binary or hex encoded (controlled by a flag). It should be easy for any team to build a binary / script that support such protocol and supply it to the test runner to run all the tests.

@xlc
Copy link
Contributor Author

xlc commented Jul 17, 2024

This protocol can also be useful for other purpose. Here are some that I can think of:
Test case reducer that can try to generate a minimal test input that have different result between two implementations.
Standard bencher.
Network emulator tool (something like Chopsticks) that can be backed by one of the implementations.

@Blockcowboy2
Copy link

Agree. Other purposes. This can be useful for security where security = quality, but quality is not ceteris paribus so having a way for anyone to assess, attest quality helps support a model I have used in numerous industries Protection (t) > Detection (t) + Reaction (t). Time based security ...

@koute
Copy link

koute commented Sep 26, 2024

In the context of PVM implementations here are my two cents:

We will most definitely need a universal harness for testing PVMs, and I was planning on proposing/creating one eventually. Otherwise I don't see a practical way of evaluating the performance milestones (just running a node on a test net and eyeballing it is not going to be enough), and as @xlc you've noted it's going to be tricky to evaluate correctness too.

But there's one more benefit to having such a harness - we can fuzz all of the implementations against each other!

What this essentially means - have a fuzzer generate random programs, and then run that program on two PVM implementations in parallel, and compare the outputs. If the two PVMs disagree then we're dealing with either a) the first PVM implementation is wrong and needs to be corrected, b) the second PVM implementation is wrong and needs to be corrected, c) the behavior is under-specced or nonsensical and should be clarified in the GP.

So the beauty of this approach is that you don't need to explicitly create test vectors, the fuzzers are pretty good at finding weird corner cases that would be hard to test for otherwise, and they often already include an automatic "test case reducer" that will spit out the shortest input that triggers a given failure.

I will suggest we define a simple stdio/out based test running protocol and implement a universal test runner that is able to run test vectors against with all implementations that also implements such protocol.

I think this should be up to the given implementation how they want to do it. Assuming the test runner is written in Rust (which is a reasonable choice assuming we want to fuzz) we could define a trait that each implementation would have to implement, and have one generic stdin/stdout based implementation for those implementations that don't want to write any extra code. (For example, for my PolkaVM I'd like to link to the test harness natively to make fuzzying faster.)

Also, for PVM performance tests specifically we will probably want a slightly more richer interface than just a single fn(input: Input) -> Output. We don't want to measure one-time initialization overheads and IPC overhead, and it'd be nice to measure compilation performance overhead separately, as I don't expect any implementation that is not a recompiler to be able to pass the performance tests anyway.

So in PolkaVM I already have a interface for benchmarking in my benchmarking harness and the PVM debugger also defined an interface for running programs. So ideally we'd merge the two together and define a single interface to both run performance tests and correctness/fuzz tests.

So a few points that a good PVM interface needs:

  • An initialize function to let the implementation do costly one-time initialization (we don't want to include this cost in the measurements),
  • A nop function to measure IPC overhead of whatever "protocol" the harness implements to communicate with the implementation (to substract it from the measurements; otherwise we'll unfairly penalize implementations which use a generic/heavier IPC mechanism like stdin/stdout)
  • A compile function to measure module compilation overhead (mainly relevant for recompilers; not strictly necessary but it's very useful to be able to measure/compare the compilation performance separately; for implementations that are not a recompiler they would just opt-out of this/have a dummy implementation that does nothing)
  • A run function to measure execution performance.
  • A bunch of auxiliary functions to set/get registers, set/get gas, modify memory. (For the registers and the gas we could technically just have them passed in/out of the run function, but for memory that's somewhat impractical and would add unnecessary noise to the measurements.)

@xlc
Copy link
Contributor Author

xlc commented Sep 26, 2024

There are many ways to define such interface. stdin/out is one of the easiest way that works universally but we definitely could define a Rust trait, implement a pipe adapter, a C FFI adapter, a HTTP adapter etc to confirm such Rust trait.

@koute
Copy link

koute commented Oct 5, 2024

So to get the ball rolling, we should maybe try to prototype a test harness unofficially (with test running, performance benchmarks and differential fuzzying) and see how it goes.

For PVM specifically I can whip up such a harness relatively easily, but there's not much point if PolkaVM is going to be the only one hooking into it. Is there anyone who would like to volunteer their PVM implementation? It can be written in any language (for fuzzying it should be enough that only one VM is natively fuzzable; the other can just piggyback on it), but it should be reasonably complete (all instructions implemented and working) and it (obviously) must be open source. Would anyone be interested?

As far as I can see (unless I missed one) all of the PVM implementations except PolkaVM are currently closed source, because all of the nodes are closed source? If so you'd have to open up at least your PVM implementation, but considering PolkaVM is already open source keeping your PVM implementation closed source doesn't really give you any competitive advantage, and you'll have to open it up anyway when delivering the first prize milestone.

@xlc
Copy link
Contributor Author

xlc commented Oct 6, 2024

we are happy to collaborate. I can spend some time next week to refactor our code to decouple our PVM implementation and move it to a open repo

@koute
Copy link

koute commented Oct 7, 2024

@xlc Sounds good to me; please ping me with a link when you're done.

If anybody else's interested you're also welcome - ideally the more implementations we add, we more we can cross-fuzz against each other and find discrepancies.

@tomusdrw
Copy link

tomusdrw commented Oct 7, 2024

Our TypeScript implementation of PVM is already published on NPM (obfuscated though). I think the reason people keep it private is not to have a competitive advantage, but rather mitigating the risk of having someone plain copy your work.
We are happy to participate to and fully open source the implementation if needed. Just curious why is being open-source a strict requirement for fuzzing though?

@subotic
Copy link

subotic commented Oct 7, 2024

We (jam4s) are also in and happy to participate :)
It should't be a problem to open-source our PVM implementation.

@koute
Copy link

koute commented Oct 7, 2024

Just curious why is being open-source a strict requirement for fuzzing though?

If you want to do it privately then it isn't. (: But then you're on your own with integrating it with the test harness (at least until there would be a generic interface), and it also can create problems when existing interfaces need to be refactored/changed/etc., since there would be no way for whoever's doing the refactoring to run your PVM and make sure it still works with the harness.

Also having your PVM open source benefits everyone, because the more PVMs we have the more cross-fuzzying combinations we have. If I only have PolkaVM in the test harness and everyone else's PVM is private then the only thing you can check is whether your PVM matches PolkaVM's behavior, which is pointless (remember that the target here is to implement the GP, not to mimic whatever PolkaVM is doing, which sometimes even actively goes against the GP because it's used as a research vehicle to improve the GP). On the other hand, if everyone contributes their PVM and we have 31 PVMs in the test harness - now you can fuzz your PVM against 30 other PVMs, and if all 30 PVMs behave like X and you behave like Y then statistically it's almost certain that it's your PVM that's wrong (unless all 30 independent implementations implemented this particular part wrong, which in theory could happen).

@davxy davxy mentioned this issue Nov 19, 2024
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants