Potential future optimizations

Add integer type. Integer runs much faster than doubles in modern CPUs, generally the reciprocal throughput for the x86-64 instruction is 4 times higher. This translates to around 10-20% improvement for some loop benchmarks.
Make some of the commonly used unary/binary functions their own instructions.
Use raw pointers instead of vector + index.
Remove some of the debug checks. We can do a validity check for the bytecode to make sure that it will not go wrong if our evaluator is correct. This can reduce the overhead of certain operations.
Profile-guided optimization. This can provide 10% performance improvement in some cases.
Support numerical vectors and matrix in addition to generic heterogeneous lists.

Things that do not work

Placing goto at the end of each match. Maybe this will work for older CPUs, but for modern CPUs the branch predictor is capable of prediction with historical information, so no need to duplicate the code and add gotos.

Profile on multiple CPUs, at least on intel 12+ gen hybrid CPUs. The big and little cores have very different reaction to various optimizations.