About speeding up the evaluation loop #83

mmatera · 2021-11-29T14:25:25Z

mmatera
Nov 29, 2021
Maintainer

Continuing with the problem of performance in Mathics, I want to discuss here one of the critical factors that determine it: the evaluation routine. To decouple it from other aspects, like pattern matching, rule applying, or parameter replacement, let's review how much it takes to evaluate a trivial expression, i.e. expressions that does not involve previously defined symbols. According to MathicsBenchmark, to evaluate something like F[x] when F and x
are undefined symbols, takes around 60 us. We can also check it inside the interpreter using the function Timing:

Timing[Do[1;,{1000}]][[1]]

Timing[Do[a;,{1000}]][[1]]

Timing[Do[F[a];,{1000}]][[1]]

Timing[Do[F[Pi];,{1000}]][[1]]

Timing[Do[F[a,b];,{1000}]][[1]]

Timing[Do[F[a,b,c,d,e,f,g,h,i,j,k];,{1000}]][[1]]

Timing[Do[F[a,a,a,a,a,a,a,a,a,a,a];,{1000}]][[1]]

Timing[Do[F[Pi,Pi,Pi,Pi,Pi,Pi,Pi,Pi,Pi,Pi,Pi,Pi];,{1000}]][[1]]

We can see that each call takes, in Mathics, around 60+/-10 us, while in WMA .20+/-.02 us.

In Mathics, since the time does not seems to depend on the number of atomic parameters, we can expect that most of the time is consumed in the mathics.core.expression.evaluate method. Inside this method, most of the work is done by the mathics.core.expression.evaluate_next method, which is applied iteratively until a fixed point is reached.

Looking inside mathics.core.expression.evaluate_next by splitting in into different small chunks, and timing those chunks I found:

(Processing Global`F[Global`x])

Step	NO CYTHON	CYTHON
Do nothing took	540 ns	252 ns
Do nothing again took	361 ns	140 ns
loading BoxConstruct took	2515 ns	1508 ns
evaluating head took	4490 ns	3322 ns
loading attributes took	2660 ns	2822 ns
getting mutable leaves took	2173 ns	1007 ns
processing *_range definitions took	1172 ns	491 ns
checking range took	10653 ns	5391 ns
building the new expression took	11106 ns	4227 ns
Processing attributes took	34221 ns	13058 ns
Rules function took	706 ns	243 ns
Processing rules took	10765 ns	5327 ns
Final step took	4974 ns 1639 ns

(Bellow is the modified code)

The first two lines provides a baseline to discount the time required by the timer, that after the first call are just on two function calls (~120ns x 2).
Discarding those lines, and sorting this by the time consumed, we obtain then

                                             NO CYTHON           CYTHON

Processing Global`F[Global`x]

Processing attributes took 34221 ns 13058 ns
building the new expression took 11106 ns 4227 ns
Processing rules took 10765 ns 5327 ns
checking range took 10653 ns 5391 ns
Final step took 4974 ns 1639 ns
evaluating head took 4490 ns 3322 ns
loading attributes took 2660 ns 2822 ns
loading BoxConstruct took 2515 ns 1508 ns
getting mutable leaves took 2173 ns 1007 ns
processing *_range definitions took 1172 ns 491 ns
Rules function took 706 ns 243 ns

So, the most costly step is to check attributes. In our case, attributes are empty, so this is a kind of lower bound. Probably, we can improve this time using a bitmask instead of a list of strings to store attributes. Building the new expression can also be improved by optimizing from_python (as in PR #51) or avoiding to call it at all (PR #49).
Following the list, we have the step of processing rules. In this case, there are not rules, so all the time is devoted into finding a definition for the head and the leaves, and again, checking attributes. In the following section I am going to discuss some ideas about how to speedup the access to the Definition object.

Very close, in the next place, we find the "checking range" chuck. Again, in part of this subroutine, we need to check attributes, but also appears another routine used many times along the code: Expression.has_form. has_form involves a) a function call, b) string comparisons instead of symbol (is) comparisons. Some idea to speed up this calls are implemented in PR #65. Also, this chunk involves calls to evaluate over the leaves, which in this case we can expect takes the half of the time (according to the time required to evaluate the head of the expression, position 6. in the ranking).

The remaining items in the list takes each one less than 3us, involving a) private function definitions, b) local module loading, c) coping leaves. If they seems to be less critical, we should notice that each of them takes 10 times more time that the whole evaluation in WMA.

Finally, notice that by using CYTHON, times are roughly reduced to a half, except in the case of loading attributes, when the time is incremented. In any case, what we need here is to reduce all these times in a factor of 10 at the very least.

Regarding the access to definitions

As I mentioned before, on each step on the evaluation process, the method Expression.evaluate has to access the Definition object associated to certain lookup_name in order to recover different sets of rules and attributes, that defines the way in which the rules are applied. Any improvement in the algorithm that brings the Definition of a given Symbol should have a measurable positive improvement in the evaluation time.

In the current implementation, the Definition objects are stored in the collection class Definitions. This class stores all the definitions in the session. This allows for example, along the same Python session, to have several independent Mathics sessions. I think this is used in Mathics-Django to serve different instances [Check]. In other implementations (like WMA or jupyter's wolfram-kernel) this behaviour is implemented using different "kernels" that run as different process. With the current implementation, a Symbol can have many Definitions, according the session, and then, during the evaluation, we need to "lookup" the corresponding definitions. Definitions.get_definition returns the definition of a given Symbol, in terms of its (not necesarily fulli qualified) name. Definitions stores up to 4 instaces of Definitions objects for each symbol in different dictionaries: Definitions.definitions_cache, Definitions.builtin, Definitions.user, and Definitions.pymathics.
Definitions.get_definition first try to find the (non necesarily fqn) in the Definitions.definitions_cache dictionary. Most of the time, the Definition is in the definitions_cache and then is just returned. If the definition is not in the definitions_cache, then get_definition looks into the other three dictionaries by its fqn. If the a definition appears in just one of the three dictionaries, then that definition is stored in the cache and afterward, is returned.
If there is no definition, a new empty definition is created and stored as user definition and into the cache. On the other hand, if there are more than one definition, a merge of the available definitions is built and stored into the cache.
builtin definitions are built and stored when Definitions object is created, while pymathics definitions are built and stored when a pymathics module is loaded. This allows to unload modules without erase the vanilla definition provided by mathics-core, and that Clear[S] restores the vanilla definition provided by builtin and the pymathics definitions already loaded.

When the input from the front-end is parsed, usually the expressions consists of non fqns. During this parsing step, Definitions.get_definition helps to determine if certain name exists in the current context, or in the context path, to produce expressions made of Symbols, that always have fqn's.
Afterward, during the evaluation, we can assume that all the calls to get_definition are made with fqn's. In this process, Expression.evaluate and Symbol.evaluate ask for the lookup name of the symbols in the expression calling get_lookup_name(). For Symbols get_lookup_name() returns the (fqn) of the Symbol. For Expressions,
get_lookup_name() (in master) returns self._head.get_lookup_name(), in a recursive way. In the branch faster_get_lookup_name (PR #16) an iterative algorithm is proposed, that seems to reduce the time in around a 20% (132ns vs 167ns in my laptop).

Another idea to improve the access time would be to avoid function calls to the get_definition method by checking first if there is already a definition in the definitions_cache PR #59 explores this direction.
Still faster could be also to avoid dealing with Definitions collections and store definitions directly inside Symbol objects. These are directions that we can explore in the future.

Appendix:

Modified code for timing `evaluate_next`

    def evaluate_next(self, evaluation) -> typing.Tuple["Expression", bool]:
        print(" \n Processing ", self)
        start_time = timer()
        delta_time = timer() - start_time
	
        print("Do nothing  took \t\t", delta_time, "ns")
        start_time = timer()
        delta_time = timer() - start_time
        print("Do nothing again  took \t\t", delta_time, "ns")
        
        start_time = timer()
        from mathics.builtin.base import BoxConstruct
        delta_time = timer() - start_time
        print("loading BoxConstruct took \t\t", delta_time, "ns")
        
        start_time = timer()
        head = self._head.evaluate(evaluation)
        delta_time = timer() - start_time
	
        print("evaluating head took \t\t", delta_time, "ns")
        start_time = timer()
        attributes = head.get_attributes(evaluation.definitions)
	
        delta_time = timer() - start_time
        print("loading attributes took \t\t", delta_time, "ns")
        start_time = timer()
	
        leaves = self.get_mutable_leaves()
        delta_time = timer() - start_time
        print("getting mutable leaves took \t\t", delta_time, "ns")

        start_time = timer()
        def rest_range(indices):
            if "System`HoldAllComplete" not in attributes:
                if self._no_symbol("System`Evaluate"):
                    return
                for index in indices:
                    leaf = leaves[index]
                    if leaf.has_form("Evaluate", 1):
                        leaves[index] = leaf.evaluate(evaluation)

        def eval_range(indices):
            for index in indices:
                leaf = leaves[index]
                if not leaf.has_form("Unevaluated", 1):
                    leaf = leaf.evaluate(evaluation)
                    if leaf:
                        leaves[index] = leaf
        delta_time = timer() - start_time
        print("processing *_range definitions took \t\t", delta_time, "ns")

        start_time = timer()
        if "System`HoldAll" in attributes or "System`HoldAllComplete" in attributes:
            # eval_range(range(0, 0))
            rest_range(range(len(leaves)))
        elif "System`HoldFirst" in attributes:
            rest_range(range(0, min(1, len(leaves))))
            eval_range(range(1, len(leaves)))
        elif "System`HoldRest" in attributes:
            eval_range(range(0, min(1, len(leaves))))
            rest_range(range(1, len(leaves)))
        else:
            eval_range(range(len(leaves)))
            # rest_range(range(0, 0))

        delta_time = timer() - start_time
        print("checking range took \t\t", delta_time, "ns")

        
        start_time = timer()
        new = Expression(head)
        new._leaves = tuple(leaves)
        delta_time = timer() - start_time
        print("building the new expression took \t\t", delta_time, "ns")


        start_time = timer()
        if (
            "System`SequenceHold" not in attributes
            and "System`HoldAllComplete" not in attributes  # noqa
        ):
            new = new.flatten_sequence(evaluation)
            leaves = new._leaves

        for leaf in leaves:
            leaf.unevaluated = False

        if "System`HoldAllComplete" not in attributes:
            dirty_leaves = None

            for index, leaf in enumerate(leaves):
                if leaf.has_form("Unevaluated", 1):
                    if dirty_leaves is None:
                        dirty_leaves = list(leaves)
                    dirty_leaves[index] = leaf._leaves[0]
                    dirty_leaves[index].unevaluated = True

            if dirty_leaves:
                new = Expression(head)
                new._leaves = tuple(dirty_leaves)
                leaves = dirty_leaves

        def flatten_callback(new_leaves, old):
            for leaf in new_leaves:
                leaf.unevaluated = old.unevaluated

        if "System`Flat" in attributes:
            new = new.flatten(new._head, callback=flatten_callback)
        if "System`Orderless" in attributes:
            new.sort()


        new._timestamp_cache(evaluation)

        if "System`Listable" in attributes:
            done, threaded = new.thread(evaluation)
            if done:
                if threaded.sameQ(new):
                    new._timestamp_cache(evaluation)
                    return new, False
                else:
                    return threaded, True

        delta_time = timer() - start_time
        print("Processing attributes took \t\t", delta_time, "ns")


        start_time = timer()
        def rules():
            rules_names = set()
            if "System`HoldAllComplete" not in attributes:
                for leaf in leaves:
                    name = leaf.get_lookup_name()
                    if len(name) > 0:  # only lookup rules if this is a symbol
                        if name not in rules_names:
                            rules_names.add(name)
                            for rule in evaluation.definitions.get_upvalues(name):
                                yield rule
            lookup_name = new.get_lookup_name()
            if lookup_name == new.get_head_name():
                for rule in evaluation.definitions.get_downvalues(lookup_name):
                    yield rule
            else:
                for rule in evaluation.definitions.get_subvalues(lookup_name):
                    yield rule

        delta_time = timer() - start_time
        print("Rules function took \t\t", delta_time, "ns")


        start_time = timer()                    
        for rule in rules():
            result = rule.apply(new, evaluation, fully=False)
            if result is not None:
                if isinstance(result, BoxConstruct):
                    return result, False
                if result.sameQ(new):
                    new._timestamp_cache(evaluation)
                    return new, False
                else:
                    return result, True

        delta_time = timer() - start_time
        print("Processing rules took \t\t", delta_time, "ns")
	
        start_time = timer()
        dirty_leaves = None

        # Expression did not change, re-apply Unevaluated
        for index, leaf in enumerate(new._leaves):
            if leaf.unevaluated:
                if dirty_leaves is None:
                    dirty_leaves = list(new._leaves)
                dirty_leaves[index] = Expression("Unevaluated", leaf)

        if dirty_leaves:
            new = Expression(head)
            new._leaves = tuple(dirty_leaves)

        new.unformatted = self.unformatted
        new._timestamp_cache(evaluation)
        delta_time = timer() - start_time
        print("Final step took \t\t", delta_time, "ns")

        print("\t\t---\t\t\n")
        return new, False

Trivial expression evaluations from jupyter (using timeit)

1
126 ns ± 3.75 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Global`a
329 ns ± 33 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Global`F[]
2 µs ± 130 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Global`F[Global`a]
2.1 µs ± 9.26 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Global`F[System`Pi]
2.2 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Global`F[Global`a, Global`b]
2.29 µs ± 15.6 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Global`F[Global`a, Global`b, Global`c, Global`d, Global`e, Global`f, Global`g, Global`h, Global`i, Global`j, Global`k]
4.09 µs ± 168 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

rocky · 2022-03-28T06:44:12Z

rocky
Mar 28, 2022
Maintainer

We (or I need) to do something like this for the expression-elements-properties branch.

0 replies

rocky · 2022-05-16T03:33:58Z

rocky
May 16, 2022
Maintainer

https://github.com/Mathics3/mathics-core/blob/benchmarking/bench-results.rst has recent results comparing V 4.0.0 performance on the combinatorica 0.9 test versus the current master.

I will be drilling down here to see what is up in more details. Will also do some comparisons for #83 (comment)

If you want to verify the two branches to use are https://github.com/Mathics3/mathics-core/tree/benchmarking for current and https://github.com/Mathics3/mathics-core/tree/4.0.0-benchmark

Run:

pytest -s test/package/test_combinatorica_benched.py

in each branch.

4 replies

mmatera May 16, 2022
Maintainer Author

@rocky,
with
pytest -s test/package/test_combinatorica.py
I got:
ERROR: file or directory not found: pytest
On the other hand,

pytest -s test/package/test_combinatorica.py

produces an Import Error:

ImportError while importing test module '/home/mauricio/Projects/mathics-core/test/package/test_combinatorica.py'.
...
E   ModuleNotFoundError: No module named 'test.helper'

For some reason, when I try to run tests in a subfolder, it does not find the helper...

rocky May 16, 2022
Maintainer

Ah - that's because it is in the parent directory. Gimme a second....

rocky May 16, 2022
Maintainer

@mmatera I think this is fixed. In both branches.

Also use test/package/test_combinatorica_benched.py not the existing test that doesn't have the extra @timeit code. That is:

$ pytest -s test/package/test_combinatorica_benched.py

mmatera May 16, 2022
Maintainer Author

@rocky, it works now. Tomorrow I am going to try a more systematic test.

rocky · 2022-05-16T03:53:54Z

rocky
May 16, 2022
Maintainer

For Timing[Do[1;,{1000}]][[1]] 4.1.0-benchmarking is faster:

Out[1]= 0.115065
Out[2]= 0.114487
Out[3]= 0.112977

vs 4.0.0-benchmarking

Out[1]= 0.297194
Out[2]= 0.302619
Out[3]= 0.296671
Out[4]= 0.294125

While for Timing[Do[F[a,a,a,a,a,a,a,a,a,a,a];,{1000}]][[1]] 4.1.0-benchmarking is worse. So I would like to check out what's up there.

4.1.0:

Out[1]= 0.297307
Out[2]= 0.299373
Out[3]= 0.308271
Out[4]= 0.307523

4.0.0:

Out[1]= 0.112872
Out[2]= 0.11084

3 replies

rocky May 16, 2022
Maintainer

I am guessing the timing difference between 4.0.0 and 4.1.0 is in the ElementCache. I can't see what else it could be. (But then the fact that I don't see what's up isn't surprising, else this would have been fixed)

Also I should note that I have a number of other small element-property and update and Expression changes that improve things, but ever so slightly again.

That will get added though after figuring out what's up with this regression.

rocky May 16, 2022
Maintainer

Looks like we have way more from_python() calls in 4.1.0 and that alone may account for most of the slowdown:

$ python ./prof-do.py
         683886 function calls (665731 primitive calls) in 0.483 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
...
    15018    0.044    0.000    0.047    0.000 atoms.py:1062(from_python)

vs 4.0.0:

python ./prof-do.py
         602313 function calls (584117 primitive calls) in 0.359 seconds

     2016    0.007    0.000    0.013    0.000 expression.py:95(from_python)

prof-do.py:

import cProfile

from mathics.session import MathicsSession

session = MathicsSession(add_builtin=True, catch_interrupt=False)

pr = cProfile.Profile()
cProfile.run('session.evaluate("Do[F[a,a,a,a,a,a,a,a,a,a,a];,{1000}]")')

I suspect these calls are coming via the setter for elements.

rocky May 25, 2022
Maintainer

A little update on Timing[Do[F[a,a,a,a,a,a,a,a,a,a,a];,{1000}]][[1]] timings...

In Pyston 2.3.3 time time drops to 0.0912786 which is below the (non-Pyston) 4.0.0 value 0.11. With Cythonized code we get 0.191801.

And the Timing[Do[1;,{1000}]][[1]] timing drops to 0.0379 from 0.115065 in 4.1.1. So Pyston 2.3.3 is a win, especially over Cython which I find annoying to develop with . In Pyston there is a very long wait to build scipy but that is done once during development.

I will be making future docker images using Pyston.

But that said, the difference in timing within a Python interpreter between 4.0.0 and master is still something that should be addressed, and I think we have this pretty much narrowed (I think) into overhead in eval_range() and eval_elements(). And if not that we have this narrowed between git versions and a specific benchmark. So I think I will be able to understand what's up here this weekend.

rocky · 2022-05-16T20:42:13Z

rocky
May 16, 2022
Maintainer

I don't have time right now to investigate this further, but I do have a plan for approaching things that I think is neat.

First of all, we could use git bisect to find where things go amiss. Especially here, since I think I know where it is and the time range I think is in a pretty narrow window.

(Elswhere I have mentioned that the "1000" could be turned into "50" or even "10" and things would be just as apparent but take less time and easier to debug.)

But here's something else that I may work on because it is neat.

In prof-do.py above I used "run", but instead there is a way to dump the statistics data, and then read it back in.

What one can do is remove any data where the number of calls didn't change between say 10 and 50. Or the number of calls doesn't change between version 4.0.0 and 4.1.0. And clearly remove any line where that has a 0 tottime.

With this, I think we would have a very precise indication of what's up.

1 reply

rocky May 17, 2022
Maintainer

Some partial results...

First some corrections. The elements setters method or ExpressionCache is not involved at all here. The from_python() method is getting called a lot in _build_elements() converting already-converted elements. So adapting from #282 we check to see if an element needs converting first.

#289 has the details. And although we don't get all of the time back (yet) with this we reduce 0.29 - 0.30 second times (for 1000 iterations of Do) to 0.25-26ish.

I have noticed that the best iteration value in the Do loop depends on what you are doing. For some things an iteration of 5 is good enough. However in using cProfile about 50 gave the right amount of detail.

I basically modified the above prof-do.py program to eliminate entries where the total time was less than 0.01 seconds:

mmatera · 2022-05-17T13:03:37Z

mmatera
May 17, 2022
Maintainer Author

Here is a proposal for benchmarking code, to compare the performance in the evaluation of several (simple) expressions, against
several versions. Not sure if this should be inside mathics-core or in mathics-benchmark, or in a different place:

import sys
import os
import subprocess
import mathics
import psutil
import platform

branches = ["2.2.0", "3.1.0", "4.0.0","origin/master","origin/_build_elements-speedup", "bench-281alt"]

mathics_folder = mathics.__path__[0]
mem=psutil.virtual_memory()
plat = platform.uname()

print("machine:", plat.processor , "system:", plat.system, plat.release)
print("system memory:  Total:", "{0:.2f}GB".format(mem.total/2**30),
     "  /  Available:", "{0:.2f}GB".format(mem.available/2**30))

print("Python version:", sys.version)
print("glibc:", platform.libc_ver())


print("\n",60*"-")


def run_wolfram(s):
    result = subprocess.run(['wolframscript', '-c', s], stdout=subprocess.PIPE, stderr=None)
    time_stat = result.stdout.decode("ascii").split("\n")[0][1:-1].split(",")
    return time_stat


def run_mathics(s, branch):
    os.chdir(mathics_folder)
    subprocess.run(["git", "checkout", branch], capture_output=True)
    result = subprocess.run(['mathics', '-e', s], stdout=subprocess.PIPE, stderr=None)
    time_stat = result.stdout.decode("ascii").split("\n")[0][1:-1].split(",")
    return time_stat


def build_test(test, n):
    n_it = "{"+ str(n) + "}"
    return  ("n=10;"
             f"times=Table[Timing[{test};][[1]],{n_it}];"
             f"av=Plus@@times/{n};"
             f"de= (Plus@@(times^2)/{n}-av^2)^.5;" 
             "{1000*av,100*de/av}"
            )
    

tests={
    "test1":"Do[{a,a,a,a,a,a,a,a,a,a,a,a,a,a,a},{10}]",
    "test2":"Do[Table[a, {15}],{10}]",
    "test3":"Do[Table[i, {i, 15}],{10}]",
}

for test in tests:
    print(60*"*")
    print(test)
    print(4*"-")
    test_str = tests[test]
    print("   <<", test_str,">>","\n")
    test_str = build_test(test_str, 10)
    for branch in branches:
        # Warm up
        for k in range(3):
            run_mathics(test_str, branch)
        print(f"{branch}:", 
              "{0:.6f}ms +/-{1:.1f}%".format(*[float(f) for f in  run_mathics(test_str, branch)]))
    print("wolfram:", "{0:.6f}ms +/-{1:.1f}%".format(*[float(f) for f in  run_wolfram(test_str)]))        
    print("\n")

Results:

machine: x86_64 system: Linux 5.13.0-41-generic
system memory: Total: 15.36GB / Available: 8.11GB
Python version: 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
glibc: ('glibc', '2.34')

test1

<< Do[{a,a,a,a,a,a,a,a,a,a,a,a,a,a,a},{10}] >>

2.2.0: 0.581977ms +/-6.1%
3.1.0: 0.733379ms +/-4.6%
4.0.0: 0.534056ms +/-6.3%
origin/master: 1.091110ms +/-4.2%
origin/_build_elements-speedup: 0.897806ms +/-5.2%
bench-281alt: 0.888800ms +/-4.4%
wolfram: 0.001400ms +/-91.5%

test2

<< Do[Table[a, {15}],{10}] >>

2.2.0: 2.380450ms +/-4.0%
3.1.0: 3.137410ms +/-0.8%
4.0.0: 3.151350ms +/-1.1%
origin/master: 3.221590ms +/-1.6%
origin/_build_elements-speedup: 3.941370ms +/-1.4%
bench-281alt: 4.068730ms +/-1.8%
wolfram: 0.007400ms +/-13.8%

test3

<< Do[Table[i, {i, 15}],{10}] >>

2.2.0: 37.125100ms +/-0.8%
3.1.0: 37.542100ms +/-1.6%
4.0.0: 37.197100ms +/-0.6%
origin/master: 38.224400ms +/-1.4%
origin/_build_elements-speedup: 28.676300ms +/-1.3%
bench-281alt: 35.550100ms +/-1.2%
wolfram: 0.011600ms +/-13.5%

2 replies

rocky May 17, 2022
Maintainer

~~where is 4.10.dev0 here?~~ . Ah ok. I guess that is origin/master and origin/_build_elements-speedup.

rocky May 19, 2022
Maintainer

Here is a proposal for benchmarking code, to compare the performance in the evaluation of several (simple) expressions, against several versions.

The first thing that struck me we that instead of building upon existing work, there is yet another new thing. Sigh.

Ok, so be it.

Not sure if this should be inside mathics-core or in mathics-benchmark, or in a different place:

If it were something simple, e.g. something analogous to a pytest test, that just records and saves data that an individual can use, then that could be in core. Once we get into stuff like comparing different git branches and more involved analysis and presentation, that should be in mathics-benchmark. Mathics core is place where there is a lot of activity and contention caused by the different kinds of activity.

...
2.2.0: 0.581977ms +/-6.1% 3.1.0: 0.733379ms +/-4.6% 4.0.0: 0.534056ms +/-6.3% origin/master: 1.091110ms +/-4.2% origin/_build_elements-speedup: 0.897806ms +/-5.2% bench-281alt: 0.888800ms +/-4.4% wolfram: 0.001400ms +/-91.5%

The wolfram timing the first time it came up was interesting to show that Mathics is a couple orders of magnitude too slow.

But once that fact was established, there isn't any further benefit at this point to bother to compare against Mathematica. Perhaps some day after we get through revising the entire interpreter and are within in order of magnitude it might make sense,.

Having now had some time to look why there is a slowdown in some of the tests in the master, I believe this is because we are calling Expression() many times where there never will be an evaluation of the result. There was more work being done in the Expression constructor and that slows things down.

Thinking about this though I think we can not only eliminate the slowdown but speed things up further over earlier versions by deferring elements conversion and property detection and do this in Expression.evaluate(), not the constructor. We'll need to add a flag to indicate whether conversion has been done inside evaluate(), but that is a simple and quick check.

I'll probably go into this in more detail some other time, but the gist here is that this a more major refactor, and should go in after #290 #292 and #293 clear.

mmatera · 2022-05-17T13:20:34Z

mmatera
May 17, 2022
Maintainer Author

@rocky, here I added it.

machine: x86_64 system: Linux 5.13.0-41-generic
system memory: Total: 15.36GB / Available: 7.84GB
Python version: 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0]
glibc: ('glibc', '2.34')

test1

<< Do[{a,a,a,a,a,a,a,a,a,a,a,a,a,a,a},{10}] >>

2.2.0: 0.588356ms +/-5.8%
3.1.0: 0.739799ms +/-3.8%
4.0.0: 0.522434ms +/-5.9%
4.10.dev0: 0.704231ms +/-4.1%
origin/master: 1.089980ms +/-4.0%
origin/_build_elements-speedup: 1.241700ms +/-2.8%
bench-281alt: 1.327490ms +/-4.4%
wolfram: 0.001000ms +/-109.5%

test2

<< Do[Table[a, {15}],{10}] >>

2.2.0: 2.362820ms +/-4.6%
3.1.0: 2.355080ms +/-1.6%
4.0.0: 3.240890ms +/-1.2%
4.10.dev0: 2.340570ms +/-1.5%
origin/master: 4.266760ms +/-1.3%
origin/_build_elements-speedup: 2.891610ms +/-2.2%
bench-281alt: 2.825940ms +/-1.8%
wolfram: 0.007500ms +/-16.1%

test3

<< Do[Table[i, {i, 15}],{10}] >>

2.2.0: 37.315500ms +/-0.8%
3.1.0: 36.995500ms +/-0.7%
4.0.0: 45.697900ms +/-1.0%
4.10.dev0: 37.252600ms +/-1.0%
origin/master: 38.053800ms +/-1.8%
origin/_build_elements-speedup: 39.445900ms +/-0.8%
bench-281alt: 28.313200ms +/-0.7%
wolfram: 0.011900ms +/-12.1%

2 replies

mmatera May 17, 2022
Maintainer Author

Here is a resume of the results as a plot, using 2.2.0 as the reference:

rocky May 30, 2022
Maintainer

I had a chance to look at what is going on here. Actually, I looked at Timing[Do[F[a,a,a,a,a,a,a,a,a,a,a];,{1000}]][[1]] because that is what was reported previously, but the understanding there applies here.

Note: In working with Benchmarks, it doesn't help to constantly change the benchmarks until existing ones are understood.

The entire slowdown is simply because we need to compute elements properties but here this is of no benefit: Function F isn't something that needs to be ordered, or flattened, and everything needs to be evaluated.

The same thing then happens with Test1 and Test2 after the a's are expanded. There, we may see some slight improvement when there is a custom ListExpression().evaluate() instead of Expression().evaluate().

The discrepancy goes up as the number of parameters increases. If the tests used less than 3 parameters, not much difference would be noticed.

So question then is how often to we get a situation where we have a function with a lot of parameters and the function doesn't have, Listable, or Orderless attributes, and where all of the parameters need to be evaluated (e.g. are not literals). And how often do we have functions with few parameters, that do have some of these attributes and do have literal parameters? (Most of the low-level primitive functions like Plus, and Times have these attributes).

And also what is the hit difference between when we encounter something we are either good or bad at versus the version we are comparing with?

For example, a CPU floating point operation may occur than 1% of the time, but the hit is slowdown is so that avoiding the slowdown justifies making a CPU more complex and expensive even to the point of slowing down many of the non-floating-point instructions.

That is why, in addition to specific low-level comparisons, we also need suites of loads that we expect to see. I am using combinatorica, and I am seeing constant improvement there. But I would like to see other suites.

Lastly I will say that recent experiments though show that there is a significant slowdown when we have to allow conversion of elements inside the Expression() constructor.

rocky · 2022-05-17T14:43:38Z

rocky
May 17, 2022
Maintainer

Thanks for benchmarking.

I have a number of thoughts and comments on this. That will have to wait for when I have time.

How is 4.10.dev0 (which I assume is 4.1.0.dev0) different than origin/master?

1 reply

mmatera May 17, 2022
Maintainer Author

Interesting: I just copy the tag that you mention, but it does not exist. So, I guess that it just test over 4.0.0.

rocky · 2022-05-22T23:40:13Z

rocky
May 22, 2022
Maintainer

I have been looking the code and following traces in execution of Expression and Expression.evaluation().

This is an area where we are slow because we are calling from_python() many places when elements that do not need conversion performed on elements.

The fact that from_python() unnecessarily has been noticed, and probably motivated the test testing at the beginning of from_python() if any conversion is needed. More effective though have the code written in a way that avoids from_python() calls. The higher level the we can detect this the more effective we will be.

The Expression() constructor is used in two separate kinds of ways:

to construct an expression from "scratch" or non-Mathics elements, e.g Expression("Plus", 1, 2, 3, 4)
to create a new expression from part of an existing expression. This we see in Expression.evaluate() in evaluating an expression, or transforming expressions.

So here I am thinking of adding a function to_expression() for the first situation where we need to create an Expession from scratch.

Expression() on the other hand would expect all of its arguments to have been converted, including "Head".

There are times when we create an Expression and it is not evaluated, whether or not Expression() or to_expression() was used to create it. This can happen for example when Mathics is just manipulating expressions.

So computing properties of elements in creating an Expression slows things down when an evaluation will never occur.

Lastly, there is a pattern that comes up: Expression(..).evaluate(..). Here, we know that an evaluation after expression creation will definitely occur. So during the conversion elements, properties can also be calculated.

Here I am thinking about adding a method expression_evaluate() which combines the two.

In sum:

add to_expression() which is general and doesn't require its Head or elements be Mathics objects, in contrast
change Expression() to assume and require all of its head and elements to be Mathics objects
add expression_evaluate() which creates a new Expression and then evaluates it; here it passes along the fact that an evaluation will occur

Now a little bit about the parameters for the functions.

`to_expression()`

Somtimes the caller will know a bit of information about its arguments. For example in mathics.core.convert we have:

    elif isinstance(expr, sympy.LessThan):
        return Expression(SymbolLessEqual, *[from_sympy(arg) for arg in expr.args])
    elif isinstance(expr, sympy.StrictLessThan):
        return Expression(SymbolLess, *[from_sympy(arg) for arg in expr.args])

Here, we know the conversion function is from_sympy, and we may as well compute element properies while building the list rather than iterating over the elements twice.

From a stylisting point of view, I think it better to express what needs to happen with elements rather than do it can pass in the result.

So this would become:

    elif isinstance(expr, sympy.LessThan):
        return to_expression(SymbolLessEqual, *expr.args, element_conversion_fn=from_sympy)
    elif isinstance(expr, sympy.StrictLessThan):
        return to_expression(SymbolLess, *expr.args, element_conversion_fn=from_sympy)

`Expression()`

As stated above while it is be okay to use the string "Less" for SymbolLess in to_expression(), in Expression() it is not okay.
(Use to_expression() when you don't have a symbol for the head.)

The caller may have information about the elements properties, so we want to allow passing that information in.

In particular this occurs the Expression.evaluation() or in Unlinkedstructure.__call__() where we see things like:

        expr = Expression(self._head)
        expr.elements = elements

That this kind of thing is needed where create an Expression immediately alter it is a sign that there is something wrong in the Expression constructor design.

I think this would be replaced by:

        expr = Expression(self._head, *elements, elements_properties=self.elements_properties)

The elements_properies argument would be optional and would default to None.
But if that is None, then evaluation() if it ever occurs will know to compute these properties.

0 replies

rocky · 2022-05-23T04:32:49Z

rocky
May 23, 2022
Maintainer

With change along the lines above, I am seeing Timing[Do[1;,{1000}]][[1]] times drop to the 0.09 range (from 0.11 before) and Timing[Do[F[a,a,a,a,a,a,a,a,a,a,a];,{1000}]][[1]] drop to the 0.23 0.24 range (from 0.29, 0.3) which is significant, although it is still a bit above the version 4.0.0 timing.

I've been a bit sloppy or pessimistic what I've done so far - it was hard enough to get this far. I do think the code is a little bit clearer and cleaner.

Also interesting would be an extended sequence like the combinatorica runs. But that too will have to wait for some other time.

1 reply

rocky May 23, 2022
Maintainer

And two commits later were are back or worse than where we were for the F[] case. In the other situations though and in combinatorical benchmarking we keep dropping in speed.

For the F[] case, fortunately the changes in the code are pretty isolated and in methods that can be decorated with timing gathering so it should be straightforward to understand what is up here. Edit: it was easy to figure out. See #83 (reply in thread)

rocky · 2022-05-23T13:23:56Z

rocky
May 23, 2022
Maintainer

In working on this, there are a couple of conclusions or observations.

The flexibility or looseness in the code of being able to write in our code base either Expression("Plus", 1, 2), Expression(SymbolPlus, 1, 2), Expression(SymboPlus, Integer1, 2) or Expression(SymbolPlus, Integer1, Integer2) is costing us speed in Expression since ultimately it has to make things uniform when we get down to preparing it for evaluation. I don't find that last form Expression(SymbolPlus, Integer1, Integer2) so bad to have to write in the code. In fact doing things uniformly one way could be viewed as an advantage. But if there are situations where this would be tough we have now to_expresssion() which handles the hybrid forms. And we also mathics.core.parser() takes a string and MathicsSession.evaluate() which takes a string, parses it, and evaluates it.
When we see something like spending a lot of time in from_python(), usually a higher level change in reducing the calls to from_python() can have a bigger effect than changing the order of the if statements or running that code through Cython, or worrying the efficiencies of various ways a function can be called in Python. Too often I have seen the code get written in a more convoluted fashion for only marginal gains. (I don't consider, Expression(SymbolPlus, Integer1, Integer2) convoluted at all. In fact, to me it is clearer since it represents in a straightforward way what is meant. And to_expression("Plus", 1, 2) is also clear too. In reality in any one place in the code it doesn't matter which is used. What is slowing us down is other places where what we want is a more efficient Expression() constructor and don't get that because Expression has to do double duty. So the win here is not so much about any individual call, but about the fuzziness and inefficiency we get by not being able to sort our laundry into the white items and the colored items before washing.
It is clear that Expression construction slowness has been noticed by many people previously. That is why the caching and string_list() functions were added. However not only are these not effective enough, but they are not documented either, and there are no tests for this as well. As a result we have complicated stuff that isn't effective. We need to do better in the future.

0 replies

About speeding up the evaluation loop #83

mmatera Nov 29, 2021 Maintainer

Regarding the access to definitions

Appendix:

Modified code for timing evaluate_next

Trivial expression evaluations from jupyter (using timeit)

Replies: 10 comments · 14 replies

rocky Mar 28, 2022 Maintainer

rocky May 16, 2022 Maintainer

mmatera May 16, 2022 Maintainer Author

rocky May 16, 2022 Maintainer

rocky May 16, 2022 Maintainer

mmatera May 16, 2022 Maintainer Author

rocky May 16, 2022 Maintainer

rocky May 16, 2022 Maintainer

rocky May 16, 2022 Maintainer

rocky May 25, 2022 Maintainer

rocky May 16, 2022 Maintainer

rocky May 17, 2022 Maintainer

mmatera May 17, 2022 Maintainer Author

test1

test2

test3

rocky May 17, 2022 Maintainer

rocky May 19, 2022 Maintainer

mmatera May 17, 2022 Maintainer Author

test1

test2

test3

mmatera May 17, 2022 Maintainer Author

rocky May 30, 2022 Maintainer

rocky May 17, 2022 Maintainer

mmatera May 17, 2022 Maintainer Author

rocky May 22, 2022 Maintainer

to_expression()

Expression()

rocky May 23, 2022 Maintainer

rocky May 23, 2022 Maintainer

rocky May 23, 2022 Maintainer

mmatera
Nov 29, 2021
Maintainer

Modified code for timing `evaluate_next`

Replies: 10 comments 14 replies

rocky
Mar 28, 2022
Maintainer

rocky
May 16, 2022
Maintainer

mmatera May 16, 2022
Maintainer Author

rocky May 16, 2022
Maintainer

rocky May 16, 2022
Maintainer

mmatera May 16, 2022
Maintainer Author

rocky
May 16, 2022
Maintainer

rocky May 16, 2022
Maintainer

rocky May 16, 2022
Maintainer

rocky May 25, 2022
Maintainer

rocky
May 16, 2022
Maintainer

rocky May 17, 2022
Maintainer

mmatera
May 17, 2022
Maintainer Author

rocky May 17, 2022
Maintainer

rocky May 19, 2022
Maintainer

mmatera
May 17, 2022
Maintainer Author

mmatera May 17, 2022
Maintainer Author

rocky May 30, 2022
Maintainer

rocky
May 17, 2022
Maintainer

mmatera May 17, 2022
Maintainer Author

rocky
May 22, 2022
Maintainer

`to_expression()`

`Expression()`

rocky
May 23, 2022
Maintainer

rocky May 23, 2022
Maintainer

rocky
May 23, 2022
Maintainer