Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer gets no-meaning infix ops from JSON #87

Merged
merged 7 commits into from
Nov 25, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion .github/workflows/mathics.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,9 @@ jobs:
- name: Test Mathics3
run: |
# Until next Mathics3/mathics-core release is out...
git clone https://github.com/Mathics3/mathics-core.git
# git clone https://github.com/Mathics3/mathics-core.git
# Until next operator-info-from-JSON is merges
git clone -b operator-info-from-JSON https://github.com/Mathics3/mathics-core.git
cd mathics-core/
make PIP_INSTALL_OPTS='[full]'
# pip install Mathics3[full]
Expand Down
60 changes: 24 additions & 36 deletions mathics_scanner/tokeniser.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# -*- coding: utf-8 -*-


import os.path as osp
import re
import string
from typing import Optional
Expand All @@ -9,6 +9,22 @@
from mathics_scanner.errors import ScanError
from mathics_scanner.prescanner import Prescanner

ROOT_DIR = osp.dirname(__file__)
try:
import ujson
except ImportError:
import json as ujson # type: ignore[no-redef]

# Load Mathics3 character information from JSON. The JSON is built from
# named-characters.yml

operators_table_path = osp.join(ROOT_DIR, "data", "operators.json")
assert osp.exists(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This produces an error when the operator.json table is built for the first time.
mathics_scanner.generate.build_operator_tables imports mathics_scanner.__version__,
with makes that mathics_scanner.__init__ be loaded. But then, it tries to import this module, and finds that the file is not already created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way to avoid this error would be to put all the initialization code inside an initialization function, and instead of raising an exception if the file does not exist, just show a warning.

Copy link
Member Author

@rocky rocky Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the situation before 370b8fe and this does not happen now on Ubuntu and Macos. But I don't know why Windows is still failing here.

I am exhausted from tracking down all the little inconsistencies for today. If you can move this forward, please do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A way to avoid this error would be to put all the initialization code inside an initialization function, and instead of raising an exception if the file does not exist, just show a warning.

Moving code to a function was done and delaying initialization was attempted but the code is too thorny that something in there is doing stuff earlier than when a tokenizer is first created.

I now have a workaround, so let's not add yet another.

Copy link
Member Author

@rocky rocky Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

If you have work cycles to spare, here are some things that in my opinion are more important than yet another workaround:

  • Put in the correct precedence for no-meaning operators
  • Split out the following list of operator names by creating sections for left assoc infix, right assoc infix, flat infix, (not yet done prefix/postfix), "misc" and newly added {Und,D]irectedEdge operators.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have work cycles to spare, here are some things that in my opinion are more important than yet another workaround:

* Put in the correct precedence for no-meaning operators

Regarding this, "correct" would be in relation with the WMA Precedence[...], isn´t it?

* Split out the following list of operator names by creating sections for left assoc infix, right assoc infix, flat infix,  (not yet done prefix/postfix), "misc" and newly added {Und,D]irectedEdge operators.

Do you mean as submodules of no_meaning?

operators_table_path
), f"Internal error: Mathics3 Operator information are missing; expected to be in {operators_table_path}"
with open(osp.join(operators_table_path), "r", encoding="utf8") as operator_f:
OPERATOR_DATA = ujson.load(operator_f)

# special patterns
NUMBER_PATTERN = r"""
( (?# Two possible forms depending on whether base is specified)
Expand All @@ -33,7 +49,6 @@
)
full_names_pattern = r"(`?{0}(`{0})*)".format(base_names_pattern)

# FIXME: Revise to get Character Symbols from data/characters.json
tokens = [
("Definition", r"\? "),
("Information", r"\?\? "),
Expand Down Expand Up @@ -102,9 +117,7 @@
("Equal", r" (\=\=) | \uf431 | \uf7d9 "),
("Unequal", r" (\!\= ) | \u2260 "),
("LessEqual", r" (\<\=) | \u2264 "),
("LessSlantEqual", r" \u2a7d "),
("GreaterEqual", r" (\>\=) | \u2265 "),
("GreaterSlantEqual", r" \u2a7e "),
("Greater", r" \> "),
("Less", r" \< "),
# https://reference.wolfram.com/language/ref/character/DirectedEdge.html
Expand Down Expand Up @@ -148,7 +161,6 @@
# ('PartialD', r' \u2202 '),
# uf4a0 is Wolfram custom, u2a2f is standard unicode
("Cross", r" \uf4a0 | \u2a2f"),
("Colon", r" \u2236 "),
# uf3c7 is Wolfram custom, 1d40 is standard unicode
("Transpose", r" \uf3c7 | \u1d40"),
("Conjugate", r" \uf3c8 "),
Expand All @@ -159,56 +171,32 @@
("Del", r" \u2207 "),
# uf520 is Wolfram custom, 25ab is standard unicode
("Square", r" \uf520 | \u25ab"),
("SmallCircle", r" \u2218 "),
("CircleDot", r" \u2299 "),
# ('Sum', r' \u2211 '),
# ('Product', r' \u220f '),
("PlusMinus", r" \u00b1 "),
("MinusPlus", r" \u2213 "),
("Nor", r" \u22BD "),
("Nand", r" \u22BC "),
("Xor", r" \u22BB "),
("Xnor", r" \uF4A2 "),
("Diamond", r" \u22c4 "),
("Wedge", r" \u22c0 "),
("Vee", r" \u22c1 "),
("CircleTimes", r" \u2297 "),
("CenterDot", r" \u00b7 "),
("Star", r" \u22c6"),
("VerticalTilde", r" \u2240 "),
("Coproduct", r" \u2210 "),
("Cap", r" \u2322 "),
("Cup", r" \u2323 "),
("CirclePlus", r" \u2295 "),
("CircleMinus", r" \u2296 "),
("Congruent", r" \u2261 "),
("Intersection", r" \u22c2 "),
("Union", r" \u22c3 "),
("VerticalBar", r" \u2223 "),
("NotVerticalBar", r" \u2224 "),
("DoubleVerticalBar", r" \u2225 "),
("NotDoubleVerticalBar", r" \u2226 "),
("Element", r" \u2208 "),
("NotElement", r" \u2209 "),
("Subset", r" \u2282 "),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these entries gone?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It now gets pulled in from JSON.

("Superset", r" \u2283 "),
("ForAll", r" \u2200 "),
("Exists", r" \u2203 "),
("NotExists", r" \u2204 "),
("Not", r" \u00AC "),
("Equivalent", r" \u29E6 "),
("Implies", r" \uF523 "),
("RightTee", r" \u22A2 "),
("DoubleRightTee", r" \u22A8 "),
("LeftTee", r" \u22A3 "),
("DoubleLeftTee", r" \u2AE4 "),
("SuchThat", r" \u220D "),
("VerticalSeparator", r" \uF432 "),
("Therefore", r" \u2234 "),
("Because", r" \u2235 "),
("Backslash", r" \u2216 "),
]

for table in ("no-meaning-infix-operators",):
table_info = OPERATOR_DATA[table]
for operator_name, unicode in table_info.items():
# if any([tup[0] == operator_name for tup in tokens]):
# print(f"Please remove {operator_name}")
tokens.append((operator_name, f" {unicode} "))


literal_tokens = {
"!": ["Unequal", "Factorial2", "Factorial"],
Expand Down
Loading