-
-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PP_NUMBER, remove CPP_INTEGER, CPP_FLOAT #80
base: master
Are you sure you want to change the base?
Conversation
A simple proof of concept change that fixes ned14#79. With it, pcpp can do codegen using the IREPEAT library. I believe it's conceptually correct, but my Python may not be; please test this against your suite and review the method (hack) carefully. There's not much code! Mostly deletions. The change removes CPP_INTEGER, effectively replacing it with PP_NUMBER, and entirely removes CPP_FLOAT as superfluous for preprocessing purposes. pp-number is sufficient for preprocessing to stage 4 The pp-number regex in the issue is incorrect, lifted from unpublished WG21 https://isocpp.org/files/papers/D2180R0.html "pp-number makes cpp dumber" (best proposal title ever). Instead, I crafted a regex based on the lastest C++ draft https://eel.is/c++draft/lex.ppnumber#ntref:pp-number which accepts character ' as digit separator: regex string r'\.?\d(\.|[\w_]|\'[\w_]|[eEpP][-+])*' (also admits binary literals, with digit separator, of course, so they can now be added to the Value parsing code) Only the conditional evaluator is required to interpret the numbers as integer constant expressions. This is achieved by hacky means: def p_expression_number(p): 'expression : PP_NUMBER' try: p[0] = Value(p[1]) except: p[0] = p[1] The idea is that if the parsed string p[1] can be interpreted as an integer constant-expression Value(p[1]) then do so, otherwise simply pass through the string for possible further pasting and processing. A robust method might check p[1] against the CPP_INTEGER regex (removed in this commit) for a full match, consuming all input. On the other hand, relying on Value to validate the input while parsing and to raise an exception on failure may be Pythonic. It seems that pp-number itself is a hack in the standard; I see no way to incorporate pp-number alongside INTEGER and FLOAT tokens meaningful in C; but then there's no need to. Happy Thanksgiving!
Added
All Are there other tests to be run? Should I commit the new generated |
If you wish. It'll be next Spring at least before I can look at this properly. |
Not sure if its right to commit generated tables in the PR so here they are in a separate commit to pick or drop.
ok, here are the generated tables, to pick or drop if and when merged. Maybe @assarbad can review as it's closely related to his #71 (comment) I may be missing something obvious... If you authorise the PR action then it should confirm that the tests pass |
Thanks for coding, publishing and maintaining this project. For my use case, I'm actually more interested in a fix for issue #53 ... |
I am also looking at issue #53 myself. And I'll review your patch, but not tonight. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think PP_NUMBER
should be named CPP_NUMBER
throughout, to stick with the established CPP_
prefix.
Otherwise this PR looks fine.
There is one minor issue that we may want to postpone for a separate ticket or not, mostly depends on @ned14. That regular expression doesn't quite capture what the pp-number is defined as. I'll comment on this separately, though. It's probably going to be bigger than what should go into such a final comment here 😉
So we have the following bits (correct me if I got it wrong):
... where:
The rest is relatively self-explanatory. The description you referenced already gives us a lot of leeway. And it may be appropriate to keep it that greedy and that broad. However, with the above remarks in mind I'd like to propose changing it to:
This:
We should probably also define what we expect to be matched here. Admittedly my proposed regex from #71 may be too strict. My change does not address:
|
Thanks @assarbad for your thorough detailed review. |
For me, if GCC and Wave's preprocessor use the same technique to solve a conformance problem, then it seems appropriate that pcpp can do the same. Whether the technique is implemented correctly is another matter, but I'm sure we'll get there. |
The previous regex escaped the digit separator ' as \' - necessary only because the string itself was '' single-quoted. This commit switches to use double-quote "" for the string and removes the now unneeded escape.
The hack is highlighted in the top PR message and it's right up top of the listed It effectively relies on Python exceptions for validating the
def p_expression_number(p):
'expression : CPP_INTEGER'
p[0] = Value(p[1])
def p_expression_number(p):
'expression : PP_NUMBER'
try:
p[0] = Value(p[1])
except:
p[0] = p[1] Is it right and Pythonic to rely on exceptions to do the validation like this? By the fact that the tests all pass, it appears that I don't see a clean alternative way to do this that meshes properly with the grammar and parser., and still allows for pasting to form numeric literals from smaller pp-token parts. |
The This raises further questions:
Out of my depth, I asked a couple of questions on WG21 sg16 unicode study group mailing list.
The Python standard library
|
The only un-Pythonic thing I see here is that the exception isn't caught by listing concrete exceptions or at least a base exception type. This is considered sort of an anti-pattern. As an example, this particular code should probably never raise a See https://docs.python.org/3/library/exceptions.html#base-classes Otherwise I don't see much of an issue trying to treat something one way and, failing that, falling back to the alternative. AFAIK nothing un-Pythonic about that.
As long as you stick to character classes such as
I don't know. As an option this would probably be neat, but outright depending on it would not be too nice, given currently the set of external dependencies is quite limited. It's true that this may eventually become a concern, but as we have learned compilers already had to backpedal on allowed Unicode characters to some extent (changing direction and such), because otherwise editors would outright hide code. And as you noted yourself the debate is far from finished, so I think that bridge ought to be crossed when we get there, not necessarily right now. Also keep in mind that other than for measuring header heft, pcpp probably is used for cases such as amalgamating libraries into single-header. This means latest if the compiler see the results of what pcpp spat out, you'll get the feedback whether something was wrong. |
I believe pcpp on Python 3 makes a reasonable attempt at Unicode. C++ has been evolving there though, as has C, and there are differences which are being reconciled. TBH I think a reasonable attempt is reasonable :) |
I spent too long yesterday following up on Tom H's unicode links - my conclusion was that the security and stability concerns for languages, around unicode identifiers and exploits, are very unlikely to be concerns for pcpp as a pure preprocessor - it seems ok for pcpp to be lax in what it accepts as identifiers. (p.s. It looks like that CI failure is on Python 2 due to the added StringIO context manager - I can revert that) |
This reverts commit 8a5b9d0.
In particular, I like the idea that pcpp accept the mathematical profile identifiers that gcc currently accepts (rejects in -pedantic), clang 14 accepted and now rejects in 15 (but is adding an extension flag to support it) and that look likely to be incorporated in C++ once there's consensus. |
I go on annual vacation next week, I'm hoping some time will present itself to review all the open PRs everywhere across my open source. No promises, it's a hectic three weeks coming for me. |
Thanks Niall; that'd be a nice surprise - don't stress it though. I'll aim to understand more of the This PR I'm only really wanting in order to replace MSVC preprocessor with nicer output (fewer bogus empty lines). It's the |
A simple proof of concept change that fixes #79.
With it, pcpp can do codegen using the IREPEAT library.
I believe it's conceptually correct, but my Python may not be;
please test this against your suite and review the method
(hack) carefully. There's not much code! Mostly deletions.
The change removes CPP_INTEGER, effectively replacing it with
PP_NUMBER, and entirely removes CPP_FLOAT as superfluous for
preprocessing purposes.
pp-number is sufficient for preprocessing to stage 4
The pp-number regex in the issue is incorrect, lifted from
unpublished WG21 https://isocpp.org/files/papers/D2180R0.html
"pp-number makes cpp dumber" (best proposal title ever).
Instead, I crafted a regex based on the lastest C++ draft
https://eel.is/c++draft/lex.ppnumber#ntref:pp-number
which accepts character ' as digit separator:
regex string r'.?\d(.|[\w_]|'[\w_]|[eEpP][-+])*'
(also admits binary literals, with digit separator, of course,
so they can now be added to the Value parsing code)
Only the conditional evaluator is required to interpret the
numbers as integer constant expressions.
This is achieved by hacky means:
The idea is that if the parsed string p[1] can be interpreted as
an integer constant-expression Value(p[1]) then do so, otherwise
simply pass through the string for possible further pasting and
processing.
A robust method might check p[1] against the CPP_INTEGER regex
(removed in this commit) for a full match, consuming all input.
On the other hand, relying on Value to validate the input while
parsing and to raise an exception on failure may be Pythonic.
It seems that pp-number itself is a hack in the standard; I see
no way to incorporate pp-number alongside INTEGER and FLOAT tokens
meaningful in C; but then there's no need to. Happy Thanksgiving!