Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binutils 2.29 followup #288

Closed
wants to merge 9 commits into from

Conversation

jnpkrn
Copy link
Contributor

@jnpkrn jnpkrn commented Jan 11, 2018

see also #266 (comment)

This is needed to test more possible combinations, as apparently, failed
assertion already on the client side will yield no visibility into how
"interlib" would react, for instance.

Signed-off-by: Jan Pokorný <[email protected]>
This is to cut down some time when only particular item of the
test matrix is of interest, such as in

  ./log_test_mock.sh -v -ncl -ncsc -tqb+_il+_c+ <srpm>

command that can be used to highlight the issue with binutils 2.29+
compatibility fix making for a new issue when all of following are
satisfied:

- there are three link participants (reusing terminology establish
  in the patches constituting said fix)
  client program > intermediate library > libqb
  (client is not necessarily directly dependent on libqb)

- client program is build-time linked using ld/binutils either
  < or >= 2.29 but with libqb 1.0.3 (containing the said fix,
  i.e., new linker script is in effect for building the client
  program)

- client program does not use any logging at all

- intermediate library triggers some checks when gets loaded,
  through usage of QB_LOG_INIT_DATA macro from qb/qblog.h,
  which are meant to check itself, but alas, the symbols to
  check on behalf of this library get run-time overridden
  with symbols of the client program, which only do exist
  because the libqb.so linker script (intended for other
  effect) made it happen despite the section is empty

So far, log_test_mock.sh was only thoroughly run without any
log-suppressing options (e.g. -ncl), which together with some
extra linker/build flags (distro specific?) in case of compiling
pacemaker (exercising the same scheme with some CLI utilities)
masked this problem.  [this all is still to be verified]

This asks for a follow-up solution, and this commit is a good
start as running whole matrix would keep increasing turnaround
time needlessly (the new skipping harness is not ideal, but
serves the purpose).

Signed-off-by: Jan Pokorný <[email protected]>
@jnpkrn jnpkrn force-pushed the binutils-2.29-followup branch from 6c58f2b to bc6a8c3 Compare January 17, 2018 16:49
Also make the in-line progress show the actual output of the current
test matrix item along with pre-existing syslog one for that very step.

Signed-off-by: Jan Pokorný <[email protected]>
@jnpkrn jnpkrn force-pushed the binutils-2.29-followup branch from bc6a8c3 to 03497c7 Compare January 17, 2018 21:28
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Jan 17, 2018

Current patchset (ending with 03497c7) looks good to me so far, but
needs some more testing and proper commit message for the last commit.
Will work some more on that but would welcome some feedback.
@gao-yan, any?

Regarding testing, I realized that protected visibility of the section boundary
symbols (either enforced or native with binutils 2.29.1, btw. making 2.29
technically the most problematic binutils release from the libqb's perspective)
can come handy for SW not using _GNU_SOURCE a.k.a. GNU extensions
on top of POSIX defined interfaces so I arranged the originally for-all check just
for "POSIX only" cases together with said visibility enforcement, unless overridden
with QB_LD_2_29 because that's the only show stopper for that and we cannot
really figure out which linker is being used declaratively.

Mainly to avoid duplicate runs because of those details getting lost
as used to happen.

Signed-off-by: Jan Pokorný <[email protected]>
... as they can be neglected without any loss of check preciseness,
and ignoring them makes for easier (isomorphic) changes in the future.

Signed-off-by: Jan Pokorný <[email protected]>
Consequently, turn original catch-all reliance on standard headers
through "os_base.h" facade include into regular demand-driven ones,
as the former had a side-effect of spoiling the included system headers
with _GNU_SOURCE macro (through internal "config.h" unless HAVE_CONFIG_H
is defined), which hence could not be undone by the means of
command-line requested clearance of such macro, colliding hence
with the objective of this change.

Note that mere undefining of HAVE_CONFIG_H alone would not work,
as the selected standard headers have to be included the alternative
way, i.e. directly, so the compatibility of log_test_mock.sh with
now-outdated SRPM would not be preserved for "POSIX only" test run,
anyway (and in turn, "-UHAVE_CONFIG_H" is omitted here), hence the
test to prevent this confusion is also added.

Signed-off-by: Jan Pokorný <[email protected]>
@gao-yan
Copy link
Member

gao-yan commented Jan 18, 2018

Thanks for all the effort on this!

Tried libqb-1.0.3+20171226.6d62b64 plus 03497c7 with binutils-2.26.1.

pacemaker-1.1.18+20180112.bd7115b8c compiles now :)

# readelf -s /usr/lib64/libqb.so.0.19.0 | grep verbose
   162: 0000000000228058     0 NOTYPE  GLOBAL DEFAULT   26 __stop___verbose
   239: 0000000000226870     0 NOTYPE  GLOBAL DEFAULT   26 __start___verbose
# readelf -s /usr/lib/pacemaker/stonithd | grep verbose
   163: 0000000000a1ddd0     0 NOTYPE  GLOBAL DEFAULT bad section index[ 36] __start___verbose
   166: 0000000000a1f270     0 NOTYPE  GLOBAL DEFAULT bad section index[ 36] __stop___verbose

--- No idea what "bad section" indicates. I see the similar with the things linked with binutils-2.29.1 too.

Fencing regression tests (which reply on logging) all passed.

Pacemaker seems to work well, but every time on shutting down cluster services, crmd encounters:

Jan 18 15:17:46 f171 crmd[30523]:     crit: GLib: Source ID 22 was not found when attempting to remove it

With the same sources but linked with binutils-2.29.1, crmd encounters the similar on shutting down and has an additional error from crm_glib_handler:

Jan 18 15:41:13 f58.suse.de crmd[10048]:    error: crm_glib_handler: Forked child 10080 to record non-fatal assert at logging.c:73 : Source ID 23 was not found when attempting to remove it
Jan 18 15:41:13 f58.suse.de crmd[10048]:     crit: GLib: Source ID 23 was not found when attempting to remove it

The relevant code in pacemaker/lib/common/logging.c: crm_glib_handler()

  62     if (glib_cs == NULL) {
  63         glib_cs = qb_log_callsite_get(__FUNCTION__, __FILE__, "glib-handler", LOG_DEBUG, __LINE__, crm_trace_nonlog);
  64     }
  65 
  66 
  67     switch (msg_level) {
  68         case G_LOG_LEVEL_CRITICAL:
  69             log_level = LOG_CRIT;
  70 
  71             if (crm_is_callsite_active(glib_cs, LOG_DEBUG, 0) == FALSE) {
  72                 /* log and record how we got here */
  73                 crm_abort(__FILE__, __FUNCTION__, __LINE__, message, TRUE, TRUE);
  74             }
  75             break;                           

Not sure if it's necessarily related this change.

@kgaillot
Copy link
Contributor

Pacemaker seems to work well, but every time on shutting down cluster services, crmd encounters:

Jan 18 15:17:46 f171 crmd[30523]: crit: GLib: Source ID 22 was not found when attempting to remove it

With the same sources but linked with binutils-2.29.1, crmd encounters the similar on shutting down and has an additional error from crm_glib_handler:

Jan 18 15:41:13 f58.suse.de crmd[10048]: error: crm_glib_handler: Forked child 10080 to record non-fatal assert at logging.c:73 : Source ID 23 was not found when attempting to remove it
Jan 18 15:41:13 f58.suse.de crmd[10048]: crit: GLib: Source ID 23 was not found when attempting to remove it

No idea what's going on there. I'm particularly concerned that with 2.26.1 crm_glib_handler() appears not to be called for a glib error. I wonder if a similar linker issue is affecting glib itself.

Most likely the particular glib error involves calling g_source_remove() on a timer. Enabling trace logs would probably narrow down which one and where. Since it's at shutdown, maybe it's the throttle timer.

It might be useful to try with different glib versions and/or glib+linker combinations to see if some behavior changed there rather than in the linker, libqb, or pacemaker. But that's more work than I want to put into this :)

@gao-yan
Copy link
Member

gao-yan commented Jan 18, 2018

Well, without 03497c7, crmd encounters the same.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Jan 18, 2018 via email

@jnpkrn jnpkrn force-pushed the binutils-2.29-followup branch from 03497c7 to 3143753 Compare January 18, 2018 20:51
@jnpkrn
Copy link
Contributor Author

jnpkrn commented Jan 18, 2018

In the meantime, current patchset up to 3143753 is just slightly refined version
of the previous, most of the value added comes from being strict on success
of dl* functions as well, which could possibly uncover some deep corner cases.

@gao-yan
Copy link
Member

gao-yan commented Jan 19, 2018

FYI, the issue itself is due to double-removal of glib source:
ClusterLabs/pacemaker#1410

It turned out that while log_test_mock.sh already provided ways for
extra variations in the composite body of the program to inspect for
a working logging, "omit logging at the target client side completely,
leave it along with the respective self-check just at the intermediate
library this program uses" one hadn't apparently been exercised with
binutils < 2.29.1 (which is the only one OK from the get-go since it
arranges boundary denoting symbols for orphan sections as protected).

This made the pacemaker building fail half the way with libqb 1.0.3
and binutils < 2.29.1 because the built executables are instantly run
as to extract their help screen[1], and they represent the said pain
pattern:

- program not using libqb log subsystem directly, but
- linked to dynamic library that itself does + it also utilizes
  QB_LOG_INIT_DATA macro, which
- upon loading the executable and the the linked dependencies
  prior to proper run invokes the checks, where
- one in particular tries to assess whether the direct-access
  boundary symbols are not aliasing (equal), but
- because with standard visibility, symbols from the program
  take a precedence, actually its symbols are prioritized (instead
  of the ones pertaining to the library layer as expected, remember
  that that check was intended as self-contained, targeting just
  it's own participating part in the link scheme), and
- since previous fix to accommodate binutils 2.29+ fancy/new handling
  of the boundary symbols they are not subject to any sort of
  garbage collection (no symbols were emitted for a section that was
  not known because there were no instruction it should contain
  anything, until we stuck those symbols in place by force by the
  means of linker script to that effect), these boundary symbols
  indeed occur in this end program having no callsite data (see first
  point), meaning that start and stop address indications for
  the section equals, hence
- the said assertion triggers (despite it should not = false positive),
  implying breach of self-containment of the check (which naturally
  cannot be responsible for any other linked part)

This clearly means that "QB_ATTR_SECTION_START != QB_ATTR_SECTION_STOP"
cannot be used unconditionally.  However, it would left the software
using logging along with QB_LOG_INIT_DATA macro (which was already
highlighted as its vital part in order to avoid silent logging
malfunction) that doesn't utilize _GNU_SOURCE macro (unlocking some
GNU extensions on top of basic POSIX interface) short of the main
checks (because they are conditionalized for requiring said extensions),
and libqb never required the client side to define that macro nor it
dared to define it behind user's back as it can have unexpected
consequences.

Luckily, the first paragraph carries the key: protected
symbols behave exactly as required here[2]:

> Protected visibility is like default visibility except that it
> indicates that references within the defining module bind to the
> definition in that module. That is, the declared entity cannot be
> overridden by another module.

So we just move said comparison to "!defined(_GNU_SOURCE)" conditional
branch within qblog.h, and to cater binutils prior to 2.29.1 (which has
it like that by default as mentioned), we also mark the declarations
of the boundary symbols, likewise conditionally, with protected
visilibity (there appears to be no way for that in the linker script).
But that would be too easy as once again, 2.29 linker begs to differ and
these two "!defined(_GNU_SOURCE)" measures will actually do more harm
than good when in the mix.  Hence, new QB_LD_2_29 macro is proclaimed
the kill-switch for that case, and the user becomes responsible to
either define it when building with this 2.29 troublemaker (as a recap,
2.29.1 is fine from this perspective), or to start using _GNU_SOURCE.

One more question mark remains, though: are not the QB_LOG_INIT_DATA's
checks in _GNU_SOURCE case weaker now than used to be?  The anwer is:
no, as long as looking at the symbols through dlopen'd particular shared
object will contain no overrides from either already loaded dependencies
(if caching is so aggressive to detect it's already in loaded in
caller's context) or from recursive load.  Currently, it doesn't seem
to be the case, but it may depend on the implementation of dynamic
linking library or the toolchain.  If this observation is proved
to be wrong, the solution may be as simple as dropping !defined(_GNU_C)
condition from the guard of making the boundary symbols with protected
visibility -- it is not done now also with respect to non-gcc compilers
which may not recognize that.

Last but not least, dl* calls in the QB_LOG_INIT_DATA's checks are
themselves subject to scrutiny now: should they fail unexpectedly,
the run is terminated just as well.  This led to discovery of the
issue masked so far, boiling down to unability to dlopen executable
(as opposed to shared object)[3].  It did not matter before, because
of rather best-effort, optimistic approach (perform the final check
only if precondition steps succeeded), but now, we have to add an
extra stipulation that this case won't lead to premature termination
-- it just happens to sometimes be the case, and there's not much
we can do to detect run down on the level of the executable proactively,
at least not based on the brief experiments.

[1] ClusterLabs#266 (comment)
[2] https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-visibility-function-attribute
[3] https://sourceware.org/bugzilla/show_bug.cgi?id=11754

Signed-off-by: Jan Pokorný <[email protected]>
It's not expected that basic prerequisite for using callsite section,
notably section attribute, would work without the basic "GNU toolchain"
support declared during the build.

Signed-off-by: Jan Pokorný <[email protected]>
@jnpkrn jnpkrn force-pushed the binutils-2.29-followup branch from 3143753 to 310882c Compare January 22, 2018 08:07
@krig
Copy link

krig commented Feb 12, 2018

Why does libqb go through all of these acrobatics at all?

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Feb 12, 2018 via email

@krig
Copy link

krig commented Feb 12, 2018

I mean that the thing I'm trying to understand is why libqb tries to use callsite sections in order to track output filtering in the first place. To be honest, binutils doesn't look like the party at fault here, rather libqb is doing something it probably shouldn't have been doing in the first place. Was it just to avoid requiring matching push/pop calls for filter manipulation?

@krig
Copy link

krig commented Feb 12, 2018

If that's the case, then using the GCC cleanup attribute to pop the filter option is probably a better solution.

@jnpkrn
Copy link
Contributor Author

jnpkrn commented Feb 12, 2018

I am not the original author but I can see that in part the intention
was to make logging as cheap and run-time seamless as possible,
which is certainly a worthy goal, given that, e.g., corosync usually
runs with real-time priority and mlock'd memory.

With this arrangement, you effectively offload page-fault prone part
of otherwise ad-hoc processing to linker at build time, and the
other, time-sensitive lookup part to early phase in run-time (upon
qb_log_init) -- note that everything gets resolved at once, no
matter how many log messages are there to be processed.
In my measurements with artificial, non-favouring test case,
this approach is indeed still tiny bit faster.

I can only suggest reading more on the topic, the linked changes and
respective discussions, to discover that binutils indeed digressed
in handling the scaffolding this approach relies upon :-)

And no, I cannot see any way how cleanup attribute would help here.

Note also that callsites being tracked systemically at query-able form
within the binary means new advanced features can be introduced, such
as new super-terse "blackbox" format: #274

@krig
Copy link

krig commented Feb 12, 2018

And no, I cannot see any way how cleanup attribute would help here.

Yeah, now that I've looked at it more I see that the purpose is not entry/exit management (which cleanup would do), but to speed up log filtering.

@jnpkrn jnpkrn mentioned this pull request Apr 11, 2018
@jnpkrn jnpkrn mentioned this pull request Aug 27, 2018
@chrissie-c
Copy link
Contributor

Closed as not needed any more

@chrissie-c chrissie-c closed this Oct 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants