Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

forecast and scores objects print output every time a column is changed #935

Open
nikosbosse opened this issue Sep 29, 2024 · 27 comments
Open
Labels
high-priority Should be addressed soon

Comments

@nikosbosse
Copy link
Contributor

nikosbosse commented Sep 29, 2024

Consider this example:

a <- data.table(x = 1:30, y = 2:31)
a[, model := "a"]

ex <- data.table::copy(example_quantile)
ex[, model := paste(model, "a")]

Upon changing the model column, the first data.table prints nothing. For ex, however, the whole data.table is printed again.


EDIT: This is no longer true
Same happens with scores objects

ex <- score(example_quantile)
ex[, test := 3]

@Bisaloo do you have any thoughts on this?

@nikosbosse nikosbosse added the high-priority Should be addressed soon label Sep 29, 2024
@nikosbosse nikosbosse added this to the scoringutils-2.0 milestone Sep 29, 2024
@nikosbosse nikosbosse changed the title forecast object prints output every time a column is changed forecast and scores objects print output every time a column is changed Sep 29, 2024
@nikosbosse
Copy link
Contributor Author

nikosbosse commented Oct 6, 2024

@Bisaloo do you think you might have a minute to look at this, or are you swamped at the moment? In that case I can try and take over. Thank you!

For context: we have to update scoringutils on CRAN until October 31, 2024. The reason for that is that the package author fields are currently malformed in the CRAN version (they are fixed on gh) and we got a few messages asking us to fix this by October 31 latest. Ideally, it would be nice to have this fixed before we release anything to CRAN.

see #920

@nikosbosse
Copy link
Contributor Author

@seabbs tagging you for context on the deadline

@seabbs
Copy link
Contributor

seabbs commented Oct 7, 2024

For context: we have to update scoringutils on CRAN until October 31, 2024. The reason for that is that the package author fields are currently malformed in the CRAN version (they are fixed on gh) and we got a few messages asking us to fix this by October 31 latest. Ideally, it would be nice to have this fixed before we release anything to CRAN.

Ah I see. So we could just update the CRAN version from the release that we made it with (i.e update that branch and release from there). That takes the pressure off the very breaking dev version but also deadlines are nice

@nikosbosse
Copy link
Contributor Author

nikosbosse commented Oct 7, 2024

Exactly. But I think we could actually make the October 31 deadline. Only 3 more issues to got!
https://github.com/epiforecasts/scoringutils/issues?q=is:issue+is:open+-milestone:scoringutils-2.x

Edit: SHOULD!

@Bisaloo
Copy link
Member

Bisaloo commented Oct 8, 2024

I can have a look after Wednesday. I don't have any intuition of what may be causing this yet 🤔

@nikosbosse
Copy link
Contributor Author

Merci!

@Bisaloo
Copy link
Member

Bisaloo commented Oct 9, 2024

From what I understand, data.table has a complex internal state to control whether or not an object should be printed. The problem is that this state is stored in the .global internal variable (Rdatatable/data.table@342d6ba) which we cannot touch or even inspect from scoringutils 🤔

scoringutils could potentially add its own internal variable to determine if printing should proceed (i.e., forwarded to print.data.table()) or cancelled but the logic in data.table at least is quite convoluted.

@nikosbosse
Copy link
Contributor Author

I see. Hm...
@MichaelChirico might you have an idea for this?

@MichaelChirico
Copy link
Contributor

Sorry, I'm just glancing & not able to glean the full context. Do you have something that's printing, that shouldn't be, or something that's not printing, that should be?

@nikosbosse
Copy link
Contributor Author

Things are printing that shouldn't be printing.
This works fine:

a <- data.table(x = 1:30, y = 2:31)
a[, model := "a"]

This prints although it shouldn't:

ex <- data.table::copy(example_quantile)
ex[, model := paste(model, "a")]

Same happens with scores objects

ex <- score(example_quantile)
ex[, test := 3]

This is probably related to the fact that we have specific [.forecast and [.scores methods that revalidate objects when the user manipulates them.

@nikosbosse
Copy link
Contributor Author

In the data.table code you linked @Bisaloo I see this comment in the code
image

This lets me think that the issue may be related to the fact that we're hitting [.forecast twice. If I run this code:

library(scoringutils)
debug(scoringutils:::`[.forecast`)
ex <- data.table::copy(example_quantile)
ex[, model := paste(model, "a")]

I'm landing at [.forecast twice. This does not seem to be due to the content of assert_forecast(). Replacing

validation <- try(
      assert_forecast(forecast = out, verbose = FALSE),,
      silent = TRUE
    )

with

validation <- try(
      TRUE,
      silent = TRUE
    )

in [.forecast doesn't change the fact that we're hitting [.forecast twice...

@nikosbosse
Copy link
Contributor Author

nikosbosse commented Oct 20, 2024

Hm. Planning ahead: If we're not able to solve this before October 31, I see two and a half options:

  • just ignoring this.
  • temporarily removing our [.forecast and [.scores functions until this is fixed.
  • (delaying the real CRAN update and only pushing the name fix we're required to do).
  • EDIT: removing print.forecast might also be a temporary solution

I'm leaning towards 2) and would like to avoid 3). Ideal solution would of course be fixing the issue until then. I'm a bit lost what's exactly causing this...

@nikosbosse
Copy link
Contributor Author

nikosbosse commented Oct 20, 2024

ok it seems like an alternative temporary solution would be to get rid of the custom print method.

I created a reprex to simplify what's going on.

library(data.table)

`[.foo` <- function(x, ...) {
  out <- NextMethod()
  validate_x_again <- "ok"
  return(out)
}

print.foo <- function(x, ...) {
  print("foo")
  NextMethod()
  return(invisible(x))
}

ex <- data.table(x = 1:10)
setattr(ex, "class", c("foo", "data.table", "data.frame"))
ex[, y := 3]

@nikosbosse
Copy link
Contributor Author

ok digging deeper it seems like actually the problem is with the print method. (I don't know why we had the same issue with scores objects as well, but that doesn't seem to be the case anymore).

There is this issue from 2018 mentioning the exact same problem: Rdatatable/data.table#3029.

I'm slightly confused we didn't notice this earlier (as far as I can tell we had the print method a lot longer than Hugo's [.forecast function.

@nikosbosse
Copy link
Contributor Author

In complete ignorance of what's actually happening I made some progress by including the following part from print.data.table in the print function:

shouldPrint <- getAnywhere("shouldPrint")$objs[[1]]
  # the following lines were taken from data.tables print.data.table
  # https://github.com/Rdatatable/data.table/blob/058dd4d51ef4a70aef3b913e556b4ab1a150d1c9/R/print.data.table.R#L24C3-L41C4
  # ----------------------------------------------------------------------------
  #  := in [.data.table sets .global$print=address(x) to suppress the next print i.e., like <- does. See FAQ 2.22 and README item in v1.9.5
  # The issue is distinguishing "> DT" (after a previous := in a function) from "> DT[,foo:=1]". To print.data.table(), there
  # is no difference. Now from R 3.2.0 a side effect of the very welcome and requested change to avoid silent deep copy is that
  # there is now no longer a difference between > DT and > print(DT). So decided that DT[] is now needed to guarantee print; simpler.
  # This applies just at the prompt. Inside functions, print(DT) will of course print.
  # Other options investigated (could revisit): Cstack_info(), .Last.value gets set first before autoprint, history(), sys.status(),
  #   topenv(), inspecting next statement in caller, using clock() at C level to timeout suppression after some number of cycles
  SYS <- sys.calls()
  if (!shouldPrint(x)) {
    SYS = sys.calls()
    if (length(SYS) <= 2L ||  # "> DT" auto-print or "> print(DT)" explicit print (cannot distinguish from R 3.2.0 but that's ok)
        ( length(SYS) >= 3L && is.symbol(thisSYS <- SYS[[length(SYS)-2L]][[1L]]) &&
          as.character(thisSYS) == 'source') || # suppress printing from source(echo = TRUE) calls, #2369
        ( length(SYS) > 3L && is.symbol(thisSYS <- SYS[[length(SYS)-3L]][[1L]]) &&
          as.character(thisSYS) %chin% mimicsAutoPrint ) )  {
      return(invisible(x))
      # is.symbol() temp fix for #1758.
    }
  }

At leat in the simple example above this seems to work, i.e. the following behaves correctly:

library(data.table)
print.foo <- function(x, ...) {
  SYS <- sys.calls()
  shouldPrint <- getAnywhere("shouldPrint")$objs[[1]]
  if (!shouldPrint(x)) {
    SYS = sys.calls()
    if (length(SYS) <= 2L ||
        ( length(SYS) >= 3L && is.symbol(thisSYS <- SYS[[length(SYS)-2L]][[1L]]) &&
          as.character(thisSYS) == 'source') ||
        ( length(SYS) > 3L && is.symbol(thisSYS <- SYS[[length(SYS)-3L]][[1L]]) &&
          as.character(thisSYS) %chin% mimicsAutoPrint ) )  {
      return(invisible(x))
    }
  }
  
  print("Additional info: foo")
  NextMethod()
  return(invisible(x))
}

ex <- data.table(x = 1:10)
setattr(ex, "class", c("foo", "data.table", "data.frame"))
ex[, y := 3]
ex

In the package itself it does not work again which again may have to do with the fact that [.forecast is called twice and shouldPrint() doesn't recognise that or something something...

@seabbs
Copy link
Contributor

seabbs commented Oct 21, 2024

Wow lots here.

Firstly I agree with "alternative temporary solution would be to get rid of the custom print method." as the option to go with to avoid us getting blocked if we cant resolve.

complete ignorance

I also have no idea what is happening here so guess its good it helps a bit!

I don't currently have a lot of useful insights for this unfortunately.

@nikosbosse
Copy link
Contributor Author

Digging deeper again it seems like the problem is not only with the print method. Even if we remove the print method, the problem still persists.

For example, this line in get_duplicate_forecasts() also triggers a print:

data[, scoringutils_InternalDuplicateCheck := .N, by = c(forecast_unit, type)]

But this is performed on a data.table object, not a forecast object.

I will try and write up what I did exactly a bit later. But the tldr is: I wasn't able to find an easy fix.

I think our short-term options:

  • temporarily remove the print method AND all/most? of the internal validation methods
  • accept that we're printing random things at random times

Both seem not ideal. I'm leaning slightly towards option 2.

@seabbs
Copy link
Contributor

seabbs commented Oct 25, 2024

It has to be 2. of those but yeah very much not ideal.

@nikosbosse
Copy link
Contributor Author

Urgh!!11!

I discovered a new issue:

debugonce(assert_forecast)
example_nominal

If assert_forecast is triggered by - whatever is triggering it, I don't even know which I think is part of the problem - then only 10 rows are passed on to assert_forecast(). For most forecast types that's fine, but for nominal forecasts this means that the validation fails immediately....

@nikosbosse
Copy link
Contributor Author

ok I think [.forecast is triggered from within print.data.table() where the dt is subset and only the first and last five rows are returned.

@nikosbosse
Copy link
Contributor Author

nikosbosse commented Oct 30, 2024

I think we need to (temporarily, I hope) remove some of the [.forecast functionality until just printing example_nominal doesn't fail anymore

@sbfnk
Copy link
Contributor

sbfnk commented Oct 30, 2024

I think that we can't fix the printing issue. Digging into the data.table code (and as @nikosbosse states above) this relies on a global variable in the package environment being set to the object address to let data.table know not to print the next data table that comes along.

This, we could perhaps address by cleverly manipulating the data.table package environment. However, there is some additional logic to avoid suppressing printing in certain instances. In particular, it expects the call stack to be of length <=2 (or containing specific strings) in

https://github.com/Rdatatable/data.table/blob/afa87e1eeb61922cdef512fefd86e3ba5847eef9/R/print.data.table.R#L33

In our case when called from print.forecast() we get a call stack of length 4

[[1]]
(function (x, ...) 
UseMethod("print"))(x)

[[2]]
print.forecast(x)

[[3]]
NextMethod()

[[4]]
print.data.table(x)

So the only option I can see would be to re-create the logic in data.table with our own global variable, checking the call stack etc. Perhaps better to live with printing upon calling :=?

@nikosbosse
Copy link
Contributor Author

Yeah it seems quite difficult... Maybe we should actually just use print.data.frame or maybe even a tibble version for printing...
I think we'll have to keep thinking about this after the CRAN release. For now I temporarily deleted the check that got triggered by subsetting a forecast

@sbfnk
Copy link
Contributor

sbfnk commented Oct 30, 2024

Maybe we should actually just use print.data.frame or maybe even a tibble version for printing...

We could but of course we’d still get prints with :=.

@nikosbosse
Copy link
Contributor Author

Argh. True... (empirically, I just tested it). But I'm not sure it always has to be the case. Hm....

@sbfnk
Copy link
Contributor

sbfnk commented Oct 30, 2024

My understanding is that because using [ with := returns a data.table it'll always call print unless assigned to something. The special thing in print.data.table is that it has a mechanism for suppressing printing under certain conditions - which we can't meet.

@MichaelChirico
Copy link
Contributor

Sorry for the headaches here all & thanks for digging into this. I haven't had time to read carefully but it looks like you're close to hitting on the root issues. I can point to where in print.data.table() [ is called on the input object:

https://github.com/Rdatatable/data.table/blob/98cf24e635ad47ba02617929d2237fca263e2b0c/R/print.data.table.R#L93

I wonder if calling `[.data.frame`(x, idx, ) there would help...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high-priority Should be addressed soon
Projects
Status: No status
Development

No branches or pull requests

5 participants