Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multiple output formats #92

Open
fsmunoz opened this issue Feb 4, 2024 · 5 comments
Open

Support multiple output formats #92

fsmunoz opened this issue Feb 4, 2024 · 5 comments

Comments

@fsmunoz
Copy link

fsmunoz commented Feb 4, 2024

Is your feature request related to a problem? Please describe.

Jupyter notebooks support multiple output formats for cell execution; depending on what's desired, one or more different entries in the data section, identified by MIME type, will contain the output in the corresponding format.

This is useful in general, but becomes critical when using any document production system that uses Jupyter kernels as the engine to create different types of documents (i.e. most of the solutions that fall within the "literate programming" approach): they will pick HTML if the desired output is HTML, and pick e.g. a PNG representation of a plot if the output is a PDF.

I will use Quarto as an example, but the idea is generic and applicable to other solutions. Consider the following source document:

---
title: "Quarto and SAS"
format:
  html:
    code-fold: true
engine: jupyter
---

Define a dataset:

```{sas}
*| output: false;
*| echo: true;
data grade;
    input subject gender $
        exam1 exam2 hwgrade $;
    datalines;
    10 m 80 84 a
     7 . 85 89 a
     4 f 90 .  b
    20 m 82 85 b
    26 f 94 94 a
    11 f 88 84 c
    ;
run;
```

Print it:

```{sas}
proc print data=grade;
    var subject gender; * print student id and gender;
run;

```

Plot it:

```{sas}
ods graphics on / width=3in;
proc sgplot data=grade;
    hbar gender / response=exam1 stat=mean datalabel categoryorder=respdesc;
run;

```

This works without any change for HTML output, because sas_kernel uses HTML(output), which automatically creates a text/html entry in the outputs array, and the HTML target makes use of HTML.

$ quarto preview sas.qmd  --to html

image

This doesn't work if we specify PDF as the output, because the toolchain (in the case of Quarto, using pandoc and LaTeX) will have no way to render the HTML, and there is no alternative representation:

$ quarto preview sas.qmd  --to pdf

image

The table works because Quarto has some automation that parses HTML tables and converts them to LaTeX, but it isn't able to convert the HTML plot into something that can be included in the PDF.

Quarto here is, and I must stress this, just an example: in general, the ability to have more outputs in the MIME bundle of the Jupyter cell outputs will be usable by any other tool.

Describe the solution you'd like

The solution I would like is different from the one I have prototyped: the one I think would likely be better would be to change things at the SASpy level to make use of the extremely rich capabilities of ODS, allowing specifying other output formats at that level, which would then be used in sas_kernel.

That said, I've quickly made something to show how this could work by making changes solely on sas_kernel, after studying the code and the use of MetaKernel: MetaKernel has some plumbing in place at the _formatter method to go through methods of an object and create the necessary outputs. I've created a SASOutput class that implements _repr_png_ and _repr_latex.

class SASOutput(object):
    def __init__(self, data):
        self.data = data

    def __repr__(self):
        try:
            soup = BS(self.data)
            return soup.get_text()
        except:
            return HTML(self.data)

    def _repr_html_(self):
        return self.data

    def _repr_png_(self):
        d = self.data
        try:
            soup = BS(d, 'html.parser')
            img_tag = soup.find('img')
            base64_data = img_tag['src'].split(',')[1]
            return base64_data
        except:
            return None

    def _repr_latex_(self):
        start_marker = r'\\documentclass\[10pt\]{article}'
        end_marker = r'\\end{document}'

        match = re.search(f'{start_marker}(.*?){end_marker}', self.data, re.DOTALL)
        if match:
            latex_output = match.group()
            return latex_output
        else:
            return None

(consider this code a MVP and not something that I am proposing as a PR, this is to illustrate the possibilities more than anything)

This assumes that the input is HTML, which seems to always be the case in SASpy. With this, the previous PDF example works, because the sas.ipynb that is created by Quarto contains a text/png with the plot (it would also output it for "regular" Jupyter notebook usage, but Jupyter would prefer the HTML version).

image

The .ipynb will contain the different formats, so the existing behaviour (text/html) would be unchanged:

 {
   "cell_type": "code",
   "execution_count": 3,
   "id": "a6bf524a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAA(...rest of base64 data)=",
      "text/html": [
       "<!DOCTYPE html>\n",
       "<html lang=\"en\" xml:lang=\"en\" xmlns=\"http://www.w3.org/1999/xhtml\">\n",
        (...)
       ]
     },

Describe alternatives you've considered

As mentioned, this approach makes use of how SASpy currently works, which seems to hardcode ods html5 for non-text output (I could be completely wrong here, I'm basing my assertion from this documentation). The LaTeX parser above cuts the return LaTeX that is present in the middle of a lot of HTML code, for example. Making it possible to specify the desired output format through SASpy would likely be better, given some way of specifying the desired format. Currently, using things like ods latex in the cells will give the expected output in the middle of HTML code, but I haven't tested this extensively.

Ideally, we would also be able to pass additional formatting options down the line: things like the plot title, the image width, they are generally implemented in a consistent way in this sort of tools so that the same syntax can be used. For Quarto, execution options support things like fig-width that are applied to R, Python, and Julia - this must be supported in Quarto itself, but it requires a way to pass that information.

Additional context

This is something that I've been using/following for a while and has several previous references, and there seems to be a growing interest.

@tomweber-sas
Copy link
Contributor

Hey @fsmunoz, I 'm sorry, I've been busy with other things here and though I have read through this and the previous conversions (linked above) from a few years ago, I'm still not really sure what kind of solution you're looking for for this. SASPy and SAS_Kernel are very different. SAS_Kernel is just a Jupyter extension, so it could have some Jupyter specific things, like you have in your example above, and I suppose it would have to. SASPy isn't a Jupyter anything. It doesn't know if it's running in Jupyter or what specific things Jupyter can do as opposed to any other environment (other than render html), including other notebooks or UI's that it also supports. It can run anywhere Python runs, so anything that is specific to Jupyter isn't really ideal.

As for returning something other than HTML5, which is what's returned by way of using ODS, it's possible to return other ODS output, but how that would get rendered in any UI isn't something I have a clear understanding of. And, I don't know enough about ODS to be sure changing it to try to return other things would work for everything either. That would have to be investigated. I tried having it return a pdf and tried to get it to render, but I wasn't successful with that.

The idea of having helper functions to convert HTML to other formats, using whatever packages that can work seems like it could be less pervasive to SASPy in general. As it would just be optional and dependent on having the right packages and the UI working with them. In the thread from a couple few years ago I think you mentioned even just documenting examples of doing this; the saspy_examples repo would be a good place for that! Especially if it's Jupyter specific things that don't necessarily work everywhere, then having example functions that do work in that environment would be good for anyone to use. Of course, if they work in other UI/s then that great too.

Sorry, that's as far as I've been able to get on this.

@fsmunoz
Copy link
Author

fsmunoz commented Feb 11, 2024

Hello @tomweber-sas ,

I'll try to be as to the point as possible, even because I think that my excessive contextualisation might have muddied the main points somewhat.

I'm still not really sure what kind of solution you're looking for for this.

Now: Using SAS Kernel only returns text/html (and text/plain) when a cell is executed.
Goal: Add more output formats to SAS Kernel, following the Jupyter spec, specifically image/png (or other image/* formats).

SASPy and SAS_Kernel are very different.

I am aware, but since SAS Kernel makes a call to SASpy to get the output, when discussing adding other output formats it can be argued that SASpy could implement that at that level:

  • Option 1, SAS Kernel calls SASpy, and always gets HTML. Any improvement in terms of adding additional output types must assume that we always get HTML from the SASpy call. This is what I did in my fork.
  • Option 2, SASpy is extended in terms of the SAS_output_options currently supported: if something different from html and html5 is added at this level, then SAS Kernel can use it instead of the above.

This is why I mentioned SASpy. I am aware it is different, but it's a dependency of SAS Kernel. Implementing the second option would also solve problems where Jupyter isn't used (like the original discussion, using knitr), whereas the first would solve it for approaches that use Jupyter.

like you have in your example above

My example above only works because I have implemented the code in my linked fork. It's not something that can be done completely at the user level. Without changes to the SAS Kernel, SAS output can't be used when the target output is not HTML.

it's possible to return other ODS output, but how that would get rendered in any UI isn't something I have a clear understanding of

The rendering isn't a problem that either SASpy or SAS Kernel should care much about. Just as returning HTML assumes that it can be displayed (a reasonable assumption for Jupyter default use-case), returning other MIME types should make the same assumption.

In my example, Quarto (and the same happens for knitr) picks up image/png when building a PDF, and passes that information to pandoc/LaTeX. Without that alternative representation, it uses the only available one (text/html), which isn't usable for PDFs (hence the issue on the inability of PDF exports) - but the key point to retain is that the choice is made by Quarto or knitr.

The current support for Zeppelin is also related to this: Zeppelin can't display the iPython HTML object, which is why it has an option. My issue is a more general version of this.

That would have to be investigated.

Agreed, but in general the rendering isn't a problem: if I ask for PDF output, then I am taking ownership of the ability to render that with whatever apporach I'm using. Having that possibility opens a lot of doors. Specifically, returning image/png as in my fork automatically makes PDF export work: there was no change anywhere else.

The idea of having helper functions to convert HTML to other formats, using whatever packages that can work seems like it could be less pervasive to SASPy in general

See above for my take on the relative implications of each. I agree that it's less impactful, and would work for my use case (using Quarto, that runs SAS through the Jupyter kernel), although not for others (e.g. using knitr with SASmarkdown).

In the thread from a couple few years ago I think you mentioned even just documenting examples of doing this; the saspy_examples repo would be a good place for that!

It's already in there: SASpy with R Markdown . Note that:

  1. This completely user-level approach is not possible with Jupyter, at least without such an amount of coding that completely defeats the purpose of using this kind of document.
  2. It requires different code for different solutions to do something that could be solved in a single place.

This user example would be possible without any specific programming if SASpy supported returning text/image or text/md. Without that, a simple 2 lines plot:

  • Works when done in Python and R, since both approaches return non-HTML output as an option.
  • Do not work with SAS, since SASpy always returns HTML, except by adding more code to work around it.

@tomweber-sas
Copy link
Contributor

Thanks for trying to clarify, I think I can try to do that as well. First, yes, I get that if saspy did the work, the sas_kernel would just get it since it's really just a jupyter extension wrappering the submit() method. It may need to be tweaked to know what to render, or rather how to render it.

On my side, some of my confusion is what output you want. I see you're saying things like text/html, and text/plain, image/png, text/image, text/md and other image/* formats. As well as mention of pdf and markdown.

SASPy doesn't create any output. The output is created by SAS, ODS specifically. ODS supports a number of different output types and it works with SAS to create output for whatever code is run. SASPy doesn't know what code you're submitting or what output may or may not be generated, for the submit methods. It knows the code it's generating for it's methods, but still the output is created by SAS/ODS and not saspy. So from my side, I could return what ODS supports, but it doesn't support the things you're saying, other than maybe pdf. Here's the doc for ODS saying what it can produce.

HTML5 was the obvious choice for saspy since the different notebooks can actually render it, and the images are embedded in the html, so this works when connected to remote sessions. This is even good when not in a notebook (line mode python or batch script), as the HTML can be written to files which can then be rendered by anything that supports html; just a browser or any UI. Being able to know what output was created and how to access it is also part of this too; like for the analytic methods, the output isn't simply one html document. It produces many different plot/graphs,... that can be individually accessed. So, that's where my confusion comes in. saspy isn't in the business of creating or manipulating output to try to make it become various jupyter supported types.

@fsmunoz
Copy link
Author

fsmunoz commented Feb 13, 2024

On my side, some of my confusion is what output you want. I see you're saying things like text/html, and text/plain, image/png, text/image, text/md and other image/* formats. As well as mention of pdf and markdown.

Indeed this is an important question, and I've mentioned different things in different places. The answer is not a closed one, but there are two approaches I think:

  1. The minimal approach is to add one or more image/* , because with that most tools will be able to include an image. This works in a much more generic way than HTML because an image can be easily included almost anywhere, whereas HTML produces the display errors I've linked in the issue above.
  2. The more ambitious approach would be to add support to the output that can be obtained from ODS. This includes e.g., LaTeX, but doesn't include (to my knowledge) Markdown. That's fine, Markdown was just an example (and if there was a super important use-case, it's one of those cases where it could be produced by converting at the SAS Kernel level instead of coming from ODS via SASpy)

The iPython documentation indicates the following as "typical":

The following MIME types are usually implemented:

text/plain
text/html
text/markdown
text/latex
application/json
application/javascript
application/pdf
image/png
image/jpeg
image/svg+xml

Some of these do not make sense (JSON, for example), others are different image types, plain and html we already have.

SASPy doesn't create any output. The output is created by SAS, ODS specifically.

That's clear, but as it's currently done, it passes ods html5 ... somewhere, so we get the ODS HTML output (although as in my LaTeX example before, even this could be partially worked with as the SAS Kernel level, since when specifying ods latex ... I've noticed that I get HTML, but with LaTeX somewhere inside as CDATA).

So from my side, I could return what ODS supports, but it doesn't support the things you're saying, other than maybe pdf. Here's the doc for ODS saying what it can produce

This would be the ideal, ambitious scenario. I would even drop several of them immediately if it helps: Excel, Doc, PPT, RTF, EPUB, these do not make much sense for this - but PostScript makes a lot of sense since it can be used instead of images to produce PDFs.

We end up being back at:

  • image/* (could simply be image/png)
  • LaTeX
  • PDF
  • PostScript
  • HTML / HTML5 / XML
  • Plain

This would be an amazing improvement - even half of them! Especially because I'm looking at a specific scenario, but having this expanded ability would almost surely allow other usages in the future.

HTML5 was the obvious choice for saspy since the different notebooks can actually render it, and the images are embedded in the html, so this works when connected to remote sessions.

This makes complete sense and it's the way every other kernel is done. I wouldn't change it as a default, the way I see it nothing should change in terms of how it works today for the Jupyter usage scenario.

This is even good when not in a notebook (line mode python or batch script), as the HTML can be written to files which can then be rendered by anything that supports html; just a browser or any UI

Yes, and I have investigated ways of working with saved files, but it does require changes to the tools I'm using as examples because some output formats are not easily produced with a 100% HTML source format.

It produces many different plot/graphs,... that can be individually accessed. So, that's where my confusion comes in. saspy isn't in the business of creating or manipulating output to try to make it become various jupyter supported types.

I haven't thought about this, tbh. This is a good point. I know that when using StatRep I explicitly select the output object that was produced in the previous step, but I haven't found a problem with my initial approach of getting an image out of the HTML, but that's because I have assumed I'm getting a single image inside the HTML.

@tomweber-sas
Copy link
Contributor

tomweber-sas commented Feb 13, 2024

You can easily try any of the ODS outputs and see if they get what you want. When the results='HTML, I submit ODS statements around the code being submitted. If results='TEXT' then I don't. So you can code any ODS you want to see what you get by just doing the following. I tried it with PDF and couldn't get the PDF rendered, but again, I don't know anything about these formats w/ jupyter or packages that are needed to be able to process and render any of that.

import saspy
sas = saspy.SASsession(results='text')

x = sas.submit(
'''
ods listing close;               /* close the default so you can set what you want */
ods pdf file=stdout;             /* try what you want here. use stdout so you get the results back to you */
ods graphics on / outputfmt=png; /* you can play with this too, I don't know all the options and variations for everything */

proc print data=sashelp.cars(obs=1);run;

ods pdf close;                   /* close it so it finishes and you get the result */
ods listing;                     /* set the default back, just to be complete */
''')

print(x['LST'].encode('ascii','ignore')) # this displays the pdf as a string of bytes, then render it however that works

b'%PDF-1.5\r%\r\n16 0 obj<</Length 1503/Filter/FlateDecode>>\rstream\r\nxn6\x10_vu\x1cwRq$N\x00%=@\x17mW\x04\x08\x02M\x17}&\x1fDD\x18\x01D}3\x1cO\x11|?||G\x18UK8y?q\x07TRK\nO\x7fw?0\x1f)cl\x1dv&\\WYm.\x13|}6osz\\\x1f\x0ep
[...]
XD\x16J\x02m\x05z R\tL\x00"5z9H\x0e\x10+\x0e"\x16H\x7f-Afe`\x00\x00Y\x11\x03\r\nendstream\nendobj\ntrailer\n<<\n/Size 31/Root 1 0 R\n>>\nstartxref\r\n86270\r\n%%EOF\r\n'

searching for how to display this I found and tried

from IPython.display import Image
Image(x['LST'].encode('ascii','ignore'))

but that didn't work, and I don't know why and didn't have enough time to try to get further on it.

If any of these ODS outputs work as expected for the various cases, it wouldn't be hard to add support. But if they don't just work as needed and have to be manipulated to produce something that works, that's a different story.

Give ODS a try and see if you can get the formats you need,
Tom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants