diff --git a/_pkgdown.yml b/_pkgdown.yml index 210132cd..a4a73e0c 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -89,3 +89,5 @@ articles: navbar: ~ contents: - articles/validation + - intro-xml + - intro-episode diff --git a/vignettes/intro-episode.Rmd b/vignettes/intro-episode.Rmd new file mode 100644 index 00000000..c7e969a6 --- /dev/null +++ b/vignettes/intro-episode.Rmd @@ -0,0 +1,461 @@ +--- +title: "Introduction to the Episode Object" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Introduction to the Episode Object} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +## Introduction + +The {pegboard} package facilitates the analysis and manipulation of Markdown and +R Markdown files by translating them to XML and back again. This extends the +{tinkr} package by providing additional methods that are specific for +Carpentries-style lessons. There are two `R6` classes defined in {pegboard}: + + - `Episode` objects that contain the XML data, YAML metadata and extra fields + that define the child and parent files for a particular episode + - `Lesson` objects that contain lists of `Episode` objects categorised as + "episodes", "extra", or "children" + +This vignette will be discussing the structure of Episode objects, how to +query the contents with the {xml2} package, and how to use the methods and +active bindings to get information about, extract, and manipulate anything +inside of a Markdown or R Markdown document. + +## Reading Markdown Content + +Each `Episode` object starts from a Markdown file. In particular for {pegboard}, +we assume that this Markdown file is written using +[Pandoc](https://pandoc.org/MANUAL.html) syntax (a superset of +[CommonMark](https://commonmark.org/)). It can be any markdown file, but for us +to explore what the `Episode` object has to offer us, let's take an example R +Markdown file that is present in a fragment of a Carpentries Workbench lesson +that we have in this package. We will be using the {xml2} package to explore +the object and the {fs} package to help with constructing file paths. + +```{r setup} +library("pegboard") +library("xml2") +library("fs") +``` + +This is what our lesson fragment looks like. It is a fragment because it's main +purpose is to be used for examples and tests, but it contains the basic structure +of a lesson that we want. + +```{r intro-read-noshow, echo = FALSE} +dir_tree(lesson_fragment("sandpaper-fragment"), recurse = 1, regex = "site/[^R].*", invert = TRUE) +``` + +We can retrieve it with the `lesson_fragment()` function, which loads example +data from pegboard. Here we will take that lesson fragment and read in the first +episode with the initialization method, `Episode$new()`, followed by +`$confirm_sandpaper()`, a confirmation that the episode was created to work +with [{sandpaper}], the user interface and build engine of The Carpentries +Workbench (for information on non-workbench content, see the section on [Jekyll +Lesson Markdown Content](#jekyll-lesson-markdown-content)) and `$protect_math()` +which will prevent special characters in LaTeX math from being escaped. + +[{sandpaper}]: https://carpentries.github.io/sandpaper/ + +```{r intro-read} +lsn <- lesson_fragment("sandpaper-fragment") +# Read in the intro.Rmd document as an `Episode` object +intro_path <- path(lsn, "episodes", "intro.Rmd") +intro <- Episode$new(intro_path)$confirm_sandpaper()$protect_math() +``` + +If we print out the Episode object, I'm going to get a long list of methods, +fields and active bindings (functions that act like fields) printed: + +```{r intro-print} +intro +``` + +The actual XML content is in the `$body` field. This contains all the data from +the markdown document, but in XML form. + +```{r intro-body} +intro$body +``` + +If we want to see what the contents look like, you can use the `$show()`, +`$head()`, or `$tail()` methods (note: the `$show()` method will print out the +entire markdown document). + +```{r intro-show} +intro$head(10) +intro$tail(10) +intro$show() +``` + +## File information + +For information about the file and its relationship to other files, you can use +the following active bindings, which are useful when working with Episodes in a +lesson context. + +```{r file-active-bindings} +intro$path +intro$name +intro$lesson +# NOTE: relationships to other episodes are automatically handled in the +# Lesson context +intro$has_parents +intro$has_children +intro$children # separate documents processed as if they were part of this document +intro$parents # the immediate documents that would require this document to build +intro$build_parents # the final documents that would require this document to build +``` + +## Accessing Markdown Elements + +The `Episode` object is centered around the `$body` item, which contains the XML +representation of document. It is possible to find markdown elements from XPath +statments: + +```{r xpath-active-bindings} +xml2::xml_find_all(intro$body, ".//md:link", ns = intro$ns) +xml2::xml_find_first(intro$body, ".//md:list[@type='ordered']", ns = intro$ns) +``` + +However, there are some useful elements that we want to know about, so I have +implemented them in active bindings and methods: + + +```{r active-bindings} +# headings where level 2 headings are equivalent to sections +intro$headings +# all callouts/fenced divs +intro$get_divs() +intro$challenges +intro$solutions +# questions, objectives, and keypoints are standard and return char vectors +intro$objectives +intro$questions +intro$keypoints +# code blocks and output types +intro$code +intro$output +intro$warning +intro$error +# images and links +intro$images +intro$get_images() # parses images embedded in `` tags +intro$links +``` + +Much of these are summarized in the `$summary()` method: + +```{r summary} +intro$summary() +``` + +## Code blocks and code chunks + +In markdown, a **code block** is written with fences of at least three backtick +characters (`` ` ``) followed by the language for syntax highlighting: + +````markdown + +List all files in reverse temporal order, printing their sizes in +a human-readable format: + +```bash +ls -larth /path/to/folder +``` +```` + +> List all files in reverse temporal order, printing their sizes in +> a human-readable format: +> +> ````bash +> ls -larth /path/to/folder +> ```` + +When these are processed by {pegboard}, the resulting XML has this structure +where the backticks inform that kind of node (`code_block`) and the language +type is known as the "info" attribute. Everything inside the code block is the +node text and has whitespace preserved + +````{r show-code-block, echo = FALSE, results = 'asis'} +cb <- "```bash + +ls -larth /path/to/folder +```" +cbx <- xml2::read_xml(commonmark::markdown_xml(cb)) +txt <- as.character(xml2::xml_find_first(cbx, ".//d1:code_block")) +writeLines(c("```xml", txt, "```")) +```` + +In R Markdown, there are special code blocks that are called code chunks that +can be dynamically evaluated. These are distinguished by the curly braces +around the language specifier and [optional +attributes](https://yihui.org/knitr/options/) that control the output of the +chunk. + +````{verbatim} + +There is a code chunk here that will produce a plot, but not show the code: + +```{r chunky, echo=FALSE, fig.alt="a plot of y = mx + b for m = 1 and b = 0"} +plot(1:10, type = "l") +``` + +```` + + +> There is a code chunk here that will produce a plot, but not show the code: +> +> ````{r chunk-name, echo = FALSE, fig.alt="a plot of y = mx + b for m = 1 and b = 0"} +> plot(1:10, type = "l") +> ```` + +When this is processed with {pegboard}, the "info" part of the code block is +further split into "language", "name" and further attributes based on the chunk +options: + +````{r show-code-chunk, echo = FALSE, results = 'asis'} + +chunk <- 'There is a code chunk here that will produce a plot, but not show the code: + +```{r chunky, echo=FALSE, fig.alt="a plot of y = mx + b for m = 1 and b = 0"} + +plot(1:10, type = "l") +```' +tmp <- tempfile() +writeLines(chunk, tmp) +chunky <- pegboard::Episode$new(tmp)$code[[1]] +xml2::xml_set_attr(chunky, "sourcepos", NULL) +txt <- as.character(chunky) +writeLines(c("```xml", txt, "```")) +unlink(tmp) +```` + +Both code blocks will be encountered, but the difference between them is that +the R Markdown code chunks will have the "language" attribute. This is an +important concept to know about when you are searching and manipulating R +Markdown documents with XPath +(see `vignette("intro-xml", package = "pegboard")`). The next section will walk +through some aspects of manipulation that we can do with these documents. + +## Manipulation + +Because everything centers around the `$body` element and is extracted with +{xml2}, it's possible to manipulate the elements of the document. One thing that +is possible is that we can add new content to the document using the `$add_md()` +method, which will add a markdown element after any paragraph in the document. + +For example, we can add information about pegboard with a new code block after +the first heading: + +````{r add-code-block} +intro$head(26) # first 26 lines +intro$body # first heading is item 11 +cb <- c("You can clone the **{pegboard} package**: + +```sh +git clone https://github.com/carpentries/pegboard.git +``` +") +intro$add_md(cb, where = 11) +intro$head(26) # code block has been added +intro$code +```` + +You can also manipulate existing elements. For example, let's say we wanted to +make sure all R code chunks were named. We can do so by querying and +manipulating the code blocks: + +```{r update-code-block} +code <- intro$code +code +# executable code chunks will have the "language" attribute +is_chunk <- xml2::xml_has_attr(code, "language") +chunks <- code[is_chunk] +chunk_names <- xml2::xml_attr(chunks, "name") +nonames <- chunk_names == "" +chunk_names[nonames] <- paste0("chunk-", seq(sum(nonames))) +xml2::xml_set_attr(chunks, "name", chunk_names) +code +``` + +We can see that the chunks now have names, but the proof is in the rendering: + +```{r show-updated} +intro$show() +``` + +One of the things about manipulating these documents in code is that it is +possible to go back and reset if things are not correct, which is why we have +the `$reset()` method: + +```{r} +intro$reset()$confirm_sandpaper()$protect_math()$head(25) +``` + +## Jekyll Lesson Markdown Content + +This section describes the features that you would expect to find in a lesson +that was built with the former infrastructure, +, which was built using the Jekyll +static site generator. These style lessons are no longer supported by The +Carpentries. {pegboard} does support these lessons so that they can be +transitioned to use The Workbench syntax via [The Carpentries Lesson Transition +Tool](https://github.com/carpentries/lesson-transition#readme). This +was the _first_ syntax that was supported by {pegboard} because the package was +written initially as a way to explore the structure of our lessons. + +### The Syntax of Jekyll Lessons + +The former Jekyll syntax used [kramdown-flavoured +markdown](https://kramdown.gettalong.org/syntax.html), which evolved separately +from [commonmark](https://spec.commonmark.org/), the syntax that {pegboard} +knows and that Pandoc-flavoured markdown extends. One of the key differences +with the kramdown syntax is that it used something known as [Inline Attribute +Lists (IAL)](https://kramdown.gettalong.org/syntax.html#inline-attribute-lists) to +help define classes for markdown elements. These elements were formated as +`{: }` where `` is replaced by class definitions and +key/value pairs. They always appear _after_ the relevant block which lead to +code blocks that looked like this: + +````markdown +~~~ +ls -larth /path/to/dir +~~~ +{: .language-bash} +```` + +Moreover, to achieve the special callout blocks, we used blockquotes that were +given special classes (which is an accessbility no-no because those blocks were +not semantic HTML) and the nesting of these block quotes looked like this: + + +````markdown +> ## Challenge +> +> How do you list all files in a directory in reverse order by the time it was +> last updated? +> +> > ## Solution +> > +> > ~~~ +> > ls -larth /path/to/dir +> > ~~~ +> > {: .language-bash} +> {: .solution} +{: .challenge} +```` + +One of the biggest challenges with this for authors was that, unless you used an +editor like vim or emacs, this was difficult to write with all the prefixed +blockquote characters and keeping track of which IALs belonged to which block. + +### Special methods and active bindings + +```{r setup-again} +library("pegboard") +library("xml2") +library("fs") +``` + +Episodes written in the Jekyll syntax have special functions and active bindings +that allow them to be analyzed and transformed to Workbench episodes. Here is an +example from a lesson fragment: + + +```{r jekyll-fragment-read} +lf <- lesson_fragment() +ep <- Episode$new(path(lf, "_episodes", "14-looping-data-sets.md")) +# show relevant sections of head and tail +ep$head(29) +ep$tail(53) +``` + +Notice that the questions, objectives, and keypoints are in the yaml frontmatter. +This is why we have an accessor that returns the list instead of the node, for +compatibility with the Jekyll lessons: + +```{r qok} +ep$questions +ep$objectives +ep$keypoints +``` + +Even though the challenges are formatted differently, the accessors will still +return them correctly: + +```{r challenges} +ep$challenges +ep$solutions +``` + +You can also get _all_ of the block quotes using the `$get_blocks()` method. +NOTE: this will extract _all_ block quotes (including those that do not have +the `ktag` attributes. + +```{r get_blocks} +ep$get_blocks() # default is all top-level blocks (challenges/callouts) +ep$get_blocks(level = 2) # nested blocks are usually solutions +ep$get_blocks(level = 0) # level zero is all levels +ep$get_blocks(type = ".solution", level = 0) # filter by type +``` + +One of the things that was advantageous about blockquotes is that we could +analyze the pathway through the blockquotes and figure out how they were comonly +written in a lesson. The `$get_challenge_graph()` creates a data frame that +describes these relationships: + +```{r get-challenge-graph} +ep$get_challenge_graph() +``` + +You might notice that there is an attribute called `ktag`. When a +Jekyll-formatted episode is read in, all of the IAL tags are processed and +placed in an attribute called `ktag` (**k**ramdown **tag**), which is +accessible via the `$tags` active binding. This is needed because commonmark +does not know how to process postfix tags and it is important for the +translation to commonmark syntax: + +```{r ktags} +ep$tags +xml2::xml_parent(ep$tags) +``` + + +### Transformation + +It was always known that we would want to use a different syntax to write the +lessons as much of the community struggled with the kramdown syntax and it +was difficult to parse and validate. The automated transformation workflow is +what powers the Lesson Transformation Tool and we have composed it into a few +basic steps: + +1. transform block quotes to fenced divs +2. removing the jekyll syntax, liquid templating, and fix relative links +3. moving the yaml frontmatter + +The process looks like this composable chain of methods: + +```{r} +ep$reset() +ep$ + unblock()$ + use_sandpaper()$ + move_questions()$ + move_objectives()$ + move_keypoints() +ep$head(31) +ep$tail(65) +``` + + diff --git a/vignettes/intro-xml.Rmd b/vignettes/intro-xml.Rmd new file mode 100644 index 00000000..a43a44b4 --- /dev/null +++ b/vignettes/intro-xml.Rmd @@ -0,0 +1,536 @@ +--- +title: "Working with XML data" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Working with XML data} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + +## Introduction + +You will want to read this vignette if you are interested in contributing to +{pegboard}, or if you would like to understand how to fine-tune the transition of +a lesson from the styles infrastructure to The Workbench (see +), or if you want to +know how to better inspect the output of some of {pegboard}'s accessors. In +this vignette, I assume that you are familiar with writing R functions and that +R will default to passing an object's _value_ to a function and not a +_reference_ (though if you do not understand that last part, do not worry, I +will try to dispell this). + +The {pegboard} package is an enhancement of the {tinkr} package, which +transforms Markdown to XML and back again. [XML is a markup language that +is derived from HTML](https://www.geeksforgeeks.org/html-vs-xml/) designed to +handle structured data. A more modern format for storing and transporting data +on the web is JSON, but the advantage of using XML is that we are able to use the +[XPath] language to parse it (more on that later). Moreover, because XML has +the same structure as HTML, it can be parsed using the same tools, which is +advantageous for a suite of packages that transforms Markdown to HTML. This +transformation is facilitated by the [{commonmark}] for transforming Markdown +to XML and [{xslt}] for transforming XML to Markdown. + +[{commonmark}]: https://docs.ropensci.org/commonmark/ +[{xslt}]: https://docs.ropensci.org/xslt/ +[{xml2}]: https://xml2.r-lib.org/ +[XPath]: https://en.wikipedia.org/wiki/XPath + +## Motivating Example + +During the lesson transition, I was often faced with situations that required +me to perform intricate replacements in documents while preserving the structure. +One such example is transitioning the "workshop" or "overview" lessons that did +not have any episodes and relied on separate child documents to separate out +redundant elements. Let's say we had a file called `setup.md` and two other +files called `setup-python.md` and `setup-r.md` that look like this: + +`setup.md`: + +````markdown +## Setup Instructions + +### Python + +{% include setup-python.md%} + +### R + +{% include setup-r.md %} +```` + +`setup-python.md`: + +````markdown +Install _python_ from **anaconda** +```` + +`setup-r.md`: + +````markdown +Install _R_ from **CRAN** +```` + +The output of `setup.md` when its rendered would include the text from both +`setup-python.md` and `setup-r.md`, but the thing is, the `{% include %}` tags +are a syntax that is specific to Jekyll. Instead, for The Workbench, we wanted +to use the [R Markdown child document +declaration](https://bookdown.org/yihui/rmarkdown-cookbook/child-document.html), +so that `setup.md` would look like this: + +`setup.md`: + +````{verbatim} +## Setup Instructions + +### Python + +```{r child="files/setup-python.md"} +``` + +### R + +```{r child="files/setup-r.md"} +``` +```` + + +```{r setup-setup} +setup_file <- tempfile(fileext=".md") +stp <- "## Setup Instructions + +### Python + +{% include setup-python.md%} + +### R + +{% include setup-r.md %} +" +writeLines(stp, setup_file) +``` + +By using the following function (originally in +[lesson-transition/datacarpentry/ecology-workshop.R](https://github.com/carpentries/lesson-transition/blob/f8edb10b2e13a926e3df9ba522983f930d0ee19b/datacarpentry/ecology-workshop.R#L23-L44)), it was possible: + +```{r child-from-include} +child_from_include <- function(from, to = NULL) { + to <- if (is.null(to)) fs::path_ext_set(from, "Rmd") else to + rlang::inform(c(i = from)) + ep <- pegboard::Episode$new(from) + # find all the {% include file.ext %} statements + includes <- xml2::xml_find_all(ep$body, + ".//md:text[starts-with(text(), '{% include')]", ns = ep$ns) + # trim off everything but our precious file path + f <- gsub("[%{ }]|include", "", xml2::xml_text(includes)) + # give it a name + fname <- paste0("include-", fs::path_ext_remove(f)) + # make sure the file path is correct + f <- sQuote(fs::path("files", f), q = FALSE) + p <- xml2::xml_parent(includes) + # remove text node + xml2::xml_remove(includes) + # change paragraph node to a code block and add chunk attributes + xml2::xml_set_name(p, "code_block") + xml2::xml_set_attr(p, "language", "r") + xml2::xml_set_attr(p, "child", f) + xml2::xml_set_attr(p, "name", fname) + fs::file_move(from, to) + ep$write(fs::path_dir(to), format = "Rmd") +} +writeLines(readLines(setup_file)) # show the file +child_from_include(setup_file) +writeLines(readLines(fs::path_ext_set(setup_file, "Rmd"))) # show the file +``` + +This is only a small peek of what is possible with XML data and if you are +familiar with R, some of this may seem like strange syntax. If you would like +to understand a bit more, read on. + +## Working with XML data + +Each `Episode` object contains a field (you can think of each field as a list +element) called `$body`, which contains an {xml2} document. This is the core of +the `Episode` object and every function works in some way with this field. + +### The memory of XML objects + +For the casual R user (and even for the more experienced), the way you use this +package may seem a little strange. This is because in R, functions will not +have side effects, but the vast majority of methods in the `Episode` object +will modify the object itself and this all has to do with the way XML data is +handled in R by the {xml2} package. + +Normally in R, when you pass data to a function, it will make a copy of the +data and then apply the function to the copy of the data: + +```{r} +x <- 1:10 +f <- function(x) { + # insert 99 after the fourth position in a vector + return(append(x, 99, after = 4)) +} +print(f(x)) +# note that x is not modified +print(x) +``` + +When working with XML in R, the {xml2} package is unparalleled, but it leads to +surprising outcomes because when you modify content within an XML object, you +are modifying the object in place: + +```{r xml-example} +x <- xml2::read_xml("") +print(x) +f <- function(x, new = "c") { + xml2::xml_add_child(x, new, .where = xml2::xml_length(x)) + return(x) +} +y <- f(x) +# note that x and y are identical +print(x) +print(y) +``` + +It gets a bit stranger when you consider that in the above code, `y` and `x` are +_exactly the same object_ as shown with the fact that if I manipulate `y`, then +`x` will also be modified: + +```{r xml-example-dup} +f(y, "d") +print(y) +print(x) +``` + +I can even extract child elements from the XML document and manipulate _those_ +and have them be reflected in the parent. For example, if I extract the second +child of the document, and then apply the `cool="verified"` attribute to the +child, it will be reflected in the parent document. + +```{r xml-example-child} +child <- xml2::xml_child(x, 2) +xml2::xml_set_attr(child, "cool", "verified") +print(child) +print(x) +print(y) +``` + +This persistance lends itself very well to using the {R6} package for creating +objects that work in a more object-oriented way (where methods belong to classes +instead of the other way around). If you are familiar with how Python methods +work, then you will be mostly familiar with how the {R6} objects behave. It is +worthwhile to read the [{R6} introduction +vignette](https://r6.r-lib.org/articles/Introduction.html) if you want to +understand how to program and modify this package. + +In the example above, you notice that I use `xml2::xml_child()` to extract child +nodes, but the real power of XML comes with searching for items using XPath +syntax for traversing the XML nodes where I would be able to do one of the +following to get the child called "c" + +```{r xml-example-xpath} +xml2::xml_find_first(x, ".//c") +xml2::xml_find_first(x, "/a/c") +``` + +The next section will cover a bit of XPath and provide some resources on how to +practice and learn because this comes in very handy to quickly traverse the XML +nodes without relying on loops. + +## Using XPath to parse XML + +### The structure of XPath + +In the section, we will talk about [XPath syntax][XPath-1.0], but it will be +non-exhaustive. Unfortunately, good tutorials on the web are few and far between, +but here are some that can help: + + - The [MDN documentation](https://developer.mozilla.org/en-US/docs/Web/XPath) + is _usually_ pretty good, but instead, it's better as a reference + - [MDN XPath Axes](https://developer.mozilla.org/en-US/docs/Web/XPath/Axes) + good for knowing how to navigate among nodes + - [MDN XPath + functions](https://developer.mozilla.org/en-US/docs/Web/XPath/Functions) + good for knowing how to filter node matches + - The [w3schools tutorial on + XPath](https://www.w3schools.com/xml/xpath_intro.asp) is actually one of the + best out there, but this is an excpetion to the rule. Other than this + tutorial, I would not trust any content from w3schools (they are not aligned + at all with the web consortium). + - An [XPath tester](https://extendsclass.com/xpath-tester.html) like a regex + tester to allow you to try out complex queries in a visual manner. + +[XPath-1.0]: https://en.wikipedia.org/wiki/XPath#Syntax_and_semantics_(XPath_1.0) + +It's important to remember that an XML document is a tree-like structure that +is similar to directories or folders on your computer. For example, if you look +at the source directory structure of this package, you would see a folder +called `R/` and a nested folder called `tests/testhat/`. If you started from +the root directory of this package, you would list the R files in the `R/` +folder with `ls R/*.R` similarly, if you wanted to list the R files in the +`tests/testthat/` folder, you would us `ls tests/testthat/*.R`. In this +respect, XPath has a very similar syntax: to enter the next level of nesting, +you add a slash (`/`). For example, let's take a look a what the file structure +would look like in XML form: + +```{r XML-files, echo = FALSE, results = "asis"} +x <- ' + + one + two + + + + + test-data + + test-one + test-two + + +' +writeLines(c("```xml", x, "```")) +xml <- xml2::read_xml(gsub("\\n", "", x)) +``` + +The XPath syntax to find all files in the the R and testthat folders would be +the same if you started from the root: `R/file` and +`tests/testthat/file`. + +```{r} +xml2::xml_find_all(xml, "R/file") +xml2::xml_find_all(xml, "tests/testthat/file") +``` + +However, XPath has one advantage that normal command line syntax doesn't have: +you can short-cut paths, so if we wanted to find all files in any given folder, +you can use the double slash (`//`) to recursively search through nesting. By +habit, I will normally use the precede these slashes with a dot (`.`) so that +I can be sure to start with the node that I have in my variable: + +```{r} +xml2::xml_find_all(xml, ".//file") +``` + +Of course, this method finds _all_ files, so if you wanted to filter them, you +can use the bracket notation to create filters for our selection based on the +`ext` attribute, which are prefixed by `@`. With the bracket notation, you add +brackets to a node selection with a condition. In this case, we want to test +that the extension is 'R', so we would use `[@ext='R']`: + +```{r} +xml2::xml_find_all(xml, ".//file[@ext='R']") +``` + +In this scheme, I've put the file names as the text of the nodes, so we can +use the bracket notation again with [XPath functions](https://developer.mozilla.org/en-US/docs/Web/XPath/Functions) to filter for only files that contain "one" + +```{r} +xml2::xml_find_all(xml, ".//file[@ext='R'][contains(text(), 'one')]") +``` + +If I only wanted to extract source files that contain "one", I could also use +the `parent::` [XPath axis](https://developer.mozilla.org/en-US/docs/Web/XPath/Axes): + +```{r} +xml2::xml_find_all(xml, ".//file[@ext='R'][contains(text(), 'one')][parent::R]") +``` + +Note that if I used a slash (`/`) instead of square brackets for the parent, I +would get the parent back: + +```{r} +xml2::xml_find_all(xml, ".//file[@ext='R'][contains(text(), 'one')]/parent::R") +``` + +As you an see, many times, an XPath query can get kind of hairy, which is why +I often like to compose it into different parts during programming with {glue}: + +```{r} +predicate <- "[@ext='R'][contains(text(), 'one')]" +XPath <- glue::glue(".//file{predicate}/parent::R") +xml2::xml_find_all(xml, XPath) +``` + +In the next section, I will discuss how to extract and manipulate XML that comes +from Markdown with namespaces. + +## XML data from Markdown using namespaces + +The XML from markdown transformation is fully handled by the {commonmark} +package, which has the convenient `commonmark::markdown_xml()` function. For +example, this is how how the following markdown is processed: + +```markdown +This is a bunch of [example markdown](https://example.com 'for example') text + +- this +- is +- a **list** +``` + +> This is a bunch of [example markdown](https://example.com 'for example') text +> +> - this +> - is +> - a **list** + + +```{r commonmark-ex} +md <- c("This is a bunch of [example markdown](https://example.com 'for example') text", + "", + "- this", + "- is", + "- a **list**" +) +xml_txt <- commonmark::markdown_xml(paste(md, collapse = "\n")) +class(xml_txt) +writeLines(xml_txt) +``` + +You can see that it has successfully parsed the markdown into a paragraph and +a list and then the various elements within. + +### The default namespace + +Now here's the catch: The commonmark markdown always starts with this basic +skeleton which has the root node of ``. The `xmlns` attribute defines the +[default XML namespace][namespace]: + +[namespace]: https://developer.mozilla.org/en-US/docs/Web/SVG/Namespaces_Crash_Course + +```{r commonmark-skel, echo = FALSE} +lines <- strsplit(commonmark::markdown_xml("hi"), "\n")[[1]][-(4:6)] +writeLines(append(lines, "\nMARKDOWN CONTENT HERE\n", after = 3)) +``` + +In many XML applications, namespaces will come with prefixes, which are defined +in the `xmlns` attribute (e.g. `xmlns:svg="http://www.w3.org/2000/svg"`). If a +node has a namespace, it needs to be selected with the namespace prefix like +so: `.//svg:circle`. For default namespaces, the same rule applies, but the +question becomes: how do you know what the namespace prefix is? In {xml2}, the +default namespace always begins with `d1` and increments up as new namespaces +are added. You can inspect the namespace with `xml2::xml_ns()`: + +```{r commonmark-namespace-show} +xml <- xml2::read_xml(xml_txt) +xml2::xml_ns(xml) +``` + +Thus, the XPath query you would use to select a paragraph would be +`.//d1:paragraph`: + +```{r commonmark-namespace} +# with namespace prefix +xml2::xml_find_all(xml, ".//d1:paragraph") +``` + +Of course, having a default namespace in {xml2} has some drawbacks in that +[adding new nodes will duplicate the namespace with a different +identifier](https://community.rstudio.com/t/adding-nodes-in-xml2-how-to-avoid-duplicate-default-namespaces/84870), so one way we have avoided this in {tinkr} (the +package that does the basic conversion) is to define a namespace with a prefix +in a function so that we can use it when querying: + +```{r commonmark-namespace-md} +tinkr::md_ns() +xml2::xml_find_all(xml, ".//md:paragraph", ns = tinkr::md_ns()) +``` + +It's also important to remember that _all nodes_ will require this namespace +prefix, so if we wanted to only select paragraphs that were inside of a list, +we would need to specify use `.//md:list//md:paragraph`: + +```{r commonmark-list-paragraph-select} +xml2::xml_find_all(xml, ".//md:list//md:paragraph", ns = tinkr::md_ns()) +``` + +### Pegboard namespace + +One of the reasons why we created pegboard was to handle markdown content that +also included [fenced divs](https://pandoc.org/MANUAL.html#divs-and-spans), but +we needed a way to programmatically label and extract them without affecting the +stylesheet that is used to translate the XML back to Markdown (not covered in +this tutorial). To acheive this we place nodes under a different namespace +around the fences and define our own namespace. + +Here's an example: + +```markdown +This is markdown with fenced divs + +::: discussion + +This is a discussion + +::: + +::: spoiler + +This is a spoiler that is hidden by default + +::: +``` + +When it's parsed by commonmark, the fenced divs are treated as paragraphs: + +```{r show-fenced-divs-paragraph} +md <- 'This is markdown with fenced divs + +::: discussion + +This is a discussion + +::: + +::: spoiler + +This is a spoiler that is hidden by default + +::: +' +fences <- xml2::read_xml(commonmark::markdown_xml(md)) +fences +``` + +In {pegboard}, we have an internal function called `label_div_tags()` that will +allow us to label and parse these tags without affecting the markdown document: + +```{r label-divs} +pb <- asNamespace("pegboard") +pb$label_div_tags(fences) +fences +``` + +Note that we have defined a `` XML node that is defined under the pegboard +namespace. These sandwich the nodes that we want to query and allow us to use +`tinkr::find_between()` to search for specific tags: + +```{r find-between} +ns <- pb$get_ns() +ns # both md and pegboard namespaces +tinkr::find_between(fences, ns = ns, pattern = "pb:dtag[@label='div-1-discussion']") +``` + +This is automated in the `get_divs()` internal function: + +```{r get-divs} +pb$get_divs(fences) +``` + +## Conclusion + +This is but a short introduction to using XML with {pegboard}. You now have the +basics of what the structure of XML is, how to use XPath (with further resources), +how to use XPath with namespaces, and how we use namespaces in {pegboard} to +allow us to parse specific items. It is a good idea to practices working with +XPath because it is useful not only for working with XML representations of +markdown documents, but it is also heavily used for post-processing of HTML in +both {pkgdown} and the {sandpaper} packages. +