Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix navbar behaviour when <body> isn't on its own line; add tests for this #10

Closed
wants to merge 2 commits into from

Conversation

penelopeysm
Copy link
Member

The current navbar script searches for lines with <body> and inserts the navbar after those lines. This is great when <body> is on its own line, but fails when there's stuff that follows it on the same line (TuringLang/TuringBenchmarking.jl#39).

This PR also adds some tests to make sure that it behaves correctly when <body> is preceded / followed by other HTML on the same line.

@penelopeysm penelopeysm force-pushed the py/fix-body-tag branch 3 times, most recently from fb02664 to 54205c0 Compare February 1, 2025 18:10
@penelopeysm
Copy link
Member Author

Honestly I couldn't see myself doing this a year ago, but gpt actually wrote almost this whole PR for me. 🙃
I guess I still had to know what to ask it to do, but gosh is it a lifesaver not having to look up the syntax for github workflows again.

Comment on lines +86 to +87
/<body[^>]*>/ {
sub(/(<body[^>]*>)/, "&\n" nav);
Copy link
Member Author

@penelopeysm penelopeysm Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new regex is more flexible in that it would picks up cases where the body tag has some attribute <body class="myclass">. I suppose it's still technically sort of fragile, in the sense that if the attribute contains > it will break. Happy to go either way on this although I think the new regex is a bit better in that the failure case is much less likely than the failure case for the old one.

Comment on lines +33 to +34
for file in ins_html/*; do
if diff <(tr -d '[:space:]' < "$file") <(tr -d '[:space:]' < $OUT); then
Copy link
Member Author

@penelopeysm penelopeysm Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not technically correct, because HTML is not totally whitespace-insignificant - for example tag names and attributes must be separated - but I suppose it's not really possible to test this without using a proper HTML parser, and for the limited test examples I reckon this should be fine.

@shravanngoswamii
Copy link
Member

shravanngoswamii commented Feb 1, 2025

I don't think it will still work for us, I don't know why but our current script works fine locally when I run it on my WSL Ubuntu, but it breaks in GHA for many cases sometimes like the current scenario in DynamicPPL and TuringBenchmarking.jl. Just asking for suggestion, can we try something other than shell script? Because it's giving different results in different environments...

Documenter generated HTML usually gets wrapped up in very few lines and our script breaks when <body> tag have some particular thing after it, I don't know what's that!

This PR also adds some tests to make sure that it behaves correctly when <body> is preceded / followed by other HTML on the same line.

Our script works for this as of now as far as I know from my old tests, but it breaks for particular unknown edge cases!

See, it works for TuringGLM which had content on same line as tag, actually whole page was in single line I guess, Documenter do this for efficiency maybe: https://github.com/TuringLang/TuringGLM.jl/actions/runs/13092370899 (I just ran it):

And it still works for old DynamicPPL versions see:

Honestly I couldn't see myself doing this a year ago, but gpt actually wrote almost this whole PR for me. 🙃

Which model did you used btw, just for curiosity😅?

@penelopeysm
Copy link
Member Author

Oh, that's so weird. :( Maybe it's an upgrade on ubuntu versions? actions/runner-images#10636 which might have come with different versions of bash/awk.

I'm happy to use something else more proper. For me, this PR was fairly easy to put together so I thought it could be a decent stopgap. But it's definitely not the 'correct' solution!

Gumbo.jl seems like a sensible option. I'm not convinced that Julia's the prettiest language to do this sort of ugly, html-wrangling stuff, but I guess it means that other people will be able to edit it. Did you want to work on it? Let me know either way:)

Which model

4o haha. The main reason why I used it was because I spent one day 4 years ago learning about awk but have since forgotten all of it 😄 I did have to prompt it quite carefully by telling it exactly what the problem was (with stuff happening after the tag). I don't think the GHA workflow it generated was perfect, I did actually edit quite a few things like the working-directory and the call to chmod, but it took care of all the boilerplate quite well which meant that I could focus on the more fun bits.

@shravanngoswamii
Copy link
Member

4o haha.

4o usually don't give required code in one go😅, o1 and Deepseek-R1 are cool btw!

Did you want to work on it? Let me know either way:)

Sure, I will try some different approaches on this tomorrow, will come up with a solution for our best case, maybe GUMBO or else something Node.js based or maybe stay with bash! It's 3:30 AM, right now🦉!

@penelopeysm penelopeysm marked this pull request as draft February 3, 2025 15:20
@shravanngoswamii
Copy link
Member

shravanngoswamii commented Feb 5, 2025

@penelopeysm I've tried several approaches for this, including Gumbo.jl, [pup](https://github.com/ericchiang/pup), Cheerio (Node.js), BeautifulSoup, and others. However, I felt it wasn't ideal to add unnecessary dependencies for something that can be done without them. Thats why I did it with bash at first, but it keeps breaking!

Finally, I've decided to implement this in Julia, with only HTTP for fetching URLs and regex for manipulation, without needing Gumbo. I’ve tested this locally across different scenarios, and it worked well for me.

https://github.com/shravanngoswamii/DynamicPPL.jl/actions/runs/13159797886/job/36725344673
https://shravangoswami.com/DynamicPPL.jl/dev/

If you have any alternative suggestions or approaches, please let me know. Here’s the script for your review—once you’ve had a look, we can commit it after cleaning a bit and add some comments if everything looks good.

using HTTP

function read_file(filename::String)
    open(filename, "r") do io
        read(io, String)
    end
end

function write_file(filename::String, contents::String)
    open(filename, "w") do io
        write(io, contents)
    end
end

function should_exclude(filename::String, patterns::Vector{String})
    for pat in patterns
        if occursin(pat, filename)
            return true
        end
    end
    return false
end

function remove_existing_navbar(html::String)
    start_marker = "<!-- NAVBAR START -->"
    end_marker   = "<!-- NAVBAR END -->"
    while occursin(start_marker, html) && occursin(end_marker, html)
        start_idx_range = findfirst(start_marker, html)
        end_idx_range   = findfirst(end_marker, html)
        start_idx = first(start_idx_range)
        end_idx = first(end_idx_range)
        prefix = start_idx > 1 ? html[1:start_idx-1] : ""
        suffix = lstrip(html[end_idx + length(end_marker) : end])
        html = string(prefix, suffix)
    end
    return html
end

function wrap_navbar(navbar_html::String)
    if !occursin("NAVBAR START", navbar_html) || !occursin("NAVBAR END", navbar_html)
        return "<!-- NAVBAR START -->\n" * navbar_html * "\n<!-- NAVBAR END -->"
    else
        return navbar_html
    end
end

function insert_navbar(html::String, navbar_html::String)
    html = remove_existing_navbar(html)
    m = match(r"(?i)(<body[^>]*>)", html)
    if m === nothing
        println("Warning: Could not find <body> tag in the file; skipping insertion.")
        return html
    end
    prefix = m.match
    inserted = string(prefix, "\n", navbar_html, "\n")
    html = replace(html, prefix => inserted; count = 1)
    return html
end

function process_file(filename::String, navbar_html::String)
    println("Processing: $filename")
    html = read_file(filename)
    html_new = insert_navbar(html, navbar_html)
    if html_new == html
        println("Skipped: No <body> tag found in $filename")
    else
        write_file(filename, html_new)
        println("Updated: $filename")
    end
end

function main()
    if length(ARGS) < 2
        println("Usage: julia insert_navbar.jl <html-file-or-directory> <navbar-file-or-url> [--exclude \"pat1,pat2,...\"]")
        return
    end
    target = ARGS[1]
    navbar_source = ARGS[2]
    
    exclude_patterns = String[]
    if length(ARGS)  4 && ARGS[3] == "--exclude"
        exclude_patterns = map(x -> string(strip(x)), split(ARGS[4], ','))
    end

    navbar_html = ""
    if startswith(lowercase(navbar_source), "http")
        resp = HTTP.get(navbar_source)
        if resp.status != 200
            error("Failed to download navbar from $navbar_source")
        end
        navbar_html = String(resp.body)
    else
        navbar_html = read_file(navbar_source)
    end
    navbar_html = string(navbar_html)
    navbar_html = wrap_navbar(navbar_html)

    if isfile(target)
        if !should_exclude(target, exclude_patterns)
            process_file(target, navbar_html)
        else
            println("Skipping excluded file: $target")
        end
    elseif isdir(target)
        for (root, _, files) in walkdir(target)
            for file in files
                if endswith(file, ".html")
                    fullpath = joinpath(root, file)
                    if !should_exclude(fullpath, exclude_patterns)
                        process_file(fullpath, navbar_html)
                    else
                        println("Skipping excluded file: $fullpath")
                    end
                end
            end
        end
    else
        error("Target $target is neither a file nor a directory.")
    end
end

main()

@penelopeysm
Copy link
Member Author

@shravanngoswamii, yes, I'm happy with the overall approach! I do have some code comments. If you open a PR I can review :)

@penelopeysm
Copy link
Member Author

Superseded by #11.

@penelopeysm penelopeysm closed this Feb 5, 2025
@penelopeysm penelopeysm deleted the py/fix-body-tag branch February 5, 2025 21:24
@penelopeysm penelopeysm restored the py/fix-body-tag branch February 5, 2025 21:24
@shravanngoswamii shravanngoswamii deleted the py/fix-body-tag branch February 6, 2025 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants