Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MarkDown markup visible on documentation search page #3575

Closed
jmsmkn opened this issue Oct 7, 2024 · 6 comments · Fixed by #3755
Closed

MarkDown markup visible on documentation search page #3575

jmsmkn opened this issue Oct 7, 2024 · 6 comments · Fixed by #3755

Comments

@jmsmkn
Copy link
Member

jmsmkn commented Oct 7, 2024

This should only show the content, not the markup:

Screenshot 2024-10-04 at 20 22 33
@ammar257ammar
Copy link
Contributor

ammar257ammar commented Dec 13, 2024

I was looking into this issue.
The headline appearing in the search results in each entry is obtained using the function SearchHeadline()

headline = SearchHeadline("content", query)

This function acts on the database where it runs the postgresql ts_headline function.

The first argument is either the column name or a database expression applied to the column (using F() functions).

Therefore, the SearchHeadline() will extract the relevant text directly from the stored text which is in markdown format, leading to a stripped markdown (malformed).

In this case, the markdown() function does not work correctly to render the markdown.
Example:

    md = ("bold**: "
        "more text more text more text. here is the link (defined by [woooow](https://www.google.com")
        
    html = render_markdown(md or "")
    soup = BeautifulSoup(html, features='html.parser')

    # Output: bold**: more text more text more text. here is the link (defined by [woooow](https://www.google.com

Ideally, we want the content to be in plain text (e.g. by rendering the markdown to html and then extracting the text using bs4) before getting the headline from it, but as I see now that is not possible with the function SearchHeadline().

I can think of these solutions currently:

1- remove the markdown characters from the text (with regex). That will require handling quite some cases to make sure it is clean (images, links, quotes, code blocks, typography ..etc).
2- Render content (in python), get plain text and use the first n character as headline. This will not contain highlighted text in relevance to the query but a snippet from the documentation page.
3- Render and extract the text from content (in python), then do the search with regex and extract the match (first occurence) with surrounding text (e.g. +/- 30 characters) and use that as headline. This solution will not use the SearchHeadline().
4- Create a model field plain_text that have a clean copy of the text for headline search (maybe other applications). That requires changes to the model, data migrations, introduces redundancy in data. I believe this one is an overkill.

@amickan
Copy link
Contributor

amickan commented Dec 13, 2024

but as I see now that is not possible with the function SearchHeadline().

Why is this not possible? I would have thought that you can annotate the queryset with a new plain_text field and then pass that to SearchHeadline?

@ammar257ammar
Copy link
Contributor

ammar257ammar commented Dec 13, 2024

Why is this not possible? I would have thought that you can annotate the queryset with a new plain_text field and then pass that to SearchHeadline?

According to documentation, annotations are for database expressions and not python functions. So, I can't pass a python function on a column for annotation. I tried that also.

Regarding SearchHeadline() , the first argument is either the column name or a database expression applied to the column.

It does not take a query string as input, so can't act on annotated column in query string
https://github.com/django/django/blob/stable/5.1.x/django/contrib/postgres/search.py#L276

That's why I mentioned in one suggestion that the plain text extraction probably should happen in python after getting the relevant results and doing the highlight afterwards.

@amickan
Copy link
Contributor

amickan commented Dec 16, 2024

I'm not sure, but I think you can write your own database expression using Func() (and then maybe the regexp_replace function of PostGresSQL). Not sure how performant it is, and how hard it would be to get the regex right, but I think this should at least technically be possible. You would then annotate the queryset with a clean_content and pass that the SearchHeadline.

@ammar257ammar
Copy link
Contributor

ammar257ammar commented Dec 16, 2024

After discussing the proposed options above with Anne synchronously, and in the light of the discussion on #3740, the latest suggestion would be to leave out the headline from the search results and only show the title of the doc page (no server side headline search and highlight).
Then, implementing JS highlighting on the doc page where the search term is passed in the URL of the doc page to use it in highlighting. This will solve both, this issue and #3740

Can you @jmsmkn give feedback on this please?

@jmsmkn
Copy link
Member Author

jmsmkn commented Dec 16, 2024

Yes just remove the Headline search, do not implement keyword highlighting or try to fix #3740.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants