Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embed course metadata as contentfile #2050

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

shanbady
Copy link
Contributor

@shanbady shanbady commented Feb 14, 2025

What are the relevant tickets?

Fixes (part 1 of) https://github.com/mitodl/hq/issues/6725

Description (What does it do?)

This PR adds embeddings for general course info (what shows up in the resource panel) in the contentfiles collection so that the chat agent can get that info from the contentfile vector endpoint directly.

How it works
Anytime we embed a new resource, we also generate an "about this course" document with all the course info and put that in the contentfiles collection. We can follow this same pattern for whatever else we might need to enrich the chat agent's response

How can this be tested?

  1. checkout this branch
  2. make sure you have some contentfiles locally
  3. find a learning resource id for a resource that has contentfiles
  4. generate embeddings for that resource via python manage.py generate_embeddings --resource-ids <id>
  5. hit the contentfile vector search endpoint - set the resource_readable_id parameter to the readable_id (make the readable_id is url encoded - this has get me a few times).
    http://open.odl.local:8063/api/v0/vector_content_files_search/?limit=10&resource_readable_id=course-v1%3AMITxT%2B14.73x&q=who%20offers%20this%20course?
  6. try seting the "q=" parameter to questions about the course (info that shows up in the resource panel) for example "?q=who is teaching this course?" or "q=how much does it cost? will I earn a certificate?". The results should surface a chunk that starts with "Information about this course" and has a "file_extension" of ".md"

Additional Context

  • The "course info" document won't always be surfaced as the top result - its heavily dependent on other contentfiles for the resource - in lots of cases the same info is found in other chunks. This does however fill the gap where that info is absent in contentfiles.
  • There will be a follow on ticket about pulling in full marketing site content into the chunk collection - related https://github.com/mitodl/hq/issues/6699

@shanbady shanbady added Work in Progress Needs Review An open Pull Request that is ready for review and removed Work in Progress labels Feb 24, 2025
@shanbady shanbady marked this pull request as ready for review February 25, 2025 16:03
@shanbady shanbady added Work in Progress and removed Needs Review An open Pull Request that is ready for review labels Feb 25, 2025
@shanbady shanbady marked this pull request as draft February 25, 2025 18:22
@shanbady shanbady added Needs Review An open Pull Request that is ready for review and removed Work in Progress labels Feb 26, 2025
@shanbady shanbady marked this pull request as ready for review February 26, 2025 19:03
@shanbady
Copy link
Contributor Author

/gemini review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

The pull request introduces a new feature to embed course metadata as content files, which enhances the chat agent's ability to retrieve course information. The implementation involves generating an "about this course" document and storing it in the contentfiles collection. The changes seem well-structured and include necessary modifications to the vector search utilities and tests. However, due to the limit on the number of review comments, I am unable to provide specific feedback on the code. I recommend that the author review the changes carefully and ensure that they meet the requirements of the project.

Summary of Findings

Merge Readiness

The pull request appears to be well-structured and introduces a valuable feature. However, due to the limitations on the number of review comments, I am unable to provide specific feedback on the code. I recommend that the author review the changes carefully and ensure that they meet the requirements of the project. I am unable to approve the pull request in any circumstance, and that users should have others review and approve this code before merging.

@abeglova abeglova self-assigned this Feb 27, 2025
Copy link
Contributor

@abeglova abeglova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Back to you with some comments!


def generate_metadata_document(serialized_resource):
"""
Generate a plaint-text info document to embed in the contentfile collection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be replaced with a serializer that takes a learning resource? Something like ContentFileLearningResourceMetadataSerializer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 That might be a cleaner way to do this. will have a look

resource_vector_point_id = str(vector_point_id(readable_id))
ids.append(resource_vector_point_id)
course_info_document = generate_metadata_document(doc)
metadata.append(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like other than resource_vector_point_id, most of this can be added to generate_metadata_document (or the serializer that i think you should replace it with)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Needs Review An open Pull Request that is ready for review Waiting on author
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants