Conduct research on given URLs without forgetting and add more research #734
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi!
I am making this pull request to fix the issue with
source_urls
being reset inside theconduct_research()
function, which causes GPTR to forget the user-input URLs.Additionally, I am introducing a new parameter to the
GPTResearcher
class calledadd_additional_sources
(bool). This parameter allows GPTR to gather more context from a default web search in addition to the user-input URLs, thereby increasing the overall scope of research for the query or sub-query.HOW: If set, I scrape from both the user-input URLs and the default web search function. This way, GPTR researches both the user-input sources and the sources it finds on its own. If unset, we simply scour the user-input URLs alone and build the answer with the gathered context. If the query is unrelated to the URLs' contents, we log a message so the user knows the answer is generated from the model's inherent knowledge from its training data and not through 'research'.
WHY: The intent of providing the
source_urls
is to scour the user-provided webpages. Sinceconduct_research
was forgetting the URLs, we were unable to scrape them. With this fix, the webpages can be scraped. However, there may be cases where the user might have missed edge cases where the query could be unrelated to the hardcodedsource_urls
, causing GPTR to generate answers from its own knowledge rather than from new research. To address this, I introduce the new parameteradd_additional_sources
, which allows GPTR to scour both the user-provided sources and conduct web searches, thereby increasing the context to answer from. This way, if the sources do not match the query, we can still overcome this and perform authentic research because of default web search as compared to the earlier answer generation from model's pre-trained weights. This feature is also useful when the user wants research done not only from the hardcoded URLs he/she provides, but also from other related sources on the internet which is infeasible to add manually by the user every time, but GPTR can find easily.Other functions remain the same. I have also cleaned up parts of the code and comments relevant to the new modifications.
Thanks,
Makesh Srinivasan