Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch files recursively #10

Closed
wants to merge 3 commits into from
Closed

Conversation

stevethomas
Copy link

@stevethomas stevethomas commented Feb 14, 2024

I'm not sure if this was a deliberate decision, but I think it makes sense for files to be retrieved recursively, because in many scenarios the files in the source code are split across lots of directories and specifying all of them is hard work and requires a lot more upkeep.

If you didn't want this to be the default, another option could be to have a recursive config option which defaults to false.

Unrelated but the other thing I noticed immediately is that the OpenAI API seems to regularly flake out with "Service Unavailable" exceptions. Not sure if I just got unlucky with my timing or if this is normal; if it is normal perhaps a retry() could help to avoid having to restart a large indexing operation from scratch (assuming one of retries is successful)

Nevermind, I the OpenAI API was having issues yesterday

@zbora23
Copy link
Contributor

zbora23 commented Feb 16, 2024

@stevethomas just a consideration, could we perhaps change it so that it does recursive only when wildcard is passed

Something along the lines of

  • app/Http will NOT do any recursive lookup and only index files within that directory
  • app/Http/* will look up recursively

What do you think? It does give more granular control.

@joshembling
Copy link
Owner

joshembling commented Feb 17, 2024

Hi @stevethomas

This is an intentional feature. I wanted to give full control to the user on what specific directories and files to index, mainly to eliminate "clumsy" indexing/genuine mistakes from a developer that could inadvertently expose any or all of the following:

  • Sensitive data
  • Protected business logic
  • API keys, other private tokens, data etc.

Every index is shared with two third parties. Obviously, it is the user's responsibility to ensure nothing unintentional is shared to OpenAI/Pinecone and not mine. However, if such details are still shared, even with explicit files and directories passed as is the current logic, then that is most certainly the fault of the user.

If you use Laragenie on many large projects, you would always want to mitigate as many unintentional indexes as possible. On some level I agree, adding lots of directories repetitively to the array in the config can be slightly tedious, but it gives you and other developers a lot more clarity on what is shared.

I am open to some sort of recursion, but for now I will close this PR as it conflicts with the intended implementation.

You are welcome to start a new PR if you have figured out a reasonable alternative.

Thanks,
Josh

@stevethomas
Copy link
Author

@stevethomas just a consideration, could we perhaps change it so that it does recursive only when wildcard is passed

Something along the lines of

  • app/Http will NOT do any recursive lookup and only index files within that directory
  • app/Http/* will look up recursively

What do you think? It does give more granular control.

Yes I agree the wildcard communicates the recursiveness more clearly.

@stevethomas
Copy link
Author

Hi @stevethomas

This is an intentional feature. I wanted to give full control to the user on what specific directories and files to index, mainly to eliminate "clumsy" indexing/genuine mistakes from a developer that could inadvertently expose any or all of the following:

  • Sensitive data
  • Protected business logic
  • API keys, other private tokens, data etc.

Every index is shared with two third parties. Obviously, it is the user's responsibility to ensure nothing unintentional is shared to OpenAI/Pinecone and not mine. However, if such details are still shared, even with explicit files and directories passed as is the current logic, then that is most certainly the fault of the user.

If you use Laragenie on many large projects, you would always want to mitigate as many unintentional indexes as possible. On some level I agree, adding lots of directories repetitively to the array in the config can be slightly tedious, but it gives you and other developers a lot more clarity on what is shared.

I am open to some sort of recursion, but for now I will close this PR as it conflicts with the intended implementation.

You are welcome to start a new PR if you have figured out a reasonable alternative.

Thanks, Josh

I agree in principle that developers should definitely be thinking about any sensitive code they may be sharing with another 2 external parties, on the other hand it would be poorly written code that was hardcoding secrets anywhere other than .env in a Laravel context, and obviously we should not index that.

Would you accept the app/Http/* format suggestion from @zbora23 for recursion?

@joshembling
Copy link
Owner

Hi @stevethomas
This is an intentional feature. I wanted to give full control to the user on what specific directories and files to index, mainly to eliminate "clumsy" indexing/genuine mistakes from a developer that could inadvertently expose any or all of the following:

  • Sensitive data
  • Protected business logic
  • API keys, other private tokens, data etc.

Every index is shared with two third parties. Obviously, it is the user's responsibility to ensure nothing unintentional is shared to OpenAI/Pinecone and not mine. However, if such details are still shared, even with explicit files and directories passed as is the current logic, then that is most certainly the fault of the user.
If you use Laragenie on many large projects, you would always want to mitigate as many unintentional indexes as possible. On some level I agree, adding lots of directories repetitively to the array in the config can be slightly tedious, but it gives you and other developers a lot more clarity on what is shared.
I am open to some sort of recursion, but for now I will close this PR as it conflicts with the intended implementation.
You are welcome to start a new PR if you have figured out a reasonable alternative.
Thanks, Josh

I agree in principle that developers should definitely be thinking about any sensitive code they may be sharing with another 2 external parties, on the other hand it would be poorly written code that was hardcoding secrets anywhere other than .env in a Laravel context, and obviously we should not index that.

Would you accept the app/Http/* format suggestion from @zbora23 for recursion?

Yes I agree it would be poorly written, but also wanted to mitigate as much as possible in the first run of this package.

Open to @zbora23's suggestion and happy for either of you to work out a solution that has no negative effect on the current functionality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants