-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ New Connector Idea: "GitHub Files" Source #22
Comments
Hi @aaronsteers , this is what me and @siddhant3030 were thinking, please let me know what you think.
If you could shed light on the below questions would be great
|
@aaronsteers
|
@Ishankoradia - response inline:
👍
👍
Yes, as you suggest would include content, (relative) file name - and hopefully also Since this is related to LLM use cases, it might make sense to try to match the schema of "unstructured" sources like source-google-drive. If easy to include, some other fields like
👍
TL;DR: This part is your call. I don't feel strongly about it either way. The thought process is that it was probably easier to not have to worry about impacts related to modifying the existing connector. And if created as a separate connector, we could always incorporate back into the GitHub source later on. If adding to the existing GitHub connector, we can't change the auth or other options. Another consideration here is that we may want to go all-in on the unstructured documents paradigm or files-type extractor paradigm. There is some existing prior art here that you could reuse - notably
Yes! Goal here is to make git content available to LLMs.
I think one stream per repo (or else one stream per glob?) is probably the right call.
Sorry for the confusion on this point. Here are some example glob patterns, keeping in mind the LLM use cases.
|
@aaronsteers thanks for getting back super quick. Appreciate all the answers. Summarizing for our understanding
Have two more questions
Could you assign this to me ? Its very interesting. |
Yes, if that works for you. 👍 We have a CDK "extra" for
There's a common CDK backend for sources such as Google Drive and S3, which I'll screenshot below for comparison.
Absolutely. You've got first dibs as first person to reply. If above sounds good to you, simply confirm and @marcosmarxm will assign to you. |
Perfect, Sounds good to me @aaronsteers . Thanks for getting back quickly. |
@Ishankoradia - We are very excited also - thanks for taking this on. I've assigned it to you! |
@Ishankoradia if you want to drop this issue, Can I work on this one? |
Hey @avirajsingh7 i am still working on this. Deep into it, i am planning to submit a first draft of the PR in the coming week. Thanks |
Hi, @Ishankoradia - Any update or questions on this before end of event? I'll check in tomorrow (Saturday) in case I can help in any way. Thanks! |
Hi @aaronsteers thanks for checking in. This is draft PR i am working on. I have a few questions or places where i am stuck/confused.
|
Hi @aaronsteers , i would still like to continue work on this, if its alright with you all. Would be great if you answer my questions above. Thanks. |
@Ishankoradia - I'm afraid I do not have clear answers to your question regarding the API key. To my understanding, it should not require an API key, although a key to Unstructured.io was intended as an optional input for high-res image processing. (Although I don't think that feature is live as of today.) In regards to leveraging the CDK versus writing from scratch, the goal with leveraging that is/was to retain parsing capabilities of the CDK, in order to be able to parse files like jpg, doc, word, pdf, CSV, Excel, etc. Since most files in a git repo are already test formatted, I don't see that as a strict hard requirement, although there are (probably?) some benefits to leveraging those paradigms. While the deadline for contributions for hackathon has now passed (last day was July 3), I'd be happy to continue working on this if you'd like to see it through to completion. Either way, we thank you for your contribution and for your participation. 🙏 |
Hi @aaronsteers thanks for getting back, I dont mind the hackathon being over. I would like to continue working on this to completion, would be great to get your support while i do. I am going to try to make a push and see if I can pull something from a public repo in the connector. Keeping in mind, leveraging CDK is not a strict requirement. If you have any ideas/suggestions, I am happy to hear them out. |
Desired Connector Spec
This new connector would allow users to get file data from a github (or generic git) repo, and apply our CDK parser logic (especially unstructured parser) along with glob patterns, to get data and content from a github repo into an Airbyte data pipeline.
The text was updated successfully, but these errors were encountered: