From 3d955b69ce4e8d83a829b541ebb3f9a33b2a4498 Mon Sep 17 00:00:00 2001 From: Costa Shulyupin Date: Fri, 24 May 2024 22:22:30 +0300 Subject: [PATCH] Import confluence documents to SDG Signed-off-by: Costa Shulyupin --- docs/sdg/confluence-doc-source.md | 42 +++++++++++++++++++++++++++++++ 1 file changed, 42 insertions(+) create mode 100644 docs/sdg/confluence-doc-source.md diff --git a/docs/sdg/confluence-doc-source.md b/docs/sdg/confluence-doc-source.md new file mode 100644 index 00000000..3665fb84 --- /dev/null +++ b/docs/sdg/confluence-doc-source.md @@ -0,0 +1,42 @@ +# Confluence document source + +Importing information from Confluence is crucial for fine-tuning models on internal documentation. +Many companies use Confluence to store their internal documents. +Fine-tuned models can be employed within these companies and shared externally without compromising the internal documentation itself. +Therefore, importing information from Confluence benefits both companies and the broader community. + +## Interfaces + +qna.yaml file, `document` section: + +- Confluence Host: The base URL of the Confluence instance. +- Space: The Confluence space key where the documents reside. +- Page titles: The titles of the Confluence pages to fetch. +- Version: The version of the Confluence page. + +The qna.yaml file can define single host and multiple spaces and pages, +each with an optional version. + +Confluence credentials in config.yaml: +- Username +- [Token](https://support.atlassian.com/atlassian-account/docs/manage-api-tokens-for-your-atlassian-account/) + +## Changes across modules + +- [Configuration module](https://github.com/instructlab/instructlab/blob/main/src/instructlab/config.py) + defines the structure and validation rules for + the config.yaml file. +- [Schema module](https://github.com/instructlab/schema) defines the structure and validation rules for + the qna.yaml file. +- [sdg utilities module](https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/taxonomy.py) + fetches documents +- [unit test](https://github.com/instructlab/instructlab/tree/main/tests) + +## Additional External Packages + +The implementation relies on the following external packages: + +- [atlassian-python-api](https://atlassian-python-api.readthedocs.io/) – + A Python library to interact with Atlassian products, including Confluence. +- [markdownify](https://pypi.org/project/markdownify/) – + A library to convert HTML content to Markdown for processing Confluence page content.