Skip to content

Commit

Permalink
import wiki documents to SDG
Browse files Browse the repository at this point in the history
Signed-off-by: Costa Shulyupin <[email protected]>
  • Loading branch information
makelinux committed Jul 18, 2024
1 parent 8fde6f7 commit d3190f7
Showing 1 changed file with 38 additions and 0 deletions.
38 changes: 38 additions & 0 deletions docs/wiki-doc-source.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Wiki document source

Fetching information from wikis is an essential
feature for fine-tuning LLMs on public knowledge.

## Interfaces

qna.yaml file, `document` section:

- Wiki Host: The base URL of a wiki host.
- Page titles: The titles of the Wiki pages to fetch.
- oldid: IDs of old releases.

The qna.yaml file can define single host and multiple spaces and pages,
each with an optional version.

Example of fetch URL:

- https://en.wikipedia.org/w/index.php?title=IBM_Granite&oldid=1235007056&action=raw

Note that oldid is sufficient to reterieve a page:

- https://en.wikipedia.org/w/index.php?oldid=1235007056&action=raw

Page title is used for vaidation.

## Changes across modules

- [Schema module](https://github.com/instructlab/schema) defines the structure and validation rules for
the qna.yaml file.
- [SDG taxonomy module](https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/taxonomy.py)
fetches documents
- [SDG unit tests](https://github.com/instructlab/sdg/tree/main/tests)

## Additional External Packages

- urllib

0 comments on commit d3190f7

Please sign in to comment.