Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import wiki documents to SDG #120

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 38 additions & 0 deletions docs/sdg/wiki-doc-source.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Wiki document source

Fetching information from wikis is an essential
feature for fine-tuning LLMs on public knowledge.

## Interfaces

qna.yaml file, `document` section:

- Wiki Host: The base URL of a wiki host.
- Page titles: The titles of the Wiki pages to fetch.
- oldid: IDs of old releases.

The qna.yaml file can define single host and multiple spaces and pages,
each with an optional version.

Example of fetch URL:

- https://en.wikipedia.org/w/index.php?title=IBM_Granite&oldid=1235007056&action=raw

Note that oldid is sufficient to reterieve a page:

- https://en.wikipedia.org/w/index.php?oldid=1235007056&action=raw

Page title is used for vaidation.

## Changes across modules

- [Schema module](https://github.com/instructlab/schema) defines the structure and validation rules for
the qna.yaml file.
- [SDG taxonomy module](https://github.com/instructlab/sdg/blob/main/src/instructlab/sdg/utils/taxonomy.py)
fetches documents
- [SDG unit tests](https://github.com/instructlab/sdg/tree/main/tests)

## Additional External Packages

- urllib