Skip to content

Latest commit

 

History

History
139 lines (95 loc) · 4.53 KB

README.md

File metadata and controls

139 lines (95 loc) · 4.53 KB

AIDRD LLM

Quick Start

Prerequisites

  • Docker

Installation

cp .env.example .env
docker-compose up -d

Crawl documents for knowledge base

  • To crawl the document for the knowledge base, prepare Firecrawl endpoint by self-hosting or using the public endpoint.
    • For self-hosting, please refer to the following links:
    • We confirmed the crawling with the revision fc08ff450da50eb436d9dfd4a09ac741fd8fbb84 of Firecrawl and other revision may not work correctly.
    • After deployment, please change the following line in docker-compose.yml to deploy the worker service in production mode.
      • This setting is important because the worker service in develogment mode does not automatically restart when app crashes by error.
  worker:
    <<: *common-service
    depends_on:
      - redis
      - playwright-service
      - api
#    command: [ "pnpm", "run", "workers" ] # Original setting for development
    command: [ "pnpm", "run", "worker:production" ] # Modified setting for production
    restart: unless-stopped                # Add this line
  • Run the following command to crawl the documents.
    • If you prepare the firecrawl endpoint other than localhost:3002, specify the endpoint by --firecrawl-host option.
    • Note that you should execute this command outside the docker container to access the local firecrawl endpoint.
python crawl_knowledges.py <URL_TO_START_CRAWLING> <OUTPUT_FILE_NAME> --max-page-count <MAX_PAGE_COUNT> --max-depth <MAX_DEPTH>
  • For example:
python crawl_knowledges.py "https://www.hokeniryo.metro.tokyo.lg.jp/kenkou/nanbyo/portal/" tokyo.json --max-page-count 1000 --max-depth 5
  • The crawled documents will be saved in tokyo.json and PDFs are saved in downloaded_pdfs directory.

If you want to crawl all prefectures at once, you can use the following command:

python crawl_all_prefectures.py

Upload knowledge base to Dify

  • To upload the knowledge base to Dify, update the .env file with the required values

KNOWLEDGE_API_KEY=<DIFY_KNOWLEDGE_API_KEY>
API_BASE_URL=<DIFY_BASE_URL> # e.g. http://aidrd.japaneast.cloudapp.azure.com/v1

  • Run the following command to upload the knowledge base to Dify
    • Note that you should execute this command in the root directory of this project because the JSON file includes relative paths to the PDFs.

python upload_knowledge.py <CRAWLED_KNOWLEDGE_FILE> <KNOWLEDGE_BASE_NAME>

  • For example:

python upload_knowledge.py tokyo.json tokyo-knowledges

  • If you want to upload all prefectures at once, you can use the following command:
python upload_all_prefectures.py
  • After upload, please execute the following SQL query to remove the suffix .added_on_upload.html from the document names.
  • This process is necessary because the URL without extension like 'http://example.com/?id=123' is not accepted by Dify and the suffix ".added_on_upload.html" is added to the document names.
UPDATE documents
SET name = LEFT(name, LENGTH(name) - LENGTH('.added_on_upload.html'))
WHERE name LIKE '%.added_on_upload.html';
  • At Dify 0.6.15, The created knowledge base has only_me visibility by default and visible only for the owner of Dify workspace.
  • If you cannot see the uploaded knowledge base, please execute the following SQL query to change the visibility.
UPDATE datasets set permission = 'all_team_members' WHERE name = '<KNOWLEDGE_BASE_NAME>';

Evaluation

  • To evaluate the accuracy of your Dify chatbot, update the .env file with the required values
API_BASE_URL=<YOUR_DIFY_BASE_URL>
CHATBOT_API_KEY=<YOUR_CHATBOT_API_KEY>
DIFY_USER=<DIFY_USER_NAME> # Just for logging purposes. Can be anything
AZURE_DEPLOYMENT_ID=<YOUR_AZURE_DEPLOYMENT_ID> # The deployment id of your LLM model. This model evaluates the chatbot responses
  • Restart the container
docker-compose restart
  • Run the evaluation script
    • The evaluation data should be located in evaluation_data.json
docker-compose exec app python evaluation.py
  • The evaluation results will be located in evaluation_results_<timestamp>.json

Known problems and solutions

  • Firecrawl sometimes crashes and crawling stops. In this case, you can restart the crawling by restarting firecrawl containers and executing the script again.
  • Weaviate in Dify sometimes refuse the connetcion. In this case, you can restart the weaviate container and execute the script again.