- Docker
cp .env.example .env
docker-compose up -d
- To crawl the document for the knowledge base, prepare Firecrawl endpoint by self-hosting or using the public endpoint.
- For self-hosting, please refer to the following links:
- We confirmed the crawling with the revision
fc08ff450da50eb436d9dfd4a09ac741fd8fbb84
of Firecrawl and other revision may not work correctly. - After deployment, please change the following line in
docker-compose.yml
to deploy the worker service in production mode.- This setting is important because the worker service in develogment mode does not automatically restart when app crashes by error.
worker:
<<: *common-service
depends_on:
- redis
- playwright-service
- api
# command: [ "pnpm", "run", "workers" ] # Original setting for development
command: [ "pnpm", "run", "worker:production" ] # Modified setting for production
restart: unless-stopped # Add this line
- Run the following command to crawl the documents.
- If you prepare the firecrawl endpoint other than
localhost:3002
, specify the endpoint by--firecrawl-host
option. - Note that you should execute this command outside the docker container to access the local firecrawl endpoint.
- If you prepare the firecrawl endpoint other than
python crawl_knowledges.py <URL_TO_START_CRAWLING> <OUTPUT_FILE_NAME> --max-page-count <MAX_PAGE_COUNT> --max-depth <MAX_DEPTH>
- For example:
python crawl_knowledges.py "https://www.hokeniryo.metro.tokyo.lg.jp/kenkou/nanbyo/portal/" tokyo.json --max-page-count 1000 --max-depth 5
- The crawled documents will be saved in
tokyo.json
and PDFs are saved indownloaded_pdfs
directory.
If you want to crawl all prefectures at once, you can use the following command:
python crawl_all_prefectures.py
- To upload the knowledge base to Dify, update the
.env
file with the required values
KNOWLEDGE_API_KEY=<DIFY_KNOWLEDGE_API_KEY>
API_BASE_URL=<DIFY_BASE_URL> # e.g. http://aidrd.japaneast.cloudapp.azure.com/v1
- Run the following command to upload the knowledge base to Dify
- Note that you should execute this command in the root directory of this project because the JSON file includes relative paths to the PDFs.
python upload_knowledge.py <CRAWLED_KNOWLEDGE_FILE> <KNOWLEDGE_BASE_NAME>
- For example:
python upload_knowledge.py tokyo.json tokyo-knowledges
- If you want to upload all prefectures at once, you can use the following command:
python upload_all_prefectures.py
- After upload, please execute the following SQL query to remove the suffix
.added_on_upload.html
from the document names. - This process is necessary because the URL without extension like 'http://example.com/?id=123' is not accepted by Dify and the suffix ".added_on_upload.html" is added to the document names.
UPDATE documents
SET name = LEFT(name, LENGTH(name) - LENGTH('.added_on_upload.html'))
WHERE name LIKE '%.added_on_upload.html';
- At Dify 0.6.15, The created knowledge base has
only_me
visibility by default and visible only for the owner of Dify workspace. - If you cannot see the uploaded knowledge base, please execute the following SQL query to change the visibility.
UPDATE datasets set permission = 'all_team_members' WHERE name = '<KNOWLEDGE_BASE_NAME>';
- To evaluate the accuracy of your Dify chatbot, update the
.env
file with the required values
API_BASE_URL=<YOUR_DIFY_BASE_URL>
CHATBOT_API_KEY=<YOUR_CHATBOT_API_KEY>
DIFY_USER=<DIFY_USER_NAME> # Just for logging purposes. Can be anything
AZURE_DEPLOYMENT_ID=<YOUR_AZURE_DEPLOYMENT_ID> # The deployment id of your LLM model. This model evaluates the chatbot responses
- Restart the container
docker-compose restart
- Run the evaluation script
- The evaluation data should be located in evaluation_data.json
docker-compose exec app python evaluation.py
- The evaluation results will be located in
evaluation_results_<timestamp>.json
- Firecrawl sometimes crashes and crawling stops. In this case, you can restart the crawling by restarting firecrawl containers and executing the script again.
- Weaviate in Dify sometimes refuse the connetcion. In this case, you can restart the weaviate container and execute the script again.