Note: This repo and its README is a work in progress. The instructions to install and run here might not be comprehensive (yet), and code might break from time to time.
Note: Please check our Powerpoint presentation here.
Note: Start with the Demo Notebook here.
Multimodal Document Analysis with RAG and Code Execution: using Text, Images and Data Tables with GPT4-V, TaskWeaver, and Assistants API:
- The work focuses on processing multi-modal analytical documents by extracting text, images, and data tables to maximize data representation and information extraction, utilizing formats like Python code, Markdown, and Mermaid script for compatibility with GPT-4 models.
- Text is programmatically extracted from documents, processed to improve structure and tag extraction for better searchability, and numerical data is captured through generated Python code for later use.
- Images and data tables are processed to generate multiple text-based representations (including detailed text descriptions, Mermaid, and Python code for images, and various formats for tables) to ensure information is searchable and usable for calculations, forecasts, and applying machine learning models using Code Interpreter capabilities.
- As of today with conventional techniques, to be able to search through a knowledge base with RAG, text from documents need to be extracted, chunked and stored in a vector database
- This process now is purely concerned with text:
- If the documents have any images, graphs or tables, these elements are usually either ignored or extracted as messy unstructured text
- Retrieving unstructured table data through RAG will lead to very low accuracy answers
- LLMs are usually very bad with numbers. If the query requires any sort of calculations, LLMs usually hallucinate or make basic math mistakes
- Ingest and interact with multi-modal analytics documents with lots of graphs, numbers and tables
- Extract structured information from some elements in documents which wasn’t possible before:
- Images
- Graphs
- Tables
- Use the Code Interpreter to formulate answers where calculations are needed based on search results
- Analyze Investment opportunity documents for Private Equity deals
- Analyze tables from tax documents for audit purposes
- Analyze financial statements and perform initial computations
- Analyze and interact with multi-modal Manufacturing documents
- Process academic and research papers
- Ingest and interact with textbooks, manuals and guides
- Analyze traffic and city planning documents
- GPT-4-Turbo is a great help with its large 128k token window
- GPT-4-Turbo with Vision is great at extracting tables from unstructured document formats
- GPT-4 models can understand a wide variety of formats (Python, Markdown, Mermaid, GraphViz DOT, etc..) which was essential in maximizing information extraction
- A new approach to vector index searching based on tags was needed because the Generation Prompts were very lengthy compared to the usual user queries
- Taskweaver’s and Assistants API’s Code Interpreters were introduced to conduct open-ended analytics questions
This solution implements a three-stage process:
- Ingestion stage for extracting data
- Search stage for enabling search capabilities with code execution, and (3)
- Generation stage for creating custom-tailored outputs such as business or industry overviews.
The below will inform the user on how to set up and install this solution.
The below resources need to be created before using this repo:
- GPT-4-Turbo and GPT-4-Vision models in Azure OpenAI resource(s)
- AI Search resource (Enable Semantic Search)
- Web App resource
- Azure Vision resource
The user has two options:
- The user can use the Chainlit web app locally on his computer
- The user can deploy the solution to an Azure Web App. In the
deployment
folder, the user can thedeployment.sh
bash script to deploy the web app.
This repository contains the Bicep code to deploy an Azure App Services baseline architecture with zonal redundancy and the Azure Open Ai resource deployed in the four regions that are currently (as of Jan 2024) supported to run the GPT -4 Vision model.
- Virtual Network: The fundamental networking backbone within Azure.
- Application Gateway Subnet with Azure Web Application Firewall: Protects the app against web vulnerabilities.
- App Service Integration Subnet: Dedicated to integrating Azure services.
- Private Endpoint Subnet: Hosts network interfaces for private connections to Azure services.
- Managed Identity: Automates the provision of Azure Active Directory identities for secure inter-service authentication.
- Azure Active Directory: Manages user identities and permissions.
- Azure Key Vault: Secures application secrets, keys, and certificates.
- App Service: Host all the user interface to provide the application functionality
- App Service Instance Zone 1, 2, 3: Ensures high availability across different zones.
- Azure Container Registries: Stores Docker container images for the application
- Azure Storage: Provides scalable cloud storage services to host all iof the ingested documents and the prompt templates.
- Azure Monitor/Application Insights: Offers advanced analytics and machine learning-driven insights.
- Private DNS Zones: Allows for internal name resolution within the Azure network.
- Monitoring: Tracks health, performance, and usage of the application and infrastructure.
- Security: Enhanced with private connections, Bastion Host, jumpbox VM, DDoS protection, WAF, and secure secret management using keyvault.
- Scalability: App Services can dynamically adapt to varying loads.
- Availability: Distributed across availability zones to ensure continuous operation.
- Performance: Optimized routing and low-latency operations due to proximity of services.
- Manageability: Simplified management with integrated monitoring and identity services.
- Compliance: Supports regulatory compliance by safeguarding data.
- Cost-Effectiveness: Reduces overhead and operational costs through PaaS solutions.
The architecture is designed for organizations that require robust security, high scalability, and uninterrupted availability for their mission-critical applications.
- Ensure you have an Azure Account
- The deployment must be started by a user who has sufficient permissions to assign roles, such as a User Access Administrator or Owner.
- Ensure you have the Azure CLI installed
- The bicep code has been tested with bicep version: v0.26.54, make sure you have at least this version. Ensure you have the az Bicep tools installed
- The script by default uses locally docker to build the container image therefore you need to install Docker otherwise the depoloyment of the web app will fail https://docs.docker.com/desktop/install/windows-install/
IMPORTANT!
This deployment requires that the user accepts the use of Cognitive Services. Unfortunately,as of today this needs to be done manually.
If you have never created a Cognitive Service before, please follow these steps:
- Go to the Azure Portal
- Create any Cognitive Service
- Accept the usage conditions during the creation process
The infrastucture deploys all the PaaS components behind a private endpoint. Therefore the following Vnet and Subnets are required:
Name | Default Value |
---|---|
Vnet Address | '10.0.0.0/16' |
Application Gateway Subnet | '10.0.1.0/24' |
App Services Subnet | '10.0.0.0/24' |
Private Endpoints Subnet | '10.0.2.0/27' |
DevOps Agents or jump box VM Subnet | '10.0.2.32/27' |
Azure Bastion Subnet | '10.0.3.0/27' |
In the companion deploy.sh file you can see the following variables declared:
- vnetAddressPrefix: '10.0.0.0/16'
- appGatewaySubnetPrefix: '10.0.1.0/24'
- appServicesSubnetPrefix: '10.0.0.0/24'
- privateEndpointsSubnetPrefix: '10.0.2.0/27'
- agentsSubnetPrefix: '10.0.2.32/27'
Do adjust these ones with the required values. Do not decrease the CDR as these settings are following Microsoft best practices in terms of IPs allocations.
This solution requires 4 Open AI deployments to accelerate the ingestion process. As of Jan 2024, the supported locations for GTP4-Vision model are:
Supported Regions (Jan 2024) |
---|
Australia East |
Japan East |
Sweeden central |
Switzerland |
If the deployment script fails while deploying to these regions might be due to one of the following reasons (among others):
-
Case 1: You run out of quota for the GPT4 model.
Solution: You will need to work on requesting more quota for the model. -
Case 2: One of the regions does not support the GPT4-V model anymore.
Solution: You will need to research what are the supported regions at the moment of running the script.
This solution uses Semantic Ranker to improve the search results, and as of Jan 2024, the supported locations are:
Supported Regions Jan 2024 |
---|
Canada Central |
Canada East |
Central US |
East US |
East US 2 |
North Central US |
South Central US |
West Central US |
West US |
West US 2 |
West US 3 |
https://azure.microsoft.com/en-gb/explore/global-infrastructure/products-by-region/?products=search
The default region is East US, to select a different version just pass the paramater aiSearchRegion to the deplate:
aiSearchRegion='eastus'
PREFIX=dev
MMSYS_NO_PATHCONV=1 az deployment group create --template-file ./infra-as-code/bicep/main.bicep \
--resource-group $RESOURCE_GROUP \
--parameters appGatewayListenerCertificate=$appGatewayListenerCertificate \
namePrefix=$PREFIX \
aiSearchRegion=$aiSearchRegion \
vnetAddressPrefix=$vnetAddressPrefix \
appGatewaySubnetPrefix=$appGatewaySubnetPrefix \
appServicesSubnetPrefix=$appServicesSubnetPrefix \
privateEndpointsSubnetPrefix=$privateEndpointsSubnetPrefix\
agentsSubnetPrefix=$agentsSubnetPrefix \
Keep in mind that Open AI and Computer vision are services that have quota. Therefore, if you delete the resource group and then you re-deploy the whole solution again, Open AI and Computer Vision are soft-deleted by default. You will need to go the Azure Portal, go to Azure Open AI services or Cumputer Vision and select the option Manage deleted resources and purge the resources that you are not using anymore.
The following steps are required to deploy the infrastructure from the command line.
-
In your command-line tool where you have the Azure CLI and Bicep installed, navigate to the root directory of this repository (AppServicesRI)
-
Login and set subscription if it is needed
az login
az account set --subscription xxxxx
-
Obtain App gateway certificate Azure Application Gateway support for secure TLS using Azure Key Vault and managed identities for Azure resources. This configuration enables end-to-end encryption of the network traffic using standard TLS protocols. For production systems you use a publicly signed certificate backed by a public root certificate authority (CA). Here, we are going to use a self signed certificate for demonstrational purposes.
The default isntallation creates a dummy certificate without password with contoso.com as the CN.
This is not secure and your organization should provide the proper certificate and then perform the Application Gateway configuration manually to suit your organization needs
Create a bash script:
-
Set a variable for the domain that will be used in the rest of this deployment.
export DOMAIN_NAME_APPSERV_BASELINE="contoso.com"
-
Generate a client-facing, self-signed TLS certificate.
⚠️ Do not use the certificate created by this script for actual deployments. The use of self-signed certificates are provided for ease of illustration purposes only. For your App Service solution, use your organization's requirements for procurement and lifetime management of TLS certificates, even for development purposes.Create the certificate that will be presented to web clients by Azure Application Gateway for your domain.
NON-WINDOWS USERS
openssl req -x509 -nodes -days 365 -newkey rsa:2048 -out appgw.crt -keyout appgw.key -subj "/CN=${DOMAIN_NAME_APPSERV_BASELINE}/O=Contoso" -addext "subjectAltName = DNS:${DOMAIN_NAME_APPSERV_BASELINE}" -addext "keyUsage = digitalSignature" -addext "extendedKeyUsage = serverAuth" openssl pkcs12 -export -out appgw.pfx -in appgw.crt -inkey appgw.key -passout pass:
WINDOWS USERS
- Replace the parameter: -subj like this --> -subj "//O=Org\CN=Name"
- MMSYS_NO_PATHCONV=1 if you get errors on the path is because your bash is loading the existing path to the execution path. you have to call the openssl command using: MMSYS_NO_PATHCONV=1 like this --> MMSYS_NO_PATHCONV=1 openssl req -x509 -rest of the comand...
REPLACE windows_path with the location of your openssl installation.
windows_path="C:\Program Files\FireDaemon OpenSSL 3\bin" export OPENSSL_CONF=$windows_path MMSYS_NO_PATHCONV=1 openssl req -x509 -nodes -days 365 -newkey rsa:2048 -out appgw.crt -keyout appgw.key -subj "\CN=${DOMAIN_NAME_APPSERV_BASELINE}//O=Contoso" -addext "subjectAltName = DNS:${DOMAIN_NAME_APPSERV_BASELINE}" -addext "keyUsage = digitalSignature" -addext "extendedKeyUsage = serverAuth" MMSYS_NO_PATHCONV=1 openssl pkcs12 -export -out appgw.pfx -in appgw.crt -inkey appgw.key -passout pass:
No matter if you used a certificate from your organization or you generated one from above, you'll need the certificate (as
.pfx
) to be Base64 encoded for proper storage in Key Vault later.export APP_GATEWAY_LISTENER_CERTIFICATE_APPSERV_BASELINE=$(cat appgw.pfx | base64 | tr -d '\n') echo APP_GATEWAY_LISTENER_CERTIFICATE_APPSERV_BASELINE: $APP_GATEWAY_LISTENER_CERTIFICATE_APPSERV_BASELINE
-
TaskWeaver requires Python >= 3.10. It can be installed by running the following command from the project root folder. Please follow the below commands very carefully and start by creating a new conda environment:
# create the conda environment
conda create -n mmdoc python=3.10
# activate the conda environment
conda activate mmdoc
# install the project requirements
pip install -r requirements.txt
# clone the repository
git clone https://github.com/microsoft/TaskWeaver.git
# cd into Taskweaver
cd TaskWeaver
# install the Taskweaver requirements
pip install -r requirements.txt
# copy the Taskweaver project directory into the root folder and name it 'test_project'
cp -r project ../test_project/
Note: Inside the
test_project
directory, there's a file calledtaskweaver_config.json
which needs to be populated. Please refer to thetaskweaver_config.sample.json
file in the root folder of this repo, fill in the Azure OpenAI model values for GPT-4-Turbo, rename it totaskweaver_config.json
, and then copy it insidetest_project
(or overwrite existing).
Note: Similiarly, there are a number of test notebooks in this solution that use Autogen. If the user wants to experiment with Autogen, then in this case, the file
OAI_CONFIG_LIST
in thecode
folder needs to be configured. Please refer toOAI_CONFIG_LIST.sample
, populate it with the right values, and then rename it toOAI_CONFIG_LIST
.
Code Interpreters Available in this Solution:
- Taskweaver: is fully supported
- Assistants API: OpenAI AssistantsAPI is supported for now. The Azure version will soon follow when it's released.
There are two web apps that are implemented as part of this solution. The Streamlit web app and the Chainlit web app.
- The Streamlit web app includes the following:
- The web app can ingest documents, which will create an ingestion job either using Azure Machine Learning (recommended) or using a Python sub-process on the web app itself (for local testing only).
- The second part of the Streamlit app is Generation. The "Prompt Management" view will enable the user to build complex prompts with sub-sections, save them to Cosmos, and use the solution to generate output based on these prompts
- The Chainlit web app is used to chat with the ingested documents, and has advanced functionality, such as an audit trail for the search, and references section for the answer with multimodal support (images and tables can be viewed).
The Chainlit web app is the main web app to chat with your data. To run the web app locally, please execute in your conda environment the following:
# cd into the app folder
cd ui
# run the chainlit app
chainlit run chat.py
The Streamlit web app is the main web app to ingest your documents and to build prompts for Generation. To run the web app locally, please execute in your conda environment the following:
# cd into the app folder
cd ui
# run the chainlit app
streamlit run main.py
- Configure properly your
.env
file. Refer to the.env.sample
file included in this solution. - In the Chainlit web app, use
cmd index
to set the index name.
The below outlines the primary commands and options available in the testing tool for the Research CoPilot solution.
Command | Usage |
---|---|
cmd index | Type cmd index to change the name of the AI Search index. |
cmd password | Type cmd password to change the PDF password (if PDFs are password-protected). |
cmd tag_limit | Type cmd tag_limit to change the upper limits of the generated tags per query for the search. |
cmd topN | Type cmd topN to change how many top N results to fetch while executing the search. |
cmd pdf_mode | Type cmd pdf_mode to change the PDF extraction mode. Allowed values are 'gpt-4-vision' or 'document-intelligence'. |
cmd docx_mode | Type cmd docx_mode to change the docx extraction mode. Allowed values are 'document-intelligence' or 'py-docx'. |
cmd threads | Type cmd threads to change the number of threads. Allows for multi-threading during ingestion. Make sure that AZURE_OPENAI_RESOURCE_x and AZURE_OPENAI_KEY_x are properly configured in your .env file. |
cmd delete_dir | Type cmd delete_dir to enable or disable deleting existing output directory if ingestion is restarted. |
cmd ci | Type cmd delete_dir to change the used Code Interpreter. Allowed values are "NoComputationTextOnly", "Taskweaver", "AssistantsAPI", or "LocalPythonExec". |
cmd upload | Type cmd upload to upload document files for ingestion. |
cmd ingest | Type cmd ingest to start the ingestion process of the uploaded files. |
cmd prompts | Type cmd prompts to display all available generation prompts. |
cmd gen | Type cmd gen to generate from pre-existing prompts. |
Query | Type your query in plain English and wait for the response. |