ACS is at the core of this Knowledge Mining accelerator. It provides the necessary search and exploration features to the solution.
This accelerator is configured to ingest data from an Azure Storage Data Lake. By default your data pushed to the documents container.
Container | Description | Is indexed ? |
---|---|---|
documents | where you would push your data | Yes |
images | Where converted pages/slides & attachments are extracted | Yes |
metadata | Where we stored all documents full metadata output and an HTML representation of a document. | No |
translation | Where we stored all translated documents. | Yes |
All search configuration is JSON-based organized in folders
Adding a new search index or datasource is as simple as dropping a new JSON file in the corresponding folder and run the following.
A quick table to understand the relationships between all Search components.
Storage Container | Datasource(s) | Indexer(s) | Skillset(s) | Description | Since Version |
---|---|---|---|---|---|
documents | documents | documents | documents | Index all types of documents except images files (extension exclusion configuration) | v1.0 |
documents | documents | docimg | images | Index all images files from the documents container (extension inclusion configuration) | v1.0 |
images | images | images | images | Index all images files located in the images container. | v1.0 |
translation | translation | translation | documents | Index all translated documents files located in the translation container. | v1.1 |
images | images | attachments | documents | Index all attached documents files located in the images container. This case allows Emails' attachments indexing for instance. | v1.1 |
All search components in the Azure portal are prefixed with the configuration name parameter used as a prefix for all deployed services.
Example of indexers prefixing
- {{config.name}}-documents
- {{config.name}}-docimg
- {{config.name}}-images
- {{config.name}}-translation
- {{config.name}}-attachments
To connect and operate your ACS instance, you would initialize your environment.
cd deployment
./init_env.ps1 -Name <envid>
To configure your search instance initially or after modifications, run the below cmdlet
Initialize-Search
Initialize-Search will re-apply all existing configurations. If you need targeted updates, please refer to the below commands
- Update-SearchAliases
- Update-SearchDataSource
- Update-SearchIndex
- Update-SearchIndexer
- Update-SearchSkillSet
- Update-SearchSynonyms
Apache Tika 2.x provides great content analysis capabilities like :
- Supports a wide range of document formats
- Extract Metadata
- HTML Conversion
- Text Extraction
- Extract Table of Contents, Annotations (PDF)
- Images Extraction
We use a custom version of Tika to natively support Azure Storage to ease data transit and processing.
The technology we use to extract images from documents is provide by a custom Apache Tika version.
While Tika can extract embedded images from any Office or PDF documents, in practice it leads to have many images not giving much value i.e. logos, backgrounds etc.
Also annotations overlapping images in documents or any graphical element are lost. This is mostly seen & common in PDF.
For this solution, by default, we chose to extract as follows
- each PowerPoint slide into an image
- each PDF page into an image
Each page/slide image is then indexed as an individual item allowing end-users to target & retrieve specific slide or page. You don't have to scroll any big PDF to find where your search query was matched.
This choice provides more user experience capabilities like document cover, thumbnails or tables extraction to name a few.
There is another aspect of doing your own images extraction is cost. In Azure Cognitive Search, Images extraction is a paid feature.
Emails processing falls under the documents processing to the exception that extracted "images" here known as attached documents will need specific indexer.
Starting with V1.1, we do now support attachments indexing.
Internet Standard Message Formats
The entire solution is configured to normalize content to a parametrable language option named searchDefaultToLanguageCode defined in the search/config.json. By default it set to English.
Each textual content whether it is coming from the text extraction of a full document or the OCR outcome of an image is translated to English (default but configurable).
On non-native English documents, the Transcript tab will show a side-by-side translation.
We added support for Document Translation in our solution. A translation container is created automatically to receive translated documents. Document Translation
The below diagram highlights the document processing flow.
Out of processing a document it produces images (pages/slides) but also extract embedded objects such as emails' attachments.
The below diagram highlights the image processing flow. Note that once the text of an image is extracted the same document processing will apply against it.
You may extend the solution to include Audio and Videos processing.
Below the list of custom functions supporting skills
Skill | Description | Language |
---|---|---|
Entities | Utility skill to deduplicate, clean, concatenate entities | C# |
Geo | Skill to assign a country, city, capital & other locations from a list of locations | C# |
Image/Extraction | Skill to trigger the images extraction on documents | C# |
Language | Single Skills to interact with Azure Cognitive Services Language : TextAnalytics, Translator | Python |
Metadata/Assign | Skill to combine all metadata assignments in terms of metadata, security or anything you'd want. | C# |
Metadata/Extraction | ACS provide a set of common properties for any file. This skill will provide you with the entire set of metadata a file has. It is based on Apache Tika | C# |
Text | Skill to manipulate textual data | C# |
Vision | Single Skills to interact with Azure Cognitive Services Vision : Analyze, Computer Vision, Form Recognizer | Python |
To support some content analysis skills like Image & Metadata extraction, we deploy a docker-based Tika server. The container service we used is Azure App Service. For Production use, we would recommend Azure Kubernetes.
A function could host one or more skills.
There are 3 skills within the Entities function.
- Concatenate (currently not used)
- Deduplication
- KeyPhrasesCleansing
During document processing, limitations on the number of characters certain Cognitive Services accept, textual content is therefore split into pages.
Each page is then processed separately causing Named Entities Recognition or Key Phrases extraction to potential output the same entities across multiple pages.
Currently not used.
This skill is responsible to deduplicate all extracted NER across pages.
Skill definition
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "EntitiesDeduplication",
"description": "A custom skill to normalize and deduplicate values of entities.",
"context": "/document",
"uri": "{{param.entities.deduplication}}",
"httpMethod": "POST",
"timeout": "PT30S",
"batchSize": 1,
"degreeOfParallelism": null,
"inputs": [
{
"name": "keyPhrases",
"source": "/document/pages/*/raw_keyPhrases/*",
"sourceContext": null,
"inputs": []
},
{
"name": "organizations",
"source": "/document/pages/*/raw_organizations/*",
"sourceContext": null,
"inputs": []
},
{
"name": "locations",
"source": "/document/pages/*/raw_locations/*",
"sourceContext": null,
"inputs": []
},
{
"name": "persons",
"source": "/document/pages/*/raw_persons/*",
"sourceContext": null,
"inputs": []
}
],
"outputs": [
{
"name": "keyPhrases",
"targetName": "dedup_keyPhrases"
},
{
"name": "organizations",
"targetName": "organizations"
},
{
"name": "locations",
"targetName": "temp_locations"
},
{
"name": "persons",
"targetName": "persons"
}
],
"httpHeaders": {}
},
Key Phrases produces a lot of noise (unsupervised effect) wich may collide with NER : some key phrases entries could be seen in locations already.
This skill removes collision between KP and NER
Skill definition
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "#14",
"description": "A custom skill to remove stopwords from keyphrases vs others entities",
"context": "/document",
"uri": "{{param.entities.keyphrases-cleansing}}",
"httpMethod": "POST",
"timeout": "PT30S",
"batchSize": 1,
"degreeOfParallelism": null,
"inputs": [
{
"name": "keyPhrases",
"source": "/document/dedup_keyPhrases",
"sourceContext": null,
"inputs": []
},
{
"name": "organizations",
"source": "/document/organizations",
"sourceContext": null,
"inputs": []
},
{
"name": "locations",
"source": "/document/locations",
"sourceContext": null,
"inputs": []
},
{
"name": "persons",
"source": "/document/persons",
"sourceContext": null,
"inputs": []
}
],
"outputs": [
{
"name": "keyPhrases",
"targetName": "keyPhrases"
},
{
"name": "acronyms",
"targetName": "acronyms"
}
],
"httpHeaders": {}
},
Note Acronyms is a placeHolder if you wish to create or look up Acronyms.
Examples
- Creation scenario: If you have a key phrase "Department of Justice" you could output an acronym DoJ.
- Look up : Find an acronym in key phrases, look it up in your acronyms referential and add its defintion to key phrases.
A very simple Geo location function/skill, to assign countries, capitals, cities and other locations. Such skill supports the Maps vertical experience.
Why?
Named Entities Recognition will produce a set of locations with more or less granularity
(Supported Named Entity Recognition (NER) entity categories)[https://docs.microsoft.com/en-us/azure/cognitive-services/language-service/named-entity-recognition/concepts/named-entity-categories#category-location]
The skill will take the produced NER locations, identify known capitals and countries and assign them to differents search fields.
This is a very simple approach to enable geo-locations in your solution.
Skill definition
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "geolocations",
"description": "Locations geo locations for the map support",
"context": "/document",
"uri": "{{param.geolocations.locations}}",
"httpMethod": "POST",
"timeout": "PT1M",
"batchSize": 5,
"degreeOfParallelism": null,
"inputs": [
{
"name": "locations",
"source": "/document/temp_locations"
}
],
"outputs": [
{
"name": "locations",
"targetName": "locations"
},
{
"name": "countries",
"targetName": "countries"
},
{
"name": "capitals",
"targetName": "capitals"
},
{
"name": "cities",
"targetName": "cities"
}
],
"httpHeaders": {}
},
Image Extraction is the only Durable function of all our skills so it doesn't affect the processing of documents.
Skill definition
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "ImageExtraction",
"description": "Send the document references for Images extraction (TIKA)",
"context": "/document",
"uri": "{{param.imgext.DurableImageExtractionSkill_HttpStart}}",
"httpMethod": "POST",
"timeout": "PT30S",
"batchSize": 1,
"degreeOfParallelism": 5,
"inputs": [
{
"name": "document_index_key",
"source": "/document/index_key"
},
{
"name": "document_id",
"source": "/document/document_id"
},
{
"name": "document_filename",
"source": "/document/document_filename"
},
{
"name": "document_url",
"source": "/document/document_url"
},
{
"name": "document_metadata",
"source": "/document/skill_metadata"
}
],
"outputs": [
{
"name": "message",
"targetName": "image_extraction_message"
}
],
"httpHeaders": {}
},
This function holds multiple skills related to Azure Cognitive Language service.
Please refer the function README for more details.
A single skill function app.
Another important skill in our solution aimed to provide a place to map file metadata, assign default values, security groups to content and more...
Skill definition
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "MetadataAssignment",
"context": "/document",
"uri": "{{param.mtda.Assign}}",
"httpMethod": "POST",
"timeout": "PT1M",
"batchSize": 5,
"degreeOfParallelism": null,
"inputs": [
{
"name": "document_filename",
"source": "/document/document_filename"
},
{
"name": "document_url",
"source": "/document/document_url"
},
{
"name": "file_metadata",
"source": "/document/file_metadata"
},
{
"name": "author",
"source": "/document/metadata_author"
},
{
"name": "metadata_title",
"source": "/document/metadata_title"
},
{
"name": "metadata_last_modified",
"source": "/document/metadata_last_modified"
},
{
"name": "metadata_creation_date",
"source": "/document/metadata_creation_date"
}
],
"outputs": [
{
"name": "skill_metadata",
"targetName": "skill_metadata"
}
],
"httpHeaders": {}
},
This metadata mapping feature is allowing you to map any document metadata source to any search enrichment fields target.
Important in order to simplify the mapping entries, the source entries are normalized : blank space and semi-column are replaced by dash.
key.Replace(" ","-").Replace(":","-")
Examples
dcterms:modified => dcterms-modified
The mapping is evaluated sequentially, so you can handle overwriting a target field. See the last_modified
[
{
"source": "dcterms-created",
"target": [ "creation_date", "last_modified", "last_save_date" ]
},
{
"source": "dcterms-modified",
"target": [ "last_modified" ]
},
{
"source": "meta-save-date",
"target": [ "last_save_date" ]
},
{
"source": "dc-subject",
"target": [ "description" ]
},
{
"source": "Category",
"target": [ "user_categories" ],
"transform": "SplitSemiColumn"
},
{
"source": "meta-author",
"target": [ "authors" ],
"transform": "none"
},
{
"source": "meta-last-author",
"target": [ "authors" ],
"transform": "none"
},
{
"source": "meta-keyword",
"target": [ "user_keywords" ],
"transform": "SplitSemiColumn"
},
{
"source": "custom-Tags",
"target": [ "user_tags" ],
"transform": "SplitSPTaxonomy"
},
{
"source": "custom-Document-Type",
"target": [ "user_tags" ],
"transform": "SplitSPTaxonomy"
}
]
Note: You may extend the metadata mapping to capture EXIF tags i.e. GPS latitude, lontitude or Altitude or XMP tags.
EXIF/XMP metadata are extracted by Apache Tika hence available for mapping !
To help refining your mapping, the document details Metadata
This is an important configuration to allow you to secure content based on location. The target are referencing Azure AD Groups (GUID).
[
{
"source": [ "folder1", "restricted" ],
"target": [ "group1" ]
},
{
"source": [ "folder2" , "restricted"],
"target": [ "group2" ]
},
{
"source": [ "folder3", "restricted" ],
"target": [ "group3" ]
}
]
In the above example, any content with folder1 and restricted part of their storage path will be secured with group1. Ans so on.
To assign a group, all source partial path have to match (AND).
Groups are stored in the search index under the permissions field (list of AAD Guid).
[
{
"source": [ "folder1"],
"target": "Group1 Data"
},
{
"source": [ "folder2"],
"target": "Group2 Data"
},
{
"source": [ "folder3"],
"target": "Group3 Data"
}
]
Content group follows the same framework as content security, it provides you an ability to logically group scattered data in a content_group facetable index field.
The function app contains multiple functions but only 3 are currently in use
- HTMLConversion
- TextMesh
- TranslationMerge
Skill definition
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "HTMLConversion",
"description": "Send the document references for HTML Conversion extraction (TIKA)",
"context": "/document",
"uri": "{{param.text.HtmlConversion}}",
"httpMethod": "POST",
"timeout": "PT3M",
"batchSize": 1,
"degreeOfParallelism": 1,
"inputs": [
{
"name": "document_index_key",
"source": "/document/index_key"
},
{
"name": "document_id",
"source": "/document/document_id"
},
{
"name": "document_filename",
"source": "/document/document_filename"
},
{
"name": "document_url",
"source": "/document/document_url"
}
],
"outputs": [
{
"name": "file_html",
"targetName": "file_html"
}
],
"httpHeaders": {}
},
This skill does a cleanup any input text like removing empty lines. It condenses the text to its minimun.
This helps reducing the amount of non-textual characters, improve the translation side-by-side experience.
Skill definition
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "ContentMesh",
"description": "Send the document content for cleaning/meshing",
"context": "/document",
"uri": "{{param.text.TextMesh}}",
"httpMethod": "POST",
"timeout": "PT3M",
"batchSize": 1,
"degreeOfParallelism": 2,
"inputs": [
{
"name": "content",
"source": "/document/content"
}
],
"outputs": [
{
"name": "trimmed_content",
"targetName": "trimmed_content"
},
{
"name": "trimmed_content_lines_count",
"targetName": "trimmed_content_lines_count"
},
{
"name": "trimmed_content_lines_matches",
"targetName": "trimmed_content_lines_matches"
}
],
"httpHeaders": {}
},
Merge the different pages translated text.
Skill definition
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "MergeTranslatedPages",
"context": "/document",
"uri": "{{param.text.TranslationMerge}}",
"httpMethod": "POST",
"timeout": "PT1M",
"batchSize": 5,
"degreeOfParallelism": null,
"inputs": [
{
"name": "translated_pages",
"source": "/document/pages/*/translated_text"
},
{
"name": "fromLanguageCode",
"source": "/document/language"
},
{
"name": "toLanguageCode",
"source": "/document/pages/0/translatedToLanguageCode"
}
],
"outputs": [
{
"name": "merged_translation",
"targetName": "merged_translation"
}
],
"httpHeaders": {}
},
Probably the most important function as our solution has a strong visual information focus.
Please refer the function README for more details.
Power Skills are a collection of useful functions to be deployed as custom skills for Azure Cognitive Search. The skills can be used as templates or starting points for your own custom skills, or they can be deployed and used as they are if they happen to meet your requirements.
Custom skills are web APIs that implement a specific interface. A custom skill can be implemented on any publicly addressable resource. The most common implementations for custom skills are:
- Azure Functions for custom logic skills
- Azure Webapps for simple containerized AI skills
- Azure Kubernetes service for more complex or larger skills.
When you have critical applications and business processes relying on Azure resources, you want to monitor those resources for their availability, performance, and operation. This article describes the monitoring data generated by Azure Cognitive Search and how to analyze and alert on this data with Azure Monitor.
Protecting your data of unwanted exposure is crucial. As ACS doesn't include any data security model for indexing and querying, our solution accelerator provides one for your convenience.
The proposed model provides a concept of public documents and private documents.
The skill responsible to assign security permissions on your content is the Metadata Assign.
2 index fields support our basic model.
- restricted: boolean to indicate if a document is restricted to certain audience defined by permissions.
- permissions: list of Azure AD security groups ids representing the people authorized to search a document.
index.json (snippet)
{
"name": "restricted",
"type": "Edm.Boolean",
"facetable": false,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": false,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "permissions",
"type": "Collection(Edm.String)",
"searchable": false,
"filterable": true,
"retrievable": false,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null,
"synonymMaps": []
}
3 general parameters helps you control your content security implementation
webappui.json
{
"name": "SearchServiceConfig:IsSecurityTrimmingEnabled",
"value": true,
"slotSetting": false
},
{
"name": "SearchServiceConfig:PermissionsPublicFilter",
"value": "restricted eq false",
"slotSetting": false
},
{
"name": "SearchServiceConfig:PermissionsProtectedFilter",
"value": "restricted eq true",
"slotSetting": false
}
Content Security Setting | Description |
---|---|
IsSecurityTrimmingEnabled | Flag to add security filters to each query |
PermissionsPublicFilter | ACS filter ODATA syntax describing how to retrieve non secured documents |
PermissionsProtectedFilter | ACS filter ODATA syntax describing how to retrieve secured documents |
Capturing the user groups memberships is possible when your UI is authenticated with an Azure AD Enterprise Application (EA). Upon authenticating an end-user, the security token will contain the list of groups memberships for the user.
Refer to the Authentication.md located in your config folder for more details on how to setup Azure AD EA Authentication.
Our UI will go through those claims and generate the security filter accordingly.
AbstractApiController.cs
protected SearchPermission[] GetUserPermissions()
{
List<SearchPermission> permissions = new();
permissions.Add(new SearchPermission() { group = GetUserId() });
if (User.Claims.Any(c => c.Type == "groups"))
{
foreach (var item in User.Claims.Where(c => c.Type == "groups"))
{
permissions.Add(new SearchPermission() { group = item.Value });
}
}
return permissions.ToArray();
}
Example of query filter added for security
(restricted eq false) or ((restricted eq true) and (permissions/any(t: search.in(t, 'group1,group2,group3', ','))))