This project uses code, data, and resources that were not made by myself or any collaborators. All non-roginal works (including code, works of art, writings, data, and resources) are copyright of their original owners and used here for non-commercial, open-source purposes.

Document Collection Analysis

Link to the Microsoft DOCS site

The detailed documentation for this real world scenario includes the step-by-step walkthrough:

https://docs.microsoft.com/azure/machine-learning/preview/scenario-document-collection-analysis

Link to the Gallery GitHub repository

The public GitHub repository for this real world scenario contains all materials, including code samples, needed for this example:

https://github.com/Azure/MachineLearningSamples-DocumentCollectionAnalysis

This scenario demonstrates how to summarize and analyze a large collection of documents, including techniques such as phrase learning, topic modeling, and topic model analysis using Azure ML Workbench. Azure Machine Learning Workbench provides for easy scale up for very large document collection, and provides mechanisms to train and tune models within a variety of compute contexts, ranging from local compute to Data Science Virtual Machines to Spark Cluster. Easy development is provided through Jupyter notebooks within Azure Machine Learning Workbench.

Overview

With a large amount of data (especially unstructured text data) collected every day, a significant challenge is to organize, search, and understand vast quantities of these texts. This document collection analysis scenario demonstrates an efficient and automated end-to-end workflow for analyzing large document collection and enabling downstream NLP tasks.

The key elements delivered by this scenario are:

Learning salient multi-words phrase from documents.
Discovering underlying topics presented in the document collection.
Representing documents by the topical distribution.
Presenting methods for organizing, searching, and summarizing documents based on the topical content.

The methods presented in this scenario could enable a variety of critical industrial workloads, such as discovery of topic trends anomaly, document collection summarization, and similar document search. It can be applied to many different types of document analysis, such as government legislation, news stories, product reviews, customer feedbacks, and scientific research articles.

The machine learning techniques/algorithms used in this scenario include:

Text processing and cleaning
Phrase Learning
Topic modeling
Corpus summarization
Topical trends and anomaly detection

Prerequisites

The prerequisites to run this example are as follows:

Make sure that you have properly installed Azure Machine Learning Workbench by following the quick start installation guide.
This example could be run on any compute context. However, it is recommended to run it on a multi-core machine with at least of 16-GB memory and 5-GB disk space.

Data/Telemetry

This advance scenarios for Document Collection Analysis collects usage data and sends it to Microsoft to help improve our products and services. Read our privacy statement to learn more.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (for example, label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information, see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Code		Code
Data		Data
aml_config		aml_config
notebooks		notebooks
.DS_Store		.DS_Store
.gitignore		.gitignore
1540071205683.tmp		1540071205683.tmp
AggregatorInput.py		AggregatorInput.py
BingCustomSearchController.py		BingCustomSearchController.py
Consensus - Sheet1.pdf		Consensus - Sheet1.pdf
ConsensusBingSearch.png		ConsensusBingSearch.png
LICENSE		LICENSE
LICENSE.TXT		LICENSE.TXT
README.md		README.md
Wildfire_News.dprep		Wildfire_News.dprep
Wildfire_News.dprep.user		Wildfire_News.dprep.user
Wildfire_News.py		Wildfire_News.py
ca_wildfire3.dsource		ca_wildfire3.dsource
ca_wildfire3.dsource.user		ca_wildfire3.dsource.user
ca_wildfire3.py		ca_wildfire3.py
ca_wildfire3.tsv		ca_wildfire3.tsv
newSearchScraping.py		newSearchScraping.py
news.json		news.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

This project uses code, data, and resources that were not made by myself or any collaborators. All non-roginal works (including code, works of art, writings, data, and resources) are copyright of their original owners and used here for non-commercial, open-source purposes.

Document Collection Analysis

Link to the Microsoft DOCS site

Link to the Gallery GitHub repository

Overview

Prerequisites

Data/Telemetry

Contributing

About

Licenses found

Releases

Packages

Contributors 3

Languages

License

Licenses found

alexfromsocal/NewsFactAggregator

Folders and files

Latest commit

History

Repository files navigation

This project uses code, data, and resources that were not made by myself or any collaborators. All non-roginal works (including code, works of art, writings, data, and resources) are copyright of their original owners and used here for non-commercial, open-source purposes.

Document Collection Analysis

Link to the Microsoft DOCS site

Link to the Gallery GitHub repository

Overview

Prerequisites

Data/Telemetry

Contributing

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages