This project includes the IBM Debater® Thematic Clustering of Sentences dataset from the Data Asset Exchange and supporting notebooks. The notebooks teach the user how to read, clean and visualize the data, how to save the cleaned dataset into a Watson Studio project, and how to develop a clustering model using the dataset. This sample project contains two notebooks and one related data files. Please run the notebooks in sequential order of their part numbers using a Python 3.7 runtime.

Resources:

Dataset homepage: https://developer.ibm.com/exchanges/data/all/thematic-clustering-of-sentences/
Dataset download link: https://dax-cdn.cdn.appdomain.cloud/dax-thematic-clustering-of-sentences/1.0.2/thematic-clustering-of-sentences.tar.gz

Data assets

dataset.csv: The dataset contains 692 articles from Wikipedia, where the number of sections(clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.

Notebooks

Open the Assets tab to access and run the following notebooks in order:

Part 1 - Data Exploration & Visualization: This notebook loads, clean, explore and visualize the data files in the project.
Part 2 - Model Development: This notebook takes sentences and clusters them into groups based on the topic similarity between sentences.

You can review the completed notebooks here

Licenses

Dataset: [CC-BY-SA 3.0]
Notebooks: MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project_readme.md

project_readme.md

Resources:

Data assets

Notebooks

Licenses

Files

project_readme.md

Latest commit

History

project_readme.md

File metadata and controls

Resources:

Data assets

Notebooks

Licenses