Skip to content

Latest commit

 

History

History
27 lines (14 loc) · 1.6 KB

File metadata and controls

27 lines (14 loc) · 1.6 KB

This project includes the IBM Debater® Thematic Clustering of Sentences dataset from the Data Asset Exchange and supporting notebooks. The notebooks teach the user how to read, clean and visualize the data, how to save the cleaned dataset into a Watson Studio project, and how to develop a clustering model using the dataset. This sample project contains two notebooks and one related data files. Please run the notebooks in sequential order of their part numbers using a Python 3.7 runtime.

Resources:

Data assets

  • dataset.csv: The dataset contains 692 articles from Wikipedia, where the number of sections(clusters) in each article ranges from 5 to 12, and the number of sentences per article ranges from 17 to 1614.

Notebooks

Open the Assets tab to access and run the following notebooks in order:

  • Part 1 - Data Exploration & Visualization: This notebook loads, clean, explore and visualize the data files in the project.
  • Part 2 - Model Development: This notebook takes sentences and clusters them into groups based on the topic similarity between sentences.

You can review the completed notebooks here

Licenses