Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feedback on kdic tutorials #78

Merged
merged 6 commits into from
Jan 28, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/learn/hardware_adaptation.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
This section presents the strategy implemented in Khiops for adapting the algorithms to the available hardware resources.
Please note that this feature requires the use of [Khiops dictionaries][dico].

[dico]: ../tutorials/dictionaries.md
[dico]: ../tutorials/kdic_intro.md

## General principles

Expand Down
2 changes: 1 addition & 1 deletion docs/learn/understand.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ One of the strengths of Khiops lies in its automation of this complex step. By l

!!! example "See what Khiops-built aggregates look like using our tutorials [here][tuto_aggregates]."

[tuto_aggregates]: ../advanced/Notebooks/Use_in_any_ML_pipeline.ipynb "See the Jupyter Notebook"
[tuto_aggregates]: ../tutorials/Notebooks/Use_in_any_ML_pipeline.ipynb "See the Jupyter Notebook"

## Interpretability

Expand Down
32 changes: 16 additions & 16 deletions docs/tutorials/introduction.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Getting Started with Khiops

Welcome to the practical guide to Khiops. Whether you are exploring its capabilities for the first time or preparing for industrial-scale deployments, this section will help you understand **how Khiops streamlines and enhances your data science workflows**.
Welcome to the practical guide to using Khiops. Whether you are exploring its capabilities for the first time or preparing for industrial-scale deployments, this section will help you understand **how Khiops streamlines and enhances your data science workflows**.

Unlike traditional machine learning libraries, Khiops is built on **a unique formalism and advanced automation capabilities** that fundamentally reshape the data science process. By automating tedious, repetitive and technically complex steps, Khiops allows users to **focus on the core objectives of data science**: understanding their data and solving meaningful business problems.

Expand All @@ -9,17 +9,17 @@ At the same time, this singular approach may feel unfamiliar to those accustomed
Here's what you'll find in this page:

- How Khiops [**accelerates your workflow**][crisp-dm]: Learn how Khiops transforms the traditional data science pipeline by simplifying complex processes and letting you focus on high-value tasks.
- Choosing [**the right API**][two-apis]: The **sklearn-like** API is ideal for rapid experimentation, while the **core** API excels in production-scale applications. Determine which best suits your needs.
- Choosing [**the right API**][two-apis]: The **Scikit-learn-like** API is ideal for rapid experimentation and prototyping, while the **core** API excels in production-scale applications. Determine which best suits your needs.

[crisp-dm]: #how-khiops-fits-into-the-data-science-workflow
[two-apis]: #two-apis-for-different-needs

This introduction will guide you through the foundational points and help you navigate the tutorials that follow. Here's an overview of the sections:

- **Sklearn-like API tutorials**: Learn the basics of Khiops with examples on quickstarts, single- and multi-table examples, and hands-on notebooks that showcase Khiops' technical advantages, such as automated data preparation and multi-table processing.
- **Scikit-learn-like API tutorials**: Learn the basics of Khiops with examples on quickstarts, single- and multi-table examples, and hands-on notebooks that showcase Khiops' technical advantages, such as automated data preparation and multi-table processing.
- **Core API & dictionaries**: Dive deeper into advanced capabilities, including dictionary usage for scalable, production-ready workflows.

In the coming weeks, we will introduce a new **Deployment & Integration** section to cover advanced features that are already developed but still in the process of being documented. This will include native deployment on Kubernetes, as well as drivers for reading directly from HDFS, GCS, or S3 buckets.
In the coming weeks, we will introduce a new "**Deployment & Integration**" section to cover advanced features that are already developed but still in the process of being documented. This will include native deployment on Kubernetes, as well as drivers for reading directly from HDFS, GCS, or S3 buckets.

!!! info
Questions about deploying Khiops in specific environments (e.g. Hadoop, Openshift, K8s) can be addressed in our [Q&A section][discussions] or through our [contact form][contact-form].
Expand All @@ -29,7 +29,7 @@ In the coming weeks, we will introduce a new “**Deployment & Integration**”

## How Khiops Fits into the Data Science Workflow

Khiops introduces a streamlined and effective approach to data science, **simplifying every stage of the process** while providing advanced automation and robust formalism. Unlike traditional tools, Khiops enables you **to focus on what truly matters**: understanding your data, interpreting insights (the story your data tells), solving business problems, and deploying reliable models. Here's how you can leverage Khiops' unique features step by step:
Khiops introduces a streamlined and effective approach to data science, **simplifying every stage of the process** while providing advanced automation and a robust formalism. Unlike traditional tools, Khiops enables you **to focus on what truly matters**: understanding your data, interpreting insights (the story your data tells), solving business problems, and deploying reliable models. Here's how you can leverage Khiops' unique features step by step:

- **Skip Data Cleaning**: Forget about spending hours on cleaning and formatting your data. Khiops reads raw data directly and handles common issues like missing values, inconsistent formats, or noisy inputs. For example, if your dataset contains missing values, Khiops automatically treats them as meaningful signals when training models.

Expand All @@ -39,7 +39,7 @@ Khiops introduces a streamlined and effective approach to data science, **simpli

- **Skip Variable Encoding**: Before using variables in a machine learning model, they often need to be transformed into a format the algorithm can process (e.g. categorical variables must be converted into numerical representations). Khiops eliminates this complexity with its MODL formalism, which automatically encodes categorical and numerical variables into statistically optimal groups or intervals.

For example, instead of manually binning a variable like `age`, Khiops will determine ranges like [0,18], ]18, 35], ]35, 50], etc. These intervals are not arbitrary but are optimally chosen according to the target variable, effectively building a univariate classifier.
For example, instead of manually binning a variable like `age`, Khiops will determine ranges like [0, 18], ]18, 35], ]35, 50], etc. These intervals are not arbitrary but are optimally chosen according to the target variable, indeed building a univariate classifier.

!!! example "Explore the [**Optimal Encoding**][optimal_encoding] tutorial and learn more about the concept on the [**Optimal Encoding**][encoding_foundations] foundations page."

Expand All @@ -50,7 +50,7 @@ Khiops introduces a streamlined and effective approach to data science, **simpli

Khiops performs feature engineering in a supervised manner, ensuring that new features are relevant to the target variable, with quasilinear complexity that enables scaling efficiently to large datasets. By balancing model complexity with statistical significance, Khiops avoids overfitting while generating informative aggregates.

For example, Khiops can automatically calculate metrics like total purchases per customer or average transaction amount per week when working with a sales dataset.
For example, Khiops can automatically calculate metrics like "total purchases per customer" or "average transaction amount per week" when working with a sales dataset.

!!! example "Explore the [**Auto Feature Engineering**][autofeature_tuto] tutorial and learn about the methodology in the [dedicated][autofeature] foundations section."

Expand All @@ -75,30 +75,30 @@ Khiops introduces a streamlined and effective approach to data science, **simpli
[demo_viz]: ../setup/demovisualization.md
[setup_viz]: ../setup/visualization.md

With Khiops simplifying every stage of the data science workflow, the next step is choosing the right API for your needs. Whether you're exploring datasets or preparing for industrial-scale deployments, Khiops offers two powerful options: the sklearn-like API and the core API. Let's dive into their differences and find the best fit for your projects.
With Khiops simplifying every stage of the data science workflow, the next step is choosing the right API for your needs. Whether you're exploring datasets or preparing for industrial-scale deployments, Khiops offers two powerful options: the Scikit-learn-like API and the core API. Let's dive into their differences and find the best fit for your projects.

## Two APIs for different needs

Khiops offers two APIs tailored to different use cases: the sklearn-like API and the core API. While both leverage Khiops' unique strengths, they are optimized for distinct stages of the data science workflow and scaling requirements.
Khiops offers two APIs tailored to different use cases: the Scikit-learn-like API and the core API. While both leverage Khiops' unique strengths, they are optimized for distinct stages of the data science workflow and scaling requirements.

### Sklearn-like API: Quick prototyping and integration
### Scikit-learn-like API: For quick prototyping and integration

The sklearn-like API is ideal for data scientists familiar with Python and the sklearn ecosystem. It provides an accessible entry point for experimenting with Khiops' key features, including multi-table support and automated feature engineering.
The scikit-learn-like API is ideal for data scientists familiar with Python and the Scikit-learn ecosystem. It provides an accessible entry point for experimenting with Khiops' key features, including multi-table support and automated feature engineering.

| :white_check_mark: **Advantages** | :red_square: **Limitations** |
|-----|-----------------|
| **Familiar syntax**: Designed for immediate use with standard sklearn workflows, making onboarding effortless. | **High I/O requirements**: Data loading and processing rely on Python and Pandas, which can be memory-intensive. |
| **Ecosystem integration**: Acts as a standard sklearn estimator, enabling easy integration with other tools (e.g., pyCaret for benchmarking). | **Scalability constraints**: Not optimized for large-scale datasets as it does not support Khiops out-of-core processing. |
| **Familiar syntax**: Designed for immediate use with standard Scikit-learn workflows, making onboarding effortless. | **High I/O requirements**: Data loading and processing rely on Python and Pandas, which can be memory-intensive. |
| **Ecosystem integration**: Acts as standard Scikit-learn estimators, enabling easy integration with other tools (e.g. pyCaret for benchmarking). | **Scalability constraints**: Not optimized for large-scale datasets as it does not support Khiops out-of-core processing. |
| **Feature testing**: Lets you explore Khiops' multi-table capabilities and auto feature engineering, supporting star or snowflake schemas (with some limitations). | **Limited support for key Khiops features**: limited expressiveness of multi-table schemas and data management capabilities. |

### Core API: Production-ready and scalability
### Core API: Production-ready and scalable

The core API unleashes the full power of Khiops, offering unmatched scalability and flexibility for industrial-scale projects. Its rich dictionary-based formalism supports complex multi-table databases and facilitates streamlined data management for production use.

| :white_check_mark: **Advantages** | :red_square: **Limitations** |
|-------------------------------------------------------------------------------|-------------------------------------------------------------------------------------|
| **Rich data description**: The dictionary formalism provides a structured, detailed way to describe data and especially multi-table relationships, enabling efficient data processing on the fly. | **Learning Curve**: The dictionary formalism introduces new concepts, requiring users to invest time in learning its syntax and structure. |
| **Advanced data management**: Automates business-level transformations such as aggregate creation, variable selection, and example filtering, all within the API. It can also act as a highly efficient ETL tool. | |
| **Advanced data management**: Automates business-level transformations such as aggregate creation, variable selection, and example filtering, all within the API. It can also act as a highly efficient ETL tool. | **Learning Curve**: The core API is tailored to Khiops and thus provides a different experience as compared to Scikit-learn APIs. |
| **Facilitated versioning**: Dictionaries serve as centralized, versionable configurations for data transformations and model definitions, ensuring traceability. | |
| **Seamless production deployment**: Models trained with the core API are ready for deployment (via the output dictionary file), ensuring robust integration into production workflows. | |
| **Out-of-core processing**: Optimized for hardware resource usage, handling datasets that exceed memory limits efficiently. | |
Expand All @@ -107,6 +107,6 @@ The core API unleashes the full power of Khiops, offering unmatched scalability

Choosing the Right API:

- Start with the sklearn-like API if you're exploring Khiops' capabilities on small datasets or need a quick, familiar way to test models within the Python ecosystem.
- Start with the Scikit-learn-like API if you're exploring Khiops' capabilities on small datasets or need a quick, familiar way to test models within the Python ecosystem.
- Move to the core API for production-grade scalability, complex data management needs, or when working with large, multi-table datasets.

Loading