Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prepare V11 version : internal release notes #223

Open
marcboulle opened this issue Apr 5, 2024 · 3 comments
Open

Prepare V11 version : internal release notes #223

marcboulle opened this issue Apr 5, 2024 · 3 comments
Assignees
Labels
Priority/0 To do NOW
Milestone

Comments

@marcboulle
Copy link
Collaborator

Lien avec issue de pilotage global Prepare V11 version
#6

Il s'agit de rendre publique les releases notes internes concernant Khiops V11, ayant un impact potentiel sur l'(ensemble de l'éco-système

  • pilotage: partage des informations sur les nouvelles foinctionnalités
  • pykhiops
    • prise en compte dès que possible (dès les version 10.2.x) de ce qui est deprecated et sera supprimé en V11, pour prévenir les utilisateur
    • évolution des pykhiops core pour prendre en compte l'évlution des paramètres (en plsu ou en moins) et les nouveaux scénario
  • outils de visualisation: évolution selon les nouvelle fonctionnalités (bien avancé, à finaliser)
  • automl
  • documentation
  • ...

Les release notes internes sont complètes et en phase avec la version 10.5.0-a1, et il n'y aura quasiment plus d'évolution
Proposition à discuter:

  • on publie ces release notes internes sur une page wiki du repo khiops
    • si évolutions, on ajoute des sous-sections avec ces évolutions en synchronisation avec des tags de khiops
  • chaque repo "client" crée une issue si nécessaire pour prendre en compte ces évolutions
@alexisbondu
Copy link
Collaborator

Ce doc sera écrit directement en commentaire de cette issue + faire une issue dans pyKhiops pour avertir des depricated

@marcboulle
Copy link
Collaborator Author

marcboulle commented Apr 11, 2024

Preparation de Khiops V11

Nouvelles fonctionnalités de Khiops V11: cf. commentaire suivant Khiops 11.0 internal release notes

Reste à faire pour la V11

  • Khiops
    • collecte des tokens les plus fréquents pour la construction de feature de type text
    • coclustering instances x variables: améliorations de l'optimisation si disponible
    • prise en cours des retour d'une diffusion en béta test complète fonctionnellement
  • Khiops visualization
    • histograms: visualiser la série des histogrammes simplifiés
    • arbres de régression: finalisation
  • Khiops covisualization
    • coclustering instances x variables: améliorations des information visualisées si disponibles
  • pykhiops:
    • prise en compte des nouvelle fonctionnalités
  • documentation
    • prise en compte des nouvelle fonctionnalités

Pris en compte dans la V10.2.x

Les évolutions suivantes développées pour la V11 sont reportées dans la branche V10.2.0

  • New option in khiops executables: -s to obtain system information
  • Khiops covisualization: correction des bugs existants

Mise à jour des Khiops 11.0 internal release notes

Référence dans le commentaire suivant de titre Khiops 11.0 internal release notes
Mise a jour:

  • si besoin au fur et a mesure de la prise en compte du reste a faire
  • en décrivant les nouveautés dans l'historique ci-dessous

Historique des mises à jour

  • initialisation: alimentation par relecture des commit notes
    • sources
      • ancien fichier version.txt
        • LearningDoc\ProjectManagement\KhiopsHistoricalProject2023\Learning\Doc\version.txt
        • de V10.2.0i a V10.4.2i non compris
      • git log du github KhiopsML/khiops
        • depuis V10.4.2i
  • jj/mm:2024: détail des nouveautés
11/04/2024: initialisation, de 10.2.0i à 10.5.0-a1

jusqu'au point de commit "Merge pull request #227 from KhiopsML/196-assertion-violated-in-kwprobabilitytabletest"

24/05/2024: 10.5.0-a.1 à 10.5.0-b.1
  • le code retour est désormais systématiquement 0 si OK, 1 sinon (plus de code retour à 2)
  • on précise ce qui sera déjà dans une version 10.2.x
24/07/2024: 10.5.0-b.1 à 10.5.2-b.0
  • disponibilité du SNB sparse
06/01/2024: 10.5.2-b.0 à 10.6.0-b.0
  • fonctionnalités principales
    • Disponibilité des arbres de régression
    • disponibilité de Khiops interpretation, dans le menu "Tools/Interpret model"
      • construction d'un modèle d'interprétation
      • construction d'un modèle re renforcement
    • Simplification des data path pour les schéma multi-tables
      In a multi-table schema, each data path refers to a table or entity variable and identifies a data table file.
      The main table has an empty data path.
      In a star schema, the data paths are the names of table or entity variables for each secondary table.
      In a snowflake schema, data paths consist of a list of variable names with a '/' separator.
      External tables begin with a data root prefixed with '/', which refers to the name of the referenced root dictionary.
    • Extension du pilotage de Khiops via des structure de contrôle et un fichier de paramètres au format json
  • autres fonctionnalités
    • Arrêt de la gestion du format obsolète .khc pour le coclustering
    • Ajout d'un menu dans la sous-fenêtre dédiée a la gestion des dictionnaires
      • menu identique a celui de la fenêtre principale (sauf items "Manage dictionaries" et "Quit")
      • suppression du bouton "Inspect current dictionary" pour ne garder que le menu contextuel
    • renommage de "Visualize results" en "Visualize report"
    • Pilotage de Khiops via des scénarios en mode batch, avec sortie rapide dès qu'une erreur applicative est détectée
      • on sort avec un exit code 0, avec dans le log la première erreur applicative détectée
      • cela évite la sortie en fatal error, qui est plutot réservée au cas d'un scenario vraiment invalide
    • évolution de Khiops_env pour faciliter l'intégration (pykhiops, docker...)
    • construction d'un rapport minimaliste, même en cas d'erreur de préparation
    • choix d'une granularité par défaut plus interprétable pour les histogrammes
    • dictionnaire et unicité:
      • tout dictionnaire multi-table ayant une sous-table Entity ou Table correspond à des record devant être unique, comme pour les dictionnaire Root
      • c'est maintenant vérifié, avec des warning en cas de doublons
    • dictionnaire et graphe
      • possibilité d'analyser une sous-table d'un schéma-multi-tables, même si elle référence la table Root de son schéma (cela buguait auparavant)
    • dictionnaire, libellés et commentaires
      • on peut maintenant associer des commentaires en plus des libellé à chaque entité suivante du langage des dictionnaire
        • dictionnaire
          • label: le premier commentaire ligne présent avant la déclaration du dictionnaire, ayant un rôle de titre avant les commentaires
        • comments: tous les commentaires lignes présents avant le début '{' du bloc de déclaration du dictionnaire
        • internalComments: les commentaires suivant la dernière variable, avant la fin du bloc '}'
      • variable
        • comments: tous les commentaires lignes présents avant la déclaration de la variable
        • label: le commentaire de fin de ligne, en fin de déclaration de la variable
      • block de variable
        • comments: tous les commentaires lignes présents avant le début du bloc de variable '{'
        • internalComments: tous les commentaires lignes présents avant la fin du bloc de variable '}'
        • label: le commentaire de fin de ligne, en fin de déclaration du bloc de variables
      • impacts dans les fichiers dans les fichiers .kdicj
        • champs facultatif associés aux entité correspondantes
          • label: string
          • comments: liste de string
          • internalComments: liste de string

Diffusion en béta-test

@marcboulle
Copy link
Collaborator Author

marcboulle commented Apr 11, 2024

Khiops 11.0 internal release notes

The purpose of the internal release notes is:

  • to give all detailed evolutions and correction potentially usefull for the Khiops eco-system
  • to allow pykhiops and AutoML to adapt in advance to the functional parts of these evolutions
  • to be the base for the file whatsnewV10.0.txt, the "official" release note (quick summary)

These release notes follow the last version of Khiops, described in the Khiops 10.2 release notes.

Khiops 11.0 is a major version, with several major functional improvements.

Major improvements

Text data

  • new Text type for variables in tabular or multi-table schema
  • Automatic feature construction from Text variables

SNB classifier for sparse data

  • extension to sparse data

Random forests for regression

Khiops interpretation

  • Instance-based interpretation of scores
  • Exact computation of Shapley values
  • Build an interpretation dictionary, to deploy interpretation values
  • Build a reinforcement dictionary, to deploy reinforcement scores based on lever variables

Histograms

  • Optimal histograms for univariate data exploration

Coclustering instances x variables

  • extension of existing variable x variable coclustering, for joint density estimation
  • to instances x variables coclustering, for exploratory analysis

New visualization tools

  • visualization
    • new panel to visualize histograms
  • covisualization
    • accounting for the case of instances x variables coclustering

Simplified ergonomy

  • simplification of panels and fields, everywhere, as much as possible
  • fast path: to train a model without a dictionary
  • results visualization and edition of dictionaries from the graphical interface

Extended scenario-based management of Khiops, with control structures and a parameter file in json format

Detailed evolutions

Functional improvements

Text data

  • new type Text available in Khiops dictionaries
    • Text variables can contain up to 1000000 bytes
    • Categorical variables are now limited to 1000 bytes
  • type detected in automatic "build dictionary" feature
  • automatic feature construction
    • parameter "number of text features ", with default value 10000
    • text features:
      • words: default automatic tokenization
      • ngrams: black-box using ngrams of bytes, for blob-like variables
      • tokens: open to user defined tokenization
  • new derivations rules for Text variables
    • TextLoadFile: load a Text variable from a text file, up to 1000000 chars, replacing end of lines by whitespaces
    • FromText, ToText: conversion with categorial variables
    • rules similar to those related to categorical variables:
      • TextLength, TextLeft, TextRight, TextMiddle,
      • TextTokenLength, TextTokenLeft, TextTokenRight, TextTokenMiddle,
      • TextTranslate, TextSearch, TextReplace, TextReplaceAll
      • TextRegexMatch, TextRegexSearch, , TextRegexReplace, TextRegexReplaceAll
      • TextToUpper, TextToLower,
      • TextConcat, TextHash, TextEncrypt
      • GetText(Entity, Text)
  • new type TextList: list of Text variables, to avoid scalability problems when concatenating Text variables from a corpus
    • dedicated derivation rules
      • creation
        • TextList(text1, text2, …)
        • TextListConcat(textList1, textList2, …)
      • Inspection
        • TextListSize, TextListAt
      • extract from sub-tables
        • GetTextList
        • TableAllTexts
        • TableAllTextLists

Optimal histograms

  • by default in unsupervised learning (not target variable), the new MODL preprocessing methods are activated
    • numerical variables: optimal histogram are built to for accurate density estimation and usefull exploratory analysis
    • categorical variable: optimal number of frequent value are kept, with the rare values in a default group
  • former unsupervised preprocessing methods can still be used if specified
    • discretization method: MODL (optimal), EqualWidth, EqualFrequency, None
      • EqualWidth: bounds are now computed on exact equal width bound, without discrading empty intervals
    • grouping method: MODL (optimal), Basic grouping, None

Preprocessing

  • in supervised learning, MODL is now the only available method
    • all other alternative methods are removed
  • max part number is now the only constraint that can be specified
    • it is an "universal" constraint that applies to all preprocessing methods
      • discretization/grouping
      • supervised/unsupervised
      • univariate/bivariate

Extend max year from 4000 to 9999 in timestamps

  • allow better automatic type recognition when year 9999 is used in databases

Khiops visualization

See Khiops visualization release notes

Khiops covisualization

See Khiops covisualization release notes

Khiops reports files .khj

Extensions of json format

  • section "variable statistics"
    • new field "parts" in the case of unsupervised learning
    • field "missingNumber" is now also available for catageorical variables
    • new field "sparseMissingNumber" to count the number of present values in sparse data blocks (technical field, not visualized)
  • "variablesDetailedStatistics"
    • new sub-section "modlHistograms" in the case of unsupervised learning with MODL optimal histigram for numerical variables
      • "histogramNumber": number of available histograms, sorted by increasing granularities
      • "intrepretableHistogramNumber": number of interpretable histogrammes (potentaiily one histogram less)
      • "truncationEpsilon": truncation epsilon used by the TMH (Truncation Management Heuristic) (0 if no truncation detected in data)
      • "removedSingularIntervalNumber": number of singular intervals removed from the finest histogram to obtain the first interpretable histogram
      • "granularities": vector of histogram granularities
      • "intervalNumbers": vector of histogram interval numbers
      • "peakIntervalNumbers": vector of histogram peak interval numbers
      • "spikeIntervalNumbers": vector of histogram spike interval numbers
      • "emptyIntervalNumbers": vector of histogram empty interval numbers
      • "levels": vector of histogram levels
      • "informationRates": vector of histogram information rates (between 0 and 100 for interpretable histograms)
      • "histograms": array of histograms
        • each histogram isa sub-object described by the following vectors
          • "bounds": interval bounds
          • "frequencies": interval frequencies

Khiops coclustering reports files .khcj

Extended to support instances x variables coclustering

The support of deprecated format .khc is removed

Simplified ergonomy

Simplification of data paths for multi-table schema in Khiops desktop

  • In a multi-table schema, each data path refers to a table or entity variable and identifies a data table file.
    The main table has an empty data path.
    In a star schema, the data paths are the names of table or entity variables for each secondary table.
    In a snowflake schema, data paths consist of a list of variable names with a '/' separator.
    External tables begin with a data root prefixed with '/', which refers to the name of the referenced root dictionary.

Khiops

  • simplified management of dictionaries
    • removed pane "Data dictionary"
    • extended menu "Data dictionary"
      • new menu item "Reload"
      • new menu item "Dictionary management": open a dialog box similar to former "Data dictionary" pane
    • new dialog box "Dictionary management"
      • similar to a simplified version of former "Data dictionary" pane
      • new button "Edit dictionary file", to open the dictionary file using a text editor
  • simplified pane "Train database"
    • new fields "Analysis dictionary" and "Dictionary file", replacing the related fields in former "Data dictionary pane"
    • simplified layout: sub-panes for "Sampling" and "Selection" specifications
  • fast path for first analysis of database without specifying a dictionary
    • just fill in the "Data base file" the click on "Train model" to
      detect the file format, automaticcaly build the dictionary and train a model
  • extended menu "Help"
    • new sub-menu "Quick start"
  • simplified pane "Parameters"
    • sub-pane "Predictors/Feature engineering"
      • new field "Keep selected variables only": to keep in reports only the constructed variable selected by the SNB predictor
      • new field "Max number of text features": maximum number of features constructed from Text variables (default: 10000)
      • field "Max number of constructed variables": default number of variable constructed from multi-table schema is now 1000
    • sub-pane "Predictors/Advanced predictor parameters"
      • new field "Do data preparation only"
        • removed fields:
          • "Selective Naive Bayes": trained, except if "Do data preparation only" is triggered
          • "Baseline predictor": never used in classification, always provided in regression
          • "Number of univariate predictors": supressed
      • new button "Text feature parameters"
        • open a Dialog box "Text feature parameters"
          • field "Text features, with three choices: words, ngrams, tokens
      • removed former button ""Selective Naive Bayes parameters"
        • former "Selective Naive Bayes" dialog box now directly in the layout
    • sub-pane "Preprocessing"
      • removed sub-pane "Discretization" (4 fields)
      • removed sub-pane "Value grouping" (4 fields)
      • new field "Max part number": universal constraint on all preprocessings, univariate/bivariate discretization/value grouping
      • new button "Advanced parameters"
        • open a dialog box "Unsupervised parameters", with 2 fields (only remaining parameters)
          • "Discretization method": among "MODL", "EqualWidth, "EqualFrequency", "None"
          • "Grouping method": among "MODL", "BasicGrouping, "None"
    • sub-pane "System parameters"
      • removed field "Max number of items in reports"
  • simplified pane "Results"
    • now only two fields
      • "Analysis report": replace former fields "Results files directory" and "Result files prefix"
      • "Short description"
    • and two buttons
      • "Export as xls": replace all former .xls reports fields
      • "Visualize report": new button to open the visualization tool directly
  • menu "Tool"
    • new sub-menu "Interpret models"
      • open a dialog box "Interpret model"
        • allow to build an interpretation dictionary, to build the Shapley values
  • simplified tool dialog boxes
    • "Deploy model"
      • simplified layout with "Sampling" and "Selection sub-panes, as in the "Train database" pane
    • "Evaluate model"
      • simplified layout with "Sampling" and "Selection sub-panes, as in the "Train database" pane
      • the evaluation report is now with format .khj, with a button "Export as xls"

Khiops coclustering

  • simplifications, similar to those of the Khiops tools
    • simplified management of dictionaries
    • fast path for first analysis of database without specifying a dictionary
    • simplified pane "Database"
    • extended menu "Help"
    • simplified pane "Results"
  • new options to build instances x variables coclusterings in pane "Parameters"
    • field "Coclustering type", to choose between "Variable coclustering" and "Instances x Variables coclustering"
    • new sub-pane "Parameters/Instances x variables parameters"
  • simplified tool dialog boxes
    • "Simplify coclustering"
      • removed pane "Results"
        • new field "Simplified coclustering report" added at the top of the dialog box
    • "Extract clusters"
      • removed panes "Cluster parameters" and "Results"
        • new fields "Coclustering variable" and "Cluster table file" added at the top of the dialog box
    • "Prepare deployment"
      • removed pane "Results"
        • new field "Coclustering variable" added at the top of the dialog box

Khiops dictionaries

Comments are now allowed in addition to labels with each of the following entities in the dictionary language, using '//' as a prefix either for a full line or an end-of-line:

  • dictionary
    • label: the first line comment present before the dictionary declaration, acting as a title before the comments
    • comments: all comment lines present before the start ‘{’ of the dictionary declaration block
    • internalComments: the comments line following the last variable, before the end ‘}’ of the block
  • variable
    • comments: all line comments before the variable declaration
    • label: the end-of-line comment, at the end of the variable declaration
  • variable block
    • comments: all line comments before the start ‘{’ of the variable block .
    • internalComments: all line comments before the end of variable block ‘}’.
    • label: the end-of-line comment at the end of the variable block declaration

Impacts in .kdicj dictionary files

  • optional fields associated with the corresponding entities
    • label: string
    • comments: list of strings
    • internalComments: list of strings

Integration improvements

Extended scenario-based management of Khiops, with control structures and a parameter file in json format

Piloting of Khiops via scenarios in batch mode, with rapid exit as soon as an application error is detected

A new environment variable KHIOPS_API_MODE is available for better integration with pykhiops API

  • defaut behavior is not set, as in the Khiops desktop tool:
    • result file names are stored in the directory of the input database if their path is relative
    • suffix are imposed where necessary
  • if KHIOPS_API_MODE is set to true (e.g. in pykhiops), result files names are used as is

Parallelization of new algorithms

Performance improvement

I/O performance improvement

Reliability improvement

The modeling results have been stabilised and are now independent of the platform.

New internal derivation rules

Impact in KhiopsGuide, section "8. Appendix: variable blocks and sparse data management"

New internal derivation rules

  • DataGridBlock
  • DataGridStatsBlock

Bug fixes

Many minor fixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority/0 To do NOW
Projects
None yet
Development

No branches or pull requests

3 participants