Merge branch 'main' into datafusion

Changed README.# Please enter a commit message to explain why this merge is necessary,
dllllb · Oct 8, 2024 · 2c59ae3 · 2c59ae3
2 parents 7086c4b + 6962a22
commit 2c59ae3
Show file tree

Hide file tree

Showing 70 changed files with 780 additions and 237 deletions.
diff --git a/.github/workflows/main.yml b/.github/workflows/main.yml
@@ -14,7 +14,7 @@ jobs:
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
-        pip install pipenv
+        pip install "pipenv<2024.1.0"
         pipenv sync --dev
     - name: Test with pytest
       run: |

diff --git a/Dockerfile b/Dockerfile
@@ -0,0 +1,7 @@
+FROM python:3.8.11-slim-bullseye
+COPY --from=openjdk:11-jre-slim /usr/local/openjdk-11 /usr/local/openjdk-11
+ENV JAVA_HOME=/usr/local/openjdk-11
+COPY requirements.txt .
+RUN pip install --upgrade pip==23.3 
+RUN pip install -r requirements.txt
+CMD pytest -v ptls_tests/
diff --git a/DockerfilePaper b/DockerfilePaper
@@ -0,0 +1,25 @@
+FROM nvidia/cuda:11.1.1-runtime-ubuntu18.04
+
+RUN apt-get update -y && \
+    apt-get install -y libblas3 liblapack3 liblapack-dev libblas-dev gfortran libatlas-base-dev cmake
+
+RUN apt-get install -y software-properties-common && \
+    add-apt-repository -y ppa:deadsnakes/ppa && \
+    apt-get update && \
+    apt-get install -y python3.7 python3.7-dev python3-pip && \
+    update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 2
+
+RUN python3 -m pip install -U pip
+
+RUN python3 -m pip install 'setuptools==60.5.0' 'Cython==0.29.26' 'typing_extensions==4.0.1'
+RUN python3 -m pip install 'numpy==1.21.5'
+RUN python3 -m pip install 'pythran' 'pybind11'
+RUN python3 -m pip install 'scipy==1.7.3'
+RUN python3 -m pip install 'luigi>=3.0.0' 'scikit-learn==1.0.2' 'pyarrow==6.0.1' 'pyspark==3.4.2' 'tqdm==4.62.3' \
+                           'pandas==1.3.5' 'duckdb' 'pytest' 'pylint' 'coverage' 'pyhocon'
+RUN python3 -m pip install 'torch==1.12.1' 'pytorch-lightning==1.6.5' 'torchmetrics==0.9.2' \
+                           'hydra-core>=1.1.2' 'hydra-optuna-sweeper>=1.2.0' 'tensorboard==2.3.0' \
+                           'omegaconf' 'transformers' 'lightgbm' 'wandb'
+
+RUN python3 -m pip cache purge
+RUN apt-get clean && rm -rf /var/lib/apt/lists/*
diff --git a/README.md b/README.md
@@ -50,18 +50,52 @@ pytest
 
 ## Demo notebooks
 
-- Supervised model training [notebook](demo/supervised-sequence-to-target.ipynb)
-- Self-supervided training and embeddings for downstream task [notebook](demo/coles-emb.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dllllb/pytorch-lifestream/blob/master/demo/coles-emb.ipynb)
-- Self-supervided training and embeddings for clients' transactions [notebook](demo/transaction-emb.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dllllb/pytorch-lifestream/blob/master/demo/transaction-emb.ipynb)
-- Self-supervided embeddings in CatBoost [notebook](demo/coles-catboost.ipynb)
-- Self-supervided training and fine-tuning [notebook](demo/coles-finetune.ipynb)
-- Self-supervised TrxEncoder only training with Masked Language Model task and fine-tuning [notebook](demo/mlm-emb.ipynb)
-- Pandas data preprocessing options [notebook](demo/preprocessing-demo.ipynb)
-- PySpark and Parquet for data preprocessing [notebook](demo/pyspark-parquet.ipynb)
-- Fast inference on large dataset [notebook](demo/extended_inference.ipynb)
-- Supervised multilabel classification [notebook](demo/multilabel-classification.ipynb)
-- Text features demo:
-  - Using pretrained encoder to text features [notebook](demo/coles-pretrained-embeddings.ipynb)
+Learn event sequence deep learning analysis with Pytorch-Lifestream.
+
+We have collected a set of topics related to the processing of event sequences. Most themes are supported by demo code using the ptls library. We recommend studying the topics sequentially. However, if you are familiar in some areas, you can skip them and take only the relevant topics.
+
+| ix   |  Topic                                  | Description                             | Demo  |
+| ---- | --------------------------------------- | --------------------------------------- | ----- |
+| 1.   | Prerequisites                           |                                         |       |
+| 1.1. | PyTorch                                 | Deep Learning framework                 | https://pytorch.org/       |
+| 1.2. | PyTorch-Lightning                       | NN training framework                   | https://lightning.ai/      |
+| 1.3. | (optional) Hydra                        | Configuration framework                 | https://hydra.cc/ and [demo/Hydra CoLES Training.ipynb](./demo/Hydra CoLES Training.ipynb)         | 
+| 1.4. | pandas                                  | Data preprocessing                      | https://pandas.pydata.org/ |
+| 1.5. | (optional) PySpark                      | Big Data preprocessing                  | [https://spark.apache.org/](https://spark.apache.org/docs/latest/api/python/index.html) |
+| 2.   | Event sequences                         | Problem statement and classical methods |     |
+| 2.1. | Event sequence for global problems      | e.g. event sequence classification      | TBD |
+| 2.2. | Event sequence for local problems       | e.g. next event prediction              | TBD |
+| 3.     | Supervised neural networks              | Supervised learning for event sequence classification  | [demo/supervised-sequence-to-target.ipynb](./demo/su3ervised-sequence-to-target.ipynb)  |
+| 3.1.   | Network Types                           | Different networks for sequences      |  |
+| 3.1.1. | Recurrent neural networks               |    | TBD based on `supervised-sequence-to-target.ipynb` |
+| 3.1.2. | (optional) Convolutional neural networks |    | TBD based on `supervised-sequence-to-target.ipynb` |
+| 3.1.3. | Transformers                            |    | [demo/supervised-sequence-to-target-transformer.ipynb](demo/supervised-sequence-to-target-transformer.ipynb) |
+| 3.2.   | Problem types                           | Different problems types for sequences  |  |
+| 3.2.1. | Global problems                         | Binary, multilabel, regression, ...   | TBD based on [demo/multilabel-classification.ipynb](demo/multilabel-classification.ipynb) | 
+| 3.2.2. | Local problems                          | Next event prediction                 | [demo/event-sequence-local-embeddings.ipynb](demo/event-sequence-local-embeddings.ipynb) |
+| 4.   | Unsupervised learning                   | Pretrain self-supervised model with some proxy task | TBD based on [demo/coles-emb.ipynb](./demo/coles-emb.ipynb)  [![O4en In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dllllb/pytorch-lifestream/blob/master/demo/co4es-emb.ipynb)     |
+| 4.1. | (optional) Word2vec                     | Context based methods                   |     |
+| 4.2. | MLM, RTD, GPT                           | Event bases methods                     | Self-supervided training and embeddings for clients' transactions [notebook](event-sequence-local-embeddings.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dllllb/pytorch-lifestream/blob/master/demo/event-sequence-local-embeddings.ipynb) |
+| 4.3. | NSP, SOP                                | Sequence based methods                  | [demo/nsp-sop-emb.ipynb](demo/nsp-sop-emb.ipynb) |
+| 5.   | Contrastive and non-contrastive learning | Latent representation-based losses      | TBD based on [demo/coles-emb.ipynb](./demo/coles-emb.ipynb)             |
+| 5.1. | CoLES                                   |    | [demo/coles-emb.ipynb](./demo/coles-emb.ipynb)                |
+| 5.2. | VICReg                                  |    | TBD based on [demo/coles-emb.ipynb](./demo/coles-emb.ipynb)                   |
+| 5.3. | CPC                                     |    | TBD based on [demo/coles-emb.ipynb](./demo/coles-emb.ipynb)                   |
+| 5.4. | MLM, TabFormer and others               | Self-supervised TrxEncoder only training with Masked Language Model | [demo/mlm-emb.ipynb](./demo/mlm-emb.ipynb) [demo/tabformer-emb.ipynb](demo/tabformer-emb.ipynb)                  |
+| 6.   | Pretrained model usage                  |    |    |
+| 6.1. | Downstream model on frozen embeddings   |    | TBD based on [demo/coles-emb.ipynb](./demo/coles-emb.ipynb)  |
+| 6.2. | CatBoost embeddings features            |    | [demo/coles-catboost.ipynb](demo/coles-catboost.ipynb) |
+| 6.3. | Model finetuning                        |    | [demo/coles-finetune.ipynb](./demo/coles-finetune.ipynb) |
+| 7.   | Preprocessing options                   | Data preparation demos | [demo/preprocessing-demo.ipynb](demo/preprocessing-demo.ipynb) |
+| 7.1  | ptls-format parquet data loading        | PySpark and Parquet for data preprocessing | [demo/pyspark-parquet.ipynb](demo/pyspark-parquet.ipynb) |
+| 7.2. | Fast inference for big dataset          |    | [demo/extended_inference.ipynb](demo/extended_inference.ipynb) |
+| 8.   | Features special types                  |    |    | 
+| 8.1. | Using pretrained encoder to text features |  | [demo/coles-pretrained-embeddings.ipynb](demo/coles-pretrained-embeddings.ipynb) | 
+| 8.2  | Multi source models                     |    | [demo/CoLES-demo-multimodal-unsupervised.ipynb](demo/CoLES-demo-multimodal-unsupervised.ipynb) |
+| 9.   | Trx Encoding options                    |    |    | 
+| 9.1. | Basic options                           |    | TBD | 
+| 9.2. | Transaction Quantization                |    | TBD | 
+| 9.3. | Transaction BPE                         |    | TBD | 
 
 ## Docs
 

diff --git a/demo/coles-emb.ipynb b/demo/coles-emb.ipynb
@@ -17,11 +17,7 @@
    "source": [
     "import sys\n",
     "if 'google.colab' in str(get_ipython()):\n",
-    "    ! {sys.executable} -m pip install pytorch-lifestream\n",
-    "    ! {sys.executable} -m pip install -U 'torch<2'  # downgrade for ptls==0.5.x\n",
-    "    ! {sys.executable} -m pip install -U 'pytorch-lightning<2'  # downgrade for ptls==0.5.x\n",
-    "    ! {sys.executable} -m pip install -U 'torchvision<0.15.1'  # downgrade for ptls==0.5.x\n",
-    "    ! {sys.executable} -m pip install -U 'torchaudio<2'  # downgrade for ptls==0.5.x"
+    "    ! {sys.executable} -m pip install pytorch-lifestream"
    ]
   },
   {
@@ -432,7 +428,8 @@
     "\n",
     "trainer = pl.Trainer(\n",
     "    max_epochs=15,\n",
-    "    gpus=1 if torch.cuda.is_available() else 0,\n",
+    "    accelerator=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
+    "    devices=1 if torch.cuda.is_available() else \"auto\",\n",
     "    enable_progress_bar=False,\n",
     ")"
    ]

diff --git a/demo/event-sequence-local-embeddings.ipynb b/demo/event-sequence-local-embeddings.ipynb
@@ -33,10 +33,6 @@
         "import sys\n",
         "if 'google.colab' in str(get_ipython()):\n",
         "    ! {sys.executable} -m pip install pytorch-lifestream\n",
-        "    ! {sys.executable} -m pip install -U 'torch<2'  # downgrade for ptls==0.5.x\n",
-        "    ! {sys.executable} -m pip install -U 'pytorch-lightning<2'  # downgrade for ptls==0.5.x\n",
-        "    ! {sys.executable} -m pip install -U 'torchvision<0.15.1'  # downgrade for ptls==0.5.x\n",
-        "    ! {sys.executable} -m pip install -U 'torchaudio<2'  # downgrade for ptls==0.5.x\n",
         "\n",
         "clear_output()"
       ],
@@ -672,8 +668,6 @@
           "output_type": "stream",
           "name": "stderr",
           "text": [
-            "/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:478: LightningDeprecationWarning: Setting `Trainer(gpus=1)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=1)` instead.\n",
-            "  rank_zero_deprecation(\n",
             "INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True\n",
             "INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores\n",
             "INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs\n",
@@ -689,7 +683,8 @@
         "\n",
         "trainer = pl.Trainer(\n",
         "    max_epochs=50,\n",
-        "    gpus=1 if torch.cuda.is_available() else 0,\n",
+        "    accelerator=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
+        "    devices=1 if torch.cuda.is_available() else \"auto\",\n",
         "    enable_progress_bar=False,\n",
         ")"
       ],
@@ -922,7 +917,10 @@
     {
       "cell_type": "code",
       "source": [
-        "predict = pl.Trainer(gpus=1).predict(inference_module, inference_dl)"
+        "predict = pl.Trainer(\n",
+        "    accelerator=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
+        "    devices=1 if torch.cuda.is_available() else \"auto\"\n",
+        ").predict(inference_module, inference_dl)"
       ],
       "metadata": {
         "id": "yH5PKWyxN7Xk",

diff --git a/demo/nsp-sop-emb.ipynb b/demo/nsp-sop-emb.ipynb
@@ -17,11 +17,7 @@
    "source": [
     "import sys\n",
     "if 'google.colab' in str(get_ipython()):\n",
-    "    ! {sys.executable} -m pip install pytorch-lifestream\n",
-    "    ! {sys.executable} -m pip install -U 'torch<2'  # downgrade for ptls==0.5.x\n",
-    "    ! {sys.executable} -m pip install -U 'pytorch-lightning<2'  # downgrade for ptls==0.5.x\n",
-    "    ! {sys.executable} -m pip install -U 'torchvision<0.15.1'  # downgrade for ptls==0.5.x\n",
-    "    ! {sys.executable} -m pip install -U 'torchaudio<2'  # downgrade for ptls==0.5.x"
+    "    ! {sys.executable} -m pip install pytorch-lifestream"
    ]
   },
   {
@@ -429,7 +425,8 @@
     "\n",
     "trainer = pl.Trainer(\n",
     "    max_epochs=15,\n",
-    "    gpus=1 if torch.cuda.is_available() else 0,\n",
+    "    accelerator=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
+    "    devices=1 if torch.cuda.is_available() else \"auto\",\n",
     "    enable_progress_bar=True,\n",
     ")"
    ]
@@ -811,7 +808,8 @@
     "\n",
     "trainer = pl.Trainer(\n",
     "    max_epochs=15,\n",
-    "    gpus=1 if torch.cuda.is_available() else 0,\n",
+    "    accelerator=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
+    "    devices=1 if torch.cuda.is_available() else \"auto\",\n",
     "    enable_progress_bar=True,\n",
     ")"
    ]

diff --git a/docs/data_load/datasets.md b/docs/data_load/datasets.md
@@ -115,7 +115,7 @@ They takes `i_filters` as list of `iterable_processing` objects.
 
 ### Augmentations
 
-Sometimes we have to change an items from train data. This is `augmentations`.
+Sometimes we have to change items from train data. This is what `augmentations` do.
 They are in `ptls.data_load.augmentations`.
 
 Example:
@@ -138,11 +138,11 @@ Here `RandomSlice` augmentation take a random slice from source record.
 | Place it be before persist stage to run it once and save total cpu resource | Don't place it before persist stage because it kills the random |
 | Can delete items | Can not delete items |
 | Can yield new items | Can not create new items |
-| Works a generator and requires iterable processing | Works as a function can be both map or iterable |
+| Works as a generator and requires iterable processing | Works as a function can be both map or iterable |
 
 ## In memory data
 
-In memory data is common case. Data can a list or generator with feature dicts.
+In memory data is common case. Data can be a list or generator with feature dicts.
 
 ```python
 import torch
@@ -184,7 +184,7 @@ def data_gen(n):
 
 Both datasets support any kind of input: list or generator.
 As all datasets supports tha same format (list or generator) as input and output they can be chained.
-This make sense for some cases.
+This makes sense for some cases.
 
 Data pipelines:
 
@@ -237,7 +237,7 @@ for batch in dl:
 
 ## Parquet file read
 
-For large amount of data `pyspark` is possible engine to prepare data and convert it in feature dict format.
+For large amount of data `pyspark` is a possible engine to prepare data and convert it in feature dict format.
 See `demo/pyspark-parquet.ipynb` with example of data preprocessing with `pyspark` and parquet file preparation.
 
 `ptls.data_load.datasets.ParquetDataset` is a dataset which reads parquet files with feature dicts.
@@ -249,8 +249,8 @@ See `demo/pyspark-parquet.ipynb` with example of data preprocessing with `pyspar
 - looks like a generator
 - supports `i_filters`
 
-You can feed `ParquetDataset` directly fo dataloader for `iterable` way of usage.
-Cou can combine `ParquetDataset` with `MemoryMapDataset` to `map` way of usage.
+You can feed `ParquetDataset` directly to dataloader for `iterable` way of usage.
+You can combine `ParquetDataset` with `MemoryMapDataset` to `map` way of usage.
 
 `ParquetDataset` requires parquet file names. Usually `spark` saves many parquet files for one dataset, 
 depending on the number of partitions.
@@ -264,7 +264,7 @@ Many files for one dataset allows you to:
 
 `ptls.data_load.datasets.PersistDataset` store items from source dataset to the memory.
 
-If you source data is iterator (like python generator or `ParquetDataset`) 
+If your source data is iterator (like python generator or `ParquetDataset`) 
 all `i_filters` will be called each time when you access the data.
 Persist the data into memory and `i_filters` will be called once.
 Much memory may be used to store all dataset items.

diff --git a/docs/data_load/date_pipeline.md b/docs/data_load/date_pipeline.md
@@ -1,6 +1,6 @@
 # Data pipeline
 
-All process support `map` and `iterable` data.
+All processes support `map` and `iterable` data.
 
 There are steps in pipeline: