Skip to content

Commit

Permalink
Merge branch 'main' into datafusion
Browse files Browse the repository at this point in the history
Changed README.# Please enter a commit message to explain why this merge is necessary,
  • Loading branch information
Artem Sakhno committed Oct 8, 2024
2 parents 7086c4b + 6962a22 commit 2c59ae3
Show file tree
Hide file tree
Showing 70 changed files with 780 additions and 237 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ jobs:
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install pipenv
pip install "pipenv<2024.1.0"
pipenv sync --dev
- name: Test with pytest
run: |
Expand Down
7 changes: 7 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
FROM python:3.8.11-slim-bullseye
COPY --from=openjdk:11-jre-slim /usr/local/openjdk-11 /usr/local/openjdk-11
ENV JAVA_HOME=/usr/local/openjdk-11
COPY requirements.txt .
RUN pip install --upgrade pip==23.3
RUN pip install -r requirements.txt
CMD pytest -v ptls_tests/
25 changes: 25 additions & 0 deletions DockerfilePaper
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
FROM nvidia/cuda:11.1.1-runtime-ubuntu18.04

RUN apt-get update -y && \
apt-get install -y libblas3 liblapack3 liblapack-dev libblas-dev gfortran libatlas-base-dev cmake

RUN apt-get install -y software-properties-common && \
add-apt-repository -y ppa:deadsnakes/ppa && \
apt-get update && \
apt-get install -y python3.7 python3.7-dev python3-pip && \
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.7 2

RUN python3 -m pip install -U pip

RUN python3 -m pip install 'setuptools==60.5.0' 'Cython==0.29.26' 'typing_extensions==4.0.1'
RUN python3 -m pip install 'numpy==1.21.5'
RUN python3 -m pip install 'pythran' 'pybind11'
RUN python3 -m pip install 'scipy==1.7.3'
RUN python3 -m pip install 'luigi>=3.0.0' 'scikit-learn==1.0.2' 'pyarrow==6.0.1' 'pyspark==3.4.2' 'tqdm==4.62.3' \
'pandas==1.3.5' 'duckdb' 'pytest' 'pylint' 'coverage' 'pyhocon'
RUN python3 -m pip install 'torch==1.12.1' 'pytorch-lightning==1.6.5' 'torchmetrics==0.9.2' \
'hydra-core>=1.1.2' 'hydra-optuna-sweeper>=1.2.0' 'tensorboard==2.3.0' \
'omegaconf' 'transformers' 'lightgbm' 'wandb'

RUN python3 -m pip cache purge
RUN apt-get clean && rm -rf /var/lib/apt/lists/*
58 changes: 46 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,18 +50,52 @@ pytest

## Demo notebooks

- Supervised model training [notebook](demo/supervised-sequence-to-target.ipynb)
- Self-supervided training and embeddings for downstream task [notebook](demo/coles-emb.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dllllb/pytorch-lifestream/blob/master/demo/coles-emb.ipynb)
- Self-supervided training and embeddings for clients' transactions [notebook](demo/transaction-emb.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dllllb/pytorch-lifestream/blob/master/demo/transaction-emb.ipynb)
- Self-supervided embeddings in CatBoost [notebook](demo/coles-catboost.ipynb)
- Self-supervided training and fine-tuning [notebook](demo/coles-finetune.ipynb)
- Self-supervised TrxEncoder only training with Masked Language Model task and fine-tuning [notebook](demo/mlm-emb.ipynb)
- Pandas data preprocessing options [notebook](demo/preprocessing-demo.ipynb)
- PySpark and Parquet for data preprocessing [notebook](demo/pyspark-parquet.ipynb)
- Fast inference on large dataset [notebook](demo/extended_inference.ipynb)
- Supervised multilabel classification [notebook](demo/multilabel-classification.ipynb)
- Text features demo:
- Using pretrained encoder to text features [notebook](demo/coles-pretrained-embeddings.ipynb)
Learn event sequence deep learning analysis with Pytorch-Lifestream.

We have collected a set of topics related to the processing of event sequences. Most themes are supported by demo code using the ptls library. We recommend studying the topics sequentially. However, if you are familiar in some areas, you can skip them and take only the relevant topics.

| ix | Topic | Description | Demo |
| ---- | --------------------------------------- | --------------------------------------- | ----- |
| 1. | Prerequisites | | |
| 1.1. | PyTorch | Deep Learning framework | https://pytorch.org/ |
| 1.2. | PyTorch-Lightning | NN training framework | https://lightning.ai/ |
| 1.3. | (optional) Hydra | Configuration framework | https://hydra.cc/ and [demo/Hydra CoLES Training.ipynb](./demo/Hydra CoLES Training.ipynb) |
| 1.4. | pandas | Data preprocessing | https://pandas.pydata.org/ |
| 1.5. | (optional) PySpark | Big Data preprocessing | [https://spark.apache.org/](https://spark.apache.org/docs/latest/api/python/index.html) |
| 2. | Event sequences | Problem statement and classical methods | |
| 2.1. | Event sequence for global problems | e.g. event sequence classification | TBD |
| 2.2. | Event sequence for local problems | e.g. next event prediction | TBD |
| 3. | Supervised neural networks | Supervised learning for event sequence classification | [demo/supervised-sequence-to-target.ipynb](./demo/su3ervised-sequence-to-target.ipynb) |
| 3.1. | Network Types | Different networks for sequences | |
| 3.1.1. | Recurrent neural networks | | TBD based on `supervised-sequence-to-target.ipynb` |
| 3.1.2. | (optional) Convolutional neural networks | | TBD based on `supervised-sequence-to-target.ipynb` |
| 3.1.3. | Transformers | | [demo/supervised-sequence-to-target-transformer.ipynb](demo/supervised-sequence-to-target-transformer.ipynb) |
| 3.2. | Problem types | Different problems types for sequences | |
| 3.2.1. | Global problems | Binary, multilabel, regression, ... | TBD based on [demo/multilabel-classification.ipynb](demo/multilabel-classification.ipynb) |
| 3.2.2. | Local problems | Next event prediction | [demo/event-sequence-local-embeddings.ipynb](demo/event-sequence-local-embeddings.ipynb) |
| 4. | Unsupervised learning | Pretrain self-supervised model with some proxy task | TBD based on [demo/coles-emb.ipynb](./demo/coles-emb.ipynb) [![O4en In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dllllb/pytorch-lifestream/blob/master/demo/co4es-emb.ipynb) |
| 4.1. | (optional) Word2vec | Context based methods | |
| 4.2. | MLM, RTD, GPT | Event bases methods | Self-supervided training and embeddings for clients' transactions [notebook](event-sequence-local-embeddings.ipynb) [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dllllb/pytorch-lifestream/blob/master/demo/event-sequence-local-embeddings.ipynb) |
| 4.3. | NSP, SOP | Sequence based methods | [demo/nsp-sop-emb.ipynb](demo/nsp-sop-emb.ipynb) |
| 5. | Contrastive and non-contrastive learning | Latent representation-based losses | TBD based on [demo/coles-emb.ipynb](./demo/coles-emb.ipynb) |
| 5.1. | CoLES | | [demo/coles-emb.ipynb](./demo/coles-emb.ipynb) |
| 5.2. | VICReg | | TBD based on [demo/coles-emb.ipynb](./demo/coles-emb.ipynb) |
| 5.3. | CPC | | TBD based on [demo/coles-emb.ipynb](./demo/coles-emb.ipynb) |
| 5.4. | MLM, TabFormer and others | Self-supervised TrxEncoder only training with Masked Language Model | [demo/mlm-emb.ipynb](./demo/mlm-emb.ipynb) [demo/tabformer-emb.ipynb](demo/tabformer-emb.ipynb) |
| 6. | Pretrained model usage | | |
| 6.1. | Downstream model on frozen embeddings | | TBD based on [demo/coles-emb.ipynb](./demo/coles-emb.ipynb) |
| 6.2. | CatBoost embeddings features | | [demo/coles-catboost.ipynb](demo/coles-catboost.ipynb) |
| 6.3. | Model finetuning | | [demo/coles-finetune.ipynb](./demo/coles-finetune.ipynb) |
| 7. | Preprocessing options | Data preparation demos | [demo/preprocessing-demo.ipynb](demo/preprocessing-demo.ipynb) |
| 7.1 | ptls-format parquet data loading | PySpark and Parquet for data preprocessing | [demo/pyspark-parquet.ipynb](demo/pyspark-parquet.ipynb) |
| 7.2. | Fast inference for big dataset | | [demo/extended_inference.ipynb](demo/extended_inference.ipynb) |
| 8. | Features special types | | |
| 8.1. | Using pretrained encoder to text features | | [demo/coles-pretrained-embeddings.ipynb](demo/coles-pretrained-embeddings.ipynb) |
| 8.2 | Multi source models | | [demo/CoLES-demo-multimodal-unsupervised.ipynb](demo/CoLES-demo-multimodal-unsupervised.ipynb) |
| 9. | Trx Encoding options | | |
| 9.1. | Basic options | | TBD |
| 9.2. | Transaction Quantization | | TBD |
| 9.3. | Transaction BPE | | TBD |

## Docs

Expand Down
9 changes: 3 additions & 6 deletions demo/coles-emb.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,7 @@
"source": [
"import sys\n",
"if 'google.colab' in str(get_ipython()):\n",
" ! {sys.executable} -m pip install pytorch-lifestream\n",
" ! {sys.executable} -m pip install -U 'torch<2' # downgrade for ptls==0.5.x\n",
" ! {sys.executable} -m pip install -U 'pytorch-lightning<2' # downgrade for ptls==0.5.x\n",
" ! {sys.executable} -m pip install -U 'torchvision<0.15.1' # downgrade for ptls==0.5.x\n",
" ! {sys.executable} -m pip install -U 'torchaudio<2' # downgrade for ptls==0.5.x"
" ! {sys.executable} -m pip install pytorch-lifestream"
]
},
{
Expand Down Expand Up @@ -432,7 +428,8 @@
"\n",
"trainer = pl.Trainer(\n",
" max_epochs=15,\n",
" gpus=1 if torch.cuda.is_available() else 0,\n",
" accelerator=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
" devices=1 if torch.cuda.is_available() else \"auto\",\n",
" enable_progress_bar=False,\n",
")"
]
Expand Down
14 changes: 6 additions & 8 deletions demo/event-sequence-local-embeddings.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,6 @@
"import sys\n",
"if 'google.colab' in str(get_ipython()):\n",
" ! {sys.executable} -m pip install pytorch-lifestream\n",
" ! {sys.executable} -m pip install -U 'torch<2' # downgrade for ptls==0.5.x\n",
" ! {sys.executable} -m pip install -U 'pytorch-lightning<2' # downgrade for ptls==0.5.x\n",
" ! {sys.executable} -m pip install -U 'torchvision<0.15.1' # downgrade for ptls==0.5.x\n",
" ! {sys.executable} -m pip install -U 'torchaudio<2' # downgrade for ptls==0.5.x\n",
"\n",
"clear_output()"
],
Expand Down Expand Up @@ -672,8 +668,6 @@
"output_type": "stream",
"name": "stderr",
"text": [
"/usr/local/lib/python3.10/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py:478: LightningDeprecationWarning: Setting `Trainer(gpus=1)` is deprecated in v1.7 and will be removed in v2.0. Please use `Trainer(accelerator='gpu', devices=1)` instead.\n",
" rank_zero_deprecation(\n",
"INFO:pytorch_lightning.utilities.rank_zero:GPU available: True (cuda), used: True\n",
"INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores\n",
"INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs\n",
Expand All @@ -689,7 +683,8 @@
"\n",
"trainer = pl.Trainer(\n",
" max_epochs=50,\n",
" gpus=1 if torch.cuda.is_available() else 0,\n",
" accelerator=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
" devices=1 if torch.cuda.is_available() else \"auto\",\n",
" enable_progress_bar=False,\n",
")"
],
Expand Down Expand Up @@ -922,7 +917,10 @@
{
"cell_type": "code",
"source": [
"predict = pl.Trainer(gpus=1).predict(inference_module, inference_dl)"
"predict = pl.Trainer(\n",
" accelerator=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
" devices=1 if torch.cuda.is_available() else \"auto\"\n",
").predict(inference_module, inference_dl)"
],
"metadata": {
"id": "yH5PKWyxN7Xk",
Expand Down
12 changes: 5 additions & 7 deletions demo/nsp-sop-emb.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -17,11 +17,7 @@
"source": [
"import sys\n",
"if 'google.colab' in str(get_ipython()):\n",
" ! {sys.executable} -m pip install pytorch-lifestream\n",
" ! {sys.executable} -m pip install -U 'torch<2' # downgrade for ptls==0.5.x\n",
" ! {sys.executable} -m pip install -U 'pytorch-lightning<2' # downgrade for ptls==0.5.x\n",
" ! {sys.executable} -m pip install -U 'torchvision<0.15.1' # downgrade for ptls==0.5.x\n",
" ! {sys.executable} -m pip install -U 'torchaudio<2' # downgrade for ptls==0.5.x"
" ! {sys.executable} -m pip install pytorch-lifestream"
]
},
{
Expand Down Expand Up @@ -429,7 +425,8 @@
"\n",
"trainer = pl.Trainer(\n",
" max_epochs=15,\n",
" gpus=1 if torch.cuda.is_available() else 0,\n",
" accelerator=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
" devices=1 if torch.cuda.is_available() else \"auto\",\n",
" enable_progress_bar=True,\n",
")"
]
Expand Down Expand Up @@ -811,7 +808,8 @@
"\n",
"trainer = pl.Trainer(\n",
" max_epochs=15,\n",
" gpus=1 if torch.cuda.is_available() else 0,\n",
" accelerator=\"cuda\" if torch.cuda.is_available() else \"cpu\",\n",
" devices=1 if torch.cuda.is_available() else \"auto\",\n",
" enable_progress_bar=True,\n",
")"
]
Expand Down
16 changes: 8 additions & 8 deletions docs/data_load/datasets.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,7 +115,7 @@ They takes `i_filters` as list of `iterable_processing` objects.

### Augmentations

Sometimes we have to change an items from train data. This is `augmentations`.
Sometimes we have to change items from train data. This is what `augmentations` do.
They are in `ptls.data_load.augmentations`.

Example:
Expand All @@ -138,11 +138,11 @@ Here `RandomSlice` augmentation take a random slice from source record.
| Place it be before persist stage to run it once and save total cpu resource | Don't place it before persist stage because it kills the random |
| Can delete items | Can not delete items |
| Can yield new items | Can not create new items |
| Works a generator and requires iterable processing | Works as a function can be both map or iterable |
| Works as a generator and requires iterable processing | Works as a function can be both map or iterable |

## In memory data

In memory data is common case. Data can a list or generator with feature dicts.
In memory data is common case. Data can be a list or generator with feature dicts.

```python
import torch
Expand Down Expand Up @@ -184,7 +184,7 @@ def data_gen(n):
Both datasets support any kind of input: list or generator.
As all datasets supports tha same format (list or generator) as input and output they can be chained.
This make sense for some cases.
This makes sense for some cases.

Data pipelines:

Expand Down Expand Up @@ -237,7 +237,7 @@ for batch in dl:

## Parquet file read

For large amount of data `pyspark` is possible engine to prepare data and convert it in feature dict format.
For large amount of data `pyspark` is a possible engine to prepare data and convert it in feature dict format.
See `demo/pyspark-parquet.ipynb` with example of data preprocessing with `pyspark` and parquet file preparation.

`ptls.data_load.datasets.ParquetDataset` is a dataset which reads parquet files with feature dicts.
Expand All @@ -249,8 +249,8 @@ See `demo/pyspark-parquet.ipynb` with example of data preprocessing with `pyspar
- looks like a generator
- supports `i_filters`

You can feed `ParquetDataset` directly fo dataloader for `iterable` way of usage.
Cou can combine `ParquetDataset` with `MemoryMapDataset` to `map` way of usage.
You can feed `ParquetDataset` directly to dataloader for `iterable` way of usage.
You can combine `ParquetDataset` with `MemoryMapDataset` to `map` way of usage.

`ParquetDataset` requires parquet file names. Usually `spark` saves many parquet files for one dataset,
depending on the number of partitions.
Expand All @@ -264,7 +264,7 @@ Many files for one dataset allows you to:

`ptls.data_load.datasets.PersistDataset` store items from source dataset to the memory.

If you source data is iterator (like python generator or `ParquetDataset`)
If your source data is iterator (like python generator or `ParquetDataset`)
all `i_filters` will be called each time when you access the data.
Persist the data into memory and `i_filters` will be called once.
Much memory may be used to store all dataset items.
Expand Down
2 changes: 1 addition & 1 deletion docs/data_load/date_pipeline.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Data pipeline

All process support `map` and `iterable` data.
All processes support `map` and `iterable` data.

There are steps in pipeline:

Expand Down
Loading

0 comments on commit 2c59ae3

Please sign in to comment.