diff --git a/structured_data/2022_06_02_causality/README.md b/structured_data/2022_06_02_causality/README.md new file mode 100644 index 0000000..d22e638 --- /dev/null +++ b/structured_data/2022_06_02_causality/README.md @@ -0,0 +1,98 @@ +# Causality + +## What is causality? +Causality in time series is a tricky thing. You want to find out which variables cause others. +The notion itself is very subtle and at the same time very strong. + +It's in essence a stronger way of looking at dynamics between variables than the well known correlation metric. +Why? Well.. Let's prove it intuitively. +In the figure below you can expect that our three (time-series) variables are strongly correlated with each other. +It makes sense to say that having your skin burned has a strong correlation with the amount of ice creams you might eat on a given day. +However, this does not necessarily mean that eating ice cream causes skin burns! +Using causality analysis techniques, we can better model and find the _causal_ links between variables. +If our analysis was successful, we'd conclude that sunshine causes us to eat lots of ice cream and also causes our skin to burn. +The analysis would also show that there is no causal relationship between eating ice cream and getting your skin burnt, or vice versa. +Note that causality is a directional way of working, whereas correlation is not! + +![causality vs correlation example](figs/causality_vs_correlation.png) + +## Causality analysis as a guidance. +Keep in mind that causality analyses are in essence unsupervised techniques. +There is really no way to check whether the results make sense at all. + +In a experimental setting, people create toy datasets of which they determined the dynamics (linear and non-linear mathematical equations) upfront. Examples: + +- [this towardsdatascience article](https://towardsdatascience.com/causality-931372313a1c) +- [the original paper of PCMCI](https://advances.sciencemag.org/content/5/11/eaau4996) +- [this paper which compares causality techniques](https://arxiv.org/pdf/2104.08043v1.pdf) + +Typically, the internal dynamics are unknown or too hard to try to model (think about complex physical processes). In such cases, causality is merely a tool to represent the dynamics of the system. +That's why results of causality analysis should be checked with the knowledge of experts. Try to make sure you are close to this expertise in your causality projects! + +## Context + +The notion of causality is not new and can be traced back to the [Vedic period]{https://en.wikipedia.org/wiki/Vedic_period#:~:text=The%20Vedic%20period%2C%20or%20the,%2C%20including%20the%20Vedas%20(ca.} which also brought about the well known concept of Karma. Next to a concept, causality nowadays has a practical and theoretical framework. Establishing this framework is possbile due to advanced mathematical and statistical sciences, coupled with an increase in computational power and our ability to capture digital data, and thereby, processes around us. + +## Approaches +The earliest notion of this theoretical causality in time series is **granger causality**, put forward by [Granger in 1969](https://en.wikipedia.org/wiki/Granger_causality. +We say that a variable *X* that evolves over time *Granger-causes* another evolving variable *Y* if predictions of the value of *Y* based on its own past values *and* on the past values of *X* are better than predictions of *Y* based only on *Y'*s own past values. + +> Just like with all other approaches mentioned in this section, the time series must be made stationary first! We provide [code](../2021_02_08_timeseries_getting_started/time_series_getting_started.ipynb) that automatically checks for stationarity and by differencing tries to make the time series stationary if that's not the case already. + +Granger causality works fine when you want to check for two variables whether one causes the other in a _linear_ way. There exist methods (non-parametric ways) to also look out for _non-linear_ causal interactions. We advocate to use these at all times, except if you're certain that the underlying complexity is linear by nature. + +Enter **Transfer Entropy**! [This article](https://towardsdatascience.com/causality-931372313a1c) gives a good overview of the differences and the similarities between Transfer Entropy (developed in 2000) and Granger Causality. In short, Transfer Entropy is the non-linear variant of granger causality and has proven to be the most intuitive and performant (quality and consistency of results) method in our current toolbox. We advocate [this python implementation](https://pypi.org/project/PyCausality/). The code itself is not maintained anymore but still works well. As the pip package is unstable, it's easier to just copy the source code and go off from there, which we already did for you :-). Check out the [code](./src/transfer_entropy/transfer_entropy_wrapper.py)! + +A more recent (2015) approach to causality is [PCMCI](https://github.com/jakobrunge/tigramite/). It was originally developed to find causal interactions in highly dimensional time series. This package is also able to _condition out_ variables. + +> **What does conditioning out mean in this context?** When a variable A is causing a variable B, and variable B causes variable C, then A is causing C to some degree as well. If B is not adding extra significant information to C, compared to A, then PCMCI will not end up showing the causal link between B and C, but will only output that A causes C. + +![conditional cuasality](figs/conditional_causality.png) + +**PCMCI** is a two phased approach: + +1. PC: condition selection algorithm (finds relevant parents to target variables) + This first removes the variables that are not even conditionally dependent by independence testing. + Then, the conditional dependence for the strongest dependent variables are checked + → This is done iteratively, until convergence + + → Ending up with a list of relevant conditions (=relevant variables at a certain lag) + +2. MCI: “Momentary conditional independence” + + ​ → false positive control for highly-interdependent time series + + ​ → Also Identifies causal strength of a causal relationship + +This two - phased approach is the skeleton of how PCMCI works. You have to instantiate PCMCI with a conditional independence test as well. Here, you have 5 options (see `tigramite.independence_tests` in the [documentation](https://jakobrunge.github.io/tigramite/)). Let's highlight some: + +1. [`tigramite.independence_tests.CondIndTest`](https://jakobrunge.github.io/tigramite/#tigramite.independence_tests.CondIndTest) : This is the base class on which all actual independence tests are built. + +2. [`tigramite.independence_tests.ParCorr`](https://jakobrunge.github.io/tigramite/#tigramite.independence_tests.ParCorr) : This test is based on correlation and works well for linear causal effects + +3. [`tigramite.independence_tests.CMIknn`](https://jakobrunge.github.io/tigramite/#tigramite.independence_tests.CMIknn) : This test is a non-parametric test for continuous data that's based on KNN as the name suggests. It works well for + + - Non-linear dependencies + + - Additive *and* multiplicative noise + + It's computationally the most expensive option though. If you want more information, have a look at the [paper](https://core.ac.uk/download/pdf/211564416.pdf). + +One of the most important parameters is alpha, denoting the “degree of conservativity”. The higher alpha, the quicker PCMCI will identify causal links higher risk for false positives. This value typically lies between 0.05 and 0.5. From our experiments, we've seen that that statement is correct most of the times (so not always). Running PCMCI with different values for alpha (e.g. [0.05, 0.1, 0.2, 0.4]) is a good idea! + +> Note that this alpha hyperparameter can be tuned automatically when alpha is set to `None` . That's at least what the documentation says, and that is correct for ParCorr, but not for other independence tests. See [this issue](https://github.com/jakobrunge/tigramite/issues/49) for more information. + + + +One last thing! PCMCI has really nice visuals as output. You can see an example below. + + ![PCMCI visual](figs/PCMCI_visual.png) + + The cross-MCI denotes how strong the causal link is between different variables. The auto-MCI scale denotes how strong the causal link is between current and past values (lags) for one specific variable (denoted as a node in the graph). The numbers denote the lags for which a causal link was found. + +#### Try it out yourself! +Have a look in this repo to find out how TE and PCMCI can help you in your use-case! You can find [an example notebook](./src/Example%20notebook.ipynb) in the `src/` folder! + +> Extra remark 1: In 2020, the creator of PCMCI has come up with an extension of PCMCI, named PCMCI+. From our projects and experiments, we have seen that PCMCI+ shows inconsistent results over different runs. Although PCMCI can (not in every case though) suffer from the same problem, it feels much more reliable. We therefore advocate not to use PCMCI+, except when your use-case has the need to check contemporaneous links, i.e. check whether A causes B, for both historic as well as current time steps instead of only historic time steps (see [PCMCI+ paper](http://proceedings.mlr.press/v124/runge20a.html)). + +> Extra remark 2: In the PCMCI+ paper, the authors state that highly autocorrelated time series are challenging for most time series causality approaches and hints that that is also the case for (regular) PCMCI. You might want to keep this in mind when using PCMCI. \ No newline at end of file diff --git a/structured_data/2022_06_02_causality/figs/PCMCI_visual.png b/structured_data/2022_06_02_causality/figs/PCMCI_visual.png new file mode 100644 index 0000000..8795fe6 Binary files /dev/null and b/structured_data/2022_06_02_causality/figs/PCMCI_visual.png differ diff --git a/structured_data/2022_06_02_causality/figs/causality_vs_correlation.png b/structured_data/2022_06_02_causality/figs/causality_vs_correlation.png new file mode 100644 index 0000000..dbd4ed3 Binary files /dev/null and b/structured_data/2022_06_02_causality/figs/causality_vs_correlation.png differ diff --git a/structured_data/2022_06_02_causality/figs/conditional_causality.png b/structured_data/2022_06_02_causality/figs/conditional_causality.png new file mode 100644 index 0000000..f7c3549 Binary files /dev/null and b/structured_data/2022_06_02_causality/figs/conditional_causality.png differ diff --git a/structured_data/2022_06_02_causality/requirements.txt b/structured_data/2022_06_02_causality/requirements.txt new file mode 100644 index 0000000..61b8788 --- /dev/null +++ b/structured_data/2022_06_02_causality/requirements.txt @@ -0,0 +1,88 @@ +argon2-cffi==21.3.0 +argon2-cffi-bindings==21.2.0 +asttokens==2.0.5 +async-generator==1.10 +attrs==21.4.0 +backcall==0.2.0 +beautifulsoup4==4.11.1 +bleach==5.0.0 +certifi==2022.5.18.1 +cffi==1.15.0 +cycler==0.11.0 +debugpy==1.6.0 +decorator==5.1.1 +defusedxml==0.7.1 +dill==0.3.5.1 +entrypoints==0.4 +executing==0.8.3 +fastjsonschema==2.15.3 +fonttools==4.33.3 +ipykernel==6.13.0 +ipython==8.4.0 +ipython-genutils==0.2.0 +ipywidgets==7.7.0 +jedi==0.18.1 +Jinja2==3.1.2 +joblib==1.1.0 +jsonschema==4.6.0 +jupyter==1.0.0 +jupyter-client==7.3.1 +jupyter-console==6.4.3 +jupyter-core==4.10.0 +jupyterlab-pygments==0.2.2 +jupyterlab-widgets==1.1.0 +kiwisolver==1.4.2 +llvmlite==0.38.1 +MarkupSafe==2.1.1 +matplotlib==3.5.2 +matplotlib-inline==0.1.3 +mistune==0.8.4 +nbclient==0.6.4 +nbconvert==6.5.0 +nbformat==5.4.0 +nest-asyncio==1.5.5 +networkx==2.8.2 +notebook==6.4.11 +numba==0.55.2 +numpy==1.22.4 +packaging==21.3 +pandas==1.4.2 +pandocfilters==1.5.0 +parso==0.8.3 +patsy==0.5.2 +pexpect==4.8.0 +pickleshare==0.7.5 +Pillow==9.1.1 +prometheus-client==0.14.1 +prompt-toolkit==3.0.29 +psutil==5.9.1 +ptyprocess==0.7.0 +pure-eval==0.2.2 +pycparser==2.21 +Pygments==2.12.0 +pyparsing==3.0.9 +pyrsistent==0.18.1 +python-dateutil==2.8.2 +pytz==2022.1 +pyzmq==23.1.0 +qtconsole==5.3.0 +QtPy==2.1.0 +scikit-learn==1.1.1 +scipy==1.8.1 +seaborn==0.11.2 +Send2Trash==1.8.0 +six==1.16.0 +sklearn==0.0 +soupsieve==2.3.2.post1 +stack-data==0.2.0 +statsmodels==0.13.2 +terminado==0.15.0 +testpath==0.6.0 +threadpoolctl==3.1.0 +tigramite==5.0.0.3 +tinycss2==1.1.1 +tornado==6.1 +traitlets==5.2.2.post1 +wcwidth==0.2.5 +webencodings==0.5.1 +widgetsnbextension==3.6.0 diff --git a/structured_data/2022_06_02_causality/src/Example notebook.ipynb b/structured_data/2022_06_02_causality/src/Example notebook.ipynb new file mode 100644 index 0000000..b118184 --- /dev/null +++ b/structured_data/2022_06_02_causality/src/Example notebook.ipynb @@ -0,0 +1,1714 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "1df7c4f1", + "metadata": {}, + "source": [ + "## Get the data ready" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "58b4a264", + "metadata": {}, + "outputs": [], + "source": [ + "%load_ext autoreload\n", + "%autoreload 2\n", + "\n", + "# imports\n", + "from datetime import datetime, timedelta\n", + "\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "import dill as pickle\n", + "\n", + "from helpers.stationarity import remove_trend_and_diff" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "dried-genetics", + "metadata": {}, + "outputs": [], + "source": [ + "## Load in the data ##\n", + "# This data corresponds with one of the experiments carried out in this paper : \n", + "# https://arxiv.org/pdf/2104.08043v1.pdf\n", + "# More specifically, this data is the first of the 200 data samples used in \n", + "# the Causal Sufficiency experiment, with one latent variable.\n", + "# See https://github.com/causalens/cdml-neurips2020 for more information.\n", + "\n", + "filename = 'data.pickle'\n", + "\n", + "with open(filename, 'rb') as f:\n", + " # the pickle file contains a pandas dataframe as well as the causal\n", + " # graph that was generated by the authors of the paper, which we don't need\n", + " df, _ = pickle.load(f)" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "8bcd194a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "tackling new col X1\n", + " --> (KPSS & ADF) Time-series IS stationary for X1 (after 0 differencing operations)!\n", + "\n", + "tackling new col X10\n", + " --> (KPSS & ADF) Time-series IS stationary for X10 (after 0 differencing operations)!\n", + "\n", + "tackling new col X2\n", + " --> (KPSS & ADF) Time-series IS stationary for X2 (after 0 differencing operations)!\n", + "\n", + "tackling new col X3\n", + " --> (KPSS & ADF) Time-series IS stationary for X3 (after 0 differencing operations)!\n", + "\n", + "tackling new col X4\n", + " --> (KPSS & ADF) Time-series IS stationary for X4 (after 0 differencing operations)!\n", + "\n", + "tackling new col X5\n", + " --> (KPSS & ADF) Time-series IS stationary for X5 (after 0 differencing operations)!\n", + "\n", + "tackling new col X6\n", + " --> (KPSS & ADF) Time-series IS stationary for X6 (after 0 differencing operations)!\n", + "\n", + "tackling new col X7\n", + " --> (KPSS & ADF) Time-series IS stationary for X7 (after 0 differencing operations)!\n", + "\n", + "tackling new col X8\n", + " --> (KPSS & ADF) Time-series IS stationary for X8 (after 0 differencing operations)!\n", + "\n", + "tackling new col X9\n", + " --> (KPSS & ADF) Time-series IS stationary for X9 (after 0 differencing operations)!\n", + "\n", + "(Maximum number of differencing operations performed was 1)\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
X1X10X2X3X4X5X6X7X8X9
2019-09-06 23:45:32.296715-1.667105-0.6471290.580980-0.9436760.475631-0.298541-0.5627680.5113421.176538-0.137143
2019-09-07 23:45:32.2967150.2240830.5083400.5212880.4050350.918387-0.6118662.306102-0.6588710.179594-0.455267
2019-09-08 23:45:32.296715-1.0866410.201047-2.9899570.466461-0.292212-0.9605430.697460-0.6824880.6923191.161958
2019-09-09 23:45:32.296715-0.4219860.9933650.662050-0.896384-0.4305991.7645460.3702392.4183680.616185-0.104601
2019-09-10 23:45:32.296715-0.109283-0.413054-1.2224581.193096-0.220662-0.8315290.723627-0.5442370.502862-1.117350
\n", + "
" + ], + "text/plain": [ + " X1 X10 X2 X3 X4 \\\n", + "2019-09-06 23:45:32.296715 -1.667105 -0.647129 0.580980 -0.943676 0.475631 \n", + "2019-09-07 23:45:32.296715 0.224083 0.508340 0.521288 0.405035 0.918387 \n", + "2019-09-08 23:45:32.296715 -1.086641 0.201047 -2.989957 0.466461 -0.292212 \n", + "2019-09-09 23:45:32.296715 -0.421986 0.993365 0.662050 -0.896384 -0.430599 \n", + "2019-09-10 23:45:32.296715 -0.109283 -0.413054 -1.222458 1.193096 -0.220662 \n", + "\n", + " X5 X6 X7 X8 X9 \n", + "2019-09-06 23:45:32.296715 -0.298541 -0.562768 0.511342 1.176538 -0.137143 \n", + "2019-09-07 23:45:32.296715 -0.611866 2.306102 -0.658871 0.179594 -0.455267 \n", + "2019-09-08 23:45:32.296715 -0.960543 0.697460 -0.682488 0.692319 1.161958 \n", + "2019-09-09 23:45:32.296715 1.764546 0.370239 2.418368 0.616185 -0.104601 \n", + "2019-09-10 23:45:32.296715 -0.831529 0.723627 -0.544237 0.502862 -1.117350 " + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Process data\n", + "df.index = pd.date_range(datetime.today() - timedelta(days = len(df)), \n", + " periods=len(df))\n", + "df_stat = remove_trend_and_diff(df.resample('W-MON').mean(), \n", + " debug=False)\n", + "df.head()" + ] + }, + { + "cell_type": "markdown", + "id": "67b0df31", + "metadata": {}, + "source": [ + "### Cool! Apparently, our dataset was completely stationary already. Let's start the analysis" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "00ee028a", + "metadata": {}, + "outputs": [], + "source": [ + "# determine min and max lags\n", + "tau_min=0\n", + "tau_max=4" + ] + }, + { + "cell_type": "markdown", + "id": "fcab33b7", + "metadata": {}, + "source": [ + "# 1. Transfer entropy" + ] + }, + { + "cell_type": "markdown", + "id": "d8f98149", + "metadata": {}, + "source": [ + "## Run TE" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "fb6973c9", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "from transfer_entropy.transfer_entropy_wrapper import average_transfer_entropy\n", + "from helpers.transfer_entropy import export_as_df" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "ab448ec2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "lag(0)\n", + "X1 -> X10\n", + "X2 -> X10\n", + "X3 -> X10\n", + "X4 -> X10\n", + "X5 -> X10\n", + "X6 -> X10\n", + "X7 -> X10\n", + "X8 -> X10\n", + "X9 -> X10\n", + "took 15.5995512008667 seconds\n", + "\n", + "lag(1)\n", + "X1 -> X10\n", + "X2 -> X10\n", + "X3 -> X10\n", + "X4 -> X10\n", + "X5 -> X10\n", + "X6 -> X10\n", + "X7 -> X10\n", + "X8 -> X10\n", + "X9 -> X10\n", + "took 14.241465091705322 seconds\n", + "\n", + "lag(2)\n", + "X1 -> X10\n", + "X2 -> X10\n", + "X3 -> X10\n", + "X4 -> X10\n", + "X5 -> X10\n", + "X6 -> X10\n", + "X7 -> X10\n", + "X8 -> X10\n", + "X9 -> X10\n", + "took 13.998454570770264 seconds\n", + "\n", + "lag(3)\n", + "X1 -> X10\n", + "X2 -> X10\n", + "X3 -> X10\n", + "X4 -> X10\n", + "X5 -> X10\n", + "X6 -> X10\n", + "X7 -> X10\n", + "X8 -> X10\n", + "X9 -> X10\n", + "took 13.934021949768066 seconds\n", + "\n", + "lag(4)\n", + "X1 -> X10\n", + "X2 -> X10\n", + "X3 -> X10\n", + "X4 -> X10\n", + "X5 -> X10\n", + "X6 -> X10\n", + "X7 -> X10\n", + "X8 -> X10\n", + "X9 -> X10\n", + "took 13.63166069984436 seconds\n" + ] + } + ], + "source": [ + "# Number of shuffles to perform to determine the results' significance.\n", + "n_shuffles = 50 \n", + "# Whether or not to calculate the effective Transfer Entropy.\n", + "effective = False\n", + "# Whether or not to show intermediate results.\n", + "debug = False\n", + "\n", + "# Function execution\n", + "# !Make sure the first column in your dataframe is the target column you want to find causality for!\n", + "# In our case, we want to check causality with respect to X10, which we then define as the target var\n", + "def bring_col_to_front(df, column):\n", + " return df[[column] + [col for col in df.columns if col != column]]\n", + "df_stat = bring_col_to_front(df_stat, \"X10\")\n", + "\n", + "avg_nonlin_te = average_transfer_entropy(df_stat, \n", + " linear=False, \n", + " effective=effective, \n", + " tau_min=tau_min, \n", + " tau_max=tau_max,\n", + " n_shuffles=n_shuffles, \n", + " debug=debug)\n", + "avg_nonlin_te_arr = np.array(avg_nonlin_te)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "285d2a66", + "metadata": {}, + "outputs": [], + "source": [ + "# Parse output as pandas dataframe\n", + "avg_nonlin_te_df = export_as_df(avg_nonlin_te_arr)" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "c578e5c1", + "metadata": {}, + "outputs": [], + "source": [ + "# Remember that the parsed output of TE contains p-values\n", + "# To draw conclusions from these statistical notions, we have to \n", + "# arbitrarily pick an alpha (cut-off) value. Based on that value, we \n", + "# determine for which links there seems enough statistical evidence to conclude causality\n", + "threshold = 0.01\n", + "booldf = avg_nonlin_te_df.iloc[:,:]" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "viz_df_raw(avg_nonlin_te_df, booldf, threshold)" + ] + }, + { + "cell_type": "markdown", + "id": "6e19f278", + "metadata": {}, + "source": [ + "#### Graph explanations\n", + "The lineplots show the actual p-values for a given link between the target Variable (X10) and the explanatory variable at a given lag. Based on the cut-off value that we chose, we then also plot blue bars. These blue bars are popping up for the p values that are below the theshold. In essence these denote the combination of lags and variables that show a causal link for with regard to the target variable (X10 in this case)." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "5dbd0993", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# clear matplotlib buffer to be able to make figures for pcmci\n", + "plt.clf()" + ] + }, + { + "cell_type": "markdown", + "id": "3679674f", + "metadata": {}, + "source": [ + "# 2. PCMCI" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "5f488872", + "metadata": {}, + "outputs": [], + "source": [ + "# imports\n", + "from tigramite import data_processing as pp\n", + "from tigramite import plotting as tp\n", + "from tigramite.pcmci import PCMCI\n", + "from tigramite.independence_tests import CMIknn\n", + "\n", + "from helpers.pcmci import get_selected_links, process_and_visualize_results" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "d69a441d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "##\n", + "## Step 1: PC1 algorithm with lagged conditions\n", + "##\n", + "\n", + "Parameters:\n", + "selected_links = {0: [(1, -1), (1, -2), (1, -3), (1, -4), (2, -1), (2, -2), (2, -3), (2, -4), (3, -1), (3, -2), (3, -3), (3, -4), (4, -1), (4, -2), (4, -3), (4, -4), (5, -1), (5, -2), (5, -3), (5, -4), (6, -1), (6, -2), (6, -3), (6, -4), (7, -1), (7, -2), (7, -3), (7, -4), (8, -1), (8, -2), (8, -3), (8, -4), (9, -1), (9, -2), (9, -3), (9, -4)], 1: [(1, -1), (1, -2), (1, -3), (1, -4), (2, -1), (2, -2), (2, -3), (2, -4), (3, -1), (3, -2), (3, -3), (3, -4), (4, -1), (4, -2), (4, -3), (4, -4), (5, -1), (5, -2), (5, -3), (5, -4), (6, -1), (6, -2), (6, -3), (6, -4), (7, -1), (7, -2), (7, -3), (7, -4), (8, -1), (8, -2), (8, -3), (8, -4), (9, -1), (9, -2), (9, -3), (9, -4)], 2: [], 3: [], 4: [], 5: [], 6: [], 7: [], 8: [], 9: []}\n", + "independence test = cmi_knn\n", + "tau_min = 1\n", + "tau_max = 4\n", + "pc_alpha = [0.25]\n", + "max_conds_dim = None\n", + "max_combinations = 1\n", + "\n", + "\n", + "\n", + "## Variable X10\n", + "\n", + "Iterating through pc_alpha = [0.25]:\n", + "\n", + "# pc_alpha = 0.25 (1/1):\n", + "\n", + "Testing condition sets of dimension 0:\n", + "\n", + " Link (X1 -1) --> X10 (1/36):\n", + " Subset 0: () gives pval = 0.22200 / val = 0.028\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X1 -2) --> X10 (2/36):\n", + " Subset 0: () gives pval = 0.20000 / val = 0.032\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X1 -3) --> X10 (3/36):\n", + " Subset 0: () gives pval = 0.49700 / val = 0.018\n", + " Non-significance detected.\n", + "\n", + " Link (X1 -4) --> X10 (4/36):\n", + " Subset 0: () gives pval = 0.94000 / val = -0.002\n", + " Non-significance detected.\n", + "\n", + " Link (X2 -1) --> X10 (5/36):\n", + " Subset 0: () gives pval = 0.00600 / val = 0.060\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X2 -2) --> X10 (6/36):\n", + " Subset 0: () gives pval = 0.33300 / val = 0.020\n", + " Non-significance detected.\n", + "\n", + " Link (X2 -3) --> X10 (7/36):\n", + " Subset 0: () gives pval = 0.34600 / val = 0.020\n", + " Non-significance detected.\n", + "\n", + " Link (X2 -4) --> X10 (8/36):\n", + " Subset 0: () gives pval = 0.14900 / val = 0.032\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X3 -1) --> X10 (9/36):\n", + " Subset 0: () gives pval = 1.00000 / val = -0.019\n", + " Non-significance detected.\n", + "\n", + " Link (X3 -2) --> X10 (10/36):\n", + " Subset 0: () gives pval = 0.20500 / val = 0.028\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X3 -3) --> X10 (11/36):\n", + " Subset 0: () gives pval = 0.50100 / val = 0.014\n", + " Non-significance detected.\n", + "\n", + " Link (X3 -4) --> X10 (12/36):\n", + " Subset 0: () gives pval = 0.18900 / val = 0.028\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X4 -1) --> X10 (13/36):\n", + " Subset 0: () gives pval = 0.21200 / val = 0.028\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X4 -2) --> X10 (14/36):\n", + " Subset 0: () gives pval = 0.66300 / val = 0.010\n", + " Non-significance detected.\n", + "\n", + " Link (X4 -3) --> X10 (15/36):\n", + " Subset 0: () gives pval = 0.48000 / val = 0.016\n", + " Non-significance detected.\n", + "\n", + " Link (X4 -4) --> X10 (16/36):\n", + " Subset 0: () gives pval = 0.07700 / val = 0.038\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X5 -1) --> X10 (17/36):\n", + " Subset 0: () gives pval = 0.21900 / val = 0.023\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X5 -2) --> X10 (18/36):\n", + " Subset 0: () gives pval = 0.93400 / val = -0.000\n", + " Non-significance detected.\n", + "\n", + " Link (X5 -3) --> X10 (19/36):\n", + " Subset 0: () gives pval = 0.03600 / val = 0.049\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X5 -4) --> X10 (20/36):\n", + " Subset 0: () gives pval = 0.21900 / val = 0.026\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X6 -1) --> X10 (21/36):\n", + " Subset 0: () gives pval = 0.37800 / val = 0.018\n", + " Non-significance detected.\n", + "\n", + " Link (X6 -2) --> X10 (22/36):\n", + " Subset 0: () gives pval = 0.81100 / val = 0.004\n", + " Non-significance detected.\n", + "\n", + " Link (X6 -3) --> X10 (23/36):\n", + " Subset 0: () gives pval = 0.31300 / val = 0.021\n", + " Non-significance detected.\n", + "\n", + " Link (X6 -4) --> X10 (24/36):\n", + " Subset 0: () gives pval = 0.37900 / val = 0.019\n", + " Non-significance detected.\n", + "\n", + " Link (X7 -1) --> X10 (25/36):\n", + " Subset 0: () gives pval = 0.75500 / val = 0.008\n", + " Non-significance detected.\n", + "\n", + " Link (X7 -2) --> X10 (26/36):\n", + " Subset 0: () gives pval = 0.82700 / val = 0.004\n", + " Non-significance detected.\n", + "\n", + " Link (X7 -3) --> X10 (27/36):\n", + " Subset 0: () gives pval = 0.81400 / val = 0.004\n", + " Non-significance detected.\n", + "\n", + " Link (X7 -4) --> X10 (28/36):\n", + " Subset 0: () gives pval = 0.85500 / val = 0.003\n", + " Non-significance detected.\n", + "\n", + " Link (X8 -1) --> X10 (29/36):\n", + " Subset 0: () gives pval = 0.33000 / val = 0.020\n", + " Non-significance detected.\n", + "\n", + " Link (X8 -2) --> X10 (30/36):\n", + " Subset 0: () gives pval = 0.62600 / val = 0.010\n", + " Non-significance detected.\n", + "\n", + " Link (X8 -3) --> X10 (31/36):\n", + " Subset 0: () gives pval = 0.83800 / val = 0.004\n", + " Non-significance detected.\n", + "\n", + " Link (X8 -4) --> X10 (32/36):\n", + " Subset 0: () gives pval = 0.52500 / val = 0.014\n", + " Non-significance detected.\n", + "\n", + " Link (X9 -1) --> X10 (33/36):\n", + " Subset 0: () gives pval = 0.05100 / val = 0.046\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X9 -2) --> X10 (34/36):\n", + " Subset 0: () gives pval = 0.66600 / val = 0.009\n", + " Non-significance detected.\n", + "\n", + " Link (X9 -3) --> X10 (35/36):\n", + " Subset 0: () gives pval = 0.40800 / val = 0.020\n", + " Non-significance detected.\n", + "\n", + " Link (X9 -4) --> X10 (36/36):\n", + " Subset 0: () gives pval = 0.29700 / val = 0.023\n", + " Non-significance detected.\n", + "\n", + " Sorting parents in decreasing order with \n", + " weight(i-tau->j) = min_{iterations} |val_{ij}(tau)| \n", + "\n", + "Updating parents:\n", + "\n", + " Variable X10 has 12 link(s):\n", + " (X2 -1): max_pval = 0.00600, min_val = 0.060\n", + " (X5 -3): max_pval = 0.03600, min_val = 0.049\n", + " (X9 -1): max_pval = 0.05100, min_val = 0.046\n", + " (X4 -4): max_pval = 0.07700, min_val = 0.038\n", + " (X2 -4): max_pval = 0.14900, min_val = 0.032\n", + " (X1 -2): max_pval = 0.20000, min_val = 0.032\n", + " (X3 -4): max_pval = 0.18900, min_val = 0.028\n", + " (X3 -2): max_pval = 0.20500, min_val = 0.028\n", + " (X1 -1): max_pval = 0.22200, min_val = 0.028\n", + " (X4 -1): max_pval = 0.21200, min_val = 0.028\n", + " (X5 -4): max_pval = 0.21900, min_val = 0.026\n", + " (X5 -1): max_pval = 0.21900, min_val = 0.023\n", + "\n", + "Testing condition sets of dimension 1:\n", + "\n", + " Link (X2 -1) --> X10 (1/12):\n", + " Subset 0: (X5 -3) gives pval = 0.22300 / val = 0.029\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X5 -3) --> X10 (2/12):\n", + " Subset 0: (X2 -1) gives pval = 0.39700 / val = 0.022\n", + " Non-significance detected.\n", + "\n", + " Link (X9 -1) --> X10 (3/12):\n", + " Subset 0: (X2 -1) gives pval = 0.03700 / val = 0.041\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X4 -4) --> X10 (4/12):\n", + " Subset 0: (X2 -1) gives pval = 0.08100 / val = 0.040\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X2 -4) --> X10 (5/12):\n", + " Subset 0: (X2 -1) gives pval = 0.20900 / val = 0.027\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X1 -2) --> X10 (6/12):\n", + " Subset 0: (X2 -1) gives pval = 0.62200 / val = 0.011\n", + " Non-significance detected.\n", + "\n", + " Link (X3 -4) --> X10 (7/12):\n", + " Subset 0: (X2 -1) gives pval = 0.42100 / val = 0.023\n", + " Non-significance detected.\n", + "\n", + " Link (X3 -2) --> X10 (8/12):\n", + " Subset 0: (X2 -1) gives pval = 0.17700 / val = 0.030\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X1 -1) --> X10 (9/12):\n", + " Subset 0: (X2 -1) gives pval = 0.19200 / val = 0.029\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X4 -1) --> X10 (10/12):\n", + " Subset 0: (X2 -1) gives pval = 0.13000 / val = 0.031\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X5 -4) --> X10 (11/12):\n", + " Subset 0: (X2 -1) gives pval = 0.10400 / val = 0.032\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X5 -1) --> X10 (12/12):\n", + " Subset 0: (X2 -1) gives pval = 0.21400 / val = 0.026\n", + " No conditions of dimension 1 left.\n", + "\n", + " Sorting parents in decreasing order with \n", + " weight(i-tau->j) = min_{iterations} |val_{ij}(tau)| \n", + "\n", + "Updating parents:\n", + "\n", + " Variable X10 has 9 link(s):\n", + " (X9 -1): max_pval = 0.05100, min_val = 0.041\n", + " (X4 -4): max_pval = 0.08100, min_val = 0.038\n", + " (X2 -1): max_pval = 0.22300, min_val = 0.029\n", + " (X3 -2): max_pval = 0.20500, min_val = 0.028\n", + " (X1 -1): max_pval = 0.22200, min_val = 0.028\n", + " (X4 -1): max_pval = 0.21200, min_val = 0.028\n", + " (X2 -4): max_pval = 0.20900, min_val = 0.027\n", + " (X5 -4): max_pval = 0.21900, min_val = 0.026\n", + " (X5 -1): max_pval = 0.21900, min_val = 0.023\n", + "\n", + "Testing condition sets of dimension 2:\n", + "\n", + " Link (X9 -1) --> X10 (1/9):\n", + " Subset 0: (X4 -4) (X2 -1) gives pval = 0.03900 / val = 0.037\n", + " Still subsets of dimension 2 left, but q_max = 1 reached.\n", + "\n", + " Link (X4 -4) --> X10 (2/9):\n", + " Subset 0: (X9 -1) (X2 -1) gives pval = 0.01800 / val = 0.047\n", + " Still subsets of dimension 2 left, but q_max = 1 reached.\n", + "\n", + " Link (X2 -1) --> X10 (3/9):\n", + " Subset 0: (X9 -1) (X4 -4) gives pval = 0.00800 / val = 0.042\n", + " Still subsets of dimension 2 left, but q_max = 1 reached.\n", + "\n", + " Link (X3 -2) --> X10 (4/9):\n", + " Subset 0: (X9 -1) (X4 -4) gives pval = 0.30300 / val = 0.028\n", + " Non-significance detected.\n", + "\n", + " Link (X1 -1) --> X10 (5/9):\n", + " Subset 0: (X9 -1) (X4 -4) gives pval = 0.25700 / val = 0.032\n", + " Non-significance detected.\n", + "\n", + " Link (X4 -1) --> X10 (6/9):\n", + " Subset 0: (X9 -1) (X4 -4) gives pval = 0.15400 / val = 0.030\n", + " Still subsets of dimension 2 left, but q_max = 1 reached.\n", + "\n", + " Link (X2 -4) --> X10 (7/9):\n", + " Subset 0: (X9 -1) (X4 -4) gives pval = 0.84300 / val = 0.014\n", + " Non-significance detected.\n", + "\n", + " Link (X5 -4) --> X10 (8/9):\n", + " Subset 0: (X9 -1) (X4 -4) gives pval = 0.05200 / val = 0.036\n", + " Still subsets of dimension 2 left, but q_max = 1 reached.\n", + "\n", + " Link (X5 -1) --> X10 (9/9):\n", + " Subset 0: (X9 -1) (X4 -4) gives pval = 0.33800 / val = 0.027\n", + " Non-significance detected.\n", + "\n", + " Sorting parents in decreasing order with \n", + " weight(i-tau->j) = min_{iterations} |val_{ij}(tau)| \n", + "\n", + "Updating parents:\n", + "\n", + " Variable X10 has 5 link(s):\n", + " (X4 -4): max_pval = 0.08100, min_val = 0.038\n", + " (X9 -1): max_pval = 0.05100, min_val = 0.037\n", + " (X2 -1): max_pval = 0.22300, min_val = 0.029\n", + " (X4 -1): max_pval = 0.21200, min_val = 0.028\n", + " (X5 -4): max_pval = 0.21900, min_val = 0.026\n", + "\n", + "Testing condition sets of dimension 3:\n", + "\n", + " Link (X4 -4) --> X10 (1/5):\n", + " Subset 0: (X9 -1) (X2 -1) (X4 -1) gives pval = 0.00700 / val = 0.044\n", + " Still subsets of dimension 3 left, but q_max = 1 reached.\n", + "\n", + " Link (X9 -1) --> X10 (2/5):\n", + " Subset 0: (X4 -4) (X2 -1) (X4 -1) gives pval = 0.11400 / val = 0.034\n", + " Still subsets of dimension 3 left, but q_max = 1 reached.\n", + "\n", + " Link (X2 -1) --> X10 (3/5):\n", + " Subset 0: (X4 -4) (X9 -1) (X4 -1) gives pval = 0.02800 / val = 0.040\n", + " Still subsets of dimension 3 left, but q_max = 1 reached.\n", + "\n", + " Link (X4 -1) --> X10 (4/5):\n", + " Subset 0: (X4 -4) (X9 -1) (X2 -1) gives pval = 0.03600 / val = 0.035\n", + " Still subsets of dimension 3 left, but q_max = 1 reached.\n", + "\n", + " Link (X5 -4) --> X10 (5/5):\n", + " Subset 0: (X4 -4) (X9 -1) (X2 -1) gives pval = 0.10300 / val = 0.035\n", + " Still subsets of dimension 3 left, but q_max = 1 reached.\n", + "\n", + " Sorting parents in decreasing order with \n", + " weight(i-tau->j) = min_{iterations} |val_{ij}(tau)| \n", + "\n", + "Updating parents:\n", + "\n", + " Variable X10 has 5 link(s):\n", + " (X4 -4): max_pval = 0.08100, min_val = 0.038\n", + " (X9 -1): max_pval = 0.11400, min_val = 0.034\n", + " (X2 -1): max_pval = 0.22300, min_val = 0.029\n", + " (X4 -1): max_pval = 0.21200, min_val = 0.028\n", + " (X5 -4): max_pval = 0.21900, min_val = 0.026\n", + "\n", + "Testing condition sets of dimension 4:\n", + "\n", + " Link (X4 -4) --> X10 (1/5):\n", + " Subset 0: (X9 -1) (X2 -1) (X4 -1) (X5 -4) gives pval = 0.00800 / val = 0.041\n", + " Still subsets of dimension 4 left, but q_max = 1 reached.\n", + "\n", + " Link (X9 -1) --> X10 (2/5):\n", + " Subset 0: (X4 -4) (X2 -1) (X4 -1) (X5 -4) gives pval = 0.07300 / val = 0.036\n", + " Still subsets of dimension 4 left, but q_max = 1 reached.\n", + "\n", + " Link (X2 -1) --> X10 (3/5):\n", + " Subset 0: (X4 -4) (X9 -1) (X4 -1) (X5 -4) gives pval = 0.00400 / val = 0.043\n", + " Still subsets of dimension 4 left, but q_max = 1 reached.\n", + "\n", + " Link (X4 -1) --> X10 (4/5):\n", + " Subset 0: (X4 -4) (X9 -1) (X2 -1) (X5 -4) gives pval = 0.35500 / val = 0.025\n", + " Non-significance detected.\n", + "\n", + " Link (X5 -4) --> X10 (5/5):\n", + " Subset 0: (X4 -4) (X9 -1) (X2 -1) (X4 -1) gives pval = 0.19700 / val = 0.032\n", + " Still subsets of dimension 4 left, but q_max = 1 reached.\n", + "\n", + " Sorting parents in decreasing order with \n", + " weight(i-tau->j) = min_{iterations} |val_{ij}(tau)| \n", + "\n", + "Updating parents:\n", + "\n", + " Variable X10 has 4 link(s):\n", + " (X4 -4): max_pval = 0.08100, min_val = 0.038\n", + " (X9 -1): max_pval = 0.11400, min_val = 0.034\n", + " (X2 -1): max_pval = 0.22300, min_val = 0.029\n", + " (X5 -4): max_pval = 0.21900, min_val = 0.026\n", + "\n", + "Algorithm converged for variable X10\n", + "\n", + "## Variable X1\n", + "\n", + "Iterating through pc_alpha = [0.25]:\n", + "\n", + "# pc_alpha = 0.25 (1/1):\n", + "\n", + "Testing condition sets of dimension 0:\n", + "\n", + " Link (X1 -1) --> X1 (1/36):\n", + " Subset 0: () gives pval = 0.91800 / val = -0.000\n", + " Non-significance detected.\n", + "\n", + " Link (X1 -2) --> X1 (2/36):\n", + " Subset 0: () gives pval = 0.03800 / val = 0.047\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X1 -3) --> X1 (3/36):\n", + " Subset 0: () gives pval = 0.87700 / val = 0.002\n", + " Non-significance detected.\n", + "\n", + " Link (X1 -4) --> X1 (4/36):\n", + " Subset 0: () gives pval = 0.16500 / val = 0.033\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X2 -1) --> X1 (5/36):\n", + " Subset 0: () gives pval = 0.29500 / val = 0.023\n", + " Non-significance detected.\n", + "\n", + " Link (X2 -2) --> X1 (6/36):\n", + " Subset 0: () gives pval = 0.55500 / val = 0.012\n", + " Non-significance detected.\n", + "\n", + " Link (X2 -3) --> X1 (7/36):\n", + " Subset 0: () gives pval = 0.45000 / val = 0.016\n", + " Non-significance detected.\n", + "\n", + " Link (X2 -4) --> X1 (8/36):\n", + " Subset 0: () gives pval = 0.43700 / val = 0.018\n", + " Non-significance detected.\n", + "\n", + " Link (X3 -1) --> X1 (9/36):\n", + " Subset 0: () gives pval = 0.47700 / val = 0.016\n", + " Non-significance detected.\n", + "\n", + " Link (X3 -2) --> X1 (10/36):\n", + " Subset 0: () gives pval = 0.94600 / val = -0.001\n", + " Non-significance detected.\n", + "\n", + " Link (X3 -3) --> X1 (11/36):\n", + " Subset 0: () gives pval = 0.38500 / val = 0.020\n", + " Non-significance detected.\n", + "\n", + " Link (X3 -4) --> X1 (12/36):\n", + " Subset 0: () gives pval = 0.59100 / val = 0.013\n", + " Non-significance detected.\n", + "\n", + " Link (X4 -1) --> X1 (13/36):\n", + " Subset 0: () gives pval = 0.29600 / val = 0.023\n", + " Non-significance detected.\n", + "\n", + " Link (X4 -2) --> X1 (14/36):\n", + " Subset 0: () gives pval = 0.38900 / val = 0.019\n", + " Non-significance detected.\n", + "\n", + " Link (X4 -3) --> X1 (15/36):\n", + " Subset 0: () gives pval = 0.22400 / val = 0.024\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X4 -4) --> X1 (16/36):\n", + " Subset 0: () gives pval = 0.45500 / val = 0.017\n", + " Non-significance detected.\n", + "\n", + " Link (X5 -1) --> X1 (17/36):\n", + " Subset 0: () gives pval = 0.00500 / val = 0.069\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X5 -2) --> X1 (18/36):\n", + " Subset 0: () gives pval = 0.08000 / val = 0.039\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X5 -3) --> X1 (19/36):\n", + " Subset 0: () gives pval = 0.42100 / val = 0.014\n", + " Non-significance detected.\n", + "\n", + " Link (X5 -4) --> X1 (20/36):\n", + " Subset 0: () gives pval = 0.82300 / val = 0.007\n", + " Non-significance detected.\n", + "\n", + " Link (X6 -1) --> X1 (21/36):\n", + " Subset 0: () gives pval = 0.39000 / val = 0.019\n", + " Non-significance detected.\n", + "\n", + " Link (X6 -2) --> X1 (22/36):\n", + " Subset 0: () gives pval = 0.70200 / val = 0.008\n", + " Non-significance detected.\n", + "\n", + " Link (X6 -3) --> X1 (23/36):\n", + " Subset 0: () gives pval = 0.25800 / val = 0.027\n", + " Non-significance detected.\n", + "\n", + " Link (X6 -4) --> X1 (24/36):\n", + " Subset 0: () gives pval = 0.69000 / val = 0.009\n", + " Non-significance detected.\n", + "\n", + " Link (X7 -1) --> X1 (25/36):\n", + " Subset 0: () gives pval = 0.52200 / val = 0.012\n", + " Non-significance detected.\n", + "\n", + " Link (X7 -2) --> X1 (26/36):\n", + " Subset 0: () gives pval = 0.28900 / val = 0.022\n", + " Non-significance detected.\n", + "\n", + " Link (X7 -3) --> X1 (27/36):\n", + " Subset 0: () gives pval = 0.24000 / val = 0.025\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X7 -4) --> X1 (28/36):\n", + " Subset 0: () gives pval = 0.29900 / val = 0.022\n", + " Non-significance detected.\n", + "\n", + " Link (X8 -1) --> X1 (29/36):\n", + " Subset 0: () gives pval = 0.06100 / val = 0.041\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X8 -2) --> X1 (30/36):\n", + " Subset 0: () gives pval = 0.19800 / val = 0.028\n", + " No conditions of dimension 0 left.\n", + "\n", + " Link (X8 -3) --> X1 (31/36):\n", + " Subset 0: () gives pval = 0.75200 / val = 0.006\n", + " Non-significance detected.\n", + "\n", + " Link (X8 -4) --> X1 (32/36):\n", + " Subset 0: () gives pval = 0.34700 / val = 0.020\n", + " Non-significance detected.\n", + "\n", + " Link (X9 -1) --> X1 (33/36):\n", + " Subset 0: () gives pval = 0.85200 / val = 0.003\n", + " Non-significance detected.\n", + "\n", + " Link (X9 -2) --> X1 (34/36):\n", + " Subset 0: () gives pval = 0.94600 / val = -0.002\n", + " Non-significance detected.\n", + "\n", + " Link (X9 -3) --> X1 (35/36):\n", + " Subset 0: () gives pval = 0.75700 / val = 0.007\n", + " Non-significance detected.\n", + "\n", + " Link (X9 -4) --> X1 (36/36):\n", + " Subset 0: () gives pval = 0.85700 / val = 0.001\n", + " Non-significance detected.\n", + "\n", + " Sorting parents in decreasing order with \n", + " weight(i-tau->j) = min_{iterations} |val_{ij}(tau)| \n", + "\n", + "Updating parents:\n", + "\n", + " Variable X1 has 8 link(s):\n", + " (X5 -1): max_pval = 0.00500, min_val = 0.069\n", + " (X1 -2): max_pval = 0.03800, min_val = 0.047\n", + " (X8 -1): max_pval = 0.06100, min_val = 0.041\n", + " (X5 -2): max_pval = 0.08000, min_val = 0.039\n", + " (X1 -4): max_pval = 0.16500, min_val = 0.033\n", + " (X8 -2): max_pval = 0.19800, min_val = 0.028\n", + " (X7 -3): max_pval = 0.24000, min_val = 0.025\n", + " (X4 -3): max_pval = 0.22400, min_val = 0.024\n", + "\n", + "Testing condition sets of dimension 1:\n", + "\n", + " Link (X5 -1) --> X1 (1/8):\n", + " Subset 0: (X1 -2) gives pval = 0.56400 / val = 0.015\n", + " Non-significance detected.\n", + "\n", + " Link (X1 -2) --> X1 (2/8):\n", + " Subset 0: (X5 -1) gives pval = 0.03900 / val = 0.036\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X8 -1) --> X1 (3/8):\n", + " Subset 0: (X5 -1) gives pval = 0.41700 / val = 0.022\n", + " Non-significance detected.\n", + "\n", + " Link (X5 -2) --> X1 (4/8):\n", + " Subset 0: (X5 -1) gives pval = 0.03400 / val = 0.038\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X1 -4) --> X1 (5/8):\n", + " Subset 0: (X5 -1) gives pval = 0.46400 / val = 0.015\n", + " Non-significance detected.\n", + "\n", + " Link (X8 -2) --> X1 (6/8):\n", + " Subset 0: (X5 -1) gives pval = 0.74200 / val = 0.011\n", + " Non-significance detected.\n", + "\n", + " Link (X7 -3) --> X1 (7/8):\n", + " Subset 0: (X5 -1) gives pval = 0.09700 / val = 0.025\n", + " No conditions of dimension 1 left.\n", + "\n", + " Link (X4 -3) --> X1 (8/8):\n", + " Subset 0: (X5 -1) gives pval = 0.20400 / val = 0.024\n", + " No conditions of dimension 1 left.\n", + "\n", + " Sorting parents in decreasing order with \n", + " weight(i-tau->j) = min_{iterations} |val_{ij}(tau)| \n", + "\n", + "Updating parents:\n", + "\n", + " Variable X1 has 4 link(s):\n", + " (X5 -2): max_pval = 0.08000, min_val = 0.038\n", + " (X1 -2): max_pval = 0.03900, min_val = 0.036\n", + " (X7 -3): max_pval = 0.24000, min_val = 0.025\n", + " (X4 -3): max_pval = 0.22400, min_val = 0.024\n", + "\n", + "Testing condition sets of dimension 2:\n", + "\n", + " Link (X5 -2) --> X1 (1/4):\n", + " Subset 0: (X1 -2) (X7 -3) gives pval = 0.10800 / val = 0.032\n", + " Still subsets of dimension 2 left, but q_max = 1 reached.\n", + "\n", + " Link (X1 -2) --> X1 (2/4):\n", + " Subset 0: (X5 -2) (X7 -3) gives pval = 0.00000 / val = 0.049\n", + " Still subsets of dimension 2 left, but q_max = 1 reached.\n", + "\n", + " Link (X7 -3) --> X1 (3/4):\n", + " Subset 0: (X5 -2) (X1 -2) gives pval = 0.27800 / val = 0.024\n", + " Non-significance detected.\n", + "\n", + " Link (X4 -3) --> X1 (4/4):\n", + " Subset 0: (X5 -2) (X1 -2) gives pval = 0.16000 / val = 0.030\n", + " Still subsets of dimension 2 left, but q_max = 1 reached.\n", + "\n", + " Sorting parents in decreasing order with \n", + " weight(i-tau->j) = min_{iterations} |val_{ij}(tau)| \n", + "\n", + "Updating parents:\n", + "\n", + " Variable X1 has 3 link(s):\n", + " (X1 -2): max_pval = 0.03900, min_val = 0.036\n", + " (X5 -2): max_pval = 0.10800, min_val = 0.032\n", + " (X4 -3): max_pval = 0.22400, min_val = 0.024\n", + "\n", + "Algorithm converged for variable X1\n", + "\n", + "## Variable X2\n", + "\n", + "Iterating through pc_alpha = [0.25]:\n", + "\n", + "# pc_alpha = 0.25 (1/1):\n", + "\n", + "Algorithm converged for variable X2\n", + "\n", + "## Variable X3\n", + "\n", + "Iterating through pc_alpha = [0.25]:\n", + "\n", + "# pc_alpha = 0.25 (1/1):\n", + "\n", + "Algorithm converged for variable X3\n", + "\n", + "## Variable X4\n", + "\n", + "Iterating through pc_alpha = [0.25]:\n", + "\n", + "# pc_alpha = 0.25 (1/1):\n", + "\n", + "Algorithm converged for variable X4\n", + "\n", + "## Variable X5\n", + "\n", + "Iterating through pc_alpha = [0.25]:\n", + "\n", + "# pc_alpha = 0.25 (1/1):\n", + "\n", + "Algorithm converged for variable X5\n", + "\n", + "## Variable X6\n", + "\n", + "Iterating through pc_alpha = [0.25]:\n", + "\n", + "# pc_alpha = 0.25 (1/1):\n", + "\n", + "Algorithm converged for variable X6\n", + "\n", + "## Variable X7\n", + "\n", + "Iterating through pc_alpha = [0.25]:\n", + "\n", + "# pc_alpha = 0.25 (1/1):\n", + "\n", + "Algorithm converged for variable X7\n", + "\n", + "## Variable X8\n", + "\n", + "Iterating through pc_alpha = [0.25]:\n", + "\n", + "# pc_alpha = 0.25 (1/1):\n", + "\n", + "Algorithm converged for variable X8\n", + "\n", + "## Variable X9\n", + "\n", + "Iterating through pc_alpha = [0.25]:\n", + "\n", + "# pc_alpha = 0.25 (1/1):\n", + "\n", + "Algorithm converged for variable X9\n", + "\n", + "## Resulting lagged parent (super)sets:\n", + "\n", + " Variable X10 has 4 link(s):\n", + " (X4 -4): max_pval = 0.08100, min_val = 0.040\n", + " (X9 -1): max_pval = 0.11400, min_val = 0.034\n", + " (X2 -1): max_pval = 0.22300, min_val = 0.029\n", + " (X5 -4): max_pval = 0.21900, min_val = 0.026\n", + "\n", + " Variable X1 has 3 link(s):\n", + " (X1 -2): max_pval = 0.03900, min_val = 0.036\n", + " (X5 -2): max_pval = 0.10800, min_val = 0.032\n", + " (X4 -3): max_pval = 0.22400, min_val = 0.024\n", + "\n", + " Variable X2 has 0 link(s):\n", + "\n", + " Variable X3 has 0 link(s):\n", + "\n", + " Variable X4 has 0 link(s):\n", + "\n", + " Variable X5 has 0 link(s):\n", + "\n", + " Variable X6 has 0 link(s):\n", + "\n", + " Variable X7 has 0 link(s):\n", + "\n", + " Variable X8 has 0 link(s):\n", + "\n", + " Variable X9 has 0 link(s):\n", + "\n", + "##\n", + "## Step 2: MCI algorithm\n", + "##\n", + "\n", + "Parameters:\n", + "\n", + "independence test = cmi_knn\n", + "tau_min = 0\n", + "tau_max = 4\n", + "max_conds_py = None\n", + "max_conds_px = None\n", + "\n", + " link (X1 -1) --> X10 (1/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ (X1 -3) (X5 -3) (X4 -4) ]\n", + "\n", + " link (X1 -2) --> X10 (2/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ (X1 -4) (X5 -4) (X4 -5) ]\n", + "\n", + " link (X1 -3) --> X10 (3/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ (X1 -5) (X5 -5) (X4 -6) ]\n", + "\n", + " link (X1 -4) --> X10 (4/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ (X1 -6) (X5 -6) (X4 -7) ]\n", + "\n", + " link (X2 -1) --> X10 (5/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X2 -2) --> X10 (6/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X2 -3) --> X10 (7/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X2 -4) --> X10 (8/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X3 -1) --> X10 (9/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X3 -2) --> X10 (10/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X3 -3) --> X10 (11/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X3 -4) --> X10 (12/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X4 -1) --> X10 (13/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X4 -2) --> X10 (14/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X4 -3) --> X10 (15/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X4 -4) --> X10 (16/36):\n", + " with conds_y = [ (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X5 -1) --> X10 (17/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X5 -2) --> X10 (18/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X5 -3) --> X10 (19/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X5 -4) --> X10 (20/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X6 -1) --> X10 (21/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X6 -2) --> X10 (22/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X6 -3) --> X10 (23/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X6 -4) --> X10 (24/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X7 -1) --> X10 (25/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X7 -2) --> X10 (26/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X7 -3) --> X10 (27/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X7 -4) --> X10 (28/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X8 -1) --> X10 (29/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X8 -2) --> X10 (30/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X8 -3) --> X10 (31/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X8 -4) --> X10 (32/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X9 -1) --> X10 (33/36):\n", + " with conds_y = [ (X4 -4) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X9 -2) --> X10 (34/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X9 -3) --> X10 (35/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X9 -4) --> X10 (36/36):\n", + " with conds_y = [ (X4 -4) (X9 -1) (X2 -1) (X5 -4) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X1 -1) --> X1 (1/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ (X1 -3) (X5 -3) (X4 -4) ]\n", + "\n", + " link (X1 -2) --> X1 (2/36):\n", + " with conds_y = [ (X5 -2) (X4 -3) ]\n", + " with conds_x = [ (X1 -4) (X5 -4) (X4 -5) ]\n", + "\n", + " link (X1 -3) --> X1 (3/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ (X1 -5) (X5 -5) (X4 -6) ]\n", + "\n", + " link (X1 -4) --> X1 (4/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ (X1 -6) (X5 -6) (X4 -7) ]\n", + "\n", + " link (X2 -1) --> X1 (5/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X2 -2) --> X1 (6/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X2 -3) --> X1 (7/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X2 -4) --> X1 (8/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X3 -1) --> X1 (9/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X3 -2) --> X1 (10/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X3 -3) --> X1 (11/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X3 -4) --> X1 (12/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X4 -1) --> X1 (13/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X4 -2) --> X1 (14/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X4 -3) --> X1 (15/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X4 -4) --> X1 (16/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X5 -1) --> X1 (17/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X5 -2) --> X1 (18/36):\n", + " with conds_y = [ (X1 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X5 -3) --> X1 (19/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X5 -4) --> X1 (20/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X6 -1) --> X1 (21/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X6 -2) --> X1 (22/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X6 -3) --> X1 (23/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X6 -4) --> X1 (24/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X7 -1) --> X1 (25/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X7 -2) --> X1 (26/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X7 -3) --> X1 (27/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X7 -4) --> X1 (28/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X8 -1) --> X1 (29/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X8 -2) --> X1 (30/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X8 -3) --> X1 (31/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X8 -4) --> X1 (32/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X9 -1) --> X1 (33/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X9 -2) --> X1 (34/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X9 -3) --> X1 (35/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + " link (X9 -4) --> X1 (36/36):\n", + " with conds_y = [ (X1 -2) (X5 -2) (X4 -3) ]\n", + " with conds_x = [ ]\n", + "\n", + "## Significant links at alpha = 0.05:\n", + "\n", + " Variable X10 has 3 link(s):\n", + " (X4 -4): pval = 0.01600 | val = 0.040\n", + " (X2 -1): pval = 0.00900 | val = 0.039\n", + " (X9 -1): pval = 0.02600 | val = 0.039\n", + "\n", + " Variable X1 has 4 link(s):\n", + " (X3 -3): pval = 0.00100 | val = 0.050\n", + " (X1 -2): pval = 0.00900 | val = 0.039\n", + " (X1 -3): pval = 0.03900 | val = 0.035\n", + " (X5 -2): pval = 0.02200 | val = 0.032\n", + "\n", + " Variable X2 has 0 link(s):\n", + "\n", + " Variable X3 has 0 link(s):\n", + "\n", + " Variable X4 has 0 link(s):\n", + "\n", + " Variable X5 has 0 link(s):\n", + "\n", + " Variable X6 has 0 link(s):\n", + "\n", + " Variable X7 has 0 link(s):\n", + "\n", + " Variable X8 has 0 link(s):\n", + "\n", + " Variable X9 has 0 link(s):\n" + ] + } + ], + "source": [ + "# convert to pp dataframe\n", + "dataframe = pp.DataFrame(df_stat.to_numpy(),\n", + " datatime = np.arange(len(df_stat)), \n", + " var_names=df_stat.columns)\n", + "\n", + "# CMIknn as independence test\n", + "cmi_knn = CMIknn(significance='shuffle_test', shuffle_neighbors=5, transform='ranks')\n", + "\n", + "# Configure the links that you want to causal-check with PCMCI\n", + "target_column_indices = [0,1]\n", + "selected_links = get_selected_links(df_stat, \n", + " tau_min=tau_min, \n", + " tau_max=tau_max, \n", + " selected_columns_indices=target_column_indices)\n", + "\n", + "# Instantiate PCMCI\n", + "pcmci_cmi_knn = PCMCI(\n", + " dataframe=dataframe, \n", + " cond_ind_test=cmi_knn,\n", + " verbosity=2)\n", + "\n", + "# Run PCMCI algorithm with a given alpha\n", + "# in this example, the value is taken very high (not really conservative) to be sure we have results.\n", + "# Remember that we are working with a toy dataset! IRL, the alpha value should be lower ([0.05-0.4]).\n", + "# Note that running this function takes a lot of compute and therefor time!\n", + "alpha = 0.25\n", + "results = pcmci_cmi_knn.run_pcmci(selected_links=selected_links, \n", + " tau_min=tau_min, \n", + " tau_max=tau_max, \n", + " pc_alpha=alpha)" + ] + }, + { + "cell_type": "markdown", + "id": "a66ca3d3", + "metadata": {}, + "source": [ + "## Visualize results PCMCI" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "02066fad", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "## Significant links at alpha = 0.01:\n", + "\n", + " Variable X10 has 1 link(s):\n", + " (X2 -1): pval = 0.00900 | val = 0.039\n", + "\n", + " Variable X1 has 2 link(s):\n", + " (X3 -3): pval = 0.00100 | val = 0.050\n", + " (X1 -2): pval = 0.00900 | val = 0.039\n", + "\n", + " Variable X2 has 0 link(s):\n", + "\n", + " Variable X3 has 0 link(s):\n", + "\n", + " Variable X4 has 0 link(s):\n", + "\n", + " Variable X5 has 0 link(s):\n", + "\n", + " Variable X6 has 0 link(s):\n", + "\n", + " Variable X7 has 0 link(s):\n", + "\n", + " Variable X8 has 0 link(s):\n", + "\n", + " Variable X9 has 0 link(s):\n", + "X10\n", + "X1\n", + "X2\n", + "X3\n", + "X4\n", + "X5\n", + "X9\n" + ] + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "process_and_visualize_results(results, pcmci_cmi_knn, df_stat.columns, target_column_indices)" + ] + }, + { + "cell_type": "markdown", + "id": "2f0722ec", + "metadata": {}, + "source": [ + "Have a look for yourself! Do the results for TE and PCMCI match? \n", + "- No? Well, then you've probably not found very strong evidence for the presence of the causal links!\n", + "- Yes? This is an indication that the results are consistent and make a lot of sense!\n", + "\n", + "Please note that this is a toy dataset, of which the causal links are unknown and very weak\n", + "\n", + "Make sure to look deeper into the hyperparameters of PCMCI and TE to have them suit your data better. Good luck and have fun :)" + ] + } + ], + "metadata": { + "interpreter": { + "hash": "916dbcbb3f70747c44a77c7bcd40155683ae19c65e1c03b4aa3499c5328201f1" + }, + "kernelspec": { + "display_name": "Python 3.10.4 64-bit", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.4" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/structured_data/2022_06_02_causality/src/data.pickle b/structured_data/2022_06_02_causality/src/data.pickle new file mode 100644 index 0000000..efa01a3 Binary files /dev/null and b/structured_data/2022_06_02_causality/src/data.pickle differ diff --git a/structured_data/2022_06_02_causality/src/helpers/pcmci.py b/structured_data/2022_06_02_causality/src/helpers/pcmci.py new file mode 100644 index 0000000..b042567 --- /dev/null +++ b/structured_data/2022_06_02_causality/src/helpers/pcmci.py @@ -0,0 +1,97 @@ +import numpy as np +from tigramite import plotting as tp + +def get_selected_links(df, tau_min=0, tau_max=3, selected_columns_indices = None): + """ + Initialize dictionary with every possible link (i.e., combination of + columns) to be tested by PCMCI. Note that only causality of marketing + channels on sales are considered and NOT between channels. + + Arguments: + - df (pd.DataFrame): input data + - tau_min (int): timelag to start from + - tau_max (int): timelag to end with + - selected_columns_indices (list[int]): column indices to exclude columns + + Retruns: + - list[Tuple]: links + """ + selected_links = {} + n_cols = list(range(len(df.columns))) + + for col in n_cols: + selected_links[col] = [(link_col, -lag) for link_col in n_cols + for lag in range(tau_min, tau_max + 1) + if link_col>0 and lag>0] + + if col not in selected_columns_indices: + # Do not consider causality between channels + selected_links[col] = [] # only need first col as ref + + return selected_links + +def process_and_visualize_results(results, pcmci, cols, target_indices, controlFDR = False): + """ + Process and visualize the results of PCMCI. + + Arguments: + - results (list): Output of PCMCI run. + - pcmci (tigramite.pcmci.PCMCI): PCMCI object + - cols (list): column names + - target_indices (list): indices of target columns + - controlFDR (bool): whether to use the q_matrix, which involves a transformation + of the p_values to account for amount of statistical tests done. + Recommended if you checked many links using PCMCI. + See the following link for more information: + https://github.com/jakobrunge/tigramite/blob/master/tutorials/tigramite_tutorial_basics.ipynb + """ + + + if not controlFDR: + pcmci.print_significant_links( + p_matrix = results['p_matrix'], + val_matrix = results['val_matrix'], + alpha_level = 0.01) + + else: + q_matrix = pcmci.get_corrected_pvalues(p_matrix=results['p_matrix'], + fdr_method='fdr_bh', + exclude_contemporaneous = False) + + pcmci.print_significant_links( + p_matrix = results['p_matrix'], + q_matrix = q_matrix, + val_matrix = results['val_matrix'], + alpha_level = 0.01) + + column_indices = set(target_indices) + + for ind, i in enumerate(results['graph']): + if not set(i.flatten()) == set(['']): + column_indices.add(ind) + + + tmp_results_val_matrix = np.array([i[list(column_indices)] for ind, i in enumerate(results['val_matrix']) if ind in list(column_indices)]) + + graph_small = np.array([i[list(column_indices)] for ind, i in enumerate(results['graph']) if ind in list(column_indices)]) + + var_names_small = [] + for i in column_indices: + print(cols[i]) + for i in column_indices: + var_names_small.append(cols[i]) + + tp.plot_graph( + val_matrix=tmp_results_val_matrix, + graph=graph_small, + var_names=var_names_small, + ) + + # Plot time series graph + tp.plot_time_series_graph( + figsize=(6, 4), + val_matrix=tmp_results_val_matrix, + graph=graph_small, + var_names=var_names_small, + link_colorbar_label='MCI', + ) \ No newline at end of file diff --git a/structured_data/2022_06_02_causality/src/helpers/stationarity.py b/structured_data/2022_06_02_causality/src/helpers/stationarity.py new file mode 100644 index 0000000..aa05469 --- /dev/null +++ b/structured_data/2022_06_02_causality/src/helpers/stationarity.py @@ -0,0 +1,144 @@ +""" +Functions for determining if a time series is stationary and for making it stationary +in case it is not. +""" +# Import standard library modules +import math +from typing import Tuple, List +import warnings + +# Import third party modules +from matplotlib import pyplot as plt +import pandas as pd + +import statsmodels.api as sm +from statsmodels.tools.sm_exceptions import InterpolationWarning +from statsmodels.tsa.stattools import adfuller, kpss + +def perform_kpss_test(df: pd.DataFrame, col: str, debug: bool=False) -> Tuple[bool, float]: + """Perform the Kwiatkowski-Phillips-Schmidt-Shin (KPSS) test for the null + hypothesis that x is level or trend stationary. + + Arguments: + - df (pd.DataFrame): Dataframe for which to check for stationarity. + - col (str): Name of column within dataframe to check stationarity for. + - debug (bool): Whether or not to print intermediate results. + + Returns: + - bool: Whether or not the column of the dataframe is stationary. + - float: Significance with which conclusion is made. + """ + # Select `col` column from argument `df` dataframe + df_col = df[[col]] + + # Perform KPSS test (hyp: stationary) while catching InterpolationWarning messages + with warnings.catch_warnings(record=True) as w: + # Cause all warnings to always be triggered. + warnings.simplefilter("always") + kpss_test = kpss(df_col, nlags='legacy') # regression='c'|'ct' + + if len(w) == 1 and issubclass(w[-1].category, InterpolationWarning): + p_value_oob = True + else: + p_value_oob = False + + kpss_output = pd.Series(kpss_test[0:3], + index=['test_statistic', 'p_value', 'lags']) + for key, value in kpss_test[3].items(): + kpss_output['Critical Value (%s)'%key] = value + + p_value = kpss_output['p_value'] + stationary = p_value >= 0.05 # Stationary if null-hyp. cannot be rejected. + + if debug or not stationary: + print(f'\t(KPSS) Time-series IS {"" if stationary else "NOT "}trend-stationary (p{">" if p_value_oob else "="}{p_value})!') + return stationary, p_value + + +def perform_adf_test(df: pd.DataFrame, col: str, debug: bool=False) -> Tuple[bool, float]: + """Perform Augmented Dickey-Fuller (ADF) unit root test for a unit root in a + univariate process in the presence of serial correlation. + + Arguments: + - df (pd.DataFrame): Dataframe for which to check for stationarity. + - col (str): Name of column within dataframe to check stationarity for. + - debug (bool): Whether or not to print intermediate results. + + Returns: + - bool: Whether or not the column of the dataframe is stationary. + - float: Significance with which conclusion is made. + """ + # Select `col` column from argument `df` dataframe + df_col = df[[col]] + + # Difference column values + df_col = df_col[col].diff() + df_col = df_col.fillna(0) # Remove first month of differenced data + + # Perform ADF unit root test + adf_test = adfuller(df_col, autolag='AIC') + adf_output = pd.Series(adf_test[0:4], index=['test_statistic','p_value','lags','observations']) + for key,value in adf_test[4].items(): + adf_output['Critical Value (%s)'%key] = value + + p_value = adf_output['p_value'] + stationary = p_value < 0.05 # Stationary if null-hyp. is rejected! + + if debug or not stationary: + print(f'\t(ADF) Time-series IS {"" if stationary else "NOT "}difference stationary (p={p_value})!') + + return stationary, p_value + + +def remove_trend_and_diff(df: pd.DataFrame, debug: bool=False) -> pd.DataFrame: + """Perform Seasonal-Trend decomposition using LOESS (STL) to remove trend + and seasonality and difference residuals as much as necessary to make + time-series stationary. + + Arguments: + - df (pd.DataFrame): Dataframe of which stationarity must be checked and + guaranteed. + - debug (bool): Whether or not to print intermediate results (for + debugging purposed). + + Result: + - pd.DataFrame: Stationary dataframe. + """ + # Keep track of number of differencing operations to omit NaN values at start of dataframe + max_diff = 1 + + # Initialize differenced dataframe + df_diff = df.copy() + + + # Make every column of dataframe stationary by... + for col in df_diff.columns: + print("tackling new col", col) + periods = 0 + kpss_stat, kpss_p = perform_kpss_test(df_diff[periods:], col, debug=debug) + adf_stat, adf_p = perform_adf_test(df_diff[periods:], col, debug=debug) + + while not (kpss_stat and adf_stat): + print(f" iteration {periods}") + + # Log number of differencing operations + periods += 1 + print(f'\tDifferencing results over {periods} period{"s" if periods - 1 else ""}...') + + # Difference signal + df_diff[col] = df_diff[col].diff() + df_diff = df_diff.fillna(0) + + # Check for stationarity + kpss_stat, kpss_p = perform_kpss_test(df_diff[periods:], col, debug=debug) + adf_stat, adf_p = perform_adf_test(df_diff[periods:], col, debug=debug) + + # Print if stationarity is obtained + print(f' --> (KPSS & ADF) Time-series IS stationary for {col} (after {periods} differencing operations)!') + + # Break up print statements between columns + print('') + + print(f'(Maximum number of differencing operations performed was {max_diff})') + # Return detrended (and possibly differenced) dataframe + return df_diff[max_diff:] \ No newline at end of file diff --git a/structured_data/2022_06_02_causality/src/helpers/transfer_entropy.py b/structured_data/2022_06_02_causality/src/helpers/transfer_entropy.py new file mode 100644 index 0000000..224cb52 --- /dev/null +++ b/structured_data/2022_06_02_causality/src/helpers/transfer_entropy.py @@ -0,0 +1,43 @@ +import matplotlib.pyplot as plt +import pandas as pd + +def export_as_df(te_output): + """Transform the output of transfer entropy to a pandas dataframe. + + Args: + te_output (list): output of a transfer entropy analysis. + + Returns: + pd.DataFrame: dataframe including the p values. + """ + df = pd.DataFrame() + for index, info in enumerate(te_output[0]): + col_name, *_ = info + data = [] + for lag in te_output: + data.append(lag[index][1]["p_value_XY"].iloc[0]) + df[col_name] = pd.Series(data) + return df + + +def viz_df_raw(df, booldf, threshold): + """Vizualize results of a Transfer Entropy analysis. + + Args: + df (pd.DataFrame): input data with raw p values. + booldf (pd.DataFrame): input data after thresholding containing booleans. + threshold (float): threshold used. + """ + fs, fs_ax = plt.subplots(len(df.columns), 1, figsize=(10,len(df.columns)*2)) + + for ind, col in enumerate(df.columns): + print(col) + df[col].astype(float).plot(kind='line', ax=fs_ax[ind], legend = col) + booldf[col].astype(float).plot(kind='bar', ax=fs_ax[ind], stacked=False, alpha=0.3) + fs_ax[ind].set_ylim([0,1]) + if ind==0: + fs_ax[ind].set_title(f"Causal relationships found - Transfer Entropy with significance level = {threshold}") + if ind == len(df.columns)-1: + fs_ax[ind].set_xlabel("lags") + fs.tight_layout() + fs.subplots_adjust(hspace=0.4, wspace=0) diff --git a/structured_data/2022_06_02_causality/src/transfer_entropy/README.md b/structured_data/2022_06_02_causality/src/transfer_entropy/README.md new file mode 100644 index 0000000..43fc7ef --- /dev/null +++ b/structured_data/2022_06_02_causality/src/transfer_entropy/README.md @@ -0,0 +1,5 @@ +## Purpose of this folder +We advocate to use the [PyCausality](https://pypi.org/project/PyCausality/) package when working with Transfer Entropy. +The code itself is not maintained anymore but still works well. +Because the pip package might become/is unstable, we copied the source code and created a wrapper around it. +See [the example notebook](./../Example%20notebook.ipynb) for inspiration to integrate this code in your own project! \ No newline at end of file diff --git a/structured_data/2022_06_02_causality/src/transfer_entropy/__init__.py b/structured_data/2022_06_02_causality/src/transfer_entropy/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/structured_data/2022_06_02_causality/src/transfer_entropy/pycausality/__init__.py b/structured_data/2022_06_02_causality/src/transfer_entropy/pycausality/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/structured_data/2022_06_02_causality/src/transfer_entropy/pycausality/src.py b/structured_data/2022_06_02_causality/src/transfer_entropy/pycausality/src.py new file mode 100644 index 0000000..6cf2f5a --- /dev/null +++ b/structured_data/2022_06_02_causality/src/transfer_entropy/pycausality/src.py @@ -0,0 +1,1226 @@ +import pandas as pd +import statsmodels.api as sm +import numpy as np + +from numpy import ma, atleast_2d, pi, sqrt, sum +from scipy import stats, linalg +from scipy.special import gammaln +from six import string_types +from scipy.stats.mstats import mquantiles + +from copy import deepcopy + +import matplotlib.pyplot as plt +import matplotlib.cm as cm + +from dateutil.relativedelta import relativedelta + +import warnings +import sys + + +class LaggedTimeSeries(): + """ + Custom wrapper class for pandas DataFrames for performing predictive analysis. + Generates lagged time series and performs custom windowing over datetime indexes + """ + + def __init__(self, df, endog, lag=None, max_lag_only=True, window_size=None, window_stride=None): + """ + Args: + df - Pandas DataFrame object of N columns. Must be indexed as an increasing + time series (i.e. past-to-future), with equal timesteps between each row + lags - The number of steps to be included. Each increase in Lags will result + in N additional columns, where N is the number of columns in the original + dataframe. It will also remove the first N rows. + max_lag_only - Defines whether the returned dataframe contains all lagged timeseries up to + and including the defined lag, or only the time series equal to this lag value + window_size - Dict containing key-value pairs only from within: {'YS':0,'MS':0,'D':0,'H':0,'min':0,'S':0,'ms':0} + Describes the desired size of each window, provided the data is indexed with datetime type. Leave as + None for no windowing. Units follow http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases + window_stride - Dict containing key-value pairs only from within: {'YS':0,'MS':0,'D':0,'H':0,'min':0,'S':0,'ms':0} + Describes the size of the step between consecutive windows, provided the data is indexed with datetime type. Leave as + None for no windowing. Units follow http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases + + Returns: - n/a + """ + self.df = sanitise(df) + self.endog = endog + self.axes = list(self.df.columns.values) # Variable names + + self.max_lag_only = max_lag_only + if lag is not None: + self.t = lag + self.df = self.__apply_lags__() + + if window_size is not None and window_stride is not None: + self.has_windows = True + self. __apply_windows__(window_size, window_stride) + else: + self.has_windows = False + + def __apply_lags__(self): + """ + Args: + n/a + Returns: + new_df.iloc[self.t:] - This is a new dataframe containing the original columns and + all lagged columns. Note that the first few rows (equal to self.lag) will + be removed from the top, since lagged values are of coursenot available + for these indexes. + """ + # Create a new dataframe to maintain the new data, dropping rows with NaN + new_df = self.df.copy(deep=True).dropna() + + # Create new column with lagged timeseries for each variable + col_names = self.df.columns.values.tolist() + + # If the user wants to only consider the time series lagged by the + # maximum number specified or by every series up to an including the maximum lag: + if self.max_lag_only == True: + for col_name in col_names: + new_df[col_name + '_lag' + + str(self.t)] = self.df[col_name].shift(self.t) + + elif self.max_lag_only == False: + for col_name in col_names: + for t in range(1, self.t+1): + new_df[col_name + '_lag' + + str(t)] = self.df[col_name].shift(t) + else: + raise ValueError('Error') + + # Drop the first t rows, which now contain NaN + return new_df.iloc[self.t:] + + def __apply_windows__(self, window_size, window_stride): + """ + Args: + window_size - Dict passed from self.__init__ + window_stride - Dict passed from self.__init__ + Returns: + n/a - Sets the daterange for the self.windows property to iterate along + """ + self.window_size = {'YS': 0, 'MS': 0, 'D': 0, + 'H': 0, 'min': 0, 'S': 0, 'ms': 0} + self.window_stride = {'YS': 0, 'MS': 0, + 'D': 0, 'H': 0, 'min': 0, 'S': 0, 'ms': 0} + + self.window_stride.update(window_stride) + self.window_size.update(window_size) + freq = '' + daterangefreq = freq.join( + [str(v)+str(k) for (k, v) in self.window_stride.items() if v != 0]) + self.daterange = pd.date_range( + self.df.index.min(), self.df.index.max(), freq=daterangefreq) + + def date_diff(self, window_size): + """ + Args: + window_size - Dict passed from self.windows function + Returns: + start_date - The start date of the proposed window + end_date - The end date of the proposed window + + This function is TBC - proposed due to possible duplication of the relativedelta usage in self.windows and self.headstart + """ + pass + + @property + def windows(self): + """ + Args: + n/a + Returns: + windows - Generator defining a pandas DataFrame for each window of the data. + Usage like: [window for window in LaggedTimeSeries.windows] + """ + if self.has_windows == False: + return self.df + # Loop Over TimeSeries Range + for i, dt in enumerate(self.daterange): + + # Ensure Each Division Contains Required Number of Months + if dt-relativedelta(years=self.window_size['YS'], + months=self.window_size['MS'], + days=self.window_size['D'], + hours=self.window_size['H'], + minutes=self.window_size['min'], + seconds=self.window_size['S'], + microseconds=self.window_size['ms'] + ) >= self.df.index.min(): + + # Create Window + yield self.df.loc[(dt-relativedelta(years=self.window_size['YS'], + months=self.window_size['MS'], + days=self.window_size['D'], + hours=self.window_size['H'], + minutes=self.window_size['min'], + seconds=self.window_size['S'], + microseconds=self.window_size['ms'] + )): dt] + + @property + def headstart(self): + """ + Args: + n/a + Returns: + len(windows) - The number of windows which would have start dates before the desired date range. + Used in TransferEntropy class to slice off incomplete windows. + + """ + windows = [i for i, dt in enumerate(self.daterange) + if dt-relativedelta(years=self.window_size['YS'], + months=self.window_size['MS'], + days=self.window_size['D'], + hours=self.window_size['H'], + minutes=self.window_size['min'], + seconds=self.window_size['S'], + microseconds=self.window_size['ms'] + ) < self.df.index.min()] + # i.e. count from the first window which falls entirely after the earliest date + return len(windows) + + +class TransferEntropy(): + """ + Functional class to calculate Transfer Entropy between time series, to detect causal signals. + Currently accepts two series: X(t) and Y(t). Future extensions planned to accept additional endogenous + series: X1(t), X2(t), X3(t) etc. + """ + + def __init__(self, DF, endog, exog, lag=None, window_size=None, window_stride=None): + """ + Args: + DF - (DataFrame) Time series data for X and Y (NOT including lagged variables) + endog - (string) Fieldname for endogenous (dependent) variable Y + exog - (string) Fieldname for exogenous (independent) variable X + lag - (integer) Number of periods (rows) by which to lag timeseries data + window_size - (Dict) Must contain key-value pairs only from within: {'YS':0,'MS':0,'D':0,'H':0,'min':0,'S':0,'ms':0} + Describes the desired size of each window, provided the data is indexed with datetime type. Leave as + None for no windowing. Units follow http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases + window_stride - (Dict) Must contain key-value pairs only from within: {'YS':0,'MS':0,'D':0,'H':0,'min':0,'S':0,'ms':0} + Describes the size of the step between consecutive windows, provided the data is indexed with datetime type. Leave as + None for no windowing. Units follow http://pandas.pydata.org/pandas-docs/stable/timeseries.html#timeseries-offset-aliases + Returns: + n/a + """ + self.lts = LaggedTimeSeries(df=sanitise(DF), + endog=endog, + lag=lag, + window_size=window_size, + window_stride=window_stride) + + if self.lts.has_windows is True: + self.df = self.lts.windows + self.date_index = self.lts.daterange[self.lts.headstart:] + self.results = pd.DataFrame(index=self.date_index) + self.results.index.name = "windows_ending_on" + else: + self.df = [self.lts.df] + self.results = pd.DataFrame(index=[0]) + self.max_lag_only = True + self.endog = endog # Dependent Variable Y + self.exog = exog # Independent Variable X + self.lag = lag + + """ If using KDE, this ensures the covariance matrices are calculated once over all data, rather + than for each window. This saves computational time and provides a fair point for comparison.""" + self.covars = [[], []] + + for i, (X, Y) in enumerate({self.exog: self.endog, self.endog: self.exog}.items()): + X_lagged = X+'_lag'+str(self.lag) + Y_lagged = Y+'_lag'+str(self.lag) + + self.covars[i] = [np.cov(self.lts.df[[Y, Y_lagged, X_lagged]].values.T), + np.cov( + self.lts.df[[X_lagged, Y_lagged]].values.T), + np.cov(self.lts.df[[Y, Y_lagged]].values.T), + np.ones(shape=(1, 1)) * self.lts.df[Y_lagged].std()**2] + + # Account for equal signals in case of lag 0 by adding identity matrix to covariance matrices + if lag == 0: + for j, c_j in enumerate(self.covars[i]): + if j % 2 == 0: + self.covars[i][j] += 1e-10 * np.eye(*c_j.shape) + + def linear_TE(self, df=None, n_shuffles=0): + """ + Linear Transfer Entropy for directional causal inference + + Defined: G-causality * 0.5, where G-causality described by the reduction in variance of the residuals + when considering side information. + Calculated using: log(var(e_joint)) - log(var(e_independent)) where e_joint and e_independent + represent the residuals from OLS fitting in the joint (X(t),Y(t)) and reduced (Y(t)) cases + + Arguments: + n_shuffles - (integer) Number of times to shuffle the dataframe, destroying the time series temporality, in order to + perform significance testing. + Returns: + transfer_entropies - (list) Directional Linear Transfer Entropies from X(t)->Y(t) and Y(t)->X(t) respectively + """ + # Prepare lists for storing results + TEs = [] + shuffled_TEs = [] + p_values = [] + z_scores = [] + + # Loop over all windows + for i, df in enumerate(self.df): + df = deepcopy(df) + + # Shows user that something is happening + # if self.lts.has_windows is True: + # print("Window ending: ", self.date_index[i]) + + # Initialise list to return TEs + transfer_entropies = [0, 0] + + # Require us to compare information transfer bidirectionally + for i, (X, Y) in enumerate({self.exog: self.endog, self.endog: self.exog}.items()): + + # Note X-t, Y-t + X_lagged = X+'_lag'+str(self.lag) + Y_lagged = Y+'_lag'+str(self.lag) + + # Calculate Residuals after OLS Fitting, for both Independent and Joint Cases + joint_residuals = sm.OLS(df[Y], sm.add_constant( + df[[Y_lagged, X_lagged]])).fit().resid + independent_residuals = sm.OLS( + df[Y], sm.add_constant(df[Y_lagged])).fit().resid + + # Use Geweke's formula for Granger Causality + if np.var(joint_residuals) == 0: + granger_causality = 0 + else: + granger_causality = np.log(np.var(independent_residuals) / + np.var(joint_residuals)) + + # Calculate Linear Transfer Entropy from Granger Causality + transfer_entropies[i] = granger_causality/2 + + TEs.append(transfer_entropies) + + # Calculate Significance of TE during this window + if n_shuffles > 0: + p, z, TE_mean = significance(df=df, + TE=transfer_entropies, + endog=self.endog, + exog=self.exog, + lag=self.lag, + n_shuffles=n_shuffles, + method='linear') + + shuffled_TEs.append(TE_mean) + p_values.append(p) + z_scores.append(z) + + # Store Linear Transfer Entropy from X(t)->Y(t) and from Y(t)->X(t) + self.add_results({'TE_linear_XY': np.array(TEs)[:, 0], + 'TE_linear_YX': np.array(TEs)[:, 1], + 'p_value_linear_XY': None, + 'p_value_linear_YX': None, + 'z_score_linear_XY': 0, + 'z_score_linear_YX': 0 + }) + + if n_shuffles > 0: + # Store Significance Transfer Entropy from X(t)->Y(t) and from Y(t)->X(t) + + self.add_results({'p_value_linear_XY': np.array(p_values)[:, 0], + 'p_value_linear_YX': np.array(p_values)[:, 1], + 'z_score_linear_XY': np.array(z_scores)[:, 0], + 'z_score_linear_YX': np.array(z_scores)[:, 1], + 'Ave_TE_linear_XY': np.array(shuffled_TEs)[:, 0], + 'Ave_TE_linear_YX': np.array(shuffled_TEs)[:, 1] + }) + + return transfer_entropies + + def nonlinear_TE(self, df=None, pdf_estimator='histogram', bins=None, bandwidth=None, gridpoints=20, n_shuffles=0): + """ + NonLinear Transfer Entropy for directional causal inference + + Defined: TE = TE_XY - TE_YX where TE_XY = H(Y|Y-t) - H(Y|Y-t,X-t) + Calculated using: H(Y|Y-t,X-t) = H(Y,Y-t,X-t) - H(Y,Y-t) and finding joint entropy through density estimation + + Arguments: + pdf_estimator - (string) 'Histogram' or 'kernel' Used to define which method is preferred for density estimation + of the distribution - either histogram or KDE + bins - (dict of lists) Optional parameter to provide hard-coded bin-edges. Dict keys + must contain names of variables - including lagged columns! Dict values must be lists + containing bin-edge numerical values. + bandwidth - (float) Optional parameter for custom bandwidth in KDE. This is a scalar multiplier to the covariance + matrix used (see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.covariance_factor.html) + gridpoints - (integer) Number of gridpoints (in each dimension) to discretise the probablity space when performing + integration of the kernel density estimate. Increasing this gives more precision, but significantly + increases execution time + n_shuffles - (integer) Number of times to shuffle the dataframe, destroying the time series temporality, in order to + perform significance testing. + + Returns: + transfer_entropies - (list) Directional Transfer Entropies from X(t)->Y(t) and Y(t)->X(t) respectively + + (Also stores TE, Z-score and p-values in self.results - for each window if windows defined.) + """ + # Retrieve user-defined bins + self.bins = bins + if self.bins is None: + self.bins = {self.endog: None} + + # Prepare lists for storing results + TEs = [] + shuffled_TEs = [] + p_values = [] + z_scores = [] + + # Loop over all windows + for j, df in enumerate(self.df): + df = deepcopy(df) + + # Shows user that something is happening + # if self.lts.has_windows is True and debug: + # print("Window ending: ", self.date_index[j]) + + # Initialise list to return TEs + transfer_entropies = [0, 0] + + # Require us to compare information transfer bidirectionally + for i, (X, Y) in enumerate({self.exog: self.endog, self.endog: self.exog}.items()): + # Entropy calculated using Probability Density Estimation: + # Following: https://stat.ethz.ch/education/semesters/SS_2006/CompStat/sk-ch2.pdf + # Also: https://www.cs.cmu.edu/~aarti/Class/10704_Spring15/lecs/lec5.pdf + + # Note Lagged Terms + X_lagged = X+'_lag'+str(self.lag) + Y_lagged = Y+'_lag'+str(self.lag) + + # Estimate PDF using Gaussian Kernels and use H(x) = p(x) log p(x) + # 1. H(Y,Y-t,X-t) + H1 = get_entropy(df=df[[Y, Y_lagged, X_lagged]], + gridpoints=gridpoints, + bandwidth=bandwidth, + estimator=pdf_estimator, + bins={k: v for (k, v) in self.bins.items() + if k in [Y, Y_lagged, X_lagged]}, + covar=self.covars[i][0]) + + # 2. H(Y-t,X-t) + H2 = get_entropy(df=df[[X_lagged, Y_lagged]], + gridpoints=gridpoints, + bandwidth=bandwidth, + estimator=pdf_estimator, + bins={k: v for (k, v) in self.bins.items() + if k in [X_lagged, Y_lagged]}, + covar=self.covars[i][1]) + #print('\t', H2) + # 3. H(Y,Y-t) + H3 = get_entropy(df=df[[Y, Y_lagged]], + gridpoints=gridpoints, + bandwidth=bandwidth, + estimator=pdf_estimator, + bins={k: v for (k, v) in self.bins.items() + if k in [Y, Y_lagged]}, + covar=self.covars[i][2]) + #print('\t', H3) + # 4. H(Y-t) + H4 = get_entropy(df=df[[Y_lagged]], + gridpoints=gridpoints, + bandwidth=bandwidth, + estimator=pdf_estimator, + bins={k: v for (k, v) in self.bins.items() + if k in [Y_lagged]}, + covar=self.covars[i][3]) + + # Calculate Conditonal Entropy using: H(Y|X-t,Y-t) = H(Y,X-t,Y-t) - H(X-t,Y-t) + conditional_entropy_joint = H1 - H2 + + # And Conditional Entropy independent of X(t) H(Y|Y-t) = H(Y,Y-t) - H(Y-t) + conditional_entropy_independent = H3 - H4 + + # Directional Transfer Entropy is the difference between the conditional entropies + transfer_entropies[i] = conditional_entropy_independent - \ + conditional_entropy_joint + + TEs.append(transfer_entropies) + + # Calculate Significance of TE during this window + if n_shuffles > 0: + p, z, TE_mean = significance(df=df, + TE=transfer_entropies, + endog=self.endog, + exog=self.exog, + lag=self.lag, + n_shuffles=n_shuffles, + pdf_estimator=pdf_estimator, + bins=self.bins, + bandwidth=bandwidth, + method='nonlinear') + + shuffled_TEs.append(TE_mean) + p_values.append(p) + z_scores.append(z) + + # Store Transfer Entropy from X(t)->Y(t) and from Y(t)->X(t) + self.add_results({'TE_XY': np.array(TEs)[:, 0], + 'TE_YX': np.array(TEs)[:, 1], + 'p_value_XY': None, + 'p_value_YX': None, + 'z_score_XY': 0, + 'z_score_YX': 0 + }) + if n_shuffles > 0: + # Store Significance Transfer Entropy from X(t)->Y(t) and from Y(t)->X(t) + + self.add_results({'p_value_XY': np.array(p_values)[:, 0], + 'p_value_YX': np.array(p_values)[:, 1], + 'z_score_XY': np.array(z_scores)[:, 0], + 'z_score_YX': np.array(z_scores)[:, 1], + 'Ave_TE_XY': np.array(shuffled_TEs)[:, 0], + 'Ave_TE_YX': np.array(shuffled_TEs)[:, 1] + }) + return transfer_entropies + + def add_results(self, dict): + """ + Args: + dict - JSON-style data to store in existing self.results DataFrame + Returns: + n/a + """ + for (k, v) in dict.items(): + self.results[str(k)] = v + + +def significance(df, TE, endog, exog, lag, n_shuffles, method, pdf_estimator=None, bins=None, bandwidth=None, both=True): + """ + Perform significance analysis on the hypothesis test of statistical causality, for both X(t)->Y(t) + and Y(t)->X(t) directions + + Calculated using: Assuming stationarity, we shuffle the time series to provide the null hypothesis. + The proportion of tests where TE > TE_shuffled gives the p-value significance level. + The amount by which the calculated TE is greater than the average shuffled TE, divided + by the standard deviation of the results, is the z-score significance level. + + Arguments: + TE - (list) Contains the transfer entropy in each direction, i.e. [TE_XY, TE_YX] + endog - (string) The endogenous variable in the TE analysis being significance tested (i.e. X or Y) + exog - (string) The exogenous variable in the TE analysis being significance tested (i.e. X or Y) + pdf_estimator - (string) The pdf_estimator used in the original TE analysis + bins - (Dict of lists) The bins used in the original TE analysis + + n_shuffles - (float) Number of times to shuffle the dataframe, destroyig temporality + both - (Bool) Whether to shuffle both endog and exog variables (z-score) or just exog variables (giving z*-score) + Returns: + p_value - Probablity of observing the result given the null hypothesis + z_score - Number of Standard Deviations result is from mean (normalised) + """ + + # Prepare array for Transfer Entropy of each Shuffle + shuffled_TEs = np.zeros(shape=(2, n_shuffles)) + + ## + if both is True: + pass # TBC + + for i in range(n_shuffles): + # Perform Shuffle + df = shuffle_series(df) + + # Calculate New TE + shuffled_causality = TransferEntropy(DF=df, + endog=endog, + exog=exog, + lag=lag + ) + if method == 'linear': + TE_shuffled = shuffled_causality.linear_TE(df, n_shuffles=0) + else: + TE_shuffled = shuffled_causality.nonlinear_TE( + df, pdf_estimator, bins, bandwidth, n_shuffles=0) + shuffled_TEs[:, i] = TE_shuffled + + # Calculate p-values for each direction + p_values = (np.count_nonzero(TE[0] < shuffled_TEs[0, :]) / n_shuffles, + np.count_nonzero(TE[1] < shuffled_TEs[1, :]) / n_shuffles) + + # Calculate z-scores for each direction + z_scores = ((TE[0] - np.mean(shuffled_TEs[0, :])) / np.std(shuffled_TEs[0, :]), + (TE[1] - np.mean(shuffled_TEs[1, :])) / np.std(shuffled_TEs[1, :])) + + TE_mean = (np.mean(shuffled_TEs[0, :]), + np.mean(shuffled_TEs[1, :])) + + # Return the self.DF value to the unshuffled case + return p_values, z_scores, TE_mean + +############################################################################################################## +# U T I L I T Y C L A S S E S +############################################################################################################## + + +class NDHistogram(): + """ + Custom histogram class wrapping the default numpy implementations (np.histogram, np.histogramdd). + This allows for dimension-agnostic histogram calculations, custom auto-binning and + associated data and methods to be stored for each object (e.g. Probability Density etc.) + """ + + def __init__(self, df, bins=None, max_bins=15): + """ + Arguments: + df - DataFrame passed through from the TransferEntropy class + bins - Bin edges passed through from the TransferEntropy class + max_bins - Number of bins per each dimension passed through from the TransferEntropy class + Returns: + self.pdf - This is an N-dimensional Probability Density Function, stored as a + Numpy histogram, representing the proportion of samples in each bin. + """ + df = sanitise(df) + self.df = df.reindex(columns=sorted(df.columns)) # Sort axes by name + self.max_bins = max_bins + self.axes = list(self.df.columns.values) + self.bins = bins + self.n_dims = len(self.axes) + + # Bins must match number and order of dimensions + if self.bins is None: + AB = AutoBins(self.df) + self.bins = AB.sigma_bins(max_bins=max_bins) + elif set(self.bins.keys()) != set(self.axes): + warnings.warn( + 'Incompatible bins provided - defaulting to sigma bins') + AB = AutoBins(self.df) + self.bins = AB.sigma_bins(max_bins=max_bins) + + ordered_bins = [sorted(self.bins[key]) + for key in sorted(self.bins.keys())] + + # Create ND histogram (np.histogramdd doesn't scale down to 1D) + if self.n_dims == 1: + self.Hist, self.Dedges = np.histogram( + self.df.values, bins=ordered_bins[0], normed=False) + elif self.n_dims > 1: + self.Hist, self.Dedges = np.histogramdd( + self.df.values, bins=ordered_bins, normed=False) + + # Empirical Probability Density Function + if self.Hist.sum() == 0: + print(self.Hist.shape) + + with pd.option_context('display.max_rows', None, 'display.max_columns', 3): + print(self.df.tail(40)) + + sys.exit( + "User-defined histogram is empty. Check bins or increase data points") + else: + self.pdf = self.Hist/self.Hist.sum() + self._set_entropy_(self.pdf) + + def _set_entropy_(self, pdf): + """ + Arguments: + pdf - Probabiiity Density Function; this is calculated using the N-dimensional histogram above. + Returns: + n/a + Sets entropy for marginal distributions: H(X), H(Y) etc. as well as joint entropy H(X,Y) + """ + # Prepare empty dict for marginal entropies along each dimension + self.H = {} + + if self.n_dims > 1: + + # Joint entropy H(X,Y) = -sum(pdf(x,y) * log(pdf(x,y))) + # Use masking to replace log(0) with 0 + self.H_joint = -np.sum(pdf * ma.log2(pdf).filled(0)) + + # Single entropy for each dimension H(X) = -sum(pdf(x) * log(pdf(x))) + for a, axis_name in enumerate(self.axes): + # Use masking to replace log(0) with 0 + self.H[axis_name] = - \ + np.sum(pdf.sum(axis=a) * ma.log2(pdf.sum(axis=a)).filled(0)) + else: + # Joint entropy and single entropy are the same + self.H_joint = -np.sum(pdf * ma.log2(pdf).filled(0)) + self.H[self.df.columns[0]] = self.H_joint + + +class AutoBins(): + """ + Prototyping class for generating data-driven binning. + Handles lagged time series, so only DF[X(t), Y(t)] required. + """ + + def __init__(self, df, lag=None): + """ + Args: + df - (DateFrame) Time series data to classify into bins + lag - (float) Lag for data to provided bins for lagged columns also + Returns: + n/a + """ + # Ensure data is in DataFrame form + self.df = sanitise(df) + self.axes = self.df.columns.values + self.ndims = len(self.axes) + self.N = len(self.df) + self.lag = lag + + def __extend_bins__(self, bins): + """ + Function to generate bins for lagged time series not present in self.df + Args: + bins - (Dict of List) Bins edges calculated by some AutoBins.method() + Returns: + bins - (Dict of lists) Bin edges keyed by column name + """ + self.max_lag_only = True # still temporary until we kill this + + # Handle lagging for bins, and calculate default bins where edges are not provided + if self.max_lag_only == True: + bins.update({fieldname + '_lag' + str(self.lag): edges + for (fieldname, edges) in bins.items()}) + else: + bins.update({fieldname + '_lag' + str(t): edges + for (fieldname, edges) in bins.items() for t in range(self.lag)}) + + return bins + + def MIC_bins(self, max_bins=15): + """ + Method to find optimal bin widths in each dimension, using a naive search to + maximise the mutual information divided by number of bins. Only accepts data + with two dimensions [X(t),Y(t)]. + We increase the n_bins parameter in each dimension, and take the bins which + result in the greatest Maximum Information Coefficient (MIC) + + (Note that this is restricted to equal-width bins only.) + Defined: MIC = I(X,Y)/ max(n_bins) + edges = {Y:[a,b,c,d], Y-t:[a,b,c,d], X-t:[e,f,g]}, + n_bins = [bx,by] + Calculated using: argmax { I(X,Y)/ max(n_bins) } + Args: + max_bins - (int) The maximum allowed bins in each dimension + Returns: + opt_edges - (dict) The optimal bin-edges for pdf estimation + using the histogram method, keyed by df column names + All bins equal-width. + """ + if len(self.df.columns.values) > 2: + raise ValueError( + 'Too many columns provided in DataFrame. MIC_bins only accepts 2 columns (no lagged columns)') + + min_bins = 3 + + # Initialise array to store MIC values + MICs = np.zeros(shape=[1+max_bins-min_bins, 1+max_bins-min_bins]) + + # Loop over each dimension + for b_x in range(min_bins, max_bins+1): + + for b_y in range(min_bins, max_bins+1): + + # Update parameters + n_bins = [b_x, b_y] + + # Update dict of bin edges + edges = {dim: list(np.linspace(self.df[dim].min(), + self.df[dim].max(), + n_bins[i]+1)) + for i, dim in enumerate(self.df.columns.values)} + + # Calculate Maximum Information Coefficient + HDE = NDHistogram(self.df, edges) + + I_xy = sum([H for H in HDE.H.values()]) - HDE.H_joint + + MIC = I_xy / np.log2(np.min(n_bins)) + + MICs[b_x-min_bins][b_y-min_bins] = MIC + + # Get Optimal b_x, b_y values + n_bins[0] = np.where(MICs == np.max(MICs))[0] + min_bins + n_bins[1] = np.where(MICs == np.max(MICs))[1] + min_bins + + bins = {dim: list(np.linspace(self.df[dim].min(), + self.df[dim].max(), + n_bins[i]+1)) + for i, dim in enumerate(self.df.columns.values)} + + if self.lag is not None: + bins = self.__extend_bins__(bins) + # Return the optimal bin-edges + return bins + + def knuth_bins(self, max_bins=15): + """ + Method to find optimal bin widths in each dimension, using a naive search to + maximise the log-likelihood given data. Only accepts data + with two dimensions [X(t),Y(t)]. + Derived from Matlab code provided in Knuth (2013): https://arxiv.org/pdf/physics/0605197.pdf + + (Note that this is restricted to equal-width bins only.) + Args: + max_bins - (int) The maximum allowed bins in each dimension + Returns: + bins - (dict) The optimal bin-edges for pdf estimation + using the histogram method, keyed by df column names + All bins equal-width. + """ + if len(self.df.columns.values) > 2: + raise ValueError( + 'Too many columns provided in DataFrame. knuth_bins only accepts 2 columns (no lagged columns)') + + min_bins = 3 + + # Initialise array to store MIC values + log_probabilities = np.zeros( + shape=[1+max_bins-min_bins, 1+max_bins-min_bins]) + + # Loop over each dimension + for b_x in range(min_bins, max_bins+1): + + for b_y in range(min_bins, max_bins+1): + + # Update parameters + Ms = [b_x, b_y] + + # Update dict of bin edges + bins = {dim: list(np.linspace(self.df[dim].min(), + self.df[dim].max(), + Ms[i]+1)) + for i, dim in enumerate(self.df.columns.values)} + + # Calculate Maximum log Posterior + + # Create N-d histogram to count number per bin + HDE = NDHistogram(self.df, bins) + nk = HDE.Hist + + # M = number of bins in total = Mx * My * Mz ... etc. + M = np.prod(Ms) + + log_prob = (self.N * np.log(M) + + gammaln(0.5 * M) + - M * gammaln(0.5) + - gammaln(self.N + 0.5 * M) + + np.sum(gammaln(nk.ravel() + 0.5))) + + log_probabilities[b_x-min_bins][b_y-min_bins] = log_prob + + # Get Optimal b_x, b_y values + Ms[0] = np.where(log_probabilities == np.max( + log_probabilities))[0] + min_bins + Ms[1] = np.where(log_probabilities == np.max( + log_probabilities))[1] + min_bins + + bins = {dim: list(np.linspace(self.df[dim].min(), + self.df[dim].max(), + Ms[i]+1)) + for i, dim in enumerate(self.df.columns.values)} + + if self.lag is not None: + bins = self.__extend_bins__(bins) + # Return the optimal bin-edges + return bins + + def sigma_bins(self, max_bins=15): + """ + Returns bins for N-dimensional data, using standard deviation binning: each + bin is one S.D in width, with bins centered on the mean. Where outliers exist + beyond the maximum number of SDs dictated by the max_bins parameter, the + bins are extended to minimum/maximum values to ensure all data points are + captured. This may mean larger bins in the tails, and up to two bins + greater than the max_bins parameter suggests in total (in the unlikely case of huge + outliers on both sides). + Args: + max_bins - (int) The maximum allowed bins in each dimension + Returns: + bins - (dict) The optimal bin-edges for pdf estimation + using the histogram method, keyed by df column names + """ + + bins = {k: [np.mean(v)-int(max_bins/2)*np.std(v) + i * np.std(v) for i in range(max_bins+1)] + for (k, v) in self.df.iteritems()} # Note: same as: self.df.to_dict('list').items()} + + # Since some outliers can be missed, extend bins if any points are not yet captured + [bins[k].append(self.df[k].min()) + for k in self.df.keys() if self.df[k].min() < min(bins[k])] + [bins[k].append(self.df[k].max()) + for k in self.df.keys() if self.df[k].max() > max(bins[k])] + + if self.lag is not None: + bins = self.__extend_bins__(bins) + return bins + + def equiprobable_bins(self, max_bins=15): + """ + Returns bins for N-dimensional data, such that each bin should contain equal numbers of + samples. + *** Note that due to SciPy's mquantiles() functional design, the equipartion is not strictly true - + it operates independently on the marginals, and so with large bin numbers there are usually + significant discrepancies from desired behaviour. Fortunately, for TE we find equipartioning is + extremely beneficial, so we find good accuracy with small bin counts *** + Args: + max_bins - (int) The number of bins in each dimension + Returns: + bins - (dict) The calculated bin-edges for pdf estimation + using the histogram method, keyed by df column names + """ + quantiles = np.array([i/max_bins for i in range(0, max_bins+1)]) + bins = dict(zip(self.axes, mquantiles( + a=self.df, prob=quantiles, axis=0).T.tolist())) + + # Remove_duplicates + bins = {k: sorted(set(bins[k])) for (k, v) in bins.items()} + + if self.lag is not None: + bins = self.__extend_bins__(bins) + return bins + + +class _kde_(stats.gaussian_kde): + """ + Subclass of scipy.stats.gaussian_kde. This is to enable the passage of a pre-defined covariance matrix, via the + `covar` parameter. This is handled internally within TransferEntropy class. + The matrix is calculated on the overall dataset, before windowing, which allows for consistency between windows, + and avoiding duplicative computational operations, compared with calculating the covariance each window. + Functions left as much as possible identical to scipi.stats.gaussian_kde; docs available: + https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.gaussian_kde.html + """ + + def __init__(self, dataset, bw_method=None, df=None, covar=None): + self.dataset = atleast_2d(dataset) + if not self.dataset.size > 1: + raise ValueError("`dataset` input should have multiple elements.") + + self.d, self.n = self.dataset.shape + self.set_bandwidth(bw_method=bw_method, covar=covar) + + def set_bandwidth(self, bw_method=None, covar=None): + + if bw_method is None: + pass + elif bw_method == 'scott': + self.covariance_factor = self.scotts_factor + elif bw_method == 'silverman': + self.covariance_factor = self.silverman_factor + elif np.isscalar(bw_method) and not isinstance(bw_method, string_types): + self._bw_method = 'use constant' + self.covariance_factor = lambda: bw_method + elif callable(bw_method): + self._bw_method = bw_method + self.covariance_factor = lambda: self._bw_method(self) + else: + msg = "`bw_method` should be 'scott', 'silverman', a scalar " \ + "or a callable." + raise ValueError(msg) + + self._compute_covariance(covar) + + def _compute_covariance(self, covar): + + if covar is not None: + try: + self._data_covariance = covar + self._data_inv_cov = linalg.inv(self._data_covariance) + except Exception as e: + print('\tSingular matrix encountered...') + covar += 10e-6 * np.eye(*covar.shape) + self._data_covariance = covar + self._data_inv_cov = linalg.inv(self._data_covariance) + + self.factor = self.covariance_factor() + # Cache covariance and inverse covariance of the data + if not hasattr(self, '_data_inv_cov'): + self._data_covariance = atleast_2d(np.cov(self.dataset, rowvar=1, + bias=False)) + self._data_inv_cov = linalg.inv(self._data_covariance) + + self.covariance = self._data_covariance * self.factor**2 + self.inv_cov = self._data_inv_cov / self.factor**2 + self._norm_factor = sqrt(linalg.det(2*pi*self.covariance)) * self.n + + +############################################################################################################## +# U T I L I T Y F U N C T I O N S +############################################################################################################## + + +def get_pdf(df, gridpoints=None, bandwidth=None, estimator=None, bins=None, covar=None): + """ + Function for non-parametric density estimation + Args: + df - (DataFrame) Samples over which to estimate density + gridpoints - (int) Number of gridpoints when integrating KDE over + the domain. Used if estimator='kernel' + bandwidth - (float) Bandwidth for KDE (scalar multiple to covariance + matrix). Used if estimator='kernel' + estimator - (string) 'histogram' or 'kernel' + bins - (Dict of lists) Bin edges for NDHistogram. Used if estimator = 'histogram' + covar - (Numpy ndarray) Covariance matrix between dimensions of df. + Used if estimator = 'kernel' + Returns: + pdf - (Numpy ndarray) Probability of a sample being in a specific + bin (technically a probability mass) + """ + DF = sanitise(df) + + if estimator == 'histogram': + pdf = pdf_histogram(DF, bins) + else: + pdf = pdf_kde(DF, gridpoints, bandwidth, covar) + return pdf + + +def pdf_kde(df, gridpoints=None, bandwidth=1, covar=None): + """ + Function for non-parametric density estimation using Kernel Density Estimation + Args: + df - (DataFrame) Samples over which to estimate density + gridpoints - (int) Number of gridpoints when integrating KDE over + the domain. Used if estimator='kernel' + bandwidth - (float) Bandwidth for KDE (scalar multiple to covariance + matrix). + covar - (Numpy ndarray) Covariance matrix between dimensions of df. + If None, these are calculated from df during the + KDE analysis + Returns: + Z/Z.sum() - (Numpy ndarray) Probability of a sample being between + specific gridpoints (technically a probability mass) + """ + # Create Meshgrid to capture data + if gridpoints is None: + gridpoints = 20 + + N = complex(gridpoints) + + slices = [slice(dim.min(), dim.max(), N) + for dimname, dim in df.iteritems()] + grids = np.mgrid[slices] + + # Pass Meshgrid to Scipy Gaussian KDE to Estimate PDF + positions = np.vstack([X.ravel() for X in grids]) + values = df.values.T + kernel = _kde_(values, bw_method=bandwidth, covar=covar) + Z = np.reshape(kernel(positions).T, grids[0].shape) + + # Normalise + return Z/Z.sum() + + +def pdf_histogram(df, bins): + """ + Function for non-parametric density estimation using N-Dimensional Histograms + Args: + df - (DataFrame) Samples over which to estimate density + bins - (Dict of lists) Bin edges for NDHistogram. + Returns: + histogram.pdf - (Numpy ndarray) Probability of a sample being in a specific + bin (technically a probability mass) + """ + histogram = NDHistogram(df=df, bins=bins) + return histogram.pdf + + +def get_entropy(df, gridpoints=15, bandwidth=None, estimator='kernel', bins=None, covar=None): + """ + Function for calculating entropy from a probability mass + + Args: + df - (DataFrame) Samples over which to estimate density + gridpoints - (int) Number of gridpoints when integrating KDE over + the domain. Used if estimator='kernel' + bandwidth - (float) Bandwidth for KDE (scalar multiple to covariance + matrix). Used if estimator='kernel' + estimator - (string) 'histogram' or 'kernel' + bins - (Dict of lists) Bin edges for NDHistogram. Used if estimator + = 'histogram' + covar - (Numpy ndarray) Covariance matrix between dimensions of df. + Used if estimator = 'kernel' + Returns: + entropy - (float) Shannon entropy in bits + """ + pdf = get_pdf(df, gridpoints, bandwidth, estimator, bins, covar) + # log base 2 returns H(X) in bits + return -np.sum(pdf * ma.log2(pdf).filled(0)) + + +def shuffle_series(DF, only=None): + """ + Function to return time series shuffled rowwise along each desired column. + Each column is shuffled independently, removing the temporal relationship. + This is to calculate Z-score and Z*-score. See P. Boba et al (2015) + Calculated using: df.apply(np.random.permutation) + Arguments: + df - (DataFrame) Time series data + only - (list) Fieldnames to shuffle. If none, all columns shuffled + Returns: + df_shuffled - (DataFrame) Time series shuffled along desired columns + """ + if not only == None: + shuffled_DF = DF.copy() + for col in only: + series = DF.loc[:, col].to_frame() + shuffled_DF[col] = series.apply(np.random.permutation) + else: + shuffled_DF = DF.apply(np.random.permutation) + + return shuffled_DF + + +def plot_pdf(df, estimator='kernel', gridpoints=None, bandwidth=None, covar=None, bins=None, show=False, + cmap='inferno', label_fontsize=7): + """ + Wrapper function to plot the pdf of a pandas dataframe + + Args: + df - (DataFrame) Samples over which to estimate density + estimator - (string) 'kernel' or 'histogram' + gridpoints - (int) Number of gridpoints when integrating KDE over + the domain. Used if estimator='kernel' + bandwidth - (float) Bandwidth for KDE (scalar multiple to covariance + matrix). Used if estimator='kernel' + covar - (Numpy ndarray) Covariance matrix between dimensions of df. + bins - (Dict of lists) Bin edges for NDHistogram. Used if estimator = 'histogram' + show - (Boolean) whether or not to plot direclty, or simply return axes for later use + cmap - (string) Colour map (see: https://matplotlib.org/examples/color/colormaps_reference.html) + label_fontsize - (float) Defines the fontsize for the axes labels + Returns: + ax - AxesSubplot object. Can be added to figures to allow multiple plots. + """ + + DF = sanitise(df) + if len(DF.columns) != 2: + print("DataFrame has " + str(len(DF.columns)) + + " dimensions. Only 2D or less can be plotted") + axes = None + else: + # Plot data in Histogram or Kernel form + if estimator == 'histogram': + + if bins is None: + bins = {axis: np.linspace(DF[axis].min(), + DF[axis].max(), + 9) for axis in DF.columns.values} + fig, axes = plot_pdf_histogram(df, bins, cmap) + else: + fig, axes = plot_pdf_kernel(df, gridpoints, bandwidth, covar, cmap) + + # Format plot + axes.set_xlabel(DF.columns.values[0], labelpad=20) + axes.set_ylabel(DF.columns.values[1], labelpad=20) + for label in axes.xaxis.get_majorticklabels(): + label.set_fontsize(label_fontsize) + for label in axes.yaxis.get_majorticklabels(): + label.set_fontsize(label_fontsize) + for label in axes.zaxis.get_majorticklabels(): + label.set_fontsize(label_fontsize) + axes.view_init(10, 45) + if show == True: + plt.show() + plt.close(fig) + + axes.remove() + + return axes + + +def plot_pdf_histogram(df, bins, cmap='inferno'): + """ + Function to plot the pdf of a dataset, estimated via histogram. + + Args: + df - (DataFrame) Samples over which to estimate density + bins - (Dict of lists) Bin edges for NDHistogram. Used if estimator = 'histogram' + Returns: + ax - AxesSubplot object, passed back via to plot_pdf() function + """ + DF = sanitise(df) # in case function called directly + + # Calculate PDF + PDF = get_pdf(df=DF, estimator='histogram', bins=bins) + + # Get x-coords, y-coords for each bar + (x_edges, y_edges) = bins.values() + X, Y = np.meshgrid(x_edges[:-1], y_edges[:-1]) + # Get dx, dy for each bar + dxs, dys = np.meshgrid(np.diff(x_edges), np.diff(y_edges)) + + # Colourmap + cmap = cm.get_cmap(cmap) + rgba = [cmap((p-PDF.flatten().min())/PDF.flatten().max()) + for p in PDF.flatten()] + + # Create subplots + fig = plt.figure() + ax = fig.add_subplot(111, projection='3d') + + ax.bar3d(x=X.flatten(), # x coordinates of each bar + y=Y.flatten(), # y coordinates of each bar + z=0, # z coordinates of each bar + dx=dxs.flatten(), # width of each bar + dy=dys.flatten(), # depth of each bar + dz=PDF.flatten(), # height of each bar + alpha=1, # transparency + color=rgba + ) + ax.set_title("Histogram Probability Distribution", fontsize=10) + + return fig, ax + + +def plot_pdf_kernel(df, gridpoints=None, bandwidth=None, covar=None, cmap='inferno'): + """ + Function to plot the pdf, calculated by KDE, of a dataset + + Args: + df - (DataFrame) Samples over which to estimate density + gridpoints - (int) Number of gridpoints when integrating KDE over + the domain. Used if estimator='kernel' + bandwidth - (float) Bandwidth for KDE (scalar multiple to covariance + matrix). Used if estimator='kernel' + covar - (Numpy ndarray) Covariance matrix between dimensions of df. + + Returns: + ax - AxesSubplot object, passed back via to plot_pdf() function + """ + DF = sanitise(df) + # Estimate the PDF from the data + if gridpoints is None: + gridpoints = 20 + + pdf = get_pdf(DF, gridpoints=gridpoints, bandwidth=bandwidth) + N = complex(gridpoints) + slices = [slice(dim.min(), dim.max(), N) + for dimname, dim in DF.iteritems()] + X, Y = np.mgrid[slices] + + fig = plt.figure() + ax = fig.add_subplot(111, projection='3d') + ax.plot_surface(X, Y, pdf, cmap=cmap) + + ax.set_title("KDE Probability Distribution", fontsize=10) + + return fig, ax + + +def sanitise(df): + """ + Function to convert DataFrame-like objects into pandas DataFrames + + Args: + df - Data in pd.Series or pd.DataFrame format + Returns: + df - Data as pandas DataFrame + """ + # Ensure data is in DataFrame form + if isinstance(df, pd.DataFrame): + df = df + elif isinstance(df, pd.Series): + df = df.to_frame() + else: + raise ValueError( + 'Data passed as %s Please ensure your data is stored as a Pandas DataFrame' % (str(type(df)))) + return df diff --git a/structured_data/2022_06_02_causality/src/transfer_entropy/transfer_entropy_wrapper.py b/structured_data/2022_06_02_causality/src/transfer_entropy/transfer_entropy_wrapper.py new file mode 100644 index 0000000..17421d4 --- /dev/null +++ b/structured_data/2022_06_02_causality/src/transfer_entropy/transfer_entropy_wrapper.py @@ -0,0 +1,169 @@ +""" +Functions for identifying causality. +""" +# Import standard library modules +from typing import List, Tuple, Dict +import warnings + +# Import third party modules +from matplotlib import pyplot as plt +import numpy as np +import pandas as pd + +from transfer_entropy.pycausality.src import TransferEntropy + +warnings.filterwarnings("ignore") + +# Function definitions +def grangers_causation_matrix(df: pd.DataFrame, test: str='ssr_chi2test', max_lag: int=7, verbose=False) -> Tuple[pd.DataFrame, pd.DataFrame]: + """Check Granger Causality of all possible combinations of the Time series. + The rows are the response variable, columns are predictors. The values in the table + are the P-Values. P-Values lesser than the significance level (0.05), implies + the Null Hypothesis that the coefficients of the corresponding past values is + zero, that is, the X does not cause Y can be rejected. + + Arguments: + - df (pd.DataFrame): + - test (str): + - max_lag (int): + - verbose: Whether or not to display intermediate results. + + Results: + - Tuple[pd.DataFrame, pd.DataFrame]: Dataframes containing the minimum p_value (i.e., largest + significance) and corresponding lag for each of the columns of the argument dataframe. + """ + df_gc = pd.DataFrame(np.zeros((1, len(df.columns[1:]))), columns=df.columns[1:], index=[df.columns[0]]) + df_gc_lags = pd.DataFrame(np.zeros((1, len(df.columns[1:]))), columns=df.columns[1:], index=[df.columns[0]]) + + col_res = df.columns[0] + for col_orig in df.columns[1:]: + test_result = grangercausalitytests(df[[col_res, col_orig]], maxlag=max_lag, verbose=False) + p_values = [round(test_result[i+1][0][test][1],4) for i in range(max_lag)] + if verbose: print(f'Y = {r}, X = {c}, P Values = {p_values}') + min_p_value = np.min(p_values) + min_lags = np.argmin(p_values) + df_gc.loc[col_res, col_orig] = min_p_value + df_gc_lags.loc[col_res, col_orig] = min_lags + + df_gc.columns = [col + '_x' for col in df.columns[1:]] + df_gc.index = [df.columns[0]] + + df_gc_lags.columns = [col + '_x' for col in df.columns[1:]] + df_gc_lags.index = [df.columns[0]] + return df_gc, df_gc_lags + + +def calculate_transfer_entropy(df: pd.DataFrame, lag: int, linear: bool=False, effective: bool=False, window_size: Dict={'MS': 6}, window_stride: Dict={'MS': 1}, n_shuffles=100, debug=False) -> List: + """Perform Seasonal-Trend decomposition using LOESS (STL) to remove trend + and seasonality and difference residuals as much as necessary to make + time-series stationary. + + Arguments: + - df (pd.DataFrame): Dataframe for which transfer entropies must be + calculated. + - linear (bool): Whether the required transfer entropies should be linear + (True) or non-linear (False). + - effective (bool): Whether or not to calculate the effective transfer + entropy. Can only be done for `n_shuffles>0`, but has proven to not + give the most reliable results given the size of the dataset. + - window_size (Dict): Dictionary indicating the size of a window, either in 'MS' + (Month Start) or 'D' (Days; to express weeks), e.g., {'MS': 6}. + - window_stride (Dict): Dictionary indicating the stride of a window, either in 'MS' + (Month Start) or 'D' (Days; to express weeks), e.g., {'MS': 1}. + - n_shuffles (int): Number of shuffling operations to do when calculating + the average transfer entropy. Only relevant if the results should be + either the effective entropy or if p-values should be included for + significance. + - debug (bool): Whether or not to print intermediate results (for + debugging purposed). + + Result: + - List[List[str, pd.DataFrame]]: List containing nested lists (pairs) of + the column names and the resulting Pandas dataframe containing the + transfer entropy for each window in the respective column. + """ + + te_results = [] + + col_res = df.columns[0] + col_origs = df.columns[1:] + for col_orig in col_origs: + print(f'{col_orig} -> {col_res}') + + # Initialise Object to Calculate Transfer Entropy + TE = TransferEntropy(DF=df, + endog=col_res, + exog=col_orig, + lag=lag, + window_size=window_size, + window_stride=window_stride + ) + + # Calculate TE using KDE + if linear: + TE.linear_TE(n_shuffles=n_shuffles) + else: + TE.nonlinear_TE(pdf_estimator='kernel', n_shuffles=n_shuffles) + + # Standardize column naming + if (linear): + TE.results = TE.results.rename(mapper=(lambda col: col.replace('linear_', '')), axis=1) + + # Display TE_XY, TE_YX and significance values + if debug: + if n_shuffles and effective: + #print('\t', TE.results[[f'TE_XY', f'Ave_TE_XY', f'p_value_XY']]) + print('\t', f"TE_XY_Eff=({TE.results['TE_XY'].values[0] - TE.results['Ave_TE_XY'].values[0]}), p=({TE.results['p_value_YX'].values[0]})", '\n') + elif n_shuffles: + print('\t', f"TE_XY=({TE.results['TE_XY'].values[0]}), p=({TE.results['p_value_YX'].values[0]})", '\n') + else: + print('\t', f"TE_XY=({TE.results[['TE_XY']]})", '\n') + + # Track results of current link + te_results.append([col_orig, TE.results]) + return te_results + +def average_transfer_entropy(df: pd.DataFrame, linear: bool, effective: bool, tau_min: int=0, tau_max: int=4, n_shuffles=None, debug: bool=False) -> List: + """Wrapper function around `calculate_transfer_entropy` for calculating the + average (non-)linear transfer entropy. + + Arguments: + - df (pd.DataFrame): Dataframe for which transfer entropies must be + calculated. + - linear (bool): Whether the required transfer entropies should be linear + (True) or non-linear (False). + - effective (bool): Whether or not to calculate the effective transfer + entropy. Can only be done for `n_shuffles>0`, but has proven to not + give the most reliable results given the size of the dataset. + - tau_min (int): Minimal lag to calculate transfer entropy for. + - tau_max (int): Maximal lag to calculate transfer entropy for. + - n_shuffles (int): Number of shuffling operations to do when calculating + the average transfer entropy. Only relevant if the results should be + either the effective entropy or if p-values should be included for + significance. + - debug (bool): Whether or not to print intermediate results (for + debugging purposed). + + Result: + - List[List[str, pd.DataFrame]]: List containing nested lists (pairs) of + the column names and the resulting Pandas dataframe containing the + transfer entropy for each window in the respective column. + """ + te_results_arr = [] + for lag in range(tau_min, tau_max+1): + print(f'\nlag({lag})') + import time + t= time.time() + + # Call over-arching Transfer Entropy function + te_results = calculate_transfer_entropy(df, lag=lag, linear=linear, window_size=None, window_stride=None, n_shuffles=n_shuffles, debug=debug) + + # Construct dataframe from results + te_results_df = pd.DataFrame(data=pd.concat(np.array(te_results)[:, 1])) + te_results_df.index = np.array(te_results)[:, 0] + + # Keep track of results + te_results_arr.append(te_results) + print("took", time.time() - t, "seconds") + + return te_results_arr \ No newline at end of file