Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error of crossfit folds splits with DynamicDML #900

Open
juandavidgutier opened this issue Jul 16, 2024 · 2 comments
Open

Error of crossfit folds splits with DynamicDML #900

juandavidgutier opened this issue Jul 16, 2024 · 2 comments

Comments

@juandavidgutier
Copy link

Hi,

I am estimating the effect of high levels of particulate matter (PM2.5) on excess deaths from panel data for 25 municipalities with daily resolution. It means my treatment is a binary variable where T=1, when the level of PM2.5 is high, and T=0, when the level of PM2.5 is low. The outcome is also a binary variable, where Y=0 for non-excess deaths, and Y=1 for excess deaths.

I am using the class DynamicDML to fit my model, but I get this error message: "AttributeError: Provided crossfit folds contain training splits that don't contain all treatments". But, 50% of the data corresponds to observations with T=1, I think it is enough to obtain balanced crossfit folds.

Here is my code with econml version 0.15 and dowhy version 0.10.1
dataset_pm_deaths.csv

`
import dowhy
import econml
from dowhy import CausalModel
import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LassoCV
import scipy.stats as stats
from itertools import product
from econml.utilities import WeightedModelWrapper
from sklearn.model_selection import train_test_split
from econml.panel.dml import DynamicDML

data_all = pd.read_csv("D:/dataset_pm_deaths.csv")
data = data_all[data_all['Year'] >= 2009]

median_pm25 = data['PM25'].median()
data['PM25'] = (data['PM25'] >= median_pm25).astype(int)

data.BC = stats.zscore(data.BC, nan_policy='omit')
data.DMS = stats.zscore(data.DMS, nan_policy='omit')
data.PM = stats.zscore(data.PM, nan_policy='omit')
data.OC = stats.zscore(data.OC, nan_policy='omit')
data.SO2 = stats.zscore(data.SO2, nan_policy='omit')
data.SO4 = stats.zscore(data.SO4, nan_policy='omit')

data0 = data[['excess', 'PM25', 'cod_munici',
'BC', 'DMS', 'PM', 'OC', 'SO2', 'SO4', 'Temperature', 'lead1_PM25']]
data0 = data0.dropna()
Y = data0.excess.to_numpy()
T = data0.PM25.to_numpy()
percentage_high_PM25 = np.mean(T == 1) * 100
W = data0[['BC', 'DMS', 'PM', 'OC', 'SO2', 'SO4', 'Temperature']].to_numpy().reshape(-1, 7)
X = data0[['Temperature', 'lead1_PM25']].to_numpy().reshape(-1, 2)
groups = data0.cod_munici.to_numpy()

estimate0 = DynamicDML(discrete_treatment=True,
featurizer=PolynomialFeatures(degree=3),
linear_first_stages=False, cv=3, random_state=123)
estimate0.fit(Y=Y, T=T, X=X, W=W, inference='auto', groups=groups) # HERE IS THE ERROR
`

@TimCosemans
Copy link

Have you tried passing a StratifiedKFold-object or creating your own cv-splitter? That could help you out in the meantime

@juandavidgutier
Copy link
Author

Hi @TimCosemans

Thanks for your suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants