[Repo Assist] fix: correct categorical groupby KeyError and pandas FutureWarning in dummy_outcome_refuter#1460
Conversation
…g in dummy_outcome_refuter
Two issues fixed in _get_treatment_groups():
1. Bug (categorical branch): line 767 overwrote the correct groupby result
with data.groupby('bins') even though 'bins' was never created for
categorical treatments. This caused a KeyError whenever the refuter
was run with a categorical treatment variable. Fixed by removing
the dead line so the function correctly returns groupby on the
treatment column.
2. FutureWarning (pandas): groupby calls on columns with CategoricalDtype
(pd.cut output 'bins', and categorical treatment) did not set
observed=True, triggering a pandas FutureWarning about the changing
default. Added observed=True to both groupby calls; this retains
existing behaviour because only non-empty bins are meaningful for
the dummy-outcome estimator.
Test added: test_refutation_dummy_outcome_refuter_default_categorical_treatment
— regression test that previously raised KeyError.
Refs #1316
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Fixes DummyOutcomeRefuter._get_treatment_groups() for categorical treatments by removing an erroneous groupby("bins") call and explicitly setting observed=... on groupby to address a pandas FutureWarning.
Changes:
- Remove incorrect categorical
groupby("bins")that could raiseKeyError. - Add
observed=Truetogroupby(...)for both continuous ("bins") and categorical (treatment column) branches. - Add regression test covering categorical treatment refutation path (issue #1316).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
dowhy/causal_refuters/dummy_outcome_refuter.py |
Fixes categorical grouping KeyError and sets observed=True on groupby to avoid the pandas warning. |
tests/causal_refuters/test_dummy_outcome_refuter.py |
Adds a regression test to ensure categorical treatments don’t trigger the previous KeyError. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| refuter_tester.binary_treatment_testsuite(tests_to_run="atleast-one-common-cause", num_samples=num_samples) | ||
|
|
||
| @mark.parametrize( | ||
| ["error_tolerence", "estimator_method", "num_samples"], |
There was a problem hiding this comment.
The parameter name error_tolerence is misspelled (tolerence vs tolerance). Since this is newly added test code, consider renaming the local test parameter to error_tolerance for readability (keeping the passed value the same).
| [(0.1, "backdoor.propensity_score_matching", 1000)], | ||
| ) | ||
| def test_refutation_dummy_outcome_refuter_default_categorical_treatment( | ||
| self, error_tolerence, estimator_method, num_samples |
There was a problem hiding this comment.
The parameter name error_tolerence is misspelled (tolerence vs tolerance). Since this is newly added test code, consider renaming the local test parameter to error_tolerance for readability (keeping the passed value the same).
| ): | ||
| # Regression test for #1316: the categorical branch had a copy-paste error that tried to | ||
| # groupby("bins") which doesn't exist for categorical treatments, raising KeyError. | ||
| refuter_tester = SimpleRefuter(error_tolerence, estimator_method, "dummy_outcome_refuter") |
There was a problem hiding this comment.
The parameter name error_tolerence is misspelled (tolerence vs tolerance). Since this is newly added test code, consider renaming the local test parameter to error_tolerance for readability (keeping the passed value the same).
There was a problem hiding this comment.
Fixed in b7d0e71 — renamed the parameter from error_tolerence to error_tolerance in the newly added categorical treatment test (parametrize list, function signature, and body).
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Emre Kıcıman <emrek@microsoft.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Emre Kıcıman <emrek@microsoft.com>
…al test Signed-off-by: Copilot <copilot@github.com> Agent-Logs-Url: https://github.com/py-why/dowhy/sessions/227e1739-c29d-4256-9664-7e83d209eb69 Co-authored-by: emrekiciman <5982160+emrekiciman@users.noreply.github.com>
🤖 This is an automated pull request from Repo Assist, an AI assistant. Please review all changes carefully before merging.
Fixes a KeyError bug and a pandas FutureWarning in
DummyOutcomeRefuter._get_treatment_groups().Closes #1316
Root Cause
In
dummy_outcome_refuter.py, the_get_treatment_groups()method has two branches for treatment type: continuous/float (usespd.cut→"bins"column) and categorical (uses the treatment column directly). The categorical branch contained a copy-paste error:This caused an immediate
KeyErrorwheneverDummyOutcomeRefuterwas used with a categorical treatment variable.Additionally, both groupby calls used the default
observed=False(deprecated in pandas ≥ 1.5), triggering aFutureWarningabout the changing default when groupingCategoricalDtypecolumns (created bypd.cut).Fix
data.groupby("bins")in the categorical branch.observed=Trueto both groupby calls ("bins"for continuous and treatment column for categorical) to suppress theFutureWarningand adopt the modern pandas default. This retains correct behaviour since only non-empty bins are relevant.Test Status
A regression test
test_refutation_dummy_outcome_refuter_default_categorical_treatmentwas added totests/causal_refuters/test_dummy_outcome_refuter.py. This test uses the existingcategorical_treatment_testsuiteinfrastructure and would previously raiseKeyError: 'bins'. The test now passes correctly.Format/lint checks:
black --check✅flake8 --select=E9,F63,F7,F82✅Full test suite could not be run due to environment constraints (missing system-level graphviz for pygraphviz), but the fix is surgical with no logic changes beyond removing a dead line and adding
observed=Trueto two groupby calls.