Skip to content

Commit 361215e

Browse files
authored
Work around NLTK v3.9.3 regression (#2042)
In NLTK v3.9.3, a new security check was added for the Zip Slip issue: nltk/nltk#3468 This change unfortunately contained a bug that causes false positive security errors when packages are downloaded to a symlinked path, since the new check doesn't use `abspath` vs `realpath` consistently: nltk/nltk#3509 This causes errors like: ``` -----> Downloading NLTK packages: punkt punkt_tab <frozen runpy>:128: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour [nltk_data] Downloading package punkt to [nltk_data] /app/.heroku/python/nltk_data... [nltk_data] Unzipping tokenizers/punkt.zip. [nltk_data] Zip Slip blocked: punkt/ Error installing package. Retry? [n/y/e] Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/app/.heroku/python/lib/python3.12/site-packages/nltk/downloader.py", line 2631, in <module> rv = downloader.download( ^^^^^^^^^^^^^^^^^^^^ File "/app/.heroku/python/lib/python3.12/site-packages/nltk/downloader.py", line 773, in download choice = input().strip() ^^^^^^^ EOFError: EOF when reading a line ``` See also: https://github.com/heroku/heroku-buildpack-python/actions/runs/22545824206/job/65308000022#step:5:539 Until upstream fix this regression, we can work around it by always passing the raw path for Python home instead of the symlinked path. (We generally prefer using the symlinked path where possible, to ensure the paths used/displayed match between build time and run time, given the build directory is different from the run directory.) Longer term I will be deprecating and that removing support for `nltk.txt` since it's a Heroku-proprietary invention that's unnecessary given most users instead either commit the NLTK corpora or use the post_compile hook feature instead. Fixes #2037. GUS-W-21410908.
1 parent 96e5bea commit 361215e

File tree

3 files changed

+10
-6
lines changed

3 files changed

+10
-6
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22

33
## [Unreleased]
44

5+
- Added a workaround for `nltk.txt` package downloader errors caused by an upstream regression in NLTK v3.9.3. ([#2041](https://github.com/heroku/heroku-buildpack-python/pull/2041))
56
- Changed the S3 URL used to download Python to use AWS' dual-stack (IPv6 compatible) endpoint. ([#2035](https://github.com/heroku/heroku-buildpack-python/pull/2035))
67

78
## [v335] - 2026-02-10

bin/steps/nltk

Lines changed: 7 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -25,10 +25,11 @@ if is_module_available 'nltk'; then
2525
readarray -t nltk_packages <"${nltk_packages_definition}"
2626
output::step "Downloading NLTK packages: ${nltk_packages[*]}"
2727

28-
nltk_data_dir="/app/.heroku/python/nltk_data"
29-
28+
# Note: We have to use the raw build directory path here and not the symlinked `/app` path,
29+
# otherwise it will cause a false positive in NLTK v3.9.3's new Zip Slip security check,
30+
# which doesn't handle symlinked paths correctly: https://github.com/nltk/nltk/issues/3509
3031
# TODO: Does this even need user-provided env vars, or can we remove the sub_env usage here?
31-
if ! sub_env python -m nltk.downloader -d "${nltk_data_dir}" "${nltk_packages[@]}" |& output::indent; then
32+
if ! sub_env python -m nltk.downloader -d "${BUILD_DIR}/.heroku/python/nltk_data" "${nltk_packages[@]}" |& output::indent; then
3233
output::error <<-EOF
3334
Error: Unable to download NLTK data.
3435
@@ -41,7 +42,9 @@ if is_module_available 'nltk'; then
4142
exit 1
4243
fi
4344

44-
set_env NLTK_DATA "${nltk_data_dir}"
45+
# Since this will be used at runtime, we must use the symlinked `/app` path and not
46+
# the raw build directory path.
47+
set_env NLTK_DATA "/app/.heroku/python/nltk_data"
4548
else
4649
build_data::set_string "nltk_downloader" "skipped-no-nltk-file"
4750
echo " 'nltk.txt' not found, not downloading any corpora"

spec/hatchet/nltk_spec.rb

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,10 @@
1313
remote: -----> Downloading NLTK packages: city_database stopwords
1414
remote: .*: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour
1515
remote: \\[nltk_data\\] Downloading package city_database to
16-
remote: \\[nltk_data\\] /app/.heroku/python/nltk_data...
16+
remote: \\[nltk_data\\] /tmp/build_.+/.heroku/python/nltk_data...
1717
remote: \\[nltk_data\\] Unzipping corpora/city_database.zip.
1818
remote: \\[nltk_data\\] Downloading package stopwords to
19-
remote: \\[nltk_data\\] /app/.heroku/python/nltk_data...
19+
remote: \\[nltk_data\\] /tmp/build_.+/.heroku/python/nltk_data...
2020
remote: \\[nltk_data\\] Unzipping corpora/stopwords.zip.
2121
REGEX
2222

0 commit comments

Comments
 (0)