Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dependency cleanup #1372

Open
wants to merge 2 commits into
base: dev
Choose a base branch
from
Open

Dependency cleanup #1372

wants to merge 2 commits into from

Conversation

melange396
Copy link
Collaborator

Removed because they werent directly referenced anywhere in the code of this repo:

  • jinja & matplotlib & pytest-check & sas7bdat & scipy & sqlalchemy-stubs & xlrd

Removed from one file because it was duplicated in both:

  • delphi_utils & tenacity

Moved to the "dev" file cuz theyre seemingly not actually used by the api:

  • more_itertools & requests

Moved to the "api" file so its with pandas which pulls it in as a dependency anyway (and it is explicitly used in server code (though its just one place)):

  • numpy

Id hoped that cleaning this stuff up would reduce the build time of docker images and thus speed up the dev cycle, as well as shrink the disk/file footprint size of the images. That turned out to not be the case because many of the removed packages still get referenced somewhere else in the dependency tree, and they get included anyway. At least we can pass off the work of maintaining the version requirements to other packages. Plus, #1308 will probably have a bigger impact than this could have dreamed of.

I am also strongly considering loosening some of the versions by using the "compatible release clause" (aka the ~= operator) wherever we currently use a hard-equal 3-level version (like ==1.2.3) so we can get security updates and bug fixes for ~free without (🤞) breaking things. Discussion and commentary on that is very much welcomed.

@korlaxxalrok: Do you remember off the top of your head how package installation gets sorted out by jenkins / g-d-r / automation? Is it done manually? We run the acquisition code on bare hardware, instead of using the bundled containers that manage the dependencies, and i dont see an obvious mechanism that would do it automatically.

@dshemetov
Copy link
Contributor

Side-note: wrt your comment about speeding up the dev cycle reminds me that I wanted to add lines to our Dockerfiles that will cache Python packages on the host system. I just had some trouble getting it to work in GH CI and then got distracted.

Thanks for doing this cleanup though, I will take a deeper look tomorrow.

Copy link
Contributor

@dshemetov dshemetov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doing a double check:

  • jinja could be used by Flask (though I don't think we use its templating functionality), but I imagine it will install it, so probably good to go
  • matplotlib no references, good to go
  • pytest-check I added this during JIT dev and unittest->pytest migration, but it's fine to delete; can bring back when we actually use it
  • sas7bdat seems like a library for loading SAS files, might be used by pandas.read_sas and not be bundled in by default, but that function isn't used anywhere in the repo so I think this is fine to go
  • scipy a bulky one that would be pretty obvious if we used it somewhere, so good to go
  • sqlalchemy-stubs seems like an extension for the mypy type checker that's good to, if someone wants them back they can easily put them in the dev file
  • xlrd pretty sure this is used by pandas.read_excel depending on the Excel file version and not bundled in Pandas by default; there is one use of that function in quidel.py, so I would keep this one

I'd be open to experimenting with "compatible release" on a few packages to start. It's asking us to trust the maintainers of our dependencies to not make breaking changes and we can work to build that trust slowly. For safety, we could log our dependency versions (for compatible release packages specifically) somewhere internally in our code's metadata and dump it on error, so we can easily see it in a Sentry error.

@korlaxxalrok
Copy link
Contributor

@melange396 Yep, that's right. Essentially manually (sometimes handled by an Ansible playbook that governs the overall system setup) and run by a real human to update.

@melange396
Copy link
Collaborator Author

Doing a double check:

jinja, matplotlib, scipy, and xlrd are all still installed (unfurl the "Build docker images" section in the gh actions build log for this branch, and then you can search for them), so we must be getting them transitively from some of our other dependencies... Seeing scipy there made me a little sad cuz it is a big one.

Good call regarding xlrd. One way to get it would be to put "pandas[excel]" in our requirements, but that will get us other things we dont need. I will put xlrd back, and also make a note for it near the .read_excel() usage.

For safety, we could log our dependency versions

I think that could be a great debugging tool, i mentioned it in #987 (comment)

Copy link

Quality Gate Passed Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarCloud

@dshemetov
Copy link
Contributor

Ah yea, the main culprit for the bulky dependencies is delphi_utils. This Dockerfile took me 80s to install!

# Dockerfile
FROM python:3.8-buster
RUN pip install delphi_utils==0.3.15

# Build and log
❯ docker build -f Dockerfile . -t py-test --no-cache > out.log  2>&1 &

# Last bits of the log
#5 54.62 Installing collected packages: wrapt, snowballstemmer, pytz, mccabe, zipp, xmltodict, xlrd, urllib3, tqdm, tomli, toml, tenacity, structlog, smmap, six, pyyaml, pyparsing, pydocstyle, pycparser, pybind11, pluggy, pillow, packaging, numpy, multidict, mock, MarkupSafe, lazy-object-proxy, kiwisolver, jmespath, isort, iniconfig, imageio-ffmpeg, idna, frozenlist, fonttools, exceptiongroup, epiweeks, cycler, coverage, click, charset-normalizer, certifi, attrs, async-timeout, yarl, werkzeug, shapely, scipy, requests, python-dateutil, pytest, pyproj, Jinja2, importlib-resources, importlib-metadata, imageio, gitdb, contourpy, cligj, click-plugins, cffi, astroid, aiosignal, scs, responses, qdldl, pytest-cov, pylint, pandas, matplotlib, gitpython, freezegun, fiona, ecos, cryptography, clarabel, botocore, aiohttp, slackclient, s3transfer, osqp, geopandas, descartes, delphi-epidata, cvxpy, covidcast, boto3, moto, delphi_utils
#5 73.65 Successfully installed Jinja2-3.1.3 MarkupSafe-2.1.4 aiohttp-3.9.1 aiosignal-1.3.1 astroid-2.5.6 async-timeout-4.0.3 attrs-23.2.0 boto3-1.34.29 botocore-1.34.29 certifi-2023.11.17 cffi-1.16.0 charset-normalizer-3.3.2 clarabel-0.6.0 click-8.1.7 click-plugins-1.1.1 cligj-0.7.2 contourpy-1.1.1 coverage-7.4.1 covidcast-0.2.2 cryptography-42.0.1 cvxpy-1.4.2 cycler-0.12.1 delphi-epidata-4.1.16 delphi_utils-0.3.15 descartes-1.1.0 ecos-2.0.12 epiweeks-2.3.0 exceptiongroup-1.2.0 fiona-1.9.5 fonttools-4.47.2 freezegun-1.4.0 frozenlist-1.4.1 geopandas-0.13.2 gitdb-4.0.11 gitpython-3.1.41 idna-3.6 imageio-2.33.1 imageio-ffmpeg-0.4.9 importlib-metadata-7.0.1 importlib-resources-6.1.1 iniconfig-2.0.0 isort-5.13.2 jmespath-1.0.1 kiwisolver-1.4.5 lazy-object-proxy-1.10.0 matplotlib-3.7.4 mccabe-0.6.1 mock-5.1.0 moto-4.2.13 multidict-6.0.4 numpy-1.24.4 osqp-0.6.3 packaging-23.2 pandas-1.5.3 pillow-10.2.0 pluggy-1.4.0 pybind11-2.11.1 pycparser-2.21 pydocstyle-6.3.0 pylint-2.8.3 pyparsing-3.1.1 pyproj-3.5.0 pytest-7.4.4 pytest-cov-4.1.0 python-dateutil-2.8.2 pytz-2023.3.post1 pyyaml-6.0.1 qdldl-0.1.7.post0 requests-2.31.0 responses-0.24.1 s3transfer-0.10.0 scipy-1.10.1 scs-3.2.4.post1 shapely-2.0.2 six-1.16.0 slackclient-2.9.4 smmap-5.0.1 snowballstemmer-2.2.0 structlog-24.1.0 tenacity-8.2.3 toml-0.10.2 tomli-2.0.1 tqdm-4.66.1 urllib3-1.26.18 werkzeug-3.0.1 wrapt-1.12.1 xlrd-2.0.1 xmltodict-0.13.0 yarl-1.9.4 zipp-3.17.0
#5 73.65 WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
#5 73.93 
#5 73.93 [notice] A new release of pip is available: 23.0.1 -> 23.3.2
#5 73.93 [notice] To update, run: pip install --upgrade pip
#5 DONE 76.2s

#6 exporting to image
#6 exporting layers
#6 exporting layers 4.3s done
#6 writing image sha256:96ad0579199bbbfc2c0f4a76c7e39006c7261b5b3e5b2b6f91709f2426ce0ef1 done
#6 naming to docker.io/library/py-test done
#6 DONE 4.4s

I tried the same with Flask and it auto-pulled in jinja, so that explains the last bit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants