Conversation
humitos
left a comment
There was a problem hiding this comment.
I think this is going in a good direction 👍🏼 . We should change the logic to keep the N builds for each version as we discussed in the meeting, tho.
If we are very conservative
Starting conservative is fine. There is no reason to rush on this and delete all the objects at once.
|
@copilot write tests for the |
…n `_delete_builds` (#12820) - [x] Write tests for `delete_old_build_objects` function - [x] Fix `KeyError` in `_delete_builds` when no builds are deleted - [x] Use `assert foo == bar` syntax instead of `self.assertEqual` in `TestDeleteOldBuildObjects` <!-- START COPILOT CODING AGENT TIPS --> --- 🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. [Learn more about Advanced Security.](https://gh.io/cca-advanced-security) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stsewd <4975310+stsewd@users.noreply.github.com>
humitos
left a comment
There was a problem hiding this comment.
This is looking good. However, I think there are a few things we should do before moving forward and merging. There are some performance issues here and also a tweak needed in the deletion of objects formula.
We should be cautious here since we will be deleting a lot of objects. Another person should review this as well.
| builds_to_delete = version.builds.filter( | ||
| state__in=BUILD_FINAL_STATES, | ||
| date__lt=cutoff_date, | ||
| ).order_by("-date") | ||
| limit -= _delete_builds(builds_to_delete, start=keep_recent, end=keep_recent + limit) | ||
| if limit <= 0: |
There was a problem hiding this comment.
Why are not taking the slice here? I looks weird that we are adding all the builds to builds_to_delete but we are filtering them again and not deleting them all.
| builds_to_delete = version.builds.filter( | |
| state__in=BUILD_FINAL_STATES, | |
| date__lt=cutoff_date, | |
| ).order_by("-date") | |
| limit -= _delete_builds(builds_to_delete, start=keep_recent, end=keep_recent + limit) | |
| if limit <= 0: | |
| builds_to_delete = version.builds.filter( | |
| state__in=BUILD_FINAL_STATES, | |
| date__lt=cutoff_date, | |
| ).order_by("-date")[keep_recent:keep_recent + limit] | |
| limit -= _delete_builds(builds_to_delete) |
There was a problem hiding this comment.
I prefer if we perform the slice here since it will be clearer and avoid confusions.
There was a problem hiding this comment.
Deleting in batches doesn't work over a sliced queryset. It's explained in the PR description.
There was a problem hiding this comment.
Yeah, but this is pretty confusing to me. We are calling _delete_builds(builds_to_delete) but the result is that those builds are not deleted? They are filtered instead and just a few of them are deleted? 🤔
There was a problem hiding this comment.
Well, we are calling _delete_builds(queryset, start, end), that seems explicit?
| # Delete builds that are not associated with any version, | ||
| # keeping the most recent `keep_recent` builds per project. | ||
| builds_to_delete = project.builds.filter( | ||
| version=None, state__in=BUILD_FINAL_STATES, date__lt=cutoff_date | ||
| ).order_by("-date") | ||
| limit -= _delete_builds(builds_to_delete, start=keep_recent, end=keep_recent + limit) | ||
| if limit <= 0: | ||
| return |
There was a problem hiding this comment.
If the builds are not associated to any version, this is because the version was deleted. We shouldn't keep only keep_recent for all of those builds here, but keep_recent builds per build.version_slug instead to be consistent with the formula.
There was a problem hiding this comment.
Is it worth it complicating this more? If the version is deleted, the build is never referenced by any version. We could keep 2x or 3x the builds here if we want to keep more builds, but honestly I think we are fine here.
There was a problem hiding this comment.
What are we saving this builds for if they are not reference by any version?
There was a problem hiding this comment.
So their builds page wouldn't immediately 404, and users had a history of previous builds. I'll be fine deleting builds when a version is deleted.
There was a problem hiding this comment.
Yeah, I think we either want:
- delete these build objects, OR
- apply the same formula we are applying to the other builds
but I wouldn't create another logic for this particular use case.
There was a problem hiding this comment.
but I wouldn't create another logic for this particular use case.
It's simpler and faster to apply a global limit than grouping the query for versions and then delete builds for each one of them...
| """ | ||
| for batch in batched(paths, 1000): | ||
| objects = [{"Key": path} for path in batch] | ||
| self.bucket.delete_objects(Delete={"Objects": objects, "Quiet": True}) |
There was a problem hiding this comment.
Do you know what Quiet means. I read:
Quiet (boolean) –
Element to enable quiet mode for the request. When you add this element, you must set its value to true.
but I don't understand it. Would this call raise an exception if it fails?
There was a problem hiding this comment.
It reports the result of each path that was deleted, if it failed or not. We aren't doing anything with the result, so we don't need it.
| def delete_paths(self, paths): | ||
| """ | ||
| Delete multiple paths from storage. | ||
|
|
||
| This is a convenience method to delete multiple paths at once, which can be more efficient | ||
| for some storage backends that support batch deletion (eg. S3). | ||
| """ | ||
| for path in paths: | ||
| self.delete(path) |
There was a problem hiding this comment.
I don't understand why we need this method here, 🤔 ?
I saw that you created RTDS3Boto3Storage.delete_paths below that looks 👍🏼 -- why do we need to override it here?
There was a problem hiding this comment.
This is needed for testing, we use a file system backend there. I actually want to refactor how we inherit these classes, currently not all storage backends inherit these classes.
Pinging @agjohnson here so he can perform another review. The pattern to delete objects has been increasing its complexity in the last weeks and I don't understand it anymore. We have a lot going on here in multiple functions:
Why do we need 4 functions to perform the deletion and what does each of them? I can't explain them by myself. @agjohnson If you don't want to review all the PR, these are my main concerns:
|
|
Community
Around 500 projects make up 30% of the builds in the database.
If we are very conservative, we can delete:
If we are a little bit more aggressive, we can delete:
Business
Around 90 projects make up 56% of the builds in the database.
If we are very conservative, we can delete:
If we are a little bit more aggressive, we can delete:
Full analysis at https://readthedocs.slack.com/archives/G81T0N8S3/p1771531958918869?thread_ts=1771531854.785459&cid=G81T0N8S3
Closes #12712