-
Notifications
You must be signed in to change notification settings - Fork 703
fix: Prevent premature EventManager shutdown when multiple crawlers share it
#1810
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Mantisus
wants to merge
9
commits into
apify:master
Choose a base branch
from
Mantisus:fix-global-event-manager
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+149
−36
Open
Changes from 5 commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
0ca0895
fix global `event_manager` for correctly support multiple crawlers
Mantisus c2a2290
launch global event_manager with local in crawler
Mantisus 6e15799
fix typo
Mantisus 4bf288f
fix
Mantisus ff4c976
code quality up
Mantisus 53d31e5
Merge branch 'master' into fix-global-event-manager
Mantisus 88c35d5
fix typing
Mantisus 838853e
Update tests/unit/crawlers/_basic/test_basic_crawler.py
Mantisus c67510d
fix `LocalEventManager`
Mantisus File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not so sure about this. Why should the crawler initialize a component that it doesn't use? And then tear it down? I think we should try to see a bigger picture first.
The linked issues are caused by
SnapshotterandRecoverableStatesubscribing to an event manager that doesn't emit anything - how does that happen?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to remove
event_managerfrom theBasicCrawlerconstructor entirely and keep only the global one, configurable viaservice_locator.set_event_manager. However, this requires discussion and should probably be part of v2.The global
event_managerbeing started inside the crawler, while only being used by the crawler's internal components, is an early architectural decision. I assume that it was done for a better user experience.I believe the root cause is bugs introduced during our work on supporting multiple parallel crawlers:
test_multiple_crawlers_with_global_event_manager. The first crawler to activate theevent_manageris fully responsible for its lifecycle. When the crawler finishes, it tears down theevent_manager, which clears all subscriptions, leaving any still-running crawlers without a functioningevent_manager._service_locatorinBasicCrawler, which caused the user-providedevent_managerto be used only forEvent.CRAWLER_STATUSin_crawler_state_task, while internal components (Snapshotter,RecoverableState) continued subscribing to the globalservice_locator, which may never be started.All that made the behavior of
service_locator.get_event_manager()unstable and unpredictable.This PR is an attempt to fix these issues within the current architecture without introducing breaking changes.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the fixes that solve 2 Crawlers using the same
EventManagerare good. Regarding the global event manager. I see this as a quick fix.I think this would lead to counterintuitive behavior due to
ConfigurationhavingEventManagerspecific fields and they would be ignored in cases like this:Maybe it would be possible for each subscriber to the event manager to have at least an optional init argument that would allow passing an explicit
event_manager(if it is creating storages, then maybeservice_locatorargument would be even better) instead of relying on the global one.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When porting the service locator to crawlee-js (apify/crawlee#3325), I actually made the global
serviceLocatorobject into a proxy that resolves to the crawler-scoped ServiceLocator when called from a crawler instance and to the global one otherwise.If we adopted that in Python as well, it would solve the inconsistency caused by
Snapshotterand friends still using the global locator.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll study that solution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me know if you get stuck on something, it's kinda magic. But that's the cost of making the library usable without factories, builders and explicit dependency containers.