Cannot get response headers #512

ejkitchen · 2024-10-31T14:43:38Z

ejkitchen
Oct 31, 2024

When I do this

result = await crawler.arun(url=url, bypass_cache=False)

result.response_headers is always {}

It appears here:
62 async def arun(
....
131: crawl_result.response_headers = async_response.response_headers if async_response else {}

It appears that async_response is always None. Is there a way to get the headers? Am I missing a flag? I have never been able to get the headers and this is baffling to me.

unclecode · 2024-11-03T05:56:01Z

unclecode
Nov 3, 2024
Maintainer

@ejkitchen Could you please share the URL you're trying? It works on my side, as seen in this image, and shouldn't require any flag by default; maybe it's dirt. Sharing the URL will help me check it better.

0 replies

ejkitchen · 2024-11-12T20:24:11Z

ejkitchen
Nov 12, 2024
Author

Thanks for the response! I decided to go back to basics as I could not get this to work. We had about 400k URLs and I sampled about 15 at random (it's a public website) and nothing, no errors in logs etc. Everything else was coming through but not the headers. Using the basic python libs, they come through without issue. I no longer have the code since we moved away from the lib but if you give me your snippet here, I can try again and if I get an issue, will let you know ASAP and what links.

0 replies

unclecode · 2024-11-13T08:34:38Z

unclecode
Nov 13, 2024
Maintainer

Sure, for examples:

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler(headless=True) as crawler:
        result = await crawler.arun(
            url="https://en.wikipedia.org/wiki/apple",
            bypass_cache=True,
        )
        
        print(result.response_headers)

if __name__ == "__main__":
    asyncio.run(main())

This generate the below log for me:

[LOG] 🚀 Crawling done for https://en.wikipedia.org/wiki/apple, success: True, time taken: 0.54 seconds
[LOG] 🚀 Content extracted for https://en.wikipedia.org/wiki/apple, success: True, time taken: 4.12 seconds
[LOG] 🔥 Extracting semantic blocks for https://en.wikipedia.org/wiki/apple, Strategy: AsyncWebCrawler
[LOG] 🚀 Extraction done for https://en.wikipedia.org/wiki/apple, time taken: 4.17 seconds.
{'accept-ch': '', 'accept-ranges': 'bytes', 'age': '45917', 'cache-control': 'private, s-maxage=0, max-age=0, must-revalidate, no-transform', 'content-encoding': 'gzip', 'content-language': 'en', 'content-length': '104493', 'content-type': 'text/html; charset=UTF-8', 'date': 'Tue, 12 Nov 2024 19:47:35 GMT', 'last-modified': 'Tue, 12 Nov 2024 18:03:41 GMT', 'nel': '{ "report_to": "wm_nel", "max_age": 604800, "failure_fraction": 0.05, "success_fraction": 0.0}', 'report-to': '{ "group": "wm_nel", "max_age": 604800, "endpoints": [{ "url": "https://intake-logging.wikimedia.org/v1/events?stream=w3c.reportingapi.network_error&schema_uri=/w3c/reportingapi/network_error/1.0.0" }] }', 'server': 'mw-web.codfw.main-5847db6f8b-nhdnp', 'server-timing': 'cache;desc="hit-front", host;desc="cp5022"', 'strict-transport-security': 'max-age=106384710; includeSubDomains; preload', 'vary': 'Accept-Encoding,Cookie,Authorization', 'x-cache': 'cp5022 miss, cp5022 hit/23', 'x-cache-status': 'hit-front', 'x-client-ip': '2001:f40:910:1983:d93d:ade6:fe1:f9ab', 'x-content-type-options': 'nosniff'}

0 replies

unclecode · 2024-11-13T09:11:28Z

unclecode
Nov 13, 2024
Maintainer

@ejkitchen I've figured out the issue. You were right - it occurs when bypass_cache isn't set to true. I noticed that in this case, the code reads the cached version from the database. However, the cached version doesn't save the response header, which is why it remains empty. Remember that if you don't, crawling from local cache occurs by default the second time. Setting bypass cache to true ensures a refresh crawl. I'm not sure if this is necessary in your case. Caching is useful when you know the content is the same and want to process it quickly. If not, you can bypass the cache by setting it to true. Anyway I'll update this feature to save the data in our database and return it.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cannot get response headers #512

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Cannot get response headers #512

Uh oh!

ejkitchen Oct 31, 2024

Replies: 4 comments

Uh oh!

unclecode Nov 3, 2024 Maintainer

Uh oh!

ejkitchen Nov 12, 2024 Author

Uh oh!

unclecode Nov 13, 2024 Maintainer

Uh oh!

unclecode Nov 13, 2024 Maintainer

ejkitchen
Oct 31, 2024

unclecode
Nov 3, 2024
Maintainer

ejkitchen
Nov 12, 2024
Author

unclecode
Nov 13, 2024
Maintainer

unclecode
Nov 13, 2024
Maintainer