📅  最后修改于: 2023-12-03 15:04:53.747000             🧑  作者: Mango
在爬取网页的过程中,有时需要获取当前网页是由哪个网页跳转过来的。这时,我们可以使用 Scrapy 中的 response.referrer 获取当前网页的引用信息。
response.referrer
是指当前网页的引用信息,也就是指向当前网页的链接。如果当前网页不是由其他网页跳转而来,response.referrer
返回 None。
在 Scrapy 的回调函数中,可以通过 response.request.headers.get('Referer')
获取当前网页的引用信息。
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
def start_requests(self):
urls = [
"http://www.example.com/page1",
"http://www.example.com/page2",
"http://www.example.com/page3",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
referrer = response.request.headers.get('Referer')
if referrer:
self.logger.info(f"This page was referred by {referrer}")
else:
self.logger.info("This page was not referred by any other page")
如果当前网页是由其他网页引用而来,referrer
将会保存引用的链接。如果当前网页不是由其他网页引用而来,referrer
的值为 None。
import scrapy
class MySpider(scrapy.Spider):
name = "myspider"
def start_requests(self):
urls = [
"http://www.example.com/page1",
"http://www.example.com/page2",
"http://www.example.com/page3",
]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
referrer = response.request.headers.get('Referer')
if referrer:
self.logger.info(f"This page was referred by {referrer}")
else:
self.logger.info("This page was not referred by any other page")
以上是使用 response.referrer scrapy 获取网页引用信息的方法,希望对大家有所帮助。