📅  最后修改于: 2020-10-31 14:35:59             🧑  作者: Mango
Scrapy可以使用Request和Response对象对网站进行爬网。请求对象经过系统,使用蜘蛛程序执行请求,并在返回响应对象时返回到请求。
该请求对象是一个生成响应的HTTP请求。它具有以下类别-
class scrapy.http.Request(url[, callback, method = 'GET', headers, body, cookies, meta,
encoding = 'utf-8', priority = 0, dont_filter = False, errback])
下表显示了Request对象的参数-
Sr.No | Parameter & Description |
---|---|
1 |
url It is a string that specifies the URL request. |
2 |
callback It is a callable function which uses the response of the request as first parameter. |
3 |
method It is a string that specifies the HTTP method request. |
4 |
headers It is a dictionary with request headers. |
5 |
body It is a string or unicode that has a request body. |
6 |
cookies It is a list containing request cookies. |
7 |
meta It is a dictionary that contains values for metadata of the request. |
8 |
encoding It is a string containing utf-8 encoding used to encode URL. |
9 |
priority It is an integer where the scheduler uses priority to define the order to process requests. |
10 |
dont_filter It is a boolean specifying that the scheduler should not filter the request. |
11 |
errback It is a callable function to be called when an exception while processing a request is raised. |
当下载响应作为其第一个参数时,将调用请求的回调函数。
例如-
def parse_page1(self, response):
return scrapy.Request("http://www.something.com/some_page.html",
callback = self.parse_page2)
def parse_page2(self, response):
self.logger.info("%s page visited", response.url)
如果要将参数传递给可调用函数并在第二个回调中接收这些参数,则可以使用Request.meta属性,如以下示例所示:
def parse_page1(self, response):
item = DemoItem()
item['foremost_link'] = response.url
request = scrapy.Request("http://www.something.com/some_page.html",
callback = self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_link'] = response.url
return item
errback是在处理请求时引发异常时要调用的可调用函数。
以下示例演示了这一点-
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
from twisted.internet.error import DNSLookupError
from twisted.internet.error import TimeoutError, TCPTimedOutError
class DemoSpider(scrapy.Spider):
name = "demo"
start_urls = [
"http://www.httpbin.org/", # HTTP 200 expected
"http://www.httpbin.org/status/404", # Webpage not found
"http://www.httpbin.org/status/500", # Internal server error
"http://www.httpbin.org:12345/", # timeout expected
"http://www.httphttpbinbin.org/", # DNS error expected
]
def start_requests(self):
for u in self.start_urls:
yield scrapy.Request(u, callback = self.parse_httpbin,
errback = self.errback_httpbin,
dont_filter=True)
def parse_httpbin(self, response):
self.logger.info('Recieved response from {}'.format(response.url))
# ...
def errback_httpbin(self, failure):
# logs failures
self.logger.error(repr(failure))
if failure.check(HttpError):
response = failure.value.response
self.logger.error("HttpError occurred on %s", response.url)
elif failure.check(DNSLookupError):
request = failure.request
self.logger.error("DNSLookupError occurred on %s", request.url)
elif failure.check(TimeoutError, TCPTimedOutError):
request = failure.request
self.logger.error("TimeoutError occurred on %s", request.url)
request.meta特殊键是由Scrapy标识的特殊元键的列表。
下表显示了Request.meta的一些键-
Sr.No | Key & Description |
---|---|
1 |
dont_redirect It is a key when set to true, does not redirect the request based on the status of the response. |
2 |
dont_retry It is a key when set to true, does not retry the failed requests and will be ignored by the middleware. |
3 |
handle_httpstatus_list It is a key that defines which response codes per-request basis can be allowed. |
4 |
handle_httpstatus_all It is a key used to allow any response code for a request by setting it to true. |
5 |
dont_merge_cookies It is a key used to avoid merging with the existing cookies by setting it to true. |
6 |
cookiejar It is a key used to keep multiple cookie sessions per spider. |
7 |
dont_cache It is a key used to avoid caching HTTP requests and response on each policy. |
8 |
redirect_urls It is a key which contains URLs through which the requests pass. |
9 |
bindaddress It is the IP of the outgoing IP address that can be used to perform the request. |
10 |
dont_obey_robotstxt It is a key when set to true, does not filter the requests prohibited by the robots.txt exclusion standard, even if ROBOTSTXT_OBEY is enabled. |
11 |
download_timeout It is used to set timeout (in secs) per spider for which the downloader will wait before it times out. |
12 |
download_maxsize It is used to set maximum size (in bytes) per spider, which the downloader will download. |
13 |
proxy Proxy can be set for Request objects to set HTTP proxy for the use of requests. |
您可以通过对请求类进行子类化来实现自己的自定义功能。内置的请求子类如下-
FormRequest类通过扩展基本请求来处理HTML表单。它具有以下类别-
class scrapy.http.FormRequest(url[,formdata, callback, method = 'GET', headers, body,
cookies, meta, encoding = 'utf-8', priority = 0, dont_filter = False, errback])
以下是参数-
formdata-这是具有分配给请求主体的HTML表单数据的字典。
注–其余参数与请求类相同,在“请求对象”部分中进行了说明。
除请求方法外, FormRequest对象还支持以下类方法-
classmethod from_response(response[, formname = None, formnumber = 0, formdata = None,
formxpath = None, formcss = None, clickdata = None, dont_click = False, ...])
下表显示了上述类的参数-
Sr.No | Parameter & Description |
---|---|
1 |
response It is an object used to pre-populate the form fields using HTML form of response. |
2 |
formname It is a string where the form having name attribute will be used, if specified. |
3 |
formnumber It is an integer of forms to be used when there are multiple forms in the response. |
4 |
formdata It is a dictionary of fields in the form data used to override. |
5 |
formxpath It is a string when specified, the form matching the xpath is used. |
6 |
formcss It is a string when specified, the form matching the css selector is used. |
7 |
clickdata It is a dictionary of attributes used to observe the clicked control. |
8 |
dont_click The data from the form will be submitted without clicking any element, when set to true. |
以下是一些请求用法示例-
使用FormRequest通过HTTP POST发送数据
以下代码演示了当您想在蜘蛛中复制HTML表单POST时如何返回FormRequest对象-
return [FormRequest(url = "http://www.something.com/post/action",
formdata = {'firstname': 'John', 'lastname': 'dave'},
callback = self.after_post)]
使用FormRequest.from_response()模拟用户登录
通常,网站使用元素来提供预填充的表单字段。
当您希望这些字段在抓取时自动填充时,可以使用FormRequest.form_response()方法。
以下示例对此进行了演示。
import scrapy
class DemoSpider(scrapy.Spider):
name = 'demo'
start_urls = ['http://www.something.com/users/login.php']
def parse(self, response):
return scrapy.FormRequest.from_response(
response,
formdata = {'username': 'admin', 'password': 'confidential'},
callback = self.after_login
)
def after_login(self, response):
if "authentication failed" in response.body:
self.logger.error("Login failed")
return
# You can continue scraping here
它是一个指示HTTP响应的对象,该响应被馈送到蜘蛛进行处理。它具有以下类别-
class scrapy.http.Response(url[, status = 200, headers, body, flags])
下表显示了Response对象的参数-
Sr.No | Parameter & Description |
---|---|
1 |
url It is a string that specifies the URL response. |
2 |
status It is an integer that contains HTTP status response. |
3 |
headers It is a dictionary containing response headers. |
4 |
body It is a string with response body. |
5 |
flags It is a list containing flags of response. |
您可以通过对响应类进行子类化来实现自己的自定义功能。内置的响应子类如下-
TextResponse对象
TextResponse对象用于二进制数据,例如图像,声音等,具有对基本Response类进行编码的能力。它具有以下类别-
class scrapy.http.TextResponse(url[, encoding[,status = 200, headers, body, flags]])
以下是参数-
encoding-这是一个具有编码的字符串,用于编码响应。
注–其余参数与响应类相同,将在“响应对象”部分中进行说明。
下表显示了除响应方法之外TextResponse对象支持的属性-
Sr.No | Attribute & Description |
---|---|
1 |
text It is a response body, where response.text can be accessed multiple times. |
2 |
encoding It is a string containing encoding for response. |
3 |
selector It is an attribute instantiated on first access and uses response as target. |
下表显示了除响应方法外TextResponse对象支持的方法-
Sr.No | Method & Description |
---|---|
1 |
xpath (query) It is a shortcut to TextResponse.selector.xpath(query). |
2 |
css (query) It is a shortcut to TextResponse.selector.css(query). |
3 |
body_as_unicode() It is a response body available as a method, where response.text can be accessed multiple times. |
通过查看HTML的meta httpequiv属性,该对象支持编码和自动发现。其参数与响应类相同,并在“响应对象”部分中进行了说明。它具有以下类别-
class scrapy.http.HtmlResponse(url[,status = 200, headers, body, flags])
通过查看XML行,该对象支持编码和自动发现。其参数与响应类相同,并在“响应对象”部分中进行了说明。它具有以下类别-
class scrapy.http.XmlResponse(url[, status = 200, headers, body, flags])