Scrapy-Shell - 芒果文档

📌 相关文章

📜 Scrapy-Shell

📅 最后修改于: 2020-10-31 14:34:15 🧑 作者: Mango

描述

Scrapy Shell可以使用无错误代码来擦除数据，而无需使用Spider。 Scrapy shell的主要目的是测试提取的代码，XPath或CSS表达式。它还有助于指定要从中抓取数据的网页。

配置外壳

可通过安装IPython控制台(用于交互式计算)控制台来配置该外壳，该控制台是功能强大的交互式外壳，可提供自动完成功能，彩色输出等。

如果您正在Unix平台上工作，则最好安装IPython。如果无法访问IPython，也可以使用bpython 。

您可以通过设置称为SCRAPY_PYTHON_SHELL的环境变量或通过定义scrapy.cfg文件来配置外壳，如下所示：

[settings]
shell = bpython

启动外壳

可以使用以下命令启动Scrapy shell-

scrapy shell

该url指定需要为其抓取数据的URL。

使用外壳

Shell提供了一些其他的快捷方式和Scrapy对象，如下表所示-

可用的快捷方式

Shell在项目中提供了以下可用的快捷方式-

Sr.No	Shortcut & Description
1	shelp() It provides the available objects and shortcuts with the help option.
2	fetch(request_or_url) It collects the response from the request or URL and associated objects will get updated properly.
3	view(response) You can view the response for the given request in the local browser for observation and to display the external link correctly, it appends a base tag to the response body.

Sr.No

Shortcut & Description

shelp()

It provides the available objects and shortcuts with the help option.

fetch(request_or_url)

It collects the response from the request or URL and associated objects will get updated properly.

view(response)

You can view the response for the given request in the local browser for observation and to display the external link correctly, it appends a base tag to the response body.

可用的Scrapy对象

Shell在项目中提供了以下可用的Scrapy对象-

Sr.No	Object & Description
1	crawler It specifies the current crawler object.
2	spider If there is no spider for present URL, then it will handle the URL or spider object by defining the new spider.
3	request It specifies the request object for the last collected page.
4	response It specifies the response object for the last collected page.
5	settings It provides the current Scrapy settings.

Shell会话示例

让我们尝试刮scrapy.org网站，然后按照所述开始从reddit.com刮数据。

在继续之前，首先我们将启动shell，如以下命令所示：

scrapy shell 'http://scrapy.org' --nolog

Scrapy将在使用上述URL时显示可用的对象-

[s] Available Scrapy objects:
[s]   crawler    
[s]   item       {}
[s]   request    
[s]   response   <200 http://scrapy.org >
[s]   settings   
[s]   spider     
[s] Useful shortcuts:
[s]   shelp()           Provides available objects and shortcuts with help option
[s]   fetch(req_or_url) Collects the response from the request or URL and associated 
objects will get update
[s]   view(response)    View the response for the given request

接下来，从对象的工作开始，如下所示：

>> response.xpath('//title/text()').extract_first() 
u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'  
>> fetch("http://reddit.com") 
[s] Available Scrapy objects: 
[s]   crawler     
[s]   item       {} 
[s]   request     
[s]   response   <200 https://www.reddit.com/> 
[s]   settings    
[s]   spider      
[s] Useful shortcuts: 
[s]   shelp()           Shell help (print this help) 
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects 
[s]   view(response)    View response in a browser  
>> response.xpath('//title/text()').extract() 
[u'reddit: the front page of the internet']  
>> request = request.replace(method="POST")  
>> fetch(request) 
[s] Available Scrapy objects: 
[s]   crawler     
...

从蜘蛛调用外壳来检查响应

仅当您希望获得该响应时，才可以检查从Spider处理的响应。

例如-

import scrapy 

class SpiderDemo(scrapy.Spider): 
   name = "spiderdemo" 
   start_urls = [ 
      "http://mysite.com", 
      "http://mysite1.org", 
      "http://mysite2.net", 
   ]  
   
   def parse(self, response): 
      # You can inspect one specific response 
      if ".net" in response.url: 
         from scrapy.shell import inspect_response 
         inspect_response(response, self)

如上面的代码所示，您可以使用以下函数从Spider调用Shell来检查响应-

scrapy.shell.inspect_response

现在运行蜘蛛，您将获得以下屏幕-

2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
[s] Available Scrapy objects: 
[s]   crawler     
...  
>> response.url 
'http://mysite2.org'

您可以使用以下代码检查提取的代码是否正常工作-

>> response.xpath('//div[@class = "val"]')

它将输出显示为

[]

上一行仅显示空白输出。现在您可以调用外壳程序以检查响应，如下所示：

>> view(response)

它将响应显示为

True