📜  Scrapy-Shell

📅  最后修改于: 2020-10-31 14:34:15             🧑  作者: Mango


描述

Scrapy Shell可以使用无错误代码来擦除数据,而无需使用Spider。 Scrapy shell的主要目的是测试提取的代码,XPath或CSS表达式。它还有助于指定要从中抓取数据的网页。

配置外壳

可通过安装IPython控制台(用于交互式计算)控制台来配置该外壳,该控制台是功能强大的交互式外壳,可提供自动完成功能,彩色输出等。

如果您正在Unix平台上工作,则最好安装IPython。如果无法访问IPython,也可以使用bpython

您可以通过设置称为SCRAPY_PYTHON_SHELL的环境变量或通过定义scrapy.cfg文件来配置外壳,如下所示:

[settings]
shell = bpython

启动外壳

可以使用以下命令启动Scrapy shell-

scrapy shell 

url指定需要为其抓取数据的URL。

使用外壳

Shell提供了一些其他的快捷方式和Scrapy对象,如下表所示-

可用的快捷方式

Shell在项目中提供了以下可用的快捷方式-

Sr.No Shortcut & Description
1

shelp()

It provides the available objects and shortcuts with the help option.

2

fetch(request_or_url)

It collects the response from the request or URL and associated objects will get updated properly.

3

view(response)

You can view the response for the given request in the local browser for observation and to display the external link correctly, it appends a base tag to the response body.

可用的Scrapy对象

Shell在项目中提供了以下可用的Scrapy对象-

Sr.No Object & Description
1

crawler

It specifies the current crawler object.

2

spider

If there is no spider for present URL, then it will handle the URL or spider object by defining the new spider.

3

request

It specifies the request object for the last collected page.

4

response

It specifies the response object for the last collected page.

5

settings

It provides the current Scrapy settings.

Shell会话示例

让我们尝试刮scrapy.org网站,然后按照所述开始从reddit.com刮数据。

在继续之前,首先我们将启动shell,如以下命令所示:

scrapy shell 'http://scrapy.org' --nolog

Scrapy将在使用上述URL时显示可用的对象-

[s] Available Scrapy objects:
[s]   crawler    
[s]   item       {}
[s]   request    
[s]   response   <200 http://scrapy.org >
[s]   settings   
[s]   spider     
[s] Useful shortcuts:
[s]   shelp()           Provides available objects and shortcuts with help option
[s]   fetch(req_or_url) Collects the response from the request or URL and associated 
objects will get update
[s]   view(response)    View the response for the given request

接下来,从对象的工作开始,如下所示:

>> response.xpath('//title/text()').extract_first() 
u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'  
>> fetch("http://reddit.com") 
[s] Available Scrapy objects: 
[s]   crawler     
[s]   item       {} 
[s]   request     
[s]   response   <200 https://www.reddit.com/> 
[s]   settings    
[s]   spider      
[s] Useful shortcuts: 
[s]   shelp()           Shell help (print this help) 
[s]   fetch(req_or_url) Fetch request (or URL) and update local objects 
[s]   view(response)    View response in a browser  
>> response.xpath('//title/text()').extract() 
[u'reddit: the front page of the internet']  
>> request = request.replace(method="POST")  
>> fetch(request) 
[s] Available Scrapy objects: 
[s]   crawler     
... 

从蜘蛛调用外壳来检查响应

仅当您希望获得该响应时,才可以检查从Spider处理的响应。

例如-

import scrapy 

class SpiderDemo(scrapy.Spider): 
   name = "spiderdemo" 
   start_urls = [ 
      "http://mysite.com", 
      "http://mysite1.org", 
      "http://mysite2.net", 
   ]  
   
   def parse(self, response): 
      # You can inspect one specific response 
      if ".net" in response.url: 
         from scrapy.shell import inspect_response 
         inspect_response(response, self)

如上面的代码所示,您可以使用以下函数从Spider调用Shell来检查响应-

scrapy.shell.inspect_response

现在运行蜘蛛,您将获得以下屏幕-

2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200)  (referer: None) 
[s] Available Scrapy objects: 
[s]   crawler     
...  
>> response.url 
'http://mysite2.org' 

您可以使用以下代码检查提取的代码是否正常工作-

>> response.xpath('//div[@class = "val"]')

它将输出显示为

[]

上一行仅显示空白输出。现在您可以调用外壳程序以检查响应,如下所示:

>> view(response)

它将响应显示为

True