📅  最后修改于: 2020-10-31 14:34:15             🧑  作者: Mango
Scrapy Shell可以使用无错误代码来擦除数据,而无需使用Spider。 Scrapy shell的主要目的是测试提取的代码,XPath或CSS表达式。它还有助于指定要从中抓取数据的网页。
可通过安装IPython控制台(用于交互式计算)控制台来配置该外壳,该控制台是功能强大的交互式外壳,可提供自动完成功能,彩色输出等。
如果您正在Unix平台上工作,则最好安装IPython。如果无法访问IPython,也可以使用bpython 。
您可以通过设置称为SCRAPY_PYTHON_SHELL的环境变量或通过定义scrapy.cfg文件来配置外壳,如下所示:
[settings]
shell = bpython
可以使用以下命令启动Scrapy shell-
scrapy shell
该url指定需要为其抓取数据的URL。
Shell提供了一些其他的快捷方式和Scrapy对象,如下表所示-
Shell在项目中提供了以下可用的快捷方式-
Sr.No | Shortcut & Description |
---|---|
1 |
shelp() It provides the available objects and shortcuts with the help option. |
2 |
fetch(request_or_url) It collects the response from the request or URL and associated objects will get updated properly. |
3 |
view(response) You can view the response for the given request in the local browser for observation and to display the external link correctly, it appends a base tag to the response body. |
Shell在项目中提供了以下可用的Scrapy对象-
Sr.No | Object & Description |
---|---|
1 |
crawler It specifies the current crawler object. |
2 |
spider If there is no spider for present URL, then it will handle the URL or spider object by defining the new spider. |
3 |
request It specifies the request object for the last collected page. |
4 |
response It specifies the response object for the last collected page. |
5 |
settings It provides the current Scrapy settings. |
让我们尝试刮scrapy.org网站,然后按照所述开始从reddit.com刮数据。
在继续之前,首先我们将启动shell,如以下命令所示:
scrapy shell 'http://scrapy.org' --nolog
Scrapy将在使用上述URL时显示可用的对象-
[s] Available Scrapy objects:
[s] crawler
[s] item {}
[s] request
[s] response <200 http://scrapy.org >
[s] settings
[s] spider
[s] Useful shortcuts:
[s] shelp() Provides available objects and shortcuts with help option
[s] fetch(req_or_url) Collects the response from the request or URL and associated
objects will get update
[s] view(response) View the response for the given request
接下来,从对象的工作开始,如下所示:
>> response.xpath('//title/text()').extract_first()
u'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'
>> fetch("http://reddit.com")
[s] Available Scrapy objects:
[s] crawler
[s] item {}
[s] request
[s] response <200 https://www.reddit.com/>
[s] settings
[s] spider
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
>> response.xpath('//title/text()').extract()
[u'reddit: the front page of the internet']
>> request = request.replace(method="POST")
>> fetch(request)
[s] Available Scrapy objects:
[s] crawler
...
仅当您希望获得该响应时,才可以检查从Spider处理的响应。
例如-
import scrapy
class SpiderDemo(scrapy.Spider):
name = "spiderdemo"
start_urls = [
"http://mysite.com",
"http://mysite1.org",
"http://mysite2.net",
]
def parse(self, response):
# You can inspect one specific response
if ".net" in response.url:
from scrapy.shell import inspect_response
inspect_response(response, self)
如上面的代码所示,您可以使用以下函数从Spider调用Shell来检查响应-
scrapy.shell.inspect_response
现在运行蜘蛛,您将获得以下屏幕-
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None)
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None)
2016-02-08 18:15:20-0400 [scrapy] DEBUG: Crawled (200) (referer: None)
[s] Available Scrapy objects:
[s] crawler
...
>> response.url
'http://mysite2.org'
您可以使用以下代码检查提取的代码是否正常工作-
>> response.xpath('//div[@class = "val"]')
它将输出显示为
[]
上一行仅显示空白输出。现在您可以调用外壳程序以检查响应,如下所示:
>> view(response)
它将响应显示为
True