📅  最后修改于: 2020-10-31 14:38:53             🧑  作者: Mango
为了从网页提取数据,Scrapy使用了一种称为选择器的技术,该技术基于XPath和CSS表达式。以下是XPath表达式的一些示例-
/ html / head / title-这将在HTML文档的
元素内选择/ html / head / title / text() -这将选择同一
// td-这将从
// div [@class =“ slice”] -这将从div中选择所有包含class =“ slice”属性的元素
选择器具有四种基本方法,如下表所示-
Sr.No | Method & Description |
---|---|
1 |
extract() It returns a unicode string along with the selected data. |
2 |
re() It returns a list of unicode strings, extracted when the regular expression was given as argument. |
3 |
xpath() It returns a list of selectors, which represents the nodes selected by the xpath expression given as an argument. |
4 |
css() It returns a list of selectors, which represents the nodes selected by the CSS expression given as an argument. |
为了演示带有内置Scrapy shell的选择器,您需要在系统中安装IPython 。重要的是,运行Scrapy时,URL应包含在引号中;否则,带有’&’字符的网址将不起作用。您可以在项目的顶级目录中使用以下命令来启动shell-
scrapy shell "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"
一个shell将如下所示:
[ ... Scrapy log here ... ]
2014-01-23 17:11:42-0400 [scrapy] DEBUG: Crawled (200)
(referer: None)
[s] Available Scrapy objects:
[s] crawler
[s] item {}
[s] request
[s] response <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
[s] settings
[s] spider
[s] Useful shortcuts:
[s] shelp() Shell help (print this help)
[s] fetch(req_or_url) Fetch request (or URL) and update local objects
[s] view(response) View response in a browser
In [1]:
加载外壳程序时,可以分别使用response.body和response.header访问主体或标题。同样,您可以使用response.selector.xpath()或response.selector.css()对响应运行查询。
例如-
In [1]: response.xpath('//title')
Out[1]: [My Book - Scrapy'>]
In [2]: response.xpath('//title').extract()
Out[2]: [u'My Book - Scrapy: Index: Chapters ']
In [3]: response.xpath('//title/text()')
Out[3]: []
In [4]: response.xpath('//title/text()').extract()
Out[4]: [u'My Book - Scrapy: Index: Chapters']
In [5]: response.xpath('//title/text()').re('(\w+):')
Out[5]: [u'Scrapy', u'Index', u'Chapters']
要从普通的HTML网站提取数据,我们必须检查网站的源代码以获取XPath。检查之后,您可以看到数据将在ul标签中。选择li标签中的元素。
以下代码行显示了不同类型数据的提取-
用于在li标签内选择数据-
response.xpath('//ul/li')
用于选择描述-
response.xpath('//ul/li/text()').extract()
用于选择站点标题-
response.xpath('//ul/li/a/text()').extract()
用于选择站点链接-
response.xpath('//ul/li/a/@href').extract()
以下代码演示了上述提取器的用法-
import scrapy
class MyprojectSpider(scrapy.Spider):
name = "project"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
for sel in response.xpath('//ul/li'):
title = sel.xpath('a/text()').extract()
link = sel.xpath('a/@href').extract()
desc = sel.xpath('text()').extract()
print title, link, desc