如何使用 Scrapy 项目？

在本文中，我们将使用来自网页 https://quotes.toscrape.com/tag/reading/ 的抓取项目抓取报价数据。抓取的主要目的是从非结构化资源中准备结构化数据。 Scrapy Items 是字典数据结构的包装器。可以编写代码，以便将提取的数据作为 Item 对象以“键值”对的格式返回。在以下情况下使用 Scrapy Items 是有益的：

随着抓取的数据量增加，它们变得不规则处理。
随着您的数据变得复杂，它很容易出现拼写错误，并且有时可能会返回错误的数据。
抓取数据的格式更容易，因为 Item 对象可以进一步传递给 Item Pipelines。
清理数据很容易，如果我们将数据作为 Items 抓取。
使用 Scrapy Items 可以更轻松地验证数据、处理丢失的数据。

通过项目适配器库，Scrapy 支持各种项目类型。一个可以选择，他们想要的物品类型。以下是支持的项目类型：

字典 -可以以字典对象的形式编写项目。它们使用起来很方便。
Item 对象——它们提供类似 API 的字典，我们需要在其中声明 Item 的字段，需要。它由键值对、使用的 Field 对象组成，同时声明了 Item 类。在本教程中，我们使用 Item 对象。
数据类对象 –当您需要将抓取的值存储在 JSON 或 CSV 文件中时，会使用它们。这里我们需要定义，需要的每个字段的数据类型。
attr.s – attr.s 允许使用字段名称定义项目类，以便可以将抓取的数据导入到不同的文件格式。它们的工作方式类似于 Dataclass 对象，只是需要安装 attr 包。

安装 Scrapy 库

Scrapy 库需要 3.6 及更高版本的Python版本。通过在终端执行以下命令安装 Scrapy 库 -

pip install Scrapy

编程需要懂一点英语

此命令将在项目环境中安装 Scrapy 库。现在，我们可以创建一个 Scrapy 项目，来编写 Spider 代码。

创建 Scrapy 项目

Scrapy 有一个高效的命令行工具，也被称为“Scrapy 工具”。命令根据其目的接受一组不同的参数和选项。为了编写 Spider 代码，我们首先创建一个 Scrapy 项目，在终端执行以下命令：

scrapy startproject

编程需要懂一点英语

输出：

Scrapy 'startproject' 命令来创建 Spider 项目

这应该在您的当前目录中创建一个文件夹。它包含一个“scrapy.cfg”，它是项目的配置文件。文件夹结构如下图：

'gfg_spiderreadingitems' 的文件夹结构

scrapy.cfg，是一个项目配置文件。包含此文件的文件夹是根目录。创建的文件夹的文件夹结构如下：

“gfg_spiderreadingitems”文件夹中的“items.py”文件

该文件夹包含 items.py、middlerwares.py 和其他设置文件，以及“spider”文件夹。爬行代码将被写入蜘蛛Python文件中。我们将更改“items.py”文件，以提及要提取的数据项。保留“items.py”的内容，因为它们是当前的。

提取数据的蜘蛛代码

网页抓取的代码写在蜘蛛代码文件中。要创建蜘蛛文件，我们将使用“genspider”命令。请注意，此命令在存在 scrapy.cfg 文件的同一级别执行。

我们正在 https://quotes.toscrape.com/tag/reading/ 网页上抓取、阅读当前的报价。因此，我们将运行命令为 -

scrapy genspider spider_name url_to_be_scraped

编程需要懂一点英语

使用'genspider'命令创建Spider文件

上面的命令将在 'spiders' 文件夹中创建一个蜘蛛文件“gfg_spiitemsread.py”。蜘蛛名称也将是，'gfg_spiitemsread'。相同的默认代码如下：

Python3

# Import the required libraries
import scrapy
 
# Spider Class Created
 
 
class GfgSpiitemsreadSpider(scrapy.Spider):
    # Name of the spider
    name = 'gfg_spiitemsread'
    # The domain to be scraped
    allowed_domains = ['quotes.toscrape.com/tag/reading/']
    # The URLs from domain to scrape
    start_urls = ['http://quotes.toscrape.com/tag/reading//']
 
    # Spider default callback function
    def parse(self, response):
        pass

Python3

# Import the required library
import scrapy
 
# The Spider class
class GfgSpiitemsreadSpider(scrapy.Spider):
    # Name of the spider
    name = 'gfg_spiitemsread'
     
    # The domain allowed to scrape
    allowed_domains = ['quotes.toscrape.com/tag/reading']
     
    # The URL to be scraped
    start_urls = ['http://quotes.toscrape.com/tag/reading/']
     
    # Default callback function
    def parse(self, response):
         
        # Fetch all quotes tags
        quotes = response.xpath('//*[@class="quote"]')
         
        # Loop through the Quote selector elements
        # to get details of each
        for quote in quotes:
             
            # XPath expression to fetch text of the Quote title
            title = quote.xpath('.//*[@class="text"]/text()').extract_first()
             
            # XPath expression to fetch author of the Quote
            authors = quote.xpath('.//*[@itemprop="author"]/text()').extract()
             
            # XPath expression to fetch Tags of the Quote
            tags = quote.xpath('.//*[@itemprop="keywords"]/@content').extract()
             
            # Yield all elements
            yield {"Quote Text ": title, "Authors ": authors, "Tags ": tags}

Python3

# Define here the models for your scraped
# items
# Import the required library
import scrapy
 
# Define the fields for Scrapy item here
# in class
class GfgSpiderreadingitemsItem(scrapy.Item):
     
    # Item key for Title of Quote
    quotetitle = scrapy.Field()
     
    # Item key for Author of Quote
    author = scrapy.Field()
     
    # Item key for Tags of Quote
    tags = scrapy.Field()

Python3

# Import the required library
import scrapy
 
# Import the Item class with fields
# mentioned in the items.py file
from ..items import GfgSpiderreadingitemsItem
 
 
class GfgSpiitemsreadSpider(scrapy.Spider):
    name = 'gfg_spiitemsread'
    allowed_domains = ['quotes.toscrape.com/tag/reading']
    start_urls = ['http://quotes.toscrape.com/tag/reading/']
 
    def parse(self, response):
       
        # Write XPath expression to loop through
        # all quotes
        quotes = response.xpath('//*[@class="quote"]')
         
        # Loop through all quotes
        for quote in quotes:
             
            # Create an object of Item class
            item = GfgSpiderreadingitemsItem()
             
            # XPath expression to fetch text of the
            # Quote title Store the title in the class
            # attribute in key-value pair
            item['quotetitle'] = quote.xpath(
                './/*[@class="text"]/text()').extract_first()
             
            # XPath expression to fetch author of the Quote
            # Store the author in the class attribute in
            # key-value pair
            item['author'] = quote.xpath(
                './/*[@itemprop="author"]/text()').extract()
             
            # XPath expression to fetch tags of the Quote title
            # Store the tags in the class attribute in key-value
            # pair
            item['tags'] = quote.xpath(
                './/*[@itemprop="keywords"]/@content').extract()
             
            # Yield the item object
            yield item

我们将从网页 https://quotes.toscrape.com/tag/reading/ 中抓取报价标题、作者和标签。 Scrapy 为我们提供了选择器，可以“选择”网页的某些部分。选择器是 CSS 或 XPath 表达式，用于从 HTML 文档中提取数据。在本教程中，我们将使用 XPath 表达式来选择我们需要的细节。让我们了解在蜘蛛代码中编写选择器语法的步骤。

出现在蜘蛛类中的默认回调方法是 parse() 方法，负责处理收到的响应。我们将在这里编写带有 XPath 表达式的选择器，负责数据提取。
选择要提取的元素，在网页上说右键单击，然后选择检查选项。这将允许我们查看其 CSS 属性。
当我们右键单击第一个 Quote 并选择 Inspect 时，我们可以看到它具有 CSS 'class' 属性“quote”。同样，网页上的所有引号，都有 CSS 'class' 属性为“quote”。如下所示：

右键单击第一个引号，然后检查其 CSS“class”属性

基于此，同样的 XPath 表达式可以写成——

引号 = response.xpath('//*[@class="quote"]')。此语法将获取所有具有“quote”的元素作为 CSS 的“class”属性。
我们将获取所有报价的报价标题、作者和标签。因此，我们将在循环中编写用于提取它们的 XPath 表达式。对于引用标题，CSS 'class' 属性是“文本”。因此，对于相同的 XPath 表达式将是 –quote.xpath('.//*[@class=”text”]/text()').extract_first()。 text() 方法将提取报价标题的文本。 extract_first() 方法将给出第一个匹配值，CSS 属性为“text”。点运算符“.”在开头，表示从单引号中提取数据。
类似地，作者元素的 CSS 属性“class”和“itemprop”是“author”。我们可以在 XPath 表达式中使用其中任何一个。语法是 – quote.xpath('.//*[@itemprop="author"]/text()').extract()。这将提取作者姓名，其中 CSS 'itemprop' 属性为 'author'。
标签元素的 CSS 属性“class”和“itemprop”是“keywords”。我们可以在 XPath 表达式中使用其中任何一个。由于有很多标签，对于任何引用，循环遍历它们都会很复杂。因此，我们将从每个引用中提取 CSS 属性“内容”。相同的 XPath 表达式是 – quote.xpath('.//*[@itemprop="keywords"]/@content').extract()。这将从“内容”属性中提取所有标签值，用于引号。
我们使用“yield”语法来获取数据。我们可以使用 'yield' 语法收集数据并将数据传输为 CSV、JSON 和其他文件格式。
如果我们观察到这里的代码，它将爬行，并为网页提取数据。

代码如下：

蟒蛇3

# Import the required library
import scrapy
 
# The Spider class
class GfgSpiitemsreadSpider(scrapy.Spider):
    # Name of the spider
    name = 'gfg_spiitemsread'
     
    # The domain allowed to scrape
    allowed_domains = ['quotes.toscrape.com/tag/reading']
     
    # The URL to be scraped
    start_urls = ['http://quotes.toscrape.com/tag/reading/']
     
    # Default callback function
    def parse(self, response):
         
        # Fetch all quotes tags
        quotes = response.xpath('//*[@class="quote"]')
         
        # Loop through the Quote selector elements
        # to get details of each
        for quote in quotes:
             
            # XPath expression to fetch text of the Quote title
            title = quote.xpath('.//*[@class="text"]/text()').extract_first()
             
            # XPath expression to fetch author of the Quote
            authors = quote.xpath('.//*[@itemprop="author"]/text()').extract()
             
            # XPath expression to fetch Tags of the Quote
            tags = quote.xpath('.//*[@itemprop="keywords"]/@content').extract()
             
            # Yield all elements
            yield {"Quote Text ": title, "Authors ": authors, "Tags ": tags}

crawl 命令用于运行蜘蛛。在 crawl 命令中提及蜘蛛名称。如果我们使用 crawl 命令运行上面的代码，那么终端的输出将是：

scrapy crawl filename

编程需要懂一点英语

输出：

如“收益”声明所示，行情被刮掉

此处，yield 语句返回Python字典对象中的数据。

了解Python字典和 Scrapy 项目

上面产生的数据是Python字典对象。使用它们的好处是——

当数据量较小时，它们方便且易于处理键值对结构。
当不需要进一步处理或格式化抓取的数据时，使用它们。
使用字典，当您要抓取的数据完整且简单时。

为了使用 Item 对象，我们将在以下文件中进行更改 -

items.py 文件存在
当前生成的蜘蛛类，gfg_spiitemsread.py 文件。

使用 Scrapy Items 收集数据

现在，我们将学习为 Quotes 编写 Scrapy Item 的过程。为此，我们将遵循下面提到的步骤 –

打开 items.py 文件。它与“蜘蛛”文件夹位于同一级别。提到字段，我们需要提取，在文件中，如下图：

蟒蛇3

# Define here the models for your scraped
# items
# Import the required library
import scrapy
 
# Define the fields for Scrapy item here
# in class
class GfgSpiderreadingitemsItem(scrapy.Item):
     
    # Item key for Title of Quote
    quotetitle = scrapy.Field()
     
    # Item key for Author of Quote
    author = scrapy.Field()
     
    # Item key for Tags of Quote
    tags = scrapy.Field()

正如所见，在上面的文件中，我们定义了一个名为“GfgSpiderreadingitemsItem”的scrapy Item。这个类，是我们的蓝图，对于所有的元素，我们都会刮。它将持续存在，三个字段，即引用标题、作者姓名和标签。我们现在可以添加，只有字段，我们在类中提到。

Field() 类是内置字典类的别名。它允许在一个位置定义所有字段元数据的方法。它不提供任何额外的属性。

现在修改蜘蛛文件，将值存储在项目文件的类的对象中，而不是直接产生它们。请注意，您需要导入 Item 类模块，如下面的代码所示。

蟒蛇3

# Import the required library
import scrapy
 
# Import the Item class with fields
# mentioned in the items.py file
from ..items import GfgSpiderreadingitemsItem
 
 
class GfgSpiitemsreadSpider(scrapy.Spider):
    name = 'gfg_spiitemsread'
    allowed_domains = ['quotes.toscrape.com/tag/reading']
    start_urls = ['http://quotes.toscrape.com/tag/reading/']
 
    def parse(self, response):
       
        # Write XPath expression to loop through
        # all quotes
        quotes = response.xpath('//*[@class="quote"]')
         
        # Loop through all quotes
        for quote in quotes:
             
            # Create an object of Item class
            item = GfgSpiderreadingitemsItem()
             
            # XPath expression to fetch text of the
            # Quote title Store the title in the class
            # attribute in key-value pair
            item['quotetitle'] = quote.xpath(
                './/*[@class="text"]/text()').extract_first()
             
            # XPath expression to fetch author of the Quote
            # Store the author in the class attribute in
            # key-value pair
            item['author'] = quote.xpath(
                './/*[@itemprop="author"]/text()').extract()
             
            # XPath expression to fetch tags of the Quote title
            # Store the tags in the class attribute in key-value
            # pair
            item['tags'] = quote.xpath(
                './/*[@itemprop="keywords"]/@content').extract()
             
            # Yield the item object
            yield item

如上所示，现在可以使用 Item 类中提到的键来收集 XPath 表达式抓取的数据。确保在两个地方都提到确切的键名。例如，在 items.py 文件中，当 'author' 是定义的键时，使用“item['author']”。

在终端产生的项目如下所示：

使用 Scrapy Items 从网页中提取的数据