Scrapy-蜘蛛 - 芒果文档

📌 相关文章

📜 Scrapy-蜘蛛

📅 最后修改于: 2020-10-31 14:32:13 🧑 作者: Mango

描述

Spider是负责定义如何通过网站链接并从页面提取信息的类。

Scrapy的默认蜘蛛如下所示-

爬虫

它是所有其他蜘蛛都必须继承的蜘蛛。它具有以下类别-

class scrapy.spiders.Spider

下表显示了scrapy.Spider类的字段-

Sr.No	Field & Description
1	name It is the name of your spider.
2	allowed_domains It is a list of domains on which the spider crawls.
3	start_urls It is a list of URLs, which will be the roots for later crawls, where the spider will begin to crawl from.
4	custom_settings These are the settings, when running the spider, will be overridden from project wide configuration.
5	crawler It is an attribute that links to Crawler object to which the spider instance is bound.
6	settings These are the settings for running a spider.
7	logger It is a Python logger used to send log messages.
8	*from_crawler(crawler,args,kwargs) It is a class method, which creates your spider. The parameters are − crawler − A crawler to which the spider instance will be bound. args(list) − These arguments are passed to the method _init_(). kwargs(dict) − These keyword arguments are passed to the method _init_().
9	start_requests() When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method.
10	make_requests_from_url(url) It is a method used to convert urls to requests.
11	parse(response) This method processes the response and returns scrapped data following more URLs.
12	log(message[,level,component]) It is a method that sends a log message through spiders logger.
13	closed(reason) This method is called when the spider closes.

蜘蛛的争论

Spider参数用于指定起始URL，并通过带有-a选项的爬网命令传递，如下所示-

scrapy crawl first_scrapy -a group = accessories

以下代码演示了Spider如何接收参数-

import scrapy 

class FirstSpider(scrapy.Spider): 
   name = "first" 
   
   def __init__(self, group = None, *args, **kwargs): 
      super(FirstSpider, self).__init__(*args, **kwargs) 
      self.start_urls = ["http://www.example.com/group/%s" % group]

通用蜘蛛

您可以使用通用蜘蛛来继承您的蜘蛛。他们的目的是根据某些规则遵循网站上的所有链接，以从所有页面提取数据。

对于以下蜘蛛网中使用的示例，让我们假设我们有一个包含以下字段的项目-

import scrapy 
from scrapy.item import Item, Field 
  
class First_scrapyItem(scrapy.Item): 
   product_title = Field() 
   product_link = Field() 
   product_description = Field()

爬行蜘蛛

CrawlSpider定义了一组规则来遵循链接并剪贴多个页面。它具有以下类别-

class scrapy.spiders.CrawlSpider

以下是CrawlSpider类的属性-

规则

它是定义爬网程序如何链接的规则对象列表。

下表显示了CrawlSpider类的规则-

Sr.No	Rule & Description
1	LinkExtractor It specifies how spider follows the links and extracts the data.
2	callback It is to be called after each page is scraped.
3	follow It specifies whether to continue following links or not.

Sr.No

Rule & Description

LinkExtractor

It specifies how spider follows the links and extracts the data.

callback

It is to be called after each page is scraped.

follow

It specifies whether to continue following links or not.

parse_start_url(响应)

通过允许解析初始响应，它返回item或request对象。

注意-编写规则时，请确保重命名解析函数，而不是解析，因为CrawlSpider使用解析函数来实现其逻辑。

让我们看下面的示例，其中spider开始抓取demoexample.com的主页，使用parse_items方法收集所有页面，链接和解析-

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class DemoSpider(CrawlSpider):
   name = "demo"
   allowed_domains = ["www.demoexample.com"]
   start_urls = ["http://www.demoexample.com"]
      
   rules = ( 
      Rule(LinkExtractor(allow =(), restrict_xpaths = ("//div[@class = 'next']",)),
         callback = "parse_item", follow = True),
   )
   
   def parse_item(self, response):
      item = DemoItem()
      item["product_title"] = response.xpath("a/text()").extract()
      item["product_link"] = response.xpath("a/@href").extract()
      item["product_description"] = response.xpath("div[@class = 'desc']/text()").extract()
      return items

XMLFeedSpider

它是从XML提要中抓取并在节点上进行迭代的Spider的基类。它具有以下类别-

class scrapy.spiders.XMLFeedSpider

下表显示了用于设置迭代器和标记名称的类属性-

Sr.No	Attribute & Description
1	iterator It defines the iterator to be used. It can be either iternodes, html or xml. Default is iternodes.
2	itertag It is a string with node name to iterate.
3	namespaces It is defined by list of (prefix, uri) tuples that automatically registers namespaces using register_namespace() method.
4	adapt_response(response) It receives the response and modifies the response body as soon as it arrives from spider middleware, before spider starts parsing it.
5	parse_node(response,selector) It receives the response and a selector when called for each node matching the provided tag name. Note − Your spider won’t work if you don’t override this method.
6	process_results(response,results) It returns a list of results and response returned by the spider.

CSVFeedSpider

它遍历每行，接收一个CSV文件作为响应，并调用parse_row()方法。它具有以下类别-

class scrapy.spiders.CSVFeedSpider

下表显示了可以针对CSV文件设置的选项-

Sr.No	Option & Description
1	delimiter It is a string containing a comma(‘,’) separator for each field.
2	quotechar It is a string containing quotation mark(‘”‘) for each field.
3	headers It is a list of statements from where the fields can be extracted.
4	parse_row(response,row) It receives a response and each row along with a key for header.

CSVFeedSpider示例

from scrapy.spiders import CSVFeedSpider
from demoproject.items import DemoItem  

class DemoSpider(CSVFeedSpider): 
   name = "demo" 
   allowed_domains = ["www.demoexample.com"] 
   start_urls = ["http://www.demoexample.com/feed.csv"] 
   delimiter = ";" 
   quotechar = "'" 
   headers = ["product_title", "product_link", "product_description"]  
   
   def parse_row(self, response, row): 
      self.logger.info("This is row: %r", row)  
      item = DemoItem() 
      item["product_title"] = row["product_title"] 
      item["product_link"] = row["product_link"] 
      item["product_description"] = row["product_description"] 
      return item

蜘蛛地图

在Sitemaps的帮助下，SitemapSpider通过从robots.txt查找URL来爬网网站。它具有以下类别-

class scrapy.spiders.SitemapSpider

下表显示了SitemapSpider的字段-

Sr.No	Field & Description
1	sitemap_urls A list of URLs which you want to crawl pointing to the sitemaps.
2	sitemap_rules It is a list of tuples (regex, callback), where regex is a regular expression, and callback is used to process URLs matching a regular expression.
3	sitemap_follow It is a list of sitemap’s regexes to follow.
4	sitemap_alternate_links Specifies alternate links to be followed for a single url.

SitemapSpider示例

以下SitemapSpider处理所有URL-

from scrapy.spiders import SitemapSpider  

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/sitemap.xml"]  
   
   def parse(self, response): 
      # You can scrap items here

以下SitemapSpider处理带有回调的某些URL-

from scrapy.spiders import SitemapSpider  

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/sitemap.xml"] 
   
   rules = [ 
      ("/item/", "parse_item"), 
      ("/group/", "parse_group"), 
   ]  
   
   def parse_item(self, response): 
      # you can scrap item here  
   
   def parse_group(self, response): 
      # you can scrap group here

以下代码显示了robots.txt中网址为/ sitemap_company的站点地图-

from scrapy.spiders import SitemapSpider

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/robots.txt"] 
   rules = [ 
      ("/company/", "parse_company"), 
   ] 
   sitemap_follow = ["/sitemap_company"]  
   
   def parse_company(self, response): 
      # you can scrap company here

您甚至可以将SitemapSpider与其他URL结合使用，如以下命令所示。

from scrapy.spiders import SitemapSpider  

class DemoSpider(SitemapSpider): 
   urls = ["http://www.demoexample.com/robots.txt"] 
   rules = [ 
      ("/company/", "parse_company"), 
   ]  
   
   other_urls = ["http://www.demoexample.com/contact-us"] 
   def start_requests(self): 
      requests = list(super(DemoSpider, self).start_requests()) 
      requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls] 
      return requests 

   def parse_company(self, response): 
      # you can scrap company here... 

   def parse_other(self, response): 
      # you can scrap other here...