📅  最后修改于: 2020-10-31 14:32:13             🧑  作者: Mango
Spider是负责定义如何通过网站链接并从页面提取信息的类。
Scrapy的默认蜘蛛如下所示-
它是所有其他蜘蛛都必须继承的蜘蛛。它具有以下类别-
class scrapy.spiders.Spider
下表显示了scrapy.Spider类的字段-
Sr.No | Field & Description |
---|---|
1 |
name It is the name of your spider. |
2 |
allowed_domains It is a list of domains on which the spider crawls. |
3 |
start_urls It is a list of URLs, which will be the roots for later crawls, where the spider will begin to crawl from. |
4 |
custom_settings These are the settings, when running the spider, will be overridden from project wide configuration. |
5 |
crawler It is an attribute that links to Crawler object to which the spider instance is bound. |
6 |
settings These are the settings for running a spider. |
7 |
logger It is a Python logger used to send log messages. |
8 |
from_crawler(crawler,*args,**kwargs) It is a class method, which creates your spider. The parameters are −
|
9 |
start_requests() When no particular URLs are specified and the spider is opened for scrapping, Scrapy calls start_requests() method. |
10 |
make_requests_from_url(url) It is a method used to convert urls to requests. |
11 |
parse(response) This method processes the response and returns scrapped data following more URLs. |
12 |
log(message[,level,component]) It is a method that sends a log message through spiders logger. |
13 |
closed(reason) This method is called when the spider closes. |
Spider参数用于指定起始URL,并通过带有-a选项的爬网命令传递,如下所示-
scrapy crawl first_scrapy -a group = accessories
以下代码演示了Spider如何接收参数-
import scrapy
class FirstSpider(scrapy.Spider):
name = "first"
def __init__(self, group = None, *args, **kwargs):
super(FirstSpider, self).__init__(*args, **kwargs)
self.start_urls = ["http://www.example.com/group/%s" % group]
您可以使用通用蜘蛛来继承您的蜘蛛。他们的目的是根据某些规则遵循网站上的所有链接,以从所有页面提取数据。
对于以下蜘蛛网中使用的示例,让我们假设我们有一个包含以下字段的项目-
import scrapy
from scrapy.item import Item, Field
class First_scrapyItem(scrapy.Item):
product_title = Field()
product_link = Field()
product_description = Field()
CrawlSpider定义了一组规则来遵循链接并剪贴多个页面。它具有以下类别-
class scrapy.spiders.CrawlSpider
以下是CrawlSpider类的属性-
它是定义爬网程序如何链接的规则对象列表。
下表显示了CrawlSpider类的规则-
Sr.No | Rule & Description |
---|---|
1 |
LinkExtractor It specifies how spider follows the links and extracts the data. |
2 |
callback It is to be called after each page is scraped. |
3 |
follow It specifies whether to continue following links or not. |
通过允许解析初始响应,它返回item或request对象。
注意-编写规则时,请确保重命名解析函数,而不是解析,因为CrawlSpider使用解析函数来实现其逻辑。
让我们看下面的示例,其中spider开始抓取demoexample.com的主页,使用parse_items方法收集所有页面,链接和解析-
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class DemoSpider(CrawlSpider):
name = "demo"
allowed_domains = ["www.demoexample.com"]
start_urls = ["http://www.demoexample.com"]
rules = (
Rule(LinkExtractor(allow =(), restrict_xpaths = ("//div[@class = 'next']",)),
callback = "parse_item", follow = True),
)
def parse_item(self, response):
item = DemoItem()
item["product_title"] = response.xpath("a/text()").extract()
item["product_link"] = response.xpath("a/@href").extract()
item["product_description"] = response.xpath("div[@class = 'desc']/text()").extract()
return items
它是从XML提要中抓取并在节点上进行迭代的Spider的基类。它具有以下类别-
class scrapy.spiders.XMLFeedSpider
下表显示了用于设置迭代器和标记名称的类属性-
Sr.No | Attribute & Description |
---|---|
1 |
iterator It defines the iterator to be used. It can be either iternodes, html or xml. Default is iternodes. |
2 |
itertag It is a string with node name to iterate. |
3 |
namespaces It is defined by list of (prefix, uri) tuples that automatically registers namespaces using register_namespace() method. |
4 |
adapt_response(response) It receives the response and modifies the response body as soon as it arrives from spider middleware, before spider starts parsing it. |
5 |
parse_node(response,selector) It receives the response and a selector when called for each node matching the provided tag name. Note − Your spider won’t work if you don’t override this method. |
6 |
process_results(response,results) It returns a list of results and response returned by the spider. |
它遍历每行,接收一个CSV文件作为响应,并调用parse_row()方法。它具有以下类别-
class scrapy.spiders.CSVFeedSpider
下表显示了可以针对CSV文件设置的选项-
Sr.No | Option & Description |
---|---|
1 |
delimiter It is a string containing a comma(‘,’) separator for each field. |
2 |
quotechar It is a string containing quotation mark(‘”‘) for each field. |
3 |
headers It is a list of statements from where the fields can be extracted. |
4 |
parse_row(response,row) It receives a response and each row along with a key for header. |
from scrapy.spiders import CSVFeedSpider
from demoproject.items import DemoItem
class DemoSpider(CSVFeedSpider):
name = "demo"
allowed_domains = ["www.demoexample.com"]
start_urls = ["http://www.demoexample.com/feed.csv"]
delimiter = ";"
quotechar = "'"
headers = ["product_title", "product_link", "product_description"]
def parse_row(self, response, row):
self.logger.info("This is row: %r", row)
item = DemoItem()
item["product_title"] = row["product_title"]
item["product_link"] = row["product_link"]
item["product_description"] = row["product_description"]
return item
在Sitemaps的帮助下,SitemapSpider通过从robots.txt查找URL来爬网网站。它具有以下类别-
class scrapy.spiders.SitemapSpider
下表显示了SitemapSpider的字段-
Sr.No | Field & Description |
---|---|
1 |
sitemap_urls A list of URLs which you want to crawl pointing to the sitemaps. |
2 |
sitemap_rules It is a list of tuples (regex, callback), where regex is a regular expression, and callback is used to process URLs matching a regular expression. |
3 |
sitemap_follow It is a list of sitemap’s regexes to follow. |
4 |
sitemap_alternate_links Specifies alternate links to be followed for a single url. |
以下SitemapSpider处理所有URL-
from scrapy.spiders import SitemapSpider
class DemoSpider(SitemapSpider):
urls = ["http://www.demoexample.com/sitemap.xml"]
def parse(self, response):
# You can scrap items here
以下SitemapSpider处理带有回调的某些URL-
from scrapy.spiders import SitemapSpider
class DemoSpider(SitemapSpider):
urls = ["http://www.demoexample.com/sitemap.xml"]
rules = [
("/item/", "parse_item"),
("/group/", "parse_group"),
]
def parse_item(self, response):
# you can scrap item here
def parse_group(self, response):
# you can scrap group here
以下代码显示了robots.txt中网址为/ sitemap_company的站点地图-
from scrapy.spiders import SitemapSpider
class DemoSpider(SitemapSpider):
urls = ["http://www.demoexample.com/robots.txt"]
rules = [
("/company/", "parse_company"),
]
sitemap_follow = ["/sitemap_company"]
def parse_company(self, response):
# you can scrap company here
您甚至可以将SitemapSpider与其他URL结合使用,如以下命令所示。
from scrapy.spiders import SitemapSpider
class DemoSpider(SitemapSpider):
urls = ["http://www.demoexample.com/robots.txt"]
rules = [
("/company/", "parse_company"),
]
other_urls = ["http://www.demoexample.com/contact-us"]
def start_requests(self):
requests = list(super(DemoSpider, self).start_requests())
requests += [scrapy.Request(x, self.parse_other) for x in self.other_urls]
return requests
def parse_company(self, response):
# you can scrap company here...
def parse_other(self, response):
# you can scrap other here...