📅  最后修改于: 2023-12-03 15:34:52.242000             🧑  作者: Mango
Scrapy is a fast and powerful open-source web crawling framework that allows developers to easily and quickly scrape data from various sources. One of the powerful features of Scrapy is the ability to extract data using Xpath selectors. In this article, we will show you how to use Scrapy Xpath to navigate to the next page of a list using a rel="next"
link.
You will need the following tools to follow this tutorial:
The first step is to create a Scrapy project. You can create a new project using the scrapy startproject
command:
scrapy startproject scrapy_xpath_rel_next
Next, create a spider using the scrapy genspider
command. In this example, we will create a spider named quotes
to scrape quotes from the Quotes to Scrape website:
scrapy genspider quotes quotes.toscrape.com
Open the quotes.py
file in your favorite code editor and define the spider settings. You can set the start_urls
and allowed_domains
attributes in the __init__
method:
class QuotesSpider(scrapy.Spider):
name = 'quotes'
def __init__(self):
self.start_urls = ['http://quotes.toscrape.com/page/1/']
self.allowed_domains = ['quotes.toscrape.com']
To extract data from the first page, we will define a parse
method in the spider. We will use Xpath selectors to extract the quotes and authors from the HTML response:
class QuotesSpider(scrapy.Spider):
name = 'quotes'
# Define start_urls and allowed_domains
def parse(self, response):
# Extract the quote and author using Xpath selectors
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'quote': quote.xpath('.//span[@class="text"]/text()').get(),
'author': quote.xpath('.//span/small/text()').get()
}
To navigate to the next page of quotes, we need to extract the URL of the rel="next"
link. We can do this using the response.css
method:
class QuotesSpider(scrapy.Spider):
name = 'quotes'
# Define start_urls and allowed_domains
def parse(self, response):
# Extract the quote and author using Xpath selectors
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'quote': quote.xpath('.//span[@class="text"]/text()').get(),
'author': quote.xpath('.//span/small/text()').get()
}
# Extract the URL of the next page using the rel="next" link
next_page_url = response.css('li.next a::attr(href)').get()
Once we have extracted the URL of the next page, we can use the yield scrapy.Request
method to follow the link to the next page:
class QuotesSpider(scrapy.Spider):
name = 'quotes'
# Define start_urls and allowed_domains
def parse(self, response):
# Extract the quote and author using Xpath selectors
quotes = response.xpath('//div[@class="quote"]')
for quote in quotes:
yield {
'quote': quote.xpath('.//span[@class="text"]/text()').get(),
'author': quote.xpath('.//span/small/text()').get()
}
# Extract the URL of the next page using the rel="next" link
next_page_url = response.css('li.next a::attr(href)').get()
# Follow the URL of the next page
if next_page_url:
yield scrapy.Request(url=next_page_url, callback=self.parse)
To run the spider, use the scrapy crawl
command followed by the spider name:
scrapy crawl quotes
In this tutorial, we showed you how to use Scrapy Xpath to navigate to the next page of a list using a rel="next"
link. By following this tutorial, you should be able to extract data from multiple pages of a list using Scrapy Xpath.