如何使用Scrapy在线解析PDF页面？

先决条件： Scrapy、PyPDF2、URLLIB

在本文中，我们将使用 Scrapy 解析任何在线 PDF，而无需将其下载到系统上。为此，我们必须使用Python的 PDF 解析器或编辑器库，称为PyPDF2 。

PyPDF2是Python的pdf解析库，提供了reader方法、writer方法等多种方法，用于在线或离线修改、编辑和解析pdf。

PyPDF2 类的所有构造函数都需要一个 PDF 文件流。现在，由于我们只能获取 pdf 文件的 URL，因此要将 URL 转换为文件流或简单地打开该 URL，我们将需要使用Python的 urllib 模块，该模块可用于调用urlopen()方法蜘蛛返回的请求对象。

示例 1：我们将使用一些基本操作，例如提取页码和检查文件是否加密。为此，我们将解析 URL 并找到响应，然后我们将检查文件页面并使用 numPages 和 isEncrypted 进行加密。

Scrapy蜘蛛爬取网页，在线查找要报废的pdf文件，然后从另一个变量URL中获取该pdf文件的URL，然后使用urllib打开该URL文件并创建PyPDF2 lib的reader对象通过将 URL 的流链接传递给对象构造函数的参数。

Python3

import io
import PyPDF2
import urllib.request
import scrapy
from scrapy.item import Item
  
class ParserspiderSpider(scrapy.Spider):
  
    name = 'parserspider'
      
    # URL of the pdf file . This is operating system
    # book solution of author Albert Silberschatz
       start_urls = ['https://codex.cs.yale.edu/avi/\
    os-book/OS9/practice-exer-dir/index.html']
         
    # default parse method
    def parse(self, response):    
  
        # getting the list of URL of the pdf
        pdfs = response.xpath('//tr[3]/td[2]/a/@href')
          
           # Extracting the URL
           URL = response.urljoin(pdfs[0].extract())
  
           # calling urllib to create a reader of the pdf url
           File = urllib.request.urlopen(URL)
           reader = PyPDF2.pdf.PdfFileReader(io.BytesIO(File.read()))
  
           # accessing some descriptions of the pdf file.
           print("This is the number of pages"+str(reader.numPages))
           print("Is file Encrypted?"+str(reader.isEncrypted))

Python3

import io
import PyPDF2
import urllib.request
import scrapy
from scrapy.item import Item
  
class ParserspiderSpider(scrapy.Spider): 
  
    name = 'parserspider'
  
    # URL of the pdf file.
    start_urls = ['https://codex.cs.yale.edu/avi\
    /os-book/OS9/practice-exer-dir/index.html']
      
    # default parse method
    def parse(self, response): 
        
        # getting the list of URL of the pdf
        pdfs = response.xpath('//tr[3]/td[2]/a/@href')
  
        # Extracting the URL
        URL = response.urljoin(pdfs[0].extract())
  
        # calling urllib to create a reader of the pdf url
        File = urllib.request.urlopen(URL)
        reader = PyPDF2.pdf.PdfFileReader(io.BytesIO(File.read()))
  
        # creating data
        data=""
        for datas in reader.pages:
            data += datas.extractText()
  
        print(data)

输出：

首先输出pdf的页面以及是否加密

示例2：在本示例中，我们将提取pdf 文件的数据（解析），然后使用PyPDF2 对象通过上述各种方法对pdf 文件进行所需的更改。我们将提取的数据打印到终端。

蟒蛇3

import io
import PyPDF2
import urllib.request
import scrapy
from scrapy.item import Item
  
class ParserspiderSpider(scrapy.Spider): 
  
    name = 'parserspider'
  
    # URL of the pdf file.
    start_urls = ['https://codex.cs.yale.edu/avi\
    /os-book/OS9/practice-exer-dir/index.html']
      
    # default parse method
    def parse(self, response): 
        
        # getting the list of URL of the pdf
        pdfs = response.xpath('//tr[3]/td[2]/a/@href')
  
        # Extracting the URL
        URL = response.urljoin(pdfs[0].extract())
  
        # calling urllib to create a reader of the pdf url
        File = urllib.request.urlopen(URL)
        reader = PyPDF2.pdf.PdfFileReader(io.BytesIO(File.read()))
  
        # creating data
        data=""
        for datas in reader.pages:
            data += datas.extractText()
  
        print(data)

输出：

解析的PDF