使用 Requests 和 BeautifulSoup 使用Python下载 PDF

BeautifulSoup对象由 Beautiful Soup 提供，它是Python的网络抓取框架。网络抓取是使用自动化工具从网站中提取数据的过程，以加快过程。 BeautifulSoup对象将解析后的文档表示为一个整体。大多数情况下，您可以将其视为 Tag 对象。

请求库是Python不可或缺的一部分，用于向指定的 URL 发出 HTTP 请求。无论是 REST API 还是 Web Scrapping，都必须了解请求才能进一步使用这些技术。当一个人向一个 URI 发出请求时，它会返回一个响应。 Python请求提供了用于管理请求和响应的内置功能。

本文涉及使用BeautifulSoup下载 PDF 并在Python中请求库。 Beautifulsoup和 requests 可用于从网页中提取所需的信息。

方法：

要找到PDF并下载它，我们必须按照以下步骤操作：

导入beautifulsoup和 requests 库。
请求 URL 并获取响应对象。
找到网页上存在的所有超链接。
检查这些链接中的 PDF 文件链接。
使用响应对象获取 PDF 文件。

执行：

Python3

# Import libraries
import requests
from bs4 import BeautifulSoup
  
# URL from which pdfs to be downloaded
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
  
# Requests URL and get response object
response = requests.get(url)
  
# Parse text obtained
soup = BeautifulSoup(response.text, 'html.parser')
  
# Find all hyperlinks present on webpage
links = soup.find_all('a')
  
i = 0
  
# From all links check for pdf link and
# if present download file
for link in links:
    if ('.pdf' in link.get('href', [])):
        i += 1
        print("Downloading file: ", i)
  
        # Get response object for link
        response = requests.get(link.get('href'))
  
        # Write content in pdf file
        pdf = open("pdf"+str(i)+".pdf", 'wb')
        pdf.write(response.content)
        pdf.close()
        print("File ", i, " downloaded")
  
print("All PDF files downloaded")

输出：

Downloading file:  1
File  1  downloaded
All PDF files downloaded

上述程序从提供的 URL 下载 PDF 文件，名称分别为 pdf1、pdf2、pdf3 等。