如何抓取网站中的所有 PDF 文件?
先决条件:使用 BeautifulSoup 在Python实现 Web Scraping
Web Scraping 是一种从网站中提取数据并将该数据用于其他用途的方法。有几个库和模块可用于在Python进行网络抓取。在本文中,我们将学习如何借助beautifulsoup( Python最好的网页抓取模块之一)和用于 GET 请求的requests模块从网站上抓取 PDF 文件。此外,为了获取有关 PDF 文件的更多信息,我们使用PyPDF2模块。
分步代码 -
步骤 1:导入所有重要的模块和包。
Python3
# for get the pdf files or url
import requests
# for tree traversal scraping in webpage
from bs4 import BeautifulSoup
# for input and output operations
import io
# For getting information about the pdfs
from PyPDF2 import PdfFileReader
Python3
# website to scrap
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
# get the url from requests get method
read = requests.get(url)
# full html content
html_content = read.content
# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")
Python3
# created an empty list for putting the pdfs
list_of_pdf = set()
# accessed the first p tag in the html
l = soup.find('p')
# accessed all the anchors tag from given p tag
p = l.find_all('a')
# iterate through p for getting all the href links
for link in p:
# original html links
print("links: ", link.get('href'))
print("\n")
# converting the extention from .html to .pdf
pdf_link = (link.get('href')[:-5]) + ".pdf"
# converted to .pdf
print("converted pdf links: ", pdf_link)
print("\n")
# added all the pdf links to set
list_of_pdf.add(pdf_link)
Python3
def info(pdf_path):
# used get method to get the pdf file
response = requests.get(pdf_path)
# response.content generate binary code for
# string function
with io.BytesIO(response.content) as f:
# initialized the pdf
pdf = PdfFileReader(f)
# all info about pdf
information = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
txt = f"""
Information about {pdf_path}:
Author: {information.author}
Creator: {information.creator}
Producer: {information.producer}
Subject: {information.subject}
Title: {information.title}
Number of pages: {number_of_pages}
"""
print(txt)
return information
Python3
# print all the content of pdf in the console
for i in list_of_pdf:
info(i)
Python3
import requests
from bs4 import BeautifulSoup
import io
from PyPDF2 import PdfFileReader
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")
list_of_pdf = set()
l = soup.find('p')
p = l.find_all('a')
for link in (p):
pdf_link = (link.get('href')[:-5]) + ".pdf"
print(pdf_link)
list_of_pdf.add(pdf_link)
def info(pdf_path):
response = requests.get(pdf_path)
with io.BytesIO(response.content) as f:
pdf = PdfFileReader(f)
information = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
txt = f"""
Information about {pdf_path}:
Author: {information.author}
Creator: {information.creator}
Producer: {information.producer}
Subject: {information.subject}
Title: {information.title}
Number of pages: {number_of_pages}
"""
print(txt)
return information
for i in list_of_pdf:
info(i)
第 2 步:传递 URL 并在 BeautifulSoup 的帮助下制作 HTML 解析器。
蟒蛇3
# website to scrap
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
# get the url from requests get method
read = requests.get(url)
# full html content
html_content = read.content
# Parse the html content
soup = BeautifulSoup(html_content, "html.parser")
在上面的代码中:
- 抓取是通过https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/链接完成的
- requests 模块用于发出 get 请求
- read.content用于遍历所有 HTML 代码。打印将输出网页的源代码。
- 汤具有 HTML 内容并用于解析 HTML
第 3 步:我们需要遍历网站上的 PDF。
蟒蛇3
# created an empty list for putting the pdfs
list_of_pdf = set()
# accessed the first p tag in the html
l = soup.find('p')
# accessed all the anchors tag from given p tag
p = l.find_all('a')
# iterate through p for getting all the href links
for link in p:
# original html links
print("links: ", link.get('href'))
print("\n")
# converting the extention from .html to .pdf
pdf_link = (link.get('href')[:-5]) + ".pdf"
# converted to .pdf
print("converted pdf links: ", pdf_link)
print("\n")
# added all the pdf links to set
list_of_pdf.add(pdf_link)
输出:
在上面的代码中:
- list_of_pdf是一个空集,用于添加网页中的所有 PDF 文件。使用 Set 是因为它从不重复同名元素。并自动删除重复项。
- 在所有将 .HTML 转换为 .pdf 的链接中进行迭代。这样做是因为PDF名称和HTML名称在格式上只有一个区别,其余的都是一样的。
- 我们使用 set 是因为我们需要去掉重复的名字。也可以使用该列表,而不是添加,我们附加所有 PDF。
第 4 步:使用 pypdf2 模块创建info函数以获取 pdf 的所有必需信息。
蟒蛇3
def info(pdf_path):
# used get method to get the pdf file
response = requests.get(pdf_path)
# response.content generate binary code for
# string function
with io.BytesIO(response.content) as f:
# initialized the pdf
pdf = PdfFileReader(f)
# all info about pdf
information = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
txt = f"""
Information about {pdf_path}:
Author: {information.author}
Creator: {information.creator}
Producer: {information.producer}
Subject: {information.subject}
Title: {information.title}
Number of pages: {number_of_pages}
"""
print(txt)
return information
在上面的代码中:
- Info函数负责在 PDF 中提供所有必需的抓取输出。
- io.BytesIO(response.content) – 使用它是因为response.content是一个二进制代码,并且请求库的级别非常低并且通常被编译(未解释)。所以要处理字节,使用io.BytesIO 。
- 有几个 pypdfs2 函数可以访问 pdf 中的不同数据。
注意:请参阅在Python使用 PDF 文件 了解详细信息。
蟒蛇3
# print all the content of pdf in the console
for i in list_of_pdf:
info(i)
完整代码:
蟒蛇3
import requests
from bs4 import BeautifulSoup
import io
from PyPDF2 import PdfFileReader
url = "https://www.geeksforgeeks.org/how-to-extract-pdf-tables-in-python/"
read = requests.get(url)
html_content = read.content
soup = BeautifulSoup(html_content, "html.parser")
list_of_pdf = set()
l = soup.find('p')
p = l.find_all('a')
for link in (p):
pdf_link = (link.get('href')[:-5]) + ".pdf"
print(pdf_link)
list_of_pdf.add(pdf_link)
def info(pdf_path):
response = requests.get(pdf_path)
with io.BytesIO(response.content) as f:
pdf = PdfFileReader(f)
information = pdf.getDocumentInfo()
number_of_pages = pdf.getNumPages()
txt = f"""
Information about {pdf_path}:
Author: {information.author}
Creator: {information.creator}
Producer: {information.producer}
Subject: {information.subject}
Title: {information.title}
Number of pages: {number_of_pages}
"""
print(txt)
return information
for i in list_of_pdf:
info(i)
输出: