在Python中从 PDF 中提取超链接

先决条件：PyPDF2，正则表达式

在本文中，我们将在Python中从 PDF 中提取超链接。它可以通过不同的方式完成：

使用 PyPDF2
使用 pdfx

方法一：使用PyPDF2。

PyPDF2 是一个构建为 PDF 工具包的Python库。它能够提取文档信息等等。

方法：

阅读PDF文件并将其转换为文本
使用正则表达式从文本中获取 URL

让我们逐步实现这个模块：

第 1 步：打开并阅读 PDF 文件。

Python3

import PyPDF2
  
  
file = "Enter PDF File Name"
  
pdfFileObject = open(file, 'rb')
   
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
   
for page_number in range(pdfReader.numPages):
      
    pageObject = pdfReader.getPage(page_number)
    pdf_text = pageObject.extractText()
    print(pdf_text)
      
pdfFileObject.close()

Python3

# Import Module
import PyPDF2
import re 
  
# Enter File Name
file = "Enter PDF File Name"
  
# Open File file
pdfFileObject = open(file, 'rb')
   
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
  
# Regular Expression (Get URL from String)
def Find(string): 
    
    # findall() has been used 
    # with valid conditions for urls in string 
    regex = r"(https?://\S+)"
    url = re.findall(regex,string)
    return [x for x in url] 
    
# Iterate through all pages
for page_number in range(pdfReader.numPages):
      
    pageObject = pdfReader.getPage(page_number)
      
    # Extract text from page
    pdf_text = pageObject.extractText()
      
    # Print all URL
    print(Find(pdf_text))
      
# CLost the PDF 
pdfFileObject.close()

Python3

# Import Module
import pdfx 
  
# Read PDF File
pdf = pdfx.PDFx("File Name") 
  
# Get list of URL
print(pdf.get_references_as_dict())

输出：

第 2 步：使用正则表达式从字符串中查找 URL

蟒蛇3

# Import Module
import PyPDF2
import re 
  
# Enter File Name
file = "Enter PDF File Name"
  
# Open File file
pdfFileObject = open(file, 'rb')
   
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
  
# Regular Expression (Get URL from String)
def Find(string): 
    
    # findall() has been used 
    # with valid conditions for urls in string 
    regex = r"(https?://\S+)"
    url = re.findall(regex,string)
    return [x for x in url] 
    
# Iterate through all pages
for page_number in range(pdfReader.numPages):
      
    pageObject = pdfReader.getPage(page_number)
      
    # Extract text from page
    pdf_text = pageObject.extractText()
      
    # Print all URL
    print(Find(pdf_text))
      
# CLost the PDF 
pdfFileObject.close()

输出：

['https://docs.python.org/', 'https://pythonhosted.org/PyPDF2/', 'https://www.geeksforgeeks.org/']

方法二：使用pdfx。

在这种方法中，我们将使用 pdfx 模块。 pdfx 模块用于从给定的 PDF 或 PDF URL 中提取 URL、元数据和纯文本。特点：从给定的 PDF 中提取参考和元数据。

pip install pdfx

下面是实现：

蟒蛇3

# Import Module
import pdfx 
  
# Read PDF File
pdf = pdfx.PDFx("File Name") 
  
# Get list of URL
print(pdf.get_references_as_dict())

输出：-

{'url': ['https://www.geeksforgeeks.org/',
  'https://docs.python.org/',
  'https://pythonhosted.org/PyPDF2/',
  'GeeksforGeeks.org']}