BeautifulSoup – 从 HTML 中抓取链接

先决条件：使用 BeautifulSoup 在Python中实现 Web Scraping

在本文中，我们将了解如何使用Python从 URL 或 HTML 文档中提取所有链接。

需要的库：

bs4 (BeautifulSoup)：它是Python中的一个库，可以轻松地从网页中抓取信息，并有助于从 HTML 和 XML 文件中提取数据。这个库需要从外部下载，因为它不随Python包一起提供。要安装此库，请在终端中键入以下命令。

pip install bs4

requests：这个库可以很容易地发送HTTP请求并获取网页内容。这个库也需要从外部下载，因为它不随Python包一起提供。要安装此库，请在终端中键入以下命令。

pip install requests

应遵循的步骤：

导入所需的库（bs4 和 requests）
创建一个函数，通过将 URL 传递给它，使用requests.get()方法从 URL 获取 HTML 文档。
创建一个解析树对象，即使用BeautifulSoup()方法的汤对象，将上面提取的 HTML 文档和Python内置的 HTML 解析器传递给它。
使用a标签从 BeautifulSoup 对象中提取链接。
使用get()方法从表单中获取所有锚标记对象的实际 URL，并将href参数传递给它。
此外，您可以使用get()方法获取 URL 的标题并将标题参数传递给它。

执行：

Python3

# import necessary libraries
from bs4 import BeautifulSoup
import requests
import re
  
  
# function to extract html document from given url
def getHTMLdocument(url):
      
    # request for HTML document of given url
    response = requests.get(url)
      
    # response will be provided in JSON format
    return response.text
  
    
# assign required credentials
# assign URL
url_to_scrape = "https://practice.geeksforgeeks.org/courses/"
  
# create document
html_document = getHTMLdocument(url_to_scrape)
  
# create soap object
soup = BeautifulSoup(html_document, 'html.parser')
  
  
# find all the anchor tags with "href" 
# attribute starting with "https://"
for link in soup.find_all('a', 
                          attrs={'href': re.compile("^https://")}):
    # display the actual urls
    print(link.get('href'))

输出：