如何使用 python 从谷歌搜索中获取所有链接(1)

📌 相关文章

📜 如何使用 python 从谷歌搜索中获取所有链接(1)

📅 最后修改于: 2023-12-03 15:08:23.556000 🧑 作者: Mango

如何使用 Python 从谷歌搜索中获取所有链接

如果你需要从谷歌搜索中获取所有链接，Python 是一种非常有用的工具。在本文中，我们将探讨如何使用 Python 从谷歌搜索中获取所有链接的方法。

使用 requests 和 BeautifulSoup

首先，我们需要导入 requests 和 BeautifulSoup。requests 是一个用于发送 HTTP 请求的库，而 BeautifulSoup 是用于解析 HTML 和 XML 文件的库。

import requests
from bs4 import BeautifulSoup

接下来，我们需要定义一个函数 google_search，该函数将搜索谷歌，并返回搜索结果页面的 HTML。

def google_search(query):
    url = "https://www.google.com/search?q=" + query
    headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:89.0) Gecko/20100101 Firefox/89.0'} # 设置 User-Agent
    html = requests.get(url, headers=headers).text
    return html

现在我们已经定义了 google_search 函数，我们可以向该函数提供搜索查询字符串，并获取搜索结果页面的 HTML。

接下来，我们需要解析 HTML，提取搜索结果中的链接。我们可以使用 BeautifulSoup 这个库。我们定义一个函数 search_links，它将搜索结果页面的 HTML 作为参数，并返回一个包含所有链接的列表。

def search_links(html):
    soup = BeautifulSoup(html, 'html.parser')
    links = []
    for link in soup.find_all('a'):
        href = link.get('href')
        if href.startswith('/url?q='):
            href = href[7:]
            if not href.startswith('http'):
                continue
            href = href.split('&')[0]
            links.append(href)
    return links

search_links 函数首先使用 BeautifulSoup 解析 HTML。然后，对于每个链接标签，我们提取链接的 href 属性。如果 href 属性的值以 /url?q= 开头，那么我们将其截取，并去除其前缀。如果链接不是以 http 或 https 开头，那么我们将其忽略。最后，我们将链接添加到列表中，并返回该列表。

对于一个简单的搜索查询，我们可以执行以下代码：

html = google_search('Python tutorials')
links = search_links(html)
for link in links:
    print(link)

这将打印出所有 Python 教程的链接。你可以将这些链接保存在一个文件中，以便后续处理。

结论

在本文中，我们使用 Python、requests 和 BeautifulSoup 展示了如何从谷歌搜索中获取所有链接。我们首先发送 HTTP 请求，接着使用 BeautifulSoup 解析 HTML。然后，我们提取所有链接，并将其添加到一个列表中。最后，我们可以将这些链接保存到文件中，或按其他方式对其进行处理。