用Python和 BeautifulSoup 抓取 Reddit

在本文中，我们将看到如何使用Python和 BeautifulSoup 抓取 Reddit。这里我们将使用 Beautiful Soup 和 request 模块来抓取数据。

需要的模块

bs4 : Beautiful Soup(bs4) 是一个Python库，用于从 HTML 和 XML 文件中提取数据。这个模块没有内置于Python。要安装此类型，请在终端中输入以下命令。

pip install bs4

requests ： Request 允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置于Python。要安装此类型，请在终端中输入以下命令。

pip install requests

方法：

导入所有需要的模块。
将 getdata函数(UDF) 中的 URL 传递给将请求一个 URL，它返回一个响应。我们使用GET方法从使用给定 URL 的给定服务器检索信息。

Syntax: requests.get(url, args)

编程需要懂一点英语

现在使用 bs4 解析 HTML 内容。

Syntax: soup = BeautifulSoup(r.content, ‘html5lib’)

Parameters:

r.content : It is the raw HTML content.
html.parser : Specifying the HTML parser we want to use.

编程需要懂一点英语

现在使用soup.Find_all函数过滤所需的数据。

让我们看看脚本的逐步执行。

第一步：导入所有依赖

Python3

# import module
import requests
from bs4 import BeautifulSoup

Python3

# user define function
# Scrape the data
def getdata(url):
    r = requests.get(url, headers = HEADERS)
    return r.text

Python3

url = "https://www.reddit.com/r/learnpython/comments/78qnze/web_scraping_in_20_lines_of_code_with/"
  
# pass the url
# into getdata function
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
    
# display html code
print(soup)

Python3

# find the Html tag
# with find()
# and convert into string
data_str = ""
for item in soup.find_all("div", class_="NAURX0ARMmhJ5eqxQrlQW"):
    data_str = data_str + item.get_text()
        
print(data_str)

Python3

# find the Html tag
# with find()
# and convert into string
data_str = ""
result = ""
for item in soup.find_all("div", class_="_3xX726aBn29LDbsDtzr_6E _1Ap4F5maDtT1E1YuCiaO0r D3IL3FD0RFy_mkKLPwL4"):
    data_str = data_str + item.get_text()
print(data_str)

Python3

# find the Html tag
# with find()
# and convert into string
data_str = ""
  
for item in soup.find_all("p", class_="_1qeIAgB0cPwnLhDF9XSiJM"):
    data_str = data_str + item.get_text()
print(data_str)

第 2 步：创建 URL 获取函数

蟒蛇3

# user define function
# Scrape the data
def getdata(url):
    r = requests.get(url, headers = HEADERS)
    return r.text

第 3 步：现在获取 URL 并将 URL 传递到 getdata()函数并将该数据转换为 HTML 代码。

蟒蛇3

url = "https://www.reddit.com/r/learnpython/comments/78qnze/web_scraping_in_20_lines_of_code_with/"
  
# pass the url
# into getdata function
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
    
# display html code
print(soup)

输出：

注意：这只是 HTML 代码或原始数据。

获取作者姓名

现在找到带有 div 标签的作者，其中 class_ =”NAURX0ARMmhJ5eqxQrlQW”。我们可以在浏览器中打开网页，点击右键查看相关元素，如图。

例子：

蟒蛇3

# find the Html tag
# with find()
# and convert into string
data_str = ""
for item in soup.find_all("div", class_="NAURX0ARMmhJ5eqxQrlQW"):
    data_str = data_str + item.get_text()
        
print(data_str)

输出：

kashaziz

获取文章包含

现在找到文章正文，这里我们将按照与上述示例相同的方法进行操作。

例子：

蟒蛇3

# find the Html tag
# with find()
# and convert into string
data_str = ""
result = ""
for item in soup.find_all("div", class_="_3xX726aBn29LDbsDtzr_6E _1Ap4F5maDtT1E1YuCiaO0r D3IL3FD0RFy_mkKLPwL4"):
    data_str = data_str + item.get_text()
print(data_str)

输出：

获取评论

现在对注释进行 Scape，这里我们将遵循与上述示例相同的方法。

蟒蛇3

# find the Html tag
# with find()
# and convert into string
data_str = ""
  
for item in soup.find_all("p", class_="_1qeIAgB0cPwnLhDF9XSiJM"):
    data_str = data_str + item.get_text()
print(data_str)

输出：