网页抓取 - 亚马逊客户评论
在本文中,我们将了解如何使用Python的Beautiful Soup 抓取亚马逊客户评论。
需要的模块
- bs4 : Beautiful Soup(bs4) 是一个Python库,用于从 HTML 和 XML 文件中提取数据。这个模块没有内置于Python。要安装此类型,请在终端中输入以下命令。
pip install bs4
- requests : Request 允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置于Python。要安装此类型,请在终端中输入以下命令。
pip install requests
要开始网络抓取,我们首先必须进行一些设置。导入所有需要的模块。获取用于向亚马逊发出请求的 cookie 数据,否则您将无法抓取。创建一个包含您的请求 cookie 的标头,没有 cookie,您将无法抓取亚马逊数据,它总是显示一些错误。本网站将为您提供特定的用户代理。
将 getdata()函数(用户定义函数)中的 URL 传递给将请求 URL 的 URL,它返回一个响应。我们使用 get 方法从使用给定 URL 的给定服务器检索信息。
句法:
requests.get(url, args)
将该数据转换为 HTML 代码,然后使用 bs4 解析 HTML 内容。
Syntax: soup = BeautifulSoup(r.content, ‘html5lib’)
Parameters:
- r.content : It is the raw HTML content.
- html.parser : Specifying the HTML parser we want to use.
现在使用soup.Find_all函数过滤所需的数据。
程序:
Python3
# import module
import requests
from bs4 import BeautifulSoup
HEADERS = ({'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
AppleWebKit/537.36 (KHTML, like Gecko) \
Chrome/90.0.4430.212 Safari/537.36',
'Accept-Language': 'en-US, en;q=0.5'})
# user define function
# Scrape the data
def getdata(url):
r = requests.get(url, headers=HEADERS)
return r.text
def html_code(url):
# pass the url
# into getdata function
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
# display html code
return (soup)
url = "https://www.amazon.in/Columbia-Mens-wind-\
resistant-Glove/dp/B0772WVHPS/?_encoding=UTF8&pd_rd\
_w=d9RS9&pf_rd_p=3d2ae0df-d986-4d1d-8c95-aa25d2ade606&pf\
_rd_r=7MP3ZDYBBV88PYJ7KEMJ&pd_rd_r=550bec4d-5268-41d5-\
87cb-8af40554a01e&pd_rd_wg=oy8v8&ref_=pd_gw_cr_cartx&th=1"
soup = html_code(url)
print(soup)
Python
def cus_data(soup):
# find the Html tag
# with find()
# and convert into string
data_str = ""
cus_list = []
for item in soup.find_all("span", class_="a-profile-name"):
data_str = data_str + item.get_text()
cus_list.append(data_str)
data_str = ""
return cus_list
cus_res = cus_data(soup)
print(cus_res)
Python3
def cus_rev(soup):
# find the Html tag
# with find()
# and convert into string
data_str = ""
for item in soup.find_all("div", class_="a-expander-content \
reviewText review-text-content a-expander-partial-collapse-content"):
data_str = data_str + item.get_text()
result = data_str.split("\n")
return (result)
rev_data = cus_rev(soup)
rev_result = []
for i in rev_data:
if i is "":
pass
else:
rev_result.append(i)
rev_result
Python3
def product_info(soup):
# find the Html tag
# with find()
# and convert into string
data_str = ""
pro_info = []
for item in soup.find_all("ul", class_="a-unordered-list a-nostyle\
a-vertical a-spacing-none detail-bullet-list"):
data_str = data_str + item.get_text()
pro_info.append(data_str.split("\n"))
data_str = ""
return pro_info
pro_result = product_info(soup)
# Filter the required data
for item in pro_result:
for j in item:
if j is "":
pass
else:
print(j)
Python3
def rev_img(soup):
# find the Html tag
# with find()
# and convert into string
data_str = ""
cus_list = []
images = []
for img in soup.findAll('img', class_="cr-lightbox-image-thumbnail"):
images.append(img.get('src'))
return images
img_result = rev_img(soup)
img_result
Python3
import pandas as pd
# initialise data of lists.
data = {'Name': cus_res,
'review': rev_result}
# Create DataFrame
df = pd.DataFrame(data)
# Save the output.
df.to_csv('amazon_review.csv')
输出:
注意:这只是 HTML 代码或原始数据。
现在,既然核心设置已经完成,让我们看看如何针对特定需求进行抓取。
抓取客户名称
现在找到带有 span 标签的客户列表,其中 class_ = a-profile-name。您可以在浏览器中打开网页,通过右键单击查看相关元素,如图所示。
您必须将标签名称和属性及其对应的值传递给 find_all()函数。
代码:
Python
def cus_data(soup):
# find the Html tag
# with find()
# and convert into string
data_str = ""
cus_list = []
for item in soup.find_all("span", class_="a-profile-name"):
data_str = data_str + item.get_text()
cus_list.append(data_str)
data_str = ""
return cus_list
cus_res = cus_data(soup)
print(cus_res)
输出:
[‘Amaze’, ‘Robert’, ‘D. Kong’, ‘Alexey’, ‘Charl’, ‘RBostillo’]
抓取用户评论:
现在找到与上述方法相同的客户评论。找到带有特定标签的唯一类名,这里我们使用 div 标签。
代码:
蟒蛇3
def cus_rev(soup):
# find the Html tag
# with find()
# and convert into string
data_str = ""
for item in soup.find_all("div", class_="a-expander-content \
reviewText review-text-content a-expander-partial-collapse-content"):
data_str = data_str + item.get_text()
result = data_str.split("\n")
return (result)
rev_data = cus_rev(soup)
rev_result = []
for i in rev_data:
if i is "":
pass
else:
rev_result.append(i)
rev_result
输出:
刮痧生产信息
在这里,我们将抓取产品信息,如产品名称、ASIN 编号、重量、尺寸。通过这样做,我们将使用 span 标记并具有特定的唯一类名称。
代码:
蟒蛇3
def product_info(soup):
# find the Html tag
# with find()
# and convert into string
data_str = ""
pro_info = []
for item in soup.find_all("ul", class_="a-unordered-list a-nostyle\
a-vertical a-spacing-none detail-bullet-list"):
data_str = data_str + item.get_text()
pro_info.append(data_str.split("\n"))
data_str = ""
return pro_info
pro_result = product_info(soup)
# Filter the required data
for item in pro_result:
for j in item:
if j is "":
pass
else:
print(j)
输出:
抓取评论图片:
这里我们将使用与上述相同的方法从产品评论中提取图片链接。标签的标签名称和属性如上所述传递给 findAll() 。
代码:
蟒蛇3
def rev_img(soup):
# find the Html tag
# with find()
# and convert into string
data_str = ""
cus_list = []
images = []
for img in soup.findAll('img', class_="cr-lightbox-image-thumbnail"):
images.append(img.get('src'))
return images
img_result = rev_img(soup)
img_result
输出:
将详细信息保存到 CSV 文件中:
在这里,我们将详细信息保存到 CSV 文件中,我们将数据转换为 dataframe,然后将其导出为 CSV,让我们看看如何将 Pandas DataFrame 导出为 CSV 文件。我们将使用to_csv()函数将 DataFrame 保存为 CSV 文件。
Syntax : to_csv(parameters)
Parameters :
- path_or_buf : File path or object, if None is provided the result is returned as a string.
代码:
蟒蛇3
import pandas as pd
# initialise data of lists.
data = {'Name': cus_res,
'review': rev_result}
# Create DataFrame
df = pd.DataFrame(data)
# Save the output.
df.to_csv('amazon_review.csv')
输出: