网页抓取 - 亚马逊客户评论

在本文中，我们将了解如何使用Python的Beautiful Soup 抓取亚马逊客户评论。

需要的模块

bs4 : Beautiful Soup(bs4) 是一个Python库，用于从 HTML 和 XML 文件中提取数据。这个模块没有内置于Python。要安装此类型，请在终端中输入以下命令。

pip install bs4

requests ： Request 允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置于Python。要安装此类型，请在终端中输入以下命令。

pip install requests

要开始网络抓取，我们首先必须进行一些设置。导入所有需要的模块。获取用于向亚马逊发出请求的 cookie 数据，否则您将无法抓取。创建一个包含您的请求 cookie 的标头，没有 cookie，您将无法抓取亚马逊数据，它总是显示一些错误。本网站将为您提供特定的用户代理。

将 getdata()函数（用户定义函数）中的 URL 传递给将请求 URL 的 URL，它返回一个响应。我们使用 get 方法从使用给定 URL 的给定服务器检索信息。

句法：

requests.get(url, args)

编程需要懂一点英语

将该数据转换为 HTML 代码，然后使用 bs4 解析 HTML 内容。

Syntax: soup = BeautifulSoup(r.content, ‘html5lib’)

Parameters:

r.content : It is the raw HTML content.
html.parser : Specifying the HTML parser we want to use.

编程需要懂一点英语

现在使用soup.Find_all函数过滤所需的数据。

程序：

Python3

# import module
import requests
from bs4 import BeautifulSoup
  
HEADERS = ({'User-Agent':
            'Mozilla/5.0 (Windows NT 10.0; Win64; x64) \
            AppleWebKit/537.36 (KHTML, like Gecko) \
            Chrome/90.0.4430.212 Safari/537.36',
            'Accept-Language': 'en-US, en;q=0.5'})
  
# user define function
# Scrape the data
def getdata(url):
    r = requests.get(url, headers=HEADERS)
    return r.text
  
  
def html_code(url):
  
    # pass the url
    # into getdata function
    htmldata = getdata(url)
    soup = BeautifulSoup(htmldata, 'html.parser')
  
    # display html code
    return (soup)
  
  
url = "https://www.amazon.in/Columbia-Mens-wind-\
resistant-Glove/dp/B0772WVHPS/?_encoding=UTF8&pd_rd\
_w=d9RS9&pf_rd_p=3d2ae0df-d986-4d1d-8c95-aa25d2ade606&pf\
_rd_r=7MP3ZDYBBV88PYJ7KEMJ&pd_rd_r=550bec4d-5268-41d5-\
87cb-8af40554a01e&pd_rd_wg=oy8v8&ref_=pd_gw_cr_cartx&th=1"
  
soup = html_code(url)
print(soup)

Python

def cus_data(soup):
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    cus_list = []
  
    for item in soup.find_all("span", class_="a-profile-name"):
        data_str = data_str + item.get_text()
        cus_list.append(data_str)
        data_str = ""
    return cus_list
  
  
cus_res = cus_data(soup)
print(cus_res)

Python3

def cus_rev(soup):
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
  
    for item in soup.find_all("div", class_="a-expander-content \
    reviewText review-text-content a-expander-partial-collapse-content"):
        data_str = data_str + item.get_text()
  
    result = data_str.split("\n")
    return (result)
  
  
rev_data = cus_rev(soup)
rev_result = []
for i in rev_data:
    if i is "":
        pass
    else:
        rev_result.append(i)
rev_result

Python3

def product_info(soup):
  
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    pro_info = []
  
    for item in soup.find_all("ul", class_="a-unordered-list a-nostyle\
    a-vertical a-spacing-none detail-bullet-list"):
        data_str = data_str + item.get_text()
        pro_info.append(data_str.split("\n"))
        data_str = ""
    return pro_info
  
  
pro_result = product_info(soup)
  
# Filter the required data
for item in pro_result:
    for j in item:
        if j is "":
            pass
        else:
            print(j)

Python3

def rev_img(soup):
  
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    cus_list = []
    images = []
    for img in soup.findAll('img', class_="cr-lightbox-image-thumbnail"):
        images.append(img.get('src'))
    return images
  
  
img_result = rev_img(soup)
img_result

Python3

import pandas as pd
  
# initialise data of lists.
data = {'Name': cus_res,
        'review': rev_result}
  
# Create DataFrame
df = pd.DataFrame(data)
  
# Save the output.
df.to_csv('amazon_review.csv')

输出：

注意：这只是 HTML 代码或原始数据。

现在，既然核心设置已经完成，让我们看看如何针对特定需求进行抓取。

抓取客户名称

现在找到带有 span 标签的客户列表，其中 class_ = a-profile-name。您可以在浏览器中打开网页，通过右键单击查看相关元素，如图所示。

您必须将标签名称和属性及其对应的值传递给 find_all()函数。

代码：

Python

def cus_data(soup):
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    cus_list = []
  
    for item in soup.find_all("span", class_="a-profile-name"):
        data_str = data_str + item.get_text()
        cus_list.append(data_str)
        data_str = ""
    return cus_list
  
  
cus_res = cus_data(soup)
print(cus_res)

输出：

[‘Amaze’, ‘Robert’, ‘D. Kong’, ‘Alexey’, ‘Charl’, ‘RBostillo’]

编程需要懂一点英语

抓取用户评论：

现在找到与上述方法相同的客户评论。找到带有特定标签的唯一类名，这里我们使用 div 标签。

代码：

蟒蛇3

def cus_rev(soup):
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
  
    for item in soup.find_all("div", class_="a-expander-content \
    reviewText review-text-content a-expander-partial-collapse-content"):
        data_str = data_str + item.get_text()
  
    result = data_str.split("\n")
    return (result)
  
  
rev_data = cus_rev(soup)
rev_result = []
for i in rev_data:
    if i is "":
        pass
    else:
        rev_result.append(i)
rev_result

输出：

刮痧生产信息

在这里，我们将抓取产品信息，如产品名称、ASIN 编号、重量、尺寸。通过这样做，我们将使用 span 标记并具有特定的唯一类名称。

代码：

蟒蛇3

def product_info(soup):
  
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    pro_info = []
  
    for item in soup.find_all("ul", class_="a-unordered-list a-nostyle\
    a-vertical a-spacing-none detail-bullet-list"):
        data_str = data_str + item.get_text()
        pro_info.append(data_str.split("\n"))
        data_str = ""
    return pro_info
  
  
pro_result = product_info(soup)
  
# Filter the required data
for item in pro_result:
    for j in item:
        if j is "":
            pass
        else:
            print(j)

输出：

抓取评论图片：

这里我们将使用与上述相同的方法从产品评论中提取图片链接。标签的标签名称和属性如上所述传递给 findAll() 。

代码：

蟒蛇3

def rev_img(soup):
  
    # find the Html tag
    # with find()
    # and convert into string
    data_str = ""
    cus_list = []
    images = []
    for img in soup.findAll('img', class_="cr-lightbox-image-thumbnail"):
        images.append(img.get('src'))
    return images
  
  
img_result = rev_img(soup)
img_result

输出：

将详细信息保存到 CSV 文件中：

在这里，我们将详细信息保存到 CSV 文件中，我们将数据转换为 dataframe，然后将其导出为 CSV，让我们看看如何将 Pandas DataFrame 导出为 CSV 文件。我们将使用to_csv()函数将 DataFrame 保存为 CSV 文件。

Syntax : to_csv(parameters)
Parameters :

path_or_buf : File path or object, if None is provided the result is returned as a string.

编程需要懂一点英语

代码：

蟒蛇3

import pandas as pd
  
# initialise data of lists.
data = {'Name': cus_res,
        'review': rev_result}
  
# Create DataFrame
df = pd.DataFrame(data)
  
# Save the output.
df.to_csv('amazon_review.csv')

输出：