使用Python抓取 Google 评论和评分
在本文中,我们将了解如何使用Python抓取 google 评论和评级。
需要的模块:
- 美丽的汤: 这里涉及的抓取机制是解析 DOM,即从 HTML 和 XML 文件中提取数据
# Installing with pip
pip install beautifulsoup4
# Installing with conda
conda install -c anaconda beautifulsoup4
- Scrapy:一个开源包,旨在抓取更大的数据集,作为开源,它也被有效使用。
- Selenium:通常,为了自动化测试,使用Selenium 。我们也可以这样做来进行抓取,因为这里的浏览器自动化有助于交互 javascript,涉及点击、滚动、多帧之间的数据移动等,
# Installing with pip
pip install selenium
# Installing with conda
conda install -c conda-forge selenium
Chrome 驱动管理器:
# Below installations are needed as browsers
# are getting changed with different versions
pip install webdriver
pip install webdriver-manager
Web驱动初始化:
Python3
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
# As there are possibilities of different chrome
# browser and we are not sure under which it get
# executed let us use the below syntax
driver = webdriver.Chrome(ChromeDriverManager().install())
Python3
url = 'https://www.google.com/maps/place/Rashtrapathi Bavan'
driver.get(url)
Python3
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotVisibleException
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.implicitly_wait(30)
# Either we can hard code or can get via input.
# The given input should be a valid one
location = "600028"
print("Search By ")
print("1.Book shops")
print("2.Food")
print("3.Temples")
print("4.Exit")
ch = "Y"
while (ch.upper() == 'Y'):
choice = input("Enter choice(1/2/3/4):")
if (choice == '1'):
query = "book shops near " + location
if (choice == '2'):
query = "food near " + location
if (choice == '3'):
query = "temples near " + location
driver.get("https://www.google.com/search?q=" + query)
wait = WebDriverWait(driver, 10)
ActionChains(driver).move_to_element(wait.until(EC.element_to_be_clickable(
(By.XPATH, "//a[contains(@href, '/search?tbs')]")))).perform()
wait.until(EC.element_to_be_clickable(
(By.XPATH, "//a[contains(@href, '/search?tbs')]"))).click()
names = []
for name in driver.find_elements(By.XPATH, "//div[@aria-level='3']"):
names.append(name.text)
print(names)
ch = input("Do you want to continue (Y/N): ")
输出:
[WDM] – ====== WebDriver manager ======
[WDM] – Current google-chrome version is 99.0.4844
[WDM] – Get LATEST driver version for 99.0.4844
[WDM] – Driver [C:\Users\ksaty\.wdm\drivers\chromedriver\win32\99.0.4844.51\chromedriver.exe] found in cache
让我们尝试定位“Rashtrapathi Bavan”,然后做进一步的处理,有时如果是第一次做,它会要求访问页面的权限,如果看到一种权限问题,同意它并移动更远。
Python3
url = 'https://www.google.com/maps/place/Rashtrapathi Bavan'
driver.get(url)
输出:
https://www.google.com/maps/place/Rashtrapati+Bhavan/@28.6143478,77.1972413,17z/data=!3m1!4b1!4m5!3m4!1s0x390ce2a99b6f9fa7:0x83a25e55f0af1c82!8m2!3d28.6143478!4d77.19943
刮谷歌评论和评级
在这里,我们将尝试从谷歌地图中获取三个实体,例如书店、食品和寺庙,为此我们将制定特定条件并将它们与位置合并。
Python3
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import ElementNotVisibleException
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
from bs4 import BeautifulSoup
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.maximize_window()
driver.implicitly_wait(30)
# Either we can hard code or can get via input.
# The given input should be a valid one
location = "600028"
print("Search By ")
print("1.Book shops")
print("2.Food")
print("3.Temples")
print("4.Exit")
ch = "Y"
while (ch.upper() == 'Y'):
choice = input("Enter choice(1/2/3/4):")
if (choice == '1'):
query = "book shops near " + location
if (choice == '2'):
query = "food near " + location
if (choice == '3'):
query = "temples near " + location
driver.get("https://www.google.com/search?q=" + query)
wait = WebDriverWait(driver, 10)
ActionChains(driver).move_to_element(wait.until(EC.element_to_be_clickable(
(By.XPATH, "//a[contains(@href, '/search?tbs')]")))).perform()
wait.until(EC.element_to_be_clickable(
(By.XPATH, "//a[contains(@href, '/search?tbs')]"))).click()
names = []
for name in driver.find_elements(By.XPATH, "//div[@aria-level='3']"):
names.append(name.text)
print(names)
ch = input("Do you want to continue (Y/N): ")
输出: