使用Python抓取 IMDB 电影评分和详细信息
我们可以借助Python的 BeautifulSoup 库抓取 IMDb 电影评级及其详细信息。
需要的模块:
以下是从 IMDB 中抓取所需的模块列表。
- 请求:请求库是Python 的一个组成部分,用于向指定的 URL 发出 HTTP 请求。无论是 REST API 还是 Web Scrapping,都必须学习请求才能进一步使用这些技术。当一个人向一个 URI 发出请求时,它会返回一个响应。
- html5lib:用于解析 HTML 的纯 Python 库。它旨在符合 WHATWG HTML 规范,正如所有主要 Web 浏览器所实现的那样。
- bs4:BeautifulSoup 对象由 Beautiful Soup 提供,它是Python的网络抓取框架。网络抓取是使用自动化工具从网站中提取数据的过程,以加快过程。
方法:
在Python实现网页抓取以提取IMDb电影评级及其评级的步骤:
- 导入所需的模块。
Python3
from bs4 import BeautifulSoup
import requests
import re
Python3
# Downloading imdb top 250 movie's data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
Python3
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]
Python3
# create a empty list for storing
# movie information
list = []
# Iterating over movies to extract
# each movie's details
for index in range(0, len(movies)):
# Separating movie into: 'place',
# 'title', 'year'
movie_string = movies[index].get_text()
movie = (' '.join(movie_string.split()).replace('.', ''))
movie_title = movie[len(str(index))+1:-7]
year = re.search('\((.*?)\)', movie_string).group(1)
place = movie[:len(str(index))-(len(movie))]
data = {"movie_title": movie_title,
"year": year,
"place": place,
"star_cast": crew[index],
"rating": ratings[index],
"vote": votes[index],
"link": links[index]}
list.append(data)
Python3
for movie in list:
print(movie['place'], '-', movie['movie_title'], '('+movie['year'] +
') -', 'Starring:', moive['star_cast'], movie['rating'])
Python3
from bs4 import BeautifulSoup
import requests
import re
# Downloading imdb top 250 movie's data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value')
for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value')
for b in soup.select('td.ratingColumn strong')]
list = []
# create a empty list for storing
# movie information
list = []
# Iterating over movies to extract
# each movie's details
for index in range(0, len(movies)):
# Separating movie into: 'place',
# 'title', 'year'
movie_string = movies[index].get_text()
movie = (' '.join(movie_string.split()).replace('.', ''))
movie_title = movie[len(str(index))+1:-7]
year = re.search('\((.*?)\)', movie_string).group(1)
place = movie[:len(str(index))-(len(movie))]
data = {"movie_title": movie_title,
"year": year,
"place": place,
"star_cast": crew[index],
"rating": ratings[index],
"vote": votes[index],
"link": links[index]}
list.append(data)
# printing movie details with its rating.
for movie in list:
print(movie['place'], '-', movie['movie_title'], '('+movie['year'] +
') -', 'Starring:', movie['star_cast'], movie['rating'])
- 通过分配 URL 并创建soap对象来访问网页中的HTML 内容。
蟒蛇3
# Downloading imdb top 250 movie's data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
- 提取电影评级及其详细信息。在这里,我们使用 Html 标签(如 href、title 等)从 BeautifulSoup 对象中提取数据。
蟒蛇3
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value') for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value') for b in soup.select('td.ratingColumn strong')]
- 提取电影详细信息后,创建一个空列表并将详细信息存储在字典中,然后将它们添加到列表中。
蟒蛇3
# create a empty list for storing
# movie information
list = []
# Iterating over movies to extract
# each movie's details
for index in range(0, len(movies)):
# Separating movie into: 'place',
# 'title', 'year'
movie_string = movies[index].get_text()
movie = (' '.join(movie_string.split()).replace('.', ''))
movie_title = movie[len(str(index))+1:-7]
year = re.search('\((.*?)\)', movie_string).group(1)
place = movie[:len(str(index))-(len(movie))]
data = {"movie_title": movie_title,
"year": year,
"place": place,
"star_cast": crew[index],
"rating": ratings[index],
"vote": votes[index],
"link": links[index]}
list.append(data)
- 现在或列表中充满了顶级 IMBD 电影及其详细信息。最后显示电影详情列表
蟒蛇3
for movie in list:
print(movie['place'], '-', movie['movie_title'], '('+movie['year'] +
') -', 'Starring:', moive['star_cast'], movie['rating'])
完整代码:
蟒蛇3
from bs4 import BeautifulSoup
import requests
import re
# Downloading imdb top 250 movie's data
url = 'http://www.imdb.com/chart/top'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
movies = soup.select('td.titleColumn')
links = [a.attrs.get('href') for a in soup.select('td.titleColumn a')]
crew = [a.attrs.get('title') for a in soup.select('td.titleColumn a')]
ratings = [b.attrs.get('data-value')
for b in soup.select('td.posterColumn span[name=ir]')]
votes = [b.attrs.get('data-value')
for b in soup.select('td.ratingColumn strong')]
list = []
# create a empty list for storing
# movie information
list = []
# Iterating over movies to extract
# each movie's details
for index in range(0, len(movies)):
# Separating movie into: 'place',
# 'title', 'year'
movie_string = movies[index].get_text()
movie = (' '.join(movie_string.split()).replace('.', ''))
movie_title = movie[len(str(index))+1:-7]
year = re.search('\((.*?)\)', movie_string).group(1)
place = movie[:len(str(index))-(len(movie))]
data = {"movie_title": movie_title,
"year": year,
"place": place,
"star_cast": crew[index],
"rating": ratings[index],
"vote": votes[index],
"link": links[index]}
list.append(data)
# printing movie details with its rating.
for movie in list:
print(movie['place'], '-', movie['movie_title'], '('+movie['year'] +
') -', 'Starring:', movie['star_cast'], movie['rating'])
输出: