如何在Python中提取 youtube 数据?
先决条件: Beautifulsoup
YouTube 频道的 YouTube 统计数据可用于分析,也可以使用Python代码提取。可以检索很多数据,如 viewCount、subscriberCount 和 videoCount。本文讨论了两种可以完成的方法。
方法 1:使用 YouTube API
首先,我们需要生成一个 API 密钥。您需要一个 Google 帐户才能访问 Google API 控制台、请求 API 密钥和注册您的应用程序。您可以使用 Google API 页面来执行此操作。
要提取数据,我们需要要查看其统计信息的 YouTube 频道的频道 ID。要获取频道 ID,请访问该特定 YouTube 频道并复制 URL 的最后一部分(在下面给出的示例中,使用了 GeeksForGeeks 频道的频道 ID)。
方法
- 首先创建 youtube_statistics.py
- 在这个文件中使用 YTstats 类提取数据并生成一个 json 文件将提取的所有数据。
- 现在创建 main.py
- 在主要导入 youtube_statistics.py
- 添加 API 密钥和频道 ID
- 现在使用与给定键对应的第一个文件数据将被检索并保存到 json 文件中。
例子 :
main.py 文件的代码:
Python3
from youtube_statistics import YTstats
# paste the API key generated by you here
API_KEY = "AIzaSyA-0KfpLK04NpQN1XghxhSlzG-WkC3DHLs"
# paste the channel id here
channel_id = "UC0RhatS1pyxInC00YKjjBqQ"
yt = YTstats(API_KEY, channel_id)
yt.get_channel_statistics()
yt.dump()
Python3
import requests
import json
class YTstats:
def __init__(self, api_key, channel_id):
self.api_key = api_key
self.channel_id = channel_id
self.channel_statistics = None
def get_channel_statistics(self):
url = f'https://www.googleapis.com/youtube/v3/channels?part=statistics&id={self.channel_id}&key={self.api_key}'
json_url = requests.get(url)
data = json.loads(json_url.text)
try:
data = data["items"][0]["statistics"]
except:
data = None
self.channel_statistics = data
return data
def dump(self):
if self.channel_statistics is None:
return
channel_title = "GeeksForGeeks"
channel_title = channel_title.replace(" ", "_").lower()
# generate a json file with all the statistics data of the youtube channel
file_name = channel_title + '.json'
with open(file_name, 'w') as f:
json.dump(self.channel_statistics, f, indent=4)
print('file dumped')
Python3
# import required packages
from selenium import webdriver
from bs4 import BeautifulSoup
# provide the url of the channel whose data you want to fetch
urls = [
'https://www.youtube.com/channel/UC0RhatS1pyxInC00YKjjBqQ'
]
def main():
driver = webdriver.Chrome()
for url in urls:
driver.get('{}/videos?view=0&sort=p&flow=grid'.format(url))
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content, 'lxml')
titles = soup.findAll('a', id='video-title')
views = soup.findAll(
'span', class_='style-scope ytd-grid-video-renderer')
video_urls = soup.findAll('a', id='video-title')
print('Channel: {}'.format(url))
i = 0 # views and time
j = 0 # urls
for title in titles[:10]:
print('\n{}\t{}\t{}\thttps://www.youtube.com{}'.format(title.text,
views[i].text, views[i+1].text, video_urls[j].get('href')))
i += 2
j += 1
main()
youtube_statistics.py 文件的代码:
蟒蛇3
import requests
import json
class YTstats:
def __init__(self, api_key, channel_id):
self.api_key = api_key
self.channel_id = channel_id
self.channel_statistics = None
def get_channel_statistics(self):
url = f'https://www.googleapis.com/youtube/v3/channels?part=statistics&id={self.channel_id}&key={self.api_key}'
json_url = requests.get(url)
data = json.loads(json_url.text)
try:
data = data["items"][0]["statistics"]
except:
data = None
self.channel_statistics = data
return data
def dump(self):
if self.channel_statistics is None:
return
channel_title = "GeeksForGeeks"
channel_title = channel_title.replace(" ", "_").lower()
# generate a json file with all the statistics data of the youtube channel
file_name = channel_title + '.json'
with open(file_name, 'w') as f:
json.dump(self.channel_statistics, f, indent=4)
print('file dumped')
输出:
方法二:使用 BeautifulSoup
Beautiful Soup 是一个Python库,用于从 HTML 和 XML 文件中提取数据。在这种方法中,我们将使用 BeautifulSoup 和Selenium从 YouTube 频道中抓取数据。该程序将告知视频的观看次数、发布时间、标题和网址,并使用 Python 的格式进行打印。
方法
- 导入模块
- 提供要获取其数据的频道的 url
- 提取数据
- 显示获取的数据。
例子:
蟒蛇3
# import required packages
from selenium import webdriver
from bs4 import BeautifulSoup
# provide the url of the channel whose data you want to fetch
urls = [
'https://www.youtube.com/channel/UC0RhatS1pyxInC00YKjjBqQ'
]
def main():
driver = webdriver.Chrome()
for url in urls:
driver.get('{}/videos?view=0&sort=p&flow=grid'.format(url))
content = driver.page_source.encode('utf-8').strip()
soup = BeautifulSoup(content, 'lxml')
titles = soup.findAll('a', id='video-title')
views = soup.findAll(
'span', class_='style-scope ytd-grid-video-renderer')
video_urls = soup.findAll('a', id='video-title')
print('Channel: {}'.format(url))
i = 0 # views and time
j = 0 # urls
for title in titles[:10]:
print('\n{}\t{}\t{}\thttps://www.youtube.com{}'.format(title.text,
views[i].text, views[i+1].text, video_urls[j].get('href')))
i += 2
j += 1
main()
输出