如何在Python中提取 youtube 数据？

先决条件： Beautifulsoup

YouTube 频道的 YouTube 统计数据可用于分析，也可以使用Python代码提取。可以检索很多数据，如 viewCount、subscriberCount 和 videoCount。本文讨论了两种可以完成的方法。

方法 1：使用 YouTube API

首先，我们需要生成一个 API 密钥。您需要一个 Google 帐户才能访问 Google API 控制台、请求 API 密钥和注册您的应用程序。您可以使用 Google API 页面来执行此操作。

要提取数据，我们需要要查看其统计信息的 YouTube 频道的频道 ID。要获取频道 ID，请访问该特定 YouTube 频道并复制 URL 的最后一部分（在下面给出的示例中，使用了 GeeksForGeeks 频道的频道 ID）。

方法

首先创建 youtube_statistics.py
在这个文件中使用 YTstats 类提取数据并生成一个 json 文件将提取的所有数据。
现在创建 main.py
在主要导入 youtube_statistics.py
添加 API 密钥和频道 ID
现在使用与给定键对应的第一个文件数据将被检索并保存到 json 文件中。

例子：

main.py 文件的代码：

Python3

from youtube_statistics import YTstats
  
# paste the API key generated by you here
API_KEY = "AIzaSyA-0KfpLK04NpQN1XghxhSlzG-WkC3DHLs"
  
 # paste the channel id here
channel_id = "UC0RhatS1pyxInC00YKjjBqQ" 
  
yt = YTstats(API_KEY, channel_id)
yt.get_channel_statistics()
yt.dump()

Python3

import requests
import json
  
  
class YTstats:
  
    def __init__(self, api_key, channel_id):
        self.api_key = api_key
        self.channel_id = channel_id
        self.channel_statistics = None
  
    def get_channel_statistics(self):
        url = f'https://www.googleapis.com/youtube/v3/channels?part=statistics&id={self.channel_id}&key={self.api_key}'
  
        json_url = requests.get(url)
        data = json.loads(json_url.text)
  
        try:
            data = data["items"][0]["statistics"]
        except:
            data = None
  
        self.channel_statistics = data
        return data
  
    def dump(self):
        if self.channel_statistics is None:
            return
  
        channel_title = "GeeksForGeeks"
        channel_title = channel_title.replace(" ", "_").lower()
  
        # generate a json file with all the statistics data of the youtube channel
        file_name = channel_title + '.json'
        with open(file_name, 'w') as f:
            json.dump(self.channel_statistics, f, indent=4)
        print('file dumped')

Python3

# import required packages
from selenium import webdriver
from bs4 import BeautifulSoup
  
# provide the url of the channel whose data you want to fetch
urls = [
    'https://www.youtube.com/channel/UC0RhatS1pyxInC00YKjjBqQ'
]
  
  
def main():
    driver = webdriver.Chrome()
    for url in urls:
        driver.get('{}/videos?view=0&sort=p&flow=grid'.format(url))
        content = driver.page_source.encode('utf-8').strip()
        soup = BeautifulSoup(content, 'lxml')
        titles = soup.findAll('a', id='video-title')
        views = soup.findAll(
            'span', class_='style-scope ytd-grid-video-renderer')
        video_urls = soup.findAll('a', id='video-title')
        print('Channel: {}'.format(url))
        i = 0  # views and time
        j = 0  # urls
        for title in titles[:10]:
            print('\n{}\t{}\t{}\thttps://www.youtube.com{}'.format(title.text,
                                                                   views[i].text, views[i+1].text, video_urls[j].get('href')))
            i += 2
            j += 1
  
  
main()

youtube_statistics.py 文件的代码：

蟒蛇3

import requests
import json
  
  
class YTstats:
  
    def __init__(self, api_key, channel_id):
        self.api_key = api_key
        self.channel_id = channel_id
        self.channel_statistics = None
  
    def get_channel_statistics(self):
        url = f'https://www.googleapis.com/youtube/v3/channels?part=statistics&id={self.channel_id}&key={self.api_key}'
  
        json_url = requests.get(url)
        data = json.loads(json_url.text)
  
        try:
            data = data["items"][0]["statistics"]
        except:
            data = None
  
        self.channel_statistics = data
        return data
  
    def dump(self):
        if self.channel_statistics is None:
            return
  
        channel_title = "GeeksForGeeks"
        channel_title = channel_title.replace(" ", "_").lower()
  
        # generate a json file with all the statistics data of the youtube channel
        file_name = channel_title + '.json'
        with open(file_name, 'w') as f:
            json.dump(self.channel_statistics, f, indent=4)
        print('file dumped')

输出：

方法二：使用 BeautifulSoup

Beautiful Soup 是一个Python库，用于从 HTML 和 XML 文件中提取数据。在这种方法中，我们将使用 BeautifulSoup 和Selenium从 YouTube 频道中抓取数据。该程序将告知视频的观看次数、发布时间、标题和网址，并使用 Python 的格式进行打印。

方法

导入模块
提供要获取其数据的频道的 url
提取数据
显示获取的数据。

例子：

蟒蛇3

# import required packages
from selenium import webdriver
from bs4 import BeautifulSoup
  
# provide the url of the channel whose data you want to fetch
urls = [
    'https://www.youtube.com/channel/UC0RhatS1pyxInC00YKjjBqQ'
]
  
  
def main():
    driver = webdriver.Chrome()
    for url in urls:
        driver.get('{}/videos?view=0&sort=p&flow=grid'.format(url))
        content = driver.page_source.encode('utf-8').strip()
        soup = BeautifulSoup(content, 'lxml')
        titles = soup.findAll('a', id='video-title')
        views = soup.findAll(
            'span', class_='style-scope ytd-grid-video-renderer')
        video_urls = soup.findAll('a', id='video-title')
        print('Channel: {}'.format(url))
        i = 0  # views and time
        j = 0  # urls
        for title in titles[:10]:
            print('\n{}\t{}\t{}\thttps://www.youtube.com{}'.format(title.text,
                                                                   views[i].text, views[i+1].text, video_urls[j].get('href')))
            i += 2
            j += 1
  
  
main()

输出