在 Flask 中使用 Web Scraping 创建 Cricket Score API

板球是世界著名的户外运动之一。提供实时记分牌的 API 很少，而且没有一个可以免费使用。使用任何可用的记分板，我们可以为自己创建 API。这种方法不仅适用于板球记分牌，也适用于任何在线可用信息。以下是本博客将指导创建 API 和部署它的流程。

设置应用程序目录
来自 NDTV Sports 的网络抓取数据。
- 将使用Python中的 Beautiful Soup。
创建 API。
- 将使用烧瓶。
Heroku 将用于部署，

设置应用程序目录

第 1 步：创建一个文件夹（例如 CricGFG）。

第 2 步：设置虚拟环境。这里我们创建一个环境.env

python -m venv .env

第 3 步：激活环境。

.env\Scripts\activate

获取数据

第 1 步：在Python中，我们有 Beautiful Soup，它是一个从 HTML 文件中提取数据的库。要安装 Beautiful Soup，运行一个简单的命令；

pip install beautifulsoup4

同样，安装Python的 Requests 模块。

pip install requests

我们将使用 NDTV Sports Cricket Scorecard 来获取数据。

第 3 步：以下是从网页抓取数据的步骤。从网页中获取 HTML 文本；

html_text = requests.get(‘https://sports.ndtv.com/cricket/live-scores’).text

编程需要懂一点英语

为了将解析的对象表示为一个整体，我们使用 BeautifulSoup 对象，

soup = BeautifulSoup(html_text, "html.parser")

注意：建议在每一步之后运行并检查代码，以了解差异并彻底理解概念。

例子：

Python

from bs4 import BeautifulSoup
import requests
  
html_text = requests.get('https://sports.ndtv.com/cricket/live-scores').text
soup = BeautifulSoup(html_text, "html.parser")
print(soup)

Python

from bs4 import BeautifulSoup
import requests
  
html_text = requests.get('https://sports.ndtv.com/cricket/live-scores').text
soup = BeautifulSoup(html_text, "html.parser")
sect = soup.find_all('div', class_='sp-scr_wrp')
section = sect[0]
description = section.find('span', class_='description').text
location = section.find('span', class_='location').text
current = section.find('div', class_='scr_dt-red').text
link = "https://sports.ndtv.com/" + \
    section.find('a', class_='scr_ful-sbr-txt').get('href')

Python3

from bs4 import BeautifulSoup
import requests
  
html_text = requests.get('https://sports.ndtv.com/cricket/live-scores').text
soup = BeautifulSoup(html_text, "html.parser")
sect = soup.find_all('div', class_='sp-scr_wrp ind-hig_crd vevent')
  
section = sect[0]
description = section.find('span', class_='description').text
location = section.find('span', class_='location').text
current = section.find('div', class_='scr_dt-red').text
link = "https://sports.ndtv.com/" + section.find(
    'a', class_='scr_ful-sbr-txt').get('href')
  
try:
    status = section.find_all('div', class_="scr_dt-red")[1].text
    block = section.find_all('div', class_='scr_tm-wrp')
    team1_block = block[0]
    team1_name = team1_block.find('div', class_='scr_tm-nm').text
    team1_score = team1_block.find('span', class_='scr_tm-run').text
    team2_block = block[1]
    team2_name = team2_block.find('div', class_='scr_tm-nm').text
    team2_score = team2_block.find('span', class_='scr_tm-run').text
    print(description)
    print(location)
    print(status)
    print(current)
    print(team1_name.strip())
    print(team1_score.strip())
    print(team2_name.strip())
    print(team2_score.strip())
    print(link)
except:
    print("Data not available")

Python3

# We import the Flask Class, an instance of 
# this class will be our WSGI application.
from flask import Flask
  
# We create an instance of this class. The first
# argument is the name of the application’s module 
# or package. __name__ is a convenient shortcut for
# this that is appropriate for most cases.This is
# needed so that Flask knows where to look for resources
# such as templates and static files.
app = Flask(__name__)
  
# We use the route() decorator to tell Flask what URL 
# should trigger our function.
@app.route('/')
def cricgfg():
    return "Welcome to CricGFG!"
  
# main driver function
if __name__ == "__main__":
    
    # run() method of Flask class runs the 
    # application on the local development server.
    app.run(debug=True)

Python3

from flask import Flask, jsonify
  
app = Flask(__name__)
  
@app.route('/')
def cricgfg():
    
    # Creating a dictionary with data to test jsonfiy.
    result = {
        "Description": "Live score England vs India 3rd Test,Pataudi \
        Trophy, 2021",
        "Location": "Headingley, Leeds",
        "Status": "England lead by 223 runs",
        "Current": "Day 2 | Post Tea Session",
        "Team A": "England",
        "Team A Score": "301/3 (96.0)",
        "Team B": "India",
        "Team B Score": "78",
        "Full Scoreboard": "https://sports.ndtv.com//cricket/live-scorecard\
        /england-vs-india-3rd-test-leeds-enin08252021199051",
        "Credits": "NDTV Sports"
    }
    return jsonify(result)
  
if __name__ == "__main__":
    app.run(debug=True)

Python3

import requests
from bs4 import BeautifulSoup
from flask import Flask, jsonify
  
app = Flask(__name__)
  
  
@app.route('/')
def cricgfg():
    html_text = requests.get('https://sports.ndtv.com/cricket/live-scores').text
    soup = BeautifulSoup(html_text, "html.parser")
    sect = soup.find_all('div', class_='sp-scr_wrp ind-hig_crd vevent')
  
    section = sect[0]
    description = section.find('span', class_='description').text
    location = section.find('span', class_='location').text
    current = section.find('div', class_='scr_dt-red').text
    link = "https://sports.ndtv.com/" + section.find(
    'a', class_='scr_ful-sbr-txt').get('href')
  
    try:
        status = section.find_all('div', class_="scr_dt-red")[1].text
        block = section.find_all('div', class_='scr_tm-wrp')
        team1_block = block[0]
        team1_name = team1_block.find('div', class_='scr_tm-nm').text
        team1_score = team1_block.find('span', class_='scr_tm-run').text
        team2_block = block[1]
        team2_name = team2_block.find('div', class_='scr_tm-nm').text
        team2_score = team2_block.find('span', class_='scr_tm-run').text
        result = {
            "Description": description,
            "Location": location,
            "Status": status,
            "Current": current,
            "Team A": team1_name,
            "Team A Score": team1_score,
            "Team B": team2_name,
            "Team B Score": team2_score,
            "Full Scoreboard": link,
            "Credits": "NDTV Sports"
        }
    except:
        pass
    return jsonify(result)
  
if __name__ == "__main__":
    app.run(debug=True)

我们将进一步找到所有必需的 div 和其他标签及其各自的类。

Python

from bs4 import BeautifulSoup
import requests
  
html_text = requests.get('https://sports.ndtv.com/cricket/live-scores').text
soup = BeautifulSoup(html_text, "html.parser")
sect = soup.find_all('div', class_='sp-scr_wrp')
section = sect[0]
description = section.find('span', class_='description').text
location = section.find('span', class_='location').text
current = section.find('div', class_='scr_dt-red').text
link = "https://sports.ndtv.com/" + \
    section.find('a', class_='scr_ful-sbr-txt').get('href')

代码的下一部分包含我们的数据，即我们的结果。如果由于任何原因代码不存在于 HTML 文件中，则会导致错误，因此将该部分包含在 try 和 except 块中。

完整代码：

蟒蛇3

from bs4 import BeautifulSoup
import requests
  
html_text = requests.get('https://sports.ndtv.com/cricket/live-scores').text
soup = BeautifulSoup(html_text, "html.parser")
sect = soup.find_all('div', class_='sp-scr_wrp ind-hig_crd vevent')
  
section = sect[0]
description = section.find('span', class_='description').text
location = section.find('span', class_='location').text
current = section.find('div', class_='scr_dt-red').text
link = "https://sports.ndtv.com/" + section.find(
    'a', class_='scr_ful-sbr-txt').get('href')
  
try:
    status = section.find_all('div', class_="scr_dt-red")[1].text
    block = section.find_all('div', class_='scr_tm-wrp')
    team1_block = block[0]
    team1_name = team1_block.find('div', class_='scr_tm-nm').text
    team1_score = team1_block.find('span', class_='scr_tm-run').text
    team2_block = block[1]
    team2_name = team2_block.find('div', class_='scr_tm-nm').text
    team2_score = team2_block.find('span', class_='scr_tm-run').text
    print(description)
    print(location)
    print(status)
    print(current)
    print(team1_name.strip())
    print(team1_score.strip())
    print(team2_name.strip())
    print(team2_score.strip())
    print(link)
except:
    print("Data not available")

输出：

Live score England vs India 3rd Test,Pataudi Trophy, 2021

Headingley, Leeds

England lead by 223 runs

Day 2 | Post Tea Session

England

301/3 (96.0)

India

https://sports.ndtv.com//cricket/live-scorecard/england-vs-india-3rd-test-leeds-enin08252021199051

编程需要懂一点英语

创建 API

我们将使用 Flask，它是一个用Python编写的微型 Web 框架。

pip install Flask

以下是我们的flask 应用程序的启动代码。

蟒蛇3

# We import the Flask Class, an instance of 
# this class will be our WSGI application.
from flask import Flask
  
# We create an instance of this class. The first
# argument is the name of the application’s module 
# or package. __name__ is a convenient shortcut for
# this that is appropriate for most cases.This is
# needed so that Flask knows where to look for resources
# such as templates and static files.
app = Flask(__name__)
  
# We use the route() decorator to tell Flask what URL 
# should trigger our function.
@app.route('/')
def cricgfg():
    return "Welcome to CricGFG!"
  
# main driver function
if __name__ == "__main__":
    
    # run() method of Flask class runs the 
    # application on the local development server.
    app.run(debug=True)

输出：

在浏览器上打开本地主机：

我们现在将我们的 Web Scraping 代码添加到这个和 Flask 提供的一些帮助方法中，以正确返回 JSON 数据。

理解 Jsonify

jsonify 是 Flask 中的一个函数。它将数据序列化为 JavaScript Object Notation (JSON) 格式。考虑以下代码：

蟒蛇3

from flask import Flask, jsonify
  
app = Flask(__name__)
  
@app.route('/')
def cricgfg():
    
    # Creating a dictionary with data to test jsonfiy.
    result = {
        "Description": "Live score England vs India 3rd Test,Pataudi \
        Trophy, 2021",
        "Location": "Headingley, Leeds",
        "Status": "England lead by 223 runs",
        "Current": "Day 2 | Post Tea Session",
        "Team A": "England",
        "Team A Score": "301/3 (96.0)",
        "Team B": "India",
        "Team B Score": "78",
        "Full Scoreboard": "https://sports.ndtv.com//cricket/live-scorecard\
        /england-vs-india-3rd-test-leeds-enin08252021199051",
        "Credits": "NDTV Sports"
    }
    return jsonify(result)
  
if __name__ == "__main__":
    app.run(debug=True)

输出：

现在是时候合并我们所有的代码了。开始吧！

蟒蛇3

import requests
from bs4 import BeautifulSoup
from flask import Flask, jsonify
  
app = Flask(__name__)
  
  
@app.route('/')
def cricgfg():
    html_text = requests.get('https://sports.ndtv.com/cricket/live-scores').text
    soup = BeautifulSoup(html_text, "html.parser")
    sect = soup.find_all('div', class_='sp-scr_wrp ind-hig_crd vevent')
  
    section = sect[0]
    description = section.find('span', class_='description').text
    location = section.find('span', class_='location').text
    current = section.find('div', class_='scr_dt-red').text
    link = "https://sports.ndtv.com/" + section.find(
    'a', class_='scr_ful-sbr-txt').get('href')
  
    try:
        status = section.find_all('div', class_="scr_dt-red")[1].text
        block = section.find_all('div', class_='scr_tm-wrp')
        team1_block = block[0]
        team1_name = team1_block.find('div', class_='scr_tm-nm').text
        team1_score = team1_block.find('span', class_='scr_tm-run').text
        team2_block = block[1]
        team2_name = team2_block.find('div', class_='scr_tm-nm').text
        team2_score = team2_block.find('span', class_='scr_tm-run').text
        result = {
            "Description": description,
            "Location": location,
            "Status": status,
            "Current": current,
            "Team A": team1_name,
            "Team A Score": team1_score,
            "Team B": team2_name,
            "Team B Score": team2_score,
            "Full Scoreboard": link,
            "Credits": "NDTV Sports"
        }
    except:
        pass
    return jsonify(result)
  
if __name__ == "__main__":
    app.run(debug=True)

浏览器输出：

在这里，我们创建了自己的 Cricket API。

在 Heroku 上部署 API

第 1 步：您需要在 Heroku 上创建一个帐户。

第 2 步：在您的机器上安装 Git。

第 3 步：在您的机器上安装 Heroku。

第 4 步：登录您的 Heroku 帐户

heroku login

第 5 步：安装 gunicorn，这是一个用于 WSGI 应用程序的纯 Python HTTP 服务器。它允许您通过运行多个Python进程来同时运行任何Python应用程序。

pip install gunicorn

第 6 步：我们需要在应用程序的根目录中创建一个配置文件，它是一个文本文件，以明确声明应该执行什么命令来启动我们的应用程序。

web: gunicorn CricGFG:app

第 7 步：我们进一步创建一个 requirements.txt 文件，其中包含 Heroku 运行我们的 Flask 应用程序所需的所有必要模块。

pip freeze >> requirements.txt

第 8 步：在 Heroku 上创建一个应用程序，单击此处。

第 9 步：我们现在初始化一个 git 存储库并将我们的文件添加到其中。

git init
git add .
git commit -m "Cricket API Completed"

第 10 步：我们现在将 Heroku 指向我们的 git 存储库。

heroku git:remote -a cricgfg

第 11 步：我们现在将我们的文件推送到 Heroku。

git push heroku master

最后，我们的 API 现在可以在 https://cricgfg.herokuapp.com/ 上使用