使用 Beautifulsoup 和 scrapingdog API 进行网页抓取
在这篇文章中,我们将抓取使用 JavaScript 库(如 React.js、Vue.js、Angular.js 等)的动态网站,您必须付出额外的努力。如果要安装Selenium、Puppeteer 等所有库和 Phantom.js 等无头浏览器,这是一个简单但漫长的过程。但是,我们有一个工具可以自己处理所有这些负载。那就是 Web Scraping Tool,它提供用于 Web 抓取的 API 和工具。
为什么是这个工具?这个工具将帮助我们使用数百万个轮换代理来抓取动态网站,这样我们就不会被阻止。它还提供验证码清算设施。它使用无标题 chrome 来抓取动态网站。
我们需要什么?
网页抓取分为两个简单的部分——
- 通过发出 HTTP 请求获取数据
- 通过解析 HTML DOM 提取重要数据
我们将 ScrapingDog 使用Python和 Scrapingdog API:
- Beautiful Soup 是一个Python库,用于从 HTML 和 XML 文件中提取数据。
- 请求允许您非常轻松地发送 HTTP 请求。
设置
我们的设置非常简单。只需创建一个文件夹并安装 Beautiful Soup & requests。要创建文件夹并安装库,请键入以下给定命令。我假设您已经安装了Python 3.x。
mkdir scraper
pip install beautifulsoup4
pip install requests
现在,在该文件夹中以您喜欢的任何名称创建一个文件。我正在使用scraping.py。
首先,您必须注册此网络抓取工具。它将为您提供 1000 个免费积分。然后只需在您的文件中导入 Beautiful Soup & requests。像这样。
from bs4 import BeautifulSoup
import requests
抓取动态内容
现在,我们熟悉了 Scrapingdog 及其工作原理。但作为参考,您应该阅读此 API 的文档。这将使您清楚地了解此 API 的工作原理。现在,我们将在亚马逊上搜索Python书籍的标题。
现在我们在这个页面上有 16 本书。我们将从 Scrapingdog API 中提取 HTML,然后使用 Beautifulsoup 生成 JSON 响应。现在在一行中,我们将能够抓取亚马逊。对于请求 API,我将使用请求。
r = requests.get("https://api.scrapingdog.com/scrape?api_key=&url=https://www.amazon.com/s?k=python+books&ref=nb_sb_noss_2&dynamic=true").text
这将为您提供该目标 URL 的 HTML 代码。
现在,您必须使用 BeautifulSoup 来解析 HTML。
soup = BeautifulSoup(r, ’html.parser’)
每个标题都有一个名为“a-size-mini a-spacing-none a-color-base s-line-clamp-2”和标签“h2”的“class”属性。您可以在下图中查看。
首先,我们将使用变量汤找出所有这些标签。
allbooks = soup.find_all(“h2”, {“class”:”a-size-mini a-spacing-none a-color-base s-line-clamp-2"})
然后我们将开始一个循环,使用变量“allbooks”的长度到达该页面上每本书的所有标题。
l ={}
u = list()
for i in range(0, len(allbooks)):
l[“title”]= allbooks[i].text.replace(“\n”, ””)
u.append(l)
l ={}
print({"Titles":u})
列表“u”包含所有标题,我们只需要打印它。现在,从 for 循环中打印出列表“u”后,我们得到一个 JSON 响应。看起来像……
输出-
{
“Titles”: [
{
“title”: “Python for Beginners: 2 Books in 1: Python Programming for Beginners, Python Workbook”
},
{
“title”: “Python Tricks: A Buffet of Awesome Python Features”
},
{
“title”: “Python Crash Course, 2nd Edition: A Hands-On, Project-Based Introduction to Programming”
},
{
“title”: “Learning Python: Powerful Object-Oriented Programming”
},
{
“title”: “Python: 4 Books in 1: Ultimate Beginner’s Guide, 7 Days Crash Course, Advanced Guide, and Data Science, Learn Computer Programming and Machine Learning with Step-by-Step Exercises”
},
{
“title”: “Intro to Python for Computer Science and Data Science: Learning to Program with AI, Big Data and The Cloud”
},
{
“title”: “Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython”
},
{
“title”: “Automate the Boring Stuff with Python: Practical Programming for Total Beginners”
},
{
“title”: “Python: 2 Books in 1: The Crash Course for Beginners to Learn Python Programming, Data Science and Machine Learning + Practical Exercises Included. (Artificial Intelligence, Numpy, Pandas)”
},
{
“title”: “Python for Beginners: 2 Books in 1: The Perfect Beginner’s Guide to Learning How to Program with Python with a Crash Course + Workbook”
},
{
“title”: “Python: 2 Books in 1: The Crash Course for Beginners to Learn Python Programming, Data Science and Machine Learning + Practical Exercises Included. (Artificial Intelligence, Numpy, Pandas)”
},
{
“title”: “The Warrior-Poet’s Guide to Python and Blender 2.80”
},
{
“title”: “Python: 3 Manuscripts in 1 book: — Python Programming For Beginners — Python Programming For Intermediates — Python Programming for Advanced”
},
{
“title”: “Python: 2 Books in 1: Basic Programming & Machine Learning — The Comprehensive Guide to Learn and Apply Python Programming Language Using Best Practices and Advanced Features.”
},
{
“title”: “Learn Python 3 the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code (Zed Shaw’s Hard Way Series)”
},
{
“title”: “Python Tricks: A Buffet of Awesome Python Features”
},
{
“title”: “Python Pocket Reference: Python In Your Pocket (Pocket Reference (O’Reilly))”
},
{
“title”: “Python Cookbook: Recipes for Mastering Python 3”
},
{
“title”: “Python (2nd Edition): Learn Python in One Day and Learn It Well. Python for Beginners with Hands-on Project. (Learn Coding Fast with Hands-On Project Book 1)”
},
{
“title”: “Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems”
},
{
“title”: “Hands-On Deep Learning Architectures with Python: Create deep neural networks to solve computational problems using TensorFlow and Keras”
},
{
“title”: “Machine Learning: 4 Books in 1: Basic Concepts + Artificial Intelligence + Python Programming + Python Machine Learning. A Comprehensive Guide to Build Intelligent Systems Using Python Libraries”
}
]
}
在评论中写代码?请使用 ide.geeksforgeeks.org,生成链接并在此处分享链接。