如何在Python从网页中提取脚本和 CSS 文件?
先决条件:
- 要求
- 美汤
- Python的文件处理
在本文中,我们将讨论如何使用Python从网页中提取脚本和 CSS 文件。
为此,我们将下载在编码过程中附加到网站源代码的 CSS 和 JavaScript 文件。首先,确定需要抓取的网站的 URL,并向其发送请求。检索网站的内容后,会创建两个文件类型的两个文件夹并将文件放入其中,然后我们可以根据需要对其进行各种操作。
需要的模块
- bs4: Beautiful Soup(bs4) 是一个Python库,用于从 HTML 和 XML 文件中提取数据。这个模块不是内置在Python的。
- 请求:请求允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置于Python。
示例 1:
在这里,我们正在计算每种类型的获取链接的数量。
Python3
# Import Required Library
import requests
from bs4 import BeautifulSoup
# Web URL
web_url = "https://www.geeksforgeeks.org/"
# get HTML content
html = requests.get(web_url).content
# parse HTML Content
soup = BeautifulSoup(html, "html.parser")
js_files = []
cs_files = []
for script in soup.find_all("script"):
if script.attrs.get("src"):
# if the tag has the attribute
# 'src'
url = script.attrs.get("src")
js_files.append(web_url+url)
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href'
# attribute
_url = css.attrs.get("href")
cs_files.append(web_url+_url)
print(f"Total {len(js_files)} javascript files found")
print(f"Total {len(cs_files)} CSS files found")
Python3
# Import Required Library
import requests
from bs4 import BeautifulSoup
# Web URL
web_url = "https://www.geeksforgeeks.org/"
# get HTML content
html = requests.get(web_url).content
# parse HTML Content
soup = BeautifulSoup(html, "html.parser")
js_files = []
cs_files = []
for script in soup.find_all("script"):
if script.attrs.get("src"):
# if the tag has the attribute
# 'src'
url = script.attrs.get("src")
js_files.append(web_url+url)
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href'
# attribute
_url = css.attrs.get("href")
cs_files.append(web_url+_url)
# adding links to the txt files
with open("javajavascript_files.txt", "w") as f:
for js_file in js_files:
print(js_file, file=f)
with open("css_files.txt", "w") as f:
for css_file in cs_files:
print(css_file, file=f)
输出:
Total 7 javascript files found
Total 14 CSS files found
我们还可以使用文件处理将获取的链接导入文本文件。
示例 2:
蟒蛇3
# Import Required Library
import requests
from bs4 import BeautifulSoup
# Web URL
web_url = "https://www.geeksforgeeks.org/"
# get HTML content
html = requests.get(web_url).content
# parse HTML Content
soup = BeautifulSoup(html, "html.parser")
js_files = []
cs_files = []
for script in soup.find_all("script"):
if script.attrs.get("src"):
# if the tag has the attribute
# 'src'
url = script.attrs.get("src")
js_files.append(web_url+url)
for css in soup.find_all("link"):
if css.attrs.get("href"):
# if the link tag has the 'href'
# attribute
_url = css.attrs.get("href")
cs_files.append(web_url+_url)
# adding links to the txt files
with open("javajavascript_files.txt", "w") as f:
for js_file in js_files:
print(js_file, file=f)
with open("css_files.txt", "w") as f:
for css_file in cs_files:
print(css_file, file=f)
输出: