如何在Python从网页中提取脚本和 CSS 文件？

先决条件：

要求
美汤
Python的文件处理

在本文中，我们将讨论如何使用Python从网页中提取脚本和 CSS 文件。

为此，我们将下载在编码过程中附加到网站源代码的 CSS 和 JavaScript 文件。首先，确定需要抓取的网站的 URL，并向其发送请求。检索网站的内容后，会创建两个文件类型的两个文件夹并将文件放入其中，然后我们可以根据需要对其进行各种操作。

需要的模块

bs4: Beautiful Soup(bs4) 是一个Python库，用于从 HTML 和 XML 文件中提取数据。这个模块不是内置在Python的。
请求：请求允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置于Python。

示例 1：

在这里，我们正在计算每种类型的获取链接的数量。

Python3

# Import Required Library
import requests
from bs4 import BeautifulSoup
  
# Web URL
web_url = "https://www.geeksforgeeks.org/"
  
# get HTML content
html = requests.get(web_url).content
  
# parse HTML Content
soup = BeautifulSoup(html, "html.parser")
  
js_files = []
cs_files = []
  
for script in soup.find_all("script"):
    if script.attrs.get("src"):
          
        # if the tag has the attribute 
        # 'src'
        url = script.attrs.get("src")
        js_files.append(web_url+url)
  
  
for css in soup.find_all("link"):
    if css.attrs.get("href"):
          
        # if the link tag has the 'href' 
        # attribute
        _url = css.attrs.get("href")
        cs_files.append(web_url+_url)
  
print(f"Total {len(js_files)} javascript files found")
print(f"Total {len(cs_files)} CSS files found")

Python3

# Import Required Library
import requests
from bs4 import BeautifulSoup
  
# Web URL
web_url = "https://www.geeksforgeeks.org/"
  
# get HTML content
html = requests.get(web_url).content
  
# parse HTML Content
soup = BeautifulSoup(html, "html.parser")
  
js_files = []
cs_files = []
  
for script in soup.find_all("script"):
    if script.attrs.get("src"):
        
        # if the tag has the attribute 
        # 'src'
        url = script.attrs.get("src")
        js_files.append(web_url+url)
  
  
for css in soup.find_all("link"):
    if css.attrs.get("href"):
        
        # if the link tag has the 'href'
        # attribute
        _url = css.attrs.get("href")
        cs_files.append(web_url+_url)
  
# adding links to the txt files
with open("javajavascript_files.txt", "w") as f:
    for js_file in js_files:
        print(js_file, file=f)
  
with open("css_files.txt", "w") as f:
    for css_file in cs_files:
        print(css_file, file=f)

输出：

Total 7 javascript files found

Total 14 CSS files found

编程需要懂一点英语

我们还可以使用文件处理将获取的链接导入文本文件。

示例 2：

蟒蛇3

# Import Required Library
import requests
from bs4 import BeautifulSoup
  
# Web URL
web_url = "https://www.geeksforgeeks.org/"
  
# get HTML content
html = requests.get(web_url).content
  
# parse HTML Content
soup = BeautifulSoup(html, "html.parser")
  
js_files = []
cs_files = []
  
for script in soup.find_all("script"):
    if script.attrs.get("src"):
        
        # if the tag has the attribute 
        # 'src'
        url = script.attrs.get("src")
        js_files.append(web_url+url)
  
  
for css in soup.find_all("link"):
    if css.attrs.get("href"):
        
        # if the link tag has the 'href'
        # attribute
        _url = css.attrs.get("href")
        cs_files.append(web_url+_url)
  
# adding links to the txt files
with open("javajavascript_files.txt", "w") as f:
    for js_file in js_files:
        print(js_file, file=f)
  
with open("css_files.txt", "w") as f:
    for css_file in cs_files:
        print(css_file, file=f)

输出：