使用Python将 HTML 源代码转换为 JSON 对象
在这篇文章中,我们将看到如何将 HTML 源代码转换为 JSON 对象。 JSON 对象可以轻松传输,并且大多数现代编程语言都支持它们。我们可以从 Javascript 中读取 JSON 并将其轻松解析为 Javascript 对象。 Javascript 可用于为您的网页制作 HTML。
我们将在这篇文章中使用xmltojson模块。该模块的 parse函数将 HTML 作为输入并返回解析后的 JSON字符串。
Syntax: xmltojson.parse(xml_input, xml_attribs=True, item_depth=0, item_callback)
Parameters:
- xml_input can be either a file or a string.
- xml_attribs will include attributes if set to True. Otherwise, ignore them if set to False.
- item_depth is the depth of children for which item_callback function is called when found.
- item_callback is a callback function
环境设置:
安装所需的 模块:
pip install xmltojson
pip install requests
脚步:
- 导入库
Python3
import xmltojson
import json
import requests
Python3
# Sample URL to fetch the html page
url = "https://geeksforgeeks-example.surge.sh"
# Headers to mimic the browser
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
# Get the page through get() method
html_response = requests.get(url=url, headers = headers)
# Save the page content as sample.html
with open("sample.html", "w") as html_file:
html_file.write(html_response.text)
Python3
with open("sample.html", "r") as html_file:
html = html_file.read()
json_ = xmltojson.parse(html)
Python3
with open("data.json", "w") as file:
json.dump(json_, file)
Python3
print(json_)
Python3
import xmltojson
import json
import requests
# Sample URL to fetch the html page
url = "https://geeksforgeeks-example.surge.sh"
# Headers to mimic the browser
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
# Get the page through get() method
html_response = requests.get(url=url, headers = headers)
# Save the page content as sample.html
with open("sample.html", "w") as html_file:
html_file.write(html_response.text)
with open("sample.html", "r") as html_file:
html = html_file.read()
json_ = xmltojson.parse(html)
with open("data.json", "w") as file:
json.dump(json_, file)
print(json_)
- 获取 HTML 代码并将其保存到文件中。
蟒蛇3
# Sample URL to fetch the html page
url = "https://geeksforgeeks-example.surge.sh"
# Headers to mimic the browser
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
# Get the page through get() method
html_response = requests.get(url=url, headers = headers)
# Save the page content as sample.html
with open("sample.html", "w") as html_file:
html_file.write(html_response.text)
- 使用 parse函数将此 HTML 转换为 JSON。打开 HTML 文件,使用xmltojson模块的解析函数。
蟒蛇3
with open("sample.html", "r") as html_file:
html = html_file.read()
json_ = xmltojson.parse(html)
- json_变量包含一个 JSON字符串,我们可以将其打印或转储到文件中。
蟒蛇3
with open("data.json", "w") as file:
json.dump(json_, file)
- 打印输出。
蟒蛇3
print(json_)
完整代码:
蟒蛇3
import xmltojson
import json
import requests
# Sample URL to fetch the html page
url = "https://geeksforgeeks-example.surge.sh"
# Headers to mimic the browser
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 \
(KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'
}
# Get the page through get() method
html_response = requests.get(url=url, headers = headers)
# Save the page content as sample.html
with open("sample.html", "w") as html_file:
html_file.write(html_response.text)
with open("sample.html", "r") as html_file:
html = html_file.read()
json_ = xmltojson.parse(html)
with open("data.json", "w") as file:
json.dump(json_, file)
print(json_)
输出:
{“html”: {“@lang”: “en”, “head”: {“title”: “Document”}, “body”: {“div”: {“h1”: “Geeks For Geeks”, “p”:
“Welcome to the world of programming geeks!”, “input”: [{“@type”: “text”, “@placeholder”: “Enter your name”},
{“@type”: “button”, “@value”: “submit”}]}}}}