📅  最后修改于: 2023-12-03 15:13:43.301000             🧑  作者: Mango
BeautifulSoup4 (often abbreviated as bs4
) is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It is designed to make it easy for developers to extract information from web pages in a programmatic way. In this article, we will explore the various features of bs4
library and how it can help you parse HTML and XML pages.
To install bs4
, you can use pip
, Python's package manager. Open your terminal or command prompt and enter the following command:
pip install bs4
To use bs4
, we first need to import the library. Here is an example code snippet:
from bs4 import BeautifulSoup
html_doc = """
<!DOCTYPE html>
<html>
<head>
<title>My Website</title>
</head>
<body>
<h1>Welcome to my website!</h1>
<p>Here you will find all the information you need.</p>
<ul>
<li>Home</li>
<li>About Us</li>
<li>Contact</li>
</ul>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.prettify())
In this example, we first create an HTML document as a string. We then create a BeautifulSoup
object and pass in the HTML document along with the parser used to parse the HTML document (in this case, html.parser
). Finally, we use the prettify
method to print the formatted HTML document to the console.
bs4
library provides a wide range of features to parse the HTML and XML files. Here are some of the common methods you can use:
With bs4
, you can search for tags by their name, attributes, or even the text they contain. Here is an example code snippet:
from bs4 import BeautifulSoup
html_doc = """
<!DOCTYPE html>
<html>
<head>
<title>My Website</title>
</head>
<body>
<h1>Welcome to my website!</h1>
<p>Here you will find all the information you need.</p>
<ul>
<li>Home</li>
<li>About Us</li>
<li>Contact</li>
</ul>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find all <li> tags
li_tags = soup.find_all('li')
for li in li_tags:
print(li.text)
# Find the <h1> tag
h1_tag = soup.find('h1')
print(h1_tag.text)
# Find the <title> tag
title_tag = soup.find('title')
print(title_tag.text)
With bs4
, you can also get the attributes of a tag using the get
method. Here is an example code snippet:
from bs4 import BeautifulSoup
html_doc = """
<!DOCTYPE html>
<html>
<head>
<title>My Website</title>
</head>
<body>
<h1>Welcome to my website!</h1>
<p>Here you will find all the information you need.</p>
<ul>
<li class="active">Home</li>
<li>About Us</li>
<li>Contact</li>
</ul>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Find the first <li> tag
li_tag = soup.find('li')
# Get the class attribute of the <li> tag
li_class = li_tag.get('class')
print(li_class)
With bs4
, you can modify the HTML by adding, deleting, or modifying tags and their attributes. Here is an example code snippet:
from bs4 import BeautifulSoup
html_doc = """
<!DOCTYPE html>
<html>
<head>
<title>My Website</title>
</head>
<body>
<h1>Welcome to my website!</h1>
<p>Here you will find all the information you need.</p>
<ul>
<li class="active">Home</li>
<li>About Us</li>
<li>Contact</li>
</ul>
</body>
</html>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
# Add a new <li> tag
new_li_tag = soup.new_tag('li')
new_li_tag.string = 'Blog'
soup.ul.append(new_li_tag)
# Remove the class attribute of the first <li> tag
li_tag = soup.find('li')
del li_tag['class']
# Change the text of the <h1> tag
h1_tag = soup.find('h1')
h1_tag.string = 'Welcome to my new website!'
print(soup.prettify())
bs4
is a powerful library that can make web scraping an easy and enjoyable task. By using the various features of bs4
, you can easily extract the required data from HTML and XML files. If you want to learn more about bs4
, you can refer to the official documentation.