bs4 (1) - 芒果文档

📌 相关文章

📜 bs4 (1)

📅 最后修改于: 2023-12-03 15:13:43.301000 🧑 作者: Mango

BeautifulSoup4

Introduction

BeautifulSoup4 (often abbreviated as bs4) is a Python library that is used for web scraping purposes to pull the data out of HTML and XML files. It is designed to make it easy for developers to extract information from web pages in a programmatic way. In this article, we will explore the various features of bs4 library and how it can help you parse HTML and XML pages.

Installation

To install bs4, you can use pip, Python's package manager. Open your terminal or command prompt and enter the following command:

pip install bs4

Usage

To use bs4, we first need to import the library. Here is an example code snippet:

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
    <head>
        <title>My Website</title>
    </head>
    <body>
        <h1>Welcome to my website!</h1>
        <p>Here you will find all the information you need.</p>
        <ul>
            <li>Home</li>
            <li>About Us</li>
            <li>Contact</li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())

In this example, we first create an HTML document as a string. We then create a BeautifulSoup object and pass in the HTML document along with the parser used to parse the HTML document (in this case, html.parser). Finally, we use the prettify method to print the formatted HTML document to the console.

Features

bs4 library provides a wide range of features to parse the HTML and XML files. Here are some of the common methods you can use:

Searching for tags

With bs4, you can search for tags by their name, attributes, or even the text they contain. Here is an example code snippet:

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
    <head>
        <title>My Website</title>
    </head>
    <body>
        <h1>Welcome to my website!</h1>
        <p>Here you will find all the information you need.</p>
        <ul>
            <li>Home</li>
            <li>About Us</li>
            <li>Contact</li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find all <li> tags
li_tags = soup.find_all('li')
for li in li_tags:
    print(li.text)

# Find the <h1> tag
h1_tag = soup.find('h1')
print(h1_tag.text)

# Find the <title> tag
title_tag = soup.find('title')
print(title_tag.text)

Getting tag attributes

With bs4, you can also get the attributes of a tag using the get method. Here is an example code snippet:

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
    <head>
        <title>My Website</title>
    </head>
    <body>
        <h1>Welcome to my website!</h1>
        <p>Here you will find all the information you need.</p>
        <ul>
            <li class="active">Home</li>
            <li>About Us</li>
            <li>Contact</li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Find the first <li> tag
li_tag = soup.find('li')

# Get the class attribute of the <li> tag
li_class = li_tag.get('class')
print(li_class)

Modifying the HTML

With bs4, you can modify the HTML by adding, deleting, or modifying tags and their attributes. Here is an example code snippet:

from bs4 import BeautifulSoup

html_doc = """
<!DOCTYPE html>
<html>
    <head>
        <title>My Website</title>
    </head>
    <body>
        <h1>Welcome to my website!</h1>
        <p>Here you will find all the information you need.</p>
        <ul>
            <li class="active">Home</li>
            <li>About Us</li>
            <li>Contact</li>
        </ul>
    </body>
</html>
"""

soup = BeautifulSoup(html_doc, 'html.parser')

# Add a new <li> tag
new_li_tag = soup.new_tag('li')
new_li_tag.string = 'Blog'
soup.ul.append(new_li_tag)

# Remove the class attribute of the first <li> tag
li_tag = soup.find('li')
del li_tag['class']

# Change the text of the <h1> tag
h1_tag = soup.find('h1')
h1_tag.string = 'Welcome to my new website!'

print(soup.prettify())

Conclusion

bs4 is a powerful library that can make web scraping an easy and enjoyable task. By using the various features of bs4, you can easily extract the required data from HTML and XML files. If you want to learn more about bs4, you can refer to the official documentation.