爬取网页并获取最常用词的Python程序(1)

📌 相关文章

📜 爬取网页并获取最常用词的Python程序(1)

📅 最后修改于: 2023-12-03 15:27:05.381000 🧑 作者: Mango

爬取网页并获取最常用词的Python程序

本程序基于Python 3.7开发，使用了BeautifulSoup库和re库。功能为爬取指定网页上的文本内容，并统计出现次数最多的单词（去除停用词后）。

使用方法

安装依赖库

pip install beautifulsoup4
pip install lxml

执行代码

python scrapewords.py

代码实现

import requests
from bs4 import BeautifulSoup
import re
import collections


# 爬取网页内容并将其转换成文本形式
def get_text(url):
    html = requests.get(url).content
    soup = BeautifulSoup(html, 'lxml')
    for script in soup(["script", "style"]):
        script.extract()  
    text = soup.get_text()
    text = re.sub(r'\s+', ' ', text)
    return text


# 统计单词出现频率并返回出现最多的10个单词和它们的出现次数
def count_words(text):
    stop_words = set(open('stopwords.txt', 'r', encoding='utf-8').read().splitlines())
    words = re.findall(r'\b[A-Za-z]{2,}\b', text)
    words_count = collections.Counter(word.lower() for word in words if word.lower() not in stop_words)
    return words_count.most_common(10)


if __name__ == '__main__':
    url = 'https://www.python.org'
    text = get_text(url)
    words_count = count_words(text)
    print('The 10 most common words are:')
    for word, count in words_count:
        print(f'{word}: {count}')

需要准备一个停用词列表，可以从 https://github.com/goto456/stopwords/blob/master/stopwords.txt 下载使用。也可以自己制作一个停用词列表。

程序解释

首先，我们通过requests库获取网页的HTML代码，然后使用BeautifulSoup库将其解析成BeautifulSoup对象。接着，我们过滤掉script和style标签，将其余内容转换成文本（去除多余空格）。最后，我们使用re.findall()获取所有英文单词。

我们需要将所有单词转换成小写形式，并去除停用词。这里我定义了一个函数count_words()，使用了Python内置库collections的Counter()函数实现。Counter()可以统计一个列表中元素出现的次数，并返回一个字典，其中每个键对应一个元素，每个值对应该元素在列表中出现的次数。

我们将所有单词转换成小写形式后，遍历列表并判断是否为停用词，如果不是就添加到words_count中进行统计。最后返回出现次数最多的前10个单词。

程序输出

输出结果为最常用的10个单词及它们的出现次数，如下所示：

The 10 most common words are:
python: 27
docs: 14
community: 9
news: 9
events: 8
release: 7
downloads: 7
program: 7
learn: 7
support: 6