在Python中从 RSS 中提取提要详细信息
在本文中,我们将看到如何使用 RSS 提要为 Hashnode 博客提取提要和发布详细信息。尽管我们将它用于 Hashnode 上的博客,但它也可以用于其他提要。
RSS 是指丰富的站点摘要,使用标准 Web 格式发布经常变化的信息,如博客文章、新闻、音频、视频等。RSS 文档通常被称为提要,它由文本和元数据(如时间和作者姓名)组成。
安装提要解析器:
我们将使用 Feedparser Python库来解析博客的 RSS 提要。它是一个非常流行的用于解析博客提要的库。
pip install feedparser
让我们逐步理解这一点:
第 1 步:获取 RSS 提要
使用 feedparser.parse()函数创建一个包含已解析博客的提要对象。它采用博客提要的 URL。
Python3
# url of blog feed
feed_url = "https://vaibhavkumar.hashnode.dev/rss.xml"
blog_feed = feedparser.parse(feed_url)
Python3
# returns title of the blog site
blog_feed.feed.title
# returns the link of the blog
# and number of entries(blogs) in the site.
blog_feed.feed.link
len(blog_feed.entries)
# Details of individual blog can
# be accessed by using attribute name
print(blog_feed.entries[0].title)
print(blog_feed.entries[0].link)
print(blog_feed.entries[0].author)
print(blog_feed.entries[0].published)
# Getting lists of tags and authors.
tags = [tag.term for tag in blog_feed.entries[0].tags]
authors= [author.name for author in blog_feed.entries[0].authors]
Python3
def get_posts_details(rss=None):
"""
Take link of rss feed as argument
"""
if rss is not None:
# import the library only when url for feed is passed
import feedparser
# parsing blog feed
blog_feed = blog_feed = feedparser.parse(rss)
# getting lists of blog entries via .entries
posts = blog_feed.entries
# dictionary for holding posts details
posts_details = {"Blog title" : blog_feed.feed.title,
"Blog link" : blog_feed.feed.link}
post_list = []
# iterating over individual posts
for post in posts:
temp = dict()
# if any post doesn't have information then throw error.
try:
temp["title"] = post.title
temp["link"] = post.link
temp["author"] = post.author
temp["time_published"] = post.published
temp["tags"] = [tag.term for tag in post.tags]
temp["authors"] = [author.name for author in post.authors]
temp["summary"] = post.summary
except:
pass
post_list.append(temp)
# storing lists of posts in the dictionary
posts_details["posts"] = post_list
return posts_details # returning the details which is dictionary
else:
return None
if __name__ == "__main__":
import json
feed_url = "https://vaibhavkumar.hashnode.dev/rss.xml"
data = get_posts_details(rss = feed_url) # return blogs data as a dictionary
if data:
# printing as a json string with indentation level = 2
print(json.dumps(data, indent=2))
else:
print("None")
第 2 步:从博客中获取详细信息。
蟒蛇3
# returns title of the blog site
blog_feed.feed.title
# returns the link of the blog
# and number of entries(blogs) in the site.
blog_feed.feed.link
len(blog_feed.entries)
# Details of individual blog can
# be accessed by using attribute name
print(blog_feed.entries[0].title)
print(blog_feed.entries[0].link)
print(blog_feed.entries[0].author)
print(blog_feed.entries[0].published)
# Getting lists of tags and authors.
tags = [tag.term for tag in blog_feed.entries[0].tags]
authors= [author.name for author in blog_feed.entries[0].authors]
下面是完整的实现:现在使用上面的代码编写一个函数,该函数获取 RSS 提要的链接并返回详细信息。
蟒蛇3
def get_posts_details(rss=None):
"""
Take link of rss feed as argument
"""
if rss is not None:
# import the library only when url for feed is passed
import feedparser
# parsing blog feed
blog_feed = blog_feed = feedparser.parse(rss)
# getting lists of blog entries via .entries
posts = blog_feed.entries
# dictionary for holding posts details
posts_details = {"Blog title" : blog_feed.feed.title,
"Blog link" : blog_feed.feed.link}
post_list = []
# iterating over individual posts
for post in posts:
temp = dict()
# if any post doesn't have information then throw error.
try:
temp["title"] = post.title
temp["link"] = post.link
temp["author"] = post.author
temp["time_published"] = post.published
temp["tags"] = [tag.term for tag in post.tags]
temp["authors"] = [author.name for author in post.authors]
temp["summary"] = post.summary
except:
pass
post_list.append(temp)
# storing lists of posts in the dictionary
posts_details["posts"] = post_list
return posts_details # returning the details which is dictionary
else:
return None
if __name__ == "__main__":
import json
feed_url = "https://vaibhavkumar.hashnode.dev/rss.xml"
data = get_posts_details(rss = feed_url) # return blogs data as a dictionary
if data:
# printing as a json string with indentation level = 2
print(json.dumps(data, indent=2))
else:
print("None")
输出: