📌  相关文章
📜  使用Python从 Geeksforgeeks 文章中提取作者信息

📅  最后修改于: 2022-05-13 01:55:18.389000             🧑  作者: Mango

使用Python从 Geeksforgeeks 文章中提取作者信息

在本文中,我们将编写一个Python脚本来从 GeeksforGeeks 文章中提取作者信息。

需要的模块

  • bs4: Beautiful Soup(bs4) 是一个Python库,用于从 HTML 和 XML 文件中提取数据。这个模块没有内置于Python中。要安装此类型,请在终端中输入以下命令。
pip install bs4
  • requests Requests 允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置于Python中。要安装此类型,请在终端中输入以下命令。
pip install requests

方法:

  • 导入模块
  • 制作请求实例并传入 URL
  • 初始化文章标题
  • 将 URL 传递到 getdata()
  • 在请求和 Beautiful Soup 的帮助下抓取数据
  • 找到所需的详细信息并过滤它们。

逐步执行脚本:

第一步:导入所有依赖

Python
# import module
import requests
from bs4 import BeautifulSoup


Python3
# link for extract html data
# Making a GET request
     
def getdata(url):
    r=requests.get(url)
    return r.text


Python3
# input article by geek
article = "optparse-module-in-python"
 
# url
url = "https://www.geeksforgeeks.org/"+article
 
# pass the url
# into getdata function
htmldata=getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# display html code
print(soup)


Python
# traverse author name
for i in soup.find('div', class_="author_handle"):
    Author = i.get_text()
print(Author)


Python3
# now get author information
# with author name
profile ='https://auth.geeksforgeeks.org/user/'+Author+'/profile'
 
# pass the url
# into getdata function
htmldata=getdata(profile)
soup = BeautifulSoup(htmldata, 'html.parser')


Python3
# traverse information of author
name = soup.find(
    'div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText').get_text()
 
 
author_info = []
for item in soup.find_all('div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold'):
    author_info.append(item.get_text())
 
print("Author name :")
print(name)
print("Author information  :")
print(author_info)


Python3
# import module
import requests
from bs4 import BeautifulSoup
 
# link for extract html data
# Making a GET request
 
 
def getdata(url):
    r = requests.get(url)
    return r.text
 
 
# input article by geek
article = "optparse-module-in-python"
 
# url
url = "https://www.geeksforgeeks.org/"+article
 
 
# pass the url
# into getdata function
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# traverse author name
for i in soup.find('div', class_="author_handle"):
    Author = i.get_text()
 
# now get author information
# with author name
profile = 'https://auth.geeksforgeeks.org/user/'+Author+'/profile'
 
# pass the url
# into getdata function
htmldata = getdata(profile)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# traverse information of author
name = soup.find(
    'div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText').get_text()
 
 
author_info = []
for item in soup.find_all('div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold'):
    author_info.append(item.get_text())
 
print("Author name :", name)
print("Author information  :")
print(author_info)



第 2 步:创建 URL 获取函数

蟒蛇3

# link for extract html data
# Making a GET request
     
def getdata(url):
    r=requests.get(url)
    return r.text

第 3 步:现在将文章名称合并到 URL 并将 URL 传递到 getdata()函数并将该数据转换为 HTML 代码

蟒蛇3

# input article by geek
article = "optparse-module-in-python"
 
# url
url = "https://www.geeksforgeeks.org/"+article
 
# pass the url
# into getdata function
htmldata=getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# display html code
print(soup)

输出:

第 4 步:从 HTML 文档中遍历作者姓名。

Python

# traverse author name
for i in soup.find('div', class_="author_handle"):
    Author = i.get_text()
print(Author)

输出:

kumar_satyam

第 5 步:现在创建一个带有作者姓名的 URL 并获取 HTML 代码。

蟒蛇3

# now get author information
# with author name
profile ='https://auth.geeksforgeeks.org/user/'+Author+'/profile'
 
# pass the url
# into getdata function
htmldata=getdata(profile)
soup = BeautifulSoup(htmldata, 'html.parser')

第六步:遍历作者信息。

蟒蛇3

# traverse information of author
name = soup.find(
    'div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText').get_text()
 
 
author_info = []
for item in soup.find_all('div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold'):
    author_info.append(item.get_text())
 
print("Author name :")
print(name)
print("Author information  :")
print(author_info)

输出:

完整代码:

蟒蛇3

# import module
import requests
from bs4 import BeautifulSoup
 
# link for extract html data
# Making a GET request
 
 
def getdata(url):
    r = requests.get(url)
    return r.text
 
 
# input article by geek
article = "optparse-module-in-python"
 
# url
url = "https://www.geeksforgeeks.org/"+article
 
 
# pass the url
# into getdata function
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# traverse author name
for i in soup.find('div', class_="author_handle"):
    Author = i.get_text()
 
# now get author information
# with author name
profile = 'https://auth.geeksforgeeks.org/user/'+Author+'/profile'
 
# pass the url
# into getdata function
htmldata = getdata(profile)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# traverse information of author
name = soup.find(
    'div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText').get_text()
 
 
author_info = []
for item in soup.find_all('div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold'):
    author_info.append(item.get_text())
 
print("Author name :", name)
print("Author information  :")
print(author_info)

输出: