使用Python从 Geeksforgeeks 文章中提取作者信息

在本文中，我们将编写一个Python脚本来从 GeeksforGeeks 文章中提取作者信息。

需要的模块

bs4: Beautiful Soup(bs4) 是一个Python库，用于从 HTML 和 XML 文件中提取数据。这个模块没有内置于Python中。要安装此类型，请在终端中输入以下命令。

pip install bs4

requests ： Requests 允许您非常轻松地发送 HTTP/1.1 请求。这个模块也没有内置于Python中。要安装此类型，请在终端中输入以下命令。

pip install requests

方法：

导入模块
制作请求实例并传入 URL
初始化文章标题
将 URL 传递到 getdata()
在请求和 Beautiful Soup 的帮助下抓取数据
找到所需的详细信息并过滤它们。

逐步执行脚本：

第一步：导入所有依赖

Python

# import module
import requests
from bs4 import BeautifulSoup

Python3

# link for extract html data
# Making a GET request
     
def getdata(url):
    r=requests.get(url)
    return r.text

Python3

# input article by geek
article = "optparse-module-in-python"
 
# url
url = "https://www.geeksforgeeks.org/"+article
 
# pass the url
# into getdata function
htmldata=getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# display html code
print(soup)

Python

# traverse author name
for i in soup.find('div', class_="author_handle"):
    Author = i.get_text()
print(Author)

Python3

# now get author information
# with author name
profile ='https://auth.geeksforgeeks.org/user/'+Author+'/profile'
 
# pass the url
# into getdata function
htmldata=getdata(profile)
soup = BeautifulSoup(htmldata, 'html.parser')

Python3

# traverse information of author
name = soup.find(
    'div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText').get_text()
 
 
author_info = []
for item in soup.find_all('div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold'):
    author_info.append(item.get_text())
 
print("Author name :")
print(name)
print("Author information  :")
print(author_info)

Python3

# import module
import requests
from bs4 import BeautifulSoup
 
# link for extract html data
# Making a GET request
 
 
def getdata(url):
    r = requests.get(url)
    return r.text
 
 
# input article by geek
article = "optparse-module-in-python"
 
# url
url = "https://www.geeksforgeeks.org/"+article
 
 
# pass the url
# into getdata function
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# traverse author name
for i in soup.find('div', class_="author_handle"):
    Author = i.get_text()
 
# now get author information
# with author name
profile = 'https://auth.geeksforgeeks.org/user/'+Author+'/profile'
 
# pass the url
# into getdata function
htmldata = getdata(profile)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# traverse information of author
name = soup.find(
    'div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText').get_text()
 
 
author_info = []
for item in soup.find_all('div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold'):
    author_info.append(item.get_text())
 
print("Author name :", name)
print("Author information  :")
print(author_info)

第 2 步：创建 URL 获取函数

蟒蛇3

# link for extract html data
# Making a GET request
     
def getdata(url):
    r=requests.get(url)
    return r.text

第 3 步：现在将文章名称合并到 URL 并将 URL 传递到 getdata()函数并将该数据转换为 HTML 代码

蟒蛇3

# input article by geek
article = "optparse-module-in-python"
 
# url
url = "https://www.geeksforgeeks.org/"+article
 
# pass the url
# into getdata function
htmldata=getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# display html code
print(soup)

输出：

第 4 步：从 HTML 文档中遍历作者姓名。

Python

# traverse author name
for i in soup.find('div', class_="author_handle"):
    Author = i.get_text()
print(Author)

输出：

kumar_satyam

第 5 步：现在创建一个带有作者姓名的 URL 并获取 HTML 代码。

蟒蛇3

# now get author information
# with author name
profile ='https://auth.geeksforgeeks.org/user/'+Author+'/profile'
 
# pass the url
# into getdata function
htmldata=getdata(profile)
soup = BeautifulSoup(htmldata, 'html.parser')

第六步：遍历作者信息。

蟒蛇3

# traverse information of author
name = soup.find(
    'div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText').get_text()
 
 
author_info = []
for item in soup.find_all('div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold'):
    author_info.append(item.get_text())
 
print("Author name :")
print(name)
print("Author information  :")
print(author_info)

输出：

Author name : Satyam Kumar
Author information :
[‘LNMI patna’, ‘\nhttps://www.linkedin.com/in/satyam-kumar-174273101/’]

编程需要懂一点英语

完整代码：

蟒蛇3

# import module
import requests
from bs4 import BeautifulSoup
 
# link for extract html data
# Making a GET request
 
 
def getdata(url):
    r = requests.get(url)
    return r.text
 
 
# input article by geek
article = "optparse-module-in-python"
 
# url
url = "https://www.geeksforgeeks.org/"+article
 
 
# pass the url
# into getdata function
htmldata = getdata(url)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# traverse author name
for i in soup.find('div', class_="author_handle"):
    Author = i.get_text()
 
# now get author information
# with author name
profile = 'https://auth.geeksforgeeks.org/user/'+Author+'/profile'
 
# pass the url
# into getdata function
htmldata = getdata(profile)
soup = BeautifulSoup(htmldata, 'html.parser')
 
# traverse information of author
name = soup.find(
    'div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold medText').get_text()
 
 
author_info = []
for item in soup.find_all('div', class_='mdl-cell mdl-cell--9-col mdl-cell--12-col-phone textBold'):
    author_info.append(item.get_text())
 
print("Author name :", name)
print("Author information  :")
print(author_info)

输出：

Author name : Satyam Kumar
Author information :
[‘LNMI patna’, ‘\nhttps://www.linkedin.com/in/satyam-kumar-174273101/’]

编程需要懂一点英语