在Python中使用 lxml 实现网页抓取
网络抓取基本上是指仅从一个或多个网站获取一些重要信息。每个网站都有可识别的 HTML 元素结构/模式。
Steps to perform web scraping :
1. Send a link and get the response from the sent link
2. Then convert response object to a byte string.
3. Pass the byte string to ‘fromstring’ method in html class in lxml module.
4. Get to a particular element by xpath.
5. Use the content according to your need.
为了完成这个任务,需要安装一些第三方包。使用 pip 安装 wheel(.whl) 文件。
pip install requests
pip install lxml
还需要元素的 xpath,数据将从中被废弃。一个简单的方法是——
1. 右键单击页面中需要报废的元素并转到“检查”。
2. 右键单击右侧源代码上的元素。
3.复制xpath。
这是“geeksforgeeks homepage”上的一个简单实现:
Python3
# Python3 code implementing web scraping using lxml
import requests
# import only html class
from lxml import html
# url to scrap data from
url = 'https://www.geeksforgeeks.org'
# path to particular element
path = '//*[@id ="post-183376"]/div / p'
# get response object
response = requests.get(url)
# get byte string
byte_data = response.content
# get filtered source code
source_code = html.fromstring(byte_data)
# jump to preferred html element
tree = source_code.xpath(path)
# print texts in first element in list
print(tree[0].text_content())
Python3
import requests
from lxml import html
# url to scrap data from
link = 'https://en.wikipedia.org / wiki / Web_scraping'
# path to particular element
path = '//*[@id ="mw-content-text"]/div / p[1]'
response = requests.get(link)
byte_string = response.content
# get filtered source code
source_code = html.fromstring(byte_string)
# jump to preferred html element
tree = source_code.xpath(path)
# print texts in first element in list
print(tree[0].text_content())
上面的代码从“geeksforgeeks homepage”主页中抓取了第一篇文章中的段落。
这是示例输出。每个人的输出可能都不一样,因为文章会发生变化。
输出 :
"Consider the following C/C++ programs and try to guess the output?
Output of all of the above programs is unpredictable (or undefined).
The compilers (implementing… Read More »"
这是从 Wiki-web-scraping 抓取的数据的另一个示例。
Python3
import requests
from lxml import html
# url to scrap data from
link = 'https://en.wikipedia.org / wiki / Web_scraping'
# path to particular element
path = '//*[@id ="mw-content-text"]/div / p[1]'
response = requests.get(link)
byte_string = response.content
# get filtered source code
source_code = html.fromstring(byte_string)
# jump to preferred html element
tree = source_code.xpath(path)
# print texts in first element in list
print(tree[0].text_content())
输出 :
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites.[1] Web scraping software may access the World Wide Web directly using the Hypertext Transfer Protocol, or through a web browser. While web scraping can be done manually by a software user, the term typically refers to automate processes implemented using a bot or web crawler. It is a form of copying, in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.