递归抓取网站所有 URL 的Python程序

在本教程中，我们将看到如何递归地从网站上抓取所有 URL

计算机科学中的递归是一种解决问题的方法，其解决方案取决于同一问题的较小实例的解决方案。此类问题一般可以通过迭代来解决，但这需要在编程时识别和索引较小的实例。

注意：有关详细信息，请参阅递归

所需模块和安装

要求：
Requests 允许您非常轻松地发送 HTTP/1.1 请求。无需手动将查询字符串添加到您的 URL。
```
pip install requests
```
美丽的汤：
Beautiful Soup 是一个库，可以很容易地从网页中抓取信息。它位于 HTML 或 XML 解析器之上，提供用于迭代、搜索和修改解析树的 Pythonic 习惯用法。
```
pip install beautifulsoup4
```

代码：

from bs4 import BeautifulSoup
import requests
   
# lists
urls=[]
   
# function created
def scrape(site):
       
    # getting the request from url
    r = requests.get(site)
       
    # converting the text
    s = BeautifulSoup(r.text,"html.parser")
       
    for i in s.find_all("a"):
          
        href = i.attrs['href']
           
        if href.startswith("/"):
            site = site+href
               
            if site not in  urls:
                urls.append(site) 
                print(site)
                # calling it self
                scrape(site)
   
# main function
if __name__ =="__main__":
   
    # website to be scrape
    site="http://example.webscraping.com//"
   
    # calling function
    scrape(site)

输出：

python-web-scraping