用于遍历所有内部 URL 并报告“traverse.errors”文件中的任何错误的 Shell 脚本(1)

📌 相关文章

📜 用于遍历所有内部 URL 并报告“traverse.errors”文件中的任何错误的 Shell 脚本(1)

📅 最后修改于: 2023-12-03 15:40:55.172000 🧑 作者: Mango

用于遍历所有内部 URL 并报告“traverse.errors”文件中的任何错误的 Shell 脚本

这个 Shell 脚本可以帮助程序员遍历所有内部 URL，并将任何错误报告到“traverse.errors”文件中。下面是脚本的详细介绍和用法说明。

脚本介绍

该脚本包含以下功能：

使用递归的方式遍历网站上的所有内部链接（URL）。
检查每个 URL 是否有效，如果无效，则将其添加到“traverse.errors”文件。
跳过任何外部链接，并只处理网站上的链接。
显示进度条以跟踪处理的 URL 数量和错误数量。

用法说明

该脚本需要使用以下命令运行：

$ bash traverse-urls.sh <url>

其中，<url> 是网站的 URL，该脚本将从这个 URL 开始遍历所有内部链接。

脚本运行时会显示进度条，告诉你已经处理的 URL 数量和错误数量。如果有任何错误，它们将被保存在“traverse.errors”文件中。

脚本代码

#!/bin/bash

# Colors for output messages
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[0;33m'
NC='\033[0m' # No Color

# Create traverse.errors file or clear its contents
echo -n > traverse.errors

function traverse_url() {
    local url="$1"
    local depth="$2"
    
    # Print progress message
    echo -ne "\rProcessed ${GREEN}$processed_urls_cnt${NC} URLs, found ${YELLOW}$errors_cnt${NC} errors." >&2
    
    # Check if URL is a relative or absolute URL
    # and change it to an absolute one
    if [[ $url == *"${full_url}"* ]]; then
        url="${url}"
    elif [[ $url == "/"* ]]; then
        url="${full_url}${url}"
    else
        return
    fi
    
    # Remove any query parameters and fragments from URL
    url="${url%%\?*}"
    url="${url%%\#*}"
    
    # Check if URL has already been processed
    if grep -qFx "${url}" traverse.urls; then
        return
    fi
    
    # Add URL to traverse.urls list
    echo "${url}" >> traverse.urls
    
    # Check if URL exists
    if ! curl -s --head "${url}" | head -n 1 | grep "HTTP/1.[01] [23].."; then
        # Save error message to traverse.errors file
        echo -e "${RED}[ERROR] ${url}${NC}" >> traverse.errors
        ((errors_cnt++))
        return
    fi
    
    # Recursively traverse child links
    while read -r child_url; do
        ((processed_urls_cnt++))
        traverse_url "${child_url}" "$((depth + 1))"
    done < <(curl -s "${url}" | \
             grep -oE '<a [^>]*href="[^"]*"' | \
             cut -d'"' -f2 | \
             sed -e 's/\&amp;/\&/g' -e 's/\&[gl]t;/</g' -e 's/\&[gr]t;/>/g' -e 's/\&quot;/"/g' -e 's/\&apos;/'\''/g' | \
             sort -u)
}

# Program start
processed_urls_cnt=0
errors_cnt=0
full_url="$1"

# Check if URL starts with "http" or "https"
if [[ $full_url != "http"* ]] && [[ $full_url != "https"* ]]; then
    echo "[ERROR] Please provide a valid URL starting with http or https." >&2
    exit 1
fi

echo "Starting traversal of ${full_url} ..." >&2

# Traverse initial URL
traverse_url "${full_url}" 0

# Print final progress message
echo -e "\rFinished processing ${GREEN}$processed_urls_cnt${NC} URLs, found ${YELLOW}$errors_cnt${NC} errors." >&2

注意：返回的代码片段中的注释已经足够详细了，本文档无法对其中的每个段落进行进一步说明。