使用Python解析和处理 URL – 正则表达式

先决条件： Python中的正则表达式

URL或统一资源定位器由许多信息部分组成，例如域名、路径、端口号等。任何URL都可以使用正则表达式进行处理和解析。因此，要使用正则表达式，我们必须在Python中使用re库。

例子：

URL: https://www.geeksforgeeks.org/courses
When we parse the above URL then we can find

Hostname: geeksforgeeks.com
Protocol: https

我们正在使用 re 库的re.findall( )函数在 URL 中搜索所需的模式。

Syntax: re.findall(regex, string)

Return: all non-overlapping matches of pattern in string, as a list of strings.

编程需要懂一点英语

现在，让我们看一下示例：

示例 1：在本示例中，我们将从给定的 URL 中提取协议和主机名。

提取协议组的正则表达式： ' (\w+):// ' 。
用于提取主机名组的正则表达式：' ://www.([\w\-\.]+) ' 。

使用的元字符：

\w：匹配任何字母数字字符，这相当于类 [a-zA-Z0-9_]。
+：前面字符出现一次或多次。

代码：

Python3

# import library
import re  
  
# url link
s = 'https://www.geeksforgeeks.org/'
  
# finding the protocol 
obj1 = re.findall('(\w+)://',
                  s)
print(obj1)
  
# finding the hostname which may
# contain dash or dots
obj2 = re.findall('://www.([\w\-\.]+)', 
                  s)
print(obj2)

Python3

# import library
import re  
  
# url link
s = 'file://localhost:4040/abc_file'
  
# finding the file capture group
obj1 = re.findall('(\w+)://', s)  
print(obj1)
  
# finding the hostname which may 
# contain dash or dots
obj2 = re.findall('://([\w\-\.]+)', s)
print(obj2)
  
# finding the hostname which may 
# contain dash or dots or port
# number
obj3 = re.findall('://([\w\-\.]+)(:(\d+))?', s)
print(obj3)

Python3

# import library
import re
  
# url
s = 'http://www.example.com/index.html' 
  
# searching for all capture groups
obj = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)',
                 s)
  
print(obj)

输出：

['https']
['geeksforgeeks.org']

示例 2：如果 URL 是不同的类型，例如 ' file://localhost:4040/zip_file '，并带有端口号，则提取端口号，因为它是可选的，我们将使用'? '符号。这里端口号“ 4040”出现在“:”符号之后。因此，因为它是一个数字(:(\d+))被使用。由于所有 URL 不以主机号结尾，因此为了使其成为可选，此语法使用'(:(\d+))?'。

使用的元字符：

\d：匹配任意十进制数字，相当于设置类[0-9]。
+：前面字符出现一次或多次。
?：匹配零次或一次。

代码：

Python3

# import library
import re  
  
# url link
s = 'file://localhost:4040/abc_file'
  
# finding the file capture group
obj1 = re.findall('(\w+)://', s)  
print(obj1)
  
# finding the hostname which may 
# contain dash or dots
obj2 = re.findall('://([\w\-\.]+)', s)
print(obj2)
  
# finding the hostname which may 
# contain dash or dots or port
# number
obj3 = re.findall('://([\w\-\.]+)(:(\d+))?', s)
print(obj3)

输出：

['file']
['localhost']
[('localhost', ':4040', '4040')]

例3：对于一般的URL，可以使用this，也可以构造路径元素。

Python3

# import library
import re
  
# url
s = 'http://www.example.com/index.html' 
  
# searching for all capture groups
obj = re.findall('(\w+)://([\w\-\.]+)/(\w+).(\w+)',
                 s)
  
print(obj)

输出：

[('http', 'www.example.com', 'index', 'html')]