📜  Scrapy-链接提取器

📅  最后修改于: 2020-10-31 14:36:21             🧑  作者: Mango


描述

顾名思义,链接提取器是用于使用scrapy.http.Response对象从网页提取链接的对象。在Scrapy中,有内置的提取器,例如scrapy.linkextractors import LinkExtractor 。您可以通过实现简单的界面根据需要自定义自己的链接提取器。

每个链接提取器都有一个名为extract_links的公共方法,该方法包含一个Response对象,并返回scrapy.link.Link对象的列表。您只能实例化链接提取程序一次,并多次调用extract_links方法以提取具有不同响应的链接。 CrawlSpider类使用链接提取器和一组规则,这些规则的主要目的是提取链接。

内置链接提取器的参考

通常,链接提取器与Scrapy分组在一起,并在scrapy.linkextractors模块中提供。默认情况下,链接提取器将是LinkExtractor,其功能与LxmlLinkExtractor相同-

from scrapy.linkextractors import LinkExtractor

LxmlLinkExtractor

class scrapy.linkextractors.lxmlhtml.LxmlLinkExtractor(allow = (), deny = (), 
   allow_domains = (), deny_domains = (), deny_extensions = None, restrict_xpaths = (), 
   restrict_css = (), tags = ('a', 'area'), attrs = ('href', ), 
   canonicalize = True, unique = True, process_value = None)

LxmlLinkExtractor是强烈推荐的链接提取器,因为它具有方便的过滤选项,并且与lxml的健壮HTMLParser一起使用。

Sr.No Parameter & Description
1

allow (a regular expression (or list of))

It allows a single expression or group of expressions that should match the url which is to be extracted. If it is not mentioned, it will match all the links.

2

deny (a regular expression (or list of))

It blocks or excludes a single expression or group of expressions that should match the url which is not to be extracted. If it is not mentioned or left empty, then it will not eliminate the undesired links.

3

allow_domains (str or list)

It allows a single string or list of strings that should match the domains from which the links are to be extracted.

4

deny_domains (str or list)

It blocks or excludes a single string or list of strings that should match the domains from which the links are not to be extracted.

5

deny_extensions (list)

It blocks the list of strings with the extensions when extracting the links. If it is not set, then by default it will be set to IGNORED_EXTENSIONS which contains predefined list in scrapy.linkextractors package.

6

restrict_xpaths (str or list)

It is an XPath list region from where the links are to be extracted from the response. If given, the links will be extracted only from the text, which is selected by XPath.

7

restrict_css (str or list)

It behaves similar to restrict_xpaths parameter which will extract the links from the CSS selected regions inside the response.

8

tags (str or list)

A single tag or a list of tags that should be considered when extracting the links. By default, it will be (’a’, ’area’).

9

attrs (list)

A single attribute or list of attributes should be considered while extracting links. By default, it will be (’href’,).

10

canonicalize (boolean)

The extracted url is brought to standard form using scrapy.utils.url.canonicalize_url. By default, it will be True.

11

unique (boolean)

It will be used if the extracted links are repeated.

12

process_value (callable)

It is a function which receives a value from scanned tags and attributes. The value received may be altered and returned or else nothing will be returned to reject the link. If not used, by default it will be lambda x: x.

以下代码用于提取链接-

以下代码函数可以在process_value中使用-

def process_value(val): 
   m = re.search("javascript:goToPage\('(.*?)'", val) 
   if m: 
      return m.group(1)