Scrapy-设置 - 芒果文档

📌 相关文章

📜 Scrapy-设置

📅 最后修改于: 2020-10-31 14:37:17 🧑 作者: Mango

描述

可以使用Scrapy设置修改Scrapy组件的行为。如果您有多个Scrapy项目，这些设置还可以选择当前处于活动状态的Scrapy项目。

指定设置

抓取网站时，您必须通知Scrapy您正在使用哪个设置。为此，应使用环境变量SCRAPY_SETTINGS_MODULE ，其值应采用Python路径语法。

填充设置

下表显示了一些可用于填充设置的机制-

Sr.No	Mechanism & Description
1	Command line options Here, the arguments that are passed takes highest precedence by overriding other options. The -s is used to override one or more settings. scrapy crawl myspider -s LOG_FILE = scrapy.log
2	Settings per-spider Spiders can have their own settings that overrides the project ones by using attribute custom_settings. class DemoSpider(scrapy.Spider): name = 'demo' custom_settings = { 'SOME_SETTING': 'some value', }
3	Project settings module Here, you can populate your custom settings such as adding or modifying the settings in the settings.py file.
4	Default settings per-command Each Scrapy tool command defines its own settings in the default_settings attribute, to override the global default settings.
5	Default global settings These settings are found in the scrapy.settings.default_settings module.

访问设置

它们可以通过self.settings获得，并在初始化后设置在基本蜘蛛中。

以下示例对此进行了演示。

class DemoSpider(scrapy.Spider): 
   name = 'demo' 
   start_urls = ['http://example.com']  
   def parse(self, response): 
      print("Existing settings: %s" % self.settings.attributes.keys())

要在初始化Spider之前使用设置，必须在Spider的_init_ ()方法中覆盖from_crawler方法。您可以通过传递给from_crawler方法的属性scrapy.crawler.Crawler.settings访问设置。

以下示例对此进行了演示。

class MyExtension(object): 
   def __init__(self, log_is_enabled = False): 
      if log_is_enabled: 
         print("Enabled log") 
         @classmethod 
   def from_crawler(cls, crawler): 
      settings = crawler.settings 
      return cls(settings.getbool('LOG_ENABLED'))

设置名称的理由

设置名称将作为前缀添加到它们配置的组件中。例如，对于robots.txt扩展名，设置名称可以是ROBOTSTXT_ENABLED，ROBOTSTXT_OBEY，ROBOTSTXT_CACHEDIR等。

内置设置参考

下表显示了Scrapy的内置设置-

Sr.No	Setting & Description
1	AWS_ACCESS_KEY_ID It is used to access Amazon Web Services. Default value: None
2	AWS_SECRET_ACCESS_KEY It is used to access Amazon Web Services. Default value: None
3	BOT_NAME It is the name of bot that can be used for constructing User-Agent. Default value: ‘scrapybot’
4	CONCURRENT_ITEMS Maximum number of existing items in the Item Processor used to process parallely. Default value: 100
5	CONCURRENT_REQUESTS Maximum number of existing requests which Scrapy downloader performs. Default value: 16
6	CONCURRENT_REQUESTS_PER_DOMAIN Maximum number of existing requests that perform simultaneously for any single domain. Default value: 8
7	CONCURRENT_REQUESTS_PER_IP Maximum number of existing requests that performs simultaneously to any single IP. Default value: 0
8	DEFAULT_ITEM_CLASS It is a class used to represent items. Default value: ‘scrapy.item.Item’
9	DEFAULT_REQUEST_HEADERS It is a default header used for HTTP requests of Scrapy. Default value − { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9, /;q=0.8', 'Accept-Language': 'en', }
10	DEPTH_LIMIT The maximum depth for a spider to crawl any site. Default value: 0
11	DEPTH_PRIORITY It is an integer used to alter the priority of request according to the depth. Default value: 0
12	DEPTH_STATS It states whether to collect depth stats or not. Default value: True
13	DEPTH_STATS_VERBOSE This setting when enabled, the number of requests is collected in stats for each verbose depth. Default value: False
14	DNSCACHE_ENABLED It is used to enable DNS in memory cache. Default value: True
15	DNSCACHE_SIZE It defines the size of DNS in memory cache. Default value: 10000
16	DNS_TIMEOUT It is used to set timeout for DNS to process the queries. Default value: 60
17	DOWNLOADER It is a downloader used for the crawling process. Default value: ‘scrapy.core.downloader.Downloader’
18	DOWNLOADER_MIDDLEWARES It is a dictionary holding downloader middleware and their orders. Default value: {}
19	DOWNLOADER_MIDDLEWARES_BASE It is a dictionary holding downloader middleware that is enabled by default. Default value − { 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, }
20	DOWNLOADER_STATS This setting is used to enable the downloader stats. Default value: True
21	DOWNLOAD_DELAY It defines the total time for downloader before it downloads the pages from the site. Default value: 0
22	DOWNLOAD_HANDLERS It is a dictionary with download handlers. Default value: {}
23	DOWNLOAD_HANDLERS_BASE It is a dictionary with download handlers that is enabled by default. Default value − { 'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler', }
24	DOWNLOAD_TIMEOUT It is the total time for downloader to wait before it times out. Default value: 180
25	DOWNLOAD_MAXSIZE It is the maximum size of response for the downloader to download. Default value: 1073741824 (1024MB)
26	DOWNLOAD_WARNSIZE It defines the size of response for downloader to warn. Default value: 33554432 (32MB)
27	DUPEFILTER_CLASS It is a class used for detecting and filtering of requests that are duplicate. Default value: ‘scrapy.dupefilters.RFPDupeFilter’
28	DUPEFILTER_DEBUG This setting logs all duplicate filters when set to true. Default value: False
29	EDITOR It is used to edit spiders using the edit command. Default value: Depends on the environment
30	EXTENSIONS It is a dictionary having extensions that are enabled in the project. Default value: {}
31	EXTENSIONS_BASE It is a dictionary having built-in extensions. Default value: { ‘scrapy.extensions.corestats.CoreStats’: 0, }
32	FEED_TEMPDIR It is a directory used to set the custom folder where crawler temporary files can be stored.
33	ITEM_PIPELINES It is a dictionary having pipelines. Default value: {}
34	LOG_ENABLED It defines if the logging is to be enabled. Default value: True
35	LOG_ENCODING It defines the type of encoding to be used for logging. Default value: ‘utf-8’
36	LOG_FILE It is the name of the file to be used for the output of logging. Default value: None
37	LOG_FORMAT It is a string using which the log messages can be formatted. Default value: ‘%(asctime)s [%(name)s] %(levelname)s: %(message)s’
38	LOG_DATEFORMAT It is a string using which date/time can be formatted. Default value: ‘%Y-%m-%d %H:%M:%S’
39	LOG_LEVEL It defines minimum log level. Default value: ‘DEBUG’
40	LOG_STDOUT This setting if set to true, all your process output will appear in the log. Default value: False
41	MEMDEBUG_ENABLED It defines if the memory debugging is to be enabled. Default Value: False
42	MEMDEBUG_NOTIFY It defines the memory report that is sent to a particular address when memory debugging is enabled. Default value: []
43	MEMUSAGE_ENABLED It defines if the memory usage is to be enabled when a Scrapy process exceeds a memory limit. Default value: False
44	MEMUSAGE_LIMIT_MB It defines the maximum limit for the memory (in megabytes) to be allowed. Default value: 0
45	MEMUSAGE_CHECK_INTERVAL_SECONDS It is used to check the present memory usage by setting the length of the intervals. Default value: 60.0
46	MEMUSAGE_NOTIFY_MAIL It is used to notify with a list of emails when the memory reaches the limit. Default value: False
47	MEMUSAGE_REPORT It defines if the memory usage report is to be sent on closing each spider. Default value: False
48	MEMUSAGE_WARNING_MB It defines a total memory to be allowed before a warning is sent. Default value: 0
49	NEWSPIDER_MODULE It is a module where a new spider is created using genspider command. Default value: ”
50	RANDOMIZE_DOWNLOAD_DELAY It defines a random amount of time for a Scrapy to wait while downloading the requests from the site. Default value: True
51	REACTOR_THREADPOOL_MAXSIZE It defines a maximum size for the reactor threadpool. Default value: 10
52	REDIRECT_MAX_TIMES It defines how many times a request can be redirected. Default value: 20
53	REDIRECT_PRIORITY_ADJUST This setting when set, adjusts the redirect priority of a request. Default value: +2
54	RETRY_PRIORITY_ADJUST This setting when set, adjusts the retry priority of a request. Default value: -1
55	ROBOTSTXT_OBEY Scrapy obeys robots.txt policies when set to true. Default value: False
56	SCHEDULER It defines the scheduler to be used for crawl purpose. Default value: ‘scrapy.core.scheduler.Scheduler’
57	SPIDER_CONTRACTS It is a dictionary in the project having spider contracts to test the spiders. Default value: {}
58	SPIDER_CONTRACTS_BASE It is a dictionary holding Scrapy contracts which is enabled in Scrapy by default. Default value − { 'scrapy.contracts.default.UrlContract' : 1, 'scrapy.contracts.default.ReturnsContract': 2, }
59	SPIDER_LOADER_CLASS It defines a class which implements SpiderLoader API to load spiders. Default value: ‘scrapy.spiderloader.SpiderLoader’
60	SPIDER_MIDDLEWARES It is a dictionary holding spider middlewares. Default value: {}
61	SPIDER_MIDDLEWARES_BASE It is a dictionary holding spider middlewares that is enabled in Scrapy by default. Default value − { 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, }
62	SPIDER_MODULES It is a list of modules containing spiders which Scrapy will look for. Default value: []
63	STATS_CLASS It is a class which implements Stats Collector API to collect stats. Default value: ‘scrapy.statscollectors.MemoryStatsCollector’
64	STATS_DUMP This setting when set to true, dumps the stats to the log. Default value: True
65	STATSMAILER_RCPTS Once the spiders finish scraping, Scrapy uses this setting to send the stats. Default value: []
66	TELNETCONSOLE_ENABLED It defines whether to enable the telnetconsole. Default value: True
67	TELNETCONSOLE_PORT It defines a port for telnet console. Default value: [6023, 6073]
68	TEMPLATES_DIR It is a directory containing templates that can be used while creating new projects. Default value: templates directory inside scrapy module
69	URLLENGTH_LIMIT It defines the maximum limit of the length for URL to be allowed for crawled URLs. Default value: 2083
70	USER_AGENT It defines the user agent to be used while crawling a site. Default value: “Scrapy/VERSION (+http://scrapy.org)”

对于其他Scrapy设置，请转到此链接。