📜  Scrapy-设置

📅  最后修改于: 2020-10-31 14:37:17             🧑  作者: Mango


描述

可以使用Scrapy设置修改Scrapy组件的行为。如果您有多个Scrapy项目,这些设置还可以选择当前处于活动状态的Scrapy项目。

指定设置

抓取网站时,您必须通知Scrapy您正在使用哪个设置。为此,应使用环境变量SCRAPY_SETTINGS_MODULE ,其值应采用Python路径语法。

填充设置

下表显示了一些可用于填充设置的机制-

Sr.No Mechanism & Description
1

Command line options

Here, the arguments that are passed takes highest precedence by overriding other options. The -s is used to override one or more settings.

scrapy crawl myspider -s LOG_FILE = scrapy.log
2

Settings per-spider

Spiders can have their own settings that overrides the project ones by using attribute custom_settings.

class DemoSpider(scrapy.Spider): 
   name = 'demo'  
   custom_settings = { 
      'SOME_SETTING': 'some value', 
   }
3

Project settings module

Here, you can populate your custom settings such as adding or modifying the settings in the settings.py file.

4

Default settings per-command

Each Scrapy tool command defines its own settings in the default_settings attribute, to override the global default settings.

5

Default global settings

These settings are found in the scrapy.settings.default_settings module.

访问设置

它们可以通过self.settings获得,并在初始化后设置在基本蜘蛛中。

以下示例对此进行了演示。

class DemoSpider(scrapy.Spider): 
   name = 'demo' 
   start_urls = ['http://example.com']  
   def parse(self, response): 
      print("Existing settings: %s" % self.settings.attributes.keys()) 

要在初始化Spider之前使用设置,必须在Spider的_init_ ()方法中覆盖from_crawler方法。您可以通过传递给from_crawler方法的属性scrapy.crawler.Crawler.settings访问设置。

以下示例对此进行了演示。

class MyExtension(object): 
   def __init__(self, log_is_enabled = False): 
      if log_is_enabled: 
         print("Enabled log") 
         @classmethod 
   def from_crawler(cls, crawler): 
      settings = crawler.settings 
      return cls(settings.getbool('LOG_ENABLED')) 

设置名称的理由

设置名称将作为前缀添加到它们配置的组件中。例如,对于robots.txt扩展名,设置名称可以是ROBOTSTXT_ENABLED,ROBOTSTXT_OBEY,ROBOTSTXT_CACHEDIR等。

内置设置参考

下表显示了Scrapy的内置设置-

Sr.No Setting & Description
1

AWS_ACCESS_KEY_ID

It is used to access Amazon Web Services.

Default value: None

2

AWS_SECRET_ACCESS_KEY

It is used to access Amazon Web Services.

Default value: None

3

BOT_NAME

It is the name of bot that can be used for constructing User-Agent.

Default value: ‘scrapybot’

4

CONCURRENT_ITEMS

Maximum number of existing items in the Item Processor used to process parallely.

Default value: 100

5

CONCURRENT_REQUESTS

Maximum number of existing requests which Scrapy downloader performs.

Default value: 16

6

CONCURRENT_REQUESTS_PER_DOMAIN

Maximum number of existing requests that perform simultaneously for
any single domain.

Default value: 8

7

CONCURRENT_REQUESTS_PER_IP

Maximum number of existing requests that performs simultaneously to any single IP.

Default value: 0

8

DEFAULT_ITEM_CLASS

It is a class used to represent items.

Default value: ‘scrapy.item.Item’

9

DEFAULT_REQUEST_HEADERS

It is a default header used for HTTP requests of Scrapy.

Default value −

{  
   'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,
*/*;q=0.8', 'Accept-Language': 'en',  
} 
10

DEPTH_LIMIT

The maximum depth for a spider to crawl any site.

Default value: 0

11

DEPTH_PRIORITY

It is an integer used to alter the priority of request according to the depth.

Default value: 0

12

DEPTH_STATS

It states whether to collect depth stats or not.

Default value: True

13

DEPTH_STATS_VERBOSE

This setting when enabled, the number of requests is collected in stats
for each verbose depth.

Default value: False

14

DNSCACHE_ENABLED

It is used to enable DNS in memory cache.

Default value: True

15

DNSCACHE_SIZE

It defines the size of DNS in memory cache.

Default value: 10000

16

DNS_TIMEOUT

It is used to set timeout for DNS to process the queries.

Default value: 60

17

DOWNLOADER

It is a downloader used for the crawling process.

Default value: ‘scrapy.core.downloader.Downloader’

18

DOWNLOADER_MIDDLEWARES

It is a dictionary holding downloader middleware and their orders.

Default value: {}

19

DOWNLOADER_MIDDLEWARES_BASE

It is a dictionary holding downloader middleware that is enabled by default.

Default value −

{ 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, }
20

DOWNLOADER_STATS

This setting is used to enable the downloader stats.

Default value: True

21

DOWNLOAD_DELAY

It defines the total time for downloader before it downloads the pages from the site.

Default value: 0

22

DOWNLOAD_HANDLERS

It is a dictionary with download handlers.

Default value: {}

23

DOWNLOAD_HANDLERS_BASE

It is a dictionary with download handlers that is enabled by default.

Default value −

{ 'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler', }
24

DOWNLOAD_TIMEOUT

It is the total time for downloader to wait before it times out.

Default value: 180

25

DOWNLOAD_MAXSIZE

It is the maximum size of response for the downloader to download.

Default value: 1073741824 (1024MB)

26

DOWNLOAD_WARNSIZE

It defines the size of response for downloader to warn.

Default value: 33554432 (32MB)

27

DUPEFILTER_CLASS

It is a class used for detecting and filtering of requests that are duplicate.

Default value: ‘scrapy.dupefilters.RFPDupeFilter’

28

DUPEFILTER_DEBUG

This setting logs all duplicate filters when set to true.

Default value: False

29

EDITOR

It is used to edit spiders using the edit command.

Default value: Depends on the environment

30

EXTENSIONS

It is a dictionary having extensions that are enabled in the project.

Default value: {}

31

EXTENSIONS_BASE

It is a dictionary having built-in extensions.

Default value: { ‘scrapy.extensions.corestats.CoreStats’: 0, }

32

FEED_TEMPDIR

It is a directory used to set the custom folder where crawler temporary
files can be stored.

33

ITEM_PIPELINES

It is a dictionary having pipelines.

Default value: {}

34

LOG_ENABLED

It defines if the logging is to be enabled.

Default value: True

35

LOG_ENCODING

It defines the type of encoding to be used for logging.

Default value: ‘utf-8’

36

LOG_FILE

It is the name of the file to be used for the output of logging.

Default value: None

37

LOG_FORMAT

It is a string using which the log messages can be formatted.

Default value: ‘%(asctime)s [%(name)s] %(levelname)s: %(message)s’

38

LOG_DATEFORMAT

It is a string using which date/time can be formatted.

Default value: ‘%Y-%m-%d %H:%M:%S’

39

LOG_LEVEL

It defines minimum log level.

Default value: ‘DEBUG’

40

LOG_STDOUT

This setting if set to true, all your process output will appear in the log.

Default value: False

41

MEMDEBUG_ENABLED

It defines if the memory debugging is to be enabled.

Default Value: False

42

MEMDEBUG_NOTIFY

It defines the memory report that is sent to a particular address when
memory debugging is enabled.

Default value: []

43

MEMUSAGE_ENABLED

It defines if the memory usage is to be enabled when a Scrapy process
exceeds a memory limit.

Default value: False

44

MEMUSAGE_LIMIT_MB

It defines the maximum limit for the memory (in megabytes) to be allowed.

Default value: 0

45

MEMUSAGE_CHECK_INTERVAL_SECONDS

It is used to check the present memory usage by setting the length of the intervals.

Default value: 60.0

46

MEMUSAGE_NOTIFY_MAIL

It is used to notify with a list of emails when the memory reaches the limit.

Default value: False

47

MEMUSAGE_REPORT

It defines if the memory usage report is to be sent on closing each spider.

Default value: False

48

MEMUSAGE_WARNING_MB

It defines a total memory to be allowed before a warning is sent.

Default value: 0

49

NEWSPIDER_MODULE

It is a module where a new spider is created using genspider command.

Default value: ”

50

RANDOMIZE_DOWNLOAD_DELAY

It defines a random amount of time for a Scrapy to wait while downloading
the requests from the site.

Default value: True

51

REACTOR_THREADPOOL_MAXSIZE

It defines a maximum size for the reactor threadpool.

Default value: 10

52

REDIRECT_MAX_TIMES

It defines how many times a request can be redirected.

Default value: 20

53

REDIRECT_PRIORITY_ADJUST

This setting when set, adjusts the redirect priority of a request.

Default value: +2

54

RETRY_PRIORITY_ADJUST

This setting when set, adjusts the retry priority of a request.

Default value: -1

55

ROBOTSTXT_OBEY

Scrapy obeys robots.txt policies when set to true.

Default value: False

56

SCHEDULER

It defines the scheduler to be used for crawl purpose.

Default value: ‘scrapy.core.scheduler.Scheduler’

57

SPIDER_CONTRACTS

It is a dictionary in the project having spider contracts to test the spiders.

Default value: {}

58

SPIDER_CONTRACTS_BASE

It is a dictionary holding Scrapy contracts which is enabled in Scrapy by default.

Default value −

{ 
   'scrapy.contracts.default.UrlContract' : 1, 
   'scrapy.contracts.default.ReturnsContract': 2, 
} 
59

SPIDER_LOADER_CLASS

It defines a class which implements SpiderLoader API to load spiders.

Default value: ‘scrapy.spiderloader.SpiderLoader’

60

SPIDER_MIDDLEWARES

It is a dictionary holding spider middlewares.

Default value: {}

61

SPIDER_MIDDLEWARES_BASE

It is a dictionary holding spider middlewares that is enabled in Scrapy by default.

Default value −

{ 
   'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, 
}
62

SPIDER_MODULES

It is a list of modules containing spiders which Scrapy will look for.

Default value: []

63

STATS_CLASS

It is a class which implements Stats Collector API to collect stats.

Default value: ‘scrapy.statscollectors.MemoryStatsCollector’

64

STATS_DUMP

This setting when set to true, dumps the stats to the log.

Default value: True

65

STATSMAILER_RCPTS

Once the spiders finish scraping, Scrapy uses this setting to send the stats.

Default value: []

66

TELNETCONSOLE_ENABLED

It defines whether to enable the telnetconsole.

Default value: True

67

TELNETCONSOLE_PORT

It defines a port for telnet console.

Default value: [6023, 6073]

68

TEMPLATES_DIR

It is a directory containing templates that can be used while creating new projects.

Default value: templates directory inside scrapy module

69

URLLENGTH_LIMIT

It defines the maximum limit of the length for URL to be allowed for crawled URLs.

Default value: 2083

70

USER_AGENT

It defines the user agent to be used while crawling a site.

Default value: “Scrapy/VERSION (+http://scrapy.org)”

对于其他Scrapy设置,请转到此链接