📅  最后修改于: 2020-10-31 14:37:17             🧑  作者: Mango
可以使用Scrapy设置修改Scrapy组件的行为。如果您有多个Scrapy项目,这些设置还可以选择当前处于活动状态的Scrapy项目。
抓取网站时,您必须通知Scrapy您正在使用哪个设置。为此,应使用环境变量SCRAPY_SETTINGS_MODULE ,其值应采用Python路径语法。
下表显示了一些可用于填充设置的机制-
Sr.No | Mechanism & Description |
---|---|
1 |
Command line options Here, the arguments that are passed takes highest precedence by overriding other options. The -s is used to override one or more settings. scrapy crawl myspider -s LOG_FILE = scrapy.log |
2 |
Settings per-spider Spiders can have their own settings that overrides the project ones by using attribute custom_settings. class DemoSpider(scrapy.Spider): name = 'demo' custom_settings = { 'SOME_SETTING': 'some value', } |
3 |
Project settings module Here, you can populate your custom settings such as adding or modifying the settings in the settings.py file. |
4 |
Default settings per-command Each Scrapy tool command defines its own settings in the default_settings attribute, to override the global default settings. |
5 |
Default global settings These settings are found in the scrapy.settings.default_settings module. |
它们可以通过self.settings获得,并在初始化后设置在基本蜘蛛中。
以下示例对此进行了演示。
class DemoSpider(scrapy.Spider):
name = 'demo'
start_urls = ['http://example.com']
def parse(self, response):
print("Existing settings: %s" % self.settings.attributes.keys())
要在初始化Spider之前使用设置,必须在Spider的_init_ ()方法中覆盖from_crawler方法。您可以通过传递给from_crawler方法的属性scrapy.crawler.Crawler.settings访问设置。
以下示例对此进行了演示。
class MyExtension(object):
def __init__(self, log_is_enabled = False):
if log_is_enabled:
print("Enabled log")
@classmethod
def from_crawler(cls, crawler):
settings = crawler.settings
return cls(settings.getbool('LOG_ENABLED'))
设置名称将作为前缀添加到它们配置的组件中。例如,对于robots.txt扩展名,设置名称可以是ROBOTSTXT_ENABLED,ROBOTSTXT_OBEY,ROBOTSTXT_CACHEDIR等。
下表显示了Scrapy的内置设置-
Sr.No | Setting & Description |
---|---|
1 |
AWS_ACCESS_KEY_ID It is used to access Amazon Web Services. Default value: None |
2 |
AWS_SECRET_ACCESS_KEY It is used to access Amazon Web Services. Default value: None |
3 |
BOT_NAME It is the name of bot that can be used for constructing User-Agent. Default value: ‘scrapybot’ |
4 |
CONCURRENT_ITEMS Maximum number of existing items in the Item Processor used to process parallely. Default value: 100 |
5 |
CONCURRENT_REQUESTS Maximum number of existing requests which Scrapy downloader performs. Default value: 16 |
6 |
CONCURRENT_REQUESTS_PER_DOMAIN Maximum number of existing requests that perform simultaneously for Default value: 8 |
7 |
CONCURRENT_REQUESTS_PER_IP Maximum number of existing requests that performs simultaneously to any single IP. Default value: 0 |
8 |
DEFAULT_ITEM_CLASS It is a class used to represent items. Default value: ‘scrapy.item.Item’ |
9 |
DEFAULT_REQUEST_HEADERS It is a default header used for HTTP requests of Scrapy. Default value − { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9, */*;q=0.8', 'Accept-Language': 'en', } |
10 |
DEPTH_LIMIT The maximum depth for a spider to crawl any site. Default value: 0 |
11 |
DEPTH_PRIORITY It is an integer used to alter the priority of request according to the depth. Default value: 0 |
12 |
DEPTH_STATS It states whether to collect depth stats or not. Default value: True |
13 |
DEPTH_STATS_VERBOSE This setting when enabled, the number of requests is collected in stats Default value: False |
14 |
DNSCACHE_ENABLED It is used to enable DNS in memory cache. Default value: True |
15 |
DNSCACHE_SIZE It defines the size of DNS in memory cache. Default value: 10000 |
16 |
DNS_TIMEOUT It is used to set timeout for DNS to process the queries. Default value: 60 |
17 |
DOWNLOADER It is a downloader used for the crawling process. Default value: ‘scrapy.core.downloader.Downloader’ |
18 |
DOWNLOADER_MIDDLEWARES It is a dictionary holding downloader middleware and their orders. Default value: {} |
19 |
DOWNLOADER_MIDDLEWARES_BASE It is a dictionary holding downloader middleware that is enabled by default. Default value − { 'scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware': 100, } |
20 |
DOWNLOADER_STATS This setting is used to enable the downloader stats. Default value: True |
21 |
DOWNLOAD_DELAY It defines the total time for downloader before it downloads the pages from the site. Default value: 0 |
22 |
DOWNLOAD_HANDLERS It is a dictionary with download handlers. Default value: {} |
23 |
DOWNLOAD_HANDLERS_BASE It is a dictionary with download handlers that is enabled by default. Default value − { 'file': 'scrapy.core.downloader.handlers.file.FileDownloadHandler', } |
24 |
DOWNLOAD_TIMEOUT It is the total time for downloader to wait before it times out. Default value: 180 |
25 |
DOWNLOAD_MAXSIZE It is the maximum size of response for the downloader to download. Default value: 1073741824 (1024MB) |
26 |
DOWNLOAD_WARNSIZE It defines the size of response for downloader to warn. Default value: 33554432 (32MB) |
27 |
DUPEFILTER_CLASS It is a class used for detecting and filtering of requests that are duplicate. Default value: ‘scrapy.dupefilters.RFPDupeFilter’ |
28 |
DUPEFILTER_DEBUG This setting logs all duplicate filters when set to true. Default value: False |
29 |
EDITOR It is used to edit spiders using the edit command. Default value: Depends on the environment |
30 |
EXTENSIONS It is a dictionary having extensions that are enabled in the project. Default value: {} |
31 |
EXTENSIONS_BASE It is a dictionary having built-in extensions. Default value: { ‘scrapy.extensions.corestats.CoreStats’: 0, } |
32 |
FEED_TEMPDIR It is a directory used to set the custom folder where crawler temporary |
33 |
ITEM_PIPELINES It is a dictionary having pipelines. Default value: {} |
34 |
LOG_ENABLED It defines if the logging is to be enabled. Default value: True |
35 |
LOG_ENCODING It defines the type of encoding to be used for logging. Default value: ‘utf-8’ |
36 |
LOG_FILE It is the name of the file to be used for the output of logging. Default value: None |
37 |
LOG_FORMAT It is a string using which the log messages can be formatted. Default value: ‘%(asctime)s [%(name)s] %(levelname)s: %(message)s’ |
38 |
LOG_DATEFORMAT It is a string using which date/time can be formatted. Default value: ‘%Y-%m-%d %H:%M:%S’ |
39 |
LOG_LEVEL It defines minimum log level. Default value: ‘DEBUG’ |
40 |
LOG_STDOUT This setting if set to true, all your process output will appear in the log. Default value: False |
41 |
MEMDEBUG_ENABLED It defines if the memory debugging is to be enabled. Default Value: False |
42 |
MEMDEBUG_NOTIFY It defines the memory report that is sent to a particular address when Default value: [] |
43 |
MEMUSAGE_ENABLED It defines if the memory usage is to be enabled when a Scrapy process Default value: False |
44 |
MEMUSAGE_LIMIT_MB It defines the maximum limit for the memory (in megabytes) to be allowed. Default value: 0 |
45 |
MEMUSAGE_CHECK_INTERVAL_SECONDS It is used to check the present memory usage by setting the length of the intervals. Default value: 60.0 |
46 |
MEMUSAGE_NOTIFY_MAIL It is used to notify with a list of emails when the memory reaches the limit. Default value: False |
47 |
MEMUSAGE_REPORT It defines if the memory usage report is to be sent on closing each spider. Default value: False |
48 |
MEMUSAGE_WARNING_MB It defines a total memory to be allowed before a warning is sent. Default value: 0 |
49 |
NEWSPIDER_MODULE It is a module where a new spider is created using genspider command. Default value: ” |
50 |
RANDOMIZE_DOWNLOAD_DELAY It defines a random amount of time for a Scrapy to wait while downloading Default value: True |
51 |
REACTOR_THREADPOOL_MAXSIZE It defines a maximum size for the reactor threadpool. Default value: 10 |
52 |
REDIRECT_MAX_TIMES It defines how many times a request can be redirected. Default value: 20 |
53 |
REDIRECT_PRIORITY_ADJUST This setting when set, adjusts the redirect priority of a request. Default value: +2 |
54 |
RETRY_PRIORITY_ADJUST This setting when set, adjusts the retry priority of a request. Default value: -1 |
55 |
ROBOTSTXT_OBEY Scrapy obeys robots.txt policies when set to true. Default value: False |
56 |
SCHEDULER It defines the scheduler to be used for crawl purpose. Default value: ‘scrapy.core.scheduler.Scheduler’ |
57 |
SPIDER_CONTRACTS It is a dictionary in the project having spider contracts to test the spiders. Default value: {} |
58 |
SPIDER_CONTRACTS_BASE It is a dictionary holding Scrapy contracts which is enabled in Scrapy by default. Default value − { 'scrapy.contracts.default.UrlContract' : 1, 'scrapy.contracts.default.ReturnsContract': 2, } |
59 |
SPIDER_LOADER_CLASS It defines a class which implements SpiderLoader API to load spiders. Default value: ‘scrapy.spiderloader.SpiderLoader’ |
60 |
SPIDER_MIDDLEWARES It is a dictionary holding spider middlewares. Default value: {} |
61 |
SPIDER_MIDDLEWARES_BASE It is a dictionary holding spider middlewares that is enabled in Scrapy by default. Default value − { 'scrapy.spidermiddlewares.httperror.HttpErrorMiddleware': 50, } |
62 |
SPIDER_MODULES It is a list of modules containing spiders which Scrapy will look for. Default value: [] |
63 |
STATS_CLASS It is a class which implements Stats Collector API to collect stats. Default value: ‘scrapy.statscollectors.MemoryStatsCollector’ |
64 |
STATS_DUMP This setting when set to true, dumps the stats to the log. Default value: True |
65 |
STATSMAILER_RCPTS Once the spiders finish scraping, Scrapy uses this setting to send the stats. Default value: [] |
66 |
TELNETCONSOLE_ENABLED It defines whether to enable the telnetconsole. Default value: True |
67 |
TELNETCONSOLE_PORT It defines a port for telnet console. Default value: [6023, 6073] |
68 |
TEMPLATES_DIR It is a directory containing templates that can be used while creating new projects. Default value: templates directory inside scrapy module |
69 |
URLLENGTH_LIMIT It defines the maximum limit of the length for URL to be allowed for crawled URLs. Default value: 2083 |
70 |
USER_AGENT It defines the user agent to be used while crawling a site. Default value: “Scrapy/VERSION (+http://scrapy.org)” |
对于其他Scrapy设置,请转到此链接。