Web Crawler是一种从 Internet 下载内容并将其编入索引的机器人。这个机器人的主要目的是了解互联网上的不同网页。这种机器人主要由搜索引擎操作。通过将搜索算法应用于网络爬虫收集的数据,搜索引擎可以提供相关链接作为对用户请求的响应。在本文中,我们将讨论网络爬虫是如何实现的。
网络爬虫是广度优先搜索算法的一个非常重要的应用。这个想法是整个互联网可以用一个有向图来表示:
- 带有顶点 -> 域名/网址/网站。
- 边缘 -> 连接。
例子:
方法:该算法的工作原理是解析网站的原始 HTML 并在获得的数据中查找其他 URL。如果有 URL,则将其添加到队列中并以广度优先搜索的方式访问它们。
Note: This code will not work on an online IDE due to proxy issues. Try to run on your local computer.
Java
// Java program to illustrate the WebCrawler
import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
import java.util.HashSet;
import java.util.LinkedList;
import java.util.List;
import java.util.Queue;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
// Class Contains the functions
// required for WebCrowler
class WebCrowler {
// To store the URLs in the
/ /FIFO order required for BFS
private Queue queue;
// To store visited URls
private HashSet
discovered_websites;
// Constructor for initializing the
// required variables
public WebCrowler()
{
this.queue
= new LinkedList<>();
this.discovered_websites
= new HashSet<>();
}
// Function to start the BFS and
// discover all URLs
public void discover(String root)
{
// Storing the root URL to
// initiate BFS.
this.queue.add(root);
this.discovered_websites.add(root);
// It will loop until queue is empty
while (!queue.isEmpty()) {
// To store the URL present in
// the front of the queue
String v = queue.remove();
// To store the raw HTML of
// the website
String raw = readUrl(v);
// Regular expression for a URL
String regex
= "https://(\\w+\\.)*(\\w+)";
// To store the pattern of the
// URL formed by regex
Pattern pattern
= Pattern.compile(regex);
// To extract all the URL that
// matches the pattern in raw
Matcher matcher
= pattern.matcher(raw);
// It will loop until all the URLs
// in the current website get stored
// in the queue
while (matcher.find()) {
// To store the next URL in raw
String actual = matcher.group();
// It will check whether this URL is
// visited or not
if (!discovered_websites
.contains(actual)) {
// If not visited it will add
// this URL in queue, print it
// and mark it as visited
discovered_websites
.add(actual);
System.out.println(
"Website found: "
+ actual);
queue.add(actual);
}
}
}
}
// Function to return the raw HTML
// of the current website
public String readUrl(String v)
{
// Initializing empty string
String raw = "";
// Use try-catch block to handle
// any exceptions given by this code
try {
// Convert the string in URL
URL url = new URL(v);
// Read the HTML from website
BufferedReader be
= new BufferedReader(
new InputStreamReader(
url.openStream()));
// To store the input
// from the website
String input = "";
// Read the HTML line by line
// and append it to raw
while ((input
= br.readLine())
!= null) {
raw += input;
}
// Close BufferedReader
br.close();
}
catch (Exception ex) {
ex.printStackTrace();
}
return raw;
}
}
// Driver code
public class Main {
// Driver Code
public static void main(String[] args)
{
// Creating Object of WebCrawler
WebCrowler web_crowler
= new WebCrowler();
// Given URL
String root
= "https:// www.google.com";
// Method call
web_crowler.discover(root);
}
}
输出:
Website found: https://www.google.com
Website found: https://www.facebook.com
Website found: https://www.amazon.com
Website found: https://www.microsoft.com
Website found: https://www.apple.com
应用:这种网络爬虫用于获取网络的重要参数,如:
- 经常访问的网站有哪些?
- 整个网络中重要的网站有哪些?
- 社交网络上的有用信息:Facebook、Twitter……等。
- 谁是一群人中最受欢迎的人?
- 谁是公司最重要的软件工程师?
如果您希望与专家一起参加现场课程,请参阅DSA 现场工作专业课程和学生竞争性编程现场课程。