从 HTML 文档中提取内容的Java程序

HTML 是网络的核心，你在互联网上看到的所有页面都是 HTML，无论它们是由 JavaScript、JSP、 PHP、ASP 或任何其他网络技术动态生成的。您的浏览器实际上会解析 HTML 并为您呈现，但是如果我们需要解析 HTML 文档并查找一些元素、标签、属性或检查特定元素是否存在。在Java中，我们可以提取 HTML 内容并解析 HTML 文档。

方法：

使用文件阅读器
使用Url.openStream()

方法 1：名为 FileReader 的库提供了读取任何文件的方法，而不管任何扩展名。将 HTML 行附加到 String Builder 的方法如下：

使用 FileReader 从源文件夹中读取文件并进一步
将每一行附加到字符串构建器。
当 HTML 文档中没有任何内容时，请使用函数br.close()关闭打开的文件。
打印出字符串。

执行：

Java

// Java Program to Extract Content from a HTML document
 
// Importing input/output java libraries
import java.io.*;
 
public class GFG {
 
    // Main driver method
    public static void main(String[] args)
        throws FileNotFoundException
    {
 
        /* Constructing String Builder to
        append the string into the html */
        StringBuilder html = new StringBuilder();
 
        // Reading html file on local directory
        FileReader fr = new FileReader(
            "C:\\Users\\rohit\\OneDrive\\Desktop\\article.html");
 
        // Try block to check exceptions
        try {
 
            // Initialization of the buffered Reader to get
            // the String append to the String Builder
            BufferedReader br = new BufferedReader(fr);
 
            String val;
 
            // Reading the String till we get the null
            // string and appending to the string
            while ((val = br.readLine()) != null) {
                html.append(val);
            }
 
            // AtLast converting into the string
            String result = html.toString();
            System.out.println(result);
 
            // Closing the file after all the completion of
            // Extracting
            br.close();
        }
 
        // Catch block to handle exceptions
        catch (Exception ex) {
 
            /* Exception of not finding the location and
            string reading termination the function
            br.close(); */
            System.out.println(ex.getMessage());
        }
    }
}

Java

// Java Program to Extract Content from a HTML document
 
// Importing hava generic class
import java.io.*;
import java.util.*;
// Importing java URL class
import java.net.URL;
 
public class GFG {
 
    // Man driver method
    public static void main(String[] args)
        throws FileNotFoundException
    {
 
        // Try block to check exceptions
        try {
            String val;
 
            // Constructing the URL connection
            // by defining the URL constructors
            URL URL = new URL(
                "file:///C:/Users/rohit/OneDrive/Desktop/article.html");
 
            // Reading the HTML content from the .HTML File
            BufferedReader br = new BufferedReader(
                new InputStreamReader(URL.openStream()));
 
            /* Catching the string and  if found any null
             character break the String */
            while ((val = br.readLine()) != null) {
                System.out.println(val);
            }
 
            // Closing the file
            br.close();
        }
 
        // Catch block to handle exceptions
        catch (Exception ex) {
 
            // No file found
            System.out.println(ex.getMessage());
        }
    }
}

输出：

方法 2：使用 Url.openStream()

调用url.openStream() 启动新 TCP 的函数连接到 URL 提供它的服务器。
现在，在服务器将包含信息的 HTTP 响应发送回连接后，HTTP 获取请求被发送到连接。
该信息采用字节的形式，然后使用InputStreamReader()和openStream()方法读取该信息将数据返回给程序。

BufferedReader br = new BufferedReader(new InputStreamReader(URL.openStream()));

首先，我们使用 开放流（） 来获取信息。如果连接一切正常（表示显示 200），则该信息以字节的形式包含在 URL 中，然后 HTTP 请求到网址获取内容。
然后使用字节以字节的形式收集信息 输入流读取器()
现在运行循环以打印信息，因为需求是在控制台中打印信息。

while ((val = br.readLine()) != null)   // condition
 {    
   System.out.println(val);             // execution if condition is true
  }

执行：

Java

// Java Program to Extract Content from a HTML document
 
// Importing hava generic class
import java.io.*;
import java.util.*;
// Importing java URL class
import java.net.URL;
 
public class GFG {
 
    // Man driver method
    public static void main(String[] args)
        throws FileNotFoundException
    {
 
        // Try block to check exceptions
        try {
            String val;
 
            // Constructing the URL connection
            // by defining the URL constructors
            URL URL = new URL(
                "file:///C:/Users/rohit/OneDrive/Desktop/article.html");
 
            // Reading the HTML content from the .HTML File
            BufferedReader br = new BufferedReader(
                new InputStreamReader(URL.openStream()));
 
            /* Catching the string and  if found any null
             character break the String */
            while ((val = br.readLine()) != null) {
                System.out.println(val);
            }
 
            // Closing the file
            br.close();
        }
 
        // Catch block to handle exceptions
        catch (Exception ex) {
 
            // No file found
            System.out.println(ex.getMessage());
        }
    }
}

输出：