📜  Java的BodyContentHandler 类

📅  最后修改于: 2022-05-13 01:55:10.431000             🧑  作者: Mango

Java的BodyContentHandler 类

Apache Tika 是一个库,允许您从不同的文档( .PDF 、 . DOCX等)中提取数据。在本教程中,我们将使用 BodyContentHandler.Next 依赖项提取数据,将使用如下所示:


org.apache.tika < / groupId >
tika - parsers < / artifactId >
1.26 < / version >
< / dependency >

BodyContentHandler是一个类装饰器,它允许我们获取 XHTML 标签中的所有内容。 或 不会包含在结果值中。

让我们首先讨论这个类的各种构造函数如下:

BodyContentHandler()Writes all content into an internal string buffer, to get content just call toString(). By default, the maximum content length is 100 000 characters. If this limit is reached, a SAXException will be thrown.
BodyContentHandler(writeLimit)

Writes all content into an internal string buffer, to get content just call toString(). 

‘write limit’ is the maximum number of characters that can be read, set -1 to disable the limit. If this limit is reached, a SAXException will be thrown.



BodyContentHandler(OutputStream outputStream)Writes all content into a given outputStream. Without any content limit.
BodyContentHandler(Writer writer)Writes all content into a given writer. Without any content limit.
BodyContentHandler(ContentHandler handler)Passes all content to a given handler.

这个类的我thods如下:

Method Action Preformed 
MatchingContentHandlerAllows you to get data by XPath

执行:

示例 1:将所有内容读入内部字符串缓冲区

Java
// Java Program to Read Everything into Inner String Buffer
 
// Main class
public class GFG {
 
    // Method 1
    // To parse the string
    public String parseToStringExample(String fileName)
        throws IOException, TikaException, SAXException
    {
 
        // Creating an object of InputStream class
        InputStream stream
            = this.getClass()
                  .getClassLoader()
                  .getResourceAsStream(fileName);
 
        Parser parser = new AutoDetectParser();
        ContentHandler handler = new BodyContentHandler();
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
 
        // Parsing the string
        parser.parse(stream, handler, metadata, context);
 
        return handler.toString();
    }
 
    // Method 2
    // Main driver method
    public static void main(String[] args)
        throws TikaException, IOException, SAXException
    {
 
        // Creating object of main class in main method
        GFG example = new GFG();
 
        // Display message for better readability
        System.out.println("Result");
 
        // Calling the method 1 to parse string by
        // providing file as an argument
        System.out.println(example.parseToStringExample(
            "test-reading.pdf"));
    }
}


Java
// Java Program to Write Content into File by
// Specifying the Maximum Content Length
 
// Main class
// BodyContentHandlerWriteToFileExample
public class GFG {
 
    // Method 1
    // Main driver method
    public static void main(String[] args)
        throws TikaException, IOException, SAXException
    {
 
        // Creating an object of the class
        GFG example = new GFG();
 
        // Calling the Method 2 in main() method and
        // passing the file and directory path as arguments
        // to it
        example.writeParsedDataToFile(
            "test-reading.pdf",
            "/Users/ali_zhagparov/Desktop/pdf-content.txt");
    }
 
    // Method 2
    // Writing parsed data to a file
    public void
    writeParsedDataToFile(String readFromFileName,
                          String writeToFileName)
        throws IOException, TikaException, SAXException
    {
 
        // Creating an object of InputStream
        InputStream stream
            = this.getClass()
                  .getClassLoader()
                  .getResourceAsStream(readFromFileName);
 
        // Creating an object of File class
        File yourFile = new File(writeToFileName);
 
        // If file is already existing then
        // no operations to be performed
        yourFile.createNewFile();
 
        FileOutputStream fileOutputStream
            = new FileOutputStream(yourFile, false);
        Parser parser = new AutoDetectParser();
        ContentHandler handler
            = new BodyContentHandler(fileOutputStream);
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
 
        parser.parse(stream, handler, metadata, context);
    }
}


输出:



示例 2:将内容写入文件并指定最大内容长度

Java

// Java Program to Write Content into File by
// Specifying the Maximum Content Length
 
// Main class
// BodyContentHandlerWriteToFileExample
public class GFG {
 
    // Method 1
    // Main driver method
    public static void main(String[] args)
        throws TikaException, IOException, SAXException
    {
 
        // Creating an object of the class
        GFG example = new GFG();
 
        // Calling the Method 2 in main() method and
        // passing the file and directory path as arguments
        // to it
        example.writeParsedDataToFile(
            "test-reading.pdf",
            "/Users/ali_zhagparov/Desktop/pdf-content.txt");
    }
 
    // Method 2
    // Writing parsed data to a file
    public void
    writeParsedDataToFile(String readFromFileName,
                          String writeToFileName)
        throws IOException, TikaException, SAXException
    {
 
        // Creating an object of InputStream
        InputStream stream
            = this.getClass()
                  .getClassLoader()
                  .getResourceAsStream(readFromFileName);
 
        // Creating an object of File class
        File yourFile = new File(writeToFileName);
 
        // If file is already existing then
        // no operations to be performed
        yourFile.createNewFile();
 
        FileOutputStream fileOutputStream
            = new FileOutputStream(yourFile, false);
        Parser parser = new AutoDetectParser();
        ContentHandler handler
            = new BodyContentHandler(fileOutputStream);
        Metadata metadata = new Metadata();
        ParseContext context = new ParseContext();
 
        parser.parse(stream, handler, metadata, context);
    }
}

输出:

控制台窗口上没有任何可见的东西,因为它文件目录映射,在这种情况下,它尝试将所有信息写入文件



该程序会生成一个带有“.pdf”文件内容的“.txt”,如下所示: