Java的BodyContentHandler 类
Apache Tika 是一个库,允许您从不同的文档( .PDF 、 . DOCX等)中提取数据。在本教程中,我们将使用 BodyContentHandler.Next 依赖项提取数据,将使用如下所示:
org.apache.tika < / groupId >
tika - parsers < / artifactId >
1.26 < / version >
< / dependency >
BodyContentHandler是一个类装饰器,它允许我们获取 XHTML 标签中的所有内容。 或 不会包含在结果值中。
让我们首先讨论这个类的各种构造函数如下: Writes all content into an internal string buffer, to get content just call toString(). ‘write limit’ is the maximum number of characters that can be read, set -1 to disable the limit. If this limit is reached, a SAXException will be thrown.BodyContentHandler() Writes all content into an internal string buffer, to get content just call toString(). By default, the maximum content length is 100 000 characters. If this limit is reached, a SAXException will be thrown. BodyContentHandler(writeLimit) BodyContentHandler(OutputStream outputStream) Writes all content into a given outputStream. Without any content limit. BodyContentHandler(Writer writer) Writes all content into a given writer. Without any content limit. BodyContentHandler(ContentHandler handler) Passes all content to a given handler.
这个类的我thods如下:Method Action Preformed MatchingContentHandler Allows you to get data by XPath
Note: BodyContentHandler class doesn’t implement any method of ContentHandler interface, it just describes XPath for MatchingContentHandler to get XHTML body content.
执行:
示例 1:将所有内容读入内部字符串缓冲区
Java
// Java Program to Read Everything into Inner String Buffer
// Main class
public class GFG {
// Method 1
// To parse the string
public String parseToStringExample(String fileName)
throws IOException, TikaException, SAXException
{
// Creating an object of InputStream class
InputStream stream
= this.getClass()
.getClassLoader()
.getResourceAsStream(fileName);
Parser parser = new AutoDetectParser();
ContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
// Parsing the string
parser.parse(stream, handler, metadata, context);
return handler.toString();
}
// Method 2
// Main driver method
public static void main(String[] args)
throws TikaException, IOException, SAXException
{
// Creating object of main class in main method
GFG example = new GFG();
// Display message for better readability
System.out.println("Result");
// Calling the method 1 to parse string by
// providing file as an argument
System.out.println(example.parseToStringExample(
"test-reading.pdf"));
}
}
Java
// Java Program to Write Content into File by
// Specifying the Maximum Content Length
// Main class
// BodyContentHandlerWriteToFileExample
public class GFG {
// Method 1
// Main driver method
public static void main(String[] args)
throws TikaException, IOException, SAXException
{
// Creating an object of the class
GFG example = new GFG();
// Calling the Method 2 in main() method and
// passing the file and directory path as arguments
// to it
example.writeParsedDataToFile(
"test-reading.pdf",
"/Users/ali_zhagparov/Desktop/pdf-content.txt");
}
// Method 2
// Writing parsed data to a file
public void
writeParsedDataToFile(String readFromFileName,
String writeToFileName)
throws IOException, TikaException, SAXException
{
// Creating an object of InputStream
InputStream stream
= this.getClass()
.getClassLoader()
.getResourceAsStream(readFromFileName);
// Creating an object of File class
File yourFile = new File(writeToFileName);
// If file is already existing then
// no operations to be performed
yourFile.createNewFile();
FileOutputStream fileOutputStream
= new FileOutputStream(yourFile, false);
Parser parser = new AutoDetectParser();
ContentHandler handler
= new BodyContentHandler(fileOutputStream);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
}
}
输出:
示例 2:将内容写入文件并指定最大内容长度
Java
// Java Program to Write Content into File by
// Specifying the Maximum Content Length
// Main class
// BodyContentHandlerWriteToFileExample
public class GFG {
// Method 1
// Main driver method
public static void main(String[] args)
throws TikaException, IOException, SAXException
{
// Creating an object of the class
GFG example = new GFG();
// Calling the Method 2 in main() method and
// passing the file and directory path as arguments
// to it
example.writeParsedDataToFile(
"test-reading.pdf",
"/Users/ali_zhagparov/Desktop/pdf-content.txt");
}
// Method 2
// Writing parsed data to a file
public void
writeParsedDataToFile(String readFromFileName,
String writeToFileName)
throws IOException, TikaException, SAXException
{
// Creating an object of InputStream
InputStream stream
= this.getClass()
.getClassLoader()
.getResourceAsStream(readFromFileName);
// Creating an object of File class
File yourFile = new File(writeToFileName);
// If file is already existing then
// no operations to be performed
yourFile.createNewFile();
FileOutputStream fileOutputStream
= new FileOutputStream(yourFile, false);
Parser parser = new AutoDetectParser();
ContentHandler handler
= new BodyContentHandler(fileOutputStream);
Metadata metadata = new Metadata();
ParseContext context = new ParseContext();
parser.parse(stream, handler, metadata, context);
}
}
输出:
控制台窗口上没有任何可见的东西,因为它文件目录映射,在这种情况下,它尝试将所有信息写入文件
该程序会生成一个带有“.pdf”文件内容的“.txt”,如下所示: