从 ODF 文件中提取内容的Java程序
ODF 的全部内容是 Open Document Format。它是一个国际标准系列,是常用的已弃用供应商特定文档格式(如 .doc、.wpd、.xls )的继承者。与其他格式相比,ODF 文档更小。 OpenDocumentParser 类用于从 TIKA 库中提取 ODF 文件中的内容。
使用的方法:
- BodyContentHandler():它创建一个将 XHTML 正文字符事件写入内部字符串缓冲区的内容处理程序。
- Metadata() :它构造新的空元数据。
- ParseContext():它创建一个解析上下文对象,用于将上下文信息传递给 Tika 解析器。
- parse():实例化解析器对象,并调用解析方法。
以下是执行以下Java代码所需的依赖项:
tika-parsers-1.24.1.jar
commons-io-2.8.0.jar
slf4j-api-2.0.0-alpha0.jar
执行:
Java
// Java Program to Extract Content from a ODF file
import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.odf.OpenDocumentParser;
import org.apache.tika.sax.BodyContentHandler;
import org.xml.sax.SAXException;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import sun.security.util.Length;
public class OdfContentExtractor {
public static void main(String[] args)
{
try {
BodyContentHandler handler
= new BodyContentHandler();
Metadata metadata = new Metadata();
// Here .odt is open document text format.
FileInputStream inputstream
= new FileInputStream(
new File("F:\\geeks.odt"));
ParseContext parsecontent = new ParseContext();
// Parsing the open document.
OpenDocumentParser opendocumentparser
= new OpenDocumentParser();
// Passing the InputStream , ContentHandler,
// Metadata , ParseContext to the parse method.
opendocumentparser.parse(inputstream, handler,
metadata,
parsecontent);
System.out.println("Content in the document :"
+ handler.toString());
// Displaying the metadata of the odf file.
System.out.println("Metadata of the document:");
String[] metaName = metadata.names();
int l = metaName.length;
for (int i = 0; i < l; i++) {
System.out.println(
metaName[i]
+ " : = " + metadata.get(metaName[i]));
}
}
catch (Exception e) {
System.out.println(
"failed to extract content due to " + e);
}
}
}
输出:
Content in the document :Geekforgeeks has a great content on DSA.
Metadata of the document:
date : = 2020-11-21T05:38:00Z
meta:paragraph-count : = 1
meta:word-count : = 6
meta:initial-author : = Mohan Sai
initial-creator : = Mohan Sai
dc:creator : = Mohan Sai
generator : = MicrosoftOffice/15.0 MicrosoftWord
Word-Count : = 6
dcterms:created : = 2020-11-21T05:36:00Z
dcterms:modified : = 2020-11-21T05:38:00Z
Last-Modified : = 2020-11-21T05:38:00Z
nbPara : = 1
Last-Save-Date : = 2020-11-21T05:38:00Z
meta:character-count : = 40
Paragraph-Count : = 1
meta:save-date : = 2020-11-21T05:38:00Z
modified : = 2020-11-21T05:38:00Z
Edit-Time : = PT0S
nbCharacter : = 40
nbPage : = 1
nbWord : = 6
Content-Type : = application/vnd.oasis.opendocument.text
creator : = Mohan Sai
meta:author : = Mohan Sai
meta:creation-date : = 2020-11-21T05:36:00Z
Creation-Date : = 2020-11-21T05:36:00Z
xmpTPg:NPages : = 1
Character Count : = 40
editing-cycles : = 3
Page-Count : = 1
Author : = Mohan Sai
meta:page-count : = 1