我正在使用 PDFBox 从 PDF 文档中提取文本。然后一次,提取,我将这些文本插入到 MySQL 的表中。
编码:
PDDocument document = PDDocument.load(new File(path1));
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");
String sql="insert IGNORE into test.indextable values (?,?);";
preparedStatement = con1.prepareStatement(sql);
int i=0;
for (String word : words) {
// check if one or more special characters at end of string then remove OR
// check special characters in beginning of the string then remove
// insert every word directly to table db
word=word.replaceAll("([\\W]+$)|(^[\\W]+)", "");
preparedStatement.setString(1, path1);
preparedStatement.setString(2, word);
/* preparedStatement.executeUpdate();
System.out.print("Add ");*/
preparedStatement.addBatch();
i++;
if (i % 1000 == 0) {
preparedStatement.executeBatch();
System.out.print("Add Thousand");
}
}
if (i > 0) {
preparedStatement.executeBatch();
System.out.print("Add Remaining");
}
}
}
代码工作正常,但正如您所看到的,如果文档很大并且里面有大约 1000 万个单词,那么它lines[]
不会做任何正义,并且会抛出out of memory exception
.
我想不出解决办法。有什么办法可以直接将单词提取并插入到数据库中,或者这是不可能的?
编辑:
这就是我所做的:
处理文本方法:
public void processText(String text) throws SQLException {
String lines[] = text.split("\\r?\\n");
for (String line : lines) {
String[] words = line.split(" ");
String sql="insert IGNORE into test.indextable values (?,?);";
preparedStatement = con1.prepareStatement(sql);
int i=0;
for (String word : words) {
// check if one or more special characters at end of string then remove OR
// check special characters in beginning of the string then remove
// insert every word directly to table db
word=word.replaceAll("([\\W]+$)|(^[\\W]+)", "");
preparedStatement.setString(1, path1);
preparedStatement.setString(2, word);
preparedStatement.addBatch();
i++;
if (i % 1000 == 0) {
preparedStatement.executeBatch();
System.out.print("Add Thousand");
}
}
if (i > 0) {
preparedStatement.executeBatch();
System.out.print("Add Remaining");
}
}
preparedStatement.close();
System.out.println("Successfully commited changes to the database!");
}
index 方法(调用上面的方法):
public void index() throws Exception {
// Connection con1 = con.connect();
try {
// Connection con1=con.connect();
// Connection con1 = con.connect();
Statement statement = con1.createStatement();
ResultSet rs = statement.executeQuery("select * from filequeue where Status='Active' LIMIT 5");
while (rs.next()) {
// get the filepath of the PDF document
path1 = rs.getString(2);
int getNum = rs.getInt(1);
// while running the process, update status : Processing
//updateProcess_DB(getNum);
Statement test = con1.createStatement();
test.executeUpdate("update filequeue SET STATUS ='Processing' where UniqueID="+getNum);
try {
// call the index function
/*Indexing process = new Indexing();
process.index(path1);*/
PDDocument document = PDDocument.load(new File(path1));
if (!document.isEncrypted()) {
PDFTextStripper tStripper = new PDFTextStripper();
for(int p=1; p<=document.getNumberOfPages();++p) {
tStripper.setStartPage(p);
tStripper.setEndPage(p);
String pdfFileInText = tStripper.getText(document);
processText(pdfFileInText);
}
}
您当前的代码使用pdfFileInText
从中收集的字符串tStripper.getText(document);
并立即获取整个文档。首先pdfFileInText.split
在一个单独的方法中重构你对这个字符串(它以 开头)所做的所有事情,例如processText
. 然后将您的代码更改为:
PDFTextStripper tStripper = new PDFTextStripper();
for (int p = 1; p <= document.getNumberOfPages(); ++p)
{
stripper.setStartPage(p); // 1-based
stripper.setEndPage(p); // 1-based
String pdfFileInText = tStripper.getText(document);
processText(pdfFileInText);
}
新代码分别处理每个页面。通过这种方式,您将能够以更小的步骤进行数据库插入,而且您不必存储文档的所有单词,只需存储一页的单词。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句