我正在使用一种拆分单个大PDF文件(表示信用卡的每月清算)的方法。它是为打印而构建的,但是我们希望将该文件拆分为单个文件,以供以后使用。每个结算都有一个可变的长度:2页,3页,4页...因此,我们需要“读取”每一页,找到“ X的第1页”,然后拆分块,直到下一个“ X的第1页”出现。同样,每个生成的拆分文件必须具有唯一的ID(也包含在“ X的第1页”页面中)。
在进行研发时,我发现了一个名为“ PDF Content Split SA”的工具,可以完成我们所需的确切任务。但是我敢肯定,在Linux中有一种方法可以做到这一点(我们正朝着OpenSource + Libre迈进)。
感谢您的阅读。任何帮助将非常有用。
编辑
到目前为止,我已经找到了可以完全满足我们需要的Nautilus脚本,但是我无法使其正常工作。
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "${filelist[@]}"; do
pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
pattern=''
pagetitle=''
datestamp=''
for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do
header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]{9}'`
let "datestamp =`date +%s%N`" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
我已经编辑了搜索条件,并且该脚本很好地放置在Nautilus Script文件夹中,但是它不起作用。我尝试使用控制台中的活动日志进行调试,并在代码上添加标记;显然pdfinfo的结果值存在冲突,但是我不知道如何解决它。
我做到了 至少,它起作用了。但是现在我想优化流程。一个大型pdf文件最多需要40分钟才能处理1000个项目。
#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.
# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS
# process files
for file in "${filelist[@]}"; do
pagecount=$(pdfinfo $file | grep "Pages" | awk '{ print $2 }')
# MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8:
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')
pattern=''
pagetitle=''
datestamp=''
#for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do
header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)
pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')
echo $pageid
let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name
# match ID found on the page to the stored ID
if [[ $pageid == $storedid ]]; then
pattern+="$pageindex " # adds number as text to variable separated by spaces
pagetitle+="$header+"
if [[ $pageindex == $pagecount ]]; then #process last output of the file
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=0
pattern=''
pagetitle=''
fi
else
#process previous set of pages to output
# pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
pdftk $file cat $pattern output "$storedid.pdf"
storedid=$pageid
pattern="$pageindex "
pagetitle="$header+"
fi
done
done
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句