根据内容将单个大PDF文件拆分为n个PDF文件，并重命名每个拆分的文件（在Bash中）

debugcn 发表于 Dev

巴拉瓦勒

我正在使用一种拆分单个大PDF文件（表示信用卡的每月清算）的方法。它是为打印而构建的，但是我们希望将该文件拆分为单个文件，以供以后使用。每个结算都有一个可变的长度：2页，3页，4页...因此，我们需要“读取”每一页，找到“ X的第1页”，然后拆分块，直到下一个“ X的第1页”出现。同样，每个生成的拆分文件必须具有唯一的ID（也包含在“ X的第1页”页面中）。

在进行研发时，我发现了一个名为“ PDF Content Split SA”的工具，可以完成我们所需的确切任务。但是我敢肯定，在Linux中有一种方法可以做到这一点（我们正朝着OpenSource + Libre迈进）。

感谢您的阅读。任何帮助将非常有用。

编辑

到目前为止，我已经找到了可以完全满足我们需要的Nautilus脚本，但是我无法使其正常工作。

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.

# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS

# process files
for file in "${filelist[@]}"; do
 pagecount=`pdfinfo $file | grep "Pages" | awk '{ print $2 }'`
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
 storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
 pattern=''
 pagetitle=''
 datestamp=''

 for (( pageindex=1; pageindex<=$pagecount; pageindex+=1 )); do

  header=`pdftotext -f $pageindex -l $pageindex $file - | head -n 1`
  pageid=`pdftotext -f $pageindex -l $pageindex $file - | egrep '8?[0-9]{9}'`
  let "datestamp =`date +%s%N`" # to avoid overwriting with same new name

  # match ID found on the page to the stored ID
  if [[ $pageid == $storedid ]]; then
   pattern+="$pageindex " # adds number as text to variable separated by spaces
   pagetitle+="$header+"

   if [[ $pageindex == $pagecount ]]; then #process last output of the file 
    pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    storedid=0
    pattern=''
    pagetitle=''
   fi
  else 
   #process previous set of pages to output
   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   storedid=$pageid
   pattern="$pageindex "
   pagetitle="$header+"
  fi
 done
done

我已经编辑了搜索条件，并且该脚本很好地放置在Nautilus Script文件夹中，但是它不起作用。我尝试使用控制台中的活动日志进行调试，并在代码上添加标记；显然pdfinfo的结果值存在冲突，但是我不知道如何解决它。

巴拉瓦勒

我做到了至少，它起作用了。但是现在我想优化流程。一个大型pdf文件最多需要40分钟才能处理1000个项目。

#!/bin/bash
# NAUTILUS SCRIPT
# automatically splits pdf file to multiple pages based on search criteria while renaming the output files using the search criteria and some of the pdf text.



# read files
IFS=$'\n' read -d '' -r -a filelist < <(printf '%s\n' "$NAUTILUS_SCRIPT_SELECTED_FILE_PATHS"); unset $IFS



# process files
for file in "${filelist[@]}"; do
 pagecount=$(pdfinfo $file | grep "Pages" | awk '{ print $2 }')
 # MY SEARCH CRITERIA is a 10 digit long ID number that begins with number 8: 
#storedid=`pdftotext -f 1 -l 1 $file - | egrep '8?[0-9]{9}'`
 storedid=$(pdftotext -f 1 -l 1 $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')
 pattern=''
 pagetitle=''
 datestamp=''

 #for (( pageindex=1; pageindex <= $pagecount; pageindex+=1 )); do
 for (( pageindex=1; pageindex <= $pagecount+1; pageindex+=1 )); do

  header=$(pdftotext -f $pageindex -l $pageindex $file - | head -n 1)


  pageid=$(pdftotext -f $pageindex -l $pageindex $file - | egrep 'RESUMEN DE CUENTA Nº ?[0-9]{8}')


  echo $pageid
  let "datestamp = $(date +%s%N)" # to avoid overwriting with same new name

  # match ID found on the page to the stored ID
  if [[ $pageid == $storedid ]]; then
   pattern+="$pageindex " # adds number as text to variable separated by spaces
   pagetitle+="$header+"


   if [[ $pageindex == $pagecount ]]; then #process last output of the file 
#   pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
    pdftk $file cat $pattern output "$storedid.pdf"
    storedid=0
    pattern=''
    pagetitle=''

   fi
  else 
   #process previous set of pages to output
#  pdftk $file cat $pattern output "$storedid $pagetitle $datestamp.pdf"
   pdftk $file cat $pattern output "$storedid.pdf"
   storedid=$pageid
   pattern="$pageindex "
   pagetitle="$header+"

  fi
 done
done

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-06-15

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

根据内容将单个大PDF文件拆分为n个PDF文件，并重命名每个拆分的文件（在Bash中）

根据内容将单个大PDF文件拆分为n个PDF文件，并重命名每个拆分的文件（在Bash中）

在shell脚本中拆分并重命名拆分的文件

ABCPDF：将PDF文件拆分为单页PDF文件

使用qpdf将pdf文件拆分为另一个2 pdf文件

使用单个XSL流将一个大XML文件拆分为多个文件

根据文件中解析的参数值将文件内容拆分为其他文件

如何将多页 pdf 文件拆分为 ruby 中的多个 pdf 文件

如何根据列（包括标题）拆分文件并重命名生成的文件？

根据文件中的模式拆分为两个文件

将Apache Vhost拆分为单个文件

根据模式将文件拆分为多个文件，并在Unix中通过搜索模式命名新文件？

JavaScript 将文件内容拆分为键值对

将文件从目录中的上一个拆分文件拆分为更小的文件

根据元数据重命名PDF文件？

根据pdf文件中的页数批量重命名

如何将PDF文件快速拆分为单页（即从“终端”命令行中）？

如何在保留历史记录的同时拆分并重命名git中的代码文件？

使用PowerShell，如何根据其内容将SQL文件拆分为多个文件？

根据数组将文件拆分为不同的文件

根据某些条件将CSV文件拆分为较小的文件

如何根据原始文件中的列标题将文件拆分为单独的文件？

如何基于Linux中的列将单个文件拆分为多个文件？

ADF：将带有对象数组的JSON文件拆分为单个JSON文件，每个文件中包含一个元素

将文件拆分为X个大小相等的文件？

将YAML文件拆分为两个单独的文件

使用名称创建PDF，而不是在驱动器中创建文件并重命名

从pdf文件中拆分合并的页码

如何使用javascript将大json文件拆分为多个小块文件以在Google Map中显示

将FileList拆分为单个文件输入元素

如何使用ChiselStage将模块拆分为单个文件？