从 PDF 文件中提取表格数据的解决方案(排序)

Big_Al_Tx

我需要从许多 PDF 文档中提取大量页面上的表格数据。在 Adob​​e 的 Acrobat Reader 中使用内置的文本导出功能是没有用的——以这种方式提取的文本失去了表格建立的空间关系。已经有很多人提出了很多问题,我尝试了很多针对这个问题的解决方案,但结果各不相同。所以我开始开发我自己的解决方案。它已经足够发达(我认为)可以在这里分享。

Big_Al_Tx

我首先尝试查看文本的分布(根据它们在页面上的 x 和 y 位置)来尝试确定行和列中断的位置。通过使用 Python 模块“pdfminer”,我提取了文本和 BoundingBox 参数,筛选了每段文本并映射了给定 x 或 y 值的页面上有多少文本段。这个想法是查看文本的分布(水平用于换行,垂直用于列断开),当密度为零(意味着表格之间或上下有明显间隙)时,这将识别行或列中断。

这个想法确实有效,但只是有时。它假设表格在垂直和水平方向上具有相同数量和对齐的单元格(一个简单的网格),并且相邻单元格的文本之间存在明显的间隙。此外,如果文本跨越多列(例如表格上方的标题、表格下方的页脚、合并的单元格等),则分栏符的识别更加困难——您可能能够识别上方或下方的哪些文本元素应该忽略表格下方,但我找不到处理合并​​单元格的好方法。

当需要横向查看以确定行中断时,还有其他几个挑战。首先,pdfminer 会自动尝试将彼此靠近的文本片段分组,即使它们跨越表格中的多个单元格也是如此。在这些情况下,该文本对象的 BoundingBox 包含多行,从而掩盖了可能已被交叉的任何换行符。即使每一行文本都是单独提取的,挑战是区分什么是分隔连续文本行的正常空间,什么是换行符。

在探索了各种变通方法并进行了大量测试之后,我决定退后一步尝试另一种方法。

包含我需要提取的数据的表格周围都有边框,所以我认为我应该能够在 PDF 文件中找到绘制这些线条的元素。然而,当我查看可以从源文件中提取的元素时,我得到了一些令人惊讶的结果。

你会认为线条会被表示为“线条对象”,但你错了(至少对于我正在查看的文件)。如果它们不是“线条”,那么也许他们只是为每个单元格绘制矩形,调整 linewidth 属性以获得他们想要的线条粗细,对吧?不。事实证明,这些线条实际上是作为“矩形对象”绘制的,具有非常小的尺寸(窄宽度以创建垂直线,或短高度以创建水平线)。看起来线条在拐角处相交的地方,矩形却没有——它们有一个非常小的矩形来填充间隙。

一旦我能够识别要查找的内容,我就不得不处理多个相邻放置的矩形以创建粗线。最终,我编写了一个例程来对相似的值进行分组,并计算一个平均值,以用于我稍后将使用的行和列分隔符。

现在,这是处理表格中的文本的问题。我选择使用 SQLite 数据库来存储、分析和重新组合 PDF 文件中的文本。我知道还有其他“pythonic”选项,有些人可能会发现这些方法更熟悉和易于使用,但我觉得我将处理的数据量最好使用实际的数据库文件来处理。

正如我之前提到的,pdfminer 将位于彼此附近的文本分组,并且它可能跨越单元格边界。最初尝试在这些文本组中的一个中拆分显示在单独行上的文本片段,但仅部分成功;这是我打算进一步开发的领域之一(即,如何绕过 pdfminer LTTextbox 例程,以便我可以单独获取各个部分)。

当涉及垂直文本时,pdfminer 模块还有另一个缺点。我一直无法识别任何属性来识别文本何时是垂直的,或者文本显示的角度(例如,+90 度或 -90 度)。并且文本分组例程似乎也不知道,因为文本旋转了 +90 度(即,从下往上读取字母的逆时针旋转),它以相反的顺序连接由换行符分隔的字母。

在这种情况下,下面的例程运行得相当好。我知道它仍然很粗糙,还有一些增强功能需要进行,并且它的打包方式还没有准备好进行广泛分发,但是它似乎已经“破坏了代码”如何从 PDF 文件中提取表格数据(对于大部分)。希望其他人可以将其用于自己的目的,甚至可能对其进行改进。

我欢迎您提出任何想法、建议或建议。

编辑:我发布了一个修订版,其中包括附加参数(cell_htol_up 等),以帮助“调整”算法,以确定哪些文本片段属于表中的特定单元格。

# This was written for use w/Python 2.  Use w/Python 3 hasn't been tested & proper execution is not guaranteed.

import os                                                   # Library of Operating System routines
import sys                                                  # Library of System routines
import sqlite3                                              # Library of SQLite dB routines
import re                                                   # Library for Regular Expressions
import csv                                                  # Library to output as Comma Separated Values
import codecs                                               # Library of text Codec types
import cStringIO                                            # Library of String manipulation routines

from pdfminer.pdfparser import PDFParser                    # Library of PDF text extraction routines
from pdfminer.pdfdocument import PDFDocument, PDFNoOutlines
from pdfminer.pdfpage import PDFPage, PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
from pdfminer.layout import LAParams, LTTextBox, LTTextLine, LTFigure, LTImage, LTLine, LTRect, LTTextBoxVertical
from pdfminer.converter import PDFPageAggregator

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

def add_new_value (new_value, list_values=[]):
    # Used to exclude duplicate values in a list
    not_in_list = True
    for list_value in list_values:
        # if list_value == new_value:
        if abs(list_value - new_value) < 1:
            not_in_list = False

    if not_in_list:
        list_values.append(new_value)

    return list_values

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

def condense_list (list_values, grp_tolerance = 1):
    # Group values & eliminate duplicate/close values
    tmp_list = []
    for n, list_value in enumerate(list_values):
        if sum(1 for val in tmp_list if abs(val - list_values[n]) < grp_tolerance) == 0:
            tmp_val = sum(list_values[n] for val in list_values if abs(val - list_values[n]) < grp_tolerance) / \
                sum(1 for val in list_values if abs(val - list_values[n]) < grp_tolerance)
            tmp_list.append(int(round(tmp_val)))

    return tmp_list

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f", 
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, quotechar = '"', quoting=csv.QUOTE_ALL, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([unicode(s).encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

#@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

# In case a connection to the database can't be created, set 'conn' to 'None'
conn = None

#^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Define variables for use later on
#_______________________________________________________________________________________________________________________

sqlite_file = "pdf_table_text.sqlite"                       # Name of the sqlite database file
brk_tol = 3                                                 # Tolerance for grouping LTRect values as line break points
                                                            # ***   This may require tuning to get optimal results   ***

cell_htol_lf = -2                                           # Horizontal & Vertical tolerances (up/down/left/right)
cell_htol_rt = 2                                            # for over-scanning table cell bounding boxes
cell_vtol_up = 8                                            # i.e., how far outside cell bounds to look for text to include
cell_vtol_dn = 0                                            # ***   This may require tuning to get optimal results   ***

replace_newlines = True                                     # Switch for replacing newline codes (\n) with spaces
replace_multspaces = True                                   # Switch for replacing multiple spaces with a single space

# txt_concat_str = "' '"                                    # Concatenate cell data with a single space
txt_concat_str = "char(10)"                                 # Concatenate cell data with a line feed

#=======================================================================================================================
# Default values for sample input & output files (path, filename, pagelist, etc.)

filepath = ""                                               # Path of the source PDF file (default = current folder)
srcfile = ""                                                # Name of the source PDF file (quit if left blank)
pagelist = [1, ]                                            # Pages to extract table data (Make an interactive input?)
                                                            # --> THIS MUST BE IN THE FORM OF A LIST OR TUPLE!

#=======================================================================================================================
# Impose required conditions & abort execution if they're not met

# Should check if files are locked:  sqlite database, input & output files, etc.

if filepath + srcfile == "" or pagelist == None:
    print "Source file not specified and/or page list is blank!  Execution aborted!"
    sys.exit()

dmp_pdf_data = "pdf_data.csv"
dmp_tbl_data = "tbl_data.csv"
destfile = srcfile[:-3]+"csv"

#=======================================================================================================================
# First test to see if this file already exists & delete it if it does

if os.path.isfile(sqlite_file):
    os.remove(sqlite_file)

#=======================================================================================================================
try:

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # Open or Create the SQLite database file
    #___________________________________________________________________________________________________________________

    print "-" * 120
    print "Creating SQLite Database & working tables ..."

    # Connecting to the database file
    conn = sqlite3.connect(sqlite_file)
    curs = conn.cursor()

    qry_create_table = "CREATE TABLE {tn} ({nf} {ft} PRIMARY KEY)"
    qry_alter_add_column = "ALTER TABLE {0} ADD COLUMN {1}"

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # Create 1st Table
    #___________________________________________________________________________________________________________________

    tbl_pdf_elements = "tbl_pdf_elements"                   # Name of the 1st table to be created
    new_field = "idx"                                       # Name of the index column
    field_type = "INTEGER"                                  # Column data type

    # Delete the table if it exists so old data is cleared out
    curs.execute("DROP TABLE IF EXISTS " + tbl_pdf_elements)

    # Create output table for PDF text w/1 column (index) & set it as PRIMARY KEY
    curs.execute(qry_create_table.format(tn=tbl_pdf_elements, nf=new_field, ft=field_type))

    # Table fields: index, text_string, pg, x0, y0, x1, y1, orient
    cols = ("'pdf_text' TEXT", 
            "'pg' INTEGER", 
            "'x0' INTEGER", 
            "'y0' INTEGER", 
            "'x1' INTEGER", 
            "'y1' INTEGER", 
            "'orient' INTEGER")

    # Add other columns
    for col in cols:
        curs.execute(qry_alter_add_column.format(tbl_pdf_elements, col))

    # Committing changes to the database file
    conn.commit()

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # Create 2nd Table
    #___________________________________________________________________________________________________________________

    tbl_table_data = "tbl_table_data"                       # Name of the 2nd table to be created
    new_field = "idx"                                       # Name of the index column
    field_type = "INTEGER"                                  # Column data type

    # Delete the table if it exists so old data is cleared out
    curs.execute("DROP TABLE IF EXISTS " + tbl_table_data)

    # Create output table for Table Data w/1 column (index) & set it as PRIMARY KEY
    curs.execute(qry_create_table.format(tn=tbl_table_data, nf=new_field, ft=field_type))

    # Table fields: index, text_string, pg, row, column
    cols = ("'tbl_text' TEXT",
            "'pg' INTEGER",
            "'row' INTEGER",
            "'col' INTEGER")

    # Add other columns
    for col in cols:
        curs.execute(qry_alter_add_column.format(tbl_table_data, col))

    # Committing changes to the database file
    conn.commit()

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # Start PDF text extraction code here
    #___________________________________________________________________________________________________________________

    print "Opening PDF file & preparing for text extraction:"
    print " -- " + filepath + srcfile

    # Open a PDF file.
    fp = open(filepath + srcfile, "rb")

    # Create a PDF parser object associated with the file object.
    parser = PDFParser(fp)

    # Create a PDF document object that stores the document structure.

    # Supply the password for initialization (if needed)
    # document = PDFDocument(parser, password)
    document = PDFDocument(parser)

    # Check if the document allows text extraction. If not, abort.
    if not document.is_extractable:
        raise PDFTextExtractionNotAllowed

    # Create a PDF resource manager object that stores shared resources.
    rsrcmgr = PDFResourceManager()

    # Create a PDF device object.
    device = PDFDevice(rsrcmgr)

    # Create a PDF interpreter object.
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    # Set parameters for analysis.
    laparams = LAParams()

    # Create a PDF page aggregator object.
    device = PDFPageAggregator(rsrcmgr, laparams=laparams)
    interpreter = PDFPageInterpreter(rsrcmgr, device)

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # Extract text & location data from PDF file (examine & process only pages in the page list)
    #___________________________________________________________________________________________________________________

    # Initialize variables
    idx1 = 0
    idx2 = 0
    lastpg = max(pagelist)

    print "Starting text extraction ..."

    qry_insert_pdf_txt = "INSERT INTO " + tbl_pdf_elements + " VALUES(?, ?, ?, ?, ?, ?, ?, ?)"
    qry_get_pdf_txt = "SELECT group_concat(pdf_text, " + txt_concat_str + \
        ") FROM {0} WHERE pg=={1} AND x0>={2} AND x1<={3} AND y0>={4} AND y1<={5} ORDER BY y0 DESC, x0 ASC;"
    qry_insert_tbl_data = "INSERT INTO " + tbl_table_data + " VALUES(?, ?, ?, ?, ?)"

    # Process each page contained in the document.
    for i, page in enumerate(PDFPage.create_pages(document)):

        interpreter.process_page(page)

        # Get the LTPage object for the page.
        lt_objs = device.get_result()
        pg = device.pageno - 1                              # Must subtract 1 to correct 'pageno'

        # Exit the loop if past last page to parse
        if pg > lastpg:
            break

        #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
        # If it finds a page in the pagelist, process the contents

        if pg in pagelist:
            print "- Processing page {0} ...".format(pg)

            xbreaks = []
            ybreaks = []

            #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            # Iterate thru list of pdf layout elements (LT* objects) then capture the text & attributes of each

            for lt_obj in lt_objs:

                # Examine LT objects & get parameters for text strings
                if isinstance(lt_obj, LTTextBox) or isinstance(lt_obj, LTTextLine):
                    # Increment index
                    idx1 += 1

                    # Assign PDF LTText object parameters to variables
                    pdftext = lt_obj.get_text()             # Need to convert escape codes & unicode characters!
                    pdftext = pdftext.strip()               # Remove leading & trailing whitespaces

                    # Save integer bounding box coordinates: round down @ start, round up @ end
                    # (x0, y0, x1, y1) = lt_obj.bbox
                    x0 = int(lt_obj.bbox[0])
                    y0 = int(lt_obj.bbox[1])
                    x1 = int(lt_obj.bbox[2] + 1)
                    y1 = int(lt_obj.bbox[3] + 1)

                    orient = 0                              # What attribute gets this value?

                    #---- These approaches don't work for identifying vertical text ... --------------------------------

                    # orient = lt_obj.rotate
                    # orient = lt_obj.char_disp

                    # if lt_obj.get_writing_mode == "tb-rl":
                        # orient = 90

                    # if isinstance(lt_obj, LTTextBoxVertical): # vs LTTextBoxHorizontal
                        # orient = 90

                    # if LAParams(lt_obj).detect_vertical:
                        # orient = 90

                    #---------------------------------------------------------------------------------------------------
                    # Split text strings at line feeds

                    if "\n" in pdftext:
                        substrs = pdftext.split("\n")
                        lineheight = (y1-y0) / (len(substrs) + 1)
                        # y1 = y0 + lineheight
                        y0 = y1 - lineheight
                        for substr in substrs:
                            substr = substr.strip()         # Remove leading & trailing whitespaces
                            if substr != "":
                                # Insert values into tuple for uploading into dB
                                pdf_txt_export = [(idx1, substr, pg, x0, y0, x1, y1, orient)]

                                # Insert values into dB
                                curs.executemany(qry_insert_pdf_txt, pdf_txt_export)
                                conn.commit()

                            idx1 += 1
                            # y0 = y1
                            # y1 = y0 + lineheight
                            y1 = y0
                            y0 = y1 - lineheight

                    else:
                        # Insert values into tuple for uploading into dB
                        pdf_txt_export = [(idx1, pdftext, pg, x0, y0, x1, y1, orient)]

                        # Insert values into dB
                        curs.executemany(qry_insert_pdf_txt, pdf_txt_export)
                        conn.commit()

                elif isinstance(lt_obj, LTLine):
                    # LTLine - Lines drawn to define tables
                    pass

                elif isinstance(lt_obj, LTRect):
                    # LTRect - Borders drawn to define tables

                    # Grab the lt_obj.bbox values
                    x0 = round(lt_obj.bbox[0], 2)
                    y0 = round(lt_obj.bbox[1], 2)
                    x1 = round(lt_obj.bbox[2], 2)
                    y1 = round(lt_obj.bbox[3], 2)
                    xmid = round((x0 + x1) / 2, 2)
                    ymid = round((y0 + y1) / 2, 2)

                    # rectline = lt_obj.linewidth

                    # If width less than tolerance, assume it's used as a vertical line
                    if (x1 - x0) < brk_tol:                 # Vertical Line or Corner
                        xbreaks = add_new_value(xmid, xbreaks)

                    # If height less than tolerance, assume it's used as a horizontal line
                    if (y1 - y0) < brk_tol:                 # Horizontal Line or Corner
                        ybreaks = add_new_value(ymid, ybreaks)

                elif isinstance(lt_obj, LTImage):
                    # An image, so do nothing
                    pass

                elif isinstance(lt_obj, LTFigure):
                    # LTFigure objects are containers for other LT* objects which shouldn't matter, so do nothing
                    pass

            col_breaks = condense_list(xbreaks, brk_tol)    # Group similar values & eliminate duplicates
            row_breaks = condense_list(ybreaks, brk_tol)

            col_breaks.sort()
            row_breaks.sort()

            #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
            # Regroup the text into table 'cells'
            #___________________________________________________________________________________________________________

            print "  -- Text extraction complete. Grouping data for table ..."

            row_break_prev = 0
            col_break_prev = 0

            table_data = []
            table_rows = len(row_breaks)
            for i, row_break in enumerate(row_breaks):
                if row_break_prev == 0:                             # Skip the rest the first time thru
                    row_break_prev = row_break
                else:
                    for j, col_break in enumerate(col_breaks):
                        if col_break_prev == 0:                     # Skip query the first time thru
                            col_break_prev = col_break
                        else:
                            # Run query to get all text within cell lines (+/- htol & vtol values)
                            curs.execute(qry_get_pdf_txt.format(tbl_pdf_elements, pg, col_break_prev + cell_htol_lf, \
                                col_break + cell_htol_rt, row_break_prev + cell_vtol_dn, row_break + cell_vtol_up))

                            rows = curs.fetchall()                  # Retrieve all rows

                            for row in rows:
                                if row[0] != None:                  # Skip null results
                                    idx2 += 1
                                    table_text = row[0]
                                    if replace_newlines:            # Option - Replace newline codes (\n) with spaces
                                        table_text = table_text.replace("\n", " ")

                                    if replace_multspaces:          # Option - Replace multiple spaces w/single space
                                        table_text = re.sub(" +", " ", table_text)

                                    table_data.append([idx2, table_text, pg, table_rows - i, j])

                        col_break_prev = col_break

                row_break_prev = row_break

            curs.executemany(qry_insert_tbl_data, table_data)
            conn.commit()

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # Export the regrouped table data:

    # Determine the number of columns needed for the output file
    # -- Should the data be extracted all at once or one page at a time?

    print "Saving exported table data ..."

    qry_col_count = "SELECT MIN([col]) AS colmin, MAX([col]) AS colmax, MIN([row]) AS rowmin, MAX([row]) AS rowmax, " + \
        "COUNT([row]) AS rowttl FROM [{0}] WHERE [pg] = {1} AND [tbl_text]!=' ';"

    qry_sql_export = "SELECT * FROM [{0}] WHERE [pg] = {1} AND [row] = {2} AND [tbl_text]!=' ' ORDER BY [col];"

    f = open(filepath + destfile, "wb")
    writer = UnicodeWriter(f)

    for pg in pagelist:
        curs.execute(qry_col_count.format(tbl_table_data, pg))
        rows = curs.fetchall()

        if len(rows) > 1:
            print "Error retrieving row & column counts!  More that one record returned!"
            print " -- ", qry_col_count.format(tbl_table_data, pg)
            print rows
            sys.exit()

        for row in rows:
            (col_min, col_max, row_min, row_max, row_ttl) = row

        # Insert a page separator
        writer.writerow(["Data for Page {0}:".format(pg), ])

        if row_ttl == 0:
            writer.writerow(["Unable to export text from PDF file.  No table structure found.", ])

        else:
            k = 0
            for j in range(row_min, row_max + 1):
                curs.execute(qry_sql_export.format(tbl_table_data, pg, j))

                rows = curs.fetchall()

                if rows == None:                            # No records match the given criteria
                    pass

                else:
                    i = 1
                    k += 1
                    column_data = [k, ]                     # 1st column as an Index

                    for row in rows:
                        (idx, tbl_text, pg_num, row_num, col_num) = row

                        if pg_num != pg:                    # Exit the loop if Page # doesn't match
                            break

                        while i < col_num:
                            column_data.append("")
                            i += 1
                            if i >= col_num or i == col_max: break

                        column_data.append(unicode(tbl_text))
                        i += 1

                    writer.writerow(column_data)

    f.close()

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # Dump the SQLite regrouped data (for error checking):

    print "Dumping SQLite table of regrouped (table) text ..."

    qry_sql_export = "SELECT * FROM [{0}] WHERE [tbl_text]!=' ' ORDER BY [pg], [row], [col];"
    curs.execute(qry_sql_export.format(tbl_table_data))
    rows = curs.fetchall()

    # Output data with Unicode intact as CSV
    with open(dmp_tbl_data, "wb") as f:
        writer = UnicodeWriter(f)
        writer.writerow(["idx", "tbl_text", "pg", "row", "col"])
        writer.writerows(rows)

    f.close()

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # Dump the SQLite temporary PDF text data (for error checking):

    print "Dumping SQLite table of extracted PDF text ..."

    qry_sql_export = "SELECT * FROM [{0}] WHERE [pdf_text]!='  ' ORDER BY pg, y0 DESC, x0 ASC;"
    curs.execute(qry_sql_export.format(tbl_pdf_elements))
    rows = curs.fetchall()

    # Output data with Unicode intact as CSV
    with open(dmp_pdf_data, "wb") as f:
        writer = UnicodeWriter(f)
        writer.writerow(["idx", "pdf_text", "pg", "x0", "y0", "x1", "y2", "orient"])
        writer.writerows(rows)

    f.close()

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    print "Conversion complete."
    print "-" * 120

except sqlite3.Error, e:

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # Rollback the last database transaction if the connection fails
    #___________________________________________________________________________________________________________________

    if conn:
        conn.rollback()

    print "Error '{0}':".format(e.args[0])
    sys.exit(1)

finally:

    #^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    # Close the connection to the database file
    #___________________________________________________________________________________________________________________

    if conn:
        conn.close()

本文收集自互联网,转载请注明来源。

如有侵权,请联系[email protected] 删除。

编辑于
0

我来说两句

0条评论
登录后参与评论

相关文章

来自分类Dev

使用R从pdf文件中提取数据

来自分类Dev

从图像或扫描的文档中提取表格数据(非pdf)

来自分类Dev

在最后一页底部以pdf格式创建表格(错误的官方解决方案)

来自分类Dev

在UIPath中提取PDF中的数据

来自分类Dev

如何从PDF文件中提取页码

来自分类Dev

perl,从PDF文件中提取目录

来自分类Dev

从PDF文件中提取值到变量

来自分类Dev

Python从PDF文件中提取多个文本

来自分类Dev

如何从PDF文件中提取图像

来自分类Dev

从PDF文件中提取背景图像?

来自分类Dev

如何从PDF文件中提取向量?

来自分类Dev

如何从PDF文件中提取页码

来自分类Dev

如何从PDF文件中提取注释?

来自分类Dev

从pdf提取表格

来自分类Dev

从包含大量pdf的zip文件中提取特定的pdf

来自分类Dev

从带有坐标的PDF中提取表格

来自分类Dev

从PDF提取SWF文件

来自分类Dev

如何使用表格数据生成pdf文件

来自分类Dev

从 Apache Solr 中提取 PDF

来自分类Dev

如何在服务器上转换pdf文件并从中提取数据?

来自分类Dev

基于集的解决方案以逗号分隔的有序行从SQL Server中提取数据?

来自分类Dev

一种优雅的单行解决方案,可从divmod的嵌套元组中提取数据

来自分类Dev

从大量文本文件中提取字符串的高效缓存解决方案

来自分类Dev

在Python 3中提取PDF元数据

来自分类Dev

在Python中从PDF元数据中提取关键字

来自分类Dev

iText:使用LocationTextExtractionStrategy从pdf文件中提取的文本顺序错误

来自分类Dev

如何从容器文件(ODP,PDF等)中提取元素?

来自分类Dev

如何从lftp日志文件中提取pdf名称?

来自分类Dev

从PDF文件中提取表结构化文本

Related 相关文章

热门标签

归档