如何根据其他列的差异加入文本列？

debugcn 发表于 Dev

克里斯_2020

我有以下数据框

       top   bottom                      fontname    size       x0       x1                           text
0   62.890   73.890  HNGMRP+HelveticaNeueLTStd-Bd  11.000  321.730  520.115  RISK MANAGEMENT AND INTERNAL 
1   76.893   87.893  HNGMRP+HelveticaNeueLTStd-Bd  11.000  321.730  376.334                        CONTROL
2  146.897  157.897  HNGMRP+HelveticaNeueLTStd-Bd  11.000   76.535  203.662              COMPANY SECRETARY
3  272.913  283.913  HNGMRP+HelveticaNeueLTStd-Bd  11.000   76.535  222.593          INDEPENDENT AUDITORS 
4  286.916  297.916  HNGMRP+HelveticaNeueLTStd-Bd  11.000   76.535  167.164                   REMUNERATION

我想要

加入的文字row[i].text和row[i+1].text若abs(row[i].bottom - row[i+1].top) < row[i].size
替换joined_text为row[i].text
替换row[i].bottom为row[i+1].bottom
丢弃 row[i+1]

例如：

row 0有text：RISK MANAGEMENT AND INTERNAL
row 1有text：CONTROL
row 0并且row 1都具有相同的size：11
abs(row[0].bottom - row[1].top) 是3.003

因为 3.003 < 11

期望的row[0].text是RISK MANAGEMENT AND INTERNAL CONTROL
期望的row[0].bottom是87.893
row[1] 从数据框中删除

为了清楚起见，期望的结果如下：

       top   bottom                      fontname    size       x0       x1                           text
0   62.890   87.893  HNGMRP+HelveticaNeueLTStd-Bd  11.000  321.730  520.115  RISK MANAGEMENT AND INTERNAL CONTROL
1  146.897  157.897  HNGMRP+HelveticaNeueLTStd-Bd  11.000   76.535  203.662  COMPANY SECRETARY
2  272.913  297.916  HNGMRP+HelveticaNeueLTStd-Bd  11.000   76.535  222.593  INDEPENDENT AUDITORS REMUNERATION

这是我尝试的：

def df_section_text(self) -> pd.DataFrame:
    df_title_text = self.df_title_text
    df_next_title_text = self.df_title_text.shift(-1).dropna()
    df_section_text = pd.DataFrame()
    
    for next_title, title in zip(df_next_title_text.itertuples(index=False),  df_title_text.itertuples(index=False)):
        diff_btw_titles = abs(title.bottom - next_title.top)
        
        if diff_btw_titles < title.size:
            title = pd.DataFrame([title]).to_dict()
            title['bottom'][0] = next_title.bottom
            title['text'][0] += next_title.text
            title = pd.DataFrame.from_dict(title)
        
        df_section_text = df_section_text.append([title])
    
    df_section_text = df_section_text.drop_duplicates(subset=['bottom']).reset_index()
    return df_section_text

self.df_title_text上面显示的问题数据框在哪里。

当行数增加时，它很慢。还有另一种更快，更优雅的方式来产生预期的结果吗？谢谢。

贝尼

让我们试着使用shift与cumcount获得亚组那么我们只需要基于该键做groupby用agg

s = (df['bottom'].shift()-df['top']).abs().gt(df['size']).cumsum()

out = df.groupby(s).agg({'top':'first',
                         'bottom':'last',
                         'fontname':'first',
                         'size':'first',
                         'x0':'first',
                         'x1':'first',
                         'text':' '.join})


out
Out[20]: 
       top   bottom  ...       x1                               text
0   62.890   87.893  ...  520.115  RISKMANAGEMENTANDINTERNAL CONTROL
1  146.897  157.897  ...  203.662                   COMPANYSECRETARY
2  272.913  297.916  ...  222.593   INDEPENDENTAUDITORS REMUNERATION
[3 rows x 7 columns]

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。