我有以下数据框
top bottom fontname size x0 x1 text
0 62.890 73.890 HNGMRP+HelveticaNeueLTStd-Bd 11.000 321.730 520.115 RISK MANAGEMENT AND INTERNAL
1 76.893 87.893 HNGMRP+HelveticaNeueLTStd-Bd 11.000 321.730 376.334 CONTROL
2 146.897 157.897 HNGMRP+HelveticaNeueLTStd-Bd 11.000 76.535 203.662 COMPANY SECRETARY
3 272.913 283.913 HNGMRP+HelveticaNeueLTStd-Bd 11.000 76.535 222.593 INDEPENDENT AUDITORS
4 286.916 297.916 HNGMRP+HelveticaNeueLTStd-Bd 11.000 76.535 167.164 REMUNERATION
我想要
row[i].text
和row[i+1].text
若abs(row[i].bottom - row[i+1].top) < row[i].size
joined_text
为row[i].text
row[i].bottom
为row[i+1].bottom
row[i+1]
例如:
row 0
有text
:RISK MANAGEMENT AND INTERNAL
row 1
有text
:CONTROL
row 0
并且row 1
都具有相同的size
:11abs(row[0].bottom - row[1].top)
是3.003因为 3.003 < 11
row[0].text
是RISK MANAGEMENT AND INTERNAL CONTROL
row[0].bottom
是87.893
row[1]
从数据框中删除为了清楚起见,期望的结果如下:
top bottom fontname size x0 x1 text
0 62.890 87.893 HNGMRP+HelveticaNeueLTStd-Bd 11.000 321.730 520.115 RISK MANAGEMENT AND INTERNAL CONTROL
1 146.897 157.897 HNGMRP+HelveticaNeueLTStd-Bd 11.000 76.535 203.662 COMPANY SECRETARY
2 272.913 297.916 HNGMRP+HelveticaNeueLTStd-Bd 11.000 76.535 222.593 INDEPENDENT AUDITORS REMUNERATION
这是我尝试的:
def df_section_text(self) -> pd.DataFrame:
df_title_text = self.df_title_text
df_next_title_text = self.df_title_text.shift(-1).dropna()
df_section_text = pd.DataFrame()
for next_title, title in zip(df_next_title_text.itertuples(index=False), df_title_text.itertuples(index=False)):
diff_btw_titles = abs(title.bottom - next_title.top)
if diff_btw_titles < title.size:
title = pd.DataFrame([title]).to_dict()
title['bottom'][0] = next_title.bottom
title['text'][0] += next_title.text
title = pd.DataFrame.from_dict(title)
df_section_text = df_section_text.append([title])
df_section_text = df_section_text.drop_duplicates(subset=['bottom']).reset_index()
return df_section_text
self.df_title_text
上面显示的问题数据框在哪里。
当行数增加时,它很慢。还有另一种更快,更优雅的方式来产生预期的结果吗?谢谢。
让我们试着使用shift
与cumcount
获得亚组那么我们只需要基于该键做groupby
用agg
s = (df['bottom'].shift()-df['top']).abs().gt(df['size']).cumsum()
out = df.groupby(s).agg({'top':'first',
'bottom':'last',
'fontname':'first',
'size':'first',
'x0':'first',
'x1':'first',
'text':' '.join})
out
Out[20]:
top bottom ... x1 text
0 62.890 87.893 ... 520.115 RISKMANAGEMENTANDINTERNAL CONTROL
1 146.897 157.897 ... 203.662 COMPANYSECRETARY
2 272.913 297.916 ... 222.593 INDEPENDENTAUDITORS REMUNERATION
[3 rows x 7 columns]
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句