我有一个混乱的csv文件(只是扩展名是csv)。但是当我在MS Excel中用;
分隔符打开此文件时,它看起来如下图(虚拟示例)-
我调查了此文件,发现以下内容:
问题:
我如何在熊猫中读取此表,而所有现有列(标题)都保留,空白列中填充连续的数字,所以行的长度可变。
实际上,我想一次又一次地获取8个单元格的值,直到用完所有行。从无标题列进行分析。
注:我已经试过usecols
,names
,skiprows
,sep
等中read_csv
,但没有成功
添加了示例输入和预期输出(格式较差,但pandas.read_clipboard(
应该可以)
输入
car_id car_type entry_gate entry_time(ms) exit_gate exit_time(ms) traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] )
24 Bus 25 4300 26 48520 118.47 2.678999 509552.78 5039855.59 10.074 0.429 0.2012 0 509552.97 5039855.57 10.0821 0.3853 0.2183 20
25 Car 25 20 26 45900 113.91 2.482746 509583.7 5039848.78 4.5344 -0.1649 0.2398 0 509583.77 5039848.71
26 Car - - - - 109.68 8.859805 509572.75 5039862.75 4.0734 -0.7164 -0.1066 0 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17 5039855.55 10.0886 0.2636 0.2356 40
27 Car - - - - 119.84 3.075936 509582.73 5039862.78 1.191 0.5247 0.0005 0 509582.71 5039862.78 1.2015 0.5322
28 Car - - - - 129.64 4.347466 509591.07 5039862.9 1.6473 0.1987 -0.0033 0 509591.04 5039862.89 1.6513 0.2015 -0.0036 20
预期输出(数据帧)
car_id car_type entry_gate entry_time(ms) exit_gate exit_time(ms) traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] 1 2 3 4 5 6 7 8 9 10 11 12
24 Bus 25 4300 26 48520 118.47 2.678999 509552.78 5039855.59 10.074 0.429 0.2012 0 509552.97 5039855.57 10.0821 0.3853 0.2183 20
25 Car 25 20 26 45900 113.91 2.482746 509583.7 5039848.78 4.5344 -0.1649 0.2398 0 509583.77 5039848.71
26 Car - - - - 109.68 8.859805 509572.75 5039862.75 4.0734 -0.7164 -0.1066 0 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17 5039855.55 10.0886 0.2636 0.2356 40
27 Car - - - - 119.84 3.075936 509582.73 5039862.78 1.191 0.5247 0.0005 0 509582.71 5039862.78 1.2015 0.5322
28 Car - - - - 129.64 4.347466 509591.07 5039862.9 1.6473 0.1987 -0.0033 0 509591.04 5039862.89 1.6513 0.2015 -0.0036 20
预处理
get_names()
打开文件功能,检查拆分行的最大长度。然后,我阅读第一行并添加最大长度中的缺失值。
第一行的最后一个值是)
,因此我将其删除firstline[:-1]
,然后再将丢失的列添加到+1
rng = range(1, m - lenfirstline + 2)
。+2
是因为范围从值开始1
。
然后,您可以使用function read_csv
,跳过第一行,并使用name的输出作为名称get_names()
。
import pandas as pd
import csv
#preprocessing
def get_names():
with open('test/file.txt', 'r') as csvfile:
reader = csv.reader(csvfile)
num = []
for i, row in enumerate(reader):
if i ==0:
firstline = ''.join(row).split()
lenfirstline = len(firstline)
#print firstline, lenfirstline
num.append(len(''.join(row).split()))
m = max(num)
rng = range(1, m - lenfirstline + 2)
#remove )
rng = firstline[:-1] + rng
return rng
#names is list return from function
df = pd.read_csv('test/file.txt', sep="\s+", names=get_names(), index_col=[0], skiprows=1)
#temporaly display 10 rows and 30 columns
with pd.option_context('display.max_rows', 10, 'display.max_columns', 30):
print df
car_type entry_gate entry_time(ms) exit_gate exit_time(ms) \
car_id
24 Bus 25 4300 26 48520
25 Car 25 20 26 45900
26 Car - - - -
27 Car - - - -
28 Car - - - -
traveled_dist(m) avg_speed(m/s) trajectory(x[m] y[m] \
car_id
24 118.47 2.678999 509552.78 5039855.59
25 113.91 2.482746 509583.70 5039848.78
26 109.68 8.859805 509572.75 5039862.75
27 119.84 3.075936 509582.73 5039862.78
28 129.64 4.347466 509591.07 5039862.90
speed[m/s] a_tangential[ms-2] a_lateral[ms-2] timestamp[ms] \
car_id
24 10.0740 0.4290 0.2012 0
25 4.5344 -0.1649 0.2398 0
26 4.0734 -0.7164 -0.1066 0
27 1.1910 0.5247 0.0005 0
28 1.6473 0.1987 -0.0033 0
1 2 3 4 5 6 7 \
car_id
24 509552.97 5039855.57 10.0821 0.3853 0.2183 20 NaN
25 509583.77 5039848.71 NaN NaN NaN NaN NaN
26 509572.67 5039862.76 4.0593 -0.7021 -0.1141 20 509553.17
27 509582.71 5039862.78 1.2015 0.5322 NaN NaN NaN
28 509591.04 5039862.89 1.6513 0.2015 -0.0036 20 NaN
8 9 10 11 12
car_id
24 NaN NaN NaN NaN NaN
25 NaN NaN NaN NaN NaN
26 5039855.55 10.0886 0.2636 0.2356 40
27 NaN NaN NaN NaN NaN
28 NaN NaN NaN NaN NaN
后期处理
首先,您必须估计最大列数N
。我知道他们的真实数字是26
,所以我估计N = 30
。read_csv
具有参数name = range(N)
返回NaN
列的函数,列的估计长度与实际长度之间有什么区别。
下降,您可以选择的列名,其中没有第一行后NaN
(我去掉最后一列)
的[:-1]
) - df1.loc[0].dropna()[:-1]
。然后,您可以NaN
在第一行中追加范围从1到值长度的新系列。最后第一行被的子集删除df
。
#set more as estimated number of columns
N = 30
df1 = pd.read_csv('test/file.txt', sep="\s+", names=range(N))
df1 = df1.dropna(axis=1, how='all') #drop columns with all NaN
df1.columns = df1.loc[0].dropna()[:-1].append(pd.Series(range(1, len(df1.columns) - len(df1.loc[0].dropna()[:-1]) + 1 )))
#remove first line with uncomplete column names
df1 = df1.ix[1:]
print df1.head()
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句