我将数据存储在csv文件中,格式如下
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
每列的数据类型
1. int 6. int
2. int 7. int
3. String 8. float
4. String 9. float
5. float 10.String
11.String
以892、893,... 897开头的第一列应以int
格式存储array
。第三列,例如“ Wilkes,詹姆斯夫人(Ellen Needs)”应该以string
类型存储。但是,第三列是string
格式,但是字符长度不是固定的,即我不知道此列中存储的最大字符长度
我已经做好了:
csv_file_object = csv.reader(open('trainData.csv', 'rb'))
header = csv_file_object.next()
data=[]
for row in csv_file_object:
data.append(row)
data = np.array(data)
但是,上面的代码读取所有列,因为string
其中许多列未采用string
format格式,并以format格式存储了信息string
。另一方面,如果我使用genfromtxt
,则第三列是问题,因为它在双引号内包含逗号。
我希望每列都以其自己的数据类型存储,即第一列应存储为int
类型。
我期望的数组:
892 3 "Kelly, Mr. James" male 34.5 0 0 330911 7.8292 NaN Q
893 3 "Wilkes, Mrs. James (Ellen Needs)" female 47 1 0 363272 7 NaN S
894 2 "Myles, Mr. Thomas Francis" male 62 0 0 240276 9.6875 NaN Q
895 3 "Wirz, Mr. Albert" male 27 0 0 315154 8.6625 NaN S
896 3 "Hirvonen, Mrs. Alexander (Helga E Lindqvist)" female 22 1 1 3101298 12.2875 NaN S
897 3 "Svensson, Mr. Johan Cervin" male 14 0 0 7538 9.225 S
如您所见,如果数据不可用,NaN
则应放置其派生数据。
我应该读什么csv文件?
您可以更轻松地使用pandas库,如下所示:
import pandas as pd
df = pd.read_csv("trainData.csv", dtype={'col1': int, 'col2': int, 'col3': str, 'col4': str, 'col5': float, 'col6':int,
'col7': int, 'col8': float, 'col9':float, 'col10': str, 'col11': str})
df = map(list, df.values)
print df
输出:
[[892, 3, 'Kelly, Mr. James', 'male', 34.5, 0, 0, 330911.0, 7.8292, nan, 'Q'],
[893, 3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', 47.0, 1, 0, 363272.0, 7.0, nan, 'S'],
[894, 2, 'Myles, Mr. Thomas Francis', 'male', 62.0, 0, 0, 240276.0, 9.6875, nan, 'Q'],
[895, 3, 'Wirz, Mr. Albert', 'male', 27.0, 0, 0, 315154.0, 8.6625, nan, 'S'],
[896, 3, 'Hirvonen, Mrs. Alexander (Helga E Lindqvist)', 'female', 22.0, 1, 1, 3101298.0, 12.2875, nan, 'S'],
[897, 3, 'Svensson, Mr. Johan Cervin', 'male', 14.0, 0, 0, 7538.0, 9.225, nan, 'S']]
csv文件应如下所示,因为第一行将是列名
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10,col11
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47,1,0,363272,7,,S
894,2,"Myles, Mr. Thomas Francis",male,62,0,0,240276,9.6875,,Q
895,3,"Wirz, Mr. Albert",male,27,0,0,315154,8.6625,,S
896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22,1,1,3101298,12.2875,,S
897,3,"Svensson, Mr. Johan Cervin",male,14,0,0,7538,9.225,,S
您可以在http://pandas.pydata.org/pandas-docs/stable/tutorials.html上了解有关熊猫的更多信息。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句