将大型数据集分解为有组织的索引

user2676336 发表于 Dev

用户名

我正在尝试shape_id从我拥有的数据集中创建的索引索引字典（请参见下文）。我意识到我可以使用循环（并尝试这样做），但是我有一种直觉，认为在熊猫中有很多方法可以做到这一点，而这在计算上并不昂贵。

字典的结构应如下所示：

{shape_id: [shape_pt_sequence, [shape_pt_lat,shape_pt_lon]]}

到目前为止，这是我的代码：

import pandas as pd

# readability assignments for shapes.csv
shapes = pd.read_csv('csv/shapes.csv')
shapes_shape_id = shapes['shape_id']
shapes_shape_id_index = list(set(shapes_shape_id))
shapes_shape_pt_sequence = shapes['shape_pt_sequence']
shapes_shape_pt_lat = shapes['shape_pt_lat']
shapes_shape_pt_lon = shapes['shape_pt_lon']

shapes_tuple = []

# add shape index to final dict
for i in range(len(shapes_shape_id_index)):
    shapes_tuple.append([shapes_shape_id_index[i]])

print(shapes_tuple)

这里的LINK的shapes.csv主旨。

这是一个空的shape_id索引：

[[20992], [20993], [20994], [20995], [20996], [20997], [20998], [20999], [21000], [21001], [21002], [21003], [21004], [21005], [21006], [21007], [21008], [21009], [21010], [21011], [21012], [21013], [21014], [21015], [21016], [21017], [21018], [21019], [21020], [21021], [21022], [21023], [21026], [21027], [21028], [21029], [21030], [21031], [21032], [21033], [21034], [21035], [21036], [21037], [21038], [21039], [21040], [21041], [21042], [21043], [21044], [21045], [21046], [21047], [21048], [21049], [21050], [21051], [21052], [21053], [21054], [21055], [21056], [21057], [21058], [21059], [21060], [21061], [21062], [21063], [21064], [21065], [21066], [21067], [21068], [21069], [21070], [21071], [21072], [21073], [21074], [21075], [21076], [21077], [21078], [21079], [21080], [21081], [21082], [21083], [21084], [21085], [21086], [21087], [21088], [21089], [20958], [20959], [20960], [20961], [20962], [20963], [20964], [20965], [20966], [20967], [20968], [20969], [20970], [20971], [20972], [20973], [20974], [20975], [20976], [20977], [20978], [20979], [20980], [20981], [20982], [20983], [20984], [20985], [20986], [20987], [20988], [20989], [20990], [20991]]

该shapes.csv如下所示：

shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,is_stop
20958,44.0577683,-123.0873313,1,0
20958,44.0577163,-123.087073,2,0
20958,44.0576286,-123.0867103,3,0
20958,44.0574258,-123.086641,4,0
20958,44.0571421,-123.0866518,5,0
20958,44.0568706,-123.086653,6,0
20958,44.0566161,-123.0867028,7,0
20958,44.0565641,-123.0869733,8,0
20958,44.0565503,-123.0872603,9,0
20958,44.0565536,-123.087631,10,0
20958,44.0565439,-123.0879283,11,0
20958,44.0564661,-123.087894,12,0
20958,44.0565124,-123.0881793,13,0
20958,44.0565181,-123.0884921,14,0
20958,44.0565331,-123.0888668,15,0
20958,44.0565406,-123.0892323,16,0
20958,44.0565406,-123.0896295,17,0
20958,44.0563515,-123.0897096,18,0
20958,44.056073,-123.0897108,19,0
20958,44.0558501,-123.0897,20,0
20958,44.0558358,-123.0897016,21,0
20958,44.0556489,-123.0896861,22,0
20958,44.0554398,-123.0896781,23,0
20958,44.0552033,-123.0896776,24,0
20958,44.0549253,-123.089692,25,0
20958,44.0546778,-123.0897281,26,0
20958,44.0546578,-123.0897326,27,0
20958,44.0546338,-123.0896965,28,0
20958,44.0543988,-123.0896838,29,0
20958,44.0543536,-123.0899543,30,0
20958,44.0543628,-123.0903496,31,0
20958,44.0543668,-123.0906733,32,0
20958,44.0543718,-123.0910178,33,0

例如，在shapes.csv中，20958最大值shape_pt_sequence为72。20960最大值shape_pt_sequence为400，依此类推。

我不知道为什么需要这样的结构[shape_id:[shape_pt_sequence, [shape_pt_lat,shape_pt_lon]]]，它对于数据选择不是很有用，可以使用MultiIndex：

shapes = pd.read_csv('shapes.csv')
shapes.set_index(["shape_id", "shape_pt_sequence"], inplace=True)

然后选择20958的所有数据：

print shapes.loc[20958]

选择一个点：

print shapes.loc[20958, 45]

选择shape_pt_sequence范围为20958的数据：

print print shapes.loc[(20958, slice(45, 48)), :]

在[45，48]中选择shape_pt_sequence的数据：

print shapes.loc[(20958, [45, 48]), :]

如果您确实需要此表单，请使用以下代码：

shapes = pd.read_csv('shapes.csv')

def f(df):
    return [df.shape_pt_sequence.tolist(), [df.shape_pt_lat.tolist(), df.shape_pt_lon.tolist()]]

res = shapes.groupby("shape_id").apply(f).to_dict()

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。