我有5个清单,如下所示。
list1 = [[111, ["food", "fruits", "vegetables"]], [112, ["mango", "apples", "grapes", "pears", "passion fruit"]]]
list2 = [[110, ["transport", "car", "van", "bus", "jeep"]], [109, ["trams", "trains", "passenger", "driver"]], [108, ["traffic", "lights"]]]
list3 = [[111, ["book", "letters", "library", "reading"]], [112, ["education", "jobs", "companies", "salary"]]]
list4 = [[111, ["food", "curry", "spices", "rice", "fruits", "vegetables"]], [112, ["fruits", "vegetables", "farms", "farmers"]]]
list5 = [[111, ["food", "industry", "delivery"]], [112, ["fresh", "curry", "food", "pears", "passion fruit"]]]
我也有一个概念清单。
myconcepts = ["fruits", "curry"]
我想找到第一个在列表中有概念的myconcepts
列表。即
"fruits" -> list1
"curry" -> list4
我目前正在使用以下代码来执行此操作
mylists = [list1, list2, list3, list4, list5]
for concept in myconcepts:
initial_list = ""
counting = 1
for mylist in mylists:
for item in mylist:
if concept in item[1]:
initial_year = str(counting)
break
if len(initial_year) > 0:
break
else:
counting = counting + 1
print(counting)
对于较小的数据集,这很好用。但是,我有一个庞大的数据集,其中包含近25个列表,每个列表都有近500万条记录。我的概念列表大约是15000。因此,我的代码需要大量时间来运行。我想知道是否在python中有更有效的方法?
如果需要,我很乐意提供更多详细信息。
这是一种使用的方法,与中的查找相比set
,它将加快使用的值in
查找list
。
list1 = [[111, ["food", "fruits", "vegetables"]], [112, ["mango", "apples", "grapes", "pears", "passion fruit"]]]
list2 = [[110, ["transport", "car", "van", "bus", "jeep"]], [109, ["trams", "trains", "passenger", "driver"]], [108, ["traffic", "lights"]]]
list3 = [[111, ["book", "letters", "library", "reading"]], [112, ["education", "jobs", "companies", "salary"]]]
list4 = [[111, ["food", "curry", "spices", "rice", "fruits", "vegetables"]], [112, ["fruits", "vegetables", "farms", "farmers"]]]
list5 = [[111, ["food", "industry", "delivery"]], [112, ["fresh", "curry", "food", "pears", "passion fruit"]]]
myconcepts = ["fruits", "curry"]
# flatten lists and generate frozensets
flatsets = [[frozenset(l[1]) for l in lists] for lists in [list1, list2, list3, list4, list5]]
# a function to retrieve indices for the strings to find
def get_idx(setlist, concept):
for ix_f, fset in enumerate(setlist):
for ix_s, s in enumerate(fset):
if concept in s:
return ix_f
return None
# generate a list holding the index of each concept
ix_concepts = [None for _ in myconcepts]
for ix_c, c in enumerate(myconcepts):
ix_concepts[ix_c] = get_idx(flatsets, c)
# show result
listnames = ['list1', 'list2', 'list3', 'list4', 'list5']
for i, c in enumerate(myconcepts):
print(f"concept '{c}' found first in {listnames[ix_concepts[i]]}")
# concept 'fruits' found first in list1
# concept 'curry' found first in list4
但是,鉴于您的大量数据为15k * 25 * 5M,我认为这不是针对实际问题的1:1解决方案。如此处已经提到的,将需要进行复杂的数据准备。而且,我现在认为O(N²)的搜索算法(忽略平整列表所需的时间等)肯定会浪费很多时间。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句