我一直必须执行以下操作,以通过处理单个json行的小型管道构建数据帧。有没有一种更有效的方法来执行此操作,而不是依赖于将它们附加到列表然后进行连接?另外,我什至不需要下面以“键”表示的列标签,但不确定如何在不出现数据框构造函数错误的情况下将其排除:
def readfiles(pattern, textfile):
for line in open(textfile):
try:
parsed = ujson.loads(line.rstrip('\n').rstrip(','))
if pattern in parsed:
yield parsed
except ValueError, e:
pass
def convertodf(lines):
dfs = []
for line in lines:
dfs.append(pd.DataFrame({'key1':line['value'],
'key2':line['value']['value'],
'key3':line['value'],
'key4':line['value']['value'],
'key5':line['value']['value']}))
pd.concat(dfs, ignore_index=True).to_csv('testdf2.csv', index=False, header=None)
def main(pattern, filenames):
lines = readfiles(pattern, filenames)
convertodf(lines)
上述实现的真正酷的部分是,line ['value']元素之一实际上是一个逗号分隔的整数列表,例如[1,2,3],它最终会自动复制其他值,例如:
'key1' 'key2'
1 california
2 california
3 california
...
这是我在unutbu的帮助下使用的最终工作版本。
def readfiles(pattern, filedir):
for f in glob.glob(filedir+'*.zip'):
try:
with zipfile.ZipFile(f, 'r') as myzip:
for logfile in myzip.namelist():
for line in myzip.open(logfile):
try:
line = ujson.loads(line.rstrip('\n').rstrip(','))
if pattern in line:
for i in line['key1']:
yield i, line['key1']['key2'],\
line['key3'], line['key4']['key5'],\
line['key6']['key7']
except ValueError as err:
pass
except zipfile.error, e:
pass
def convertdfcsv(lines):
df = pd.DataFrame.from_records(lines)
df.to_csv('testdf2.csv', index=False, header=None)
def main(pattern):
lines = readfiles(pattern)
convertdf(lines)
您可以使用DataFrame.from_records从行迭代器构建DataFrame。一个显示from_records
工作方式的简单示例是:
iterator = (item for item in [[1, 2, 3], [2, 3, 4, 5]])
df = pd.DataFrame.from_records(iterator,
columns=list('abcd'))
print(df)
# a b c d
# 0 1 2 3 NaN
# 1 2 3 4 5
根据您的情况,代码可能类似于:
def readfiles(pattern, filenames):
for textfile in filenames:
with open(textfile, 'rb') as f:
for line in f:
try:
line = ujson.loads(line.rstrip('\n').rstrip(','))
if pattern in line:
yield line['value'], line['value']['value'], line['value'], line['value']['value'], line['value']['value']
except ValueError as err:
pass
def convertodf(lines):
df = pd.DataFrame.from_records(lines)
df.to_csv('testdf2.csv', index=False, header=None)
def main(pattern, filenames):
lines = readfiles(pattern, filenames)
convertodf(lines)
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句