使用ijson从特定密钥读取json数据

debugcn 发表于 Dev

亚历克斯

我有几个较大的json文件，我正在尝试将其加载到pandas数据框中。我发现在Python中使用大型json的典型方法是使用ijson模块。我代表的json位于地理位置的推特ID。我只对来自美国的推特ID感兴趣。json数据如下所示：

{
  "tweet_id": "1223655173056356353",
  "created_at": "Sat Feb 01 17:11:42 +0000 2020",
  "user_id": "3352471150",
  "geo_source": "user_location",
  "user_location": {
    "country_code": "br"
  },
  "geo": {},
  "place": {
    
  },
  "tweet_locations": [
    {
      "country_code": "it",
      "state": "Trentino-Alto",
      "county": "Pustertal - Val Pusteria"
    },
    {
      "country_code": "us"
    },
    {
      "country_code": "ru",
      "state": "Voronezh Oblast",
      "county": "Petropavlovsky District"
    },
    {
      "country_code": "at",
      "state": "Upper Austria",
      "county": "Braunau am Inn"
    },
    {
      "country_code": "it",
      "state": "Trentino-Alto",
      "county": "Pustertal - Val Pusteria"
    },
    {
      "country_code": "cn"
    },
    {
      "country_code": "in",
      "state": "Himachal Pradesh",
      "county": "Jubbal"
    }
  ]
}

我将如何使用ijson仅从美国选择推特ID，然后将这些美国ID放入数据框？ijson模块对我来说是新的，我不了解如何处理此任务。更具体地说，我想获取所有tweet ID，以使国家代码user_location为US或国家代码tweet_locations为US。感谢所有帮助！

特伦顿·麦金尼

采用 `pandas.json_normalize`

将半结构化JSON数据规范化为平面表。
data 是您的JSON字典
熊猫：建立索引并选择数据
- 布尔索引
数据：带有地理信息的推文（英语）（选择1）
- 每个文件都包含字典行。
- 它们不在列表或元组内，因此将读取每一行。
- 的值tweet_locations是字典列表
- 的值user_location是一个字典
对于tweet_locations空列表（[]而不是）的情况[{}]，因为json_normalize期望查看metadata字段的方式，所以该行不包含在数据框中。
- 在tweet_id从{"tweet_id":"1256223765513584641","created_at":"Fri May 01 14:07:39 +0000 2020","user_id":"772487185031311360","geo_source":"user_location","user_location":{"country_code":"us"},"geo":{},"place":{},"tweet_locations":[]}将不会被包含在数据中。
  - 这可以通过设置被固定"tweet_locations" = [{}]时"tweet_locations":[]是True

import pandas as pd
import json
from pathlib import Path

# path to file, which contains the sample data at the bottom of this answer
file = Path('data/test.json')  # some path to your file

# load file
data = list()
with file.open('r') as f:
    for line in f:  # the file is rows of dicts that must be read 1 at a time
        data.append(json.loads(line))

# create dataframe
df = pd.json_normalize(data, 'tweet_locations', ['tweet_id', ['user_location', 'country_code']], errors='ignore')

# display(df.head())
  country_code              state         county    city             tweet_id user_location.country_code
0           us           Illinois  McLean County  Normal  1256223753220034566                        NaN
1           ke      Kiambu County            NaN     NaN  1256223748904161280                         ca
2           us           Illinois  McLean County  Normal  1256223744122593287                         us
3           th  Saraburi Province            NaN     NaN  1256223753463365632                        NaN
4           in              Assam          Lanka     NaN  1256223753463365632                        NaN                       br

# filter for US in the two columns
us = df[(df.country_code == 'us') | (df['user_location.country_code'] == 'us')]

# display(us)
   country_code          state          county    city             tweet_id user_location.country_code
0            us       Illinois   McLean County  Normal  1256223753220034566                        NaN
2            us       Illinois   McLean County  Normal  1256223744122593287                         us
15           us       Michigan  Sanilac County     NaN  1256338355106672640                         in
16           us  West Virginia     Clay County     NaN  1256338355106672640                         in
18           us        Florida   Taylor County     NaN  1256338355106672640                         in

# get unique tweet_id
df_tweet_ids = df.tweet_id.unique().tolist()

print(df_tweet_ids)
['1256223753220034566', '1256223744122593287', '1256338355106672640']

加载和解析所有JSON文件

一次最多加载一个文件
使用pandas.concat合并数据帧列表，us_data

# path to files
p = Path('c:/path_to_files')

# get of all json files
files = list(p.rglob('*.json'))

# parse files
us_data = list()
for file in files:
    data = list()
    with file.open('r', encoding='utf-8') as f:
        for line in f:
            data.append(json.loads(line))

        # create dataframe
        df = pd.json_normalize(data, 'tweet_locations', ['tweet_id', ['user_location', 'country_code']], errors='ignore')

        # filter for US in the two columns
        df = df[(df.country_code == 'us') | (df['user_location.country_code'] == 'us')]
        
        us_data.append(df)


# combine all data into one dataframe
us = pd.concat(us_data)

# delete objects that are no longer needed
del(data)
del(df)
del(us_data)

仅解析`tweet_id`无熊猫

由于文件是字典行，ijson因此不需要。
如所写，这将包括tweet_idif country_codeis 'us'，即使tweet_locations是一个空列表。
- 在tweet_id从{"tweet_id":"1256223765513584641","created_at":"Fri May 01 14:07:39 +0000 2020","user_id":"772487185031311360","geo_source":"user_location","user_location":{"country_code":"us"},"geo":{},"place":{},"tweet_locations":[]}将被列入数据。

file = Path('data/en_geo_2020-05-01/en_geo_2020-05-01.json')
tweet_ids = list()
with file.open('r') as f:
    for line in f:
        line = json.loads(line)
        if line.get('user_location').get('country_code') == 'us':
            tweet_ids.append(line.get('tweet_id'))
        else:
            if line['tweet_locations']:  # if tweet_locations is a list and not empty (None)
                tweet_locations_country_code = [i.get('country_code') for i in line['tweet_locations']]  # get the coutry_code for each tweet
                if 'us' in tweet_locations_country_code:  # if 'us' is in the list
                    tweet_ids.append(line.get('tweet_id'))  # append

print(tweet_ids)
['1256223753220034566', '1256223744122593287', '1256338355106672640']

样本数据

数据是文件中词典的行

{"tweet_id":"1256223753220034566","created_at":"Fri May 01 14:07:36 +0000 2020","user_id":"916540973190078465","geo_source":"tweet_text","user_location":{},"geo":{},"place":{},"tweet_locations":[{"country_code":"us","state":"Illinois","county":"McLean County","city":"Normal"}]}
{"tweet_id":"1256223748904161280","created_at":"Fri May 01 14:07:35 +0000 2020","user_id":"697426379583983616","geo_source":"user_location","user_location":{"country_code":"ca"},"geo":{},"place":{},"tweet_locations":[{"country_code":"ke","state":"Kiambu County"}]}
{"tweet_id":"1256223744122593287","created_at":"Fri May 01 14:07:34 +0000 2020","user_id":"1277481013","geo_source":"user_location","user_location":{"country_code":"us","state":"Florida"},"geo":{},"place":{},"tweet_locations":[{"country_code":"us","state":"Illinois","county":"McLean County","city":"Normal"}]}
{"tweet_id":"1256223753463365632","created_at":"Fri May 01 14:07:36 +0000 2020","user_id":"596005899","geo_source":"tweet_text","user_location":{},"geo":{},"place":{},"tweet_locations":[{"country_code":"th","state":"Saraburi Province"},{"country_code":"in","state":"Assam","county":"Lanka"},{"country_code":"cz","state":"Northeast","county":"okres \u00dast\u00ed nad Orlic\u00ed"},{"country_code":"lk"}]}
{"tweet_id":"1256223753115238406","created_at":"Fri May 01 14:07:36 +0000 2020","user_id":"139159502","geo_source":"user_location","user_location":{"country_code":"ca"},"geo":{},"place":{},"tweet_locations":[{"country_code":"ve"},{"country_code":"ca","state":"Nova Scotia","county":"Pictou County","city":"Diamond"},{"country_code":"my","state":"Selangor","city":"Kajang"}]}
{"tweet_id":"1256223748161757190","created_at":"Fri May 01 14:07:35 +0000 2020","user_id":"1655021437","geo_source":"user_location","user_location":{"country_code":"af","state":"Nangarhar","county":"Kot"},"geo":{},"place":{},"tweet_locations":[{"country_code":"cz","state":"Northeast","county":"okres \u00dast\u00ed nad Orlic\u00ed"},{"country_code":"cz","state":"Northeast","county":"okres \u00dast\u00ed nad Orlic\u00ed"},{"country_code":"gb","state":"England","county":"Gloucestershire"}]}
{"tweet_id":"1256223749214437380","created_at":"Fri May 01 14:07:35 +0000 2020","user_id":"3244990814","geo_source":"user_location","user_location":{"country_code":"se"},"geo":{},"place":{},"tweet_locations":[{"country_code":"cg","state":"Kouilou","county":"Pointe-Noire"},{"country_code":"cn"}]}
{"tweet_id":"1256338355106672640","created_at":"Fri May 01 21:43:00 +0000 2020","user_id":"1205700416123486208","geo_source":"user_location","user_location":{"country_code":"in","state":"Delhi"},"geo":{},"place":{},"tweet_locations":[{"country_code":"us","state":"Michigan","county":"Sanilac County"},{"country_code":"us","state":"West Virginia","county":"Clay County"},{"country_code":"de","state":"Baden-W\u00fcrttemberg","county":"Verwaltungsgemeinschaft Friedrichshafen"},{"country_code":"us","state":"Florida","county":"Taylor County"}]}
{"tweet_id":"1256223764980944904","created_at":"Fri May 01 14:07:39 +0000 2020","user_id":"1124447266205503488","geo_source":"none","user_location":{},"geo":{},"place":{},"tweet_locations":[]}
{"tweet_id":"1256223760765595650","created_at":"Fri May 01 14:07:38 +0000 2020","user_id":"909477905737990144","geo_source":"tweet_text","user_location":{},"geo":{},"place":{},"tweet_locations":[{"country_code":"lr","state":"Grand Bassa County","county":"District # 2"}]}

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-04-2

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章

使用ijson从特定密钥读取json数据

使用ijson从特定密钥读取json数据

采用 pandas.json_normalize

加载和解析所有JSON文件

仅解析tweet_id无熊猫

样本数据

使用python ijson读取具有多个json对象的大型json文件

使用pyspark从Redis读取特定密钥

使用 ijson python 将 1.4 GB json 数据加载到 mysql

如何使用ijson和python解析json

如何从jSON读取密钥

使用python搜索特定密钥的JSON文件

将3.7GB的大json文件加载到数据帧中，然后使用ijson转换为CSV文件

使用angularjs读取Json数据

使用cURL读取JSON数据

使用Java读取JSON数据

使用angularjs读取Json数据

使用SwiftyJSON读取JSON数据

Python从响应JSON读取特定数据

如何使用Kafka Consumer API中的密钥读取数据？

播放JSON：使用未知密钥读取和验证JsObject

在Python中提取数据时忽略特定的JSON密钥

在JSON中搜索特定密钥并返回该数据

如何从JSON字符串中检索特定密钥的数据

使用Spark从HBase读取特定的列数据

我如何使用ijson

使用Python读取特定的JSON值

使用Python从JSON文件读取数据

使用ListView Android JSON读取数据

使用axios和React读取JSON数据

使用Python从JSON文件读取数据

如何使用jQuery读取缓存的json数据？

使用ListView Android JSON读取数据

如何使用jQuery从JSON格式读取数据

使用 React 读取嵌套的 JSON 数据

采用 `pandas.json_normalize`

仅解析`tweet_id`无熊猫