パンダでread_parquet（）を使用して一部のデータをフィルタリングするにはどうすればよいですか？

debugcn 投稿 Dev

UN2758

いくつかのgidをフィルタリングして、ロードメモリの使用量を減らしたい

reg_df = pd.read_parquet('/data/2010r.pq',
                             columns=['timestamp', 'gid', 'uid', 'flag'])

しかし、ドキュメントではkwargsは表示されていません。例えば：

gid=[100,101,102,103,104,105]
gid_i_want_load = [100,103,105]

だから、どうすれば計算したいgidだけをロードできますか？

ビル・アームストロング

**kwargspandasライブラリの紹介はここに記載されています。元々の意図は、columnsIOボリュームを制限する要求を実際に渡すことだったようです。寄稿者は次のステップに進み、の一般パスを追加しました**kwargs。

以下のためにpandas/io/parquet.py、以下のためのものですread_parquet：

def read_parquet(path, engine='auto', columns=None, **kwargs):
    """
    Load a parquet object from the file path, returning a DataFrame.
    .. versionadded 0.21.0
    Parameters
    ----------
    path : string
        File path
    columns: list, default=None
        If not None, only these columns will be read from the file.
        .. versionadded 0.21.1
    engine : {'auto', 'pyarrow', 'fastparquet'}, default 'auto'
        Parquet library to use. If 'auto', then the option
        ``io.parquet.engine`` is used. The default ``io.parquet.engine``
        behavior is to try 'pyarrow', falling back to 'fastparquet' if
        'pyarrow' is unavailable.
    kwargs are passed to the engine
    Returns
    -------
    DataFrame
    """

    impl = get_engine(engine)
    return impl.read(path, columns=columns, **kwargs)

以下のためにpandas/io/parquet.py、次のためにあるreadのpyarrowエンジン：

def read(self, path, columns=None, **kwargs):
    path, _, _, should_close = get_filepath_or_buffer(path)
    if self._pyarrow_lt_070:
        result = self.api.parquet.read_pandas(path, columns=columns,
                                              **kwargs).to_pandas()
    else:
        kwargs['use_pandas_metadata'] = True    #<-- only param for kwargs...
        result = self.api.parquet.read_table(path, columns=columns,
                                             **kwargs).to_pandas()
    if should_close:
        try:
            path.close()
        except:  # noqa: flake8
            pass

    return result

以下のためにpyarrow/parquet.py、以下のためでありますread_pandas：

def read_pandas(self, **kwargs):
    """
    Read dataset including pandas metadata, if any. Other arguments passed
    through to ParquetDataset.read, see docstring for further details

    Returns
    -------
    pyarrow.Table
        Content of the file as a table (of columns)
    """
    return self.read(use_pandas_metadata=True, **kwargs)  #<-- params being passed

以下のためにpyarrow/parquet.py、以下のためのものですread：

def read(self, columns=None, nthreads=1, use_pandas_metadata=False):  #<-- kwargs param at pyarrow
        """
        Read a Table from Parquet format

        Parameters
        ----------
        columns: list
            If not None, only these columns will be read from the file. A
            column name may be a prefix of a nested field, e.g. 'a' will select
            'a.b', 'a.c', and 'a.d.e'
        nthreads : int, default 1
            Number of columns to read in parallel. If > 1, requires that the
            underlying file source is threadsafe
        use_pandas_metadata : boolean, default False
            If True and file has custom pandas schema metadata, ensure that
            index columns are also loaded

        Returns
        -------
        pyarrow.table.Table
            Content of the file as a table (of columns)
        """
        column_indices = self._get_column_indices(
            columns, use_pandas_metadata=use_pandas_metadata)
        return self.reader.read_all(column_indices=column_indices,
                                    nthreads=nthreads)

したがって、私が正しく理解していれば、アクセスできる可能性がnthreadsありますがuse_pandas_metadata、どちらも明示的に割り当てられていません（??）。私はそれをテストしていません-しかしそれはおそらく始まりです。

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-1

コメントを追加

サインイン

分類Dev

Related 関連記事

記事

パンダでread_parquet（）を使用して一部のデータをフィルタリングするにはどうすればよいですか？

パンダでread_parquet（）を使用して一部のデータをフィルタリングするにはどうすればよいですか？

光沢のある、フィルターをデータの一部にリンクして、一部のグラフを変更し、他のグラフは変更しないようにするにはどうすればよいですか？

パンダのデータフレームをフィルタリングして特定の値を含めるにはどうすればよいですか？

c＃linqを使用して、一部のデータをフィルタリングしてxml文字列を取得するにはどうすればよいですか？

パンダデータフレームの一部の列をカテゴリに変換するにはどうすればよいですか？

wagtail adminでページリストをフィルタリングして、編集者がその一部のグループのページのみを表示するようにするにはどうすればよいですか？

データフレームのラベルの一部をパンダに置き換えるにはどうすればよいですか？

複数の列に基づいてパンダデータセットをフィルタリングするにはどうすればよいですか？

MySQLで2つのタグパラメータを使用して結果をフィルタリングするにはどうすればよいですか？

xcodeインターフェイスビルダーを使用してUIButtonにパディングを追加するにはどうすればよいですか？

URLの条件以上のデータを使用してデータをフィルタリングするにはどうすればよいですか？

DataFrameセルの文字列の一部をフィルタリングする数値に変換するにはどうすればよいですか？

clojureでネストされたXMLから一部の値のみをフィルタリングするにはどうすればよいですか？

ファイルでパターンを検索し、パターンの一部を抽出するか、パターンが存在しない場合はデフォルトを指定するにはどうすればよいですか？

行の位置でパンダのデータフレームをフィルタリングするにはどうすればよいですか？

パンダの行を別のデータフレーム列でフィルタリングするにはどうすればよいですか？

拡張子で正規表現を使用してファイルをフィルタリングし、一部のファイルを除外するにはどうすればよいですか？

配列値の一部であるオブジェクトをフィルタリングするにはどうすればよいですか？

groupbyを使用してデータフレーム内の重複をフィルタリングするにはどうすればよいですか？

パンダデータフレームのフィルタリングされた行の平均を計算し、元のデータフレームのすべての列に平均を追加するにはどうすればよいですか？

Python：パンダのデータの種類でDataFrameをフィルタリングするにはどうすればよいですか？

月と年の入力で日時インデックスを使用してデータフレームをフィルタリングするにはどうすればよいですか？パンダ

モデルをテーブル形式でレンダリングし、djangoを使用してチェックボックスを使用してモデルの一部を選択するにはどうすればよいですか？

Windows CMDを使用して、パターンに基づいて文字列の一部を抽出するにはどうすればよいですか？

フィルタされたクエリを使用して、一部のフィールドに特定の単語を含むドキュメントを取得するにはどうすればよいですか？

CSVファイルでパンダを使用して、「日付」列の形式が「MM / DD / YYYY」の場合、「月」でデータセットをフィルタリングするにはどうすればよいですか。

データの一部をパイプ/ファイルを介して後でインタラクティブに送信するにはどうすればよいですか？

フィルタリングされたパンダシリーズのインデックスを取得するにはどうすればよいですか？

node.jsの属性を使用してJsonデータをフィルタリングするにはどうすればよいですか？

パンダのデータフレームで「決して」フィルターを作成するにはどうすればよいですか