Building Google Cloud Platform Data Catalog on unstructured data

nsk

I have unstructured data in the form of document images. We are converting these documents to JSON files. I now want to have technical metadata captured for this. Can someone please give me some tips/best practices for building a data catalog on unstructured data in Google Cloud Platform?

mesmacosta

This answer comes with the assumption that you are not using any tool to create schemas around your unstructured data and query your data, like BigQuery, Hive, Presto. And you simply want to catalog your files.

I had a similar use case, Google Data Catalog has an option to create custom entries.

Some tips on building a Data Catalog on unstructured files data:

  1. Use meaningful file names on your JSON files. That way searching for them will become easier.
  2. Since you are already using GCP, use their managed Data Catalog, and leverage their custom entries API to ingest the files metadata into it.
  3. In case you also want to look for sensitive data in your JSON files, you could run DLP on them.
  4. Use Data Catalog Tags to enrich the files metadata. The tutorial on the link shows how to do it on Big Query tables, but you can do the same on custom entries.

I would add some information about your ETL jobs that convert these documents in JSON files as Tags. Like execution time, data quality score, user, business owner, etc.

In case you are wondering how to do the step 2, I put together one script that automatically does that: enter image description here link for the GitHub. Another option is to work with Data Catalog Filesets.

So between using custom entries or filesets, I'd ask you this, do you need information about your files name?

If not then filesets might easier, since at the time of this writing it does not show any info about your files name, but are good to manage file patterns in GCS buckets: It is defined by one or more file patterns that specify a set of one or more Cloud Storage files.

The datatalog-util also has an option to enrich your filesets, in case you just want to have statistics about them, like average file size, types, etc.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Error Python API GCP Data Catalog - Google Cloud Platform

From Dev

Google Cloud Platform Data Display

From Dev

Delta Lake - Building Data Catalog

From Dev

Not able to read data from Google Cloud Platform in StreamSets Data Collector

From Dev

Storing images and structured data together (Google Cloud Platform)

From Java

How to configure specific data pipeline in Google Cloud Platform?

From Dev

Read unstructured data pandas

From Dev

Unlabeled vs unstructured data

From Dev

Importing unstructured data into hadoop

From Dev

Cassandra and unstructured data

From Dev

Knockoutjs mapping and unstructured data

From Dev

Ideal way to read data in bucket stored batches of data for Keras ML training in Google Cloud Platform?

From Dev

Using Google Cloud Platform for scheduled recurring API data pull that then loads the data to BigQuery

From Dev

Google Cloud Platform: accumulate data from Pub/Sub to files in Cloud Storage without Dataflow

From Dev

How to import data from Google Cloud Platform (BigQuery/Cloud SQL) into R?

From Dev

How to convert this unstructured data to structured?

From Dev

Convert unstructured data into a Python Dictionary

From Dev

Parsing mixed structured and unstructured data

From Dev

How to create unstructured data in terraform

From Dev

Unstructured data to find a column count

From Dev

Why videos are unstructured data in context of Big data?

From Dev

data mining with unstructured data how to implement?

From Dev

how can i download my data from google-cloud-platform using python?

From Dev

How to fetch data from Google Cloud Platform via API from a mobile app on Cordova?

From Dev

how to configure Google Cloud Platform Data Loss Prevention client library for python to work behind a SSL proxy?

From Dev

Best way to transfer data between vm instaces in different projects in Google cloud platform

From Dev

Manage Google Data Catalog in Terraform - Set Tag Template's visibility

From Dev

How to insert/append unstructured data to bigquery table

From Dev

How to structure unstructured data using apache pig

Related Related

  1. 1

    Error Python API GCP Data Catalog - Google Cloud Platform

  2. 2

    Google Cloud Platform Data Display

  3. 3

    Delta Lake - Building Data Catalog

  4. 4

    Not able to read data from Google Cloud Platform in StreamSets Data Collector

  5. 5

    Storing images and structured data together (Google Cloud Platform)

  6. 6

    How to configure specific data pipeline in Google Cloud Platform?

  7. 7

    Read unstructured data pandas

  8. 8

    Unlabeled vs unstructured data

  9. 9

    Importing unstructured data into hadoop

  10. 10

    Cassandra and unstructured data

  11. 11

    Knockoutjs mapping and unstructured data

  12. 12

    Ideal way to read data in bucket stored batches of data for Keras ML training in Google Cloud Platform?

  13. 13

    Using Google Cloud Platform for scheduled recurring API data pull that then loads the data to BigQuery

  14. 14

    Google Cloud Platform: accumulate data from Pub/Sub to files in Cloud Storage without Dataflow

  15. 15

    How to import data from Google Cloud Platform (BigQuery/Cloud SQL) into R?

  16. 16

    How to convert this unstructured data to structured?

  17. 17

    Convert unstructured data into a Python Dictionary

  18. 18

    Parsing mixed structured and unstructured data

  19. 19

    How to create unstructured data in terraform

  20. 20

    Unstructured data to find a column count

  21. 21

    Why videos are unstructured data in context of Big data?

  22. 22

    data mining with unstructured data how to implement?

  23. 23

    how can i download my data from google-cloud-platform using python?

  24. 24

    How to fetch data from Google Cloud Platform via API from a mobile app on Cordova?

  25. 25

    how to configure Google Cloud Platform Data Loss Prevention client library for python to work behind a SSL proxy?

  26. 26

    Best way to transfer data between vm instaces in different projects in Google cloud platform

  27. 27

    Manage Google Data Catalog in Terraform - Set Tag Template's visibility

  28. 28

    How to insert/append unstructured data to bigquery table

  29. 29

    How to structure unstructured data using apache pig

HotTag

Archive