How to access Google Cloud Storage Bucket from AI Platform job

bhoeksem Published at Dev

bhoeksem

My Google AI Platform / ML Engine training job doesn't seem to have access to the training file I put into a Google Cloud Storage bucket.

Google's AI Platform / ML Engine requires you store training data files in one of their Cloud Storage buckets. Accessing locally from CLI works fine. However, when I send a training job (after ensuring the data is in the appropriate location in my Cloud Storage bucket), I get an error seeming to be due to no access to the bucket Link URL.

The error is from trying to read what looks to me like the contents of a web page that Google served up saying "Hey, you don't have access to this." I see this gaia.loginAutoRedirect.start(5000, and a URL with this flag at the end: noautologin=true.

I know permissions between AI Platform and Cloud Storage are a thing, but both are under the same project. The walkthroughs I'm using at very least imply that no further action is required if under the same project.

I am assuming I need to use the Link URL provided in the bucket Overview tab. Tried the Link for gsutil but the python (from Google's CloudML Samples repo) was upset about using gs://.

I think Google's examples are proving insufficient since their example data is from a public URL rather than a private Cloud Storage bucket.

Ultimately, the error message I get is a Python error. But like I said, this is preceded by a bunch of gross INFO logs of HTML/CSS/JS from Google saying I don't have permission to get the file I'm trying to get. These logs are actually just because I added a print statement to the util.py file as well - right before read_csv() on the train file. (So the Python parse error is due to trying to parse HTML as a CSV).

... 
INFO    g("gaia.loginAutoRedirect.stop",function(){var b=n;b.b=!0;b.a&&(clearInterval(b.a),b.a=null)});
INFO    gaia.loginAutoRedirect.start(5000,
INFO    'https:\x2F\x2Faccounts.google.com\x2FServiceLogin?continue=https%3A%2F%2Fstorage.cloud.google.com%2F<BUCKET_NAME>%2Fdata%2F%2Ftrain.csv\x26followup=https%3A%2F%2Fstorage.cloud.google.com%2F<BUCKET_NAME>%2Fdata%2F%2Ftrain.csv\x26service=cds\x26passive=1209600\x26noautologin=true',
ERROR   Command '['python', '-m', u'trainer.task', u'--train-files', u'gs://<BUCKET_NAME>/data/train.csv', u'--eval-files', u'gs://<BUCKET_NAME>/data/test.csv', u'--batch-pct', u'0.2', u'--num-epochs', u'1000', u'--verbosity', u'DEBUG', '--job-dir', u'gs://<BUCKET_NAME>/predictor']' returned non-zero exit status 1.

Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 137, in <module>
    train_and_evaluate(args)
  File "/root/.local/lib/python2.7/site-packages/trainer/task.py", line 80, in train_and_evaluate
    train_x, train_y, eval_x, eval_y = util.load_data()
  File "/root/.local/lib/python2.7/site-packages/trainer/util.py", line 168, in load_data
    train_df = pd.read_csv(training_file_path, header=0, names=_CSV_COLUMNS, na_values='?')
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1848, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 876, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 891, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 945, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 932, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2112, in pandas._libs.parsers.raise_parser_error
ParserError: Error tokenizing data. C error: Expected 5 fields in line 205, saw 961

To get the data, I'm more or less trying to mimic this: https://github.com/GoogleCloudPlatform/cloudml-samples/blob/master/census/tf-keras/trainer/util.py

Various ways I have tried to address my bucket in my copy of util.py: https://console.cloud.google.com/storage/browser/<BUCKET_NAME>/data (think this was the "Link URL" back in May)
https://storage.cloud.google.com/<BUCKET_NAME>/data (this is the "Link URL" now - in July)
gs://<BUCKET_NAME>/data (this is the URI - which gives a different error about not liking gs as a url type)

rpasricha

Transferring the answer from a comment above:

Looks like the URL approach requires cookie based authentication if it's not a public object. Instead of using a URL, I would suggest using tf.gfile with a gs:// path, as is used in the Keras sample. If you need to download the file from GCS in a separate step, you can use the GCS client library.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2020-12-6

Comments

0 comments

From Dev

Is it possible to use file system instead of actual Storage bucket in the cloud for development purposes (Google Cloud Platform)

Related Related

Article