Batch transform job results in "InternalServerError" with data file >100MB

debugcn 投稿 Dev

Melodie.R

I'm using Sagemaker in order to perform binary classification on time series, each sample being a numpy array of shape [24,11] (24h, 11features). I used a tensorflow model in script mode, my script being very similar to the one I used as reference: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/mnist.py

The training reported success and I was able to deploy a model for batch transformation. The transform job works fine when I input just a few samples (say, [10,24,11]), but it returns an InternalServerError when I input more samples for prediction (for example, [30000, 24, 11], which size is >100MB).

Here is the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-0c46f7563389> in <module>()
     32 
     33 # Then wait until transform job is completed
---> 34 tf_transformer.wait()

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self)
    133     def wait(self):
    134         self._ensure_last_transform_job()
--> 135         self.latest_transform_job.wait()
    136 
    137     def _ensure_last_transform_job(self):

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self)
    207 
    208     def wait(self):
--> 209         self.sagemaker_session.wait_for_transform_job(self.job_name)
    210 
    211     @staticmethod

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_transform_job(self, job, poll)
    893         """
    894         desc = _wait_until(lambda: _transform_job_status(self.sagemaker_client, job), poll)
--> 895         self._check_job_status(job, desc, 'TransformJobStatus')
    896         return desc
    897 

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
    915             reason = desc.get('FailureReason', '(No reason provided)')
    916             job_type = status_key_name.replace('JobStatus', ' job')
--> 917             raise ValueError('Error for {} {}: {} Reason: {}'.format(job_type, job, status, reason))
    918 
    919     def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error for Transform job Tensorflow-batch-transform-2019-05-29-02-56-00-477: Failed Reason: InternalServerError: We encountered an internal error.  Please try again.

I tried to use both SingleRecord and MultiRecord parameters when deploying the model but the result was the same, so I decided to keep MultiRecord. My transformer looks like that:

transformer = tf_estimator.transformer(
    instance_count=1, 
    instance_type='ml.m4.xlarge',
    max_payload = 100,
    assemble_with = 'Line',
    strategy='MultiRecord'
)

At first I was using a json file as input for the transform job, and it threw the error :

Too much data for max payload size

So next I tried the jsonlines format (the .npy format is not supported as far as I understand), thinking that jsonlines could get split by Line and thus avoid the size error, but that's where I got the InternalServerError. Here is the related code:

#Convert test_x to jsonlines and save
test_x_list = test_x.tolist()
file_path ='data_cnn_test/test_x.jsonl'
file_name='test_x.jsonl'

with jsonlines.open(file_path, 'w') as writer:
    writer.write(test_x_list)    

input_key = 'batch_transform_tf/input/{}'.format(file_name)
output_key = 'batch_transform_tf/output'
test_input_location = 's3://{}/{}'.format(bucket, input_key)
test_output_location = 's3://{}/{}'.format(bucket, output_key)

s3.upload_file(file_path, bucket, input_key)

# Initialize the transformer object
tf_transformer = sagemaker.transformer.Transformer(
    base_transform_job_name='Tensorflow-batch-transform',
    model_name='sagemaker-tensorflow-scriptmode-2019-05-29-02-46-36-162',
    instance_count=1,
    instance_type='ml.c4.2xlarge',
    output_path=test_output_location,
    assemble_with = 'Line'
    )

# Start the transform job
tf_transformer.transform(test_input_location, content_type='application/jsonlines', split_type='Line')

The list named test_x_list has a shape [30000, 24, 11], which corresponds to 30000 samples so I would like to return 30000 predictions.

I suspect my jsonlines file isn't being split by Line and is of course too big to be processed in one batch, which throws the error, but I don't understand why it doesn't get split correctly. I am using the default output_fn and input_fn (I did not re-write those functions in my script).

Any insight on what I could be doing wrong would be greatly appreciated.

ishaaq

I assume this is a duplicate of this AWS Forum post: https://forums.aws.amazon.com/thread.jspa?threadID=303810&tstart=0

Anyway, for completeness I'll answer here as well.

The issue is that you are serializing your dataset incorrectly when converting it into jsonlines:

test_x_list = test_x.tolist()
...
with jsonlines.open(file_path, 'w') as writer:
    writer.write(test_x_list)

What the above is doing is creating a very large single-line containing your full dataset which is too big for single inference call to consume.

I suggest you change your code to make each line a single sample so that inference can take place on individual samples instead of the whole dataset:

test_x_list = test_x.tolist()
...
with jsonlines.open(file_path, 'w') as writer:
    for sample in test_x_list:
        writer.write(sample)

If one sample at a time is too slow you can also play around with the max_concurrent_transforms, strategy, and max_payload parameters to be able to batch the data as well as run concurrent transforms if your algorithm can run in parallel - also, of course, you can split the data into multiple files and run transformations with more than just one node. See https://sagemaker.readthedocs.io/en/latest/transformer.html and https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html for additional detail on what these parameters do.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集2021-06-10

コメントを追加

サインイン

分類Dev

Spring XD: pipe (>) from file source to batch job fails (IllegalArgumentException: Unable to convert provided JSON to Map<String, Object>)

分類Dev

Related 関連記事

記事