Batch transform job results in "InternalServerError" with data file >100MB

Melodie.R

I'm using Sagemaker in order to perform binary classification on time series, each sample being a numpy array of shape [24,11] (24h, 11features). I used a tensorflow model in script mode, my script being very similar to the one I used as reference: https://github.com/awslabs/amazon-sagemaker-examples/blob/master/sagemaker-python-sdk/tensorflow_script_mode_training_and_serving/mnist.py

The training reported success and I was able to deploy a model for batch transformation. The transform job works fine when I input just a few samples (say, [10,24,11]), but it returns an InternalServerError when I input more samples for prediction (for example, [30000, 24, 11], which size is >100MB).

Here is the error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-0c46f7563389> in <module>()
     32 
     33 # Then wait until transform job is completed
---> 34 tf_transformer.wait()

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self)
    133     def wait(self):
    134         self._ensure_last_transform_job()
--> 135         self.latest_transform_job.wait()
    136 
    137     def _ensure_last_transform_job(self):

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/transformer.py in wait(self)
    207 
    208     def wait(self):
--> 209         self.sagemaker_session.wait_for_transform_job(self.job_name)
    210 
    211     @staticmethod

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in wait_for_transform_job(self, job, poll)
    893         """
    894         desc = _wait_until(lambda: _transform_job_status(self.sagemaker_client, job), poll)
--> 895         self._check_job_status(job, desc, 'TransformJobStatus')
    896         return desc
    897 

~/anaconda3/envs/tensorflow_p36/lib/python3.6/site-packages/sagemaker/session.py in _check_job_status(self, job, desc, status_key_name)
    915             reason = desc.get('FailureReason', '(No reason provided)')
    916             job_type = status_key_name.replace('JobStatus', ' job')
--> 917             raise ValueError('Error for {} {}: {} Reason: {}'.format(job_type, job, status, reason))
    918 
    919     def wait_for_endpoint(self, endpoint, poll=5):

ValueError: Error for Transform job Tensorflow-batch-transform-2019-05-29-02-56-00-477: Failed Reason: InternalServerError: We encountered an internal error.  Please try again.

I tried to use both SingleRecord and MultiRecord parameters when deploying the model but the result was the same, so I decided to keep MultiRecord. My transformer looks like that:

transformer = tf_estimator.transformer(
    instance_count=1, 
    instance_type='ml.m4.xlarge',
    max_payload = 100,
    assemble_with = 'Line',
    strategy='MultiRecord'
)

At first I was using a json file as input for the transform job, and it threw the error :

Too much data for max payload size

So next I tried the jsonlines format (the .npy format is not supported as far as I understand), thinking that jsonlines could get split by Line and thus avoid the size error, but that's where I got the InternalServerError. Here is the related code:

#Convert test_x to jsonlines and save
test_x_list = test_x.tolist()
file_path ='data_cnn_test/test_x.jsonl'
file_name='test_x.jsonl'

with jsonlines.open(file_path, 'w') as writer:
    writer.write(test_x_list)    

input_key = 'batch_transform_tf/input/{}'.format(file_name)
output_key = 'batch_transform_tf/output'
test_input_location = 's3://{}/{}'.format(bucket, input_key)
test_output_location = 's3://{}/{}'.format(bucket, output_key)

s3.upload_file(file_path, bucket, input_key)

# Initialize the transformer object
tf_transformer = sagemaker.transformer.Transformer(
    base_transform_job_name='Tensorflow-batch-transform',
    model_name='sagemaker-tensorflow-scriptmode-2019-05-29-02-46-36-162',
    instance_count=1,
    instance_type='ml.c4.2xlarge',
    output_path=test_output_location,
    assemble_with = 'Line'
    )

# Start the transform job
tf_transformer.transform(test_input_location, content_type='application/jsonlines', split_type='Line')

The list named test_x_list has a shape [30000, 24, 11], which corresponds to 30000 samples so I would like to return 30000 predictions.

I suspect my jsonlines file isn't being split by Line and is of course too big to be processed in one batch, which throws the error, but I don't understand why it doesn't get split correctly. I am using the default output_fn and input_fn (I did not re-write those functions in my script).

Any insight on what I could be doing wrong would be greatly appreciated.

ishaaq

I assume this is a duplicate of this AWS Forum post: https://forums.aws.amazon.com/thread.jspa?threadID=303810&tstart=0

Anyway, for completeness I'll answer here as well.

The issue is that you are serializing your dataset incorrectly when converting it into jsonlines:

test_x_list = test_x.tolist()
...
with jsonlines.open(file_path, 'w') as writer:
    writer.write(test_x_list)   

What the above is doing is creating a very large single-line containing your full dataset which is too big for single inference call to consume.

I suggest you change your code to make each line a single sample so that inference can take place on individual samples instead of the whole dataset:

test_x_list = test_x.tolist()
...
with jsonlines.open(file_path, 'w') as writer:
    for sample in test_x_list:
        writer.write(sample)

If one sample at a time is too slow you can also play around with the max_concurrent_transforms, strategy, and max_payload parameters to be able to batch the data as well as run concurrent transforms if your algorithm can run in parallel - also, of course, you can split the data into multiple files and run transformations with more than just one node. See https://sagemaker.readthedocs.io/en/latest/transformer.html and https://docs.aws.amazon.com/sagemaker/latest/dg/API_CreateTransformJob.html for additional detail on what these parameters do.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

Testing of spring batch job causes unexpected results

分類Dev

Spring Batch Persist Job Meta Data

分類Dev

transform text file to data table using python

分類Dev

Remove unwanted spaces in wmic results through batch file

分類Dev

Getting input data after execution of batch file

分類Dev

Remove data in txt file after 39 results

分類Dev

Crontab job than writes transition data to log file

分類Dev

Batch File: If registry key's data equals x

分類Dev

Spring XD: pipe (>) from file source to batch job fails (IllegalArgumentException: Unable to convert provided JSON to Map<String, Object>)

分類Dev

Transform Linq Data Structure

分類Dev

Disable transactions in my Spring Batch job

分類Dev

NoSuchJobException when running a job programmatically in Spring Batch

分類Dev

Spring Batch job execution status in response body

分類Dev

pass job parameters to custom writer Spring batch

分類Dev

Using Data from a file as Hash-Map in Map Reduce job Hadoop

分類Dev

Deleting asset and transform job (Azure Media Services v3)

分類Dev

block code in batch file

分類Dev

Batch file timestamp issue

分類Dev

Running for statement in a batch file?

分類Dev

Dash or Slash in batch file?

分類Dev

Disable file sharing with batch

分類Dev

What is wrong in this batch file?

分類Dev

batch file, for and findstr error

分類Dev

DOS Batch File & FOR statements

分類Dev

Outputting batch file to JTextArea

分類Dev

How to transform async data in mapStateToProps?

分類Dev

JDBC: Query results and add batch into another table

分類Dev

Batch: checking commandline parameters in batch file

分類Dev

Transform raster data to data.frame

Related 関連記事

ホットタグ

アーカイブ