AWS RDS with Postgres : Is OOM killer configured

debugcn 投稿 Dev

Loc Ann

We are running load test against an application that hits a Postgres database.

During the test, we suddenly get an increase in error rate. After analysing the platform and application behaviour, we notice that:

CPU of Postgres RDS is 100%
Freeable memory drops on this same server

And in the postgres logs, we see:

2018-08-21 08:19:48 UTC::@:[XXXXX]:LOG: server process (PID XXXX) was terminated by signal 9: Killed

After investigating and reading documentation, it appears one possibility is linux oomkiller running having killed the process.

But since we're on RDS, we cannot access system logs /var/log messages to confirm.

So can somebody:

confirm that oom killer really runs on AWS RDS for Postgres
give us a way to check this ?
give us a way to compute max memory used by Postgres based on number of connections ?

I didn't find the answer here:

Fabio Manzano

AWS maintains a page with best practices for their RDS service: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/CHAP_BestPractices.html

In terms of memory allocation, that's the recommendation:

An Amazon RDS performance best practice is to allocate enough RAM so that your working set resides almost completely in memory. To tell if your working set is almost all in memory, check the ReadIOPS metric (using Amazon CloudWatch) while the DB instance is under load. The value of ReadIOPS should be small and stable. If scaling up the DB instance class—to a class with more RAM—results in a dramatic drop in ReadIOPS, your working set was not almost completely in memory. Continue to scale up until ReadIOPS no longer drops dramatically after a scaling operation, or ReadIOPS is reduced to a very small amount. For information on monitoring a DB instance's metrics, see Viewing DB Instance Metrics.

Also, that's their recommendation to troubleshoot possible OS issues:

Amazon RDS provides metrics in real time for the operating system (OS) that your DB instance runs on. You can view the metrics for your DB instance using the console, or consume the Enhanced Monitoring JSON output from Amazon CloudWatch Logs in a monitoring system of your choice. For more information about Enhanced Monitoring, see Enhanced Monitoring

There's a lot of good recommendations there, including query tuning.

Note that, as a last resort, you could switch to Aurora, which is compatible with PostgreSQL:

Aurora features a distributed, fault-tolerant, self-healing storage system that auto-scales up to 64TB per database instance. Aurora delivers high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to Amazon S3, and replication across three Availability Zones.

EDIT: talking specifically about your issue w/ PostgreSQL, check this Stack Exchange thread -- they had a long connection with auto commit set to false.

We had a long connection with auto commit set to false:

connection.setAutoCommit(false)

During that time we were doing a lot of small queries and a few queries with a cursor:

statement.setFetchSize(SOME_FETCH_SIZE)

In JDBC you create a connection object, and from that connection you create statements. When you execute the statments you get a result set.

Now, every one of these objects needs to be closed, but if you close statement, the entry set is closed, and if you close the connection all the statements are closed and their result sets.

We were used to short living queries with connections of their own so we never closed statements assuming the connection will handle the things once it is closed.

The problem was now with this long transaction (~24 hours) which never closed the connection. The statements were never closed. Apparently, the statement object holds resources both on the server that runs the code and on the PostgreSQL database.

My best guess to what resources are left in the DB is the things related to the cursor. The statements that used the cursor were never closed, so the result set they returned never closed as well. This meant the database didn't free the relevant cursor resources in the DB, and since it was over a huge table it took a lot of RAM.

Hope it helps!

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]