我在SparkConf中将参数spark.cassandra.output.batch.size.rows设置如下:
val conf = new SparkConf(true)
.set("spark.cassandra.connection.host", "host")
.set("spark.cassandra.auth.username", "cassandra")
.set("spark.cassandra.auth.password", "cassandra")
.set("spark.cassandra.output.batch.size.rows", "5120")
.set("spark.cassandra.output.concurrent.writes", "10")
但是当我表演
saveToCassandra(“数据”,“十天”)
我继续在system.log中看到警告
NFO [FlushWriter:7] 2014-11-20 11:11:16,498 Memtable.java (line 395) Completed flushing /var/lib/cassandra/data/system/hints/system-hints-jb-76-Data.db (5747287 bytes) for commitlog position ReplayPosition(segmentId=1416480663951, position=44882909)
INFO [FlushWriter:7] 2014-11-20 11:11:16,499 Memtable.java (line 355) Writing Memtable-ten_days@1656582530(32979978/329799780 serialized/live bytes, 551793 ops)
WARN [Native-Transport-Requests:761] 2014-11-20 11:11:16,499 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36825, exceeding specified threshold of 5120 by 31705.
WARN [Native-Transport-Requests:777] 2014-11-20 11:11:16,500 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36813, exceeding specified threshold of 5120 by 31693.
WARN [Native-Transport-Requests:822] 2014-11-20 11:11:16,501 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36823, exceeding specified threshold of 5120 by 31703.
WARN [Native-Transport-Requests:835] 2014-11-20 11:11:16,500 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36817, exceeding specified threshold of 5120 by 31697.
WARN [Native-Transport-Requests:781] 2014-11-20 11:11:16,501 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36817, exceeding specified threshold of 5120 by 31697.
WARN [Native-Transport-Requests:755] 2014-11-20 11:11:16,501 BatchStatement.java (line 226) Batch of prepared statements for [data.ten_days] is of size 36822, exceeding specified threshold of 5120 by 31702.
我知道这只是警告,但我想了解为什么我的设置无法正常工作。然后,我可以在群集中看到很多提示。批量大小会影响集群中的提示数量吗?
谢谢
您设置了批量大小行,而不是批量大小字节。这意味着连接器将限制行数,而不是批次的内存大小。
spark.cassandra.output.batch.size.rows:每批处理的行数;默认值为“自动”,这意味着连接器将根据每行中的数据量来调整行数
spark.cassandra.output.batch.size.bytes:批处理的最大总大小,以字节为单位;默认为64 kB。
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/5_saving.md
更重要的是,您可以通过使用更大的批处理大小(64kb)并更改cassandra.yaml文件中的警告限制来更好。
最近,我们发现较大的批次可能会导致某些C *配置不稳定,因此,如果系统变得不稳定,请降低该值。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句