Most efficient RowScan of very large Bigtable table

Ross Gibson 发表于 Dev

VS_FF

I am looking for the fastest way to perform a row scan of very large Bigtable tables using the latest JAVA API. I only need to scan based on partial row values (no column/column family information needed). The row values are well distributed and Bigtable's lexicographic sorting works well for this use case.

There are a lot of answers out there on this topic throughout the years, but some of them are outdated for older versions and some of them seem to be HBase-specific, or shell-specific. I need specifically for Cloud Bigtable and for the latest versions of JAVA API.

For now, based on my own testing, I see this as the best approach:

Scan s = new Scan();
s.setStartRow(startRowKey); // this can also be passed to constructor
s.setStopRow(stopRowKey); // this can also be passed to constructor
s.setRowPrefixFilter(key.getBytes());
s.setFilter(new PageFilter(MaxResult));
s.setFilter(new KeyOnlyFilter());

But my questions are:

1: Is there something I'm not aware of I should be doing to improve the speed?

2: Is there a better way to limit the results other than through PageFilter()? I.e. how can I say "return max 25 rows"

3: what is the difference between scan.setFilter(new PrefixFilter(rowKey)) and scan.setRowPrefixFilter(rowKey)

4: the advantage of putting the startRow parameter for the scan is very clear, but is there any advantage (or disadvantage) to putting the endRow parameter as well? particularly if you are providing the PageSize() or another limit measur

Thanks for any feedback!

Igor Bernstein

It seems like your filters are clobbering each other (the KeyOnlyFilter will overwrite the PageFilter, you should wrap them in a MUST_PASS_ALL FilterList.

Other then the bug I mentioned above, I can't think of any other optimizations.
I don't believe the HBase API provides another way to specify the row limit.
In your case not much. Main reason to use a PrefilterFilter is to be able to chain it together with other filters in a FilterList.
There is definitely no downside to adding an endRow, but at the same time, I don't think there is much gain either.

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2020-11-28

我来说两句

0条评论

登录后参与评论

来自分类Dev

Most efficient way to SELECT rows WHERE the ID EXISTS IN a second table

来自分类Dev

Java: What's the most efficient way to read relatively large txt files and store its data?

来自分类Dev

Using R to eliminate duplicates in a very large table and then use the remaining data to calculate the distance between several points

来自分类Dev

Related 相关文章

文章