I want to efficiently calculate two means of values of a HashMap
each time a new key/value pair is inserted.
Suppose we currently have this HashMap<Double, Double>
:
3 4
5 6
8 8
1 3
6 8 <- Latest insertion
The latest insertion was the key 6
with value 8
.
The first mean to calculate consists of all values which keys are smaller than the inserted one, which is 6
.
These are the values 4,6,3
of the keys 3,5,1
, so the mean is (4+6+3)/3=4.3...
The second mean is the "opposite", so the mean of all values for all keys greater than 6
.
The key 8
with value 1
gives this mean as 8/1=8
.
Now, a new key/pair gets inserted:
3 4
5 6
6 8
8 8
1 3
4 9 <- Latest insertion
So again, we need to calculate the mean for all values with keys smaller than 4
.
These are the values 4,3
for the keys 3,1
, so the "smaller mean" is now (4+3)/2=3.5
The "greater mean" is now (6+8+8)/3=7.3...
for the key/value pairs 5/6,6/8,8/8
.
A naive implementation might be something like this:
public class CalculateMapMean {
private double smallerMean = 0.0;
private double greaterMean = 0.0;
private HashMap<Double, Double> someMap = new HashMap<Double, Double>();
public void calculateMeans(double latestInsertedKey) {
double sumGreater = 0;
double sumSmaller = 0;
double sumGreaterCount = 0;
double sumSmallerCount = 0;
for (Map.Entry<Double, Double> entry : someMap.entrySet()) {
double key = entry.getKey();
double value = entry.getValue();
if (key > latestInsertedKey) {
sumGreater += value;
++sumGreaterCount;
}
else if (key < latestInsertedKey) {
sumSmaller += value;
++sumSmallerCount;
}
}
if (sumGreaterCount != 0) {
greaterMean = sumGreater / sumGreaterCount;
}
else {
greaterMean = 0.0;
}
if (sumSmallerCount != 0) {
smallerMean = sumSmaller / sumSmallerCount;
}
else {
smallerMean = 0.0;
}
}
}
The question is if the calculations of the means can be dramatically improved with a TreeMap
or another datastrure such that one does not to have iterate over all keys on every insertion.
Is there an elegant way of reusing former calculations?
The only way I can think of to get below O(n)
time for every change to the map is by keeping a balanced binary search tree (BBST) with the keys. In every node you need to keep few extra fields
Rebalancing a BBST after an insert/delete takes O(log n)
time. In that same balance operation you can update the count and sum, also in O(log n)
time (since you perform O(log n)
operations that take O(1)
time).
To get the correct means you need to traverse the tree and add the right counts. Let's give a simple example. Suppose I have the following 7 key-value pairs. I hope you can imagine how the corresponding BBST would look.
(3, 5) (4, 3) (7, 1) (8, 4) (11, 3) (12, 1)(13, 3)
In the root - (8, 4)
- the total count and sum is stored: [7, 20]
. In the root of the left subtree - (4, 3)
- the total count and sum of that subtree is stored: [3, 9]
. I draw these extra values now as a function of depth in the tree:
[ 7, 20 ]
[ 3, 9 ][ 3, 7 ]
[1, 5][1, 1][1, 3][1, 3]
Suppose I add a new tuple with key 10 now. I start traversing the tree at the root. Because 8 < 10
, I don't need to traverse the left subtree: all keys in that subtree are smaller than 10
, so we can use the cached values [3, 9]
. For the right subtree, we need to recurse, because some keys might be smaller than 10 and some might be larger. We don't have to traverse the right subtree there, because 12 > 10
, so we can use [1, 3]
directly.
In every layer of the tree, we can ignore one branch and recurse on the other. Therefore, finding the total value and count for keys smaller than the last inserted key and for keys larger than the last inserted key takes O(log n)
time as well.
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments