matkenis

I'm struggling to find a way to calculate neighbours avarage value from data set that looks like this:

```
+------+------+---------+
| X| Y| value |
+------+------+---------+
| 1| 5| 1 |
| 1| 8| 1 |
| 1| 6| 6 |
| 2| 8| 5 |
| 2| 6| 3 |
+------+------+---------+
```

**For example**:

(1, 5) neighbours would be (1,6), (2,6) so I need to find mean of all their values and the answer here would be (1 + 6 + 3) / 3 = 3.33

(1, 8) neighbours would be (2, 8) and the mean of their values would be (1 + 5) / 2 = 3

I'm hoping my solution to look something like this (I just concat coordinates as strings here for the key):

```
+--------------------------+
| neighbour_values | mean |
+--------------------------+
| (1,5)_(1,6)_(2,6) | 3.33 |
| (1,8)_(2,8) | 3 |
+--------------------------+
```

I've tried it with column concatenation but didn't seem to go far. One of the solutions that I'm thinking of is to iterate threw table twice, once for element and again for the other values and check if its a neighbour or not. Unfortunately, I'm fairly new to spark and I can't seem to find any information on how to do it.

ANY help is VERY much appreciated! Thank you!:))

ELinda

The answer depends on if you are concerned with only grouping by adjacent neighbors. That scenario can lead to ambiguity, if say, there is a contiguous block of greater than width or height of two items. Therefore the approach below assumes that all items in a contiguous set of coordinates is bunched into a single group, and that each original record belongs to exactly one grouping.

This assumption of partitioning the set into disjoint coordinates lends itself to the union-find algorithm.

Since union-find is recursive, this approach collects the original elements into memory and creates a UDF based on those values. Note that this can be slow and/or require a lot of memory for large datasets.

```
// create example DF
val df = Seq((1, 5, 1), (1, 8, 1), (1, 6, 6), (2, 8, 5), (2, 6, 3)).toDF("x", "y", "value")
// collect all coordinates into in-memory collections
val coordinates = df.select("x", "y").collect().map(r => (r.getInt(0), r.getInt(1)))
val coordSet = coordinates.toSet
type K = (Int, Int)
val directParent:Map[K,Option[K]] = coordinates.map { case (x: Int, y: Int) =>
val possibleParents = coordSet.intersect(Set((x - 1, y - 1), (x, y - 1), (x - 1, y)))
val parent = if (possibleParents.isEmpty) None else Some(possibleParents.min)
((x, y), parent)
}.toMap
// skip unionFind if only concerned with direct neighbors
def unionFind(key: K, map:Map[K,Option[K]]): K = {
val mapValue = map.get(key)
mapValue.map(parentOpt => parentOpt match {
case None => key
case Some(parent) => unionFind(parent, map)
}).getOrElse(key)
}
val canonicalUDF = udf((x: Int, y: Int) => unionFind((x, y), directParent))
// group using the canonical element
// create column "neighbors" based on x, y values in each group
val avgDF = df.groupBy(canonicalUDF($"x", $"y").alias("canonical")).agg(
concat_ws("_", collect_list(concat(lit("("), $"x", lit(","), $"y", lit(")")))).alias("neighbors"),
avg($"value")).drop("canonical")
```

Result:

```
avgDF.show(10, false)
+-----------------+------------------+
|neighbors |avg(value) |
+-----------------+------------------+
|(1,8)_(2,8) |3.0 |
|(1,5)_(1,6)_(2,6)|3.3333333333333335|
+-----------------+------------------+
```

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at

*1*### How to calculate x,y coordinates from bounding box values

*2*### For given two integers A and B, find a pair of numbers X and Y such that A = X*Y and B = X xor Y

*3*### in R, how to calculate mean of all column, by group?

*4*### How to group rows and extract mean values

*5*### Calculate mean by group with dplyr

*6*### How to calculate center point of box of grid which is near to point x,y?

*7*### Calculate Group Mean and Overall Mean

*8*### How to find duplicate values in mysql table based on ANY y out of x columns where y<=x

*9*### Pandas:Calculate mean of a group of n values of each columns of a dataframe

*10*### How to use apply group of function in R to calculate mean of values with plus delimiter

*11*### How to calculate mean spatial location by group

*12*### 2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)

*13*### Count the first x integers of a group of values

*14*### How to calculate mean of every three values of a list

*15*### How to replace values for similar group by mean?

*16*### Find the mean of y per grouped value of x

*17*### How can I calculate the number of the neighbours for each cell?

*18*### Spark - Group on one Column and find Mean of other colums

*19*### HTML Canvas X and Y values: How to get them?

*20*### Calculate mean of calculated values

*21*### C# How to calculate X Y Z coordinates to move a group of objects?

*22*### Mongodb: How can I find the nested group that maximizes the values max/min length and calculate the average?

*23*### How to Calculate mean, median, max and min values for group by 15 days without outlier effect in R

*24*### Subset / group by pandas Data Frame to calculate mean and apply to missing values

*25*### How to calculate mean in group by another group?

*26*### How to find the number of neighbours pixels in binary array

*27*### Group when values in two columns are identical and calculate the mean

*28*### How to calculate mean of values per unique class

*29*### Calculate mean difference values

## Comments