How to find grid neighbours (x, y as integers) group them and calculate mean of their values in spark

matkenis

I'm struggling to find a way to calculate neighbours avarage value from data set that looks like this:

+------+------+---------+
|     X|     Y|  value  |
+------+------+---------+
|     1|     5|   1     |
|     1|     8|   1     |
|     1|     6|   6     |
|     2|     8|   5     |
|     2|     6|   3     |
+------+------+---------+

For example:

(1, 5) neighbours would be (1,6), (2,6) so I need to find mean of all their values and the answer here would be (1 + 6 + 3) / 3 = 3.33

(1, 8) neighbours would be (2, 8) and the mean of their values would be (1 + 5) / 2 = 3

I'm hoping my solution to look something like this (I just concat coordinates as strings here for the key):

+--------------------------+
|  neighbour_values | mean |
+--------------------------+
| (1,5)_(1,6)_(2,6) | 3.33 |
| (1,8)_(2,8)       | 3    |
+--------------------------+

I've tried it with column concatenation but didn't seem to go far. One of the solutions that I'm thinking of is to iterate threw table twice, once for element and again for the other values and check if its a neighbour or not. Unfortunately, I'm fairly new to spark and I can't seem to find any information on how to do it.

ANY help is VERY much appreciated! Thank you!:))

ELinda

The answer depends on if you are concerned with only grouping by adjacent neighbors. That scenario can lead to ambiguity, if say, there is a contiguous block of greater than width or height of two items. Therefore the approach below assumes that all items in a contiguous set of coordinates is bunched into a single group, and that each original record belongs to exactly one grouping.

This assumption of partitioning the set into disjoint coordinates lends itself to the union-find algorithm.

Since union-find is recursive, this approach collects the original elements into memory and creates a UDF based on those values. Note that this can be slow and/or require a lot of memory for large datasets.

// create example DF
val df = Seq((1, 5, 1), (1, 8, 1), (1, 6, 6), (2, 8, 5), (2, 6, 3)).toDF("x", "y", "value")

// collect all coordinates into in-memory collections
val coordinates = df.select("x", "y").collect().map(r => (r.getInt(0), r.getInt(1)))
val coordSet = coordinates.toSet

type K = (Int, Int)
val directParent:Map[K,Option[K]] = coordinates.map { case (x: Int, y: Int) =>
  val possibleParents = coordSet.intersect(Set((x - 1, y - 1), (x, y - 1), (x - 1, y)))
  val parent = if (possibleParents.isEmpty) None else Some(possibleParents.min)
  ((x, y), parent)
}.toMap

// skip unionFind if only concerned with direct neighbors
def unionFind(key: K, map:Map[K,Option[K]]): K = {
  val mapValue = map.get(key)
  mapValue.map(parentOpt => parentOpt match {
    case None => key
    case Some(parent) => unionFind(parent, map)
  }).getOrElse(key)
}

val canonicalUDF = udf((x: Int, y: Int) => unionFind((x, y), directParent))

// group using the canonical element
// create column "neighbors" based on x, y values in each group
val avgDF = df.groupBy(canonicalUDF($"x", $"y").alias("canonical")).agg(
  concat_ws("_", collect_list(concat(lit("("), $"x", lit(","), $"y", lit(")")))).alias("neighbors"),
  avg($"value")).drop("canonical")

Result:

avgDF.show(10, false)
+-----------------+------------------+
|neighbors        |avg(value)        |
+-----------------+------------------+
|(1,8)_(2,8)      |3.0               |
|(1,5)_(1,6)_(2,6)|3.3333333333333335|
+-----------------+------------------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Java

How to calculate x,y coordinates from bounding box values

From

For given two integers A and B, find a pair of numbers X and Y such that A = X*Y and B = X xor Y

From Dev

in R, how to calculate mean of all column, by group?

From Dev

How to group rows and extract mean values

From Dev

Calculate mean by group with dplyr

From Dev

How to calculate center point of box of grid which is near to point x,y?

From Dev

Calculate Group Mean and Overall Mean

From Dev

How to find duplicate values in mysql table based on ANY y out of x columns where y<=x

From Dev

Pandas:Calculate mean of a group of n values of each columns of a dataframe

From Dev

How to use apply group of function in R to calculate mean of values with plus delimiter

From Dev

How to calculate mean spatial location by group

From Dev

2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)

From Dev

Count the first x integers of a group of values

From Dev

How to calculate mean of every three values of a list

From Dev

How to replace values for similar group by mean?

From Dev

Find the mean of y per grouped value of x

From Dev

How can I calculate the number of the neighbours for each cell?

From Dev

Spark - Group on one Column and find Mean of other colums

From Dev

HTML Canvas X and Y values: How to get them?

From Dev

Calculate mean of calculated values

From Dev

C# How to calculate X Y Z coordinates to move a group of objects?

From Dev

Mongodb: How can I find the nested group that maximizes the values max/min length and calculate the average?

From Dev

How to Calculate mean, median, max and min values for group by 15 days without outlier effect in R

From Dev

Subset / group by pandas Data Frame to calculate mean and apply to missing values

From Dev

How to calculate mean in group by another group?

From Dev

How to find the number of neighbours pixels in binary array

From Dev

Group when values in two columns are identical and calculate the mean

From Dev

How to calculate mean of values per unique class

From Dev

Calculate mean difference values

Related Related

  1. 1

    How to calculate x,y coordinates from bounding box values

  2. 2

    For given two integers A and B, find a pair of numbers X and Y such that A = X*Y and B = X xor Y

  3. 3

    in R, how to calculate mean of all column, by group?

  4. 4

    How to group rows and extract mean values

  5. 5

    Calculate mean by group with dplyr

  6. 6

    How to calculate center point of box of grid which is near to point x,y?

  7. 7

    Calculate Group Mean and Overall Mean

  8. 8

    How to find duplicate values in mysql table based on ANY y out of x columns where y<=x

  9. 9

    Pandas:Calculate mean of a group of n values of each columns of a dataframe

  10. 10

    How to use apply group of function in R to calculate mean of values with plus delimiter

  11. 11

    How to calculate mean spatial location by group

  12. 12

    2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)

  13. 13

    Count the first x integers of a group of values

  14. 14

    How to calculate mean of every three values of a list

  15. 15

    How to replace values for similar group by mean?

  16. 16

    Find the mean of y per grouped value of x

  17. 17

    How can I calculate the number of the neighbours for each cell?

  18. 18

    Spark - Group on one Column and find Mean of other colums

  19. 19

    HTML Canvas X and Y values: How to get them?

  20. 20

    Calculate mean of calculated values

  21. 21

    C# How to calculate X Y Z coordinates to move a group of objects?

  22. 22

    Mongodb: How can I find the nested group that maximizes the values max/min length and calculate the average?

  23. 23

    How to Calculate mean, median, max and min values for group by 15 days without outlier effect in R

  24. 24

    Subset / group by pandas Data Frame to calculate mean and apply to missing values

  25. 25

    How to calculate mean in group by another group?

  26. 26

    How to find the number of neighbours pixels in binary array

  27. 27

    Group when values in two columns are identical and calculate the mean

  28. 28

    How to calculate mean of values per unique class

  29. 29

    Calculate mean difference values

HotTag

Archive