How to find grid neighbours (x, y as integers) group them and calculate mean of their values in spark

matkenis

I'm struggling to find a way to calculate neighbours avarage value from data set that looks like this:

+------+------+---------+
|     X|     Y|  value  |
+------+------+---------+
|     1|     5|   1     |
|     1|     8|   1     |
|     1|     6|   6     |
|     2|     8|   5     |
|     2|     6|   3     |
+------+------+---------+

For example:

(1, 5) neighbours would be (1,6), (2,6) so I need to find mean of all their values and the answer here would be (1 + 6 + 3) / 3 = 3.33

(1, 8) neighbours would be (2, 8) and the mean of their values would be (1 + 5) / 2 = 3

I'm hoping my solution to look something like this (I just concat coordinates as strings here for the key):

+--------------------------+
|  neighbour_values | mean |
+--------------------------+
| (1,5)_(1,6)_(2,6) | 3.33 |
| (1,8)_(2,8)       | 3    |
+--------------------------+

I've tried it with column concatenation but didn't seem to go far. One of the solutions that I'm thinking of is to iterate threw table twice, once for element and again for the other values and check if its a neighbour or not. Unfortunately, I'm fairly new to spark and I can't seem to find any information on how to do it.

ANY help is VERY much appreciated! Thank you!:))

ELinda

The answer depends on if you are concerned with only grouping by adjacent neighbors. That scenario can lead to ambiguity, if say, there is a contiguous block of greater than width or height of two items. Therefore the approach below assumes that all items in a contiguous set of coordinates is bunched into a single group, and that each original record belongs to exactly one grouping.

This assumption of partitioning the set into disjoint coordinates lends itself to the union-find algorithm.

Since union-find is recursive, this approach collects the original elements into memory and creates a UDF based on those values. Note that this can be slow and/or require a lot of memory for large datasets.

// create example DF
val df = Seq((1, 5, 1), (1, 8, 1), (1, 6, 6), (2, 8, 5), (2, 6, 3)).toDF("x", "y", "value")

// collect all coordinates into in-memory collections
val coordinates = df.select("x", "y").collect().map(r => (r.getInt(0), r.getInt(1)))
val coordSet = coordinates.toSet

type K = (Int, Int)
val directParent:Map[K,Option[K]] = coordinates.map { case (x: Int, y: Int) =>
  val possibleParents = coordSet.intersect(Set((x - 1, y - 1), (x, y - 1), (x - 1, y)))
  val parent = if (possibleParents.isEmpty) None else Some(possibleParents.min)
  ((x, y), parent)
}.toMap

// skip unionFind if only concerned with direct neighbors
def unionFind(key: K, map:Map[K,Option[K]]): K = {
  val mapValue = map.get(key)
  mapValue.map(parentOpt => parentOpt match {
    case None => key
    case Some(parent) => unionFind(parent, map)
  }).getOrElse(key)
}

val canonicalUDF = udf((x: Int, y: Int) => unionFind((x, y), directParent))

// group using the canonical element
// create column "neighbors" based on x, y values in each group
val avgDF = df.groupBy(canonicalUDF($"x", $"y").alias("canonical")).agg(
  concat_ws("_", collect_list(concat(lit("("), $"x", lit(","), $"y", lit(")")))).alias("neighbors"),
  avg($"value")).drop("canonical")

Result:

avgDF.show(10, false)
+-----------------+------------------+
|neighbors        |avg(value)        |
+-----------------+------------------+
|(1,8)_(2,8)      |3.0               |
|(1,5)_(1,6)_(2,6)|3.3333333333333335|
+-----------------+------------------+

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

How to calculate the mean of a new group?

From Dev

How to calculate mean in group by another group?

From Dev

HTML Canvas X and Y values: How to get them?

From Dev

2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)

From Dev

Count the first x integers of a group of values

From Java

How to calculate x,y coordinates from bounding box values

From Dev

How to use apply group of function in R to calculate mean of values with plus delimiter

From Dev

How to Calculate mean, median, max and min values for group by 15 days without outlier effect in R

From Dev

in R, how to calculate mean of all column, by group?

From Dev

How to calculate a mean by group in a JS Array?

From Dev

How to calculate mean spatial location by group

From Dev

How to group specific items in a column and calculate the mean

From Dev

Pandas how to calculate sum for columns in list x and mean for columns in list y using aggregate

From Dev

How to find the number of neighbours pixels in binary array

From Dev

Find the mean of y per grouped value of x

From Dev

How to calculate mean of every three values of a list

From Dev

How to calculate mean of values per unique class

From Dev

Subset / group by pandas Data Frame to calculate mean and apply to missing values

From Dev

Pandas:Calculate mean of a group of n values of each columns of a dataframe

From Dev

Group when values in two columns are identical and calculate the mean

From Dev

How to calculate center point of box of grid which is near to point x,y?

From Dev

Calculate Group Mean and Overall Mean

From Dev

Calculate mean by group with dplyr

From Dev

R calculate how many values used to calculate mean in aggregate function

From Dev

Mongodb: How can I find the nested group that maximizes the values max/min length and calculate the average?

From Dev

Spark - Group on one Column and find Mean of other colums

From Dev

C# How to calculate X Y Z coordinates to move a group of objects?

From

For given two integers A and B, find a pair of numbers X and Y such that A = X*Y and B = X xor Y

From Dev

How to find duplicate values in mysql table based on ANY y out of x columns where y<=x

Related Related

  1. 1

    How to calculate the mean of a new group?

  2. 2

    How to calculate mean in group by another group?

  3. 3

    HTML Canvas X and Y values: How to get them?

  4. 4

    2D bin (x,y) and calculate mean of values (c) of 10 deepest data points (z)

  5. 5

    Count the first x integers of a group of values

  6. 6

    How to calculate x,y coordinates from bounding box values

  7. 7

    How to use apply group of function in R to calculate mean of values with plus delimiter

  8. 8

    How to Calculate mean, median, max and min values for group by 15 days without outlier effect in R

  9. 9

    in R, how to calculate mean of all column, by group?

  10. 10

    How to calculate a mean by group in a JS Array?

  11. 11

    How to calculate mean spatial location by group

  12. 12

    How to group specific items in a column and calculate the mean

  13. 13

    Pandas how to calculate sum for columns in list x and mean for columns in list y using aggregate

  14. 14

    How to find the number of neighbours pixels in binary array

  15. 15

    Find the mean of y per grouped value of x

  16. 16

    How to calculate mean of every three values of a list

  17. 17

    How to calculate mean of values per unique class

  18. 18

    Subset / group by pandas Data Frame to calculate mean and apply to missing values

  19. 19

    Pandas:Calculate mean of a group of n values of each columns of a dataframe

  20. 20

    Group when values in two columns are identical and calculate the mean

  21. 21

    How to calculate center point of box of grid which is near to point x,y?

  22. 22

    Calculate Group Mean and Overall Mean

  23. 23

    Calculate mean by group with dplyr

  24. 24

    R calculate how many values used to calculate mean in aggregate function

  25. 25

    Mongodb: How can I find the nested group that maximizes the values max/min length and calculate the average?

  26. 26

    Spark - Group on one Column and find Mean of other colums

  27. 27

    C# How to calculate X Y Z coordinates to move a group of objects?

  28. 28

    For given two integers A and B, find a pair of numbers X and Y such that A = X*Y and B = X xor Y

  29. 29

    How to find duplicate values in mysql table based on ANY y out of x columns where y<=x

HotTag

Archive