Is there an RDD transform function that looks at neighboring elements?

Mr.UNOwen

Does anyone know if there is a way during a transform to look at neighboring elements in a sorted RDD? I know I can collect and then do such an operation as the one in the below example, however it kind of defeats the purpose of a distributed system and I'm trying to leverage the fact that it's distributed.

Example:

RDD of (string name, int val) map to RDD of (string name, int val, int diff)

such that:

name | val     becomes ->      name | val | diff (current - prior)
a    | 3                       a    | 3   | 3
b    | 6                       b    | 6   | 3
c    | 4                       c    | 4   | -2
d    | 20                      d    | 20  | 16
zero323

Probably the most efficient approach simplest approach is to convert a RDD to data frame and use lag:

case class NameValue(name: String, value: Int)
val rdd = sc.parallelize(
    NameValue("a", 3) ::  NameValue("b", 6) :: 
    NameValue("c", 4) ::  NameValue("d", 20) :: Nil)

val df = sqlContext.createDataFrame(rdd)
df.registerTempTable("df")
sqlContext.sql("""SELECT name, value,
                  value - lag(value) OVER (ORDER BY name, value) lag
                  FROM df""").show

Unfortunately at this moment window functions without PARTITION BY clause move all data to a single partition so it is particularly useful if you have large dataset.

Using low level operations you could use zipWithIndex followed by flatMap and groupByKey:

case class NameValueWithLag(name: String, value: Int, lag: Int)
val cnt = rdd.count() - 1

rdd.
    zipWithIndex.
    flatMap{case (x, i) => (0 to 1).map(lag => (i - lag, (i, x)))}.
    groupByKey.
    filter{ case (k, v) => k != cnt}.
    values.
    map(vals => {
        val sorted = vals.toArray.sortBy(_._1).map(_._2)
        if (sorted.length == 1) {
            NameValueWithLag(sorted(0).name, sorted(0).value, sorted(0).value)
        } else {
            NameValueWithLag(
               sorted(1).name, sorted(1).value,
               sorted(1).value - sorted(0).value
            )
        }
    })

Edit:

If you don't mind using developers API there you can try RDDFunctions.sliding but it requires manual processing

import org.apache.spark.mllib.rdd.RDDFunctions._

val first = rdd.first match {
  case NameValue(name, value) => NameValueWithLag(name, value, value)
}

sc.parallelize(Seq(first)).union(rdd
  .sliding(2)
  .map(a => NameValueWithLag(a(1).name, a(1).value, a(1).value - a(0).value)))

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Perform a function on all couples of elements of my RDD

From Dev

How to write a transformation function to transform RDD with reference to a Graphframe object?

From Dev

How to apply styles to neighboring elements in a list

From Dev

Scraping data based on the text of other neighboring elements?

From Dev

Transform RDD in PySpark

From Dev

Spark: How to transform a Seq of RDD into a RDD

From Dev

Using R, how to pivot/transform a dataset whose elements are the result of a function

From Dev

How to find the nearest neighboring elements of an td using jquery?

From Dev

How to replace the elements of an RDD

From Dev

PySpark repartitioning RDD elements

From Dev

How to replace the elements of an RDD

From Dev

RDD Remove elements by key

From Dev

Filter repeated elements RDD

From Java

Variable looks like a function pointer

From Dev

CSS transform elements suggestion

From Dev

CSS transform elements suggestion

From Dev

Transform elements of a list into a table

From Dev

Transform XML elements into Attributes

From Dev

How should I do to transform a RDD[String] to RDD[(String, String)]?

From Dev

transform the RDD with list column , into multiple rows in Spark

From Dev

Transform function to generic function

From Dev

Why does Transform in JavaFX looks like 3D?

From Dev

Element in transform rotate3d parent still looks flat

From Dev

Remove elements from Spark RDD

From Dev

Saving elements of RDD in Spark application

From Dev

Operate on neighbor elements in RDD in Spark

From Dev

Remove elements from Spark RDD

From Dev

How to insert background (picture) elements in a bulleted list, a url to which is written in the neighboring bulleted list

From Dev

Transform scaleX relative to other elements

Related Related

  1. 1

    Perform a function on all couples of elements of my RDD

  2. 2

    How to write a transformation function to transform RDD with reference to a Graphframe object?

  3. 3

    How to apply styles to neighboring elements in a list

  4. 4

    Scraping data based on the text of other neighboring elements?

  5. 5

    Transform RDD in PySpark

  6. 6

    Spark: How to transform a Seq of RDD into a RDD

  7. 7

    Using R, how to pivot/transform a dataset whose elements are the result of a function

  8. 8

    How to find the nearest neighboring elements of an td using jquery?

  9. 9

    How to replace the elements of an RDD

  10. 10

    PySpark repartitioning RDD elements

  11. 11

    How to replace the elements of an RDD

  12. 12

    RDD Remove elements by key

  13. 13

    Filter repeated elements RDD

  14. 14

    Variable looks like a function pointer

  15. 15

    CSS transform elements suggestion

  16. 16

    CSS transform elements suggestion

  17. 17

    Transform elements of a list into a table

  18. 18

    Transform XML elements into Attributes

  19. 19

    How should I do to transform a RDD[String] to RDD[(String, String)]?

  20. 20

    transform the RDD with list column , into multiple rows in Spark

  21. 21

    Transform function to generic function

  22. 22

    Why does Transform in JavaFX looks like 3D?

  23. 23

    Element in transform rotate3d parent still looks flat

  24. 24

    Remove elements from Spark RDD

  25. 25

    Saving elements of RDD in Spark application

  26. 26

    Operate on neighbor elements in RDD in Spark

  27. 27

    Remove elements from Spark RDD

  28. 28

    How to insert background (picture) elements in a bulleted list, a url to which is written in the neighboring bulleted list

  29. 29

    Transform scaleX relative to other elements

HotTag

Archive