Specifying an external configuration file for Apache Spark

Alexander

I'd like to specify all of Spark's properties in a configuration file, and then load that configuration file at runtime.

~~~~~~~~~~Edit~~~~~~~~~~~

It turns out I was pretty confused about how to go about doing this. Ignore the rest of this question. To see a simple solution (in Java Spark) on how to load a .properties file into a spark cluster, see my answer below.

original question below for reference purposes only.

~~~~~~~~~~~~~~~~~~~~~~~~

I want

  • Different configuration files depending on the environment (local, aws)
  • I'd like to specify application specific parameters

As a simple example, let's imagine I'd like to filter lines in a log file depending on a string. Below I've got a simple Java Spark program that reads data from a file and filters it depending on a string the user defines. The program takes one argument, the input source file.

Java Spark Code

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

public class SimpleSpark {
    public static void main(String[] args) {
        String inputFile = args[0]; // Should be some file on your system

        SparkConf conf = new SparkConf();// .setAppName("Simple Application");
        JavaSparkContext sc = new JavaSparkContext(conf);
        JavaRDD<String> logData = sc.textFile(inputFile).cache();

        final String filterString = conf.get("filterstr");

        long numberLines = logData.filter(new Function<String, Boolean>() {
            public Boolean call(String s) {
                return s.contains(filterString);
            }
        }).count();

        System.out.println("Line count: " + numberLines);
    }
}

Config File

the configuration file is based on https://spark.apache.org/docs/1.3.0/configuration.html and it looks like:

spark.app.name          test_app
spark.executor.memory   2g
spark.master            local
simplespark.filterstr   a

The Problem

I execute the application using the following arguments:

/path/to/inputtext.txt --conf /path/to/configfile.config

However, this doesn't work, since the exception

Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration

gets thrown. To me means the configuration file is not being loaded.

My questions are:

  1. What is wrong with my setup?
  2. Is specifying application specific parameters in the spark configuration file good practice?
Alexander

So after a bit of time, I realized I was pretty confused. The easiest way to get a configuration file into memory is to use a standard properties file, put it into hdfs and load it from there. For the record, here is the code to do it (in Java Spark):

import java.util.Properties;

import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;

import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaSparkContext;

SparkConf sparkConf = new SparkConf()
JavaSparkContext ctx = new JavaSparkContext(sparkConf);

InputStream inputStream;
Path pt = new Path("hdfs:///user/hadoop/myproperties.properties");
FileSystem fs = FileSystem.get(ctx.hadoopConfiguration());
inputStream = fs.open(pt);

Properties properties = new Properties();
properties.load(inputStream);

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Specifying the output file name in Apache Spark

From Dev

Specifying runtime dependency in Apache Spark

From Dev

Specifying runtime dependency in Apache Spark

From Dev

reading external configuration file

From Dev

Apache zeppelin: Spark cluster configuration

From Dev

Springboot external file db configuration

From Dev

Writing to a file in Apache Spark

From Dev

Taurus : Using an external properties file instead of specifying in YAML

From Dev

How to use Apache Ignite as an external data source of Apache Spark

From Dev

How to use Apache Ignite as an external data source of Apache Spark

From Dev

parse file with schema in apache spark

From Dev

Spark remove Apache Orc file

From Dev

Read part of appSettings from an external configuration file

From Dev

Jasypt: read configuration from external file

From Dev

Grails missing configuration from external config file

From Dev

Spring boot external configuration of property file

From Dev

Jasypt: read configuration from external file

From Dev

Are configuration settings replaced when adding an external configuration file?

From Dev

Spark: configuration file 'metrics.properties'

From Dev

Spark: configuration file 'metrics.properties'

From Dev

Apache configuration file name without domain name

From Dev

External Configuration

From Dev

How to split the input file in Apache Spark

From Dev

Apache Spark NumberformatException with content not existing in file

From Dev

Multiple file storage in HDFS using Apache Spark

From Dev

Apache Spark MLlib Model File Format

From Dev

apache spark DataFrame Operations reading json file

From Dev

Bluemix Apache Spark Service - Scala - reading a file

From Dev

Bluemix Apache Spark Service - Scala - reading a file

Related Related

  1. 1

    Specifying the output file name in Apache Spark

  2. 2

    Specifying runtime dependency in Apache Spark

  3. 3

    Specifying runtime dependency in Apache Spark

  4. 4

    reading external configuration file

  5. 5

    Apache zeppelin: Spark cluster configuration

  6. 6

    Springboot external file db configuration

  7. 7

    Writing to a file in Apache Spark

  8. 8

    Taurus : Using an external properties file instead of specifying in YAML

  9. 9

    How to use Apache Ignite as an external data source of Apache Spark

  10. 10

    How to use Apache Ignite as an external data source of Apache Spark

  11. 11

    parse file with schema in apache spark

  12. 12

    Spark remove Apache Orc file

  13. 13

    Read part of appSettings from an external configuration file

  14. 14

    Jasypt: read configuration from external file

  15. 15

    Grails missing configuration from external config file

  16. 16

    Spring boot external configuration of property file

  17. 17

    Jasypt: read configuration from external file

  18. 18

    Are configuration settings replaced when adding an external configuration file?

  19. 19

    Spark: configuration file 'metrics.properties'

  20. 20

    Spark: configuration file 'metrics.properties'

  21. 21

    Apache configuration file name without domain name

  22. 22

    External Configuration

  23. 23

    How to split the input file in Apache Spark

  24. 24

    Apache Spark NumberformatException with content not existing in file

  25. 25

    Multiple file storage in HDFS using Apache Spark

  26. 26

    Apache Spark MLlib Model File Format

  27. 27

    apache spark DataFrame Operations reading json file

  28. 28

    Bluemix Apache Spark Service - Scala - reading a file

  29. 29

    Bluemix Apache Spark Service - Scala - reading a file

HotTag

Archive