Spark: how to use SparkContext.textFile for local file system

MattO 发表于 Dev

herman

I'm just getting started using Apache Spark (in Scala, but the language is irrelevant). I'm using standalone mode and I'll want to process a text file from a local file system (so nothing distributed like HDFS).

According to the documentation of the textFile method from SparkContext, it will

Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings.

What is unclear for me is if the whole text file can just be copied to all the nodes, or if the input data should already be partitioned, e.g. if using 4 nodes and a csv file with 1000 lines, have 250 lines on each node.

I suspect each node should have the whole file but I'm not sure.

David Gruzman

Each node should contain a whole file. In this case local file system will be logically indistinguishable from the HDFS, in respect to this file.

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-02-11

我来说两句

0条评论

登录后参与评论

上一篇：键入时，Tab键和Enter键在webstorm编辑器中不再起作用

来自分类Dev

Related 相关文章

文章

Spark: how to use SparkContext.textFile for local file system

Spark: how to use SparkContext.textFile for local file system

Spark：如何将SparkContext.textFile用于本地文件系统

在Spark中，参数“ minPartitions”在SparkContext.textFile（path，minPartitions）中有什么作用？

SparkContext.textFile如何在幕后工作？

'SparkContext'对象没有属性'textfile'

Spark流丢失SparkContext

为什么SparkContext.textFile的partition参数不生效？

SparkContext.addFile与spark-submit --files

从Spark中的textFile读取和转换数据

与 textFile() 相比，Spark binaryRecords() 的性能较低

SparkContext.textFile 可以与自定义接收器一起使用吗？

错误 SparkContext：初始化 SparkContext 时出错 - Java + Eclipse + Spark

在Spark 2.0中是否已解除了单个SparkContext的限制？

何时使用SPARK_CLASSPATH或SparkContext.addJar

何时使用SPARK_CLASSPATH或SparkContext.addJar

Spark：任务不可序列化（广播/ RDD / SparkContext）

Apache Spark-JavaSparkContext无法转换为SparkContext错误

在Spark 2.0中是否已解除了单个SparkContext的限制？

如何使用spark sc.textFile获取文件名？

What file system does Android use?

How to delete a file from local disk in UWP

How to send a local file through a REST service?

引起原因：java.io.NotSerializableException：org.apache.spark.SparkContext-在Spark中使用JdbcRDD时

How to use dual monitors on a system with 2 computers?

How to change file system encoding via python?

SparkContext.setCheckpointDir（hdfsPath）可以在不同的Spark应用程序中设置相同的hdfsPath吗？

spark-SparkContext和SqlContext-生命周期和threadafty

使用远程SparkContext在纱线上运行spark作业：Yarn应用程序已结束

从发送到spark-submit的__main__文件外部修改SparkContext

SparkContext错误-找不到文件/ tmp / spark-events不存在

从发送到spark-submit的main文件外部修改SparkContext