I got a RDD[String]
from a file:
val file = sc.textFile("/path/to/myData.txt")
myData's format:
>str1_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
...
FJDLALLLGL //the last line of str1
>str2_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
...
FJDLALLLGL //the last line of str2
>str3_name
...
How should I do to transform the data from file to a structure RDD[(String, String)]
? For instance,
trancRDD(
(str1_name, ATCGGKFKKVKKFKRLFFVLFLRLFDJKALGFJVKRIKFKVKFGKLRL),
(str2_name, ATCGGKFKKVKKFKRLFFVLFLRLFDJKALGFJVKRIKFKVKFGKLRL),
...
)
If there's a defined record separator, like ">" indicated above, this could be done using a custom Hadoop configuration:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.{LongWritable, Text}
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat
val conf = new Configuration
conf.set("textinputformat.record.delimiter", ">")
// genome.txt contains the records provided in the question without the "..."
val dataset = sc.newAPIHadoopFile("./data/genome.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text], conf)
val data = dataset.map(x=>x._2.toString)
Let's have a look at the data
data.collect
res11: Array[String] =
Array("", "str1_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
FJDLALLLGL
", "str2_name
ATCGGKFKKVKKFKRLFFVLFLRL
FDJKALGFJVKRIKFKVKFGKLRL
FJDLALLLGL
")
We can easily make records out of this String
val records = data.map{ multiLine => val lines = multiLine.split("\n"); (lines.head, lines.tail)}
records.collect
res14: Array[(String, Array[String])] = Array(("",Array()),
(str1_name,Array(ATCGGKFKKVKKFKRLFFVLFLRL, FDJKALGFJVKRIKFKVKFGKLRL, FJDLALLLGL)),
(str2_name,Array(ATCGGKFKKVKKFKRLFFVLFLRL, FDJKALGFJVKRIKFKVKFGKLRL, FJDLALLLGL)))
(use filter to take that first empty record out... exercise for the reader)
Collected from the Internet
Please contact [email protected] to delete if infringement.
Comments