What is sequence file in hadoop?

Soghra Gargari

I am new to Map-reduce and I want to understand what is sequence file data input? I studied in the Hadoop book but it was hard for me to understand.

JiaMing Lin

First we should understand what problems does the SequenceFile try to solve, and then how can SequenceFile help to solve the problems.

In HDFS

  • SequenceFile is one of the solutions to small file problem in Hadoop.
  • Small file is significantly smaller than the HDFS block size(128MB).
  • Each file, directory, block in HDFS is represented as object and occupies 150 bytes.
  • 10 million files, would use about 3 gigabytes of memory of NameNode.
  • A billion files is not feasible.

In MapReduce

  • Map tasks usually process a block of input at a time (using the default FileInputFormat).

  • The more the number of files is, the more number of Map task need and the job time can be much more slower.

Small file scenarios

  • The files are pieces of a larger logical file.
  • The files are inherently small, for example, images.

These two cases require different solutions.

  • For first one, write a program to concatenate the small files together.(see Nathan Marz’s post about a tool called the Consolidator which does exactly this)
  • For the second one, some kind of container is needed to group the files in some way.

Solutions in Hadoop

HAR files

  • HAR(Hadoop Archives) were introduced to alleviate the problem of lots of files putting pressure on the namenode’s memory.
  • HARs are probably best used purely for archival purposes.

SequenceFile

  • The concept of SequenceFile is to put each small file to a larger single file.
  • For example, suppose there are 10,000 100KB files, then we can write a program to put them into a single SequenceFile like below, where you can use filename to be the key and content to be the value.

    SequenceFile File Layout
    (source: csdn.net)

  • Some benefits:

    1. A smaller number of memory needed on NameNode. Continue with the 10,000 100KB files example,
      • Before using SequenceFile, 10,000 objects occupy about 4.5MB of RAM in NameNode.
      • After using SequenceFile, 1GB SequenceFile with 8 HDFS blocks, these objects occupy about 3.6KB of RAM in NameNode.
    2. SequenceFile is splittable, so is suitable for MapReduce.
    3. SequenceFile is compression supported.
  • Supported Compressions, the file structure depends on the compression type.

    1. Uncompressed
    2. Record-Compressed: Compresses each record as it’s added to the file. record_compress_seq
      (source: csdn.net)

    3. Block-Compressed 这里写图片描述
      (source: csdn.net)

      • Waits until data reaches block size to compress.
      • Block compression provide better compression ratio than Record compression.
      • Block compression is generally the preferred option when using SequenceFile.
      • Block here is unrelated to HDFS or filesystem block.

この記事はインターネットから収集されたものであり、転載の際にはソースを示してください。

侵害の場合は、連絡してください[email protected]

編集
0

コメントを追加

0

関連記事

分類Dev

How to convert hadoop sequence file to json format?

分類Dev

Navigate file system in Hadoop

分類Dev

What is the arithmetic mean of an empty sequence?

分類Dev

What is the arithmetic mean of an empty sequence?

分類Dev

What is the arithmetic mean of an empty sequence?

分類Dev

On what parameters boot sequence varies?

分類Dev

Unable to save the input file in hadoop

分類Dev

Reverse sequence of a file with POSIX tools?

分類Dev

What is exact sequence of operations made here?

分類Dev

What dimension is the LSTM model considers the data sequence?

分類Dev

What is the "0G" terminal sequence

分類Dev

What is the correct start up sequence of ATG instances?

分類Dev

Hadoop - Result of WordCount is not writing on output file

分類Dev

Hadoop/Python: Loading a reference file to use in the mapper

分類Dev

Is that possible to rename the file with sequence number using {1..10}

分類Dev

What is the path to a local variable from hadoop fs, bash?

分類Dev

Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

分類Dev

Hadoop accessing 3rd party libraries from local file system of a Hadoop node

分類Dev

What is the time complexity of Linq OrderBy().ThenBy() method sequence?

分類Dev

What Should the Sequence Position be for a Custom Action Type 18 for Uninstalls?

分類Dev

What is the <leader> in a .vimrc file?

分類Dev

What is a pdf bcmap file?

分類Dev

What is the meaning of - in cat file -

分類Dev

What is wrong in this batch file?

分類Dev

What is an open file description?

分類Dev

How to change the reducer output file names in a hadoop mr job?

分類Dev

hadoop version command gives error related to java (No such file or directory)

分類Dev

Avoiding file collisions in Hadoop Pig script that writes multiple output files

分類Dev

Hadoop HDFS - Wrong FS: hdfs://0.0.0.0:9000... expected: file:///

Related 関連記事

  1. 1

    How to convert hadoop sequence file to json format?

  2. 2

    Navigate file system in Hadoop

  3. 3

    What is the arithmetic mean of an empty sequence?

  4. 4

    What is the arithmetic mean of an empty sequence?

  5. 5

    What is the arithmetic mean of an empty sequence?

  6. 6

    On what parameters boot sequence varies?

  7. 7

    Unable to save the input file in hadoop

  8. 8

    Reverse sequence of a file with POSIX tools?

  9. 9

    What is exact sequence of operations made here?

  10. 10

    What dimension is the LSTM model considers the data sequence?

  11. 11

    What is the "0G" terminal sequence

  12. 12

    What is the correct start up sequence of ATG instances?

  13. 13

    Hadoop - Result of WordCount is not writing on output file

  14. 14

    Hadoop/Python: Loading a reference file to use in the mapper

  15. 15

    Is that possible to rename the file with sequence number using {1..10}

  16. 16

    What is the path to a local variable from hadoop fs, bash?

  17. 17

    Reading file from s3 in pyspark using org.apache.hadoop:hadoop-aws

  18. 18

    Hadoop accessing 3rd party libraries from local file system of a Hadoop node

  19. 19

    What is the time complexity of Linq OrderBy().ThenBy() method sequence?

  20. 20

    What Should the Sequence Position be for a Custom Action Type 18 for Uninstalls?

  21. 21

    What is the <leader> in a .vimrc file?

  22. 22

    What is a pdf bcmap file?

  23. 23

    What is the meaning of - in cat file -

  24. 24

    What is wrong in this batch file?

  25. 25

    What is an open file description?

  26. 26

    How to change the reducer output file names in a hadoop mr job?

  27. 27

    hadoop version command gives error related to java (No such file or directory)

  28. 28

    Avoiding file collisions in Hadoop Pig script that writes multiple output files

  29. 29

    Hadoop HDFS - Wrong FS: hdfs://0.0.0.0:9000... expected: file:///

ホットタグ

アーカイブ