Strategy for splitting a large JSON file

surreydude

I'm trying to split very large JSON files into smaller files for a given array. For example:

{
    "headerName1": "headerVal1",
    "headerName2": "headerVal2",
    "headerName3": [{
        "element1Name1": "element1Value1"
    },
    {
        "element2Name1": "element2Value1"
    },
    {
        "element3Name1": "element3Value1"
    },
    {
        "element4Name1": "element4Value1"
    },
    {
        "element5Name1": "element5Value1"
    },
    {
        "element6Name1": "element6Value1"
    }]
}

...down to { "elementNName1": "elementNValue1" } where N is a large number

The user provides the name which represents the array to be split (in this example "headerName3") and the number of array objects per file, e.g. 1,000,000

This would result in N files each containing the top name:value pairs (headerName1, headerName3) and up to 1,000,000 of the headerName3 objects in each file.

I'm using the excellent Newtonsof JSON.net and understand that I need to do this using a stream.

So far I have looked a reading in JToken objects to establish where the PropertyName == "headerName3" occurs when reading in the tokens but what I would like to do is then read in the entire JSON object for each object in the array and not have to continue parsing JSON into JTokens;

Here's a snippet of the code I am building so far:

        using (StreamReader oSR = File.OpenText(strInput))
        {
            using (var reader = new JsonTextReader(oSR))
            {
                while (reader.Read())
                {
                    if (reader.TokenType == JsonToken.StartObject)
                    {
                        intObjectCount++;
                    }
                    else if (reader.TokenType == JsonToken.EndObject)
                    {
                        intObjectCount--;

                        if (intObjectCount == 1)
                        {
                            intArrayRecordCount++;
                            // Here I want to read the entire object for this record into an untyped JSON object

                            if( intArrayRecordCount % 1000000 == 0)
                            {
                                //write these to the split file
                            }
                        }
                    }
                }
            }
        }

I don't know - and in fact, and am not concerned with - the structure of the JSON itself, and the objects can be of varying structures within the array. I am therefore not serializing to classes.

Is this the right approach? Is there a set of methods in the JSON.net library I can easily use to perform such operation?

Any help appreciated.

dbc

You can use JsonWriter.WriteToken(JsonReader reader, true) to stream individual array entries and their descendants from a JsonReader to a JsonWriter. You can also use JProperty.Load(JsonReader reader) and JProperty.WriteTo(JsonWriter writer) to read and write entire properties and their descendants.

Using these methods, you can create a state machine that parses the JSON file, iterates through the root object, loads "prefix" and "postfix" properties, splits the array property, and writes the prefix, array slice, and postfix properties out to new file(s).

Here's a prototype implementation that takes a TextReader and a callback function to create sequential output TextWriter objects for the split file:

    enum SplitState
    {
        InPrefix,
        InSplitProperty,
        InSplitArray,
        InPostfix,
    }

    public static void SplitJson(TextReader textReader, string tokenName, long maxItems, Func<int, TextWriter> createStream, Formatting formatting)
    {
        List<JProperty> prefixProperties = new List<JProperty>();
        List<JProperty> postFixProperties = new List<JProperty>();
        List<JsonWriter> writers = new List<JsonWriter>();

        SplitState state = SplitState.InPrefix;
        long count = 0;

        try
        {
            using (var reader = new JsonTextReader(textReader))
            {
                bool doRead = true;
                while (doRead ? reader.Read() : true)
                {
                    doRead = true;
                    if (reader.TokenType == JsonToken.Comment || reader.TokenType == JsonToken.None)
                        continue;
                    if (reader.Depth == 0)
                    {
                        if (reader.TokenType != JsonToken.StartObject && reader.TokenType != JsonToken.EndObject)
                            throw new JsonException("JSON root container is not an Object");
                    }
                    else if (reader.Depth == 1 && reader.TokenType == JsonToken.PropertyName)
                    {
                        if ((string)reader.Value == tokenName)
                        {
                            state = SplitState.InSplitProperty;
                        }
                        else
                        {
                            if (state == SplitState.InSplitProperty)
                                state = SplitState.InPostfix;
                            var property = JProperty.Load(reader);
                            doRead = false; // JProperty.Load() will have already advanced the reader.
                            if (state == SplitState.InPrefix)
                            {
                                prefixProperties.Add(property);
                            }
                            else
                            {
                                postFixProperties.Add(property);
                            }
                        }
                    }
                    else if (reader.Depth == 1 && reader.TokenType == JsonToken.StartArray && state == SplitState.InSplitProperty)
                    {
                        state = SplitState.InSplitArray;
                    }
                    else if (reader.Depth == 1 && reader.TokenType == JsonToken.EndArray && state == SplitState.InSplitArray)
                    {
                        state = SplitState.InSplitProperty;
                    }
                    else if (state == SplitState.InSplitArray && reader.Depth == 2)
                    {
                        if (count % maxItems == 0)
                        {
                            var writer = new JsonTextWriter(createStream(writers.Count)) { Formatting = formatting };
                            writers.Add(writer);
                            writer.WriteStartObject();
                            foreach (var property in prefixProperties)
                                property.WriteTo(writer);
                            writer.WritePropertyName(tokenName);
                            writer.WriteStartArray();
                        }
                        count++;
                        writers.Last().WriteToken(reader, true);
                    }
                    else
                    {
                        throw new JsonException("Internal error");
                    }
                }
            }
            foreach (var writer in writers)
                using (writer)
                {
                    writer.WriteEndArray();
                    foreach (var property in postFixProperties)
                        property.WriteTo(writer);
                    writer.WriteEndObject();
                }
        }
        finally
        {
            // Make sure files are closed in the event of an exception.
            foreach (var writer in writers)
                using (writer)
                {
                }

        }
    }

This method leaves all the files open until the end in case "postfix" properties, appearing after the array property, need to be appended. Be aware that there is a limit of 16384 open files at one time, so if you need to create more split files, this won't work. If postfix properties are never encountered in practice, you can just close each file before opening the next and throw an exception in case any postfix properties are found. Otherwise you may need to parse the large file in two passes or close and reopen the split files to append them.

Here is an example of how to use the method with an in-memory JSON string:

    private static void TestSplitJson(string json, string tokenName)
    {
        var builders = new List<StringBuilder>();
        using (var reader = new StringReader(json))
        {
            SplitJson(reader, tokenName, 2, i => { builders.Add(new StringBuilder()); return new StringWriter(builders.Last()); }, Formatting.Indented);
        }
        foreach (var s in builders.Select(b => b.ToString()))
        {
            Console.WriteLine(s);
        }
    }

Prototype fiddle.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related

From Dev

Splitting large file by user

From Dev

Splitting out a large file

From Dev

Is this the correct way of splitting a large file?

From Dev

Splitting large data file in python

From Dev

Splitting a large log file in to multiple files in Scala

From Dev

Split large file into chunks without splitting entry

From Dev

Splitting a large text file to form a table

From Dev

Splitting large html file in several files

From Dev

Splitting a large Pdf file with PDFBox gets large result files

From Dev

Splitting a large file by column with values in the header as file names

From Dev

large JSON file Swift

From Dev

Splitting large typescript file into module across multiple files

From Dev

Java - Splitting Large SQL Text File on Delimeter Using Scanner (OutOfMemoryError)

From Dev

Splitting a single large csv file to resample by two columns

From Dev

Splitting a large text file every x pattern repeats

From Dev

Beautiful soup strategy for splitting data

From Dev

Biopython Large Sequence splitting

From Dev

FFmpeg splitting large files

From Dev

Splitting a large VM

From Dev

Parsing large JSON file in .NET

From Dev

Convert large JSON file format

From Dev

Splitting a single large PDF file into n PDF files based on content and rename each splitted file (in Bash)

From Dev

Parse large JSON file with JSON Simple (OutOfMemoryError)

From Dev

How to efficiently split up a large text file wihout splitting multiline records?

From Dev

Splitting two large CSV files preserving relations between file A and B across the resulting files

From Dev

Splitting a large vector into intervals in R

From Dev

Splitting up a large class with modules

From Dev

Splitting a very large string in part

From Dev

Fix format of VERY LARGE json file

Related Related

  1. 1

    Splitting large file by user

  2. 2

    Splitting out a large file

  3. 3

    Is this the correct way of splitting a large file?

  4. 4

    Splitting large data file in python

  5. 5

    Splitting a large log file in to multiple files in Scala

  6. 6

    Split large file into chunks without splitting entry

  7. 7

    Splitting a large text file to form a table

  8. 8

    Splitting large html file in several files

  9. 9

    Splitting a large Pdf file with PDFBox gets large result files

  10. 10

    Splitting a large file by column with values in the header as file names

  11. 11

    large JSON file Swift

  12. 12

    Splitting large typescript file into module across multiple files

  13. 13

    Java - Splitting Large SQL Text File on Delimeter Using Scanner (OutOfMemoryError)

  14. 14

    Splitting a single large csv file to resample by two columns

  15. 15

    Splitting a large text file every x pattern repeats

  16. 16

    Beautiful soup strategy for splitting data

  17. 17

    Biopython Large Sequence splitting

  18. 18

    FFmpeg splitting large files

  19. 19

    Splitting a large VM

  20. 20

    Parsing large JSON file in .NET

  21. 21

    Convert large JSON file format

  22. 22

    Splitting a single large PDF file into n PDF files based on content and rename each splitted file (in Bash)

  23. 23

    Parse large JSON file with JSON Simple (OutOfMemoryError)

  24. 24

    How to efficiently split up a large text file wihout splitting multiline records?

  25. 25

    Splitting two large CSV files preserving relations between file A and B across the resulting files

  26. 26

    Splitting a large vector into intervals in R

  27. 27

    Splitting up a large class with modules

  28. 28

    Splitting a very large string in part

  29. 29

    Fix format of VERY LARGE json file

HotTag

Archive