Parsing text to object with regex

Johan Published at Dev

Johan

I'm using an API which returns text in the following format:

#start
#p 12345 foo
#p 12346 bar
#end
#start
#p 12345 foo2
#p 12346 bar2
#end

My parsing function:

function parseApiResponse(data) {

    var results = [], match, obj;

    while (match = CST.REGEX.POST.exec(/(#start)|(#end)|#p\s+(\S+)\s+(\S+)/ig)) {

        if (match[1]) {           // #start
            obj = {};

        } else if (match[2]) {    // #end
            results.push(obj);
            obj = null;           // prevent accidental reuse 
                                  // if input is malformed

        } else {                  // #p something something
            obj[match[3]] = match[4];
        }
    }

    return results;
}

This will give me a list of objects which looks something like this:

[{ '12345': 'foo', '12346': 'bar'}, /* etc... */]

However, if a line is formatted like this

#start
#p 12345
#p 12346 bar
#end

The line would actually be #p 12345\n and my match[4] would contain the next row's #p.

How do I adjust the pattern to adapt to this?

tgies

Assuming you have one #start, #end, or #p element per line, you can make your regex aware of this and add an additional non-capturing group to indicate that the last \s+(\S+) in a line is optional:

/(#start)|(#end)|#p\s+(\S+)(?:\s+(\S+))?$/igm

(?: ) is saying "treat this as a group, but don't capture the pattern it matches" (so it won't create an element in match). The ? that follows that group means "this group is optional and may or may not match anything in the pattern". The $ right after that, in conjunction with the m flag, matches the end of the line.

You can also avoid the (?: ) trickery by using * instead of + quantifiers, meaning "match zero or more times": change \s+(\S+) to \s*(\S*). This has the side effect that the space between the number and the data that follows it is now optional.

I would rewrite the regex and refactor the code a bit as follows:

while (match = CST.REGEX.POST.exec(/^#(start|end|p)(?:\s+(\d+)(?:[^\S\r\n]+([^\r\n]+))?)?$/igm)) {
  switch (match[1]) {
    case 'start':
      obj = {};
      break;
    case 'end':
      results.push(obj);
      obj = null;
      break;
    case 'p':
      obj[match[2]] = match[3];
      break;
  }
}

I like capturing start, end, or p in the one capture group so I can use it in a switch statement. The version of the regex I use here is a little more discriminating (expects the token that follows #p to be numeric) and a little more forgiving (allows the last token on a #p line to contain any non-linebreak whitespace, e.g. #p 1138 this is only a test).

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at2021-02-7

Comments

0 comments

From Dev

Related Related

Article

Parsing text to object with regex

Parsing text to object with regex

python plain text regex parsing

parsing html text with regex in javascript?

Parsing large text file with regex

Regex parsing text and get relevant words / characters

Python - Parsing JSON formatted text file with regex

Parsing text from a table of contents using regex

Parsing text from API and converting to object

Regex Parsing / Splitting WKT (Well Known Text) into Key Value Pairs

RegEX : Parsing text for exact string match or same string with underscore at the end

Regex parsing issues of multi-line entries containing formatted text

Parsing Regex in C

Regex parsing citation issue

Parsing Perl regex with golang

Parsing XML in Python with regex

RegEx for parsing chemical formulas

Regex - parsing string into groups

Parsing address with Regex

Python Regex for parsing site

Regex for custom parsing

RegEx parsing for config files

Regex for parsing complicated array

Regex for parsing floats

Log Parsing with Regex

Correctly parsing comment in regex

Parsing XML in Python with regex

Parsing this string with regex PHP

Python Regex for parsing site

Regex URL parsing

Parsing string below with regex