Data Extraction from Poorly Structured XML File

Tom Fogarty

I am working with physiologic data and have extremely large, poorly formatted XML files that contain all of the raw data needed for analysis. We have previously used a regex style method in autohotkey to extract data, but it performs poorly. I have been looking for ways to parse or extract data efficiently from these files.

I am running into problems with either XML file structure being invalid or not being able to build a good sed command to extract and piece together the data the way we need it. That said there is minimal variability to the data structure overall so I think regex can be used to pull data from the file effectively. I have been trying to use grep/sed and various XML parsing tools so far.

Input: XML (poorly structured, 500-1500 mb)

Desired Output: CSV

VitalSigns:

1 VitalSigns, Date, Time, Par, Value
2 VitalSigns, Date, Time, Par, Value
...
end

Waveforms:

1 Waveforms, Date, Time, Waveform Channel, value 1, value 2, ... , value 480
2 Waveforms, Date, Time, Waveform Channel, value 1, value 2, ... , value 480
...
end

Here is a sample of the XML file: Dropbox link for sample file

      <VitalSigns Time="01/27/2017 10:33:04 PM">
            <VS>
                  <Par>PVC</Par>
                  <Value UOM="Bpm">0</Value>
                  <AlarmLimitHigh Label="PVC HI">6</AlarmLimitHigh>
            </VS>
            <VS>
                  <Par>RESP</Par>
                  <Value UOM="BrMin">68</Value>
                  <AlarmLimitLow Label="RESP LO">12</AlarmLimitLow>
                  <AlarmLimitHigh Label="RESP HI">60</AlarmLimitHigh>
            </VS>
            <VS>
                  <Par>SPO2-%</Par>
                  <Value UOM="%" Q="2">100</Value>
                  <AlarmLimitLow Label="SPO2-% LO">93</AlarmLimitLow>
                  <AlarmLimitHigh Label="SPO2-% HI">101</AlarmLimitHigh>
            </VS>
      </VitalSigns>
      <Waveforms Time="01/27/2017 10:33:04 PM">
            <WaveformData Channel="I" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
            <WaveformData Channel="II" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
            <WaveformData Channel="SPO2" UOM="Uncalib">370,741,741,1112,1117,1122,1122,1128,1128,1128,1128,1128,1138,1149,1149,1160,1160,1160,1160,1160,1176,1192,1192,1208,1208,1208,1208,1208,1216,1224,1224,1232,1232,1232,1232,1232,1242,1253,1253,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1277,1290,1290,1304,1304,1304,1304,1304,1314,1325,1325,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1338,1341,1341,1344,1344,1344,1344,1344,1338,1333,1333,1328,1328,1328,1328,1328,1325,1322,1322,1320,1320,1320,1320,1320,1330,1341,1341,1352,1352,1352,1352,1352,1373,1394,1394,1416,1416,1416,1416,1416,1434,1453,1453,1472,1472,1472,1472,1472,1485,1498,1498,1512,1512,1512,1512,1512,1520,1528,1528,1536,1536,1536,1536,1536,1533,1530,1530,1528,1528,1528,1528,1528,1514,1501,1501,1488,1488,1488,1488,1488,1464,1440,1440,1416,1416,1416,1416,1416,1397,1378,1378,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1306,1309,1309,1312,1312,1312,1312,1312,1336,1360,1360,1384,1384,1384,1384,1384,1413,1442,1442,1472,1472,1472,1472,1472,1477,1482,1482,1488,1488,1488,1488,1488,1480,1472,1472,1464,1464,1464,1464,1464,1448,1432,1432,1416,1416,1416,1416,1416,1389,1362,1362,1336,1336,1336,1336,1336,1322,1309,1309,1296,1296,1296,1296,1296,1290,1285,1285,1280,1280,1280,1280,1280,1285,1290,1290,1296,1296,1296,1296,1296,1314,1333,1333,1352,1352,1352,1352,1352,1370,1389,1389,1408,1408,1408,1408,1408,1429,1450,1450,1472,1474,1477,1477,1480,1477,1474,1474,1472,1472,1472,1472,1472,1458,1445,1445,1432,1432,1432,1432,1432,1408,1384,1384,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1282,1261,1261,1240,1240,1240,1240,1240,1218,1197,1197,1176,1176,1176,1176,1176,1168,1160,1160,1152,1152,1152,1152,1152,1154,1157,1157,1160,1160,1160,1160,1160,1184,1208,1208,1232,1232,1232,1232,1232,1253,1274,1274,1296,1296,1296,1296,1296,1298,1301,1301,1304,1304,1304,1304,1304,1293,1282,1282,1272,1272,1272,1272,1272,1256,1240,1240,1224,1224,1224,1224,1224,1194,1165,1165,1136,1136,1136,1136,1136,1117,1098,1098,1080,1080,1080,1080,1080,1056,1032,1032,1008,1008,1008,1008,1008,994,981,981,968,968,968,968,968,960,952,952,944,944,944,944,944,960,976,976,992,992,992,992,992,1024,1056,1056,1088,1088,1088,1088,1088,1106,1125,1125,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1128,1112,1112,1096,1096,1096,1096,1096,1053,1010,1010,968,968,968,968,968,928,888,888,848</WaveformData>
      </Waveforms>
  <VitalSigns Time="01/27/2017 10:33:06 PM">
        <VS>
              <Par>PVC</Par>
              <Value UOM="Bpm">0</Value>
              <AlarmLimitHigh Label="PVC HI">6</AlarmLimitHigh>
        </VS>
        <VS>
              <Par>RESP</Par>
              <Value UOM="BrMin">55</Value>
              <AlarmLimitLow Label="RESP LO">12</AlarmLimitLow>
              <AlarmLimitHigh Label="RESP HI">60</AlarmLimitHigh>
        </VS>
        <VS>
              <Par>SPO2-%</Par>
              <Value UOM="%" Q="2">99</Value>
              <AlarmLimitLow Label="SPO2-% LO">93</AlarmLimitLow>
              <AlarmLimitHigh Label="SPO2-% HI">101</AlarmLimitHigh>
        </VS>
  </VitalSigns>
<Waveforms Time="01/27/2017 10:33:06 PM">
            <WaveformData Channel="I" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
            <WaveformData Channel="II" UOM="Uncalib">347,342,334,332,332,332,331,333,335,335,335,335,335,335,338,343,347,347,346,348,348,347,345,342,341,341,343,342,340,335,332,328,323,312,298,286,281,279,279,281,286,289,293,299,307,317,331,344,359,373,382,385,379,365,345,328,317,313,309,304,300,297,294,292,290,288,289,289,290,284,270,265,307,424,586,667,538,262,49,30,136,237,286,294,288,282,280,278,276,274,276,276,275,272,269,267,265,262,260,259,263,271,281,290,301,312,316,315,310,304,297,290,282,276,269,259,251,243,237,234,233,234,237,239,242,243,243,244,244,247,254,264,271,273,274,275,271,259,239,221,214,214,213,209,205,203,202,201,200,197,194,191,188,176,162,176,259,409,532,488,254,-4,-93,-13,110,191,222,223,219,218,217,216,215,217,219,220,222,223,224,227,229,232,235,238,240,240,240,245,249,251,251,248,246,244,244,243,240,235,229,222,215,209,207,208,214,221,226,230,233,234,235,239,243,247,253,261,270,275,280,283,281,269,248,228,217,214,214,213,214,217,223,229,233,238,242,249,255,254,248,265,343,489,621,595,373,107,-7,55,169,244,269,268,262,260,262,265,268,269,269,270,271,273,276,278,281,282,287,290,294,296,299,304,308,310,309,308,307,307,307,306,302,294,286,278,271,262,256,253,254,263,278,293,307,320,334,345,350,350,349,349,353,360,366,371,374,370,360,346,333,325,323,323,322,319,317,315,313,314,314,315,316,317,313,308,326,405,554,701,694,478,194,56,109,231,315,343,341,333,331,333,336,338,339,338,339,338,336,332,327,325,329,337,345,351,360,374,388,397,399,398,394,391,387,382,378,373,367,362,357,352,346,342,339,339,340,343,347,353,356,358,360,360,359,360,360,361,362,364,367,375,383,390,392,388,377,362,347,342,342,342,342,340,338,337,336,335,334,334,332,330,321,318,352,453,611,718,649,408,155,63,146,273,354,379,374,365,360,359,359,360,361,364,365,364,362,359,357,355,354,355,355,357,358,358,358,358,357,357,356,355,353,351,347,343,338,330,324,319,315,312,310,310,314,319,325,330,335,339,342,342,339,336,334,333,334,336,340,349,361,372</WaveformData>
            <WaveformData Channel="SPO2" UOM="Uncalib">370,741,741,1112,1117,1122,1122,1128,1128,1128,1128,1128,1138,1149,1149,1160,1160,1160,1160,1160,1176,1192,1192,1208,1208,1208,1208,1208,1216,1224,1224,1232,1232,1232,1232,1232,1242,1253,1253,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1264,1277,1290,1290,1304,1304,1304,1304,1304,1314,1325,1325,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1336,1338,1341,1341,1344,1344,1344,1344,1344,1338,1333,1333,1328,1328,1328,1328,1328,1325,1322,1322,1320,1320,1320,1320,1320,1330,1341,1341,1352,1352,1352,1352,1352,1373,1394,1394,1416,1416,1416,1416,1416,1434,1453,1453,1472,1472,1472,1472,1472,1485,1498,1498,1512,1512,1512,1512,1512,1520,1528,1528,1536,1536,1536,1536,1536,1533,1530,1530,1528,1528,1528,1528,1528,1514,1501,1501,1488,1488,1488,1488,1488,1464,1440,1440,1416,1416,1416,1416,1416,1397,1378,1378,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1306,1309,1309,1312,1312,1312,1312,1312,1336,1360,1360,1384,1384,1384,1384,1384,1413,1442,1442,1472,1472,1472,1472,1472,1477,1482,1482,1488,1488,1488,1488,1488,1480,1472,1472,1464,1464,1464,1464,1464,1448,1432,1432,1416,1416,1416,1416,1416,1389,1362,1362,1336,1336,1336,1336,1336,1322,1309,1309,1296,1296,1296,1296,1296,1290,1285,1285,1280,1280,1280,1280,1280,1285,1290,1290,1296,1296,1296,1296,1296,1314,1333,1333,1352,1352,1352,1352,1352,1370,1389,1389,1408,1408,1408,1408,1408,1429,1450,1450,1472,1474,1477,1477,1480,1477,1474,1474,1472,1472,1472,1472,1472,1458,1445,1445,1432,1432,1432,1432,1432,1408,1384,1384,1360,1360,1360,1360,1360,1341,1322,1322,1304,1304,1304,1304,1304,1282,1261,1261,1240,1240,1240,1240,1240,1218,1197,1197,1176,1176,1176,1176,1176,1168,1160,1160,1152,1152,1152,1152,1152,1154,1157,1157,1160,1160,1160,1160,1160,1184,1208,1208,1232,1232,1232,1232,1232,1253,1274,1274,1296,1296,1296,1296,1296,1298,1301,1301,1304,1304,1304,1304,1304,1293,1282,1282,1272,1272,1272,1272,1272,1256,1240,1240,1224,1224,1224,1224,1224,1194,1165,1165,1136,1136,1136,1136,1136,1117,1098,1098,1080,1080,1080,1080,1080,1056,1032,1032,1008,1008,1008,1008,1008,994,981,981,968,968,968,968,968,960,952,952,944,944,944,944,944,960,976,976,992,992,992,992,992,1024,1056,1056,1088,1088,1088,1088,1088,1106,1125,1125,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1144,1128,1112,1112,1096,1096,1096,1096,1096,1053,1010,1010,968,968,968,968,968,928,888,888,848</WaveformData>
      </Waveforms>

Here is the ideal output:

VitalSigns, 01/27/2017, 10:33:04 PM, PVC, 0    
VitalSigns, 01/27/2017, 10:33:04 PM, RESP, 68    
VitalSigns, 01/27/2017, 10:33:04 PM, SPO2, 100    
Waveforms, 01/27/2017, 10:33:04 PM, I, 347, 342, ..., end    
Waveforms, 01/27/2017, 10:33:04 PM, II, 347, 342, ..., end    
Waveforms, 01/27/2017, 10:33:04 PM, SPO2, 370, 741, ..., end    
VitalSigns, 01/27/2017, 10:33:06 PM, PVC, 0    
VitalSigns, 01/27/2017, 10:33:06 PM, RESP, 55
VitalSigns, 01/27/2017, 10:33:06 PM, SPO2, 99
...
end of file
Lars Fischer

As the other commenters pointed out, an XML tool is better suited for this task. But it can be done in sed as well, here is GNU sed script doing what you want:

script.sed

# concatenate VitalSigns or Waveforms, the date, the time into the holdspace
# on finding a Par, Value or WaveformData, further append the info to the holdspace
/<VitalSigns|Waveforms /  { s/^[^<]*<([^ ]+) Time="([^ ]+) ([^"]+)".*$/\1, \2, \3, /; h; } 
/<Par>.*<\/Par>/          { s/^[^>]+>([^<]*).*$/\1, /; H; }
/<Value.*<\/Value>/       { s/^[^>]+>([^<]*).*$/\1/; 
                            H; 
                            # now print Holdspace after removing newlines
                            x;
                            s/, \n/, /g;
                            p;
                            # now put only the first three columns back into Hold space
                            s/^(([^,]+, ){3}).*$/\1/;
                            x;
                          }

/<WaveformData /          { s/^.*WaveformData Channel="([^"]+)[^>]*>(.*)<\/WaveformData>.*$/\1, \2/; 
                            H;
                            # now print Holdspace after removing newlines
                            x;
                            s/, \n/, /g;
                            p;
                            # now put only the first three columns back into Hold space
                            s/^(([^,]+, ){3}).*$/\1/;
                            x;
                          }

Use it like this: sed -rn -f script.sed yourfile.

Notes

  • I used the example posted in the question. It it probably not very stable against changes in the xml file.
  • You see that it is a mess of arcane syntax and switching back and forth between hold and pattern space, thus it is not very extensible.
  • Better use an XML-tool as others have already suggested in the comments.

Collected from the Internet

Please contact [email protected] to delete if infringement.

edited at
0

Comments

0 comments
Login to comment

Related