合并/平均多个数据文件

debugcn 发表于 Dev

没有人2100

我有一组数据文件（例如，“ data ####。dat”，其中#### = 0001，...，9999），它们都具有相同的x值的通用数据结构第一列，然后是具有不同y值的许多列。

data0001.dat：

#A < comment line with unique identifier 'A'
#B1 < this is a comment line that can/should be dropped
1 11 21
2 12 22
3 13 23

data0002.dat：

#A < comment line with unique identifier 'A'
#B2 < this is a comment line that can/should be dropped
1 13 23
2 12 22
3 11 21

它们基本上源自我程序的不同运行，并带有不同的种子，现在我想将这些部分结果组合成一个共同的直方图，以便保留以“ #A”开头的注释行（所有文件都相同），并保留其他注释行掉了。第一列保持不变，然后所有其他列应在所有数据文件中取平均值：

dataComb.dat：

#A < comment line with unique identifier 'A'
1 12 22 
2 12 22 
3 12 22

在哪里12 = (11+13)/2 = (12+12)/2 = (13+11)/2和22 = (21+23)/2 = (22+22)/2 = (23+21)/2

我已经有一个bash脚本（可能是可怕的代码；但是我没有那么丰富的经验……）可以通过./merge.sh data* > dataComb.dat在命令行中运行来完成这项工作。它还检查所有数据文件的第一列中是否具有相同的列数和相同的值。

merge.sh：

#!/bin/bash

if [ $# -lt 2 ]; then
    echo "at least two files please"
    exit 1;
fi

i=1
for file in "$@"; do
    cols[$i]=$(awk '
BEGIN {cols=0}
$1 !~ /^#/ {
  if (cols==0) {cols=NF}
  else {
    if (cols!=NF) {cols=-1}
  }
}
END {print cols}
' ${file})
    i=$((${i}+1))
done

ncol=${cols[1]}
for i in ${cols[@]}; do
    if [ $i -ne $ncol ]; then
        echo "mismatch in the number of columns"
        exit 1
    fi
done

echo "#combined $# files"
grep "^#A" $1

paste "$@" | awk "
\$1 !~ /^#/ && NF>0 {
  flag=0
  x=\$1
  for (c=1; c<${ncol}; c++) { y[c]=0. }
  i=1
  while (i <= NF) {
    if (\$i==x) {
      for (c=1; c<${ncol}; c++) { y[c] += \$(i+c) }
      i+= ${ncol}
    } else { flag=1; i=NF+1; }
  }
  if (flag==0) {
    printf(\"%e \", x)
    for (c=1; c<${ncol}; c++) { printf(\"%e \", y[c]/$#) }
    printf(\"\n\")
  } else { printf(\"# x -coordinate mismatch\n\") }
}"

exit 0

我的问题是，对于大量数据文件，它很快变得缓慢，并在某些时候抛出“打开文件太多”错误。我看到简单地将所有数据文件一次性粘贴（paste "$@"）是这里的问题，但是分批进行，以某种方式引入临时文件似乎也不是理想的解决方案。我希望在保持脚本调用方式（即所有数据文件作为命令行参数）传递的同时提供更多帮助，以使其更具可伸缩性

我决定也将此内容发布在python部分中，因为经常有人告诉我，处理此类问题非常方便。但是，我几乎没有python的经验，但是也许这是一个终于开始学习它的机会了；）

水苏属

下面附加的代码可在Python 3.3中运行，并产生所需的输出，但有一些小的警告：

它会从处理的第一个文件中获取初始注释行，但不会费心检查之后的所有其他文件是否仍然匹配（即，如果您有多个以#A开头的文件和以＃开头的文件C，即使可能，它也不会拒绝#C）。我主要想说明合并功能在Python中的工作方式，并指出最好将这种类型的其他有效性检查添加为“作业”问题。
它也不会费心检查行数和列数是否匹配，如果不匹配则可能会崩溃。考虑这是另一个较小的作业问题。
它在第一个列的右侧打印所有列作为浮点值，因为在某些情况下，这可能就是它们。初始列被视为标签或行号，因此被打印为整数值。

您可以按照与以前几乎相同的方式来调用代码。例如，如果您将脚本文件命名为merge.py，则可以执行python merge.py data0001.dat data0002.dat该操作，它将与bash脚本一样将合并的平均结果打印到stdout。与较早的答案之一相比，该代码还具有更大的灵活性：编写方式，原则上（我尚未实际测试以确定）可以合并任意数量列的文件，而不仅仅是具有三列的文件。另一个好处是：完成文件操作后，它不会使文件保持打开状态。该with open(name, 'r') as infile:行是一个Python惯用语，即使close()从未明确调用过该脚本，该脚本也会在完成从文件中的读取后自动导致文件关闭。

#!/usr/bin/env python

import argparse
import re

# Give help description
parser = argparse.ArgumentParser(description='Merge some data files')
# Add to help description
parser.add_argument('fname', metavar='f', nargs='+',
                    help='Names of files to be merged')
# Parse the input arguments!
args = parser.parse_args()
argdct = vars(args)

topcomment=None
output = {}
# Loop over file names
for name in argdct['fname']:
    with open(name, "r") as infile:
        # Loop over lines in each file
        for line in infile:
            line = str(line)
            # Skip comment lines, except to take note of first one that
            # matches "#A"
            if re.search('^#', line):
                if re.search('^#A', line) != None and topcomment==None:
                    topcomment = line
                continue
            items = line.split()
            # If a line matching this one has been encountered in a previous
            # file, add the column values
            currkey = float(items[0])
            if currkey in output.keys():
                for ii in range(len(output[currkey])):
                    output[currkey][ii] += float(items[ii+1])
            # Otherwise, add a new key to the output and create the columns
            else:
                output[currkey] = list(map(float, items[1:]))

# Print the comment line
print(topcomment, end='')
# Get total number of files for calculating average
nfile = len(argdct['fname'])              
# Sort the output keys
skey = sorted(output.keys())
# Loop through sorted keys and print each averaged column to stdout
for key in skey:
    outline = str(int(key))
    for item in output[key]:
        outline += ' ' + str(item/nfile)
    outline += '\n'
    print(outline, end='')

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-06-5

我来说两句

0条评论

登录后参与评论

来自分类Dev

Related 相关文章

文章