将一个文件中的“模式”与另一个文件中的标头名称匹配（R，Unix）

debugcn 发表于 Dev

伊维

我有两个大文件，我正在尝试将文件_1的第一列中的信息与文件_2的标题相匹配。细节很小，file_2的头开头有一些信息，各列之间的信息有所不同，但最后它具有模式匹配。基本上，我必须找到文件2的列名结尾处file_1的'pattern'所在的位置，并使用此信息输出data.frame。
请在下面查看文件的外观：

**file_1**  dim (757*3) the first column of the file_1 contains patterns
10001-101A3  a   t
10008-101B6  b   g
10235-104A1  c   h
-            -   -
-            -   -
etc...

**file_2** dim (4120*1079)
blabla.10001.101A3   blbl.2348.101B6 trsdr.1111.111D2 gfder.10008.101B6  ....
12                         1223           544               -              -
132                         23           3564               -              -
14                         223           33               -              -
162                         13           344               -              -


**Desired output file-3:** I assume that the output size will be 4120*757
blabla.10001.101A3   gfder.10008.101B6  ....
12                    -              -
132                   -              -
14                    -              -
162                    -              -

我正在尝试使用R获取输出（下面是我的脚本），但我也想学习如何在Unix中实现它（我猜-awk和-grep可以帮助解决此问题）。

这是我的R脚本：

table1=read.table("file2.tsv.gz", quote=NULL, sep='\t', header=T, fill=T)
table2=read.table("file1.txt", quote=NULL, sep='\t', header=T, fill=T)
    # dim(table1 4120 * 1079)   -> need to reduce amount of columns to 757
    # dim(table2 757 * 3)

###### the header in table1 has following view 10001.101A3, thus we need to substitute '-' to '.' in pattern
### What to do:
### 1) Use gsub() function to substitute '-' by '.' 
### 2) Use gsub() function to remove space in the end of string ' ' by ''
### 3) Find modified pattern in the end of column's name
### 4) Apply to the entire table

pattern=table2[,1]            # '10001-101A3 '  '10008-101B6 ' 
for (x in pattern)  {
    ptn=gsub('-','.',x)
    ptn1=gsub(' ','',ptn)            # pattern to be matched'
                                     # '10001.101A3'  '10008.101B6' 

    find_match=table1[,(grepl('^.+ptn1$', header))]   
    final_tb=table1[,find_match]
}

我认为问题在于grepl（）函数中ptn1的数据表示，因为当我插入10001.101A3而不是ptn1时，我会得到一次运行的答案，但是显然我需要循环遍历它。

我也尝试过get（ptn1），但仍然无法正常工作。

我会很感激您的评论以及在Unix中如何做到这一点的任何想法（我是Unix的基本用户，因此目前无法执行此任务）。

########################跟踪小数据

df=data.frame(aa24.12a,dda43.23s,fds24.12a,sdf24.112f)

z = c（'24 -12a'，'43 -23s'）＃模式

aa24.12a fds24.12a aa24.12a.1 fds24.12a.1
1        2        34          2          34
2        3         2          3           2
3        4         1          4           1
4       56         3         56           3
5        3         5          3           5


header=colnames(df)
for (x in z){
     ptn=gsub('-','.',x)
     ptn1=gsub(' ','',ptn)# correct pattern 

     find_match=grep('^.+24.12a$', header)# find match of pattern in header
     tbl=df[,find_match]
}
> tbl
  aa24.12a fds24.12a
1        2        34
2        3         2
3        4         1
4       56         3
5        3         5

谢谢

伊维

谢谢N8TRO，为您提供解决方案并及时答复。

我对这个问题的解决方案是：

# Modify pattern z=('24-12a','43-23s')
ptn=gsub('-','.',z)
ptn1=gsub(' ','',ptn)
# so no it looks like '24.12a' '34.23s'

i=1        
# create empty vector
df2=c()        
# Iterate:
# first loop through column names of data frame 
# second loop goes through vector's value
# grepl -> searches for matches
# condition, ==TRUE
# if so: append to the empty vector, values in the vector will be column numbers 

for (x in colnames(df)){
    for (y in ptn1){
        e=grepl(y,x)
            if (e==TRUE){
                df2=append(df2,i)
        }
    }
    i=i+1
}

wanted_output = df [，df2]

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。