I have a space-separated file that I want to turn into a tab-separated file. The file looks like this:
pos peptide logscore affinity(nM) Bind Level Protein Name Allele
0 GPSGGQPX 0.075 22266 1 HLA-A11:01
0 PSGGQPXA 0.071 23285 2 HLA-A11:01
0 SGGQPXAL 0.076 21945 3 HLA-A11:01
0 GGQPXALD 0.076 21858 4 HLA-A11:01
0 GQPXALDS 0.075 22237 5 HLA-A11:01
0 QPXALDSG 0.073 22748 6 HLA-A11:01
0 PXALDSGY 0.072 22962 7 HLA-A11:01
0 XALDSGYD 0.080 21133 8 HLA-A11:01
0 DTSMKDMH 0.093 18194 9 HLA-A11:01
0 TSMKDMHK 0.732 18 SB 10 HLA-A11:01
0 SMKDMHKV 0.099 17148 11 HLA-A11:01
0 MKDMHKVL 0.071 23175 12 HLA-A11:01
0 KDMHKVLR 0.135 11550 13 HLA-A11:01
0 DMHKVLRT 0.074 22537 14 HLA-A11:01
0 MHKVLRTL 0.072 23056 15 HLA-A11:01
0 HKVLRTLQ 0.069 23819 16 HLA-A11:01
0 DTSMKDMH 0.093 18194 17 HLA-A11:01
0 TSMKDMHK 0.732 18 SB 18 HLA-A11:01
0 SMKDMHKV 0.099 17148 19 HLA-A11:01
0 MKDMHKVL 0.071 23175 20 HLA-A11:01
I have to replace whitespaces with a single tab, taking into account:
Hence, just the following isn't enough:
awk '{$1=$1}1' OFS="\t" file
Is there a simple way to accomplish this with a one-liner, preferably awk?
EDIT:
This is how the output should look like, notice "Bind.Level" and "Protein.Name" in the title, and "-" (which can be NA or "") in the empty Bind.Level records
pos peptide logscore affinity(nM) Bind.Level Protein.Name Allele
0 GPSGGQPX 0.075 22266 - 1 HLA-A11:01
0 PSGGQPXA 0.071 23285 - 2 HLA-A11:01
0 SGGQPXAL 0.076 21945 - 3 HLA-A11:01
0 GGQPXALD 0.076 21858 - 4 HLA-A11:01
0 GQPXALDS 0.075 22237 - 5 HLA-A11:01
0 QPXALDSG 0.073 22748 - 6 HLA-A11:01
0 PXALDSGY 0.072 22962 - 7 HLA-A11:01
0 XALDSGYD 0.080 21133 - 8 HLA-A11:01
0 DTSMKDMH 0.093 18194 - 9 HLA-A11:01
0 TSMKDMHK 0.732 18 SB 10 HLA-A11:01
0 SMKDMHKV 0.099 17148 - 11 HLA-A11:01
0 MKDMHKVL 0.071 23175 - 12 HLA-A11:01
0 KDMHKVLR 0.135 11550 - 13 HLA-A11:01
0 DMHKVLRT 0.074 22537 - 14 HLA-A11:01
0 MHKVLRTL 0.072 23056 - 15 HLA-A11:01
0 HKVLRTLQ 0.069 23819 - 16 HLA-A11:01
0 DTSMKDMH 0.093 18194 - 17 HLA-A11:01
0 TSMKDMHK 0.732 18 SB 18 HLA-A11:01
0 SMKDMHKV 0.099 17148 - 19 HLA-A11:01
0 MKDMHKVL 0.071 23175 - 20 HLA-A11:01
Note that non-empty Bind.Level records might adopt different values, not just "SB"... but all of them alphabetic... Protein.Name might not always be numeric, though...
It would be something like identifying the fields separated by \s+; then, if there are 7 fields, print them as such (separated by tab), and if there are 6 (Bind.Level empty), print $1, $2, $3, $4, "-", $5, $6. Protein.names could potentially contain spaces, but I'm going to make sure it they don't (they are the input). That should be super simple, but I don't know how to do it... anyone??
Got it in 2 steps, a first step to add "-" in the empty Binding.Level records and "." in the proper title names, and a second step to change from whitespaces to tabs:
awk 'BEGIN{FS="";OFS=FS};($50==" "){$50="-"};(NR==1){$47="."; $64="."}{print}' file > out1
awk '{$1=$1}1' OFS="\t" out1 > out2
이 기사는 인터넷에서 수집됩니다. 재 인쇄 할 때 출처를 알려주십시오.
침해가 발생한 경우 연락 주시기 바랍니다[email protected] 삭제
몇 마디 만하겠습니다