我目前正在尝试用Python编写一个程序,该程序读取syslog并运行正则表达式负载,以便将数据提取到数据框中的列中,然后可以对其进行分析。
这是syslog文件的示例:
Feb 1 00:00:02 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.84 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=12973 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0
Feb 1 00:00:02 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=24.17.237.70 DST=11.11.11.95 LEN=40 TOS=0x00 PREC=0x00 TTL=113 ID=27095 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0
Feb 1 00:00:07 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.85 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=13801 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0
Feb 1 00:00:17 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.87 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=15432 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0
Feb 1 00:00:24 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=24.17.237.70 DST=11.11.11.100 LEN=40 TOS=0x00 PREC=0x00 TTL=113 ID=31168 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0
Feb 1 00:00:27 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=192.150.249.87 DST=11.11.11.89 LEN=40 TOS=0x00 PREC=0x00 TTL=110 ID=17292 PROTO=TCP SPT=220 DPT=6129 WINDOW=16384 RES=0x00 SYN URGP=0
Feb 1 00:00:31 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=211.168.230.94 DST=11.11.11.70 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=7204 DF PROTO=TCP SPT=1208 DPT=135 WINDOW=16384 RES=0x00 SYN URGP=0
Feb 1 00:00:31 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=211.168.230.94 DST=11.11.11.72 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=7206 DF PROTO=TCP SPT=1210 DPT=135 WINDOW=16384 RES=0x00 SYN URGP=0
Feb 1 00:00:31 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOUT=eth1 SRC=211.168.230.94 DST=11.11.11.64 LEN=48 TOS=0x00 PREC=0x00 TTL=110 ID=7198 DF PROTO=TCP SPT=1202 DPT=135 WINDOW=16384 RES=0x00 SYN URGP=0
到目前为止,我已经在PySpark中使用此方法加载数据文件了:
raw_data_files = glob.glob('*.log')
base_df = spark.read.text(raw_data_files)
然后将数据提取到这样的列中:
ts_pattern =r'((Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec)\s+(\d+)\s+((0[0-9]|1[0-9]|2[0-3])(\:)(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])(\:)(0[0-9]|1[0-9]|2[0-9]|3[0-9]|4[0-9]|5[0-9])))'
mth_pattern = r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec\s+\d+\s+)'
dy_pattern = r'\s+(\d+)\s+'
tme_pattern = r'(([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))'
src_pattern = r'((\sSRC.*?=)([0-9a-f\.]*))'
dst_pattern = r'((\sDST.*?=)([0-9a-f\.]*))'
dvc_pattern = r'((([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))\s)(\S+)'
svc_pattern = r'((([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))\s)(\S+)\s((?:(?!:).)*)'
ufw_pattern = r'(\bUFW\s\b)([^]\s]+)'
sts_pattern = r'((?<=stats:\s).*)'
in_pattern = r'((\bIN=\b)([^\s]+))'
physin_pattern = r'((\bPHYSIN=\b)([^\s]+))'
out_pattern = r'((\bOUT=\b)([^\s]+))'
physout_pattern = r'((\bPHYSOUT=\b)([^\s]+))'
mac_pattern = r'((\bMAC=\b)([^\s]+))'
len_pattern = r'((\bLEN=\b)([^\s]+))'
tos_pattern = r'((\bTOS=\b)([^\s]+))'
prec_pattern = r'((\bPREC=\b)([^\s]+))'
ttl_pattern = r'((\bTTL=\b)([^\s]+))'
id_pattern = r'((\bID=\b)([^\s]+))'
flag_pattern = r'(\b\sCE|DF|MF\s\b)'
prtc_pattern = r'((\bPROTO=\b)([^\s]+))'
spt_pattern = r'((\bSPT=\b)([^\s]+))'
dpt_pattern = r'((\bDPT=\b)([^\s]+))'
recseq_pattern = r'((\bSEQ=\b)([^\s]+))'
ackseq_pattern = r'((\bSEQ=\b)([^\s]+))'
win_pattern = r'((\bWINDOW=\b)([^\s]+))'
res_pattern = r'((\bRES=\b)([^\s]+))'
synurgp_pattern = r'((\bSYN\sURGP=\b)([^\s]+))'
status_pattern = r'((\bkernel:\s\b)(((eth|br|dev|Ker).*)))'
# Combined regex
logs_df = base_df.select(regexp_extract('value', mth_pattern, 1).alias('month'),
regexp_extract('value', dy_pattern, 1).alias('day'),
regexp_extract('value', tme_pattern, 1).alias('time'),
regexp_extract('value', src_pattern, 3).alias('source'),
regexp_extract('value', dst_pattern, 3).alias('destination'),
regexp_extract('value', dvc_pattern, 6).alias('device'),
regexp_extract('value', svc_pattern, 7).alias('service'),
regexp_extract('value', ufw_pattern, 2).alias('ufw rule'),
regexp_extract('value', sts_pattern, 1).alias('stats event'),
regexp_extract('value', status_pattern, 3).alias('status event'),
regexp_extract('value', in_pattern, 3).alias('input'),
regexp_extract('value', physin_pattern, 3).alias('physinput'),
regexp_extract('value', out_pattern, 3).alias('output'),
regexp_extract('value', physout_pattern, 3).alias('physout'),
regexp_extract('value', mac_pattern, 3).alias('mac address'),
regexp_extract('value', len_pattern, 3).cast('integer').alias('length'),
regexp_extract('value', tos_pattern, 3).alias('typeofservice'),
regexp_extract('value', prec_pattern, 3).alias('precedence'),
regexp_extract('value', ttl_pattern, 3).cast('integer').alias('timetolive'),
regexp_extract('value', id_pattern, 3).cast('integer').alias('id'),
regexp_extract('value', flag_pattern, 1).alias('flag'),
regexp_extract('value', prtc_pattern, 3).alias('protocol'),
regexp_extract('value', spt_pattern, 3).cast('integer').alias('sourceport'),
regexp_extract('value', dpt_pattern, 3).cast('integer').alias('destinationport'),
regexp_extract('value', recseq_pattern, 3).cast('integer').alias('recseqnumber'),
regexp_extract('value', ackseq_pattern, 3).cast('integer').alias('ackseqnumber'),
regexp_extract('value', win_pattern, 3).cast('integer').alias('window'),
regexp_extract('value', res_pattern, 3).alias('reserved'),
regexp_extract('value', synurgp_pattern, 3).cast('integer').alias('synurgp'))
印刷数据框:
+-----+---+--------+--------------+------------+------+-------+--------+-----------+------------+-----+---------+------+-------+-----------+------+-------------+----------+----------+-----+----+--------+----------+---------------+------------+------------+------+--------+-------+
|month|day| time| source| destination|device|service|ufw rule|stats event|status event|input|physinput|output|physout|mac address|length|typeofservice|precedence|timetolive| id|flag|protocol|sourceport|destinationport|recseqnumber|ackseqnumber|window|reserved|synurgp|
+-----+---+--------+--------------+------------+------+-------+--------+-----------+------------+-----+---------+------+-------+-----------+------+-------------+----------+----------+-----+----+--------+----------+---------------+------------+------------+------+--------+-------+
| Feb| 1|00:00:02|192.150.249.87| 11.11.11.84|bridge| kernel| | | | br0| eth0| br0| eth1| | 40| 0x00| 0x00| 110|12973| | TCP| 220| 6129| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:02| 24.17.237.70| 11.11.11.95|bridge| kernel| | | | br0| eth0| br0| eth1| | 40| 0x00| 0x00| 113|27095| | TCP| 220| 6129| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:07|192.150.249.87| 11.11.11.85|bridge| kernel| | | | br0| eth0| br0| eth1| | 40| 0x00| 0x00| 110|13801| | TCP| 220| 6129| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:17|192.150.249.87| 11.11.11.87|bridge| kernel| | | | br0| eth0| br0| eth1| | 40| 0x00| 0x00| 110|15432| | TCP| 220| 6129| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:24| 24.17.237.70|11.11.11.100|bridge| kernel| | | | br0| eth0| br0| eth1| | 40| 0x00| 0x00| 113|31168| | TCP| 220| 6129| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:27|192.150.249.87| 11.11.11.89|bridge| kernel| | | | br0| eth0| br0| eth1| | 40| 0x00| 0x00| 110|17292| | TCP| 220| 6129| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:31|211.168.230.94| 11.11.11.70|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7204| DF| TCP| 1208| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:31|211.168.230.94| 11.11.11.72|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7206| DF| TCP| 1210| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:31|211.168.230.94| 11.11.11.64|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7198| DF| TCP| 1202| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|211.168.230.94| 11.11.11.69|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7203| DF| TCP| 1207| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|211.168.230.94| 11.11.11.73|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7207| DF| TCP| 1211| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|211.168.230.94| 11.11.11.75|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7209| DF| TCP| 1213| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|211.168.230.94| 11.11.11.80|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7214| DF| TCP| 1218| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|211.168.230.94| 11.11.11.67|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7201| DF| TCP| 1205| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|211.168.230.94| 11.11.11.71|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7205| DF| TCP| 1209| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|192.150.249.87| 11.11.11.90|bridge| kernel| | | | br0| eth0| br0| eth1| | 40| 0x00| 0x00| 110|18107| | TCP| 220| 6129| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|211.168.230.94| 11.11.11.70|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7235| DF| TCP| 1208| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|211.168.230.94| 11.11.11.72|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7237| DF| TCP| 1210| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|211.168.230.94| 11.11.11.69|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7234| DF| TCP| 1207| 135| null| null| 16384| 0x00| 0|
| Feb| 1|00:00:32|211.168.230.94| 11.11.11.80|bridge| kernel| | | | br0| eth0| br0| eth1| | 48| 0x00| 0x00| 110| 7233| DF| TCP| 1218| 135| null| null| 16384| 0x00| 0|
+-----+---+--------+--------------+------------+------+-------+--------+-----------+------------+-----+---------+------+-------+-----------+------+-------------+----------+----------+-----+----+--------+----------+---------------+------------+------------+------+--------+-------+
此方法效果很好,我对结果感到满意,但是,如果可能的话,我想在不使用Apache Spark的情况下进行此工作。这是一个很棒的工具,但我认为对于这个用例来说可能太多了。
我尝试通过将文件读取为fwf来使用Pandas获得类似的结果,这并不理想,因为由于日志格式的原因,它会将其拆分为奇数列。如果我可以使用像PySpark这样的特定正则表达式组从数据框中提取数据,这将不是问题。无论如何,我在熊猫都找不到做到这一点的方法。或者无论如何将所有正则表达式编译成一个命令并使用指定的组以这种方式创建列。
我尝试将Pandas数据框中的所有列合并为1,然后使用str.extract将数据提取到新列中。但是,仅当我的正则表达式具有1个捕获组时,此方法才有效,这是不可能实现的,但仍会获得相关结果。
我还尝试将所有的正则表达式模式组合为一个,并以这种方式提取数据,如下所示:
import re
import pandas as pd
import glob
import numpy as np
# define regex patterns
mth = r'(?P<month>Jan|Feb|Mar|Apr|May|Jun|Jul|Apr|Sep|Oct|Nov|Dec\s+\d+\s+)'
dy = r'\s+(?P<day>\d+)\s+'
tme = r'(?P<time>([01]?\d|2[0-3]|24(?=:00?:00?$)):([0-5]\d):([0-5]\d))'
src = r'\sSRC.*?=?(?P<source>[0-9a-f\.]*)'
dst = r'\sDST.*?=(?P<destination>[0-9a-f\.]*)'
# import all files
data = pd.read_fwf('/home/fred/Documents/PythonProjects/LogFileReader/SotM30-anton.log', header=None)
all_re = re.compile('({0}{1}{2}{3}{4})'.format(mth, dy, tme, src, dst))
# parse to dataframe
df = pd.DataFrame(data)
# format to single column
df = df.rename(columns={0 : 'c1', 1 : 'c2', 2 : 'c3', 3 : 'c4', 4 : 'c5'})
cols = ['c1', 'c2', 'c3', 'c4', 'c5']
df['combined'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
print(df)
print(df)的结果:
c1 c2 c3 c4 c5 5 combined
0 Feb 1 00:00:02 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU... Feb 1 00:00:02 bridge kernel:
1 Feb 1 00:00:02 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU... Feb 1 00:00:02 bridge kernel:
2 Feb 1 00:00:07 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU... Feb 1 00:00:07 bridge kernel:
3 Feb 1 00:00:17 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU... Feb 1 00:00:17 bridge kernel:
4 Feb 1 00:00:24 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU... Feb 1 00:00:24 bridge kernel:
... ... .. ... ... ... ... ...
307519 Feb 7 14:35:06 bridge kernel: OUTG CONN TCP: IN=br0 PHYSIN=eth1 OUT=br0 PHYS... Feb 7 14:35:06 bridge kernel:
307520 Feb 7 14:36:12 bridge kernel: INBOUND UDP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU... Feb 7 14:36:12 bridge kernel:
307521 Feb 7 14:39:04 bridge kernel: INBOUND UDP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU... Feb 7 14:39:04 bridge kernel:
307522 Feb 7 14:39:11 bridge kernel: INBOUND TCP: IN=br0 PHYSIN=eth0 OUT=br0 PHYSOU... Feb 7 14:39:11 bridge kernel:
307523 Feb 7 14:40:06 bridge kernel: OUTG CONN TCP: IN=br0 PHYSIN=eth1 OUT=br0 PHYS... Feb 7 14:40:06 bridge kernel:
new_df = df[['combined']].copy()
x = new_df['combined'].str.extract(all_re, expand=True)
print(x)
但这只会导致数据帧充满NaN(print(x)的结果):
0 month day time 4 5 6 source destination
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
307519 NaN NaN NaN NaN NaN NaN NaN NaN NaN
307520 NaN NaN NaN NaN NaN NaN NaN NaN NaN
307521 NaN NaN NaN NaN NaN NaN NaN NaN NaN
307522 NaN NaN NaN NaN NaN NaN NaN NaN NaN
307523 NaN NaN NaN NaN NaN NaN NaN NaN NaN
它似乎确实正在获取捕获组名称并将其用作列标题,但由于没有获取任何值,因此出现了问题。
综上所述,我想使用正则表达式从syslog文件中提取数据,并将结果解析为数据帧中的列,以便对数据进行分析。我可以使用Apache Spark进行此操作,但不希望这样做。我想使用Pandas,但不知道如何指定regex捕获组并像在Apache Spark中一样将它们一起编译。任何帮助将不胜感激。
本文收集自互联网,转载请注明来源。
如有侵权,请联系[email protected] 删除。
我来说两句