正常运行30分钟后，同步/ fsync的通话速度变慢

alex.p 发表于 Dev

亚历克斯

使用带有混合SSD的Ubuntu 14.04运行30分钟后，我看到许多进程使用阻塞了IO iotop。这是在磁盘写入期间，例如，如果我在gedit中打开和关闭一个空文件，由于dconf写入设置，可能需要2秒钟才能关闭，这会以类似方式影响其他应用程序；严重降低了整个系统的速度。

使用strace，我设法将其追溯到fsync调用，并从那里设法使用sync命令重现了它。

综上所述，sync仅从终端反复运行可能需要1-2秒的时间，但只有在30分钟的正常运行时间后才能运行。

为了证明这一点，我制作了一个脚本，以秒为单位输出正常运行时间，而不是执行同步所花费的时间，并每秒钟运行一次：

while true;
do
cat /proc/uptime | awk '{printf "%f ",$1}'; /usr/bin/time -f '%e' sync;
sleep 1;
done;

我运行了上面的脚本，等待了大约一个小时（系统处于空闲状态），然后将结果绘制为gnuplot（y =执行同步的时间（以秒为单位），x =运行时间（以秒为单位））：

同步时间与正常运行时间的关系

图形峰值的时间点大约是1780（1780/60 =大约30分钟）。

除脚本外，此时没有任何内容可写到磁盘上，因此在第一次同步后，页面缓存中应该几乎没有任何内容，每次后续同步都将精确地写入正在写到脚本中的内容，这大约是100个字节或所以。

重新启动后，此问题仍然存在。例如-如果我等待30分钟以等待减速，然后重新启动，则减速仍然存在。如果我关闭电源，然后重新启动，问题将消失，直到30分钟后。

另一个好奇之处是，当我检查以上图表并放大出现减速的区域时，我得到了以下信息：

同步时间与正常运行时间的关系图，放大

高峰和低谷重复出现-从低谷到低谷几乎每隔10秒就会发生一次，并且峰顶在下降时也会扭结。

在减速之前，我还进行了hdparm测试（hdparm -t /dev/sda和hdparm -T /dev/sda）：

/dev/sda:
Timing cached reads:   23778 MB in  2.00 seconds = 11900.64 MB/sec
/dev/sda:
Timing buffered disk reads: 318 MB in  3.01 seconds = 105.63 MB/sec

在减速期间：

/dev/sda:
 Timing cached reads:     2 MB in  2.24 seconds = 915.50 kB/sec
/dev/sda:
Timing buffered disk reads: 300 MB in  3.01 seconds =  99.54 MB/sec

显示实际的磁盘读取不会受到影响，但缓存的读取会受到影响，这是否意味着这与系统总线有关，而与HD无关？

这是我尝试过的解决方案：

更改HD的Spindown设置，也许HD进入了省电模式：

hdparm /dev/sda -S252 #(set it to 5 hours before spindown)

Change the filesystem's journalling type to writeback rather than ordered so that we get performance improvements - this isn't solving the problem though as it doesn't explain the 30 minutes slowdown-free uptime.
Disabled CRON as it seems to be occuring after a round 30 minutes.
CPU usage is fine and is completely idle so no processes can be blamed however I've tried shutting down every service including the session manager (lightdm) this does nothing as I believe the issue is lower level.
Analysing any new processes coming in at 30 minutes indicates no changes - I've diffed the output of PS before and after and there's no difference.

This only started occuring about 2 weeks ago, nothing was installed and no updates were done around that time. I'm thinking this issue is much lower level so would really appreciate some help here as I'm clueless, even pointing me in the right direction would be helpful - for example is there a way to examine what's being flushed out the page cache?

Write caching is enabled on the disk in question, I've also tried disabling write barriers. SMART data on the HD indicates no problems with the HD itself however I have my suspicions it's the HD doing something mysterious as it persists after reboots.

EDIT:

I've done :

watch -n 1 cat /proc/meminfo

... to see how the memory changes particularly looking at the dirty row and the writeback row which I believe is the HDs disk buffer. They all stay at zero for the most part highest being probably 300kb. Calling sync flushes these as expected back to 0 but during the slowdown calling sync when there is zero dirty pages and zero kb in the disk buffer still locks IO. What else could sync be doing if there's nothing to flush out the page cache and write cache?

Kjetil Jorgensen

The symptoms are very consistent with a mostly saturated IO system, however having for the most part ruled out IO load from the OS/userspace side, another possibility is the drive running self-tests on itself, which may include reading from all the sectors. This should be queryable/tunable from smartctl (At least one place being smartctl -c for querying).

至于为什么它来来往往现在突然开始：

该驱动器的使用寿命已过某个阶段（写入的扇区数，时间缩短等），并且该驱动器上的固件已触发以下扫描之一
我相信这也可以通过smartctl触发，因此可能有一些自动化过程触发了它
在驱动器已花费一定时间后，将其中一种扫描触发并标记为进行中或开始时，驱动器会从头开始重新触发，或从中断的位置恢复

本文收集自互联网，转载请注明来源。

如有侵权，请联系[email protected] 删除。

编辑于2021-03-5

我来说两句

0条评论

登录后参与评论

上一篇：标准I / O流-fgets（）缓冲类型

来自分类Dev

Related 相关文章

文章