RAS(一)介紹
寫(xiě)在開(kāi)篇之前
近期收到了公司大禮包,想著在找工作期間把Linux RAS整理一下,寫(xiě)成系列文章。畢竟作為OS RAS負(fù)責(zé)人兼開(kāi)發(fā),為阿里云X86和倚天710 RAS落地了很多RAS增強(qiáng)和解決方案,對(duì)阿里云服務(wù)器穩(wěn)定性做出些許貢獻(xiàn)。期間也有不少其他團(tuán)隊(duì)過(guò)來(lái)請(qǐng)教過(guò)RAS事項(xiàng),所以想著記錄下來(lái),對(duì)以后計(jì)劃了解和學(xué)習(xí)RAS的Linux愛(ài)好者有所幫助。另外個(gè)人視角主要從Linux內(nèi)核出發(fā),梳理Linux RAS涉及的組件、功能、特性都有哪些,也會(huì)介紹內(nèi)核RAS涉及的硬件。
RAS背景
隨著云時(shí)代的到來(lái),各個(gè)公司都在將產(chǎn)品、服務(wù)等遷移上云,云為數(shù)字化建設(shè)提供了極大的便利。國(guó)內(nèi)外云服務(wù)器廠商如雨后春筍般出現(xiàn),比如谷歌云、阿里云、騰訊云等等。近些年,全球范圍內(nèi)云服務(wù)出現(xiàn)多次宕機(jī)事件,云的穩(wěn)定性越來(lái)越受到大家關(guān)注。據(jù)《中國(guó)數(shù)據(jù)災(zāi)備產(chǎn)業(yè)白皮書(shū)暨數(shù)據(jù)災(zāi)備建設(shè)調(diào)研報(bào)告》中描述,業(yè)務(wù)宕機(jī)1分鐘,平均會(huì)使運(yùn)輸業(yè)損失15萬(wàn)美元,銀行業(yè)損失27萬(wàn)美元,通信業(yè)損失35萬(wàn)美元,制造業(yè)損失42萬(wàn)美元,證券業(yè)損失45萬(wàn)美元,同時(shí)公司聲譽(yù)等無(wú)形資產(chǎn)損失更是無(wú)法估量。
在嵌入式領(lǐng)域,越來(lái)越多的國(guó)產(chǎn)自研硬件商用發(fā)布,包括內(nèi)存、硬盤、GPU、CPU等,應(yīng)用于PC主機(jī)、汽車電子、工業(yè)控制等市場(chǎng)。隨著這些硬件使用數(shù)量的飛速上升,穩(wěn)定性問(wèn)題也逐步開(kāi)始暴露出來(lái)。比如筆者在華為內(nèi)核團(tuán)隊(duì)負(fù)責(zé)對(duì)接某ARM嵌入式產(chǎn)品業(yè)務(wù),在使用國(guó)產(chǎn)中發(fā)現(xiàn)使用國(guó)產(chǎn)內(nèi)存條出現(xiàn)硬件問(wèn)題的概率大大超過(guò)國(guó)外某大廠品牌。
總體來(lái)說(shuō),隨著軟件技術(shù)的成熟和完善,因?yàn)檐浖?dǎo)致的問(wèn)題占比逐年減少。此消彼長(zhǎng)下,硬件問(wèn)題占比逐年突顯,比如硬件故障導(dǎo)致服務(wù)器異?;蝈礄C(jī)問(wèn)題已逐步成為云服務(wù)器Top1問(wèn)題。
RAS定義
服務(wù)器硬件穩(wěn)定性,主要體現(xiàn)在RAS上。RAS指機(jī)器的可靠性(Reliability)、可用性(Availability)和可服務(wù)性(Serviceability)。Linux Kernel對(duì)Reliability,Availability,Serviceability定義如下
Reliability
is the probability that a system will produce correct outputs.
?Generally measured as Mean Time Between Failures (MTBF)
?Enhanced by features that help to avoid, detect and repair hardware faults
Availability
is the probability that a system is operational at a given time
?Generally measured as a percentage of downtime per a period of time
?Often uses mechanisms to detect and correct hardware faults in runtime;
Serviceability (or maintainability)
is the simplicity and speed with which a system can be repaired or maintained
?Generally measured on Mean Time Between Repair (MTBR)
RAS目標(biāo)是使系統(tǒng)盡可能長(zhǎng)期可靠的運(yùn)行而不停機(jī),減少系統(tǒng)downtime;提供硬件檢測(cè)上報(bào)機(jī)制,以便在硬件錯(cuò)誤引起數(shù)據(jù)丟失或宕機(jī)之前能夠通知管理員及時(shí)更換硬件;提供硬件錯(cuò)誤恢復(fù)機(jī)制,并盡可能糾正錯(cuò)誤,使系統(tǒng)可持續(xù)可靠的運(yùn)行。
RAS涉及的硬件包括且不限于:CPU、Memory、IO、PCIe、硬盤和其他外設(shè)
?CPU – detect errors at instruction execution and at L1/L2/L3 caches;
?Memory – add error correction logic (ECC) to detect and correct errors;
?I/O – add CRC checksums for tranfered data;
?Storage – RAID, journal file systems, checksums, Self-Monitoring, Analysis and Reporting Technology (SMART).
通常來(lái)說(shuō),硬件錯(cuò)誤分為CE、UE、Fatal Error、Non-fatal Error,定義如下
?Correctable Error (CE)?- the error detection mechanism detected and corrected the error. Such errors are usually not fatal, although some Kernel mechanisms allow the system administrator to consider them as fatal.
?Uncorrected Error (UE)?- the amount of errors happened above the error correction threshold, and the system was unable to auto-correct.
?Fatal Error?- when an UE error happens on a critical component of the system (for example, a piece of the Kernel got corrupted by an UE), the only reliable way to avoid data corruption is to hang or reboot the machine.
?Non-fatal Error?- when an UE error happens on an unused component, like a CPU in power down state or an unused memory bank, the system may still run, eventually replacing the affected hardware by a hot spare, if available.
但是實(shí)際這個(gè)定義比較寬泛且簡(jiǎn)陋,比如還有Defferred Error(DE)
Deferred error
The error was detected, was not corrected, and was deferred. The error has not been silently propagated. The error might be latent in the system. It is IMPLEMENTATION DEFINED whether the error continues to infect the state of the node or whether it has been deferred to the consumer. The node continues to operate. If the error might have been silently propagated, it must be reported as an Uncorrected error.
又比如Intel將軟件可恢復(fù)的UC Error定義為UCR(Uncorrected Recoverable) Error,下面又分為SRAR、SRAO、UCNA等。
RAS基本框圖
RAS基本流程框圖如上,硬件發(fā)生故障后,通過(guò)硬件RAS能力觸發(fā)中斷或異常,通知到Firmware/OS,軟件收到通知后采取相應(yīng)的策略,比如Panic、執(zhí)行Recover actions或者通知到用戶。
隨著RAS功能不斷更新迭代以及架構(gòu)不同,RAS體系開(kāi)始呈現(xiàn)多樣性,因不同使用場(chǎng)景所有不同,體現(xiàn)在:
1.通知方式多樣
通知方式細(xì)分下來(lái)包括IRQ、Exception、Poll、SEA、SDEI、GPIO等方式。
2.Mode多樣
硬件故障先通知到Firmware,然后Firmware帶外處理或再通知到OS的方式,稱為Firmware First Mode;
硬件故障通知到OS,OS處理硬件故障的方式,稱為Kernel First Mode;
這兩種方式還可以支持混合使用,各有優(yōu)劣,要學(xué)會(huì)因地制宜。比如對(duì)于CE來(lái)說(shuō),服務(wù)器經(jīng)常發(fā)生大量CE事件,就會(huì)產(chǎn)生CE Irq風(fēng)暴,CPU長(zhǎng)時(shí)間在處理這些Irq,就會(huì)導(dǎo)致其他任務(wù)得不到調(diào)度,影響整體性能。
3.芯片架構(gòu)、硬件多樣性
隨著近些年芯片行業(yè)發(fā)展,芯片架構(gòu)越來(lái)越多樣性,包括Intel、AMD、ARM、RISC等,不同芯片架構(gòu)下硬件組成也有些許差異。
4.軟件多樣性
對(duì)于Linux驅(qū)動(dòng)來(lái)說(shuō),包括mce驅(qū)動(dòng)、apei驅(qū)動(dòng)、edac驅(qū)動(dòng)等;
對(duì)于用戶態(tài)RAS服務(wù)來(lái)說(shuō),包括mcelog、rasdaemon、perf event通知等;
總體來(lái)說(shuō),RAS是一個(gè)復(fù)雜的體系,不同芯片架構(gòu)、不同硬件RAS功能各不相同,作為RAS開(kāi)發(fā)要根據(jù)不同業(yè)務(wù)場(chǎng)景采取對(duì)應(yīng)的RAS方案。
RAS故障處理流程
以Intel服務(wù)器為例,
1.Intel服務(wù)器內(nèi)存發(fā)生CE故障后,硬件觸發(fā)CMCI中斷,執(zhí)行OS注冊(cè)的中斷處理函數(shù);
2.該函數(shù)調(diào)用EDAC驅(qū)動(dòng)代碼,讀取MCA狀態(tài)寄存器來(lái)獲取硬件故障信息,比如故障級(jí)別、故障硬件位置、故障地址等等。EDAC驅(qū)動(dòng)會(huì)將信息保存在/dev/mcelog;
3.Mcelog是一個(gè)用戶態(tài)的服務(wù)程序,通過(guò)解析/dev/mcelog信息,將其保存在/var/log/mcelog。用戶可以通過(guò)查看該文件了解此服務(wù)器是否發(fā)生過(guò)硬件故障以及故障發(fā)生的時(shí)間、硬件信息、是否恢復(fù)等關(guān)鍵信息;
RAS硬件故障舉例
如下是x86服務(wù)器注入內(nèi)存CE故障的日志,EDAC驅(qū)動(dòng)會(huì)打印故障發(fā)生所在的硬件(Memory)、Addr、Processor、類型(CE)、memory channel/dimm等信息。
?
C++ [22715.830801] EDAC sbridge MC3: HANDLING MCE MEMORY ERROR [22715.834759] EDAC sbridge MC3: CPU 0: Machine Check Event: 0 Bank 7: 8c00004000010090 [22715.834759] EDAC sbridge MC3: TSC 0 [22715.834759] EDAC sbridge MC3: ADDR 12345000 EDAC sbridge MC3: MISC 144780c86 [22715.834759] EDAC sbridge MC3: PROCESSOR 0:306e7 TIME 1422553404 SOCKET 0 APIC 0 [22716.616173] EDAC MC3: 1 CE memory read error on CPU_SrcID#0_Channel#0_DIMM#0 (channel:0 slot:0 page:0x12345 offset:0x0 grain:32 syndrome:0x0 - ?area:DRAM err_code0090 socket:0 channel_mask:1 rank:0) |
?
圖片來(lái)源:v2-8e4986144a6a70301ee1a30c60c5ffad_720w.webp (720×378) (zhimg.com)
編輯:黃飛
?
評(píng)論
查看更多