目錄:
一、電源管理框架
? ? ? 1.1 電源狀態(tài)管理
? ? ? 1.2 省電管理
? ? ? 1.3 電源管理質(zhì)量
二、睡眠與休眠
? ? ? 2.1 凍結(jié)進(jìn)程
? ? ? 2.2 睡眠流程
? ? ? 2.3 休眠流程
? ? ? 2.4 自動(dòng)睡眠
三、關(guān)機(jī)與重啟
? ? ? 3.1 用戶空間處理
? ? ? 3.2 內(nèi)核處理
四、CPU動(dòng)態(tài)調(diào)頻
? ? ? 4.1 CPUFreq Core
? ? ? 4.2 Govener介紹
? ? ? 4.3 Driver介紹
五、CPU休閑
? ? ? 5.1 CPUIdle Core
? ? ? 5.2 決策者介紹
? ? ? 5.3 Driver介紹
六、電源管理質(zhì)量
? ? ? 6.1 系統(tǒng)級(jí)約束
? ? ? 6.2 設(shè)備級(jí)約束
七、總結(jié)回顧
? 一、電源管理框架 ? ?
計(jì)算機(jī)運(yùn)行在物理世界中,物理世界中的一切活動(dòng)都需要消耗能量。能量的形式有很多種,如熱能、核能、化學(xué)能等。計(jì)算機(jī)消耗的是電能,其來源是電池或者外電源。計(jì)算機(jī)內(nèi)部有一個(gè)部件叫做電源管理芯片(PMIC),它接收外部的電能,然后轉(zhuǎn)化為不同電壓的電流,向系統(tǒng)的各個(gè)硬件供電。什么硬件需要多少伏的電壓,都是由相應(yīng)的電氣標(biāo)準(zhǔn)規(guī)定好了的,各個(gè)硬件廠商按照標(biāo)準(zhǔn)生成硬件就可以了。上電的過程是由硬件自動(dòng)完成的,不需要軟件的參與。因?yàn)橛布簧想姷脑?,軟件也沒法運(yùn)行啊。但是當(dāng)硬件運(yùn)行起來之后,軟件就可以對硬件的電源狀態(tài)進(jìn)行管理了。電源管理的內(nèi)容包括電源狀態(tài)管理和省電管理。電源狀態(tài)管理是對整個(gè)系統(tǒng)的供電狀態(tài)進(jìn)行管理,內(nèi)容包括睡眠、休眠、關(guān)機(jī)、重啟等操作。省電管理是因?yàn)殡娔懿皇敲赓M(fèi)的,我們應(yīng)該盡量地節(jié)省能源,尤其是對于一些手持設(shè)備來說,電能雖然并不昂貴但是卻非常珍貴,因?yàn)殡姵氐娜萘糠浅S邢?。不過省電管理也不能一味地省電,還要考慮性能問題,在性能與功耗之間達(dá)到平衡。
1.1 電源狀態(tài)管理
計(jì)算機(jī)只有開機(jī)之后才能使用,但是我們并不是一直都在使用計(jì)算機(jī)。當(dāng)我們短時(shí)間不使用計(jì)算機(jī)時(shí),可以把它置入睡眠或者休眠狀態(tài),這樣可以省電,而且當(dāng)我們想使用時(shí)還可以快速地恢復(fù)到可用狀態(tài)。當(dāng)我們長時(shí)間不使用計(jì)算機(jī)時(shí),就可以把它關(guān)機(jī),這樣更省電,當(dāng)然再使用它時(shí)還需要重新開機(jī)。有時(shí)候我們覺得系統(tǒng)太卡或者系統(tǒng)狀態(tài)不對的時(shí)候,還可以對計(jì)算機(jī)進(jìn)行重啟,讓系統(tǒng)重新恢復(fù)到一個(gè)干凈穩(wěn)定的狀態(tài)。
睡眠(Sleep)也叫做Suspend to RAM(STR),掛起到內(nèi)存。休眠(Hibernate)也叫做Suspend to Disk(STD)。有時(shí)候我們會(huì)把睡眠叫做掛起(Suspend),但是有時(shí)候我們也會(huì)把睡眠和休眠統(tǒng)稱為掛起(Suspend)。系統(tǒng)睡眠的時(shí)候會(huì)把系統(tǒng)的狀態(tài)信息保存到內(nèi)存,然后內(nèi)存要保持供電,其它設(shè)備都可以斷電。系統(tǒng)休眠的時(shí)候會(huì)把系統(tǒng)的狀態(tài)信息保存到磁盤,此時(shí)整個(gè)系統(tǒng)都可以斷電,就和關(guān)機(jī)一樣。系統(tǒng)無論睡眠還是休眠,都可以被喚醒。對于睡眠來說很多外設(shè)都可以喚醒整個(gè)系統(tǒng),比如鍵盤。對于休眠來說,就只有電源按鈕能喚醒系統(tǒng)了。休眠一方面和睡眠比較像,都保存了系統(tǒng)的狀態(tài)信息,一方面又和關(guān)機(jī)比較像,整個(gè)系統(tǒng)都斷電了。
重啟和關(guān)機(jī)的關(guān)系比較密切,重啟相當(dāng)于是關(guān)機(jī)再開機(jī)。二者都是用reboot系統(tǒng)調(diào)用來實(shí)現(xiàn)的,其參數(shù)cmd用來指定是關(guān)機(jī)還是重啟。關(guān)機(jī)和重啟是需要init進(jìn)程來處理的,無論我們是使用命令還是使用系統(tǒng)的關(guān)機(jī)按鈕還是直接按電源鍵,事件最終都會(huì)被傳遞給init進(jìn)程。Init接收到關(guān)機(jī)或重啟命令后,會(huì)進(jìn)行一些保存處理,然后停止所有的服務(wù)進(jìn)程、殺死所有的普通進(jìn)程,最后調(diào)用系統(tǒng)調(diào)用reboot進(jìn)行關(guān)機(jī)或者重啟。
1.2 省電管理
我們不使用電腦時(shí)可以進(jìn)行睡眠、休眠甚至關(guān)機(jī)來進(jìn)行省電,但是我們使用電腦時(shí)也可以有很多辦法來省電。這些省電方法又可以分為兩類,使用省電和閑暇省電。閑暇省電是指計(jì)算機(jī)在宏觀上整體上還在使用,但是在微觀上局部上有的設(shè)備暫時(shí)不在使用。使用省電的方法就是動(dòng)態(tài)調(diào)頻,包括CPU動(dòng)態(tài)調(diào)頻(CPUFreq)和設(shè)備動(dòng)態(tài)調(diào)頻(DevFreq)。你正在使用著還想要省電,那唯一的方法就是降低頻率了。降低頻率就會(huì)降低性能,所以還要考慮性能,結(jié)合當(dāng)時(shí)的負(fù)載進(jìn)行動(dòng)態(tài)調(diào)頻。閑暇省電的方法就比較多了,包括CPU休閑(CPUIdle)、CPU熱插拔(CPU Hotplug)、CPU隔離(Core Isolate)和動(dòng)態(tài)PM(Runtime PM)。CPUIdle指的是當(dāng)某個(gè)CPU上沒有進(jìn)程可調(diào)度的時(shí)候可以暫時(shí)局部關(guān)掉這個(gè)CPU的電源,從而達(dá)到省電的目的,當(dāng)再有進(jìn)程需要執(zhí)行的時(shí)候再恢復(fù)電源。CPU Hotplug指的是我們可以把某個(gè)CPU熱移除,然后系統(tǒng)就不會(huì)再往這個(gè)CPU上派任務(wù)了,這個(gè)CPU就可以放心地完全關(guān)閉電源了,當(dāng)把這個(gè)CPU再熱插入之后,就對這個(gè)CPU恢復(fù)供電,這個(gè)CPU就可以正常執(zhí)行任務(wù)了。CPU隔離指的是我們把某個(gè)CPU隔離開來,系統(tǒng)不再把它作為進(jìn)程調(diào)度的目標(biāo),這樣這個(gè)CPU就可以長久地進(jìn)入Idle狀態(tài)了,達(dá)到省電的目的。不過CPU隔離并不是專門的省電機(jī)制,我們把CPU隔離之后還可以通過set_affinity把進(jìn)程專門遷移到這個(gè)CPU上,這個(gè)CPU還會(huì)繼續(xù)運(yùn)行。CPU隔離能達(dá)到一種介于CPUIdle和CPU熱插拔之間的效果。Runtime PM指的是設(shè)備的動(dòng)態(tài)電源管理,系統(tǒng)中存在很多設(shè)備,但是并不是每種設(shè)備都在一直使用,比如相機(jī)可能在大部分時(shí)間都不會(huì)使用,所以我們可以在大部分時(shí)間把相機(jī)的電源關(guān)閉,在需用相機(jī)的時(shí)候,再給相機(jī)供電。
1.3 電源管理質(zhì)量
省電管理可以達(dá)到省電的目的,但是也會(huì)降低系統(tǒng)的性能,包括響應(yīng)延遲、帶寬、吞吐量等。所以內(nèi)核又提供了一個(gè)PM QoS框架,QoS是Quality Of Service(服務(wù)質(zhì)量)。PM QoS框架一面向顧客提供接口,顧客可以通過這些接口對系統(tǒng)的性能提出要求,一面向各種省電機(jī)制下發(fā)要求,省電機(jī)制在省電的同時(shí)也要滿足這些性能要求。PM QoS的顧客包括內(nèi)核和進(jìn)程:對于內(nèi)核,PM QoS提供了接口函數(shù)可以直接調(diào)用;對于進(jìn)程,PM QoS提供了一些設(shè)備文件可以讓用戶空間進(jìn)行讀寫。PM QoS對某一項(xiàng)性能指標(biāo)的要求叫做一個(gè)約束,約束分為系統(tǒng)級(jí)約束和設(shè)備級(jí)約束。系統(tǒng)級(jí)約束針對的是整個(gè)系統(tǒng)的性能要求,設(shè)備級(jí)約束針對的是某個(gè)設(shè)備的性能要求。
下面我們畫個(gè)圖總結(jié)一下電源管理:
? 二、睡眠與休眠 ? ?
睡眠和休眠的整體過程是相似的,都是暫停系統(tǒng)的運(yùn)行、保存系統(tǒng)信息、關(guān)閉全部或大部分硬件的供電,當(dāng)被喚醒時(shí)的過程正好相反,先恢復(fù)供電,然后恢復(fù)系統(tǒng)的運(yùn)行,再恢復(fù)之前保存的信息,然后就可以正常使用了。暫停系統(tǒng)運(yùn)行包括以下操作:同步文件數(shù)據(jù)到磁盤、凍結(jié)幾乎所有進(jìn)程、暫停devfreq和cpufreq、掛起所有設(shè)備(調(diào)用所有設(shè)備的suspend函數(shù))、禁用大部分外設(shè)的中斷、下線所有非當(dāng)前CPU。對于睡眠來說,內(nèi)存是不斷電的,所以不用保存信息。對于休眠來說整個(gè)系統(tǒng)是要斷電的,所以要把很多系統(tǒng)關(guān)鍵信息都保存到swap中。然后系統(tǒng)就可以斷電進(jìn)入睡眠或者休眠狀態(tài)了。對于睡眠來說有很多外設(shè)都可以喚醒系統(tǒng),對于休眠來說只有電源鍵能喚醒系統(tǒng)。當(dāng)系統(tǒng)被喚醒時(shí)就開始了恢復(fù)操作,睡眠的恢復(fù)和休眠的恢復(fù)操作是不太一樣的。睡眠基本上是上面操作的反操作,休眠是先正常啟動(dòng),然后在啟動(dòng)的末尾從swap區(qū)恢復(fù)狀態(tài)信息。
2.1 凍結(jié)進(jìn)程
睡眠和休眠都有凍結(jié)進(jìn)程的流程,我們就先來看一看凍結(jié)進(jìn)程的過程。凍結(jié)進(jìn)程是先凍結(jié)普通進(jìn)程,再凍結(jié)內(nèi)核進(jìn)程,其中有些特殊進(jìn)程不凍結(jié),當(dāng)前進(jìn)程不凍結(jié)。凍結(jié)的方法是先把一個(gè)全局變量pm_freezing設(shè)置為true,然后給每個(gè)進(jìn)程都發(fā)送一個(gè)偽信號(hào),也就是把所有進(jìn)程都喚醒。進(jìn)程喚醒之后會(huì)運(yùn)行,在其即將返回用戶空間時(shí)會(huì)進(jìn)行信號(hào)處理,在信號(hào)處理的流程中,會(huì)先進(jìn)行凍結(jié)檢測,如果發(fā)現(xiàn)pm_freezing為true而且當(dāng)前進(jìn)程也不是免凍進(jìn)程,那么就會(huì)凍結(jié)該進(jìn)程。凍結(jié)方法也很簡單,就是把進(jìn)程的運(yùn)行狀態(tài)設(shè)置為不可運(yùn)行,然后調(diào)度其它進(jìn)程。
下面我們看一下凍結(jié)的流程,代碼進(jìn)行了極度刪減,只保留最關(guān)鍵的部分。
linux-src/kernel/power/process.c
?
int freeze_processes(void){ pm_freezing = true; try_to_freeze_tasks(true);} static int try_to_freeze_tasks(bool user_only){ for_each_process_thread(g, p) { freeze_task(p) }}
?
linux-src/kernel/freezer.c
?
bool freeze_task(struct task_struct *p){ fake_signal_wake_up(p);} static void fake_signal_wake_up(struct task_struct *p){ unsigned long flags; if (lock_task_sighand(p, &flags)) { signal_wake_up(p, 0); unlock_task_sighand(p, &flags); }}
?
linux-src/arch/x86/kernel/signal.c
?
void arch_do_signal_or_restart(struct pt_regs *regs, bool has_signal){ struct ksignal ksig; if (has_signal && get_signal(&ksig)) { /* Whee! Actually deliver the signal. */ handle_signal(&ksig, regs); return; }}
?
linux-src/kernel/signal.c
?
bool get_signal(struct ksignal *ksig){ try_to_freeze();}
?
linux-src/include/linux/freezer.h
?
static inline bool try_to_freeze(void){ return try_to_freeze_unsafe();} static inline bool try_to_freeze_unsafe(void){ if (likely(!freezing(current))) return false; return __refrigerator(false);} static inline bool freezing(struct task_struct *p){ if (likely(!atomic_read(&system_freezing_cnt))) return false; return freezing_slow_path(p);}
?
linux-src/kernel/freezer.c
?
bool freezing_slow_path(struct task_struct *p){ if (p->flags & (PF_NOFREEZE | PF_SUSPEND_TASK)) return false; if (test_tsk_thread_flag(p, TIF_MEMDIE)) return false; if (pm_nosig_freezing || cgroup_freezing(p)) return true; if (pm_freezing && !(p->flags & PF_KTHREAD)) return true; return false;} bool __refrigerator(bool check_kthr_stop){ unsigned int save = get_current_state(); for (;;) { set_current_state(TASK_UNINTERRUPTIBLE); was_frozen = true; schedule(); } set_current_state(save); return was_frozen;}
?
凍結(jié)流程并不是一條線執(zhí)行完成的,分為發(fā)送凍結(jié)信號(hào)把每個(gè)進(jìn)程都喚醒,然后每個(gè)進(jìn)程自己在運(yùn)行的時(shí)候自己把自己凍結(jié)了。
2.2 睡眠流程
下面我們來看一下睡眠流程的代碼:
linux-src/kernel/power/suspend.c
?
int pm_suspend(suspend_state_t state){ int error; if (state <= PM_SUSPEND_ON || state >= PM_SUSPEND_MAX) return -EINVAL; pr_info("suspend entry (%s) ", mem_sleep_labels[state]); error = enter_state(state); if (error) { suspend_stats.fail++; dpm_save_failed_errno(error); } else { suspend_stats.success++; } pr_info("suspend exit "); return error;} static int enter_state(suspend_state_t state){ int error; if (sync_on_suspend_enabled) { ksys_sync_helper(); } error = suspend_prepare(state); error = suspend_devices_and_enter(state); return error;} static int suspend_prepare(suspend_state_t state){ int error; trace_suspend_resume(TPS("freeze_processes"), 0, true); error = suspend_freeze_processes(); trace_suspend_resume(TPS("freeze_processes"), 0, false); return error;} int suspend_devices_and_enter(suspend_state_t state){ int error; error = platform_suspend_begin(state); suspend_console(); suspend_test_start(); error = dpm_suspend_start(PMSG_SUSPEND); do { error = suspend_enter(state, &wakeup); } while (!error && !wakeup && platform_suspend_again(state)); Resume_devices: dpm_resume_end(PMSG_RESUME); suspend_test_finish("resume devices"); resume_console(); Close: platform_resume_end(state); pm_suspend_target_state = PM_SUSPEND_ON; return error;}
?
linux-src/drivers/base/power/main.c
?
int dpm_suspend_start(pm_message_t state){ ktime_t starttime = ktime_get(); int error; error = dpm_prepare(state); if (error) { suspend_stats.failed_prepare++; dpm_save_failed_step(SUSPEND_PREPARE); } else error = dpm_suspend(state); dpm_show_time(starttime, state, error, "start"); return error;} int dpm_suspend(pm_message_t state){ int error = 0; devfreq_suspend(); cpufreq_suspend(); while (!list_empty(&dpm_prepared_list)) { struct device *dev = to_device(dpm_prepared_list.prev); get_device(dev); error = device_suspend(dev); } return error;}
?
linux-src/kernel/power/suspend.c
?
static int suspend_enter(suspend_state_t state, bool *wakeup){ int error; error = platform_suspend_prepare(state); error = dpm_suspend_late(PMSG_SUSPEND); error = platform_suspend_prepare_late(state); error = dpm_suspend_noirq(PMSG_SUSPEND); error = platform_suspend_prepare_noirq(state); error = suspend_disable_secondary_cpus(); arch_suspend_disable_irqs(); BUG_ON(!irqs_disabled()); system_state = SYSTEM_SUSPEND; error = syscore_suspend(); if (!error) { *wakeup = pm_wakeup_pending(); if (!(suspend_test(TEST_CORE) || *wakeup)) { error = suspend_ops->enter(state); } else if (*wakeup) { error = -EBUSY; } syscore_resume(); } system_state = SYSTEM_RUNNING; arch_suspend_enable_irqs(); BUG_ON(irqs_disabled()); Enable_cpus: suspend_enable_secondary_cpus(); Platform_wake: platform_resume_noirq(state); dpm_resume_noirq(PMSG_RESUME); Platform_early_resume: platform_resume_early(state); Devices_early_resume: dpm_resume_early(PMSG_RESUME); Platform_finish: platform_resume_finish(state); return error;}
?
2.3 休眠流程
下面我們來看一下休眠流程的代碼:
linux-src/kernel/power/hibernate.c
?
int hibernate(void){ int error; lock_system_sleep(); pm_prepare_console(); ksys_sync_helper(); error = freeze_processes(); lock_device_hotplug(); error = create_basic_memory_bitmaps(); error = hibernation_snapshot(hibernation_mode == HIBERNATION_PLATFORM); if (in_suspend) { pm_pr_dbg("Writing hibernation image. "); error = swsusp_write(flags); swsusp_free(); if (!error) { power_down(); } } return error;}
?
linux-src/kernel/power/snapshot.c
?
int create_basic_memory_bitmaps(void){ struct memory_bitmap *bm1, *bm2; int error = 0; bm1 = kzalloc(sizeof(struct memory_bitmap), GFP_KERNEL); error = memory_bm_create(bm1, GFP_KERNEL, PG_ANY); bm2 = kzalloc(sizeof(struct memory_bitmap), GFP_KERNEL); error = memory_bm_create(bm2, GFP_KERNEL, PG_ANY); forbidden_pages_map = bm1; free_pages_map = bm2; mark_nosave_pages(forbidden_pages_map); return 0;}
?
linux-src/kernel/power/hibernate.c
?
int hibernation_snapshot(int platform_mode){ int error; error = platform_begin(platform_mode); error = hibernate_preallocate_memory(); error = freeze_kernel_threads(); error = dpm_prepare(PMSG_FREEZE); suspend_console(); pm_restrict_gfp_mask(); error = dpm_suspend(PMSG_FREEZE); error = create_image(platform_mode); msg = in_suspend ? (error ? PMSG_RECOVER : PMSG_THAW) : PMSG_RESTORE; dpm_resume(msg); resume_console(); dpm_complete(msg); Close: platform_end(platform_mode); return error;} static void power_down(void){ switch (hibernation_mode) { case HIBERNATION_REBOOT: kernel_restart(NULL); break; case HIBERNATION_PLATFORM: hibernation_platform_enter(); fallthrough; case HIBERNATION_SHUTDOWN: if (pm_power_off) kernel_power_off(); break; } kernel_halt(); /* * Valid image is on the disk, if we continue we risk serious data * corruption after resume. */ pr_crit("Power down manually "); while (1) cpu_relax();}
?
上面是休眠的過程,下面我們來看一下休眠恢復(fù)的過程,休眠恢復(fù)是先正常開機(jī),然后從swap分區(qū)中加載之前保存的數(shù)據(jù)。
linux-src/kernel/power/hibernate.c
?
late_initcall_sync(software_resume); static int software_resume(void){ int error; if (swsusp_resume_device) goto Check_image; if (resume_delay) { pr_info("Waiting %dsec before reading resume device ... ", resume_delay); ssleep(resume_delay); } /* Check if the device is there */ swsusp_resume_device = name_to_dev_t(resume_file); if (!swsusp_resume_device) { wait_for_device_probe(); if (resume_wait) { while ((swsusp_resume_device = name_to_dev_t(resume_file)) == 0) msleep(10); async_synchronize_full(); } swsusp_resume_device = name_to_dev_t(resume_file); if (!swsusp_resume_device) { error = -ENODEV; goto Unlock; } } Check_image: pm_pr_dbg("Hibernation image partition %d:%d present ", MAJOR(swsusp_resume_device), MINOR(swsusp_resume_device)); pm_pr_dbg("Looking for hibernation image. "); error = swsusp_check(); if (error) goto Unlock; /* The snapshot device should not be opened while we're running */ if (!hibernate_acquire()) { error = -EBUSY; swsusp_close(FMODE_READ | FMODE_EXCL); goto Unlock; } error = freeze_processes(); error = freeze_kernel_threads(); error = load_image_and_restore(); thaw_processes(); Finish: pm_notifier_call_chain(PM_POST_RESTORE); Restore: pm_restore_console(); pr_info("resume failed (%d) ", error); hibernate_release(); /* For success case, the suspend path will release the lock */ Unlock: mutex_unlock(&system_transition_mutex); pm_pr_dbg("Hibernation image not present or could not be loaded. "); return error; Close_Finish: swsusp_close(FMODE_READ | FMODE_EXCL); goto Finish;} static int load_image_and_restore(void){ int error; lock_device_hotplug(); error = create_basic_memory_bitmaps(); error = swsusp_read(&flags); swsusp_close(FMODE_READ | FMODE_EXCL); error = hibernation_restore(flags & SF_PLATFORM_MODE); swsusp_free(); free_basic_memory_bitmaps(); Unlock: unlock_device_hotplug(); return error;} int hibernation_restore(int platform_mode){ int error; pm_prepare_console(); suspend_console(); pm_restrict_gfp_mask(); error = dpm_suspend_start(PMSG_QUIESCE); if (!error) { error = resume_target_kernel(platform_mode); /* * The above should either succeed and jump to the new kernel, * or return with an error. Otherwise things are just * undefined, so let's be paranoid. */ BUG_ON(!error); } dpm_resume_end(PMSG_RECOVER); pm_restore_gfp_mask(); resume_console(); pm_restore_console(); return error;}
?
2.4 自動(dòng)睡眠
隨著智能手機(jī)的普及,手機(jī)的電量問題也越來越嚴(yán)重。之前的手機(jī)都是充一次能用三到五天甚至七天以上,但是對于智能手機(jī)來說,充一次只能用一天或者半天。手機(jī)電池技術(shù)遲遲沒有大的突破,為此也只能從軟件上下手解決了。安卓系統(tǒng)為此采取的辦法是投機(jī)性睡眠,也就是說對于手機(jī)來說,睡眠是常態(tài),運(yùn)行不是常態(tài),這也符合手機(jī)的使用習(xí)慣,一天24小時(shí)大部分時(shí)間是不用手機(jī)的。安卓在內(nèi)核中添加了wakelock模塊,內(nèi)核默認(rèn)情況下總是嘗試去睡眠,除非受到了wakelock的阻止。用戶空間的各個(gè)模塊都可以向內(nèi)核添加wakelock,以表明自己需要運(yùn)行,系統(tǒng)不能去睡眠。當(dāng)用戶空間都把自己的wakelock移除之后,內(nèi)核沒了wakelock就會(huì)去睡眠了。Wakelock推出之后,受到了很多內(nèi)核核心維護(hù)者的強(qiáng)烈批評,wakelock的源碼也一直沒有合入標(biāo)準(zhǔn)內(nèi)核。后來內(nèi)核又重新實(shí)現(xiàn)了wakelock的邏輯,叫做自動(dòng)睡眠。
其代碼如下:
linux-src/kernel/power/autosleep.c
?
int __init pm_autosleep_init(void){ autosleep_ws = wakeup_source_register(NULL, "autosleep"); if (!autosleep_ws) return -ENOMEM; autosleep_wq = alloc_ordered_workqueue("autosleep", 0); if (autosleep_wq) return 0; wakeup_source_unregister(autosleep_ws); return -ENOMEM;} static void try_to_suspend(struct work_struct *work){ unsigned int initial_count, final_count; if (!pm_get_wakeup_count(&initial_count, true)) goto out; mutex_lock(&autosleep_lock); if (!pm_save_wakeup_count(initial_count) || system_state != SYSTEM_RUNNING) { mutex_unlock(&autosleep_lock); goto out; } if (autosleep_state == PM_SUSPEND_ON) { mutex_unlock(&autosleep_lock); return; } if (autosleep_state >= PM_SUSPEND_MAX) hibernate(); else pm_suspend(autosleep_state); mutex_unlock(&autosleep_lock); if (!pm_get_wakeup_count(&final_count, false)) goto out; /* * If the wakeup occurred for an unknown reason, wait to prevent the * system from trying to suspend and waking up in a tight loop. */ if (final_count == initial_count) schedule_timeout_uninterruptible(HZ / 2); out: queue_up_suspend_work();}
?
? 三、關(guān)機(jī)與重啟 ? ?
關(guān)機(jī)和重啟是我們平時(shí)使用電腦時(shí)用的最多的操作了。重啟也是一種關(guān)機(jī),只不是關(guān)機(jī)之后再開機(jī),所以把它們放在一起講,實(shí)際上它們的代碼也是在一起實(shí)現(xiàn)的。后文中我們用關(guān)機(jī)來同時(shí)指代關(guān)機(jī)和重啟。關(guān)機(jī)的過程分為兩個(gè)部分,用戶空間處理和內(nèi)核處理。正常的關(guān)機(jī)的話,我們肯定不能直接拔電源,也不能讓內(nèi)核直接去關(guān)機(jī),因?yàn)橛脩艨臻g也運(yùn)行著大量的進(jìn)程,也要對它們進(jìn)行妥善的處理。由于init進(jìn)程是所有用戶空間進(jìn)程的祖先,所以由init進(jìn)程處理關(guān)機(jī)命令是最合適不過的。實(shí)際上無論你是用命令行關(guān)機(jī)還是圖形界面按鈕關(guān)機(jī)還是長按電源鍵關(guān)機(jī),最終的關(guān)機(jī)命令都會(huì)發(fā)給init進(jìn)程來處理。Init進(jìn)程首先會(huì)stop各個(gè)服務(wù)進(jìn)程,然后殺死其它用戶空間進(jìn)程,最后使用reboot系統(tǒng)調(diào)用請求內(nèi)核進(jìn)行最后的關(guān)機(jī)操作。
3.1 用戶空間處理
我們使用命令reboot或者圖形界面關(guān)機(jī)時(shí),最終都會(huì)把命令發(fā)給init進(jìn)程來處理。Init進(jìn)程會(huì)首先關(guān)閉各個(gè)服務(wù)進(jìn)程(deamon),然后發(fā)送信號(hào)SIGTERM給所有其他進(jìn)程,給其一次優(yōu)雅地退出的機(jī)會(huì),并sleep一段時(shí)間(一般是3s)來等待其退出,接著再發(fā)送信號(hào)SIGKILL給那么還是沒有退出的進(jìn)程,強(qiáng)制其退出。最后Init進(jìn)程會(huì)調(diào)用sync把內(nèi)存中的文件數(shù)據(jù)同步到磁盤,最終通過reboot系統(tǒng)調(diào)用請求內(nèi)核來關(guān)機(jī)。
3.2 內(nèi)核處理
我們來看一下內(nèi)核總reboot系統(tǒng)調(diào)用的實(shí)現(xiàn):
linux-src/kernel/reboot.c
?
SYSCALL_DEFINE4(reboot, int, magic1, int, magic2, unsigned int, cmd, void __user *, arg){ struct pid_namespace *pid_ns = task_active_pid_ns(current); char buffer[256]; int ret = 0; /* We only trust the superuser with rebooting the system. */ if (!ns_capable(pid_ns->user_ns, CAP_SYS_BOOT)) return -EPERM; /* For safety, we require "magic" arguments. */ if (magic1 != LINUX_REBOOT_MAGIC1 || (magic2 != LINUX_REBOOT_MAGIC2 && magic2 != LINUX_REBOOT_MAGIC2A && magic2 != LINUX_REBOOT_MAGIC2B && magic2 != LINUX_REBOOT_MAGIC2C)) return -EINVAL; /* * If pid namespaces are enabled and the current task is in a child * pid_namespace, the command is handled by reboot_pid_ns() which will * call do_exit(). */ ret = reboot_pid_ns(pid_ns, cmd); if (ret) return ret; /* Instead of trying to make the power_off code look like * halt when pm_power_off is not set do it the easy way. */ if ((cmd == LINUX_REBOOT_CMD_POWER_OFF) && !pm_power_off) cmd = LINUX_REBOOT_CMD_HALT; mutex_lock(&system_transition_mutex); switch (cmd) { case LINUX_REBOOT_CMD_RESTART: kernel_restart(NULL); break; case LINUX_REBOOT_CMD_CAD_ON: C_A_D = 1; break; case LINUX_REBOOT_CMD_CAD_OFF: C_A_D = 0; break; case LINUX_REBOOT_CMD_HALT: kernel_halt(); do_exit(0); panic("cannot halt"); case LINUX_REBOOT_CMD_POWER_OFF: kernel_power_off(); do_exit(0); break; case LINUX_REBOOT_CMD_RESTART2: ret = strncpy_from_user(&buffer[0], arg, sizeof(buffer) - 1); if (ret < 0) { ret = -EFAULT; break; } buffer[sizeof(buffer) - 1] = '?'; kernel_restart(buffer); break; #ifdef CONFIG_KEXEC_CORE case LINUX_REBOOT_CMD_KEXEC: ret = kernel_kexec(); break;#endif #ifdef CONFIG_HIBERNATION case LINUX_REBOOT_CMD_SW_SUSPEND: ret = hibernate(); break;#endif default: ret = -EINVAL; break; } mutex_unlock(&system_transition_mutex); return ret;} void kernel_power_off(void){ kernel_shutdown_prepare(SYSTEM_POWER_OFF); if (pm_power_off_prepare) pm_power_off_prepare(); migrate_to_reboot_cpu(); syscore_shutdown(); pr_emerg("Power down "); kmsg_dump(KMSG_DUMP_SHUTDOWN); machine_power_off();}
?
關(guān)機(jī)命令最終會(huì)由平臺(tái)相關(guān)的代碼來執(zhí)行。
? 四、CPU動(dòng)態(tài)調(diào)頻 ? ?
早期的CPU的頻率都是固定的,但是有一些極客玩家會(huì)去超頻。后來CPU廠商官方支持CPU動(dòng)態(tài)調(diào)頻了。但是什么時(shí)候調(diào),由誰去調(diào),調(diào)到多少,這些問題就交給了內(nèi)核。Linux內(nèi)核設(shè)計(jì)了一個(gè)CPUFreq框架,此框架明確區(qū)分了各個(gè)角色,不同的角色職責(zé)不同。CPUFreq框架由3部分組成,CPUFreq Govenor、CPUFreq Core和CPUFreq Driver。Govenor是決策者,負(fù)責(zé)決定什么時(shí)候進(jìn)行調(diào)頻,調(diào)到多少,Driver是執(zhí)行者,和具體的硬件打交道,Core是中間人,負(fù)責(zé)居中協(xié)調(diào)。一個(gè)系統(tǒng)可以有多個(gè)候選決策者,但是只能有一個(gè)當(dāng)前決策者,每個(gè)候選決策者都向Core注冊自己,用戶空間可以選擇哪個(gè)決策者作為當(dāng)前決策者。一個(gè)系統(tǒng)必須有且只有一個(gè)執(zhí)行者,執(zhí)行者由CPU廠商開發(fā),編譯哪個(gè)平臺(tái)的代碼就會(huì)編譯哪個(gè)執(zhí)行者,此執(zhí)行者會(huì)向Core注冊自己。下面我們畫個(gè)圖來看一下CPUFreq的整體框架。
可以發(fā)現(xiàn)圖中有一個(gè)CPUFreq Policy,這是什么意思呢?這是因?yàn)镃PU調(diào)頻并不能為每一個(gè)CPU單獨(dú)調(diào)頻,有些CPU必須作為一個(gè)整體進(jìn)行調(diào)頻,所以抽象出了一個(gè)CPUFreq Policy的概念,方便操作。
4.1 CPUFreq Core
Core中定義了一個(gè)全局變量cpufreq_governor_list,可以使用接口cpufreq_register_governor來注冊決策者,系統(tǒng)中可以同時(shí)注冊很多決策者,對于每個(gè)policy來說只有一個(gè)當(dāng)前決策者生效。Core還定義了一個(gè)全局變量cpufreq_driver,可以使用接口cpufreq_register_driver來注冊執(zhí)行者,對于一個(gè)系統(tǒng)來說有且只能有一個(gè)決策者被注冊,第二個(gè)注冊的會(huì)返回錯(cuò)誤。Core還定義全局變量cpufreq_policy_list,代表的是policy的列表,policy代表是多個(gè)必須一起改變頻率的CPU的集合。
我們先來看一下決策者的定義和注冊函數(shù):
linux-src/include/linux/cpufreq.h
?
struct cpufreq_governor { char name[CPUFREQ_NAME_LEN]; int (*init)(struct cpufreq_policy *policy); void (*exit)(struct cpufreq_policy *policy); int (*start)(struct cpufreq_policy *policy); void (*stop)(struct cpufreq_policy *policy); void (*limits)(struct cpufreq_policy *policy); ssize_t (*show_setspeed) (struct cpufreq_policy *policy, char *buf); int (*store_setspeed) (struct cpufreq_policy *policy, unsigned int freq); struct list_head governor_list; struct module *owner; u8 flags;};
?
linux-src/drivers/cpufreq/cpufreq.c
?
int cpufreq_register_governor(struct cpufreq_governor *governor){ int err; if (!governor) return -EINVAL; if (cpufreq_disabled()) return -ENODEV; mutex_lock(&cpufreq_governor_mutex); err = -EBUSY; if (!find_governor(governor->name)) { err = 0; list_add(&governor->governor_list, &cpufreq_governor_list); } mutex_unlock(&cpufreq_governor_mutex); return err;}
?
可以看到注冊過程很簡單,就是把決策者往list中一放就可以了。我們來看一下決策者的幾個(gè)函數(shù)指針,init是在把決策者設(shè)置給policy的時(shí)候會(huì)調(diào)用,exit是在舊的決策者被替換的時(shí)候被調(diào)用。Start是在決策者開始生效的時(shí)候調(diào)用,stop是在決策者不再生效的時(shí)候調(diào)用,limits是在Core需要調(diào)頻的時(shí)候會(huì)調(diào)用。
我們再來看一下決策者的定義和注冊函數(shù):
linux-src/include/linux/cpufreq.h
?
struct cpufreq_driver { char name[CPUFREQ_NAME_LEN]; u16 flags; void *driver_data; /* needed by all drivers */ int (*init)(struct cpufreq_policy *policy); int (*verify)(struct cpufreq_policy_data *policy); /* define one out of two */ int (*setpolicy)(struct cpufreq_policy *policy); int (*target)(struct cpufreq_policy *policy, unsigned int target_freq, unsigned int relation); /* Deprecated */ int (*target_index)(struct cpufreq_policy *policy, unsigned int index); unsigned int (*fast_switch)(struct cpufreq_policy *policy, unsigned int target_freq); /* * ->fast_switch() replacement for drivers that use an internal * representation of performance levels and can pass hints other than * the target performance level to the hardware. */ void (*adjust_perf)(unsigned int cpu, unsigned long min_perf, unsigned long target_perf, unsigned long capacity); /* * Only for drivers with target_index() and CPUFREQ_ASYNC_NOTIFICATION * unset. * * get_intermediate should return a stable intermediate frequency * platform wants to switch to and target_intermediate() should set CPU * to that frequency, before jumping to the frequency corresponding * to 'index'. Core will take care of sending notifications and driver * doesn't have to handle them in target_intermediate() or * target_index(). * * Drivers can return '0' from get_intermediate() in case they don't * wish to switch to intermediate frequency for some target frequency. * In that case core will directly call ->target_index(). */ unsigned int (*get_intermediate)(struct cpufreq_policy *policy, unsigned int index); int (*target_intermediate)(struct cpufreq_policy *policy, unsigned int index); /* should be defined, if possible */ unsigned int (*get)(unsigned int cpu); /* Called to update policy limits on firmware notifications. */ void (*update_limits)(unsigned int cpu); /* optional */ int (*bios_limit)(int cpu, unsigned int *limit); int (*online)(struct cpufreq_policy *policy); int (*offline)(struct cpufreq_policy *policy); int (*exit)(struct cpufreq_policy *policy); int (*suspend)(struct cpufreq_policy *policy); int (*resume)(struct cpufreq_policy *policy); struct freq_attr **attr; /* platform specific boost support code */ bool boost_enabled; int (*set_boost)(struct cpufreq_policy *policy, int state); /* * Set by drivers that want to register with the energy model after the * policy is properly initialized, but before the governor is started. */ void (*register_em)(struct cpufreq_policy *policy);};
?
linux-src/drivers/cpufreq/cpufreq.c
?
int cpufreq_register_driver(struct cpufreq_driver *driver_data){ unsigned long flags; int ret; if (cpufreq_disabled()) return -ENODEV; /* * The cpufreq core depends heavily on the availability of device * structure, make sure they are available before proceeding further. */ if (!get_cpu_device(0)) return -EPROBE_DEFER; if (!driver_data || !driver_data->verify || !driver_data->init || !(driver_data->setpolicy || driver_data->target_index || driver_data->target) || (driver_data->setpolicy && (driver_data->target_index || driver_data->target)) || (!driver_data->get_intermediate != !driver_data->target_intermediate) || (!driver_data->online != !driver_data->offline)) return -EINVAL; pr_debug("trying to register driver %s ", driver_data->name); /* Protect against concurrent CPU online/offline. */ cpus_read_lock(); write_lock_irqsave(&cpufreq_driver_lock, flags); if (cpufreq_driver) { write_unlock_irqrestore(&cpufreq_driver_lock, flags); ret = -EEXIST; goto out; } cpufreq_driver = driver_data; write_unlock_irqrestore(&cpufreq_driver_lock, flags); /* * Mark support for the scheduler's frequency invariance engine for * drivers that implement target(), target_index() or fast_switch(). */ if (!cpufreq_driver->setpolicy) { static_branch_enable_cpuslocked(&cpufreq_freq_invariance); pr_debug("supports frequency invariance"); } if (driver_data->setpolicy) driver_data->flags |= CPUFREQ_CONST_LOOPS; if (cpufreq_boost_supported()) { ret = create_boost_sysfs_file(); if (ret) goto err_null_driver; } ret = subsys_interface_register(&cpufreq_interface); if (ret) goto err_boost_unreg; if (unlikely(list_empty(&cpufreq_policy_list))) { /* if all ->init() calls failed, unregister */ ret = -ENODEV; pr_debug("%s: No CPU initialized for driver %s ", __func__, driver_data->name); goto err_if_unreg; } ret = cpuhp_setup_state_nocalls_cpuslocked(CPUHP_AP_ONLINE_DYN, "cpufreq:online", cpuhp_cpufreq_online, cpuhp_cpufreq_offline); if (ret < 0) goto err_if_unreg; hp_online = ret; ret = 0; pr_debug("driver %s up and running ", driver_data->name); goto out; err_if_unreg: subsys_interface_unregister(&cpufreq_interface);err_boost_unreg: remove_boost_sysfs_file();err_null_driver: write_lock_irqsave(&cpufreq_driver_lock, flags); cpufreq_driver = NULL; write_unlock_irqrestore(&cpufreq_driver_lock, flags);out: cpus_read_unlock(); return ret;}
?
可以看到注冊函數(shù)也很簡單,主要就是為全局變量cpufreq_driver賦值。我們來看一下執(zhí)行者的函數(shù)指針,其中最重要的函數(shù)指針是target和target_index,它們是具體負(fù)責(zé)設(shè)置目標(biāo)policy的頻率的,target是老的接口,是為了兼容才保留下來的,現(xiàn)在建議使用接口target_index。
我們再來看一下policy的定義和注冊函數(shù):
linux-src/include/linux/cpufreq.h
?
struct cpufreq_policy { /* CPUs sharing clock, require sw coordination */ cpumask_var_t cpus; /* Online CPUs only */ cpumask_var_t related_cpus; /* Online + Offline CPUs */ cpumask_var_t real_cpus; /* Related and present */ unsigned int shared_type; /* ACPI: ANY or ALL affected CPUs should set cpufreq */ unsigned int cpu; /* cpu managing this policy, must be online */ struct clk *clk; struct cpufreq_cpuinfo cpuinfo;/* see above */ unsigned int min; /* in kHz */ unsigned int max; /* in kHz */ unsigned int cur; /* in kHz, only needed if cpufreq * governors are used */ unsigned int suspend_freq; /* freq to set during suspend */ unsigned int policy; /* see above */ unsigned int last_policy; /* policy before unplug */ struct cpufreq_governor *governor; /* see below */ void *governor_data; char last_governor[CPUFREQ_NAME_LEN]; /* last governor used */ struct work_struct update; /* if update_policy() needs to be * called, but you're in IRQ context */ struct freq_constraints constraints; struct freq_qos_request *min_freq_req; struct freq_qos_request *max_freq_req; struct cpufreq_frequency_table *freq_table; enum cpufreq_table_sorting freq_table_sorted; struct list_head policy_list; struct kobject kobj; struct completion kobj_unregister; /* * The rules for this semaphore: * - Any routine that wants to read from the policy structure will * do a down_read on this semaphore. * - Any routine that will write to the policy structure and/or may take away * the policy altogether (eg. CPU hotplug), will hold this lock in write * mode before doing so. */ struct rw_semaphore rwsem; /* * Fast switch flags: * - fast_switch_possible should be set by the driver if it can * guarantee that frequency can be changed on any CPU sharing the * policy and that the change will affect all of the policy CPUs then. * - fast_switch_enabled is to be set by governors that support fast * frequency switching with the help of cpufreq_enable_fast_switch(). */ bool fast_switch_possible; bool fast_switch_enabled; /* * Set if the CPUFREQ_GOV_STRICT_TARGET flag is set for the current * governor. */ bool strict_target; /* * Preferred average time interval between consecutive invocations of * the driver to set the frequency for this policy. To be set by the * scaling driver (0, which is the default, means no preference). */ unsigned int transition_delay_us; /* * Remote DVFS flag (Not added to the driver structure as we don't want * to access another structure from scheduler hotpath). * * Should be set if CPUs can do DVFS on behalf of other CPUs from * different cpufreq policies. */ bool dvfs_possible_from_any_cpu; /* Cached frequency lookup from cpufreq_driver_resolve_freq. */ unsigned int cached_target_freq; unsigned int cached_resolved_idx; /* Synchronization for frequency transitions */ bool transition_ongoing; /* Tracks transition status */ spinlock_t transition_lock; wait_queue_head_t transition_wait; struct task_struct *transition_task; /* Task which is doing the transition */ /* cpufreq-stats */ struct cpufreq_stats *stats; /* For cpufreq driver's internal use */ void *driver_data; /* Pointer to the cooling device if used for thermal mitigation */ struct thermal_cooling_device *cdev; struct notifier_block nb_min; struct notifier_block nb_max;};
?
linux-src/drivers/cpufreq/cpufreq.c
?
static int cpufreq_online(unsigned int cpu){ struct cpufreq_policy *policy; bool new_policy; unsigned long flags; unsigned int j; int ret; pr_debug("%s: bringing CPU%u online ", __func__, cpu); /* Check if this CPU already has a policy to manage it */ policy = per_cpu(cpufreq_cpu_data, cpu); if (policy) { WARN_ON(!cpumask_test_cpu(cpu, policy->related_cpus)); if (!policy_is_inactive(policy)) return cpufreq_add_policy_cpu(policy, cpu); /* This is the only online CPU for the policy. Start over. */ new_policy = false; down_write(&policy->rwsem); policy->cpu = cpu; policy->governor = NULL; up_write(&policy->rwsem); } else { new_policy = true; policy = cpufreq_policy_alloc(cpu); if (!policy) return -ENOMEM; } if (!new_policy && cpufreq_driver->online) { ret = cpufreq_driver->online(policy); if (ret) { pr_debug("%s: %d: initialization failed ", __func__, __LINE__); goto out_exit_policy; } /* Recover policy->cpus using related_cpus */ cpumask_copy(policy->cpus, policy->related_cpus); } else { cpumask_copy(policy->cpus, cpumask_of(cpu)); /* * Call driver. From then on the cpufreq must be able * to accept all calls to ->verify and ->setpolicy for this CPU. */ ret = cpufreq_driver->init(policy); if (ret) { pr_debug("%s: %d: initialization failed ", __func__, __LINE__); goto out_free_policy; } /* * The initialization has succeeded and the policy is online. * If there is a problem with its frequency table, take it * offline and drop it. */ ret = cpufreq_table_validate_and_sort(policy); if (ret) goto out_offline_policy; /* related_cpus should at least include policy->cpus. */ cpumask_copy(policy->related_cpus, policy->cpus); } down_write(&policy->rwsem); /* * affected cpus must always be the one, which are online. We aren't * managing offline cpus here. */ cpumask_and(policy->cpus, policy->cpus, cpu_online_mask); if (new_policy) { for_each_cpu(j, policy->related_cpus) { per_cpu(cpufreq_cpu_data, j) = policy; add_cpu_dev_symlink(policy, j, get_cpu_device(j)); } policy->min_freq_req = kzalloc(2 * sizeof(*policy->min_freq_req), GFP_KERNEL); if (!policy->min_freq_req) { ret = -ENOMEM; goto out_destroy_policy; } ret = freq_qos_add_request(&policy->constraints, policy->min_freq_req, FREQ_QOS_MIN, FREQ_QOS_MIN_DEFAULT_VALUE); if (ret < 0) { /* * So we don't call freq_qos_remove_request() for an * uninitialized request. */ kfree(policy->min_freq_req); policy->min_freq_req = NULL; goto out_destroy_policy; } /* * This must be initialized right here to avoid calling * freq_qos_remove_request() on uninitialized request in case * of errors. */ policy->max_freq_req = policy->min_freq_req + 1; ret = freq_qos_add_request(&policy->constraints, policy->max_freq_req, FREQ_QOS_MAX, FREQ_QOS_MAX_DEFAULT_VALUE); if (ret < 0) { policy->max_freq_req = NULL; goto out_destroy_policy; } blocking_notifier_call_chain(&cpufreq_policy_notifier_list, CPUFREQ_CREATE_POLICY, policy); } if (cpufreq_driver->get && has_target()) { policy->cur = cpufreq_driver->get(policy->cpu); if (!policy->cur) { ret = -EIO; pr_err("%s: ->get() failed ", __func__); goto out_destroy_policy; } } /* * Sometimes boot loaders set CPU frequency to a value outside of * frequency table present with cpufreq core. In such cases CPU might be * unstable if it has to run on that frequency for long duration of time * and so its better to set it to a frequency which is specified in * freq-table. This also makes cpufreq stats inconsistent as * cpufreq-stats would fail to register because current frequency of CPU * isn't found in freq-table. * * Because we don't want this change to effect boot process badly, we go * for the next freq which is >= policy->cur ('cur' must be set by now, * otherwise we will end up setting freq to lowest of the table as 'cur' * is initialized to zero). * * We are passing target-freq as "policy->cur - 1" otherwise * __cpufreq_driver_target() would simply fail, as policy->cur will be * equal to target-freq. */ if ((cpufreq_driver->flags & CPUFREQ_NEED_INITIAL_FREQ_CHECK) && has_target()) { unsigned int old_freq = policy->cur; /* Are we running at unknown frequency ? */ ret = cpufreq_frequency_table_get_index(policy, old_freq); if (ret == -EINVAL) { ret = __cpufreq_driver_target(policy, old_freq - 1, CPUFREQ_RELATION_L); /* * Reaching here after boot in a few seconds may not * mean that system will remain stable at "unknown" * frequency for longer duration. Hence, a BUG_ON(). */ BUG_ON(ret); pr_info("%s: CPU%d: Running at unlisted initial frequency: %u KHz, changing to: %u KHz ", __func__, policy->cpu, old_freq, policy->cur); } } if (new_policy) { ret = cpufreq_add_dev_interface(policy); if (ret) goto out_destroy_policy; cpufreq_stats_create_table(policy); write_lock_irqsave(&cpufreq_driver_lock, flags); list_add(&policy->policy_list, &cpufreq_policy_list); write_unlock_irqrestore(&cpufreq_driver_lock, flags); /* * Register with the energy model before * sched_cpufreq_governor_change() is called, which will result * in rebuilding of the sched domains, which should only be done * once the energy model is properly initialized for the policy * first. * * Also, this should be called before the policy is registered * with cooling framework. */ if (cpufreq_driver->register_em) cpufreq_driver->register_em(policy); } ret = cpufreq_init_policy(policy); if (ret) { pr_err("%s: Failed to initialize policy for cpu: %d (%d) ", __func__, cpu, ret); goto out_destroy_policy; } up_write(&policy->rwsem); kobject_uevent(&policy->kobj, KOBJ_ADD); if (cpufreq_thermal_control_enabled(cpufreq_driver)) policy->cdev = of_cpufreq_cooling_register(policy); pr_debug("initialization complete "); return 0; out_destroy_policy: for_each_cpu(j, policy->real_cpus) remove_cpu_dev_symlink(policy, get_cpu_device(j)); up_write(&policy->rwsem); out_offline_policy: if (cpufreq_driver->offline) cpufreq_driver->offline(policy); out_exit_policy: if (cpufreq_driver->exit) cpufreq_driver->exit(policy); out_free_policy: cpufreq_policy_free(policy); return ret;}
?
Policy代表的是必須得一起調(diào)節(jié)頻率的CPU的集合,對于物理CPU來說,并不是每個(gè)核都可以單獨(dú)調(diào)節(jié)頻率的。系統(tǒng)中有多少個(gè)policy是和CPU的具體情況有關(guān)。
4.2 Govener介紹
系統(tǒng)中一個(gè)存在6個(gè)決策者,下面我們一一介紹一下。
1.performance
performance的策略非常簡單,就是一直把CPU的頻率設(shè)置為最大值。代碼如下:
linux-src/drivers/cpufreq/cpufreq_performance.c
?
static struct cpufreq_governor cpufreq_gov_performance = { .name = "performance", .owner = THIS_MODULE, .flags = CPUFREQ_GOV_STRICT_TARGET, .limits = cpufreq_gov_performance_limits,}; static void cpufreq_gov_performance_limits(struct cpufreq_policy *policy){ pr_debug("setting to %u kHz ", policy->max); __cpufreq_driver_target(policy, policy->max, CPUFREQ_RELATION_H);}
?
2.powersave
powersave的策略也非常簡單,就是一直把CPU的頻率設(shè)置為最小值。代碼如下:
linux-src/drivers/cpufreq/cpufreq_powersave.c
?
static struct cpufreq_governor cpufreq_gov_powersave = { .name = "powersave", .limits = cpufreq_gov_powersave_limits, .owner = THIS_MODULE, .flags = CPUFREQ_GOV_STRICT_TARGET,}; static void cpufreq_gov_powersave_limits(struct cpufreq_policy *policy){ pr_debug("setting to %u kHz ", policy->min); __cpufreq_driver_target(policy, policy->min, CPUFREQ_RELATION_L);}
?
3.conservative
Conservative,包括模式,總是把頻率往policy的最大值和最小值之間調(diào)整。代碼如下:
linux-src/drivers/cpufreq/cpufreq_conservative.c
?
static struct dbs_governor cs_governor = { .gov = CPUFREQ_DBS_GOVERNOR_INITIALIZER("conservative"), .kobj_type = { .default_attrs = cs_attributes }, .gov_dbs_update = cs_dbs_update, .alloc = cs_alloc, .free = cs_free, .init = cs_init, .exit = cs_exit, .start = cs_start,};
?
linux-src/drivers/cpufreq/cpufreq_governor.h
?
#define CPUFREQ_DBS_GOVERNOR_INITIALIZER(_name_) { .name = _name_, .flags = CPUFREQ_GOV_DYNAMIC_SWITCHING, .owner = THIS_MODULE, .init = cpufreq_dbs_governor_init, .exit = cpufreq_dbs_governor_exit, .start = cpufreq_dbs_governor_start, .stop = cpufreq_dbs_governor_stop, .limits = cpufreq_dbs_governor_limits, ??}
?
linux-src/drivers/cpufreq/cpufreq_governor.c
?
void cpufreq_dbs_governor_limits(struct cpufreq_policy *policy){ struct policy_dbs_info *policy_dbs; /* Protect gov->gdbs_data against cpufreq_dbs_governor_exit() */ mutex_lock(&gov_dbs_data_mutex); policy_dbs = policy->governor_data; if (!policy_dbs) goto out; mutex_lock(&policy_dbs->update_mutex); cpufreq_policy_apply_limits(policy); gov_update_sample_delay(policy_dbs, 0); mutex_unlock(&policy_dbs->update_mutex); out: mutex_unlock(&gov_dbs_data_mutex);}
?
linux-src/include/linux/cpufreq.h
?
static inline void cpufreq_policy_apply_limits(struct cpufreq_policy *policy){ if (policy->max < policy->cur) __cpufreq_driver_target(policy, policy->max, CPUFREQ_RELATION_H); else if (policy->min > policy->cur) __cpufreq_driver_target(policy, policy->min, CPUFREQ_RELATION_L);}
?
4.userspace
Userspace,按照用戶空間設(shè)置的值進(jìn)行調(diào)節(jié),代碼如下:
linux-src/drivers/cpufreq/cpufreq_userspace.c
?
static struct cpufreq_governor cpufreq_gov_userspace = { .name = "userspace", .init = cpufreq_userspace_policy_init, .exit = cpufreq_userspace_policy_exit, .start = cpufreq_userspace_policy_start, .stop = cpufreq_userspace_policy_stop, .limits = cpufreq_userspace_policy_limits, .store_setspeed = cpufreq_set, .show_setspeed = show_speed, .owner = THIS_MODULE,}; static void cpufreq_userspace_policy_limits(struct cpufreq_policy *policy){ unsigned int *setspeed = policy->governor_data; mutex_lock(&userspace_mutex); pr_debug("limit event for cpu %u: %u - %u kHz, currently %u kHz, last set to %u kHz ", policy->cpu, policy->min, policy->max, policy->cur, *setspeed); if (policy->max < *setspeed) __cpufreq_driver_target(policy, policy->max, CPUFREQ_RELATION_H); else if (policy->min > *setspeed) __cpufreq_driver_target(policy, policy->min, CPUFREQ_RELATION_L); else __cpufreq_driver_target(policy, *setspeed, CPUFREQ_RELATION_L); mutex_unlock(&userspace_mutex);}
?
5.ondemand
Ondemand,按需調(diào)整,默認(rèn)運(yùn)行在較低頻率,系統(tǒng)負(fù)載增大時(shí)就運(yùn)行在高頻率,代碼如下:
linux-src/drivers/cpufreq/cpufreq_ondemand.c
?
static struct dbs_governor od_dbs_gov = { .gov = CPUFREQ_DBS_GOVERNOR_INITIALIZER("ondemand"), .kobj_type = { .default_attrs = od_attributes }, .gov_dbs_update = od_dbs_update, .alloc = od_alloc, .free = od_free, .init = od_init, .exit = od_exit, .start = od_start,};
?
6.schedutil
Schedutil,根據(jù)CPU使用率動(dòng)態(tài)調(diào)整頻率,代碼如下:
linux-src/kernel/sched/cpufreq_schedutil.c
?
struct cpufreq_governor schedutil_gov = { .name = "schedutil", .owner = THIS_MODULE, .flags = CPUFREQ_GOV_DYNAMIC_SWITCHING, .init = sugov_init, .exit = sugov_exit, .start = sugov_start, .stop = sugov_stop, .limits = sugov_limits,}; static void sugov_limits(struct cpufreq_policy *policy){ struct sugov_policy *sg_policy = policy->governor_data; if (!policy->fast_switch_enabled) { mutex_lock(&sg_policy->work_lock); cpufreq_policy_apply_limits(policy); mutex_unlock(&sg_policy->work_lock); } sg_policy->limits_changed = true;}
?
4.3 Driver介紹
在x86上只有一個(gè)執(zhí)行者,叫做intel_pstate,我們來看一下它的代碼實(shí)現(xiàn):
linux-src/drivers/cpufreq/intel_pstate.c
?
static struct cpufreq_driver intel_pstate = { .flags = CPUFREQ_CONST_LOOPS, .verify = intel_pstate_verify_policy, .setpolicy = intel_pstate_set_policy, .suspend = intel_pstate_suspend, .resume = intel_pstate_resume, .init = intel_pstate_cpu_init, .exit = intel_pstate_cpu_exit, .offline = intel_pstate_cpu_offline, .online = intel_pstate_cpu_online, .update_limits = intel_pstate_update_limits, .name = "intel_pstate",}; static int intel_pstate_set_policy(struct cpufreq_policy *policy){ struct cpudata *cpu; if (!policy->cpuinfo.max_freq) return -ENODEV; pr_debug("set_policy cpuinfo.max %u policy->max %u ", policy->cpuinfo.max_freq, policy->max); cpu = all_cpu_data[policy->cpu]; cpu->policy = policy->policy; mutex_lock(&intel_pstate_limits_lock); intel_pstate_update_perf_limits(cpu, policy->min, policy->max); if (cpu->policy == CPUFREQ_POLICY_PERFORMANCE) { /* * NOHZ_FULL CPUs need this as the governor callback may not * be invoked on them. */ intel_pstate_clear_update_util_hook(policy->cpu); intel_pstate_max_within_limits(cpu); } else { intel_pstate_set_update_util_hook(policy->cpu); } if (hwp_active) { /* * When hwp_boost was active before and dynamically it * was turned off, in that case we need to clear the * update util hook. */ if (!hwp_boost) intel_pstate_clear_update_util_hook(policy->cpu); intel_pstate_hwp_set(policy->cpu); } mutex_unlock(&intel_pstate_limits_lock); return 0;} static int __init intel_pstate_init(void){ const struct x86_cpu_id *id; int rc; if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL) return -ENODEV; id = x86_match_cpu(hwp_support_ids); if (id) { bool hwp_forced = intel_pstate_hwp_is_enabled(); if (hwp_forced) pr_info("HWP enabled by BIOS "); else if (no_load) return -ENODEV; copy_cpu_funcs(&core_funcs); /* * Avoid enabling HWP for processors without EPP support, * because that means incomplete HWP implementation which is a * corner case and supporting it is generally problematic. * * If HWP is enabled already, though, there is no choice but to * deal with it. */ if ((!no_hwp && boot_cpu_has(X86_FEATURE_HWP_EPP)) || hwp_forced) { hwp_active++; hwp_mode_bdw = id->driver_data; intel_pstate.attr = hwp_cpufreq_attrs; intel_cpufreq.attr = hwp_cpufreq_attrs; intel_cpufreq.flags |= CPUFREQ_NEED_UPDATE_LIMITS; intel_cpufreq.adjust_perf = intel_cpufreq_adjust_perf; if (!default_driver) default_driver = &intel_pstate; if (boot_cpu_has(X86_FEATURE_HYBRID_CPU)) intel_pstate_cppc_set_cpu_scaling(); goto hwp_cpu_matched; } pr_info("HWP not enabled "); } else { if (no_load) return -ENODEV; id = x86_match_cpu(intel_pstate_cpu_ids); if (!id) { pr_info("CPU model not supported "); return -ENODEV; } copy_cpu_funcs((struct pstate_funcs *)id->driver_data); } if (intel_pstate_msrs_not_valid()) { pr_info("Invalid MSRs "); return -ENODEV; } /* Without HWP start in the passive mode. */ if (!default_driver) default_driver = &intel_cpufreq; hwp_cpu_matched: /* * The Intel pstate driver will be ignored if the platform * firmware has its own power management modes. */ if (intel_pstate_platform_pwr_mgmt_exists()) { pr_info("P-states controlled by the platform "); return -ENODEV; } if (!hwp_active && hwp_only) return -ENOTSUPP; pr_info("Intel P-state driver initializing "); all_cpu_data = vzalloc(array_size(sizeof(void *), num_possible_cpus())); if (!all_cpu_data) return -ENOMEM; intel_pstate_request_control_from_smm(); intel_pstate_sysfs_expose_params(); mutex_lock(&intel_pstate_driver_lock); rc = intel_pstate_register_driver(default_driver); mutex_unlock(&intel_pstate_driver_lock); if (rc) { intel_pstate_sysfs_remove(); return rc; } if (hwp_active) { const struct x86_cpu_id *id; id = x86_match_cpu(intel_pstate_cpu_ee_disable_ids); if (id) { set_power_ctl_ee_state(false); pr_info("Disabling energy efficiency optimization "); } pr_info("HWP enabled "); } else if (boot_cpu_has(X86_FEATURE_HYBRID_CPU)) { pr_warn("Problematic setup: Hybrid processor with disabled HWP "); } return 0;}device_initcall(intel_pstate_init);
?
? 五、CPU休閑? ? ?
CPU在無進(jìn)程可執(zhí)行的情況下會(huì)進(jìn)入idle狀態(tài),idle狀態(tài)的CPU可以選擇進(jìn)入低功耗模式以節(jié)省能源。不同的低功耗模式被統(tǒng)稱為C-state,ACPI定義的有C0、C1、C2、C3、C4、C5、C6這幾種模式,CPU廠商可以選擇實(shí)現(xiàn)C0–Cn,n >= 3。下面是各種模式的定義:?
C0:CPU的正常工作模式,CPU處于100%運(yùn)行狀態(tài)。
C1:通過軟件停止CPU內(nèi)部主時(shí)鐘;總線接口單元和APIC仍然保持全速運(yùn)行。?
C2:通過硬件停止CPU內(nèi)部主時(shí)鐘;總線接口單元和APIC仍然保持全速運(yùn)行。
C3:停止所有CPU內(nèi)部時(shí)鐘。?
C4:降低CPU電壓。?
C5:大幅降低CPU電壓并關(guān)閉內(nèi)存高速緩存。?
C6:將CPU內(nèi)部電壓降低至任何值,包括0V。
那么誰來決定CPU該進(jìn)入哪一級(jí)的idle狀態(tài),又該誰去執(zhí)行這個(gè)決定呢?為此Linux設(shè)計(jì)了CPUIdle框架,區(qū)分了不同的角色。決策者負(fù)責(zé)決定該進(jìn)入哪一級(jí)idle狀態(tài),執(zhí)行者負(fù)責(zé)去執(zhí)行,Core負(fù)責(zé)居中調(diào)節(jié),下面我們畫個(gè)圖來看一下。
可以看到系統(tǒng)能同時(shí)注冊多個(gè)決策者,但是卻只能注冊一個(gè)執(zhí)行者。和CPUFreq不同的是每個(gè)CPU可以單獨(dú)調(diào)節(jié)自己的低功耗狀態(tài),所以Driver直接調(diào)節(jié)的就是CPU。
5.1 CPUIdle Core
Core定義了一個(gè)全局變量cpuidle_governors,是所有決策者的列表,可以通過接口cpuidle_register_governor注冊決策者,全局變量cpuidle_curr_governor代表當(dāng)前決策者。Core還定義了cpuidle_curr_driver,代表當(dāng)前執(zhí)行者,可以通過接口cpuidle_register_driver來注冊執(zhí)行者,一個(gè)系統(tǒng)只能注冊一個(gè)執(zhí)行者,后面注冊的會(huì)返回錯(cuò)誤。
下面我們看一下決策者的定義和注冊函數(shù):
linux-src/include/linux/cpuidle.h
?
struct cpuidle_governor { char name[CPUIDLE_NAME_LEN]; struct list_head governor_list; unsigned int rating; int (*enable) (struct cpuidle_driver *drv, struct cpuidle_device *dev); void (*disable) (struct cpuidle_driver *drv, struct cpuidle_device *dev); int (*select) (struct cpuidle_driver *drv, struct cpuidle_device *dev, bool *stop_tick); void (*reflect) (struct cpuidle_device *dev, int index);};
?
linux-src/drivers/cpuidle/governor.c
?
int cpuidle_register_governor(struct cpuidle_governor *gov){ int ret = -EEXIST; if (!gov || !gov->select) return -EINVAL; if (cpuidle_disabled()) return -ENODEV; mutex_lock(&cpuidle_lock); if (cpuidle_find_governor(gov->name) == NULL) { ret = 0; list_add_tail(&gov->governor_list, &cpuidle_governors); if (!cpuidle_curr_governor || !strncasecmp(param_governor, gov->name, CPUIDLE_NAME_LEN) || (cpuidle_curr_governor->rating < gov->rating && strncasecmp(param_governor, cpuidle_curr_governor->name, CPUIDLE_NAME_LEN))) cpuidle_switch_governor(gov); } mutex_unlock(&cpuidle_lock); return ret;}
?
下面我們再來看一下執(zhí)行者的定義和注冊函數(shù):
linux-src/include/linux/cpuidle.h
?
struct cpuidle_driver { const char *name; struct module *owner; /* used by the cpuidle framework to setup the broadcast timer */ unsigned int bctimer:1; /* states array must be ordered in decreasing power consumption */ struct cpuidle_state states[CPUIDLE_STATE_MAX]; int state_count; int safe_state_index; /* the driver handles the cpus in cpumask */ struct cpumask *cpumask; /* preferred governor to switch at register time */ const char *governor;};
?
linux-src/drivers/cpuidle/driver.c
?
int cpuidle_register_driver(struct cpuidle_driver *drv){ struct cpuidle_governor *gov; int ret; spin_lock(&cpuidle_driver_lock); ret = __cpuidle_register_driver(drv); spin_unlock(&cpuidle_driver_lock); if (!ret && !strlen(param_governor) && drv->governor && (cpuidle_get_driver() == drv)) { mutex_lock(&cpuidle_lock); gov = cpuidle_find_governor(drv->governor); if (gov) { cpuidle_prev_governor = cpuidle_curr_governor; if (cpuidle_switch_governor(gov) < 0) cpuidle_prev_governor = NULL; } mutex_unlock(&cpuidle_lock); } return ret;}
?
5.2 決策者介紹
CPUIdle中默認(rèn)有兩個(gè)決策者,ladder和menu。ladder是梯子的意思,CPU 隨著idle的時(shí)間其睡眠程度逐漸加深,適用于固定tick。menu是菜單的意思,預(yù)估一個(gè)CPU idle的時(shí)間,然后CPU一步到位地處于某種睡眠狀態(tài),適用于動(dòng)態(tài)tick。下面我們分別來看看它們的實(shí)現(xiàn)。
1.ladder
linux-src/drivers/cpuidle/governors/ladder.c
?
static struct cpuidle_governor ladder_governor = { .name = "ladder", .rating = 10, .enable = ladder_enable_device, .select = ladder_select_state, .reflect = ladder_reflect,}; static int ladder_select_state(struct cpuidle_driver *drv, struct cpuidle_device *dev, bool *dummy){ struct ladder_device *ldev = this_cpu_ptr(&ladder_devices); struct ladder_device_state *last_state; int last_idx = dev->last_state_idx; int first_idx = drv->states[0].flags & CPUIDLE_FLAG_POLLING ? 1 : 0; s64 latency_req = cpuidle_governor_latency_req(dev->cpu); s64 last_residency; /* Special case when user has set very strict latency requirement */ if (unlikely(latency_req == 0)) { ladder_do_selection(dev, ldev, last_idx, 0); return 0; } last_state = &ldev->states[last_idx]; last_residency = dev->last_residency_ns - drv->states[last_idx].exit_latency_ns; /* consider promotion */ if (last_idx < drv->state_count - 1 && !dev->states_usage[last_idx + 1].disable && last_residency > last_state->threshold.promotion_time_ns && drv->states[last_idx + 1].exit_latency_ns <= latency_req) { last_state->stats.promotion_count++; last_state->stats.demotion_count = 0; if (last_state->stats.promotion_count >= last_state->threshold.promotion_count) { ladder_do_selection(dev, ldev, last_idx, last_idx + 1); return last_idx + 1; } } /* consider demotion */ if (last_idx > first_idx && (dev->states_usage[last_idx].disable || drv->states[last_idx].exit_latency_ns > latency_req)) { int i; for (i = last_idx - 1; i > first_idx; i--) { if (drv->states[i].exit_latency_ns <= latency_req) break; } ladder_do_selection(dev, ldev, last_idx, i); return i; } if (last_idx > first_idx && last_residency < last_state->threshold.demotion_time_ns) { last_state->stats.demotion_count++; last_state->stats.promotion_count = 0; if (last_state->stats.demotion_count >= last_state->threshold.demotion_count) { ladder_do_selection(dev, ldev, last_idx, last_idx - 1); return last_idx - 1; } } /* otherwise remain at the current state */ return last_idx;}
?
2.menu
linux-src/drivers/cpuidle/governors/menu.c
?
static struct cpuidle_governor menu_governor = { .name = "menu", .rating = 20, .enable = menu_enable_device, .select = menu_select, .reflect = menu_reflect,}; static int menu_select(struct cpuidle_driver *drv, struct cpuidle_device *dev, bool *stop_tick){ struct menu_device *data = this_cpu_ptr(&menu_devices); s64 latency_req = cpuidle_governor_latency_req(dev->cpu); unsigned int predicted_us; u64 predicted_ns; u64 interactivity_req; unsigned int nr_iowaiters; ktime_t delta, delta_tick; int i, idx; if (data->needs_update) { menu_update(drv, dev); data->needs_update = 0; } /* determine the expected residency time, round up */ delta = tick_nohz_get_sleep_length(&delta_tick); if (unlikely(delta < 0)) { delta = 0; delta_tick = 0; } data->next_timer_ns = delta; nr_iowaiters = nr_iowait_cpu(dev->cpu); data->bucket = which_bucket(data->next_timer_ns, nr_iowaiters); if (unlikely(drv->state_count <= 1 || latency_req == 0) || ((data->next_timer_ns < drv->states[1].target_residency_ns || latency_req < drv->states[1].exit_latency_ns) && !dev->states_usage[0].disable)) { /* * In this case state[0] will be used no matter what, so return * it right away and keep the tick running if state[0] is a * polling one. */ *stop_tick = !(drv->states[0].flags & CPUIDLE_FLAG_POLLING); return 0; } /* Round up the result for half microseconds. */ predicted_us = div_u64(data->next_timer_ns * data->correction_factor[data->bucket] + (RESOLUTION * DECAY * NSEC_PER_USEC) / 2, RESOLUTION * DECAY * NSEC_PER_USEC); /* Use the lowest expected idle interval to pick the idle state. */ predicted_ns = (u64)min(predicted_us, get_typical_interval(data, predicted_us)) * NSEC_PER_USEC; if (tick_nohz_tick_stopped()) { /* * If the tick is already stopped, the cost of possible short * idle duration misprediction is much higher, because the CPU * may be stuck in a shallow idle state for a long time as a * result of it. In that case say we might mispredict and use * the known time till the closest timer event for the idle * state selection. */ if (predicted_ns < TICK_NSEC) predicted_ns = data->next_timer_ns; } else { /* * Use the performance multiplier and the user-configurable * latency_req to determine the maximum exit latency. */ interactivity_req = div64_u64(predicted_ns, performance_multiplier(nr_iowaiters)); if (latency_req > interactivity_req) latency_req = interactivity_req; } /* * Find the idle state with the lowest power while satisfying * our constraints. */ idx = -1; for (i = 0; i < drv->state_count; i++) { struct cpuidle_state *s = &drv->states[i]; if (dev->states_usage[i].disable) continue; if (idx == -1) idx = i; /* first enabled state */ if (s->target_residency_ns > predicted_ns) { /* * Use a physical idle state, not busy polling, unless * a timer is going to trigger soon enough. */ if ((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) && s->exit_latency_ns <= latency_req && s->target_residency_ns <= data->next_timer_ns) { predicted_ns = s->target_residency_ns; idx = i; break; } if (predicted_ns < TICK_NSEC) break; if (!tick_nohz_tick_stopped()) { /* * If the state selected so far is shallow, * waking up early won't hurt, so retain the * tick in that case and let the governor run * again in the next iteration of the loop. */ predicted_ns = drv->states[idx].target_residency_ns; break; } /* * If the state selected so far is shallow and this * state's target residency matches the time till the * closest timer event, select this one to avoid getting * stuck in the shallow one for too long. */ if (drv->states[idx].target_residency_ns < TICK_NSEC && s->target_residency_ns <= delta_tick) idx = i; return idx; } if (s->exit_latency_ns > latency_req) break; idx = i; } if (idx == -1) idx = 0; /* No states enabled. Must use 0. */ /* * Don't stop the tick if the selected state is a polling one or if the * expected idle duration is shorter than the tick period length. */ if (((drv->states[idx].flags & CPUIDLE_FLAG_POLLING) || predicted_ns < TICK_NSEC) && !tick_nohz_tick_stopped()) { *stop_tick = false; if (idx > 0 && drv->states[idx].target_residency_ns > delta_tick) { /* * The tick is not going to be stopped and the target * residency of the state to be returned is not within * the time until the next timer event including the * tick, so try to correct that. */ for (i = idx - 1; i >= 0; i--) { if (dev->states_usage[i].disable) continue; idx = i; if (drv->states[i].target_residency_ns <= delta_tick) break; } } } return idx;}
?
5.3 Driver介紹
在x86上只有一個(gè)執(zhí)行者叫pseries_idle。
linux-src/drivers/cpuidle/cpuidle-pseries.c
?
static struct cpuidle_driver pseries_idle_driver = { .name = "pseries_idle", .owner = THIS_MODULE,}; static int __init pseries_processor_idle_init(void){ int retval; retval = pseries_idle_probe(); if (retval) return retval; pseries_cpuidle_driver_init(); retval = cpuidle_register(&pseries_idle_driver, NULL); if (retval) { printk(KERN_DEBUG "Registration of pseries driver failed. "); return retval; } retval = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN, "cpuidle/pseries:online", pseries_cpuidle_cpu_online, NULL); WARN_ON(retval < 0); retval = cpuhp_setup_state_nocalls(CPUHP_CPUIDLE_DEAD, "cpuidle/pseries:DEAD", NULL, pseries_cpuidle_cpu_dead); WARN_ON(retval < 0); printk(KERN_DEBUG "pseries_idle_driver registered "); return 0;} device_initcall(pseries_processor_idle_init);
?
? 六、電源質(zhì)量管理 ? ?
我們前面講了很多省電機(jī)制,但是也不能一味地省電。畢竟我們用計(jì)算機(jī)的目的是為了用計(jì)算機(jī),而不是為了省電。省電也不能犧牲太大的性能,從而影響了用戶體驗(yàn)。所以內(nèi)核里開發(fā)PM QoS模塊,專門用來處理電源管理的服務(wù)質(zhì)量問題。PM QoS是一個(gè)框架,面向顧客(內(nèi)核和進(jìn)程),它提供了請求接口,顧客可以請求系統(tǒng)某一方面的性能不能低于某個(gè)標(biāo)準(zhǔn);面向省電機(jī)制,它提供了查詢接口,省電機(jī)制在進(jìn)行省電的時(shí)候要通過這個(gè)接口進(jìn)行查詢,然后省電的同時(shí)也要滿足這個(gè)標(biāo)準(zhǔn)。
PM QoS把對某一項(xiàng)性能的最低要求抽象為一個(gè)約束,所有顧客都可以對某個(gè)約束發(fā)出請求,也可以修改請求、移除請求,PM QoS會(huì)把能滿足所有要求的數(shù)值發(fā)給省電機(jī)制。約束可以分為兩類,系統(tǒng)級(jí)約束和設(shè)備級(jí)約束,系統(tǒng)級(jí)約束針對的是系統(tǒng)的性能,設(shè)備級(jí)約束針對的是一個(gè)設(shè)備。內(nèi)核中的顧客可以直接調(diào)用接口函數(shù)來添加約束請求,用戶空間的顧客可以通過設(shè)備節(jié)點(diǎn)文件來添加約束請求。
6.1 系統(tǒng)級(jí)約束
系統(tǒng)級(jí)約束有兩個(gè),CPU頻率和CPU延遲。CPU頻率代表的是CPU運(yùn)行時(shí)的性能,頻率越高,性能越強(qiáng),功耗也越大。CPU延遲是CPU Idle之后從低功耗狀態(tài)恢復(fù)到運(yùn)行的時(shí)間,CPU idle之后可以處于不同的低功耗狀態(tài),狀態(tài)越深越省電,但是恢復(fù)的延遲越大。
下面我們首先看一下CPU頻率約束的定義和請求函數(shù):
linux-src/include/linux/pm_qos.h
?
struct freq_constraints { struct pm_qos_constraints min_freq; struct blocking_notifier_head min_freq_notifiers; struct pm_qos_constraints max_freq; struct blocking_notifier_head max_freq_notifiers;}; struct pm_qos_constraints { struct plist_head list; s32 target_value; /* Do not change to 64 bit */ s32 default_value; s32 no_constraint_value; enum pm_qos_type type; struct blocking_notifier_head *notifiers;}; struct freq_qos_request { enum freq_qos_req_type type; struct plist_node pnode; struct freq_constraints *qos;}; enum freq_qos_req_type { FREQ_QOS_MIN = 1, FREQ_QOS_MAX,};
?
這是頻率約束的相關(guān)定義。
linux-src/kernel/power/qos.c
?
int freq_qos_add_request(struct freq_constraints *qos, struct freq_qos_request *req, enum freq_qos_req_type type, s32 value){ int ret; if (IS_ERR_OR_NULL(qos) || !req) return -EINVAL; if (WARN(freq_qos_request_active(req), "%s() called for active request ", __func__)) return -EINVAL; req->qos = qos; req->type = type; ret = freq_qos_apply(req, PM_QOS_ADD_REQ, value); if (ret < 0) { req->qos = NULL; req->type = 0; } return ret;} int freq_qos_apply(struct freq_qos_request *req, enum pm_qos_req_action action, s32 value){ int ret; switch(req->type) { case FREQ_QOS_MIN: ret = pm_qos_update_target(&req->qos->min_freq, &req->pnode, action, value); break; case FREQ_QOS_MAX: ret = pm_qos_update_target(&req->qos->max_freq, &req->pnode, action, value); break; default: ret = -EINVAL; } return ret;}
?
這是CPU頻率約束的請求函數(shù)。
linux-src/kernel/power/qos.c
?
s32 freq_qos_read_value(struct freq_constraints *qos, enum freq_qos_req_type type){ s32 ret; switch (type) { case FREQ_QOS_MIN: ret = IS_ERR_OR_NULL(qos) ? FREQ_QOS_MIN_DEFAULT_VALUE : pm_qos_read_value(&qos->min_freq); break; case FREQ_QOS_MAX: ret = IS_ERR_OR_NULL(qos) ? FREQ_QOS_MAX_DEFAULT_VALUE : pm_qos_read_value(&qos->max_freq); break; default: WARN_ON(1); ret = 0; } return ret;}
?
CPUFreq模塊會(huì)通過接口freq_qos_read_value來讀取CPU頻率約束,以便在動(dòng)態(tài)調(diào)頻的時(shí)候也滿足最低性能要求。
下面我們首先看一下CPU延遲約束的定義和請求函數(shù):
linux-src/include/linux/pm_qos.h
?
struct pm_qos_constraints { struct plist_head list; s32 target_value; /* Do not change to 64 bit */ s32 default_value; s32 no_constraint_value; enum pm_qos_type type; struct blocking_notifier_head *notifiers;}; struct pm_qos_request { struct plist_node node; struct pm_qos_constraints *qos;};
?
這是CPU延遲約束的定義。
linux-src/kernel/power/qos.c
?
void cpu_latency_qos_add_request(struct pm_qos_request *req, s32 value){ if (!req) return; if (cpu_latency_qos_request_active(req)) { WARN(1, KERN_ERR "%s called for already added request ", __func__); return; } trace_pm_qos_add_request(value); req->qos = &cpu_latency_constraints; cpu_latency_qos_apply(req, PM_QOS_ADD_REQ, value);} static void cpu_latency_qos_apply(struct pm_qos_request *req, enum pm_qos_req_action action, s32 value){ int ret = pm_qos_update_target(req->qos, &req->node, action, value); if (ret > 0) wake_up_all_idle_cpus();}
?
這是CPU延遲約束的請求函數(shù)
linux-src/kernel/power/qos.c
?
s32 cpu_latency_qos_limit(void){ return pm_qos_read_value(&cpu_latency_constraints);} s32 pm_qos_read_value(struct pm_qos_constraints *c){ return READ_ONCE(c->target_value);}
?
CPUIdle模塊會(huì)通過這個(gè)接口來讀取對CPU延遲的最小要求。
6.2 設(shè)備級(jí)約束
暫略
linux-src/drivers/base/power/qos.c
? 七、總結(jié)回顧 ? ?
通過本文我們對計(jì)算機(jī)的電源管理有了一個(gè)基本的了解,下面我們再看圖回憶一下:
電源管理分為電源狀態(tài)管理和省電管理兩個(gè)重要組成部分。電源狀態(tài)管理是對計(jì)算機(jī)的電源狀態(tài)進(jìn)行管理,包括睡眠、休眠、關(guān)機(jī)、重啟等。省電管理是內(nèi)核中的一些省電機(jī)制,可以很好的幫我們節(jié)省電力。光一味地省電也不行,還得考慮計(jì)算機(jī)的性能,所以電源管理中還有PM QoS來保證電源管理的服務(wù)質(zhì)量,使得計(jì)算機(jī)的運(yùn)行還要滿足一定的性能需求。
評論
查看更多