作者簡介:
榮濤,csdn博主。
組件概述
Linux性能子系統(tǒng)在性能分析中非常有用。以下顯示了這篇文章中的perf子系統(tǒng)componenet 。
“ perf”是可用于執(zhí)行性能分析的用戶程序。
僅暴露給用戶空間的系統(tǒng)調(diào)用perfeventopen返回一個(gè)perf事件fd。該系統(tǒng)調(diào)用沒有g(shù)libc包裝器。更多信息可以在手冊頁中閱讀。此功能是最復(fù)雜的功能之一。
“ perf_event”是內(nèi)核中的核心結(jié)構(gòu)。性能事件有幾種類型,例如跟蹤點(diǎn),軟件,硬件。
我們還可以通過perf event fd將eBPF程序附加到trae事件。
抽象層
以下顯示了perf的抽象層。
每個(gè)類型的性能事件都有一個(gè)對應(yīng)的PMU(性能監(jiān)視單元)。例如,跟蹤點(diǎn)pmu具有以下pmu。
?
?
static struct pmu perf_tracepoint = {
.task_ctx_nr = perf_sw_context,
.event_init = perf_tp_event_init,
.add = perf_trace_add,
.del = perf_trace_del,
.start = perf_swevent_start,
.stop = perf_swevent_stop,
.read = perf_swevent_read,
};
與硬件相關(guān)的PMU具有與arch-spec有關(guān)的抽象結(jié)構(gòu),例如'struct x86_pmu'。與硬件相關(guān)的結(jié)構(gòu)將讀取/寫入性能監(jiān)視器MSR。
每個(gè)PMU都通過調(diào)用“ perf_pmu_register”進(jìn)行注冊。
性能事件上下文
性能可以監(jiān)視cpu相關(guān)事件和任務(wù)相關(guān)事件。他們兩個(gè)都可以有幾個(gè)受監(jiān)視的事件。因此,我們需要一個(gè)上下文來連接事件。這是“ perf_event_context”。
有兩種上下文,軟件和硬件,定義如下:
?
?
enum perf_event_task_context {
perf_invalid_context = -1,
perf_hw_context = 0,
perf_sw_context,
perf_nr_task_contexts,
};
對于CPU級(jí)別,上下文定義為“?perf_cpu_context”,并在“ struct pmu”中定義為percpu變量。
?
?
struct pmu {
...
struct perf_cpu_context __percpu *pmu_cpu_context;
};
如果PMU是相同類型,則它們將共享一個(gè)“ struct perf_cpu_context”。
?
?
int perf_pmu_register(struct pmu *pmu, const char *name, int type)
{
int cpu, ret, max = PERF_TYPE_MAX;
mutex_lock(&pmus_lock);
...
pmu->pmu_cpu_context = find_pmu_context(pmu->task_ctx_nr);
if (pmu->pmu_cpu_context)
goto got_cpu_context;
ret = -ENOMEM;
pmu->pmu_cpu_context = alloc_percpu(struct perf_cpu_context);
if (!pmu->pmu_cpu_context)
goto free_dev;
for_each_possible_cpu(cpu) {
struct perf_cpu_context *cpuctx;
cpuctx = per_cpu_ptr(pmu->pmu_cpu_context, cpu);
__perf_event_init_context(&cpuctx->ctx);
lockdep_set_class(&cpuctx->ctx.mutex, &cpuctx_mutex);
lockdep_set_class(&cpuctx->ctx.lock, &cpuctx_lock);
cpuctx->ctx.pmu = pmu;
cpuctx->online = cpumask_test_cpu(cpu, perf_online_mask);
__perf_mux_hrtimer_init(cpuctx, cpu);
cpuctx->heap_size = ARRAY_SIZE(cpuctx->heap_default);
cpuctx->heap = cpuctx->heap_default;
}
...
}
下圖顯示了此帖子中的相關(guān)結(jié)構(gòu)。
對于任務(wù)級(jí)別,“ task_struct”具有如下定義的指針數(shù)組:
?
?
struct task_struct {
struct perf_event_context *perf_event_ctxp[perf_nr_task_contexts];
};
下圖顯示了相關(guān)結(jié)構(gòu),也來自于該帖子。
CPU在線時(shí)將觸發(fā)CPU級(jí)性能事件。但是對于任務(wù)級(jí)別的perf事件,只能通過運(yùn)行任務(wù)來觸發(fā)它?!?perf_cpu_context”的task_ctx包含當(dāng)前正在運(yùn)行的任務(wù)的perf上下文。
性能事件上下文時(shí)間表
性能的一項(xiàng)工作是安排任務(wù)的perf_event_context的進(jìn)出時(shí)間。
下圖顯示了與性能相關(guān)的任務(wù)計(jì)劃輸入和輸出功能。
最后,將調(diào)用PMU的add和del回調(diào)。讓我們以跟蹤點(diǎn)為例。add回調(diào)是“ perf_trace_add”,而del回調(diào)是“ perf_trace_add”。
?
?
????int perf_trace_add(struct perf_event *p_event, int flags)
{
struct trace_event_call *tp_event = p_event->tp_event;
if (!(flags & PERF_EF_START))
p_event->hw.state = PERF_HES_STOPPED;
/*
* If TRACE_REG_PERF_ADD returns false; no custom action was performed
* and we need to take the default action of enqueueing our event on
* the right per-cpu hlist.
*/
if (!tp_event->class->reg(tp_event, TRACE_REG_PERF_ADD, p_event)) {
struct hlist_head __percpu *pcpu_list;
struct hlist_head *list;
pcpu_list = tp_event->perf_events;
if (WARN_ON_ONCE(!pcpu_list))
return -EINVAL;
list = this_cpu_ptr(pcpu_list);
hlist_add_head_rcu(&p_event->hlist_entry, list);
}
return 0;
}
void perf_trace_del(struct perf_event *p_event, int flags)
{
struct trace_event_call *tp_event = p_event->tp_event;
/*
* If TRACE_REG_PERF_DEL returns false; no custom action was performed
* and we need to take the default action of dequeueing our event from
* the right per-cpu hlist.
*/
if (!tp_event->class->reg(tp_event, TRACE_REG_PERF_DEL, p_event))
hlist_del_rcu(&p_event->hlist_entry);
}
“ perf_event”將被添加或刪除到“ tp_event-> perf_events”列表中。
perf_event_open流
?
?
perf_event_open
->perf_copy_attr
->get_unused_fd_flags(fd)
->perf_event_alloc
->perf_init_event
->perf_try_init_event
->pmu->event_init()
->find_get_context
->perf_install_in_context
->__perf_install_in_context
->add_event_to_ctx
->list_add_event
->perf_group_attach
->add_event_to_ctx
->fd_install
perf_event_open將調(diào)用'pmu-> event_init'來初始化事件。并將perf_event添加到perf_event_context中。
性能跟蹤事件
回顧跟蹤點(diǎn)PMU的定義。
?
?
static struct pmu perf_tracepoint = {
.task_ctx_nr = perf_sw_context,
.event_init = perf_tp_event_init,
.add = perf_trace_add,
.del = perf_trace_del,
.start = perf_swevent_start,
.stop = perf_swevent_stop,
.read = perf_swevent_read,
};
讓我們嘗試看一下perf子系統(tǒng)如何監(jiān)視跟蹤點(diǎn)事件。
性能事件初始化
稱為“ perf_tp_event_init”。
?
?
perf_tp_event_init
->perf_trace_init
->perf_trace_event_init
->perf_trace_event_reg
->tp_event->class->reg(TRACE_REG_PERF_REGISTER)
'perf_trace_init'將找到指定的跟蹤點(diǎn)。
“ perf_trace_event_reg”將分配并初始化“ tp_event_perf_events”列表。并使用TRACE_REG_PERF_REGISTER調(diào)用“ tp_event-> class-> reg”。
?
?
static int perf_trace_event_reg(struct trace_event_call *tp_event,
struct perf_event *p_event)
{
struct hlist_head __percpu *list;
int ret = -ENOMEM;
int cpu;
p_event->tp_event = tp_event;
if (tp_event->perf_refcount++ > 0)
return 0;
list = alloc_percpu(struct hlist_head);
if (!list)
goto fail;
for_each_possible_cpu(cpu)
INIT_HLIST_HEAD(per_cpu_ptr(list, cpu));
tp_event->perf_events = list;
...
ret = tp_event->class->reg(tp_event, TRACE_REG_PERF_REGISTER, NULL);
if (ret)
goto fail;
total_ref_count++;
return 0;
...
}
“ tp_event_> class-> reg”回調(diào)為“ trace_event_reg”。
?
?
int trace_event_reg(struct trace_event_call *call,
enum trace_reg type, void *data)
{
struct trace_event_file *file = data;
WARN_ON(!(call->flags & TRACE_EVENT_FL_TRACEPOINT));
switch (type) {
...
#ifdef CONFIG_PERF_EVENTS
case TRACE_REG_PERF_REGISTER:
return tracepoint_probe_register(call->tp,
call->class->perf_probe,
call);
case TRACE_REG_PERF_UNREGISTER:
tracepoint_probe_unregister(call->tp,
call->class->perf_probe,
call);
return 0;
case TRACE_REG_PERF_OPEN:
case TRACE_REG_PERF_CLOSE:
case TRACE_REG_PERF_ADD:
case TRACE_REG_PERF_DEL:
return 0;
#endif
}
return 0;
}
我們可以看到'call-> class-> perf_probe'將被注冊到跟蹤點(diǎn)。從我的帖子。我們知道這個(gè)“ perf_probe”是“ perf_trace _ ## call”。
?
?
static notrace void
perf_trace_##call(void *__data, proto)
{
struct trace_event_call *event_call = __data;
struct trace_event_data_offsets_##call __maybe_unused __data_offsets;
struct trace_event_raw_##call *entry;
struct pt_regs *__regs;
u64 __count = 1;
struct task_struct *__task = NULL;
struct hlist_head *head;
int __entry_size;
int __data_size;
int rctx;
__data_size = trace_event_get_offsets_##call(&__data_offsets, args);
head = this_cpu_ptr(event_call->perf_events);
if (!bpf_prog_array_valid(event_call) &&
__builtin_constant_p(!__task) && !__task &&
hlist_empty(head))
return;
__entry_size = ALIGN(__data_size + sizeof(*entry) + sizeof(u32),
sizeof(u64));
__entry_size -= sizeof(u32);
entry = perf_trace_buf_alloc(__entry_size, &__regs, &rctx);
if (!entry)
return;
perf_fetch_caller_regs(__regs);
tstruct
{ assign; }
perf_trace_run_bpf_submit(entry, __entry_size, rctx,
event_call, __count, __regs,
head, __task);
}
如果“ event_call-> perf_events”為空,則表示沒有任何當(dāng)前的perf_event添加到該跟蹤點(diǎn)。這是'perf_event_open'初始化perf_event時(shí)的默認(rèn)狀態(tài)。
性能事件添加
在CPU中調(diào)度任務(wù)時(shí),將調(diào)用'pmu-> add',并將'perf_event'鏈接到'event_call-> perf_events'鏈接列表。
性能事件
從CPU調(diào)度任務(wù)后,將調(diào)用“ pmu-> del”,并且將從“ event_call-> perf_events”鏈接列表中刪除“ perf_event”。
性能事件觸發(fā)器
如果'event_call-> perf_events'不為空,則將調(diào)用'perf_trace_run_bpf_submit'。如果沒有附加eBPF程序,則將調(diào)用“ perf_tp_event”。
?
?
? ?void perf_tp_event(u16 event_type, u64 count,
void *record, int entry_size,
struct pt_regs *regs, struct hlist_head *head, int rctx,
struct task_struct *task)
{
struct perf_sample_data data;
struct perf_event *event;
struct perf_raw_record raw = {
.frag = {
.size = entry_size,
.data = record,
},
};
perf_sample_data_init(&data, 0, 0);
data.raw = &raw;
perf_trace_buf_update(record, event_type);
hlist_for_each_entry_rcu(event, head, hlist_entry) {
if (perf_tp_event_match(event, &data, regs))
perf_swevent_event(event, count, &data, regs);
}
...
perf_swevent_put_recursion_context(rctx);
}
對于“ event_call-> perf_events”列表中的每個(gè)“ perf_event”。它調(diào)用perf_swevent_event觸發(fā)性能事件。
?
?
static void perf_swevent_event(struct perf_event *event,
u64 nr,struct perf_sample_data *data,
struct pt_regs *regs)
{
struct hw_perf_event *hwc = &event->hw;
local64_add(nr, &event->count);
if (!regs)
return;
if (!is_sampling_event(event))
return;
if ((event->attr.sample_type & PERF_SAMPLE_PERIOD)
&& !event->attr.freq) {
data->period = nr;
return perf_swevent_overflow(event, 1, data, regs);
} else
data->period = event->hw.last_period;
if (nr == 1 && hwc->sample_period == 1 &&
!event->attr.freq)
return perf_swevent_overflow(event, 1, data, regs);
if (local64_add_negative(nr, &hwc->period_left))
return;
perf_swevent_overflow(event, 0, data, regs);
}
static void perf_swevent_event(struct perf_event *event,
u64 nr,struct perf_sample_data *data,
struct pt_regs *regs)
{
struct hw_perf_event *hwc = &event->hw;
local64_add(nr, &event->count);
if (!regs)
return;
if (!is_sampling_event(event))
return;
if ((event->attr.sample_type & PERF_SAMPLE_PERIOD)
&& !event->attr.freq) {
data->period = nr;
return perf_swevent_overflow(event, 1, data, regs);
} else
data->period = event->hw.last_period;
if (nr == 1 && hwc->sample_period == 1
&& !event->attr.freq)
return perf_swevent_overflow(event, 1, data, regs);
if (local64_add_negative(nr, &hwc->period_left))
return;
perf_swevent_overflow(event, 0, data, regs);
}
'perf_swevent_event'添加'event-> count'。如果事件未采樣,則僅返回。Tis是性能計(jì)數(shù)模式。如果perf_event在樣本模式下,則需要復(fù)制跟蹤點(diǎn)數(shù)據(jù)。以下是呼叫鏈。
?
?
perf_swevent_overflow
->__perf_event_overflow->event
->overflow_handler(perf_event_output).
軟件性能事件
軟件PMU定義如下:
?
?
static struct pmu perf_swevent = {
.task_ctx_nr = perf_sw_context,
.capabilities = PERF_PMU_CAP_NO_NMI,
.event_init = perf_swevent_init,
.add = perf_swevent_add,
.del = perf_swevent_del,
.start = perf_swevent_start,
.stop = perf_swevent_stop,
.read = perf_swevent_read,
};
性能事件初始化
“ perf_swevent_init”將被調(diào)用。它稱為“ swevent_hlist_get”
?
?
static int perf_swevent_init(struct perf_event *event)
{
u64 event_id = event->attr.config;
if (event->attr.type != PERF_TYPE_SOFTWARE)
return -ENOENT;
/*
* no branch sampling for software events
*/
if (has_branch_stack(event))
return -EOPNOTSUPP;
switch (event_id) {
case PERF_COUNT_SW_CPU_CLOCK:
case PERF_COUNT_SW_TASK_CLOCK:
return -ENOENT;
default:
break;
}
if (event_id >= PERF_COUNT_SW_MAX)
return -ENOENT;
if (!event->parent) {
int err;
err = swevent_hlist_get();
if (err)
return err;
static_key_slow_inc(&perf_swevent_enabled[event_id]);
event->destroy = sw_perf_event_destroy;
}
return 0;
}
這將創(chuàng)建一個(gè)percpu'swhash-> swevent_hlist'列表。還要將perf_swevent_enabled [event_id]設(shè)置為true。
性能事件添加
'perf_swevent_add'將perf_event添加到percpu哈希列表中。
?
?
static int perf_swevent_add(struct perf_event *event, int flags)
{
struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
struct hw_perf_event *hwc = &event->hw;
struct hlist_head *head;
if (is_sampling_event(event)) {
hwc->last_period = hwc->sample_period;
perf_swevent_set_period(event);
}
hwc->state = !(flags & PERF_EF_START);
head = find_swevent_head(swhash, event);
if (WARN_ON_ONCE(!head))
return -EINVAL;
hlist_add_head_rcu(&event->hlist_entry, head);
perf_event_update_userpage(event);
return 0;
}
性能事件
'perf_swevent_del'從哈希列表中刪除。
?
?
static void perf_swevent_del(struct perf_event *event, int flags)
{
hlist_del_rcu(&event->hlist_entry);
}
性能事件觸發(fā)器
以任務(wù)開關(guān)為例。
“ perf_sw_event_sched”將被調(diào)用。
?
?
static inline void perf_event_task_sched_out(struct task_struct *prev,
struct task_struct *next)
{
perf_sw_event_sched(PERF_COUNT_SW_CONTEXT_SWITCHES, 1, 0);
if (static_branch_unlikely(&perf_sched_events))
__perf_event_task_sched_out(prev, next);
}
在perf_event_task_sched_out->?_perf_sw_event-> do_perf_sw_event調(diào)用鏈之后。
?
?
static void do_perf_sw_event(enum perf_type_id type, u32 event_id,
u64 nr,
struct perf_sample_data *data,
struct pt_regs *regs)
{
struct swevent_htable *swhash = this_cpu_ptr(&swevent_htable);
struct perf_event *event;
struct hlist_head *head;
rcu_read_lock();
head = find_swevent_head_rcu(swhash, type, event_id);
if (!head)
goto end;
hlist_for_each_entry_rcu(event, head, hlist_entry) {
if (perf_swevent_match(event, type, event_id, data, regs))
perf_swevent_event(event, nr, data, regs);
}
end:
rcu_read_unlock();
}
如我們所見,它最終會(huì)調(diào)用“ perf_swevent_event”來觸發(fā)事件。
硬件性能事件
硬件PMU之一定義如下:
?
?
static struct pmu pmu = {
.pmu_enable = x86_pmu_enable,
.pmu_disable = x86_pmu_disable,
.attr_groups = x86_pmu_attr_groups,
.event_init = x86_pmu_event_init,
.event_mapped = x86_pmu_event_mapped,
.event_unmapped = x86_pmu_event_unmapped,
.add = x86_pmu_add,
.del = x86_pmu_del,
.start = x86_pmu_start,
.stop = x86_pmu_stop,
.read = x86_pmu_read,
.start_txn = x86_pmu_start_txn,
.cancel_txn = x86_pmu_cancel_txn,
.commit_txn = x86_pmu_commit_txn,
.event_idx = x86_pmu_event_idx,
.sched_task = x86_pmu_sched_task,
.task_ctx_size = sizeof(struct x86_perf_task_context),
.swap_task_ctx = x86_pmu_swap_task_ctx,
.check_period = x86_pmu_check_period,
.aux_output_match = x86_pmu_aux_output_match,
};
硬件性能事件非常復(fù)雜,因?yàn)樗鼘⑴c硬件交互。這里不會(huì)深入介紹硬件。
性能事件初始化
?
?
x86_pmu_event_init
->__x86_pmu_event_init
->x86_reserve_hardware
->x86_pmu.hw_config()
->validate_event
此處的“ x86_pmu”是基于arch規(guī)范的PMU結(jié)構(gòu)。
性能事件添加
x86_pmu_add->收集事件->-> x86_pmu.schedule_events()-> x86_pmu.add
'collect_events'集
?
?
cpuc->event_list[n] = leader;
性能事件
x86_pmu_del將刪除“ cpuc-> event_list”中的事件。
性能事件觸發(fā)器
觸發(fā)硬件事件時(shí),它將觸發(fā)NMI中斷。此處理程序是“ perf_event_nmi_handler”。
?
?
static int
perf_event_nmi_handler(unsigned int cmd, struct pt_regs *regs)
{
u64 start_clock;
u64 finish_clock;
int ret;
/*
* All PMUs/events that share this PMI handler should make sure to
* increment active_events for their events.
*/
if (!atomic_read(&active_events))
return NMI_DONE;
start_clock = sched_clock();
ret = x86_pmu.handle_irq(regs);
finish_clock = sched_clock();
perf_sample_event_took(finish_clock - start_clock);
return ret;
}
以Taks'x86_pmu.handle_irq'= x86_pmu_handle_irq為例。
?
?
for (idx = 0; idx < x86_pmu.num_counters; idx++) {
if (!test_bit(idx, cpuc->active_mask))
continue;
event = cpuc->events[idx];
val = x86_perf_event_update(event);
if (val & (1ULL << (x86_pmu.cntval_bits - 1)))
continue;
/*
* event overflow
*/
handled++;
perf_sample_data_init(&data, 0, event->hw.last_period);
if (!x86_perf_event_set_period(event))
continue;
if (perf_event_overflow(event, &data, regs))
x86_pmu_stop(event, 0);
}
在這里,我們可以看到它對“ cpuc”進(jìn)行了迭代,以查找觸發(fā)該中斷的事件。
審核編輯:湯梓紅
評論
查看更多