poll運(yùn)行效率的兩個(gè)瓶頸已經(jīng)找出,現(xiàn)在的問題是怎么改進(jìn)。首先,如果要監(jiān)聽1000個(gè)fd,每次poll都要把1000個(gè)fd?拷入內(nèi)核,太不科學(xué)了,內(nèi)核干嘛不自己保存已經(jīng)拷入的fd呢?答對(duì)了,epoll就是自己保存拷入的fd,它的API就已經(jīng)說明了這一點(diǎn)——不是?epoll_wait的時(shí)候才傳入fd,而是通過epoll_ctl把所有fd傳入內(nèi)核再一起"wait",這就省掉了不必要的重復(fù)拷貝。其次,在?epoll_wait時(shí),也不是把current輪流的加入fd對(duì)應(yīng)的設(shè)備等待隊(duì)列,而是在設(shè)備等待隊(duì)列醒來時(shí)調(diào)用一個(gè)回調(diào)函數(shù)(當(dāng)然,這就需要“喚醒回調(diào)”機(jī)制),把產(chǎn)生事件的fd歸入一個(gè)鏈表,然后返回這個(gè)鏈表上的fd。
????另外,epoll機(jī)制實(shí)現(xiàn)了自己特有的文件系統(tǒng)eventpoll?filesystem
1.?內(nèi)核數(shù)據(jù)結(jié)構(gòu)
(1)?struct?eventpoll?{??
spinlock_t?lock;??
struct?mutex?mtx;??
wait_queue_head_t?wq;??/*?Wait?queue?used?by?sys_epoll_wait()?,調(diào)用epoll_wait()時(shí),?我們就是"睡"在了這個(gè)等待隊(duì)列上*/
wait_queue_head_t?poll_wait;??/*?Wait?queue?used?by?file->poll()?,?這個(gè)用于epollfd本事被poll的時(shí)候*/
struct?list_head?rdllist;?/*?List?of?ready?file?descriptors,?所有已經(jīng)ready的epitem都在這個(gè)鏈表里面*/?
structrb_root?rbr;?/*?RB?tree?root?used?to?store?monitored?fd?structs,?所有要監(jiān)聽的epitem都在這里*/?
epitem?*ovflist;??/*存放的epitem都是我們?cè)趥鬟f數(shù)據(jù)給用戶空間時(shí)監(jiān)聽到了事件*/.
struct?user_struct?*user;?/*這里保存了一些用戶變量,比如fd監(jiān)聽數(shù)量的最大值等*/??
};??
通過epoll_ctl接口加入該epoll描述符監(jiān)聽的套接字則屬于socket?filesystem,這點(diǎn)一定要注意。每個(gè)添加的待監(jiān)聽(這里監(jiān)聽和listen調(diào)用不同)都對(duì)應(yīng)于一個(gè)epitem結(jié)構(gòu)體,該結(jié)構(gòu)體已紅黑樹的結(jié)構(gòu)組織,eventpoll結(jié)構(gòu)中保存了樹的根節(jié)點(diǎn)(rbr成員)。同時(shí)有監(jiān)聽事件到來的套接字的該結(jié)構(gòu)以雙向鏈表組織起來,鏈表頭保存在eventpoll中(rdllist成員)。
/*?
*?Each?file?descriptor?added?to?the?eventpoll?interface?will??have?an?entry?of?this?type?linked?to?the?"rbr"?RB?tree.?
*/??
(2)?struct?epitem?{??
struct?rb_node?rbn;??????/*?RB?tree?node?used?to?link?this?structure?to?the?eventpoll?RB?tree?*/??
struct?list_head?rdllink;??/*?鏈表節(jié)點(diǎn),?所有已經(jīng)ready的epitem都會(huì)被鏈到eventpoll的rdllist中?*/?
struct?epitem?*next;??
struct?epoll_filefd?ffd;????/*?The?file?descriptor?information?this?item?refers?to?*/
int?nwait;???/*?Number?of?active?wait?queue?attached?to?poll?operations?*/
struct?list_head?pwqlist;??/*?List?containing?poll?wait?queues?*/??
struct?eventpoll?*ep;??/*?The?"container"?of?this?item?*/?
struct?list_head?fllink;?/*?List?header?used?to?link?this?item?to?the?"struct?file"?items?list?*/?
struct?epoll_event?event;???/*當(dāng)前的epitem關(guān)系哪些events,?這個(gè)數(shù)據(jù)是調(diào)用epoll_ctl時(shí)從用戶態(tài)傳遞過來?*/?
};??
(3)?struct?epoll_filefd?{
struct?file?*file;
int?fd;};
(4)??struct?eppoll_entry?{?/*?Wait?structure?used?by?the?poll?hooks?*/
struct?list_head?llink;?/*?List?header?used?to?link?this?structure?to?the?"struct?epitem"?*/
struct?epitem?*base;?/*?The?"base"?pointer?is?set?to?the?container?"struct?epitem"?*/
wait_queue_t?wait;?/?Wait?queue?item?that?will?be?linked?to?the?target?file?wait?queue?head.?/
wait_queue_head_t?*whead;/The?wait?queue?head?that?linked?the?"wait"?wait?queue?item?*/
};//注:后兩項(xiàng)相當(dāng)于等待隊(duì)列
(5)??struct?ep_pqueue?{/*?Wrapper?struct?used?by?poll?queueing?*/
poll_table?pt;???//?struct?poll_table是一個(gè)函數(shù)指針的包裹
struct?epitem?*epi;
};
(6)?struct?ep_send_events_data?{
/*?Used?by?the?ep_send_events()?function?as?callback?private?data?*/
int?maxevents;
struct?epoll_event?__user?*events;
};
各個(gè)數(shù)據(jù)結(jié)構(gòu)的關(guān)系如下圖:
2.?函數(shù)調(diào)用分析
epoll函數(shù)調(diào)用關(guān)系全局圖:
3.?函數(shù)實(shí)現(xiàn)分析
3.1?eventpoll_init
epoll是個(gè)module,所以先看看module的入口eventpoll_init
[fs/eventpoll.c-->evetpoll_init()](簡(jiǎn)化后)
?static?int?__init?eventpoll_init(void)
?{
?epi_cache?=?kmem_cache_create("eventpoll_epi",?sizeof(struct?epitem),
?0,?SLAB_HWCACHE_ALIGN|EPI_SLAB_DEBUG|SLAB_PANIC,?NULL,?NULL);
pwq_cache?=?kmem_cache_create("eventpoll_pwq",
?sizeof(struct?eppoll_entry),?0,?EPI_SLAB_DEBUG|SLAB_PANIC,?NULL,?NULL);
?//注冊(cè)了一個(gè)新的文件系統(tǒng),叫"eventpollfs"
error?=?register_filesystem(&eventpoll_fs_type);
eventpoll_mnt?=?kern_mount(&eventpoll_fs_type);;
?}
很有趣,這個(gè)module在初始化時(shí)注冊(cè)了一個(gè)新的文件系統(tǒng),叫"eventpollfs"(在eventpoll_fs_type結(jié)構(gòu)里),然后掛載此文件系統(tǒng)。另外創(chuàng)建兩個(gè)內(nèi)核cache(在內(nèi)核編程中,如果需要頻繁分配小塊內(nèi)存,應(yīng)該創(chuàng)建kmem_cahe來做“內(nèi)存池”),分別用于存放struct?epitem和eppoll_entry。
現(xiàn)在想想epoll_create為什么會(huì)返回一個(gè)新的fd?因?yàn)樗褪窃谶@個(gè)叫做"eventpollfs"的文件系統(tǒng)里創(chuàng)建了一個(gè)新文件!如下:
3.2?sys_epoll_create
[fs/eventpoll.c-->sys_epoll_create()]
?asmlinkage?long?sys_epoll_create(int?size)
?{
?int?error,?fd;
?struct?inode?*inode;
?struct?file?*file;
?error?=?ep_getfd(&fd,?&inode,?&file);
?/*?Setup?the?file?internal?data?structure?(?"struct?eventpoll"?)?*/
?error?=?ep_file_init(file);
}
函數(shù)很簡(jiǎn)單,其中ep_getfd看上去是“get”,其實(shí)在第一次調(diào)用epoll_create時(shí),它是要?jiǎng)?chuàng)建新inode、新的file、新的fd。而ep_file_init則要?jiǎng)?chuàng)建一個(gè)struct?eventpoll結(jié)構(gòu),并把它放入file->private_data,注意,這個(gè)private_data后面還要用到的。
3.3?epoll_ctl
epoll_create好了,該epoll_ctl了,我們略去判斷性的代碼:
[fs/eventpoll.c-->sys_epoll_ctl()]
?asmlinkage?long
?sys_epoll_ctl(int?epfd,?int?op,?int?fd,?struct?epoll_event?__user?*event)
?{
?struct?file?*file,?*tfile;
?struct?eventpoll?*ep;
?struct?epitem?*epi;
?struct?epoll_event?epds;
....
?epi?=?ep_find(ep,?tfile,?fd);//tfile存放要監(jiān)聽的fd對(duì)應(yīng)在rb-tree中的epitem
?switch?(op)?{//省略了判空處理
?case?EPOLL_CTL_ADD:?epds.events?|=?POLLERR?|?POLLHUP;
error?=?ep_insert(ep,?&epds,?tfile,?fd);?break;
?case?EPOLL_CTL_DEL:?error?=?ep_remove(ep,?epi);?break;
?case?EPOLL_CTL_MOD:?epds.events?|=?POLLERR?|?POLLHUP;
error?=?ep_modify(ep,?epi,?&epds);?break;
?}
????原來就是在一個(gè)“大的結(jié)構(gòu)”(struct?eventpoll)里先ep_find,如果找到了struct?epitem,而根據(jù)用戶操作是ADD、DEL、MOD調(diào)用相應(yīng)的函數(shù),這些函數(shù)在epitem組成紅黑樹中增加、刪除、修改相應(yīng)節(jié)點(diǎn)(每一個(gè)監(jiān)聽fd對(duì)應(yīng)一個(gè)節(jié)點(diǎn))。很直白。那這個(gè)“大結(jié)構(gòu)”是什么呢?看ep_find的調(diào)用方式,ep參數(shù)應(yīng)該是指向這個(gè)“大結(jié)構(gòu)”的指針,再看ep?=?file->private_data,我們才明白,原來這個(gè)“大結(jié)構(gòu)”就是那個(gè)在epoll_create時(shí)創(chuàng)建的struct?eventpoll,具體再看看ep_find的實(shí)現(xiàn),發(fā)現(xiàn)原來是struct?eventpoll的rbr成員(struct?rb_root),原來這是一個(gè)紅黑樹的根!而紅黑樹上掛的都是struct?epitem。
????現(xiàn)在清楚了,一個(gè)新創(chuàng)建的epoll文件帶有一個(gè)struct?eventpoll結(jié)構(gòu),這個(gè)結(jié)構(gòu)上再掛一個(gè)紅黑樹,而這個(gè)紅黑樹就是每次epoll_ctl時(shí)fd存放的地方!
3.4?sys_epoll_wait
現(xiàn)在數(shù)據(jù)結(jié)構(gòu)都已經(jīng)清楚了,我們來看最核心的:
[fs/eventpoll.c-->sys_epoll_wait()]
?asmlinkage?long?sys_epoll_wait(int?epfd,?struct?epoll_event?__user?*events,?int?maxevents,?
int?timeout)
?{
?struct?file?*file;
?struct?eventpoll?*ep;
?/*?Get?the?"struct?file?*"?for?the?eventpoll?file?*/
?file?=?fget(epfd);
/*
?*?We?have?to?check?that?the?file?structure?underneath?the?fd
?*?the?user?passed?to?us?_is_?an?eventpoll?file.(所以如果這里是普通的文件fd會(huì)出錯(cuò))
?*/
?if?(!IS_FILE_EPOLL(file))
?goto?eexit_2;
ep?=?file->private_data;
?error?=?ep_poll(ep,?events,?maxevents,?timeout);
……
}
故伎重演,從file->private_data中拿到struct?eventpoll,再調(diào)用ep_poll
3.5?ep_poll()
[fs/eventpoll.c-->sys_epoll_wait()->ep_poll()]
?static?int?ep_poll(struct?eventpoll?*ep,?struct?epoll_event?__user?*events,?int?maxevents,
long?timeout)
?{
?int?res;
?wait_queue_t?wait;//等待隊(duì)列項(xiàng)
?if?(list_empty(&ep->rdllist))?{
?//ep->rdllist存放的是已就緒(read)的fd,為空時(shí)說明當(dāng)前沒有就緒的fd,所以需要將當(dāng)前
?init_waitqueue_entry(&wait,?current);//創(chuàng)建一個(gè)等待隊(duì)列項(xiàng),并使用當(dāng)前進(jìn)程(current)初始化
?add_wait_queue(&ep->wq,?&wait);//將剛創(chuàng)建的等待隊(duì)列項(xiàng)加入到ep中的等待隊(duì)列(即將當(dāng)前進(jìn)程添加到等待隊(duì)列)
?for?(;;)?{
?/*將進(jìn)程狀態(tài)設(shè)置為TASK_INTERRUPTIBLE,因?yàn)槲覀儾幌M@期間ep_poll_callback()發(fā)信號(hào)喚醒進(jìn)程的時(shí)候,進(jìn)程還在sleep?*/
?set_current_state(TASK_INTERRUPTIBLE);
?if?(!list_empty(&ep->rdllist)?||?!jtimeout)//如果ep->rdllist非空(即有就緒的fd)或時(shí)間到則跳???????????????????????????出循環(huán)
break;
?if?(signal_pending(current))?{
?res?=?-EINTR;
?break;
?}
?}
?remove_wait_queue(&ep->wq,?&wait);//將等待隊(duì)列項(xiàng)移出等待隊(duì)列(將當(dāng)前進(jìn)程移出)
?set_current_state(TASK_RUNNING);
?}
....
又是一個(gè)大循環(huán),不過這個(gè)大循環(huán)比poll的那個(gè)好,因?yàn)樽屑?xì)一看——它居然除了睡覺和判斷ep->rdllist是否為空以外,啥也沒做!什么也沒做當(dāng)然效率高了,但到底是誰來讓ep->rdllist不為空呢?答案是ep_insert時(shí)設(shè)下的回調(diào)函數(shù).
3.6?ep_insert()
[fs/eventpoll.c-->sys_epoll_ctl()-->ep_insert()]
static?int?ep_insert(struct?eventpoll?*ep,?struct?epoll_event?*event,?struct?file?*tfile,?int?fd)
{
struct?epitem?*epi;
?struct?ep_pqueue?epq;//?創(chuàng)建ep_pqueue對(duì)象
epi?=?EPI_MEM_ALLOC();//分配一個(gè)epitem
/*?初始化這個(gè)epitem?...?*/
?epi->ep?=?ep;//將創(chuàng)建的epitem添加到傳進(jìn)來的struct?eventpoll
/*后幾行是設(shè)置epitem的相應(yīng)字段*/
?EP_SET_FFD(&epi->ffd,?tfile,?fd);//將要監(jiān)聽的fd加入到剛創(chuàng)建的epitem
?epi->event?=?*event;
?epi->nwait?=?0;
/*?Initialize?the?poll?table?using?the?queue?callback?*/
epq.epi?=?epi;??//將一個(gè)epq和新插入的epitem(epi)關(guān)聯(lián)
//下面一句等價(jià)于&(epq.pt)->qproc?=?ep_ptable_queue_proc;
init_poll_funcptr(&epq.pt,?ep_ptable_queue_proc);
revents?=?tfile->f_op->poll(tfile,?&epq.pt);??//tfile代表target?file,即被監(jiān)聽的文件,poll()返回就緒事件的掩碼,賦給revents.
list_add_tail(&epi->fllink,?&tfile->f_ep_links);//?每個(gè)文件會(huì)將所有監(jiān)聽自己的epitem鏈起來
ep_rbtree_insert(ep,?epi);//?都搞定后,?將epitem插入到對(duì)應(yīng)的eventpoll中去
……
}
緊接著?tfile->f_op->poll(tfile,?&epq.pt)其實(shí)就是調(diào)用被監(jiān)控文件(epoll里叫“target?file”)的poll方法,而這個(gè)poll其實(shí)就是調(diào)用poll_wait(還記得poll_wait嗎?每個(gè)支持poll的設(shè)備驅(qū)動(dòng)程序都要調(diào)用的),最后就是調(diào)用ep_ptable_queue_proc。(注:f_op->poll()一般來說只是個(gè)wrapper,?它會(huì)調(diào)用真正的poll實(shí)現(xiàn),?拿UDP的socket來舉例,?這里就是這樣的調(diào)用流程:?f_op->poll(),?sock_poll(),?udp_poll(),?datagram_poll(),?sock_poll_wait()。)這是比較難解的一個(gè)調(diào)用關(guān)系,因?yàn)椴皇钦Z言級(jí)的直接調(diào)用。ep_insert還把struct?epitem放到struct?file里的f_ep_links連表里,以方便查找,struct?epitem里的fllink就是擔(dān)負(fù)這個(gè)使命的。
3.7?ep_ptable_queue_proc
[fs/eventpoll.c-->ep_ptable_queue_proc()]
static?void?ep_ptable_queue_proc(struct?file?*file,?wait_queue_head_t?*whead,?poll_table?*pt)
{
?struct?epitem?*epi?=?EP_ITEM_FROM_EPQUEUE(pt);
?struct?eppoll_entry?*pwq;
if?(epi->nwait?>=?0?&&?(pwq?=?PWQ_MEM_ALLOC()))?{
?init_waitqueue_func_entry(&pwq->wait,?ep_poll_callback);
?pwq->whead?=?whead;
?pwq->base?=?epi;
?add_wait_queue(whead,?&pwq->wait);
?list_add_tail(&pwq->llink,?&epi->pwqlist);
?epi->nwait++;
?}?else?{
?/*?We?have?to?signal?that?an?error?occurred?*/
?epi->nwait?=?-1;
?}
?}
????上面的代碼就是ep_insert中要做的最重要的事:創(chuàng)建struct?eppoll_entry,設(shè)置其喚醒回調(diào)函數(shù)為ep_poll_callback,然后加入設(shè)備等待隊(duì)列(注意這里的whead就是上一章所說的每個(gè)設(shè)備驅(qū)動(dòng)都要帶的等待隊(duì)列)。只有這樣,當(dāng)設(shè)備就緒,喚醒等待隊(duì)列上的等待進(jìn)程時(shí),ep_poll_callback就會(huì)被調(diào)用。每次調(diào)用poll系統(tǒng)調(diào)用,操作系統(tǒng)都要把current(當(dāng)前進(jìn)程)掛到fd對(duì)應(yīng)的所有設(shè)備的等待隊(duì)列上,可以想象,fd多到上千的時(shí)候,這樣“掛”法很費(fèi)事;而每次調(diào)用epoll_wait則沒有這么羅嗦,epoll只在epoll_ctl時(shí)把current掛一遍(這第一遍是免不了的)并給每個(gè)fd一個(gè)命令“好了就調(diào)回調(diào)函數(shù)”,如果設(shè)備有事件了,通過回調(diào)函數(shù),會(huì)把fd放入rdllist,而每次調(diào)用epoll_wait就只是收集rdllist里的fd就可以了——epoll巧妙的利用回調(diào)函數(shù),實(shí)現(xiàn)了更高效的事件驅(qū)動(dòng)模型。
????現(xiàn)在我們猜也能猜出來ep_poll_callback會(huì)干什么了——肯定是把紅黑樹(ep->rbr)上的收到event的epitem(代表每個(gè)fd)插入ep->rdllist中,這樣,當(dāng)epoll_wait返回時(shí),rdllist里就都是就緒的fd了!
3.8?ep_poll_callback
[fs/eventpoll.c-->ep_poll_callback()]
static?int?ep_poll_callback(wait_queue_t?*wait,?unsigned?mode,?int?sync,?void?*key)
{
?int?pwake?=?0;
?struct?epitem?*epi?=?EP_ITEM_FROM_WAIT(wait);
?struct?eventpoll?*ep?=?epi->ep;
?/*?If?this?file?is?already?in?the?ready?list?we?exit?soon?*/
?if?(EP_IS_LINKED(&epi->rdllink))
?goto?is_linked;
?list_add_tail(&epi->rdllink,?&ep->rdllist);
?is_linked:
?/*
?*?Wake?up?(?if?active?)?both?the?eventpoll?wait?list?and?the?->poll()
?*?wait?list.
?*/
?if?(waitqueue_active(&ep->wq))
?wake_up(&ep->wq);
?if?(waitqueue_active(&ep->poll_wait))
?pwake++;
?}
4.?epoll獨(dú)有的EPOLLET
EPOLLET是epoll系統(tǒng)調(diào)用獨(dú)有的flag,ET就是Edge?Trigger(邊緣觸發(fā))的意思,具體含義和應(yīng)用大家可google之。有了EPOLLET,重復(fù)的事件就不會(huì)總是出來打擾程序的判斷,故而常被使用。那EPOLLET的原理是什么呢?
????上篇我們講到epoll把fd都掛上一個(gè)回調(diào)函數(shù),當(dāng)fd對(duì)應(yīng)的設(shè)備有消息時(shí),回調(diào)函數(shù)就把fd放入rdllist鏈表,這樣epoll_wait只要檢查這個(gè)rdllist鏈表就可以知道哪些fd有事件了。我們看看ep_poll的最后幾行代碼:
4.1?ep_poll()????(接3.5)
[fs/eventpoll.c->ep_poll()]
/*?Try?to?transfer?events?to?user?space.?*/
?ep_events_transfer(ep,?events,?maxevents)
?......
把rdllist里的fd拷到用戶空間,這個(gè)任務(wù)是ep_events_transfer做的.
4.2?ep_events_transfer
[fs/eventpoll.c->ep_events_transfer()]
?static?int?ep_events_transfer(struct?eventpoll?*ep,?struct?epoll_event?__user?*events,?
int?maxevents)
{
int?eventcnt?=?0;
struct?list_head?txlist;
INIT_LIST_HEAD(&txlist);
/*?Collect/extract?ready?items?*/
if?(ep_collect_ready_items(ep,?&txlist,?maxevents)?>?0)?{
/*?Build?result?set?in?userspace?*/
eventcnt?=?ep_send_events(ep,?&txlist,?events);
/*?Reinject?ready?items?into?the?ready?list?*/
ep_reinject_items(ep,?&txlist);
}
up_read(&ep->sem);
return?eventcnt;
}
????代碼很少,其中ep_collect_ready_items把rdllist里的fd挪到txlist里(挪完后rdllist就空了),接著ep_send_events把txlist里的fd拷給用戶空間,然后ep_reinject_items把一部分fd從txlist里“返還”給rdllist以便下次還能從rdllist里發(fā)現(xiàn)它。
其中ep_send_events的實(shí)現(xiàn):
4.3?ep_send_events()
[fs/eventpoll.c->ep_send_events()]
static?int?ep_send_events(struct?eventpoll?*ep,?struct?list_head?*txlist,
struct?epoll_event?__user?*events)
{
int?eventcnt?=?0;
unsigned?int?revents;
struct?list_head?*lnk;
struct?epitem?*epi;
list_for_each(lnk,?txlist)?{
epi?=?list_entry(lnk,?struct?epitem,?txlink);
revents?=?epi->ffd.file->f_op->poll(epi->ffd.file,?NULL);//調(diào)用每個(gè)監(jiān)聽文件的poll方法獲取就緒事件(掩碼),并賦值給revents
epi->revents?=?revents?&?epi->event.events;
if?(epi->revents)?{
?????if?(__put_user(epi->revents,?&events[eventcnt].events)?||?__put_user(epi->event.data,
?????&events[eventcnt].data))//將event從內(nèi)核空間發(fā)送到用戶空間
?????return?-EFAULT;
????if?(epi->event.events?&?EPOLLONESHOT)
????epi->event.events?&=?EP_PRIVATE_BITS;
????eventcnt++;
?????}?????}
????return?eventcnt;?}
????這個(gè)拷貝實(shí)現(xiàn)其實(shí)沒什么可看的,但是請(qǐng)注意紅色的一行,這個(gè)poll很狡猾,它把第二個(gè)參數(shù)置為NULL來調(diào)用。我們先看一下設(shè)備驅(qū)動(dòng)通常是怎么實(shí)現(xiàn)poll的:
static?unsigned?int?scull_p_poll(struct?file?*filp,?poll_table?*wait)
{
struct?scull_pipe?*dev?=?filp->private_data;
unsigned?int?mask?=?0;
poll_wait(filp,?&dev->inq,?wait);
poll_wait(filp,?&dev->outq,?wait);
if?(dev->rp?!=?dev->wp)
mask?|=?POLLIN?|?POLLRDNORM;?/*?readable?*/
if?(spacefree(dev))
mask?|=?POLLOUT?|?POLLWRNORM;?/*?writable?*/
return?mask;
}
????上面這段代碼摘自《linux設(shè)備驅(qū)動(dòng)程序(第三版)》,絕對(duì)經(jīng)典,設(shè)備先要把current(當(dāng)前進(jìn)程)掛在inq和outq兩個(gè)隊(duì)列上(這個(gè)“掛”操作是wait回調(diào)函數(shù)指針做的),然后等設(shè)備來喚醒,喚醒后就能通過mask拿到事件掩碼了(注意那個(gè)mask參數(shù),它就是負(fù)責(zé)拿事件掩碼的)。那如果wait為NULL,poll_wait會(huì)做些什么呢?
4.4?poll_wait
[include/linux/poll.h->poll_wait]
?static?inline?void?poll_wait(struct?file?*?filp,?wait_queue_head_t?*?wait_address,poll_table?*p)
?{
????if?(p?&&?wait_address)
????p->qproc(filp,?wait_address,?p);
?}
喏,看見了,如果poll_table為空,什么也不做。我們倒回ep_send_events,那句標(biāo)紅的poll,實(shí)際上就是“我不想休眠,我只想拿到事件掩碼”的意思。然后再把拿到的事件掩碼拷給用戶空間。ep_send_events完成后,就輪到ep_reinject_items了。
4.5?p_reinject_items
[fs/eventpoll.c->ep_reinject_items]
static?void?ep_reinject_items(struct?eventpoll?*ep,?struct?list_head?*txlist)
{
?????int?ricnt?=?0,?pwake?=?0;
?????unsigned?long?flags;
?????struct?epitem?*epi;
?????while?(!list_empty(txlist))?{//遍歷txlist(此時(shí)txlist存放的是已就緒的epitem)
?????epi?=?list_entry(txlist->next,?struct?epitem,?txlink);
?????EP_LIST_DEL(&epi->txlink);//將當(dāng)前的epitem從txlist中刪除
?????if?(EP_RB_LINKED(&epi->rbn)?&&?!(epi->event.events?&?EPOLLET)?&&
?????(epi->revents?&?epi->event.events)?&&?!EP_IS_LINKED(&epi->rdllink))?{
list_add_tail(&epi->rdllink,?&ep->rdllist);//將當(dāng)前epitem重新加入ep->rdllist
?????ricnt++;//?ep->rdllist中epitem的個(gè)數(shù)(即從新加入就緒的epitem的個(gè)數(shù))
??????}
????}
?if?(ricnt)?{//如果ep->rdllist不空,重新喚醒等、等待隊(duì)列的進(jìn)程(current)
????if?(waitqueue_active(&ep->wq))
????wake_up(&ep->wq);
????if?(waitqueue_active(&ep->poll_wait))
????pwake++;
????}
???……
}
ep_reinject_items把txlist里的一部分fd又放回rdllist,那么,是把哪一部分fd放回去呢?看上面那個(gè)判斷——是那些“沒有標(biāo)上EPOLLET(即默認(rèn)的LT)”(標(biāo)紅代碼)且“事件被關(guān)注”(標(biāo)藍(lán)代碼)的fd被重新放回了rdllist。那么下次epoll_wait當(dāng)然會(huì)又把rdllist里的fd拿來拷給用戶了。舉個(gè)例子。假設(shè)一個(gè)socket,只是connect,還沒有收發(fā)數(shù)據(jù),那么它的poll事件掩碼總是有POLLOUT的(參見上面的驅(qū)動(dòng)示例),每次調(diào)用epoll_wait總是返回POLLOUT事件(比較煩),因?yàn)樗膄d就總是被放回rdllist;假如此時(shí)有人往這個(gè)socket里寫了一大堆數(shù)據(jù),造成socket塞住(不可寫了),那么標(biāo)藍(lán)色的判斷就不成立了(沒有POLLOUT了),fd不會(huì)放回rdllist,epoll_wait將不會(huì)再返回用戶POLLOUT事件?,F(xiàn)在我們給這個(gè)socket加上EPOLLET,然后connect,沒有收發(fā)數(shù)據(jù),此時(shí),標(biāo)紅的判斷又不成立了,所以epoll_wait只會(huì)返回一次POLLOUT通知給用戶(因?yàn)榇薴d不會(huì)再回到rdllist了),接下來的epoll_wait都不會(huì)有任何事件通知了。
總結(jié):
epoll函數(shù)調(diào)用關(guān)系全局圖:
注:上述函數(shù)關(guān)系圖中有個(gè)問題,當(dāng)ep_reinject_items()將LT的上次就緒的eptiem重新放回就緒鏈表,下次ep_poll()直接返回,這不就造成了一個(gè)循環(huán)了嗎?什么時(shí)候這些LT的epitem才不再加入就緒鏈表呢?這個(gè)問題的解決在4.3——ep_send_events()中,注意這個(gè)函數(shù)中標(biāo)紅的那個(gè)poll調(diào)用,我們分析過當(dāng)傳入NULL時(shí),poll僅僅是拿到事件掩碼,所以如果之前用戶對(duì)事件的處理導(dǎo)致的文件的revents(狀態(tài))改變,那么這里就會(huì)得到更新。例如:用戶以可讀監(jiān)聽,當(dāng)讀完數(shù)據(jù)后文件的會(huì)變?yōu)椴豢勺x,這時(shí)ep_send_events()中獲取的revents中將不再有可讀事件,也就不滿足ep_reinject_items()中的藍(lán)色判斷,所以epitem不再被加入就緒鏈表(ep->rdllist)。但是如果只讀部分?jǐn)?shù)據(jù),并不會(huì)引起文件狀態(tài)改變(文件仍可讀),所以仍會(huì)加入就緒鏈表通知用戶空間,這也就是如果是TL,就會(huì)一直通知用戶讀事件,直到某些操作導(dǎo)致那個(gè)文件描述符不再為就緒狀態(tài)了(比如,你在發(fā)送,接收或者接收請(qǐng)求,或者發(fā)送接收的數(shù)據(jù)少于一定量時(shí)導(dǎo)致了一個(gè)EWOULDBLOCK?錯(cuò)誤)。
將上述調(diào)用添加到函數(shù)調(diào)用關(guān)系圖后,如下(添加的為藍(lán)線):
epoll實(shí)現(xiàn)數(shù)據(jù)結(jié)構(gòu)全局關(guān)系圖:
?
評(píng)論
查看更多