前言
共享內(nèi)存主要用于進程間通信,Linux有兩種共享內(nèi)存(Shared Memory)機制:
(1)?** System V shared memory(shmget/shmat/shmdt) **
Original shared memory mechanism, still widely used?Sharing between unrelated processes.
(2)?**?POSIX shared memory(shm_open/shm_unlink) **
Sharing between unrelated processes, without overhead of filesystem I/O?Intended to be simpler and better than older APIs.
另外,在Linux中不得不提一下內(nèi)存映射(也可用于進程間通信):
** Shared mappings – mmap(2) **
l?Shared anonymous mappings:Sharing between related processes only (related via fork())
l?Shared file mappings:Sharing between unrelated processes, backed by file in filesystem
System V共享內(nèi)存歷史悠久,使用也很廣范,很多類Unix系統(tǒng)都支持。一般來說,我們在寫程序時也通常使用第一種。這里不再討論如何使用它們,關(guān)于POSIX共享內(nèi)存的詳細介紹可以參考這里1,這里2。
**?講到那么多,那么問題來了,共享內(nèi)存與tmpfs有什么關(guān)系??**
The POSIX shared memory object implementation on Linux 2.4 makes use of a dedicated filesystem, which is normally mounted under /dev/shm.
從這里可以看到,POSIX共享內(nèi)存是基于tmpfs來實現(xiàn)的。實際上,更進一步,不僅PSM(POSIX shared memory),而且SSM(System V shared memory)在內(nèi)核也是基于tmpfs實現(xiàn)的。
tmpfs介紹
下面是內(nèi)核文檔中關(guān)于tmpfs的介紹:
tmpfs has the following uses:
1) There is always a kernel internal mount which you will not see at all. This is used for shared anonymous mappings and SYSV shared memory.
This mount does not depend on CONFIG_TMPFS. If CONFIG_TMPFS is not set, the user visible part of tmpfs is not build. But the internal mechanisms are always present.
2) glibc 2.2 and above expects tmpfs to be mounted at /dev/shm for POSIX shared memory (shm_open, shm_unlink). Adding the following line to /etc/fstab should take care of this:
tmpfs /dev/shm tmpfs defaults 0 0
Remember to create the directory that you intend to mount tmpfs on if necessary.
This mount is?not?needed for SYSV shared memory. The internal mount is used for that. (In the 2.3 kernel versions it was necessary to mount the predecessor of tmpfs (shm fs) to use SYSV shared memory)
從這里可以看到tmpfs主要有兩個作用:
(1)用于SYSV共享內(nèi)存,還有匿名內(nèi)存映射;這部分由內(nèi)核管理,用戶不可見;
(2)用于POSIX共享內(nèi)存,由用戶負責mount,而且一般mount到/dev/shm;依賴于CONFIG_TMPFS;
到這里,我們可以了解,SSM與PSM之間的區(qū)別,也明白了/dev/shm的作用。
下面我們來做一些測試:
測試
我們將/dev/shm的tmpfs設(shè)置為64M:
# mount -size=64M -o remount /dev/shm# df -lh
Filesystem ?????????????????Size ?Used Avail Use% Mounted on
tmpfs ?????????????????????????64M ????0 ??64M ??0% /dev/shm
SYSV共享內(nèi)存的最大大小為32M:
# cat /proc/sys/kernel/shmmax
33554432
(1)創(chuàng)建65M的system V共享內(nèi)存失?。?/p>
# ipcmk -M 68157440 ??????????????????
ipcmk: create share memory failed: Invalid argument
這是正常的。
(2)將shmmax調(diào)整為65M
# echo 68157440 > /proc/sys/kernel/shmmax# cat /proc/sys/kernel/shmmax ????????????
68157440# ipcmk -M 68157440 ??????????????????????
Shared memory id: 0# ipcs -m
------ Shared Memory Segments --------
key ???????shmid ?????owner ?????perms ?????bytes ?????nattch ????status ?????
0xef46b249 0 ?????????root ??????644 ???????68157440 ??0 ??????????????????????
可以看到system v共享內(nèi)存的大小并不受/dev/shm的影響。
(3)創(chuàng)建POSIX共享內(nèi)存
點擊(此處)折疊或打開
/*gcc?-o shmopen shmopen.c?-lrt*/#include?
#include?
#include?
#include?
#include?
#include?
#include?
#define MAP_SIZE 68157440
int main(int?argc,?char?*argv[])
{
int?fd;
void*?result;
fd?=?shm_open("/shm1",?O_RDWR|O_CREAT,?0644);
if(fd?0){
printf("shm_open failed\n");
exit(1);
}
return 0;
}
# ./shmopen# ls -lh /dev/shm/shm1
-rw-r--r-- 1 root root 65M Mar ?3 06:19 /dev/shm/shm1
僅管/dev/shm只有64M,但創(chuàng)建65M的POSIX SM也可以成功。
(4)向POSIX SM寫數(shù)據(jù)
點擊(此處)折疊或打開
/*gcc?-o shmwrite shmwrite.c?-lrt*/#include?
#include?
#include?
#include?
#include?
#include?
#include?
#define MAP_SIZE 68157440
int main(int?argc,?char?*argv[])
{
int?fd;
void*?result;
fd?=?shm_open("/shm1",?O_RDWR|O_CREAT,?0644);
if(fd?0){
printf("shm_open failed\n");
exit(1);
}
if?(ftruncate(fd,?MAP_SIZE)?0){
printf("ftruncate failed\n");
exit(1);
}
result?=?mmap(NULL,?MAP_SIZE,?PROT_READ|PROT_WRITE,?MAP_SHARED,?fd,?0);
if(result?==?MAP_FAILED){
printf("mapped failed\n");
exit(1);
}
/*?...?operate result pointer?*/
printf("memset\n");
memset(result,?0,?MAP_SIZE);
//shm_unlink("/shm1");
return 0;
}
# ./shmwrite
memset
Bus error
可以看到,寫65M的數(shù)據(jù)會報Bus error錯誤。
但是,卻可以在/dev/shm創(chuàng)建新的文件:
# ls -lh /dev/shm/ -lh
總用量?64M
-rw-r--r-- 1 root root 65M 3月 ??3 15:23 shm1
-rw-r--r-- 1 root root 65M 3月 ??3 15:24 shm2
這很正常,ls顯示的是inode->size。
# stat /dev/shm/shm2
File:?"/dev/shm/shm2"
Size: 68157440 ???????Blocks: 0 ?????????IO Block: 4096 ??普通文件
Device: 10h/16d Inode: 217177 ?????Links: 1
Access:?(0644/-rw-r--r--)??Uid:?(????0/ ???root)???Gid:?(????0/ ???root)
Access: 2015-03-03 15:24:28.025985167 +0800
Modify: 2015-03-03 15:24:28.025985167 +0800
Change: 2015-03-03 15:24:28.025985167 +0800
(5)向SYS V共享內(nèi)存寫數(shù)據(jù)
將System V共享內(nèi)存的最大值調(diào)整為65M(/dev/shm仍然為64M)。
# cat /proc/sys/kernel/shmmax
68157440
點擊(此處)折疊或打開
/*gcc?-o shmv shmv.c*/#include?
#include?
#include?
#include?
#define MAP_SIZE 68157440
int main(int?argc,?char**?argv){
int?shm_id,i;
key_t key;
char temp;
char?*p_map;
char*?name?=?"/dev/shm/shm3";
key?=?ftok(name,0);
if(key==-1)
perror("ftok error");
shm_id=shmget(key,MAP_SIZE,IPC_CREAT);
if(shm_id==-1)
{
perror("shmget error");
return;
}
p_map=(char*)shmat(shm_id,NULL,0);
memset(p_map,?0,?MAP_SIZE);
if(shmdt(p_map)==-1)
perror(" detach error ");
}
#./shmv
卻可以正常執(zhí)行。
(7)結(jié)論
雖然System V與POSIX共享內(nèi)存都是通過tmpfs實現(xiàn),但是受的限制卻不相同。也就是說/proc/sys/kernel/shmmax只會影響SYS V共享內(nèi)存,/dev/shm只會影響Posix共享內(nèi)存。實際上,System V與Posix共享內(nèi)存本來就是使用的兩個不同的tmpfs實例(instance)。
內(nèi)核分析
內(nèi)核在初始化時,會自動mount一個tmpfs文件系統(tǒng),掛載為shm_mnt:
點擊(此處)折疊或打開
//mm/shmem.cstatic struct file_system_type?
shmem_fs_type?=?{
.owner?=?THIS_MODULE,
.name?=?"tmpfs",
.get_sb?=?shmem_get_sb,
.kill_sb?=?kill_litter_super,
};
int?__init shmem_init(void)?{
...
error?=?register_filesystem(&shmem_fs_type);
if?(error)
{
printk(KERN_ERR?"Could not register tmpfs\n");
goto out2;
}
///掛載tmpfs(用于SYS V)
shm_mnt?=?vfs_kern_mount(&shmem_fs_type,?MS_NOUSER,shmem_fs_type.name,?NULL);
/dev/shm的mount與普通文件mount的流程類似,不再討論。但是,值得注意的是,/dev/shm默認的大小為當前物理內(nèi)存的1/2:
shmem_get_sb –> shmem_fill_super
點擊(此處)折疊或打開
//mem/shmem.c
int?shmem_fill_super(struct super_block?*sb,?void?*data,?int?silent)
{
...
#ifdef CONFIG_TMPFS?
/*
*?Per default we only allow half of the physical ram per
*?tmpfs instance,?limiting inodes?to?one per page of lowmem;
*?but the internal instance?is?left?unlimited.
*/
if?(!(sb->s_flags?&?MS_NOUSER))?{///內(nèi)核會設(shè)置MS_NOUSER?
sbinfo->max_blocks?=?shmem_default_max_blocks();
sbinfo->max_inodes?=?shmem_default_max_inodes();
if?(shmem_parse_options(data,?sbinfo,?false))?{
err?=?-EINVAL;
goto failed;
}
}
sb->s_export_op?=?&shmem_export_ops;
#else
...
#ifdef CONFIG_TMPFS
static unsigned long shmem_default_max_blocks(void)?{
return totalram_pages?/?2;
}
可以看到:由于內(nèi)核在mount tmpfs時,指定了MS_NOUSER,所以該tmpfs沒有大小限制,因此,SYS V共享內(nèi)存能夠使用的內(nèi)存空間只受/proc/sys/kernel/shmmax限制;而用戶通過掛載的/dev/shm,默認為物理內(nèi)存的1/2。
注意CONFIG_TMPFS.
另外,在/dev/shm創(chuàng)建文件走VFS接口,而SYS V與匿名映射卻是通過shmem_file_setup實現(xiàn):
SIGBUS
當應(yīng)用訪問共享內(nèi)存對應(yīng)的地址空間,如果對應(yīng)的物理PAGE還沒有分配,就會調(diào)用fault方法,分配失敗,就會返回OOM或者BIGBUS錯誤:
點擊(此處)折疊或打開
static?const?struct vm_operations_struct shmem_vm_ops?=?{
.fault?=?shmem_fault,
#ifdef CONFIG_NUMA?
.set_policy?=?shmem_set_policy,
.get_policy?=?shmem_get_policy,
#endif
};
static?int?shmem_fault(struct vm_area_struct?*vma,?struct vm_fault?*vmf)
{
struct inode?*inode?=?vma->vm_file->f_path.dentry->d_inode;
int?error;
int?ret?=?VM_FAULT_LOCKED;
error?=?shmem_getpage(inode,?vmf->pgoff,?&vmf->page,?SGP_CACHE,?&ret);
if?(error)
return?((error?==?-ENOMEM)???VM_FAULT_OOM?:?VM_FAULT_SIGBUS);
return ret;
}
shmem_getpage –>?shmem_getpage_gfp:
/*
*?shmem_getpage_gfp?-?find page?in?cache,?or?get?from swap,?or?allocate
*
*?If?we allocate a new one we?do?not?mark it dirty.?That's up?to?the
*?vm.?If?we swap it?in?we mark it dirty since we also free the swap
*?entry since a page cannot live?in?both the swap?and?page cache
*/
static?int?shmem_getpage_gfp(struct inode?*inode,?pgoff_t index,
struct page?**pagep,?enum sgp_type sgp,?gfp_t gfp,?int?*fault_type)?
{
...
if?(sbinfo->max_blocks)?{?///dev/shm會有該值?
if?(percpu_counter_compare(&sbinfo->used_blocks,sbinfo->max_blocks)?>=?0)?{
error?=?-ENOSPC;
goto unacct;
}
percpu_counter_inc(&sbinfo->used_blocks);
}
//分配一個物理PAGE
page?=?shmem_alloc_page(gfp,?info,?index);
if?(!page)?{
error?=?-ENOMEM;
goto decused;
}
SetPageSwapBacked(page);
__set_page_locked(page);
error?=?mem_cgroup_cache_charge(page,?current->mm,gfp?&?GFP_RECLAIM_MASK);?///mem_cgroup檢查
if?(!error)
error?=?shmem_add_to_page_cache(page,?mapping,?index,?gfp,?NULL);
共享內(nèi)存與CGROUP
目前,共享內(nèi)存的空間計算在第一個訪問共享內(nèi)存的group,參考:
l?http://lwn.net/Articles/516541/
l?https://www.kernel.org/doc/Documentation/cgroups/memory.txt
POSIX共享內(nèi)存與Docker
目前Docker將/dev/shm限制為64M,卻沒有提供參數(shù),這種做法比較糟糕。如果應(yīng)用使用大內(nèi)存的POSIX共享內(nèi)存,必然會導致問題。 參考:
l?https://github.com/docker/docker/issues/2606
l?https://github.com/docker/docker/pull/4981
總結(jié)
(1)POSIX共享內(nèi)存與SYS V共享內(nèi)存在內(nèi)核都是通過tmpfs實現(xiàn),但對應(yīng)兩個不同的tmpfs實例,相互獨立。
(2)通過/proc/sys/kernel/shmmax可以限制SYS V共享內(nèi)存(單個)的最大值,通過/dev/shm可以限制POSIX共享內(nèi)存的最大值(所有之和)。
?
評論
查看更多