主要接口

  • epoll_create
  • epoll_ctl
  • epoll_wait

epoll_create

头文件

#include <sys/epoll.h>

函数原型

int epoll_create(int size);
int epoll_create1(int flags);

成功返回整型fd,失败返回-1

作用

创建一个epoll的句柄,size用来告诉内核这个监听的数目一共有多大。这个参数不同于select()中的第一个参数,给出最大监听的fd+1的值。需要注意的是,当创建好epoll句柄后,它就是会占用一个fd值,在linux下如果查看/proc/进程id/fd/,是能够看到这个fd的,所以在使用完epoll后,必须调用close()关闭,否则可能导致fd被耗尽。

注意:size参数只是告诉内核这个 epoll对象会处理的事件大致数目,而不是能够处理的事件的最大个数。在 Linux最新的一些内核版本的实现中,这个 size参数没有任何意义。

DESCRIPTION
epoll_create() creates a new epoll(7) instance. Since Linux
2.6.8, the size argument is ignored, but must be greater than
zero; see NOTES.

   epoll_create() returns a file descriptor referring to the new
   epoll instance.  This file descriptor is used for all the
   subsequent calls to the epoll interface.  When no longer
   required, the file descriptor returned by epoll_create() should
   be closed by using close(2).  When all file descriptors referring
   to an epoll instance have been closed, the kernel destroys the
   instance and releases the associated resources for reuse.

epoll_create1()
If flags is 0, then, other than the fact that the obsolete size
argument is dropped, epoll_create1() is the same as
epoll_create(). The following value can be included in flags to
obtain different behavior:

   EPOLL_CLOEXEC
          Set the close-on-exec (FD_CLOEXEC) flag on the new file
          descriptor.  See the description of the O_CLOEXEC flag in
          open(2) for reasons why this may be useful.

RETURN VALUE
On success, these system calls return a file descriptor (a
nonnegative integer). On error, -1 is returned, and errno is set
to indicate the error.
ERRORS
EINVAL size is not positive.

   EINVAL (epoll_create1()) Invalid value specified in flags.

   EMFILE The per-user limit on the number of epoll instances
          imposed by /proc/sys/fs/epoll/max_user_instances was
          encountered.  See epoll(7) for further details.

   EMFILE The per-process limit on the number of open file
          descriptors has been reached.

   ENFILE The system-wide limit on the total number of open files
          has been reached.

   ENOMEM There was insufficient memory to create the kernel object.

VERSIONS
epoll_create() was added to the kernel in version 2.6. Library
support is provided in glibc starting with version 2.3.2.

   epoll_create1() was added to the kernel in version 2.6.27.
   Library support is provided in glibc starting with version 2.9.

CONFORMING TO
epoll_create() and epoll_create1() are Linux-specific.

NOTES
In the initial epoll_create() implementation, the size argument
informed the kernel of the number of file descriptors that the
caller expected to add to the epoll instance. The kernel used
this information as a hint for the amount of space to initially
allocate in internal data structures describing events. (If
necessary, the kernel would allocate more space if the caller’s
usage exceeded the hint given in size.) Nowadays, this hint is
no longer required (the kernel dynamically sizes the required
data structures without needing the hint), but size must still be
greater than zero, in order to ensure backward compatibility when
new epoll applications are run on older kernels.

epoll_create传递的size必须大于0,目的为了向下兼容。但是不再需要指定size了,由内核内部自行动态调整。后续只需要epoll_create(1)即可。

epoll_create()在2.6版中添加到内核中。glibc库从版本2.3.2开始提供了支持。
epoll_create1()在2.6.27版中添加到内核中。glibc库从2.9版开始提供了支持

查看到当前Z5100设备内核版本为4.14

root@firewall:~# uname -r
4.14.76-g3544954e7

目前Z5100libcollect epoll优化还是继续沿用11.X的写法。使用epoll_create接口。

epoll源码解析epoll_create

epoll_create1(EPOLL_CLOEXEC)

epoll_create和epoll_create1

一般用法都是使用EPOLL_CLOEXEC.

Note:

关于FD_CLOEXEC,它是fd的一个标识说明,用来设置文件close-on-exec状态的。当close-on-exec状态为0时,调用exec时,fd不会被关闭;状态非零时则会被关闭,这样做可以防止fd泄露给执行exec后的进程。关于exec的用法,大家可以去自己查阅下,或者直接man exec。

浅析epoll – epoll函数深入讲解

glibc-2.34版本源码

int
epoll_create (int size)
{
  if (size <= 0)
    {
      __set_errno (EINVAL);
      return -1;
    }

  return INLINE_SYSCALL (epoll_create1, 1, 0);
}

epoll_ctl

NAME top
epoll_ctl - control interface for an epoll file descriptor
SYNOPSIS top
#include <sys/epoll.h>

   int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

DESCRIPTION top
This system call is used to add, modify, or remove entries in the
interest list of the epoll(7) instance referred to by the file
descriptor epfd. It requests that the operation op be performed
for the target file descriptor, fd.

   Valid values for the op argument are:

   EPOLL_CTL_ADD
          Add an entry to the interest list of the epoll file
          descriptor, epfd.  The entry includes the file descriptor,
          fd, a reference to the corresponding open file description
          (see epoll(7) and open(2)), and the settings specified in
          event.

   EPOLL_CTL_MOD
          Change the settings associated with fd in the interest
          list to the new settings specified in event.

   EPOLL_CTL_DEL
          Remove (deregister) the target file descriptor fd from the
          interest list.  The event argument is ignored and can be
          NULL (but see BUGS below).

   The event argument describes the object linked to the file
   descriptor fd.  The struct epoll_event is defined as:

       typedef union epoll_data {
           void        *ptr;
           int          fd;
           uint32_t     u32;
           uint64_t     u64;
       } epoll_data_t;

       struct epoll_event {
           uint32_t     events;      /* Epoll events */
           epoll_data_t data;        /* User data variable */
       };

   The data member of the epoll_event structure specifies data that
   the kernel should save and then return (via epoll_wait(2)) when
   this file descriptor becomes ready.

   The events member of the epoll_event structure is a bit mask
   composed by ORing together zero or more of the following
   available event types:

   EPOLLIN
          The associated file is available for read(2) operations.

   EPOLLOUT
          The associated file is available for write(2) operations.

   EPOLLRDHUP (since Linux 2.6.17)
          Stream socket peer closed connection, or shut down writing
          half of connection.  (This flag is especially useful for
          writing simple code to detect peer shutdown when using
          edge-triggered monitoring.)

   EPOLLPRI
          There is an exceptional condition on the file descriptor.
          See the discussion of POLLPRI in poll(2).

   EPOLLERR
          Error condition happened on the associated file
          descriptor.  This event is also reported for the write end
          of a pipe when the read end has been closed.

          epoll_wait(2) will always report for this event; it is not
          necessary to set it in events when calling epoll_ctl().

   EPOLLHUP
          Hang up happened on the associated file descriptor.

          epoll_wait(2) will always wait for this event; it is not
          necessary to set it in events when calling epoll_ctl().

          Note that when reading from a channel such as a pipe or a
          stream socket, this event merely indicates that the peer
          closed its end of the channel.  Subsequent reads from the
          channel will return 0 (end of file) only after all
          outstanding data in the channel has been consumed.

   EPOLLET
          Requests edge-triggered notification for the associated
          file descriptor.  The default behavior for epoll is level-
          triggered.  See epoll(7) for more detailed information
          about edge-triggered and level-triggered notification.

          This flag is an input flag for the event.events field when
          calling epoll_ctl(); it is never returned by
          epoll_wait(2).

   EPOLLONESHOT (since Linux 2.6.2)
          Requests one-shot notification for the associated file
          descriptor.  This means that after an event notified for
          the file descriptor by epoll_wait(2), the file descriptor
          is disabled in the interest list and no other events will
          be reported by the epoll interface.  The user must call
          epoll_ctl() with EPOLL_CTL_MOD to rearm the file
          descriptor with a new event mask.

          This flag is an input flag for the event.events field when
          calling epoll_ctl(); it is never returned by
          epoll_wait(2).

   EPOLLWAKEUP (since Linux 3.5)
          If EPOLLONESHOT and EPOLLET are clear and the process has
          the CAP_BLOCK_SUSPEND capability, ensure that the system
          does not enter "suspend" or "hibernate" while this event
          is pending or being processed.  The event is considered as
          being "processed" from the time when it is returned by a
          call to epoll_wait(2) until the next call to epoll_wait(2)
          on the same epoll(7) file descriptor, the closure of that
          file descriptor, the removal of the event file descriptor
          with EPOLL_CTL_DEL, or the clearing of EPOLLWAKEUP for the
          event file descriptor with EPOLL_CTL_MOD.  See also BUGS.

          This flag is an input flag for the event.events field when
          calling epoll_ctl(); it is never returned by
          epoll_wait(2).

   EPOLLEXCLUSIVE (since Linux 4.5)
          Sets an exclusive wakeup mode for the epoll file
          descriptor that is being attached to the target file
          descriptor, fd.  When a wakeup event occurs and multiple
          epoll file descriptors are attached to the same target
          file using EPOLLEXCLUSIVE, one or more of the epoll file
          descriptors will receive an event with epoll_wait(2).  The
          default in this scenario (when EPOLLEXCLUSIVE is not set)
          is for all epoll file descriptors to receive an event.
          EPOLLEXCLUSIVE is thus useful for avoiding thundering herd
          problems in certain scenarios.

          If the same file descriptor is in multiple epoll
          instances, some with the EPOLLEXCLUSIVE flag, and others
          without, then events will be provided to all epoll
          instances that did not specify EPOLLEXCLUSIVE, and at
          least one of the epoll instances that did specify
          EPOLLEXCLUSIVE.

          The following values may be specified in conjunction with
          EPOLLEXCLUSIVE: EPOLLIN, EPOLLOUT, EPOLLWAKEUP, and
          EPOLLET.  EPOLLHUP and EPOLLERR can also be specified, but
          this is not required: as usual, these events are always
          reported if they occur, regardless of whether they are
          specified in events.  Attempts to specify other values in
          events yield the error EINVAL.

          EPOLLEXCLUSIVE may be used only in an EPOLL_CTL_ADD
          operation; attempts to employ it with EPOLL_CTL_MOD yield
          an error.  If EPOLLEXCLUSIVE has been set using
          epoll_ctl(), then a subsequent EPOLL_CTL_MOD on the same
          epfd, fd pair yields an error.  A call to epoll_ctl() that
          specifies EPOLLEXCLUSIVE in events and specifies the
          target file descriptor fd as an epoll instance will
          likewise fail.  The error in all of these cases is EINVAL.

          The EPOLLEXCLUSIVE flag is an input flag for the
          event.events field when calling epoll_ctl(); it is never
          returned by epoll_wait(2).

RETURN VALUE top
When successful, epoll_ctl() returns zero. When an error occurs,
epoll_ctl() returns -1 and errno is set to indicate the error.
ERRORS top
EBADF epfd or fd is not a valid file descriptor.

   EEXIST op was EPOLL_CTL_ADD, and the supplied file descriptor fd
          is already registered with this epoll instance.

   EINVAL epfd is not an epoll file descriptor, or fd is the same as
          epfd, or the requested operation op is not supported by
          this interface.

   EINVAL An invalid event type was specified along with
          EPOLLEXCLUSIVE in events.

   EINVAL op was EPOLL_CTL_MOD and events included EPOLLEXCLUSIVE.

   EINVAL op was EPOLL_CTL_MOD and the EPOLLEXCLUSIVE flag has
          previously been applied to this epfd, fd pair.

   EINVAL EPOLLEXCLUSIVE was specified in event and fd refers to an
          epoll instance.

   ELOOP  fd refers to an epoll instance and this EPOLL_CTL_ADD
          operation would result in a circular loop of epoll
          instances monitoring one another or a nesting depth of
          epoll instances greater than 5.

   ENOENT op was EPOLL_CTL_MOD or EPOLL_CTL_DEL, and fd is not
          registered with this epoll instance.

   ENOMEM There was insufficient memory to handle the requested op
          control operation.

   ENOSPC The limit imposed by /proc/sys/fs/epoll/max_user_watches
          was encountered while trying to register (EPOLL_CTL_ADD) a
          new file descriptor on an epoll instance.  See epoll(7)
          for further details.

   EPERM  The target file fd does not support epoll.  This error can
          occur if fd refers to, for example, a regular file or a
          directory.

VERSIONS top
epoll_ctl() was added to the kernel in version 2.6. Library
support is provided in glibc starting with version 2.3.2.

epoll的事件注册函数,epoll_ctl向 epoll对象中添加、修改或者删除感兴趣的事件,返回0表示成功,否则返回–1,此时需要根据errno错误码判断错误类型。

int epoll_ctl(int epfd, int op, int fd, struct epoll_event *event);

  • epfd为epoll_create返回值,

  • op为操作动作

EPOLL_CTL_ADD:注册新的fd到epfd中;
EPOLL_CTL_MOD:修改已经注册的fd的监听事件;
EPOLL_CTL_DEL:从epfd中删除一个fd;
  • fd为内核要监听的文件描述符
  • event为内核要监听的事件,与fd相关联。
events可以是以下几个宏的集合:
EPOLLIN :表示对应的文件描述符可以读(包括对端SOCKET正常关闭);
EPOLLOUT:表示对应的文件描述符可以写;
EPOLLPRI:表示对应的文件描述符有异常(包括带外数据Out-Of-Band data);
EPOLLERR:表示对应的文件描述符发生错误;
EPOLLHUP:表示对应的文件描述符被挂断;
EPOLLET: 将EPOLL设为边缘触发(Edge Triggered)模式,这是相对于水平触发(Level Triggered)来说的。
EPOLLONESHOT:只监听一次事件,当监听完这次事件之后,如果还需要继续监听这个socket的话,需要再次把这个socket加入到EPOLL队列里

成功返回0,失败返回-1

ET、LT工作模式

epoll有两种工作模式:LT(水平触发)模式和ET(边缘触发)模式。默认情况下,epoll采用 LT模式工作。

ET模式与LT模式的区别:

epoll_百度百科

ET和LT的区别就在这里体现,LT事件不会丢弃,而是只要读buffer里面有数据可以让用户读,则不断的通知你。而ET则只在事件发生之时通知。可以简单理解为LT是水平触发,而ET则为边缘触发。LT模式只要有事件未处理就会触发,而ET则只在高低电平变换时(即状态从1到0或者0到1)触发。

看具体的应用场景选择相应模式,针对日志采集场景个人觉得还是不要设置ET模式(EPOLLET)存在丢数据的风险。

epoll_wait

NAME top
epoll_wait, epoll_pwait, epoll_pwait2 - wait for an I/O event on
an epoll file descriptor
SYNOPSIS top
#include <sys/epoll.h>

   int epoll_wait(int epfd, struct epoll_event *events,
                  int maxevents, int timeout);
   int epoll_pwait(int epfd, struct epoll_event *events,
                  int maxevents, int timeout,
                  const sigset_t *sigmask);
   int epoll_pwait2(int epfd, struct epoll_event *events,
                  int maxevents, const struct timespec *timeout,
                  const sigset_t *sigmask);

DESCRIPTION top
The epoll_wait() system call waits for events on the epoll(7)
instance referred to by the file descriptor epfd. The buffer
pointed to by events is used to return information from the ready
list about file descriptors in the interest list that have some
events available. Up to maxevents are returned by epoll_wait().
The maxevents argument must be greater than zero.

   The timeout argument specifies the number of milliseconds that
   epoll_wait() will block.  Time is measured against the
   CLOCK_MONOTONIC clock.

   A call to epoll_wait() will block until either:

   • a file descriptor delivers an event;

   • the call is interrupted by a signal handler; or

   • the timeout expires.

   Note that the timeout interval will be rounded up to the system
   clock granularity, and kernel scheduling delays mean that the
   blocking interval may overrun by a small amount.  Specifying a
   timeout of -1 causes epoll_wait() to block indefinitely, while
   specifying a timeout equal to zero cause epoll_wait() to return
   immediately, even if no events are available.

   The struct epoll_event is defined as:

       typedef union epoll_data {
           void    *ptr;
           int      fd;
           uint32_t u32;
           uint64_t u64;
       } epoll_data_t;

       struct epoll_event {
           uint32_t     events;    /* Epoll events */
           epoll_data_t data;      /* User data variable */
       };

   The data field of each returned epoll_event structure contains
   the same data as was specified in the most recent call to
   epoll_ctl(2) (EPOLL_CTL_ADD, EPOLL_CTL_MOD) for the corresponding
   open file descriptor.

   The events field is a bit mask that indicates the events that
   have occurred for the corresponding open file description.  See
   epoll_ctl(2) for a list of the bits that may appear in this mask.

epoll_pwait()
The relationship between epoll_wait() and epoll_pwait() is
analogous to the relationship between select(2) and pselect(2):
like pselect(2), epoll_pwait() allows an application to safely
wait until either a file descriptor becomes ready or until a
signal is caught.

   The following epoll_pwait() call:

       ready = epoll_pwait(epfd, &events, maxevents, timeout, &sigmask);

   is equivalent to atomically executing the following calls:

       sigset_t origmask;

       pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
       ready = epoll_wait(epfd, &events, maxevents, timeout);
       pthread_sigmask(SIG_SETMASK, &origmask, NULL);

   The sigmask argument may be specified as NULL, in which case
   epoll_pwait() is equivalent to epoll_wait().

epoll_pwait2()
The epoll_pwait2() system call is equivalent to epoll_pwait()
except for the timeout argument. It takes an argument of type
timespec to be able to specify nanosecond resolution timeout.
This argument functions the same as in pselect(2) and ppoll(2).
If timeout is NULL, then epoll_pwait2() can block indefinitely.
RETURN VALUE top
On success, epoll_wait() returns the number of file descriptors
ready for the requested I/O, or zero if no file descriptor became
ready during the requested timeout milliseconds. On failure,
epoll_wait() returns -1 and errno is set to indicate the error.
ERRORS top
EBADF epfd is not a valid file descriptor.

   EFAULT The memory area pointed to by events is not accessible
          with write permissions.

   EINTR  The call was interrupted by a signal handler before either
          (1) any of the requested events occurred or (2) the
          timeout expired; see signal(7).

   EINVAL epfd is not an epoll file descriptor, or maxevents is less
          than or equal to zero.

VERSIONS top
epoll_wait() was added to the kernel in version 2.6. Library
support is provided in glibc starting with version 2.3.2.

   epoll_pwait() was added to Linux in kernel 2.6.19.  Library
   support is provided in glibc starting with version 2.6.

   epoll_pwait2() was added to Linux in kernel 5.11.

CONFORMING TO top
epoll_wait(), epoll_pwait(), and epoll_pwait2() are Linux-
specific.
NOTES top
While one thread is blocked in a call to epoll_wait(), it is
possible for another thread to add a file descriptor to the
waited-upon epoll instance. If the new file descriptor becomes
ready, it will cause the epoll_wait() call to unblock.

   If more than maxevents file descriptors are ready when
   epoll_wait() is called, then successive epoll_wait() calls will
   round robin through the set of ready file descriptors.  This
   behavior helps avoid starvation scenarios, where a process fails
   to notice that additional file descriptors are ready because it
   focuses on a set of file descriptors that are already known to be
   ready.

   Note that it is possible to call epoll_wait() on an epoll
   instance whose interest list is currently empty (or whose
   interest list becomes empty because file descriptors are closed
   or removed from the interest in another thread).  The call will
   block until some file descriptor is later added to the interest
   list (in another thread) and that file descriptor becomes ready.

C library/kernel differences
The raw epoll_pwait() and epoll_pwait2() system calls have a
sixth argument, size_t sigsetsize, which specifies the size in
bytes of the sigmask argument. The glibc epoll_pwait() wrapper
function specifies this argument as a fixed value (equal to
sizeof(sigset_t)).
BUGS top
In kernels before 2.6.37, a timeout value larger than
approximately LONG_MAX / HZ milliseconds is treated as -1 (i.e.,
infinity). Thus, for example, on a system where sizeof(long) is
4 and the kernel HZ value is 1000, this means that timeouts
greater than 35.79 minutes are treated as infinity.

int epoll_wait(int epfd, struct epoll_event * events, int maxevents, int timeout);
等待事件的产生,类似于select()调用。参数events用来从内核得到事件的集合,maxevents告之内核这个events有多大,这个 maxevents的值不能大于创建epoll_create()时的size,虽然高内核版本已经忽略size这个值但还是记录一下。便于理解maxevents。参数timeout是超时时间(毫秒,0会立即返回,-1将不确定,也有说法说是永久阻塞)。该函数返回需要处理的事件数目,如返回0表示已超时。如果返回–1,则表示出现错误,需要检查 errno错误码判断错误类型。

  • epfd为epoll的描述符。

  • events是分配好的 epoll_event结构体数组,epoll将会把发生的事件复制到 events数组中(events不可以是空指针,内核只负责把数据复制到这个 events数组中,不会去帮助我们在用户态中分配内存。内核这种做法效率很高)。

  • maxevents为本次可以返回的最大事件数目,通常 maxevents参数与预分配的events数组的大小是相等的。

  • timeout表示在没有检测到事件发生时最多等待的时间(单位为毫秒),如果 timeout为0,则表示 epoll_wait在 rdllist链表中为空,立刻返回,不会等待。

实例代码

Socket相关简单实例

参考资料

epoll_create(2) - Linux manual page

epoll使用详解:epoll_create、epoll_ctl、epoll_wait

详解epoll_events结构体

epoll源码的函数调用流程分析(图)

epoll 简单介绍及例子

浅析epoll的水平触发和边缘触发,以及边缘触发为什么要使用非阻塞IO

Logo

更多推荐