虚拟主机网站建设步骤?/网络营销策划书8000字
最近遇到一个网络问题,一个客户端线程在connect的时候,发几次syn之后不发了,每次connect都返回EINVAL。
用strace追踪了,connect的第一次参数socketfd并未变动,而且地址和端口号也是正确的,第三个参数len更是用sizeof获得的肯定不会有问题。
还好问题比较好复现。
逐步加打印是在__inet_stream_connect函数中返回的EINVAL
https://elixir.bootlin.com/linux/v5.15.178/source/net/ipv4/af_inet.c#L649
switch (sock->state) {default:err = -EINVAL; /* 后面connect系统调用一直返回-22,而不触发syn报文发送 */goto out;case SS_CONNECTED:err = -EISCONN;goto out;case SS_CONNECTING:if (inet_sk(sk)->defer_connect)err = is_sendmsg ? -EINPROGRESS : -EISCONN;elseerr = -EALREADY;/* Fall out of switch with err, set for this state */break;case SS_UNCONNECTED:err = -EISCONN;if (sk->sk_state != TCP_CLOSE)goto out;if (BPF_CGROUP_PRE_CONNECT_ENABLED(sk)) {err = sk->sk_prot->pre_connect(sk, uaddr, addr_len);if (err)goto out;}
... ...err = sk->sk_prot->connect(sk, uaddr, addr_len);if (err < 0)goto out;sock->state = SS_CONNECTING;/* Connection was closed by RST, timeout, ICMP error* or another process disconnected us.*/if (sk->sk_state == TCP_CLOSE)goto sock_error;/* sk->sk_err may be not zero now, if RECVERR was ordered by user* and error was received after socket entered established state.* Hence, it is handled normally after connect() return successfully.*/sock->state = SS_CONNECTED;err = 0;
out:return err;sock_error:err = sock_error(sk) ? : -ECONNABORTED;sock->state = SS_UNCONNECTED;if (sk->sk_prot->disconnect(sk, flags))sock->state = SS_DISCONNECTING; /* 注意这里是关键,最后一次syn之后超时,disconnect返回失败就把sock状态设置成disconnecting */goto out;
}
继续加打印为什么sk->sk_prot->disconnect会返回失败?返回值是EBUSY
就是这里:
https://elixir.bootlin.com/linux/v5.15.178/source/net/ipv4/tcp.c#L2989
int tcp_disconnect(struct sock *sk, int flags)
{
... .../* Deny disconnect if other threads are blocked in sk_wait_event()* or inet_wait_for_connect().*/if (sk->sk_wait_pending)return -EBUSY; /* 这里返回出错 */
那就是sk_wait_pending值不为0,那看sk_wait_pending修改的位置
https://elixir.bootlin.com/linux/v5.15.178/source/include/net/sock.h#L1128
#define sk_wait_event(__sk, __timeo, __condition, __wait) \({ int __rc; \__sk->sk_wait_pending++; \release_sock(__sk); \__rc = __condition; \if (!__rc) { \*(__timeo) = wait_woken(__wait, \TASK_INTERRUPTIBLE, \*(__timeo)); \} \sched_annotate_sleep(); \lock_sock(__sk); \__sk->sk_wait_pending--; \__rc = __condition; \__rc; \})
而sk_wait_event是在
https://elixir.bootlin.com/linux/v5.15.178/source/net/core/stream.c#L75
/*** sk_stream_wait_connect - Wait for a socket to get into the connected state* @sk: sock to wait on* @timeo_p: for how long to wait** Must be called with the socket locked.*/
int sk_stream_wait_connect(struct sock *sk, long *timeo_p)
{DEFINE_WAIT_FUNC(wait, woken_wake_function);struct task_struct *tsk = current;int done;do {int err = sock_error(sk);if (err)return err;if ((1 << sk->sk_state) & ~(TCPF_SYN_SENT | TCPF_SYN_RECV))return -EPIPE;if (!*timeo_p)return -EAGAIN;if (signal_pending(tsk))return sock_intr_errno(*timeo_p);add_wait_queue(sk_sleep(sk), &wait);sk->sk_write_pending++;done = sk_wait_event(sk, timeo_p,!READ_ONCE(sk->sk_err) &&!((1 << READ_ONCE(sk->sk_state)) &~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)), &wait);remove_wait_queue(sk_sleep(sk), &wait);sk->sk_write_pending--;} while (!done);return 0;
}
EXPORT_SYMBOL(sk_stream_wait_connect);
sk_stream_wait_connect这个是在tcp send的时候调用的。
加打印可以看到connect线程和send线程在同时操作这个socketfd,根本原因是connect线程连接发送几个syn包后连接失败返回超时,内核会执行disconnect,而此时正好send线程走到wait for connect中,导致disconnect失败返回EBUSY,进而把sock状态设置成了disconnecting,后面每次connect系统调用就会直接返回EINVAL,不会触发syn报文的发送。
解决办法就是在send参数的flags中传递MSG_DONTWAIT,使得send线程不会去走wait for connect,如果未connect直接返回错误。这时connect线程每次调用都会触发syn报文。