为什么在内核 space 中实现的线程很慢？

Why threads implemented in kernel space are slow?

When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine happens to have an instruction to store all the registers and another one to load them all, the entire thread switch can be done in just a handful of instructions. Doing thread switching like this is at least an order of magnitude-maybe more-faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.

来源：现代操作系统（Andrew S. Tanenbaum | Herbert Bos）

以上论点是有利于用户级线程的。用户级线程实现被描述为管理所有进程的内核，其中各个进程可以拥有自己的运行-time（由库包提供）来管理该进程中的所有线程。

当然，仅仅在运行时间内调用一个函数比陷入内核可能要执行的指令要少一些，但为什么差异如此之大？

例如，如果线程是在内核中实现的space，则每次必须创建线程时，程序都需要进行系统调用。是的。但是该调用仅涉及向具有某些属性的线程 table 添加一个条目（在用户 space 线程中也是如此）。当必须发生线程切换时，内核可以简单地执行运行时间（在用户 space 处）会执行的操作。我在这里看到的唯一真正的区别是内核参与了所有这一切。性能差异怎么会这么大？

我觉得这个问题的答案可以用到很多OS和并行分布计算的知识（我不确定答案，但我会尽力而为）

所以如果你考虑一下。库包将比您在内核本身中编写的性能更高。在包中，这段代码给出的中断将被立即保留，所有的执行都将完成。当您在内核中编写不同的其他中断时，可能会出现。再加上一次又一次地访问线程对内核来说是苛刻的，因为每次都会有一个中断。我希望这将是一个更好的观点。

说 user-space 线程比 kernel-space 线程更好是不正确的，因为每个线程都有自己的优点和缺点。在user-space线程方面，由于应用程序负责管理线程，所以实现这样的线程更容易，而且那种线程对OS的依赖也不大。但是，您无法利用多处理的优势。相反，内核 space 模块是由 OS 处理的，因此您需要根据您使用的 OS 来实现它们，这将是一个更复杂的任务。但是，您可以更好地控制线程。如需更全面的教程，请查看 here.

Threads implemented as a library package in user space perform significantly better. Why?

他们不是。

事实上，大多数任务切换都是由线程阻塞引起的（必须等待来自磁盘或网络的 IO，或者来自用户的 IO，或者等待时间过去，或者等待某种 semaphore/mutex 共享给不同的进程，或来自不同进程的某种 pipe/message/packet ）或由线程解除阻塞引起（因为他们正在等待的事情发生了）；大多数阻止和解除阻止的原因都以某种方式涉及内核（例如设备驱动程序、网络堆栈……）；所以当你已经在内核中时在内核中进行任务切换会更快（因为它避免了切换到 user-space 并在没有合理原因的情况下切换回来的开销）。

其中 user-space 任务切换 "works" 是在根本不涉及内核的情况下。这主要只发生在有人未能正确执行线程时（例如，他们有数千个线程和粗粒度锁定，并且由于锁争用而不断在线程之间切换，而不是像 "worker thread pool" 这样明智的事情）。它也仅在所有线程都具有相同优先级时才有效——您不希望属于一个进程的非常重要的线程没有获得 CPU 时间，因为属于另一个进程的非常不重要的线程占用了 CPU（但这正是 user-space 线程所发生的情况，因为一个进程不知道属于另一个进程的线程）。

大部分； user-space 线程是一个愚蠢的烂摊子。不是更快还是"significantly better"；更糟了。

When a thread does something that may cause it to become blocked locally, for example, waiting for another thread in its process to complete some work, it calls a run-time system procedure. This procedure checks to see if the thread must be put into blocked state. If so, it stores the thread's registers in the thread table, looks in the table for a ready thread to run, and reloads the machine registers with the new thread's saved values. As soon as the stack pointer and program counter have been switched, the new thread comes to life again automatically. If the machine happens to have an instruction to store all the registers and another one to load them all, the entire thread switch can be done in just a handful of instructions. Doing thread switching like this is at least an order of magnitude-maybe more-faster than trapping to the kernel and is a strong argument in favor of user-level threads packages.

这是在讨论 CPU 本身进行实际任务切换的情况（内核或用户-space 库告诉 CPU 何时进行任务切换任务切换到什么）。这背后有一些相对有趣的历史......

在 1980 年代，英特尔为 "secure object oriented programming" 设计了 CPU（"iAPX" - 参见 https://en.wikipedia.org/wiki/Intel_iAPX_432）；每个对象都有自己独立的内存段和特权级别，并且可以直接将控制权转移给其他对象。一般的想法是您将拥有一个由使用协作流控制的全局对象组成的单任务系统。由于多种原因而失败，部分原因是所有保护检查都破坏了性能，部分原因是当时的大多数软件都是为 "multi-process preemptive time sharing, with procedural programming".

设计的

英特尔在设计保护模式（80286、80386）的时候还是对"single-tasking system consisting of global objects using cooperating flow control"抱有希望的。它们包括硬件 task/object 切换、本地描述符 table（因此每个 task/object 可以有自己的隔离段）、调用门（因此 tasks/objects 可以直接将控制权转移给彼此），并修改了一些控制流指令（call far 和 jmp far）以支持新的控制流。当然，这与 iAPX 失败的原因相同；并且（据我所知）没有人将这些东西用于 "global objects using cooperative flow control" 它们最初设计的目的。有些人（例如很早的 Linux）确实尝试将硬件任务切换用于更传统的 "multi-process preemptive time sharing, with procedural programming" 系统；但发现它很慢，因为硬件任务切换做了太多可以通过软件任务切换避免的保护检查和saved/reloaded太多可以通过软件任务切换避免的状态;p并且没有做任何任务切换所需的其他内容（例如，保持 CPU 使用时间的统计数据，saving/restoring 调试寄存器等）。

现在.. Andrew S. Tanenbaum 是微内核的倡导者。他的理想系统由 user-space（进程、服务、驱动程序等）中的独立部分组成，通过它们进行通信。同步消息。在实践中（忽略术语上的表面差异）这个 "isolated pieces in user-space communicating via. synchronous messaging" 几乎与英特尔的两次失败 "global objects using cooperative flow control".

完全相同

大部分；理论上（如果你忽略所有的实际问题，比如 CPU 没有保存所有的状态，并且想在任务切换上做额外的工作，比如跟踪统计），对于特定类型的 OS Andrew S. Tanenbaum 更喜欢（同步消息传递的微内核，没有任何线程优先级），上面引用的段落似乎不仅仅是一厢情愿。

为什么在内核 space 中实现的线程很慢？

Why threads implemented in kernel space are slow?

performance

multithreading

operating-system

kernel