[转帖]如何解决静态变量的线程安全问题

主题：[转帖]如何解决静态变量的线程安全问题

小小C [专家分：4570] 发布于 2009-02-20 09:19:00

这是一位同事写的文章, 在此转贴, 大家讨论.

通常C++中很常见的Singleton实现，是利用一个模版类中的静态函数（或者一个直接的模版函数），在其中定义一个静态变量来实现的，例如：

template <class T>
class Singleton
{
public:
static T& Instance()
{
  static T obj;
  return obj;
}
};

而这里存在一个著名的隐患，就是static变量的初始化时间为函数被首次调用时。其汇编码如下（在VS2008下生成）：

004115BD  mov         eax,dword ptr [`Singleton<SomeClass>::Instance'::`2'::`local static guard' (41A154h)]
004115C2  and         eax,1
004115C5  jne         Singleton<SomeClass>::Instance+79h (4115F9h)
004115C7  mov         eax,dword ptr [`Singleton<SomeClass>::Instance'::`2'::`local static guard' (41A154h)]
004115CC  or          eax,1
004115CF  mov         dword ptr [`Singleton<SomeClass>::Instance'::`2'::`local static guard' (41A154h)],eax
004115D4  mov         dword ptr [ebp-4],0
004115DB  mov         ecx,offset obj (41A148h)
004115E0  call        SomeClass::SomeClass (41100Ah)
004115E5  push        offset `Singleton<SomeClass>::Instance'::`2'::`dynamic atexit destructor for 'obj'' (411226h)
004115EA  call        @ILT+125(_atexit) (411082h)
004115EF  add         esp,4
004115F2  mov         dword ptr [ebp-4],0FFFFFFFFh
，有一个dword类型的local static guard，初始值为0，函数每次被调用时，都检查这个数的最后一个二进制位是否为1，如果不为1，则调用构造函数004115E0  call        SomeClass::SomeClass (41100Ah) 。

注：在C++0x标准中，规定了编译器必须完成static变量的线程安全问题，所以在支持0x标准的编译器下，直接使用原始的Singleton定义即可。

因此，如果存在多线程同时首次访问该函数，多个线程获取到的local static guard均为0，然后分别调用了一次构造函数。如果对象中包含内存的分配，或者包含重要资源的分配，则这些内存/资源会发生泄漏。而且更严重的问题是，如果对象的构造函数较为复杂，例如往容器中填写了某些数据，那么甚至可能会产生严重错误或崩溃，甚至即使加锁也不能解决问题！

举例：

构造函数流程为：

1、调用成员的构造函数（这通常是无法改变的）
2、初始化锁
3、加锁
4、往成员的某个容器里填写数据
5、解锁。

那么，以下情况下会导致十分严重的连锁错误，最后极度可能发生崩溃：

1、A线程调用成员的构造函数
2、A线程初始化锁
3、A线程加锁
4、A容器更改了某个成员，并还未来得及完全保存下来，发生了线程切换，或者是在另一个CPU核心上B线程开始了工作
5、B线程调用成员的构造函数
6、B线程初始化锁，并覆盖了A线程所初始化的锁，使得A线程所初始化的锁泄漏。
7、B线程加锁，不会被A线程阻塞（因为是不同的锁）
8、B线程试图更改某个成员，但是成员数据已经发生了不一致，从而导致生成错误的数据，甚至崩溃
9、B线程解锁
10、A线程试图解锁，但所初始化的锁已经被覆盖，结果使得线程锁发生异常。

从上面的流程可以看出，即使加了临界区锁也不能解决该对象被构造两次的问题（因为锁本身也需要构造）。除非采用命名锁之类的避免受静态内存影响的锁，那样既繁琐又不方便使用。

但是，利用数值等类型的静态变量构造方式的不同（直接在静态区写入数值），我们可以很方便的构造出一个安全的Singleton类：

template <typename T>
class ThreadSafeSingleton
{
public:
static T& Instance()
{
  static long state = 0;
  static char obj[sizeof(T)];

  if (state == 0)
  {
   long val = InterlockedAdd(&state, 2);
   if (val == 2)
   {
    new (obj) SomeClass();
    InterlockedDecrement(&state);
   }
   else
   {
    InterlockedAdd(&state, -2);
   }
  }
  while ( state > 1 )//也可以是while ((state & 1) == 0)
  {
   Sleep(0);
  }

  return reinterpret_cast<T&>(obj[0]);
}
};

其原理是用state变量代替local static guard，state为0表示对象尚未构造，state为1表示对象已经构造，state大于1通常表示对象正在构造中。另外定义了一个obj空间，用于盛放该对象。
当开始有多个线程同时进入时，state首先为0，所有线程均进入if之后的语句块。此时，所有线程开始InterlockedAdd，因为操作系统保证此操作的原子性，因此有且仅有一个线程会得到val == 2（第一个调用该函数的线程）。此时，这个线程调用该对象的构造函数，并将state减1。其它线程将state再减2，并退出这个语句块，假装自己没有来过。并且，之后进入此函数的线程都会获得state != 0，从而不进入这个语句块。
随后，其他线程均处在后一个循环中，等待构造的线程构造完毕。Sleep(0);主要是为了主动让出CPU控制权，避免构造的线程得不到CPU控制权。构造线程构造完毕后，将state减1，并退出这个语句块。此时，所有线程检测到state为1，因此将直接返回。

以上代码还存在一个BUG，就是程序退出时析构函数没有被调用。还有一个比较头疼的地方就是InterlockedAdd这个函数在32位下通常没有。因此小小的修补一下，改用了另一种思路：

template <typename T>
class ThreadSafeSingleton
{
class holder
{
  char m_space[sizeof(T)];
  long &m_state;
public:
  holder(long &state)
   : m_state(state)
  {

  }
  ~holder()
  {
   if ( m_state < 0) //通常静态对象的析构应在所有线程结束后，因此不考虑还有线程停留在构造函数中。
    reinterpret_cast<T*>(this)->~T();
  }
};
public:

static T& Instance()
{
  static long state = 0;
  static holder obj(state);

  if (state == 0)
  {
   long val = InterlockedIncrement(&state);
   if (val == 1)
   {
    new (&obj) T();
    state = (1<<31);
   }
  }
  while ( state > 0 )
  {
   Sleep(0);
  }

  return reinterpret_cast<T&>(obj);
}
};

主要就是信任不会有0x80000000个并发线程，因此在构造完成后，state应当一直是负值，不可能被并发线程突然使得state达到0。

以上版本为最终版本，欢迎大家讨论。

本帖地址： http://bbs.pfan.cn/post/294156.html

回复列表（共17个回复）

沙发

sarrow [专家分：35660] 发布于 2009-02-20 10:58:00

楼主说的确实如此。

我其实一般都回避这个问题——资源（或者说那个可以获取静态对象的函数），我是在主线程就创建（调用该函数），之后才创建子线程。

呵呵，我这个方法不太好。

板凳

强强 [专家分：4740] 发布于 2009-02-20 11:11:00

对线程没有什么深入的认识目前只是知道注意两点:
1,线程执行时尽量不操作全局变量,创建线程时把要用到的变量取到线程内部的局部变量中,执行完再把局部变量最终结果给予全局变量
2,就是临界对象的使用
都是自已摸索出来的,望高手指正

3 楼

小小C [专家分：4570] 发布于 2009-02-20 11:39:00

[quote]对线程没有什么深入的认识目前只是知道注意两点:
1,线程执行时尽量不操作全局变量,创建线程时把要用到的变量取到线程内部的局部变量中,执行完再把局部变量最终结果给予全局变量
2,就是临界对象的使用
都是自已摸索出来的,望高手指正[/quote]

老兄哇，上述问题就算你加了临界区也是没用的，因为临界区本身也是一个全局对象。

4 楼

iAkiak [专家分：8460] 发布于 2009-02-20 12:14:00

看Loki的Singleton，那个可以[配置]为多种线程共享方式。基本够用了。

5 楼

Chipset [专家分：16190] 发布于 2009-02-20 14:17:00

//我也贴一个
Developing Lightweight, Statically Initializable C++ Mutexes
Threadsafe initialization of Singletons
Vladimir Kliatchko
The novel synchronization mechanism Vladimir presents here comes in handy for a wide range of applications.
While working on a solution for a fairly narrow and specific problem, I discovered a new synchronization mechanism. This mechanism not only provided a novel and effective solution to my problem, but turns out to have a number of properties making it useful for a wide range of important applications .
My original problem had to do with the implementation of Singletons. The issues with Singletons involve the order and time of their initialization and destruction, as well as the thread safety of these operations. Despite the multitude of techniques developed to deal with these issues, there does not seem to be one universal solution—let alone a simple one.
The particular Singleton I was working on had to be located in a low-level library, which was used by many applications and other libraries. This constraint ruled out the use of a C++ file-scope variable as a mechanism for initializing the Singleton because there is no way to ensure that initialization of such an object is done properly before its first use. To avoid the active collaboration that would result from explicit initialization, I concluded that the Singleton had to be initialized on first access. The Meyers Singleton pattern (Listing One) fits this requirement. When this pattern is employed, the Singleton is initialized the first time it is used, and automatically destroyed when the process terminates. A word of caution: It is sometimes necessary not only to ensure that a Singleton is destroyed at process termination, but also to establish a particular relative order; for my Singleton, such was not the case.
class Singleton {
    Singleton();
    ~Singleton();
  public:
    static Singleton& instance();
};
Singleton& Singleton::instance() {
    static Singleton singleton;
    return singleton;
}
Listing One
This implementation of a Singleton leaves the issue of thread safety unresolved. If multiple threads enter the instance routine simultaneously and if this routine has never been executed before, various problems may occur, depending on the compiler and platform. For example, the constructor of my Singleton may be executed multiple times or not at all. Alternatively, the constructor may be executed exactly once, which is correct, but one of the instance routines may return before the construction is completed—again, leading to unpredictable and hard-to-debug failures. As you'd expect, you can eliminate this race condition by using one of a number of various synchronization mechanisms. Listing Two, for example, employs a mutex. In Listing Two, I did not show how the mutex variable is initialized. The mutex itself is a Singleton! If this mutex has a nontrivial constructor or destructor, we have a chicken-and-egg problem.
class Singleton
{
    Singleton();
    ~Singleton();
    static Singleton *s_singleton;
    static void init();
public:
    static Singleton& instance();
};
Singleton *Singleton::s_singleton = 0;
void Singleton::init()
{
    static Singleton singleton;
    s_singleton = &singleton;
}
Singleton& Singleton::instance()
{
    lock(&mutex);
    if (!s_singleton) init();
    unlock(&mutex);
    return *s_singleton;
}
Listing Two
UNIX: pthread Solutions
Solving the problem of initializing the mutex on UNIX platforms isn't difficult. The pthread library available on these platforms provides a mutex that is a Plain Old Data (POD) type requiring only static (nonruntime) initialization, as in Listing Three. Additionally, the pthread library provides a mechanism called "once routines" that can also be used to solve this initialization problem (Listing Four).
Singleton& Singleton::instance()
{
    static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
    pthread_mutex_lock(&mutex);
    if (!s_singleton) init();
    pthread_mutex_unlock(&mutex);
    return *s_singleton;
}
Listing Three
Singleton& Singleton::instance()
{
    static pthread_once_t once_control = PTHREAD_ONCE_INIT;
    pthread_once(&once_control, &Singleton::init);
    return *s_singleton;
}

6 楼

Chipset [专家分：16190] 发布于 2009-02-20 14:19:00

Listing Four
Windows: Spin-Lock Solution
Unfortunately, the Singleton I was implementing had to support Windows. That's where the main complication arose. Even critical sections, which are the simplest synchronization mechanism available on Windows applicable for my Singleton, do not support static initialization; a call to InitializeCriticalSection is required. As for once routines, there is no support for them in the Win32 API.
Of course, I could have implemented a simple spin-lock using the Interlocked family of routines and the Sleep function, as in Listing Five.
Singleton& Singleton::instance()
{
    static volatile LONG lock = 0;
    while (InterlockedExchange(&lock, 1) != 0) Sleep(1);
    if (!s_singleton) init();
    InterlockedExchange(&lock, 0);
    return *s_singleton;
}
Listing Five
Listing Five can be further optimized in two important ways. First, you can replace the lock variable with a three-state control variable (Listing Six).
Singleton& Singleton::instance()
{
    static volatile LONG control = 0;
while (1)
{
        LONG prev = InterlockedCompareExchange(&control, 1, 0);
        if (2 == prev)
            // The singleton has been initialized.
            return *s_singleton;
        elsif (0 == prev)
            // got the lock
            break;
        else
{
            // Another thread is initializing the singleton:
            // must wait.
            assert(1 == prev);
            Sleep(1); // sleep 1 millisecond
        }
    }
    if (!s_singleton) init();
    InterlockedExchange(&control, 2);
    return *s_singleton;
}

7 楼

Chipset [专家分：16190] 发布于 2009-02-20 14:21:00

Listing Six
•    No threads have yet attempted to initialize the Singleton. (This is the initial state.)
•    The Singleton is in the process of being initialized. (Any thread, except the first thread that sets the control variable to this value, must wait for the initialization to be completed.)
•    The initialization has been completed.
Introducing these three states ensures that, once the initialization is completed, no spinning is necessary, even if multiple threads enter the Singleton::instance routine simultaneously.
The second optimization lets the sleep interval adjust dynamically using exponential backoff (Listing Seven; available electronically; see "Resource Center," page 5). If the initialization of the Singleton (the constructor of our Singleton class) takes a relatively long time or if there are many spinning threads, choosing small sleep intervals may create considerable overhead. Moreover, this problem is exacerbated whenever spinning threads delay the progress of the thread performing the initialization. On systems with less sophisticated thread scheduling (Windows 98, for instance), this deficiency may even bring the entire system to a halt.
On the other hand, if the initialization of the Singleton is relatively quick, choosing a large sleep interval causes an unnecessary delay. Although adjusting the sleep interval dynamically does not solve these problems completely, it does reduce their likelihood.
With these substantial optimizations, the spin-lock approach, though inelegant, is acceptable for many applications . Nonetheless, because I was looking for a generic solution that I planned to implement in a low-level library used by a wide range of applications and for a variety of Singleton objects, the spin-lock approach did not appeal to me.
Windows: Named-Mutex Solution
In search of a better solution, I decided to look at how two open-source libraries—Boost.Threads (boost.org/doc/html/threads.html) and Pthreads-w32 (sourceware.org/pthreads-win32/)—deal with this problem. Although neither library is directly concerned with Singletons, each has code that implements once-routine support on Windows. Boost.Threads provides a portable set of handy MT-related primitives to C++  programmers. Naturally, the platforms supported by this library include Windows. Pthreads-w32, on the other hand, implements a pthread-compatible layer, including pthread_once, on top of the Windows API, thus simplifying the porting of MT programs from UNIX to Windows. I reasoned that whatever technique these libraries use to implement once routines on Windows, it should be possible to use that same technique to implement my Singleton.
The technique I discovered in Boost.Threads (libs/thread/src/once.cpp under the Boost source tree) relied on a special Windows feature that lets a name be associated with a mutex. If a thread (not necessarily within the same process) creates a mutex with a name that already exists, instead of creating a new mutex, the system returns a handle to the existing one. The system also makes sure that the mutex is not destroyed as long as there are any open handles to it. Finally, the system guarantees that creating a mutex and closing a mutex handle are atomic operations. Listing Eight (available electronically) shows how a named mutex can be used to implement our Singleton.
The name of the mutex is designed to be unique. If you don't think the string "Singleton::instance" provides sufficient guarantee from collisions, you can use a random string generated by your favorite method (the string actually used in the Boost code is "2AC1A572DB6944B0A65C38C4140AF2F4"). Also, note that this code uses the Double-Checked-Lock optimization (www.cs.wustl.edu/~schmidt/PDF/DC-Locking.pdf). To avoid the various problems with naïve implementations of this optimization (www.aristeia.com/Papers/DDJ_Jul_Aug_2004_revised.pdf), we make sure that all access to the flag is done through Windows interlocked routines, which incorporate all the necessary memory barriers. The InterlockedExchangeAdd(&flag, 0) call—a no-op incrementing the flag by 0—is just a way of using interlocked routines for reading the flag. (The same effect can be achieved more efficiently with some inlined assembler.)

8 楼

Chipset [专家分：16190] 发布于 2009-02-20 14:22:00

Although a clever technique from the Boost.Threads library lets you improve upon the previous implementation by removing the inherent unpredictability of spinning, the solution suffers from several inefficiencies. To begin with, named mutexes are intended to be used for synchronization across process boundaries rather than among threads within the same process. Manipulating, and creating such "interprocess" mutexes is expensive—substantially more so than those used to synchronize threads. Moreover, with this heavy approach, a named mutex has to be created at least once, even when there is absolutely no contention—even if the application using this Singleton is singlethreaded! Finally, generating a unique mutex name from a pointer is artificial and rather inelegant. Given these drawbacks, this solution, although acceptable in many applications , did not satisfy the needs of my high-performance reusable infrastructure library.
Developing Class QLock
Next, I looked at the Pthread-w32 library. The code I discovered in that library's implementation of pthread_once (pthread_once.c, CVS version 1.16) used an approach radically different from either of the two techniques just discussed: If multiple threads enter pthread_once simultaneously, the first thread that has to wait (the second thread overall) creates a synchronization object (a Windows event object, in this case). The last thread to leave the routine destroys the object. An atomic reference count keeps track of the number of threads involved. All of the problems of the previous two approaches seemed to be avoided. Unfortunately, a closer examination of the code uncovered a race condition. After a few attempts at fixing the problem (see pthread-w32 mailing list archive if you are interested in the details; sources.redhat.com/ml/pthreads-win32/), it became clear that this general approach is fundamentally flawed and cannot be fixed.
Although reference counting used by the pthread_once implementation did not work, trying to fix it gave me some ideas. What if each thread that had to be suspended (waiting for another thread to complete the initialization of a Singleton), created and then destroyed its own synchronization object? For instance, I knew how the MCS implementation of spin-locks, to reduce contention, makes each thread wait spinning on its own flag. Could I use a similar technique, but instead of spinning, make each thread wait on its own semaphore or event object? It turned out that it was not only possible to make such a technique work, but the resulting code was fairly simple, too. Listing Nine (available electronically) uses this specialized waiting technique to implement my Singleton.
The central idea of this new approach is encapsulated in class QLock. Each instance of this class represents a lock held on the mutex that was passed as a constructor argument. A mutex itself, which is represented by the QLock::Mutex type, is just a pointer, which is set to zero when the mutex is not locked. When the mutex is locked, it points to the tail of the queue of threads waiting on this mutex. Each thread in the queue is represented by the instance of class QLock that the thread created to obtain the lock. In other words, the mutex is a pointer to the QLock instance created by the last thread waiting in the queue. This QLock instance is linked to another QLock instance, which was created by the previous thread in the queue, and so on. The QLock instance at the head of the queue corresponds to the thread that currently holds the lock.

9 楼

Chipset [专家分：16190] 发布于 2009-02-20 14:22:00

A Closer Look at QLock
Table 1 lists the QLock class data members, while Table 2 lists the values that d_readyFlag and d_nextFlag may have. The constructor QLock::QLock atomically exchanges the mutex pointer for this. If the mutex pointer was 0 prior to the exchange, the lock is obtained and the constructor completes. If the mutex pointer was not zero, this instance of QLock has to join the queue and wait for the ready flag.
Data Member    Description
d_mutex    Pointer to the mutex locked by this QLock instance.
d_next    Pointer to the next instance of QLock in queue waiting to obtain the lock.
d_readyFlag    Flag used to indicate that the lock has been released by the predecessor.
d_nextFlag    Flag used to indicate that the next pointer (d_next) has been set by a successor.
Table 1: QLock class data members.
Value    Description
0    Initial value indicating that the flag is not set.
-1    Indicates that the flag was set before an event object had been installed into the flag.
Neither 0 nor -1    Indicates that an event object has been installed into the flag, which now holds the handle of the event object.
Table 2: Values that d_readyFlag and d_nextFlag may have.
The destructor QLock::~QLock, using one atomic compare-and-exchange operation, checks if the mutex pointer contains the value of this pointer. If it does, it resets the mutex pointer to zero, thus releasing the mutex. If the mutex no longer contains the value of this pointer, the queue has been formed, and the destructor must pass the lock to the next instance in the queue. The destructor first waits on the d_nextFlag to make sure that the next pointer has been set, then sets the d_readyFlag.
The algorithms used in QLock's constructor/destructor are basically the same as those used to lock/unlock MCS spin-locks. The setFlag and waitOnFlag methods are where we make our important deviation from MCS locks. Instead of spinning (as is appropriate for a spin-lock), the waitOnFlag routine:
1.    Creates an event object.
2.    Atomically checks that the flag has not yet been set and installs the event object's handle into the flag.
3.    Proceeds to wait on the event object.
Now compare our QLock-based solution for the Singleton problem with the two previous solutions (spin-lock-based and named-mutex-based). Similar to the named-mutex-based solution, you avoid the unpredictable behavior of the spin-lock: Instead of spinning, waiting threads are suspended by the kernel, and resumed as soon as the mutex is available. Additionally, you avoid the problems with the named mutex: Event objects used for synchronization are process local, do not require artificial names, and are created only when threads need to be suspended due to contention. When there is no contention (when only one thread enters the Singleton::instance method at the time of initialization), no synchronization objects are created: The overhead is simply that of a few atomic operations.
Observations
After using the synchronization mechanisms implemented by the QLock class to solve the Singleton problem on Windows (and more generally, the problem with the static initialization of mutexes on Windows), I discovered that QLock has some other important properties that made it attractive for other applications  (including on UNIX). QLock::Mutex has a small memory footprint (one pointer), and a small initialization cost (that of initializing a pointer with 0). Additionally, when there is no contention, QLock is fast. If no wait is necessary, locking a QLock requires just one atomic instruction and no system calls. If there are no waiting threads, unlocking is also just one atomic instruction.
Furthermore, when there is contention, the cost of constructing event objects (or semaphores on UNIX) can be virtually eliminated by using a lock-free pool of allocated event objects. This pool works particularly well because the number of event objects required never exceeds the number of waiting threads—an inherently small number.
One common scenario in which QLock appears to be of great use is when you have a large collection of some kind of items that are accessed from multiple threads. To increase concurrency, it may be logically possible and desirable to maintain a mutex per item of the collection (instead of maintaining a single mutex for the entire collection). It is often the case, though, that the memory footprint and the initialization cost of a standard mutex or critical section would be prohibitive due to the (sometimes very large) size of the collection. QLock, because of its small size and negligible initialization cost, may be able to provide a viable solution to fine-grain locking.

10 楼

小小C [专家分：4570] 发布于 2009-02-20 14:27:00

能不能用中文简单的介绍一下啊！这么多的英文见了头大！

我来回复

您尚未登录，请登录后再回复。点此登录或注册

主题：[转帖]如何解决静态变量的线程安全问题

回复列表（共17个回复）

我来回复

程序员工具箱 new

代码片段

本版新帖

主题：[转帖]如何解决静态变量的线程安全问题

回复列表 （共17个回复）

我来回复

程序员工具箱 new

代码片段

本版新帖

回复列表（共17个回复）