std::cout 多线程输出踩坑记录

问题与怀疑

偶然出现一个问题, 单线程无锁寻路组件, 在多线程执行只, 性能劣化了500倍往上
- 单次寻路每次在10ms左右, 但是16核机器开了32个线程后, 单次寻路劣化到了2s左右
  - 寻路请求有个前置队列, 排队处理, 理论上无论多大的QPS压力, 寻路核心都会安装固定的速度处理, 单次寻路时间理应不变
- 业务逻辑是完全无锁的
- 劣化后运行过程中CPU被完全吃满
- 劣化后火焰图中有个奇怪的自旋锁占用
寻路过程使用了dijkstra/jps/A*等多个算法, 只有dijkstra会劣化, 其他的都不会, 基本确认是dijkstra的问题
- 进过排查发现有一个500*500左右的二维矩阵, 调试时直接输出到了std::cout, 服务器版本没有改为log组件控制
  - 矩阵中的每一行都直接使用std::cout输出
- 服务器把std::cout 重定向到了 /dev/null, 所有平时看不到它的输出
- 怀疑是std::cout输出时会自旋锁

验证

构造一个1000*1000的二维矩阵, 分别使用单线程/多线程打印

// 打印矩阵
void Print(std::ostream &oss, const std::string &rawGap,
           const std::string &colGap) {
    for (int curr_raw = 0; curr_raw < m_raw; ++curr_raw) {
        for (int curr_col = 0; curr_col < m_col; ++curr_col) {
            oss << m_data[curr_raw][curr_col] << colGap;
        }
        oss << "\n";
    }
}
void Print() {
    std::cout << "print double dim array" << std::endl;
    Print(std::cout, "\n", "\t");
    std::cout << "\n"
              << "end print double dim array" << std::endl;
}

void PrintMatrix(std::ostream& timeOs, MatrixType& m, int count)
{
    for (int i = 0; i < count; ++i)
    {
        TimeGap gap{};
        m.Print();
        timeOs << "time use " << gap.gap() << "us " << gap.gapMs() << "ms" << std::endl;
    }
}
TEST_CASE("cout_test")
{
    auto m = std::make_shared<MatrixType>(1000, 1000);
    for (int x = 0; x < 1000; ++x)
    {
        for (int y = 0; y < 1000; ++y)
        {
            m->Set(x, y, x * 100000 + y);
        }
    }
    std::cerr << "single " << std::endl;
    // 单线程
    PrintMatrix(std::cerr, *m, 20);
    std::cerr << "multi " << std::endl;

    // 多线程
    tars::TC_ThreadPool m_threadPool{};
    m_threadPool.init(get_nprocs() * 2);
    m_threadPool.start();
    for (int i = 0; i < 1000; ++i)
    {
        m_threadPool.exec([m]() { PrintMatrix(std::cerr, *m, 1); });
    }
    m_threadPool.waitForAllDone();
}

测试结果

./PathSearchServer --unittest --tc=cout_test 1>/dev/null
single 
time use 76835us 77ms
time use 76088us 76ms
time use 74894us 75ms
time use 73033us 73ms
time use 74418us 74ms
time use 74521us 75ms
time use 73480us 73ms
time use 73446us 73ms
time use 73105us 73ms
time use 74032us 74ms
time use 72463us 72ms
time use 72522us 73ms
time use 72438us 72ms
time use 72267us 72ms
time use 72322us 72ms
time use 72225us 72ms
time use 72128us 72ms
time use 72110us 72ms
time use 72360us 72ms
time use 72302us 72ms
multi 
time use 24006429us 24006ms
time use 24015438us 24015ms
time use 24016567us 24017ms
time use 24019720us 24020ms
time use 24056568us 24057ms
time use 24113107us 24113ms
time use 24133636us 24134ms
time use 24140334us 24140ms
time use 24141262us 24141ms
time use 24145302us 24145ms
time use 24193464us 24193ms
time use 24222469us 24222ms
time use 24223077us 24223ms
time use 24235170us 24235ms
time use 24245044us 24245ms
time use 24252006us 24252ms
time use 24258551us 24259ms
time use 24258875us 24259ms
time use 24277297us 24277ms
...

可以看到
- 在cpu数量*2的压力下, 矩阵打印速度慢了320倍左右, 刚好是线程数*10
- cpu满负载运行

结论

liunx c 输出会加速
如果输出速度过快, 单次输出时间过长, 会导致所有线程自旋等待锁
cpu满负载, 运行速度大幅度降低, 线程数越多,降低的越多
可以从GLibc源码确认

size_t
_IO_fwrite (const void *buf, size_t size, size_t count, FILE *fp)
{
  size_t request = size * count;
  size_t written = 0;
  CHECK_FILE (fp, 0);
  if (request == 0)
    return 0;
  _IO_acquire_lock (fp);
  if (_IO_vtable_offset (fp) != 0 || _IO_fwide (fp, -1) == -1)
    written = _IO_sputn (fp, (const char *) buf, request);
  _IO_release_lock (fp);
  /* We have written all of the input in case the return value indicates
     this or EOF is returned.  The latter is a special case where we
     simply did not manage to flush the buffer.  But the data is in the
     buffer and therefore written as far as fwrite is concerned.  */
  if (written == request || written == EOF)
    return count;
  else
    return written / size;
}
libc_hidden_def (_IO_fwrite)