【输入 / 输出优化】浅谈 C++ IO优化

IO 优化是很实用&简单的常数优化，应该也是大部分 OIer 学习或接触卡常的第一种卡常数方式。当读入数据超过 4MB 时，读入优化可以节省好几倍时间。

本文主要以读入优化为主。

# 一、读入部分：

该部分的评测结果均以 [P10815 【模板】快速读入](https://www.luogu.com.cn/problem/P10815) 为例。

## 1. `cin` / `cout` 的优化（关闭同步 和 解除绑定）

### `ios::sync_with_stdio(0)`

这个函数是一个「是否兼容 stdio」的开关，C++ 为了兼容 C，保证程序在使用了 `printf` 和 `cout` 的时候不发生混乱，将输出流绑到了一起。同步的输出流是线程安全的。

这其实是 C++ 为了兼容而采取的保守措施，也是使 `cin`/`cout` 速度较慢的主要原因。但我们可以在使用 `cin`/`cout` 之前将 stdio 解除绑定，能够明显地提升 `cin`/`cout` 的 IO 速度，但是在这样做之后要注意不能同时使用 `cin` 和 `scanf`，也不能同时使用 `cout` 和 `printf`，但是可以同时使用 `cin` 和 `printf`，也可以同时使用 `scanf` 和 `cout`（即不同的输入或输出方式不能混合使用）。

### `cin.tie(0),cout.tie(0)`

`tie` 是将两个 `stream`（`istream`，`ostream`）绑定的参数，如果括号里填空参数（`nullptr`），则返回当前的输出流指针。

在默认的情况下 `cin.tie()` 绑定的是 `&std::cout`，每次进行格式化输入的时候都要调用 `cout.flush()` 清空输出缓冲区，这样会减慢 IO 速度。可以通过 `cin.tie(nullptr)` 来解除绑定，进一步加快执行效率。

需要注意的是，如果你需要在每次输入之前输出结果（如交互题），在进行 `cin` 和 `cout` 解绑之后，每次都需要调用 `cout.flush()` 或 `cout << flush`。如果输出时刚好需要换行，也可以调用 `cout << endl`。

下面的程序是一个例子：

```cpp
cout << "Please input an integer:";
cout << "\n", cout.flush(); // 刷新缓冲区，也可以换成 cout << "\n" << flush; 或 cout << endl;
cin >> x;
```

### 使用示例：

```cpp
#include <iostream>
using namespace std;
int main() {
  ios::sync_with_stdio(0), cin.tie(0), cout.tie(0);
// Your code begins here.
  int n, x, ans = 0;
  cin >> n;
  while (n --) cin >> x, ans += x;
  cout << ans;
// End.
  return 0;
}
```

[测试结果](https://www.luogu.com.cn/record/225967355)：读入 $10^7$ 个整数需要 $643$ ms 左右，无法在 $2500$ ms 内读入 $10^8$ 个整数。

## 2. `C` 风格输入的优化

### `getchar()`

`getchar` 是用来读入 1 byte 的数据并将其转换为 `char` 类型的函数（但实际上的返回值是 `int`），且速度很快，故可以用「读入字符——转换为整型」来代替缓慢的读入。

我们需要考虑输入的组成部分：整数由两部分组成——符号（一般是负号 `-`）和数字，整数之间有换行和空格。

10 进制整数中是不含空格或除 `0~9` 和（正）负号外的其他字符的，因此在读入不应存在于整数中的字符（通常为空格 / 换行）时，就可以判定已经读入结束。

C 和 C++ 语言分别在 `ctype.h` （`C/C++`）和 `cctype`（`C++`） 头文件中，提供了函数 `isdigit`, 这个函数会检查传入的参数是否为十进制数字字符，是则返回一个非零整数，否则返回零。所以我们判定一个字符是否是十进制数字字符时可以使用 `isdigit(c)` 代替 `c >= '0' && c <= '9'`，也可以使用 `!isdigit(c)` 代替 `c < '0' || c > '9'`（而且 `isdigit` 的速度比 `c >= '0' && c <= '9'` 更快）。

下面是使用示例：

```cpp
#include <cstdio>
#include <cctype>
int readint() {
  int k = 0, f = 1, c = getchar(); // 实际上变量 c 没有必要转成 char
  for (; !isdigit(c); c = getchar()) if (c == '-') f = -1; // 由于 isdigit 的接收值也是 int，所以确实没有必要转成 char
  for (; isdigit(c); c = getchar()) k = k * 10 + (c ^ 48); // 也可以写 (c & 15)
  return k * f;
}
int main() {
  int n = readint(), ans = 0;
  while (n --) ans += readint();
  printf("%d", ans);
}
```

[评测结果](https://www.luogu.com.cn/record/226751534)：读入 $10^7$ 个整数需要 $290$ ms 左右，但仍然无法在 $2500$ ms 内读入 $10^8$ 个整数。

### `getchar_unlocked` 函数：

普通的 `getchar` 带线程锁，然而 OI 用不上多线程（~~你用多线程就算作弊~~），所以我们在日常使用中可以把它关闭，以达到提高输入效率的目的，故可用更快的 `getchar_unlocked` 代替。

### 线程锁：

在多线程环境中，多个线程可能同时访问共享资源（举例如文件、内存、全局变量），线程锁在这个时候用于强制互斥访问：某个线程访问资源前先上锁，其他线程就阻塞在一边排队等待；操作完成后释放锁，允许其他线程访问。这可以防止 Data Race 导致程序崩溃或数据错误。

`getchar` 本身是线程安全的，它每调用一次就要先上锁，再读一个字符，再开锁，来保证安全。但是如果只有一个线程访问标准输入（事实上 OI 中整个程序只有一个线程），这个频繁开关锁的过程就没有必要了。

所以使用不带线程锁直接读入的 `getchar_unlocked` 可以加快读入速度。

### 使用实例：

`getchar_unlocked()` 和 `getchar()` 在功能上基本一致，所以只需要将上述代码中的 `getchar` 替换成 `getchar_unlocked` 即可。

```cpp
#include <cstdio>
#include <cctype>
int readint() {
  int k = 0, f = 1, c = getchar_unlocked();
  for (; !isdigit(c); c = getchar_unlocked()) if (c == '-') f = -1;
  for (; isdigit(c); c = getchar_unlocked()) k = k * 10 + (c ^ 48);
  return k * f;
}
int main() {
  int n = readint(), ans = 0;
  while (n --) ans += readint();
  printf("%d", ans);
}
```

[评测结果](https://www.luogu.com.cn/record/226754574)：读入 $10^7$ 个整数需要 $182$ ms 左右，并能在 $1950$ ms 内读入 $10^8$ 个整数。

### 更方便本地运行的宏定义：

由于 `getchar_unlocked()` 在 Windows 系统上没有定义，所以在本地无法运行。因此我们可以在代码中添加以下宏定义方便自己进行调试。

```cpp
#ifdef __unix__
#define gc getchar_unlocked
#else
#define gc _getchar_nolock
#endif
```

该代码在常见的 Linux 和 Windows 环境中都不会 CE。

## 3. `fread` 快读

### 3.1 介绍

通过 `fread` 可以实现更快的读入。

`fread` 能将需要的文件部分读入内存缓冲区。

由于 `fread` 是整段整段读取，所以比 `getchar()` 要快的多。

`fread` 类似于参数为 `"%s"` 的 `scanf`，不过它更为快速，而且可以一次性读入若干个字符（包括空格、换行，甚至可以是不可见字符），如果缓存区足够大，甚至可以一次性读入整个文件。

以下是 `fread` 的函数原型（C）：

```c
size_t fread(void *ptr, size_t size, size_t nmemb, FILE *stream);
```

函数 `fread` 会从 `stream` 流中读取 `nmemb` 个长度为 `size` 字节大小的数据，并将它们存储在 `ptr` 给定的位置，返回读取的字节数。

如 `fread(buf, 1, 114514, stdin);` 就是从标准输入流中读取 $114514$ 个 `1` 字节大小的字符，存储在 `buf` 里。

### 3.2 使用

我们可以重定义一个 `getchar()` 函数（以下是 `gc` 函数）。可以一次性读入 $2^{20}$ 个字符到 `buf`，然后从 `buf` 中读入一个 `char`，也就是头指针向后移动一位。

```cpp
char buf[1<<20], *p1, *p2;
char gc() { // getchar
    if (p1 == p2) { // 读入的 2^20 个字符已经被读完
        p1 = buf;
        p2 = buf + fread(buf, 1, 1<<20, stdin); // 再读 2^20 个字符
        if (p1 == p2) return EOF; // p1 == p2，说明没有读入成功，即读到文件结束符
    }
    return *p1++; // 读入的 2^20 个字符没有用完
}
int readint() { // 普通的快读，仅更改 getchar() 为 gc()
  int k = 0, f = 1, c = gc();
  for (; !isdigit(c); c = gc()) if (c == '-') f = -1;
  for (; isdigit(c); c = gc()) k = k * 10 + (c ^ 48);
  return k * f;
}
```

当然这是为了方便理解，实际上多次调用函数很影响效率，导致上述代码甚至没有普通快读快，所以一般我们把它写成 `#define` 形式。

```cpp
#include <cctype>
#include <cstdio>
unsigned char buf[1<<20], *p1, *p2;
#define gc() (p1==p2 && (p2=(p1=buf)+fread(buf,1,1<<20,stdin)),*p1++)
int readint() {
  int k = 0, f = 1, c = gc();
  for (; !isdigit(c); c = gc()) if (c == '-') f = -1;
  for (; isdigit(c); c = gc()) k = k * 10 + (c ^ 48);
  return k * f;
}
int main() {
  int n = readint(), ans = 0;
  while (n --) ans += readint();
  printf("%d", ans);
}
```

[评测结果](https://www.luogu.com.cn/record/226760996)：读入 $10^7$ 个整数需要 $137$ ms 左右，并能在 $1450$ ms 内读入 $10^8$ 个整数。

注：

1. 调试的时候， 输入完成后请手动输入 EOF（Linux 上为 `Ctrl + D`，Windows 上为 `Ctrl + Z`）。
2. 使用此快读请不要使用其他输入方式（包括 `getchar`、`scanf`、`cin`）， 否则会因为缓存问题产生错误。（除非你非常清楚自己在做什么）

## 4. `mmap` 快读

### 4.1 介绍

`mmap` 是 Linux 系统函数，它的功能是将文件一次性地映射到内存中（注意不是拷贝）。在一些场景下有更优的速度。

注意 `mmap` 不能在 Windows 系统中通过编译，同时也不建议在正式的 OI 赛制的比赛中使用（但在很多卡常大赛中使用的频率很高）。且在使用前需要引入头文件 `fcntl.h`，`unistd.h`，`sys/stat.h` 与 `sys/mman.h`。

读入示例：首先要获取文件描述符 `fd`，然后通过 `fstat` 获取文件信息以得到文件大小，此后通过 `char *pc = (char *) mmap(NULL, state.st_size, PROT_READ, MAP_PRIVATE, fd, 0);` 将指针 `*pc` 指向我们的文件。可以直接用 `*pc ++` 替代 `getchar()`。

当我们要提交不使用文件操作的题目时，可以将 `fd` 设为 `0`，表示从 `stdin` 读入。**但是，对 `stdin` 使用 `mmap` 是极其危险的行为，同时不能在终端输入，所以不建议使用 `mmap` 从标准输入中读入。**

`mmap` 函数的使用实例（该代码实现了读入一个整数并输出）：

```cpp
#include <bits/stdc++.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
#define gc() (*pc++)
char *pc;
int readint() {
  int k = 0, f = 1, c = gc(); // getchar 返回值都是 int，当然没有必要转成 char
  for (; !isdigit(c); c = gc()) if (c == '-') f = -1;
  for (; isdigit(c); c = gc()) k = k * 10 + (c ^ 48); // 也可以写 (c & 15)
  return k * f;
}

int main() {
  int fd = open("*.in", O_RDONLY);
  struct stat state;
  fstat(fd, &state);
  pc = (char *)mmap(NULL, state.st_size, PROT_READ, MAP_PRIVATE, fd, 0);
  close(fd);
  printf("%d", readint());
}
```

### 4.2 `mmap` 在快读上的使用

#### 4.2.1（基础用法）

直接照搬 `fread`，然后把 `fread` 改为 `mmap`，但此时我们不需要 `buf` 数组，而是使用指针（`char*` 或 `unsigned char*`）代替。

```cpp
#include <cctype>
#include <cstdio>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <unistd.h>
unsigned char* str;
#define gc() (*str ++)
int readint() {
  int k = 0, f = 1; unsigned char c = gc();
  for (; !isdigit(c); c = gc()) if (c == '-') f = -1;
  for (; isdigit(c); c = gc()) k = k * 10 + (c ^ 48);
  return k * f;
}
int main() {
  struct stat s;
  fstat(0,&s);
  str=(unsigned char*)mmap(NULL,s.st_size,1,2,0,0);
// ↑以上是 mmap 进行读入操作
  int n = readint(), ans = 0;
  while (n --) ans += readint();
  printf("%d", ans);
}
```

[评测结果](https://www.luogu.com.cn/record/226767918)：读入 $10^7$ 个整数需要 $124$ ms 左右，并能在 $1280$ ms 内读入 $10^8$ 个整数。

#### 4.2.2 打表优化（非常规优化方式）

可以发现我们现在读入字符仍然是一个一个字符读入的（因为是 `*str ++`，每次只移动一个字节）。如果我们每次读入的字符大于一个，可以更加节省时间。

考虑到一个 `char / unsigned char` 的范围为 $-128\sim 127$ 或 $0\sim 255$，一共是 $256$ 种取值，所以两个 `char` 组合一共有 $65536$ 种情况。

可以发现 $65536$ 是一个比较小的数字，开一个长度为 $65536$ 的数组进行打表也是完全可行的。所以我们可以打出两个字符拼成的数字是什么，然后使用下标访问得到答案。

- 打表代码：

```cpp
unsigned int a[65536];
memset(a, -1, sizeof a);
for (int i = 48; i <= 57; ++ i)
  for (int j = 48; j <= 57; ++ j)
    a[i << 8 | j] = (j ^ 48) * 10 + (i ^ 48);
```

- 读入整数：

```cpp
int v, f; // v: 输入数字的绝对值；f：输入数字的符号（0 正；1 负）
for (; n --; ) {
  c+=f=*c==45; // 等价于 { f = 0; if (*c == '-') f = 1, ++ c; } // 读到负号，则记录，并跳过负号
  v=0;
  if (~a[*(unsigned short*)(c)]) v=a[*(unsigned short*)c],c+=2; // 一次读入两个数字
  if (~a[*(unsigned short*)(c)]) v=v*100+a[*(unsigned short*)c],c+=2;
  if (~a[*(unsigned short*)(c)]) v=v*100+a[*(unsigned short*)c],c+=2;
  if (~a[*(unsigned short*)(c)]) v=v*100+a[*(unsigned short*)c],c+=2;
  if (~a[*(unsigned short*)(c)]) v=v*100+a[*(unsigned short*)c],c+=2;
  if (*c>47) v=v*10+((*c++)^48); // 一次读入一个数字（如果长度是奇数就会执行到这里）
  ans+=(f?-v:v);
  while (isspace(*c)) ++c; // 跳过空格/换行
}
```

- 完整代码：

```cpp
#include<stdio.h>
#include<ctype.h>
#include<stdint.h>
#include<string.h>
#include<sys/stat.h>
#include<sys/mman.h>

int main(){
  int a[0x10000]; // 打表数组
  memset(a,-1,0x40000);
  for (int i=48;i<=57;++i)
    for (int j=48;j<=57;++j)
      a[i<<8|j]=(j^48)*10+(i^48);
  struct stat s;
  fstat(0,&s);
  char* c=(char*)mmap(0, s.st_size, PROT_READ, MAP_SHARED, 0, 0);
  int n = 0, ans=0;
  while (isdigit(*c)) n=n*10+((*c++)^48);
  while (isspace(*c)) ++c; // 跳过空格和换行
  int v, f; // v: 输入数字的绝对值；f：输入数字的符号（0 正；1 负）
  for (; n --; ) {
    c+=f=*c==45; // 等价于 { f = 0; if (*c == '-') f = 1, ++ c; } // 读到负号，则记录，并跳过负号
    v=0;
    if (~a[*(unsigned short*)(c)]) v=a[*(unsigned short*)c],c+=2; // 一次读入两个数字
    if (~a[*(unsigned short*)(c)]) v=v*100+a[*(unsigned short*)c],c+=2;
    if (~a[*(unsigned short*)(c)]) v=v*100+a[*(unsigned short*)c],c+=2;
    if (~a[*(unsigned short*)(c)]) v=v*100+a[*(unsigned short*)c],c+=2;
    if (~a[*(unsigned short*)(c)]) v=v*100+a[*(unsigned short*)c],c+=2;
    if (*c>47) v=v*10+((*c++)^48); // 一次读入一个数字（如果长度是奇数就会执行到这里）
    ans+=(f?-v:v);
    while (isspace(*c)) ++c; // 跳过空格/换行
  } printf("%d", ans);
}
```

[评测结果](https://www.luogu.com.cn/record/226789511)：读入 $10^7$ 个整数需要 $63$ ms 左右，并能在 $626$ ms 内读入 $10^8$ 个整数。

#### 4.2.3 其它优化

我们的优化并没有到达极限。

1. 将 `a` 数组定义成 `const unsigned` 类型，并使用 `#define` 减小代码长度。(常量优化)
2. 在 `5` 个 `if` 语句中我们可以发现越前面的语句越容易满足，所以我们可以使用 `__builtin_expected(CONDITION,0/1)` 告诉编译器条件 `CONDITION` 更偏向 `False(0)/True(1)`（根据题目的输入进行优化）
3. 发现输入的数字 $\in[-n,n]$，其中 $n_{max}=10^8$，说明输入数字长度最多为 9，所以可以去除一个 `if`。
4. 发现第二行输入的数中间都是一个空格，所以可以用一个 `++c` 代替 `while`。

根据优化 1，我们可以像这样初始化 $a$ 数组。

$700$ 多个字符就完成初始化了。

根据优化 2，我们可以把第一个 `if` 语句写成 `if (__builtin_expect(~a[*(unsigned short*)(c)], 1)) v=a[*(unsigned short*)c],c+=2;`

所以我们得到了以下代码：

```cpp
#define N1 -1,-1,-1,-1,-1,-1,-1,-1,-1,-1
#define N2 N1,N1,N1,N1,N1,N1,N1,N1,N1,N1
#define N3 N2,N2,N2,N2,N2,N2,N2,N2,N2,N2
#define N4 N3,N3,N3,N3,N3,N3,N3,N3,N3,N3
#define D(a) a,1##a,2##a,3##a,4##a,5##a,6##a,7##a,8##a,9##a
#define S246 N2,N2,N1,N1,N1,N1,-1,-1,-1,-1,-1,-1
const int _mp[0x10000]={
N4,N3,N3,N2,N2,N2,N1,N1,N1,-1,-1,-1,-1,-1,-1, // 12336个-1
D(0),S246,D(1),S246,D(2),S246,D(3),S246,D(4),S246, // 数字组0-4
D(5),S246,D(6),S246,D(7),S246,D(8),S246,D(9), // 数字组5-9
N4,N4,N4,N4,N4,N2,N2,N2,N2,N2,N2,N2,N2,N1,N1,N1,N1,N1,N1,N1,N1,
-1,-1,-1,-1,-1,-1
};
#undef N1
#undef N2
#undef N3
#undef N4
#undef D
#undef S246
main(){
    int v,f,n=0,ans=0;
    struct stat s; fstat(0,&s);
	unsigned char* c=(unsigned char*)mmap(NULL,s.st_size,1,2,0,0);
	while (isdigit(*c)) n=n*10+(15&*c++);
	while (isspace(*c)) ++c;
	for (;n--;++c){
		c+=f=(*c==45);
		v=0;
		__builtin_expect(~_mp[*(uint16_t*)(c)], 1) && (v=_mp[*(uint16_t*)(c)],c+=2),
		(~_mp[*(uint16_t*)(c)]) && (v=v*100+_mp[*(uint16_t*)(c)],c+=2),
		(~_mp[*(uint16_t*)(c)]) && (v=v*100+_mp[*(uint16_t*)(c)],c+=2),
        (~_mp[*(uint16_t*)(c)]) && (v=v*100+_mp[*(uint16_t*)(c)],c+=2),
		isdigit(*c) && (v=v*10+(15&*c++)),
        ans+=(f?-v:v);
	}
    printf("%d", ans);
	return 0;
}
```

[评测结果（C++）](https://www.luogu.com.cn/record/225933271)：读入 $10^7$ 个整数需要 $53$ ms 左右，并能在 $464$ ms 内读入 $10^8$ 个整数。

[评测结果（C）](https://www.luogu.com.cn/record/226615350)：读入 $10^7$ 个整数需要 $52$ ms 左右，并能在 $451$ ms 内读入 $10^8$ 个整数。

~~然后就拿下了这道题的最优解（除旧数据和套数据点外）~~

#### 4.2.4 可能的继续优化方向

以下是可能能够提升速度的方向：

1. 找到 $3$ 字节整数类型（`int24`），然后将打表一次两个字符改为三个字符。
2. 减少类型转换次数。
3. 有更快的函数能替代 `isdigit/isspace`（已知比较运算符不可能更快）。
4. ~~更多玄学做法/人类智慧~~。

# 二、输出部分
**注意：**

1. 由于在 `int` 范围内，`-2147483648 == -(-2147483648)`，所以在输出这个数的时候会出现问题，需要进行特判，或者转换成 `unsigned int` 后再输出（为方便理解，以下采用前者）。
2. 部分非递归快写还需要特判 `0`。

该部分的评测结果均以 [U588066 Fast Write](https://www.luogu.com.cn/problem/U588066) 为例。
## 1. `C++` 风格输出的优化
见「一、1、`cin` / `cout` 的优化（关闭同步 和 解除绑定）」

[评测结果](https://www.luogu.com.cn/record/227085184)：输出 $10^7$ 个长度为 $9\sim 11$ 的数字需要 $561$ ms。
## 2. `C` 风格的输出优化

先看 `printf` 函数的 [测试结果](https://www.luogu.com.cn/record/227085359)：输出 $10^7$ 个长度为 $9\sim 11$ 的数字需要 $659$ ms。

可见 `printf` 的输出性能不是特别理想，以下是代替 `printf` 的方法。

### `putchar(x)/putchar_unlocked(x)`
与 `getchar` 类似，`putchar` 是用来输出 1 byte 的数据并将其转换为 `char` 类型进行输出（但实际上的接收值的类型是 `int`），且速度很快，故可以用逐位输出来代替 `printf`。

我们可以每次对一个正整数除以 10，然后记录余数，存放在数组里，最后逆序输出（也可以逆序存放，顺序输出），同时不要忘记要**特判 `0` 和 `-2147483648`**。

### `puts(_String)`
`puts` 是一个用于快速输出 C 风格字符串（字符数组）的函数，接收值为 `const char*`，当输出到 `\0` 时停止。

考虑到我们要进行多次 `putchar` 操作，则我们可以把这些操作合并成一个 `puts` 操作。

```c
#include <stdio.h>
#include <string.h>
typedef int i32;

unsigned char fr[17];
inline char* to_str(int x) {
	if (x == 0) return "0";
	if (x == -2147483648) return "-2147483648";
	unsigned char *tot = fr + 17;
	if (x < 0) putchar(45), x = -x;
	while(x) *(-- tot) = x % 10 | 48, x /= 10;
	return (char*)tot;
}
main() {
	int n, x; scanf("%d%d", &n, &x);
	fputs(to_str(n), stdout); putchar(32); puts(to_str(x));
	for (int i = 1; i <= n; ++ i) {
	    x ^= x << 13;
	    x ^= x >> 17;
	    x ^= x << 5;
	    puts(to_str(x));
	}
}
```
[测试结果](https://www.luogu.com.cn/record/227085872)：输出 $10^7$ 个长度为 $9\sim 11$ 的数字需要 $365$ ms。

### 循环展开优化
考虑到一个 `int` 类型的变量最大为 10 位数，所以我们可以使用 10 个 `if` 语句代替 `while` 循环。
```c
inline char* to_str(int x) { // Like this.
	if (x == 0) return "0";
	unsigned char *tot = fr + 17;
	(x < 0) && (putchar(45), x = -x);
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10);
	return (char*)tot;
}
```
但实际上可能是因为分支预测，这段代码和上一段代码的运行时间基本一致，但下文的优化需要使用到循环展开。

## 3. `fwrite` 在快写上的应用
与 `fread` 相似，用于实现更快的快写。

函数原型：
```c
size_t fwrite(const void * buffer, size_t size, size_t count, FILE * stream);
```
`fwrite` 函数将 count 个对象写入给定的输出流，每个对象的大小为 `size` 个字节。

与 `puts` 不同的其中一点，`puts` 在输出到 `\0` 的时候就会停止输出，而 `fwrite` 可以指定输出范围，即使遇到 `\0` 也可以继续输出，并且不会像 `puts` 一样输出完成后还会加一个多余的 `\n`。

在上个版本的快写上进行修改，并结合 `fread` 快读的思想，我们同样需要创建一个 `obuf` 数组，用于存储输出内容。当存储到达一定量时，一次性输出并清空缓冲区。
```c
#include <stdio.h>
#include <string.h>
#define BUF (1 << 22)
#define putchar(x) (now ++[0] = (x)) // 重定义 putchar
typedef int i32;
unsigned char obuf[BUF + 17], // 需要多预留一点空间，防止缓冲区溢出引起错误
  *now = obuf;

unsigned char fr[17];
inline void write(int x) {
	if (!x) return(void)(putchar(48));
	if (x == -2147483648) return(void)(memmove(now, "-2147483648", 11), now += 11);
	unsigned char* tot = fr + 17;
	(x < 0) && (now ++[0] = 45, x = -x);
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10),
	(x) && ((-- tot)[0] = x % 10 | 48, x /= 10);
	memmove(now, tot, fr + 17 - tot); // 内存复制
	now += fr + 17 - tot;
}
main() {
	int n, x; scanf("%d%d", &n, &x);
	write(n), putchar(32), write(x), putchar(10);
	
	for (int i = 1; i <= n; ++ i) {
	    x ^= x << 13;
	    x ^= x >> 17;
	    x ^= x << 5;
		write(x), putchar(10);
		if (now - obuf > BUF) { // 存储到达一定量，清空缓冲区
			fwrite(obuf, 1, now - obuf, stdout);
			now = obuf;
		}
	}
	fwrite(obuf, 1, now - obuf, stdout); // 在程序结束时也要输出
}
```
[测试结果](https://www.luogu.com.cn/record/227087909)：输出 $10^7$ 个长度为 $9\sim 11$ 的数字需要 $225$ ms。

## 4. 更多优化（包括 打表优化 与 宏定义封装）
1. 与上文的快读中一样可以采用打表优化，不同的是，这次我们可以一次打四个数字。
2. 使用 `__builtin_expect(...,0/1)` 对第一个（伪）`if` 语句进行优化。
3. 考虑到 `inline` 的内联不够彻底，我们可以直接使用宏定义。

根据 优化 1，我们可以写出以下打表代码：
```cpp
int fnd[10000];
for (int n = 0000; n <= 9999; ++ n)
  fnd[n] = (n/1000%10+'0')|((n/100%10+'0')<<8)|((n/10%10+'0')<<16)|((n%10+'0')<<24)
```
然后使用宏定义初始化列表：
```cpp
#define D(n) ((n)/1000%10+'0')|(((n)/100%10+'0')<<8)|(((n)/10%10+'0')<<16)|(((n)%10+'0')<<24)
#define D10(s) D((s)+0),D((s)+1),D((s)+2),D((s)+3),D((s)+4),D((s)+5),D((s)+6),D((s)+7),D((s)+8),D((s)+9)
#define D100(s) D10((s)+0),D10((s)+10),D10((s)+20),D10((s)+30),D10((s)+40),D10((s)+50),D10((s)+60),D10((s)+70),D10((s)+80),D10((s)+90)
#define D1000(s) D100((s)+0),D100((s)+100),D100((s)+200),D100((s)+300),D100((s)+400),D100((s)+500),D100((s)+600),D100((s)+700),D100((s)+800),D100((s)+900)
const unsigned fnd[] = {
	D1000(0),D1000(1000),D1000(2000),D1000(3000),D1000(4000),
	D1000(5000),D1000(6000),D1000(7000),D1000(8000),D1000(9000)
};
```
$650$ 左右个字符就完成初始化了。

最终代码如下：
```cpp
#include <stdio.h>
#include <string.h>
typedef int i32;

#define K 11
// Fnd list
#define D(n) ((n)/1000%10+'0')|(((n)/100%10+'0')<<8)|(((n)/10%10+'0')<<16)|(((n)%10+'0')<<24)
#define D10(s) D((s)+0),D((s)+1),D((s)+2),D((s)+3),D((s)+4),D((s)+5),D((s)+6),D((s)+7),D((s)+8),D((s)+9)
#define D100(s) D10((s)+0),D10((s)+10),D10((s)+20),D10((s)+30),D10((s)+40),D10((s)+50),D10((s)+60),D10((s)+70),D10((s)+80),D10((s)+90)
#define D1000(s) D100((s)+0),D100((s)+100),D100((s)+200),D100((s)+300),D100((s)+400),D100((s)+500),D100((s)+600),D100((s)+700),D100((s)+800),D100((s)+900)

// End
#define _Write(y) {\
	int $ = (y);\
	unsigned char* tot = fr + K;\
	($ < 0) && (now ++[0] = 45, $ = -$);\
	($ > 999) && (*(int*)(tot-=4) = fnd[$%10000], $ /= 10000),\
	($ > 999) && (*(int*)(tot-=4) = fnd[$%10000], $ /= 10000),\
	($ > 9) && (*(unsigned short*)(tot-=2) = fnd[$%100] >> 16, $ /= 100),\
	($) && ((-- tot)[0] = $ % 10 | 48, $ /= 10);\
	memmove(now, tot, fr + K - tot);\
	now += fr + K - tot;\
}
#define write(x) \
	if (__builtin_expect(x == 0, 0)) now++[0] = 48;\
	else if (__builtin_expect(x == -2147483648, 0)) memmove(now, "-2147483648", 11), now += 11;\
	else _Write(x)
#define putchar(x) (now++[0] = (x));
int main() {
	unsigned char fr[K];
	unsigned char obuf[(1 << 22) + 17];
	const int fnd[] = {
		D1000(0),D1000(1000),D1000(2000),D1000(3000),D1000(4000),
		D1000(5000),D1000(6000),D1000(7000),D1000(8000),D1000(9000)
	};
	unsigned char* now = obuf;

int n, x;
	scanf("%d%d", &n, &x);
	printf("%d %d\n", n, x);
	for (int i = 1; i <= n; ++ i) {
		x ^= x << 13;
		x ^= x >> 17;
		x ^= x << 5;
		write(x) putchar(10)
		(now - obuf > (1 << 22)) && (fwrite(obuf, 1, now - obuf, stdout), now = obuf);
	}
	fwrite(obuf, 1, now - obuf, stdout);
}
```
[测试结果](https://www.luogu.com.cn/record/227067631)：输出 $10^7$ 个长度为 $9\sim 11$ 的数字需要 $159$ ms。

# 三、 IO 优化的弊端

IO 优化虽然能够显著提升程序的运行速度，但并非没有缺点。

1. IO 优化可能会导致程序可读性下降，尤其是使用 `fread`、`mmap` 或 宏定义形式的函数会导致调试困难。
2. 容易引入隐蔽 Bug，比如缓冲区溢出，没有特判 `0` 或 `-2147483648`（以 32 位整型变量为例），或者将 EOF（文件结束符）“吃掉”。
3. 优化可能适得其反，比如递归形式的普通快写在很多时候会比去除同步流的 `cout` 更慢。
4. 其它低级错误导致代码挂掉。

# 四、 THE END

完结撒花。

如果有问题/更好的做法欢迎指出。

upd: 2025/7/26（ver. 1.01）

1. 增加了快速输出。
2. 更改了部分宏定义和代码缩进。
3. 更改了文章结构。

正在渲染内容...