解析 word 内嵌文件名中文乱码

Apache POI 简介是用 Java 编写的免费开源的跨平台的 Java API，Apache POI 提供 API 给 Java 程式对 Microsoft Office（Excel、WORD、PowerPoint、Visio 等）格式档案读和写的功能。POI 为 “Poor Obfuscation Implementation” 的首字母缩写，意为 “可怜的模糊实现”。

官方主页： http://poi.apache.org/index.html

API 文档： http://poi.apache.org/apidocs/index.html

问题发现

我们知道 word 是支持插入其他文件的

插入后入下图所示

在使用 Apache POI 过程中发现在读取文件名时会发生乱码问题

问题定位

经过代码排查定位发现文件名是取自一个 label 的元素值，

label 是在包 org.apache.poi.poifs.filesystem.Ole10Native 的第 165 行，代码如下

case parsed: {
            flags1 = LittleEndian.getShort(data, ofs);
            
            // structured format
            ofs += LittleEndianConsts.SHORT_SIZE;
        
            int len = getStringLength(data, ofs);
            label = StringUtil.getFromCompressedUnicode(data, ofs, len - 1);
            ofs += len;
            
            len = getStringLength(data, ofs);
            fileName = StringUtil.getFromCompressedUnicode(data, ofs, len - 1);
            ofs += len;
    
            flags2 = LittleEndian.getShort(data, ofs);
            ofs += LittleEndianConsts.SHORT_SIZE;
            
            unknown1 = LittleEndian.getShort(data, ofs);
            ofs += LittleEndianConsts.SHORT_SIZE;

继续跟踪源码进入 StringUtil.getFromCompressedUnicode 方法中，其代码中明确写出二进制目标编码为 ISO_8859_1，

/**
 * Read 8 bit data (in ISO-8859-1 codepage) into a (unicode) Java
 * String and return.
 * (In Excel terms, read compressed 8 bit unicode as a string)
 *
 * @param string byte array to read
 * @param offset offset to read byte array
 * @param len    length to read byte array
 * @return String generated String instance by reading byte array
 */
public static String getFromCompressedUnicode(
        final byte[] string,
        final int offset,
        final int len) {
    int len_to_use = Math.min(len, string.length - offset);
    return new String(string, offset, len_to_use, ISO_8859_1);
}

问题解决

通过在 IDEA 的 debug 运算中发现将编码修改为 GBK 正常转换中文，看来这里是乱码的根本原因了，后续又新建文件测试，发现 wps、word2019、wps linux、永中 office 新建的文件在此处改为 GBK 均可正常显示中文文件名。问题成功定位，接下来进行代码修改

将源码拉取到本地，在 StringUtil 中添加如下方法，这里自定义了一个 charset, 我们在调用处调用这个方法就可以了，这样也能保证不会干扰其他代码

public static String getFromCompressedUnicode(
            final byte[] string,
            final int offset,
            final int len, final Charset charset) {
        int len_to_use = Math.min(len, string.length - offset);
        return new String(string, offset, len_to_use, charset);
    }

调用处 (Ole10Native) 修改

1
2
3

label = StringUtil.getFromCompressedUnicode(data, ofs, len - 1,Charset.forName("GBK"));
// 可以顺便把下边隔一行的fileName一块改了
fileName = StringUtil.getFromCompressedUnicode(data, ofs, len - 1,Charset.forName("GBK"));

上传到私有仓库

我们公司是有自己的私有仓库的，既然改好了，就将代码上传上去吧

打开上一级的 build.gradle 在 subprojects 中添加

apply plugin: 'maven-publish'
apply plugin: 'net.linguica.maven-settings' // 看需要，我们仓库是要加这个的

group = "org.apache.poi"
version = '4.1.2-xxx.1'  // 随便.1代表第一次修改，4.1.2是原始的版本号

publishing {
        publications {
            basic(MavenPublication) {
                from components.java
            }
        }

        repositories {
            maven {
                url = uri("xxx/maven/v1") // 仓库上传地址
                // 下面是一些鉴权信息，参考其他项目
                name = "xxx"
                authentication {
                    basic(BasicAuthentication)
                }
            }
        }
    }

}

引用修改后包

需要实际引用项目在 IDEA 右侧 gradle 中搜索一下那些包额外引用了 org.apache.poi，需要手动排除掉

然后引入我们刚才修改的包就可以了