Character encodings in Confluence

Where character encoding is used

There are three places that character encoding matters to Confluence:

データベースエンコード - 通常、最も重要です。ほぼすべてのユーザーデータが保存される場所です。
ファイルシステムエンコード - 添付ファイルの保存場所 (pre-2.2)、Velocity テンプレートの読み取りおよびエクスポート済みファイルの書き込みのために重要です。
HTTP 要求および応答のエンコード - フォーム構文解析、ブラウザーによるレンダリングの修正およびエンコード済み URL のブラウザー変換処理のために重要です。

一般的に、問題は Confluence が上記のエンコードの１つと実際のエンコードが異なると認識した場合に生じます。たとえば、Confluence はデータベースが実際には UTF-8 エンコードを使用しているにもかかわらず、 ISO-8859-1 エンコードを使用していると認識する可能性があります。

Java の文字エンコード

Java always uses the double-byte UCS-2 character encoding for all char and String data. This means that each of the encodings above defines how, at that particular point, characters are converted to and from Java's native UCS-2 format into some other format that the browser, filesystem or database might understand.

So when a request comes in to Confluence, we convert it from the request encoding to UCS-2. Then we store that data into the database, converting from UCS-2 to the database's encoding. Retrieving information from the database and sending it back to the browser is the same process in the opposite direction.

文字エンコードの問題

Confluence が上記のいずれかのエンコードについて誤認をした場合、それが以下のように様々な形で現れます。

不正なデータベースエンコード - ユーザーデータが保存やデータベースからの復元の過程で破損します。多くの場合、これは遅れて生じます。というのも、データはデータベースに書き込まれているとおりにキャッシュされるため、後になって初めてデータベースから破損したコピーを取得することになります。
不正な/非 Unicode のファイルシステムエンコード - 国際ファイル名によって添付ファイルのダウンロード/アップロード/削除 (pre-2.2) が中断します。エクスポートは国際コンテンツまたは添付ファイルが含まれている場合中断します。
不正な HTTP エンコード - ブラウザーにより不正なエンコードが選択された場合、文字のレンダリングが正しく行われません。ブラウザーのエンコードを変更すると、ページのレンダリングが正しく行われます。非 ASCII 文字が含まれるページまたは添付ファイルへのリンク時にURL が壊れます。

Configuration of character encodings

The Confluence character encoding is a configuration setting found in Administration > General Configuration, and at runtime available in Settings.defaultEncoding. It is subsequently used in the following parts of the system:

ConfluenceWebWorkConfiguration sets webwork.i18n.encoding to the this encoding, which WebWork uses in the response Content-Type header.
AbstractEncodingFilter は HTTP 要求エンコードをこのエンコードに設定します。クライアントからの Content-Type ヘッダーには使用されているエンコードが含まれているはずなので、これは不要と考えられます。これは、フォームの送信やファイルのアップロードに影響を与えます。
VelocityUtils は、ディスクからテンプレートを読み取るときにこのエンコードを使用して Velocity テンプレートを読み取ります。
AbstractXmlExporter は、このエンコードを使用して出力を作成します。
GeneralUtil は、URLEncode および URLDecode を実行するときこのエンコードを使用します。異なるブラウザーは、URL で異なる文字セットをサポートしているため、これによりどの程度の利点が得られるかは不明です。

In summary, changing the Confluence character encoding will change your HTTP request and response encoding and your Filesystem encoding as used by exports and velocity templates.

The database encoding is the responsibility of your JDBC drivers. The drivers are responsible for reading and writing from the database in its native encoding and translating this data to and from Java Strings (which are UCS-2). For some drivers, such as MySQL, you must set Unicode encoding explicitly in the JDBC URL. For others, the driver is smart enough to determine the database encoding automatically.

Ideally, your database itself should be in a Unicode encoding (and we recommend doing this for the simplest configuration), but that is not necessary as long as:

データベースのエンコードが Confluence に保存するすべての文字をサポートしている
your JDBC drivers can properly convert from the database encoding to UCS-2 and vice-versa.

The filesystem encoding is mostly ignored by Confluence, except for the cases where the above configuration setting above plays a part (exports, velocity). When attachments are uploaded, they are written as a stream of bytes directly to the filesystem. It is the same when they are downloaded: the bytes from the file InputStream are written directly to the HTTP response.

In some places in Confluence, we use the default filesystem encoding as determined by the JVM and stored in the file.encoding system property (it can be overridden by setting this property at startup). This encoding is used by the Java InputStreamReader and InputStreamWriter classes by default. This encoding should probably never be used; for consistent results across all filesystem access we should be using the encoding set in the General Configuration.

ファイルシステムからのデータの読み取りまたはファイルシステムへのデータの書き込みに使用されるエンコードを明示的ハードコード化する場合もあります。２つの重要な例を以下に示します。

ISO-8859-1 で知られている Mbox メールボックスのインポート。
Confluence Bandana config ファイルは常に UTF-8 で保存されます。

Some application servers, Tomcat for example, have an encoding setting that modifies Confluence URLs before they reach the application. This can prevent access to international pages and attachments (really anything with international characters in the URL). See configuring your Application Server URL encoding.

Advice

In general, always set all character encodings to UTF-8. That includes database, JDBC drivers, application server, filesystem and Confluence.

In certain isolated cases (e.g. Microsoft Windows), it might not be possible to use a fully Unicode filesystem (that is, a default Windows install doesn't support Unicode filenames properly). If so, stick with UTF-8 for the other two and be aware that your operating system might have limitations around international attachments (pre-2.2), backup and restore of international data, etc.

ページツリー

Where character encoding is used

Java の文字エンコード

文字エンコードの問題

Configuration of character encodings

Advice

RELATED TOPICS: