Database Restore/Migration fails - SAXParseException: Invalid byte 2 of 4-byte UTF-8 sequence
After using the database migration wizard or using the Bitbucket backup client - the restoration of the generated XML backup fails with the following error:
2020-07-13 07:39:41,634 INFO Initializing 2020-07-13 07:39:43,253 INFO Unpacking bitbucket-20200709-172649-440.tar to /media/atl/bitbucket 2020-07-13 10:25:18,721 INFO Validating database before restore 2020-07-13 10:25:20,450 INFO Restoring database schema definition 2020-07-13 10:25:33,482 INFO Restoring database data 2020-07-13 10:25:40,133 ERROR bitbucket-20200709-172649-440.tar could not be restored com.atlassian.stash.internal.backup.liquibase.LiquibaseDataAccessException: SAX parsing error while parsing backup file; nested exception is org.xml.sax.SAXParseException; lineNumber: 10888480; columnNumber: 36; Invalid byte 2 of 4-byte UTF-8 sequence. at com.atlassian.stash.internal.backup.liquibase.DefaultLiquibaseMigrationDao.parse(DefaultLiquibaseMigrationDao.java:229) at com.atlassian.stash.internal.backup.liquibase.DefaultLiquibaseMigrationDao.scan(DefaultLiquibaseMigrationDao.java:215) ... 10 more frames available in the log file
When performing a database migration, it uses the same classes/logic as the Bitbucket backup client to take an XML backup of the current database schema/data and then restore that backup into the target database.
While the XML backup is successfully generated, when this same XML backup is read - we use the third-party Apache Xerces XML parser to do this, which contains an unresolved bug which can result in this error when reading particularly large XML backups containing 4-byte UTF-8 sequences. This is because once the read buffer gets exhausted, the next 4-byte UTF-8 character parsed experiences an off-by-one error, resulting in the error above.
The above criteria means that this issue will most likely be seen in large XML backups that also contain a wide variety of particular special characters (4-byte UTF-8 sequences). These special characters generally include less common CJK characters, various historic scripts, mathematical symbols, and emojis.
Without resolving the bug with the above XML parser or changing to a different XML parsing utility, the options for getting past this issue come down to either:
- Reducing the overall amount of content within the XML backup to prevent the read buffer from becoming exhausted
- This is not recommended, as this is a variable threshold depending on the amount of 4-byte UTF-8 characters in the XML backup - meaning it may not be clear exactly how much data you would need to remove (and at what location in the XML backup) to get past this error.
- Removing/substituting these 4-byte UTF-8 characters in the XML backup prior to restoring it into the target database.
We recommend choosing the second option, as this will minimize the number of changes that need to be made to the XML backup to allow the restore to succeed. These are the steps that can be performed to easily remove these characters:
- Download the JAR file: atlassian-xml-cleaner-0.1.jar
- Open a command prompt and locate the XML or ZIP backup file on your computer, ensuring that it is extracted if it's within a ZIP file. In this example, we will use
Run the cleaner as shown:
$ java -jar atlassian-xml-cleaner-0.1.jar stash-data.xml > stash-data-clean.xml
- This will create a copy of
stash-data-clean.xmlwith the invalid characters removed.
- Copy the
stash-data-clean.xmlfile into another directory, rename it back to
stash-data.xml,and create a new ZIP with the updated
After performing the above steps to produce an updated backup .ZIP file, follow the standard process for restoring the backup to the desired database using the Bitbucket backup client.
The ultimate resolution to this issue is being tracked in the following bug ticket: