Troubleshooting Hipchat Data Center
Before you allow users in to your Hipchat Data Center deployment, you should make sure everything is configured and working appropriately. This page contains a checklist that we hope will help you verify the deployment, and catch any errors before you open the deployment to users.
On this page:
System monitoring and alerts
You can configure Hipchat Data Center with a recipient email address for system alerts from the Hipchat Data Center admin UI.
An alert email is sent when any of the following conditions are met:
- Memory utilization over 98% for three cycles
- Swap file utilization over 10% for three cycles
- CPU (user) utilization over 95% for three cycles
- CPU (system) utilization over 95% for three cycles
- CPU (wait) utilization over 99% for three cycles
- 'gearman' over 30% CPU utilization
- 'nginx' over 20% CPU utilization for five cycles
- 'ntpd' becomes unavailable
- 'php5' restarts three times within five cycles
- 'punjab' unavailable for three cycles, or over 45% CPU for three cycles
- 'rsyslog' over 75% CPU for three cycles
Set up SNMP monitoring
Hipchat Data Center implements SNMP v2c using standard Ubuntu MIBs that can be enabled at the command line.
- To turn SNMP on or off:
hipchat service -n "on" OR "off"
- To set up the community string:
hipchat service -c <communitystring>
Example: hipchat service -c public
To add TRAP recipient server list:
hipchat service -t trap.server.com
\prior to a special character as in
Troubleshooting and logs
Log files are available in the
/var/log/ directory of each node. The Hipchat service logs can be found inside
Once per day, the log files from each node are copied to the
/file_store/shared/logs subdirectory of your network-attached storage volume. They follow a
/YYYYMMDD/machineid/log-files naming convention.
Configuration management is managed by chef-solo. It is run at boot, upgrade, and during service restarts. You can find the chef-solo log file in at
To retrieve all your logs, run
hipchat log -r on each node. This copies the logs to the
/file_store/shared/logs folder, which you can then compress and include with your support request.
If you need to open a Support request, make sure you download and attach your logs if possible. This helps us speed up the troubleshooting process.
|Force a log rotation||This will force all logs to conform to the log rotation configuration specified in |
|Truncates the contents of all logs in ||Be sure to backup any logs required for troubleshooting before executing this command.|
Log file reference
|chef runs for installing/updating/configuring||Logging starts from first boot. Most system configuration changes will trigger a chef run.|
|nginx logs AND coral logs|
Includes nginx-access entries alongside coral entries. nginx.err.log only logs ERROR and above.
Any entries in nginx.err.log are indicative of a problem.
|Ubuntu kernel logging|
|Logs any schema upgrade changes that occur during upgrades||Useful for seeing upgrade history.|
External directory (Crowd/AD/LDAP) integration and authentication
|Related to user authentication and external directory synchronization.|
Many services rely on coral for authentication, so this log is often referenced while tracing a problem.
coral.err.log only logs ERROR and above. Any entries in coral.err.log are indicative of a problem.
|Entries related to cron job schedules on the server|
|WebUI logging (i.e. the php-based administration)|
Good starting point for any error messages or stack traces occurring in the web interface.
web.err.log only logs ERROR and above. Any entries in web.err.log are indicative of a problem.
|Detailed output of upgrades (and errors)||Critical for troubleshooting upgrade issues, along with chef.log.|
Core chat service log
Errors here are often critical.
tetra.err.log only logs ERROR and above. Any entries in tetra.err.log are indicative of a problem.
|Logs when services are restarted|
Helpful for troubleshooting a broken service/upgrade.
"services starting" is to prevent access to the system before it is fully initialized, the hup.log is the orderly start - the last statement should be "maintenance_mode now OFF".
|Hipchat-specific subprocesses:||Entries include associated service name for easy parsing, such as:|
|redis master log, there is another redis log for stats||If this file is very large, then most likely sudo /bin/dont-blame-hipchat; chown redis /mnt is required.|
|Logs for the various daemons, including monit and ntpd||Useful for observing emergency service restarts via monit. Entries include daemon names for parsing, similar to hcapp.log|
|Lists server processes, disk space, server status (including CPU, memory, active user counts, etc.)||This is a great place to start for root cause analysis.|
|Output related to connection with external PostgreSQL database.|