Troubleshooting Hipchat Data Center
Before you allow users in to your Hipchat Data Center deployment, you should make sure everything is configured and working appropriately. This page contains a checklist that we hope will help you verify the deployment, and catch any errors before you open the deployment to users.
System monitoring and alerts
You can configure Hipchat Data Center with a recipient email address for system alerts from the Hipchat Data Center admin UI.
An alert email is sent when any of the following conditions are met:
- Memory utilization over 98% for three cycles
- Swap file utilization over 10% for three cycles
- CPU (user) utilization over 95% for three cycles
- CPU (system) utilization over 95% for three cycles
- CPU (wait) utilization over 99% for three cycles
- 'gearman' over 30% CPU utilization
- 'nginx' over 20% CPU utilization for five cycles
- 'ntpd' becomes unavailable
- 'php5' restarts three times within five cycles
- 'punjab' unavailable for three cycles, or over 45% CPU for three cycles
- 'rsyslog' over 75% CPU for three cycles
Set up SNMP monitoring
Hipchat Data Center implements SNMP v2c using standard Ubuntu MIBs that can be enabled at the command line.
- To turn SNMP on or off:
hipchat service -n "on" OR "off"
- To set up the community string:
hipchat service -c <communitystring>
Example: hipchat service -c public
To add TRAP recipient server list:
hipchat service -t trap.server.com
\prior to a special character as in
Troubleshooting and logs
Log files are available in the
/var/log/ directory of each node. The Hipchat service logs can be found inside
Once per day, the log files from each node are copied to the
/file_store/shared/logs subdirectory of your network-attached storage volume. They follow a
/YYYYMMDD/machineid/log-files naming convention.
Configuration management is managed by chef-solo. It is run at boot, upgrade, and during service restarts. You can find the chef-solo log file in at
To retrieve all your logs, run
hipchat log -r on each node. This copies the logs to the
/file_store/shared/logs folder, which you can then compress and include with your support request.
If you need to open a Support request, make sure you download and attach your logs if possible. This helps us speed up the troubleshooting process.
|Force a log rotation||This will force all logs to conform to the log rotation configuration specified in |
|Truncates the contents of all logs in ||Be sure to backup any logs required for troubleshooting before executing this command.|
Log file reference
|chef runs for installing/updating/configuring||Logging starts from first boot. Most system configuration changes will trigger a chef run.|
|nginx logs AND coral logs|
Includes nginx-access entries alongside coral entries. nginx.err.log only logs ERROR and above.
Any entries in nginx.err.log are indicative of a problem.
|Ubuntu kernel logging|
|Logs any schema upgrade changes that occur during upgrades||Useful for seeing upgrade history.|
External directory (Crowd/AD/LDAP) integration and authentication
|Related to user authentication and external directory synchronization.|
Many services rely on coral for authentication, so this log is often referenced while tracing a problem.
coral.err.log only logs ERROR and above. Any entries in coral.err.log are indicative of a problem.
|Entries related to cron job schedules on the server|
|WebUI logging (i.e. the php-based administration)|
Good starting point for any error messages or stack traces occurring in the web interface.
web.err.log only logs ERROR and above. Any entries in web.err.log are indicative of a problem.
|Detailed output of upgrades (and errors)||Critical for troubleshooting upgrade issues, along with chef.log.|
Core chat service log
Errors here are often critical.
tetra.err.log only logs ERROR and above. Any entries in tetra.err.log are indicative of a problem.
|Logs when services are restarted|
Helpful for troubleshooting a broken service/upgrade.
"services starting" is to prevent access to the system before it is fully initialized, the hup.log is the orderly start - the last statement should be "maintenance_mode now OFF".
|Hipchat-specific subprocesses:||Entries include associated service name for easy parsing, such as:|
|redis master log, there is another redis log for stats||If this file is very large, then most likely sudo /bin/dont-blame-hipchat; chown redis /mnt is required.|
|Logs for the various daemons, including monit and ntpd||Useful for observing emergency service restarts via monit. Entries include daemon names for parsing, similar to hcapp.log|
|Lists server processes, disk space, server status (including CPU, memory, active user counts, etc.)||This is a great place to start for root cause analysis.|
|Output related to connection with external PostgreSQL database.|