Users of our external Groupwise 6.5.6 webaccess gateway have been experiencing problems with their sessions hanging when trying to open items. Users access our internal webaccess gateways, which are the same version and configuration as the external gateway, have not been experiencing this problem. All Groupwise servers were running on NetWare 6.5.5. and were patched up to FTF GW 6.5 post SP6 English only Agents Rev 6.
It did not matter which Internet browser the external users were using, the same problem was apparent for users of Firefox 2.x, 3.x, IE6, and IE7. They’d try to open an item, and their browser would appear to hang for anywhere from a few seconds to a few minutes. Rebooting the webaccess server did not have any affect on the problem.
The following message was seen in the webaccess log file:
Error: Request aborted while waiting on locked conversation
The only Novell TID that was relevant to this problem is TID #10023251. It describes this exact error message, but it specifically states the error only occurs when trying to view an attachment, which wasn’t the case for our users, since the error was logged when they were just navigating though the webaccess client.
The TID referenced above was not helpful, since it stated the only fix is to have the user perform the action again (ie resend the message) and there are no configurable parameters to increase this timeout.
Here’s what we did to troubleshoot this problem:
First thing I did was to verify the amount of free disk space on the sys volume of each server. The two well behaving servers had at least 750MB free, while the failing server only had 500MB free. Java sometimes behaves poorly when it lacks an abundance of free disk space, so I cleared out 500MB of old log files and restarted the server. Unfortunately, no change in performance or error rate was noted.
I loaded config.nlm on the good internal webaccess servers and the problematic external webaccess server. I then used Winmerge to compare the log resulting files to check for differences in versions of drivers, nlms, and configuration files. One of the team members noticed a difference in how webaccess was being loaded in protected mode. We tried duplicating the change on the external server, but that didn’t have any impact on the situation.
Next I checked the versions of java and tomcat on all machines:
To get the Java version number: java – version
To view the running instances of Java: java – show
To see Java instance memory utilization: java -showmemoryID where ID is the ID of the instance listed when performing java – show with no space between showmemory and the ID number.
Note: You have to switch to the NetWare console logger screen to see the output of these commands
I didn’t see anything abnormal in the problem server’s Java configuration, so next I looked at the Sys:\Apache2\logs\mod_jk.log file. Inside it I saw the following messages, repeated frequently:
jk_ajp_common.c (1318)]: Error connecting to tomcat. Tomcat is probably not started or is listening on the wrong port. worker=ajp13admin failed errno = 54
jk_uri_worker_map.c (620)]: In jk_uri_worker_map_t::map_uri_to_worker, wrong parameters
jk_ajp_common.c (1483): Timeout with waiting reply from tomcat. Tomcat is down, stopped or network problems
jk_ajp_common.c (1503): Tomcat is down or refused connection. No response has been sent to the client (yet)
These messages made me think communication was definitely failing somewhere. The server administrator in charge of the NetWare servers replaced the server’s patch cable and moved it to another port on the switch, thinking that may help communication. It didn’t.
My co-worker was poking through the NetWare Console Monitor – LAN/WAN drivers – highlight NIC – press tab for stats, and noticed increasing Rx CRC errors, as well as other errors. He replaced the network card, and all of the webaccess errors went away!