r3 - 19 Apr 2007 - 10:41:45 - GuyRixonYou are here: TWiki >  Astrogrid Web  >  DaveMorris > BugzillaBug2182
Notes on the probable cause and possible fixes for Bugzilla Bug 2182

The database query that causes the problem is :

    SELECT
        t.htmID,
        t.ra,
        t."dec",
        t.j_m,
        t.j_cmsig,
        t.h_m,
        t.h_cmsig,
        t.k_m,
        t.k_cmsig,
        t.glon,
        t.glat,
        t.b_m_opt,
        t.vr_m_opt
    FROM
        twomass_psc AS t
    WHERE
        t.glon > 90
    AND
        t.glon < 112
    AND
        t.glat > -6
    AND
        t.glat < 6

I reproduced the problem transferring data from the DSA at ROE to a FileStore? on my system here. The FileStore? was connected direct to the net on port 8080, so there is no proxy service involved.

Watching the network traffic as the DSA transfers its results, the DSA seems to pause for 20 or 30 seconds mid way through the query. This is perfectly natural, and is probably caused by a pause in the response from the database server while it processes the next block of results.

However, what the receiving FileStore? sees is lots of data .... and then everything goes quiet. If the pause is longer than 20 seconds, this triggers the default socket timeout limit, and the FileStore? service gives up and closes the connection.

The timeout limit can be changed on the receiving FileStore? by modifying the tomcat configuration :

$CATALINA_HOME/conf/server.xml

    <!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
    <Connector
        port="8080"
        ....
        connectionTimeout="20000"
        />
    ....
    </Connector>

The timeout value is in milli seconds, so setting it to 60000 should mean we handle this particular example.

$CATALINA_HOME/conf/server.xml

    <!-- Define a non-SSL HTTP/1.1 Connector on port 8080 -->
    <Connector
        port="8080"
        ....
        connectionTimeout="60000"
        />
    ....
    </Connector>

I have re-tested the query and got a complete set of results (420Mbytes) transferred into my FileStore?, and checked that the results are complete (the closing VOTable tag is there).

                    ....
                </TABLEDATA>
            </DATA>
        </TABLE>
    </RESOURCE>
</VOTABLE>

This appears to solve the initial problem. However, there may be further issues we need to check.

This was tested using a FileStore? connected direct to the net, without a http proxy in front of it. Increasing the connection timeout on the tomcat server seems to work, but we may have similar issues connection timeouts on the http proxies used at ROE and Leicester.

This needs Mark or Gary to tweak the tomcat settings on their machines and then re-test using a myspace account at ROE or Leicester.

Second, to get this query to work I increased the timeout to 60 seconds. However, there may be cases where a really complex query causes the DSA to pause for more than 60 seconds, triggering the same problem. I don't want to raise the timeout any more than 60 seconds, as this could cause other side effects.

The connection timeout is there so that the tomcat server can recover resources when a client disconnects half way through a transfer.

I have seen many people at workshops request to view their results in a web browser, checked the first 20 or so rows look ok, and then stop the request. This leaves an open connection on the server, which is recovered when it reaches the connection timeout.

Increasing the timeout means that it will take longer for the server to recover the resources from a dead connection. If we set the timeout to an extremely high value, then the FileStore? services will be in danger of running out of resources if lots of users request to view their results and then kill the request after the first few rows.

There is a facility in the HTTP standard for sending 'keep alive' packets as part of the transfer protocol. While debugging this problem, I didn't see any evidence that our services were using this. Getting this to work would be a better long term solution, rather than explicitly tweaking the connection timeout on selected servers.

-- DaveMorris - 16 Apr 2007

Looks like I got the wrong idea about HTTP keepalive. According to this : HTTP Persistent Connections the http keep alive is for re-using a socket connection after a request has completed.

-- DaveMorris - 17 Apr 2007

"This was tested using a FileStore? connected direct to the net, without a http proxy in front of it."

I suggest that we should move to remove the proxies which are a PITA. This will be particularly important for REST services secured with HTTPS. With better implementations, we should not need multiple tomcats per host. If we do, then VMs would be a better approach. In any case, a VOSpace really needs a physical host to itself.

-- GuyRixon - 19 Apr 2007

Edit | Attach | Printable | Raw View | Backlinks: Web, All Webs | History: r3 < r2 < r1 | More topic actions
 
AstroGrid Service Click here for the
AstroGrid Service Web
This is the AstroGrid
Development Wiki
This site is powered by the TWiki collaboration platformCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback