Monday, February 26, 2007

Accepting the Q factor

http://www.w3.org/2000/09/xmldsig# is the namespace for the schema for XML Signatures - one of the many, many schemas you end up accessing if you do XML Schema based completion for WS-SecurityPolicy (2005) (part of our WSDL policy editor in the Eclipse plugins for Sonic ESB Workbench). Why is this one special? For the following reason -

If you access http://www.w3.org/2000/09/xmldsig# from Mozilla Firefox you will get back the schema at http://www.w3.org/TR/2002/REC-xmldsig-core-20020212/xmldsig-core-schema.xsd (through an HTTP re-direct response code 303) but if you use Java's java.net.URL.openConnection() (basically through HttpURLConnection) you get an HTML page and not the Schema (XML) which our Schema loader does not particularly appreciate.

It took a while for me to understand why the same URL is behaving differently. Using Eclipse 's TCP/IP Monitor I captured the headers sent by my code and used LiveHTTPHeaders for Firefox.

This is what Firefox sends -

GET /2000/09/xmldsig HTTP/1.1
Host: www.w3.org
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.2) Gecko/20070219 Firefox/2.0.0.2
Accept: text/xml,application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

and this is what it receives -

HTTP/1.x 303 See Other
Date: Mon, 26 Feb 2007 11:50:20 GMT
Server: Apache/1.3.37 (Unix) PHP/4.4.5
WWW-Authenticate: Basic realm="W3CACL"
Location: http://www.w3.org/TR/2002/REC-xmldsig-core-20020212/xmldsig-core-schema.xsd
Keep-Alive: timeout=2, max=99
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1



But when HttpURLConnection sends the request this is what it sends -

GET /2000/09/xmldsig HTTP/1.1
User-Agent: Java/1.4.2_12
Host: www.w3.org
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive


and this is what it receives -

HTTP/1.1 303 See Other
Date: Mon, 26 Feb 2007 11:56:47 GMT
Server: Apache/1.3.37 (Unix) PHP/4.4.5
WWW-Authenticate: Basic realm="W3CACL"
Location: http://www.w3.org/TR/2002/REC-xmldsig-core-20020212/Overview.html
Keep-Alive: timeout=2, max=99
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=iso-8859-1


Notice the difference in the Location header
Firefox : Location: http://www.w3.org/TR/2002/REC-xmldsig-core-20020212/xmldsig-core-schema.xsd
Java URLConnection : Location: http://www.w3.org/TR/2002/REC-xmldsig-core-20020212/Overview.html

The problem turns out to be in the Accept header set by Java URLConnection by default (or I guess the Sun HttpURLConnection implementation).

Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2


Note no text/xml as in Firefox. Although there is a */* its 'q' value is lower than text/html and the nice server at www.w3.org uses this to change its output to suit what is best accepted by the user-agent. I guess since they are the standards organization they should do this :-). Fixing the accept header fixes this behaviour. Is there something I am missing in my understanding of how URLConnection works?

1 comment:

Rajiv said...

Very interesting issue Sachin. It appears that firefox keeps changing the Accept header based on the extension in the URL. I wonder if the JDK's URL connection has similar logic.

Incidentally, the folks here tell me that, as per the spec, if the response changes based on the Accept header, the server SHOULD set the Vary header. Else if one user accepts xml and other does not, they might both end up getting the cached xml response from the proxy.

Considering they are the standards org, they should have set the Vary header?! ;)