Creating a web crawler in Java is easy – if you don’t need to set timeouts shorter then the defaults. Here is a JavaWorld article and a Sun Developer Connection article on doing just that. But the moment you do require control over timeouts… ouch does it get confusing. Note that this post covers version earlier then 1.4.x.
Here goes some of the major issues I’ve found: 1. The HTTP enabled classes do not expose a timeout property setting method. 2. The lower level socket classes do expose that method, but require a crap load more programming to utilize. This great JavaWorld article covers that approach. Run it on this site though and watch what happens. 3. Trying to hack your way to a timeout with the HTTP classes and threads exposes a registered bug – the HTTP operation may not properly close.
Anyway, after much searching, jGuru pointed me towards Jakarta’s HTTP Client. I recall reading about it over at rebelutionary. But guess what? After who knows how long digging I discovered the timeout property was not exposed in the last release build. Ok I figure, I’ll go grab the latest nightly build and cross my fingers. Now some undocumented depencies are exposed. You will need to download and install the Jakarta Logging Component and Sun’s Java Secure Socket Extention to get it to work.
I could have written the low level Socket code…. but there is a strong part of me… the lazy programmer in me… that believes in never re-writing the wheel. I knew there had to be a set of packages that would allow me do this with as little as possible coding. By finding those packages, especially from a reliable source such as the Apache Jakarta project, I can have a higher degree of confidence in what I’m putting together. And oh yes… just write about 50 lines of code 🙂
Anyway… anyone else with these findings or did I just take a walk I didn’t need to?