Practical Example — URLDump
The program urldump.sma
implements a basic HTTP client that permits the
sending of a GET
request to a web server. It then receives the result and outputs the entire
returned page into a target file name. Although the same thing can be done with a browser, what makes this program
interesting is the fact that it also outputs the headers that were sent from the server, which the browser
normally strips off. These are the most interesting part, especially if you are trying to track down why something
might be going wrong. The code used in this example could also be reused to eventually provide a light-weight
browser component or a web crawling robot or any of a number of different useful programs based on the
HTTP protocol.
In the Beginning …
To start with, we need to create a few useful symbolic constants. The use of symbolic constants is what makes a program readable and easy to maintain. Just as using styles in a document makes it easy to change the look and feel of a document very quickly, the same is true of a good computer program.
////////////////////////////////////// // Symbolic Constants // ////////////////////////////////////// constant sURLID "://" constant sSTDPORT ":80" constant iTIMEOUT 1000000 constant sGET "GET" constant sSP " " constant sCRLF "{d}{a}" constant sHTTPVER "HTTP/1.0" constant sURIBASE "http://" constant sCONTENTLENGTH "content-length:" constant sHTTP "HTTP" constant sIGNORECHARS " {d}{a}" constant sERRTXT_CONNECT "Failed to connect to host" constant sERRTXT_SEND "Error sending request to host, \ error number: " constant sERRTXT_RECEIVINGDATA "Error receiving results from \ host, number: " constant sERRTXT_FILEOPENFAILED "Error opening output file" constant sERRTXT_PAGE "Page '" constant sERRTXT_NOTFOUND "' not found" constant sERRTXT_SUCCESS "' successfully retrieved"
There are two types of constants listed in this section, one set consists of the values used in various parts
of the program, the other is specifically error and success messages that are returned to the user. Many of
these values may not currently make much sense, although the name of the constant may help clarify their meaning
and later in the actual program code how they are used will also help clear things up.
The Main Event
Now that we have established the constants we will be using (obviously, these actually got created during the writing and restructuring of the program, not before the work began), let's have a look at the program code.
main()
function of the urlget programfunction main(string sUrl, string sOutfile) tcpsocket http string sDomain, sResult integer iErrnum, iPos, iContentLength, iPos2 fsfileoutputstream fpo blob bContent, bReceive, bTmp, bHeader, bStatus sResult = "" bTmp = "" if .instr(sUrl, sURLID) == 0 sUrl = sURIBASE + sUrl end if sDomain = getdomainroot(sUrl) // Now do a quick check and make sure that if they provided // a URL like www.foobar.com without the ending slash, that // we add it. if sDomain == .rstr(sUrl, .len(sDomain)) sUrl = sUrl + '/' end if if .instr(sDomain, ":") == 0 sDomain = sDomain + sSTDPORT end if
The start of the program is the main()
function. It takes two parameters: the
URL of the page to retrieve and the name of the output file in which the retrieved page
should be stored. After declaring and initializing the variables the program first evaluates the
URL and extracts the root domain from it, since tht is what is needed to create the
connection. It also checks the URL to ensure that if a base domain was based that the
closing slash has been appended (otherwise it adds one) since without the closing slash it will fail
when attempting to retrieve the default page from the web server. Finally the root URL
is checked to see if it includes the optional port information. If it does (such as :8080
)
then the program does nothing but if there is no port information (the normal case) then it adds
:80
to the end of the domain root. This is necessary since the first parameter to the
tcpsocket.new()
method is the destination in the format of either
IP address:port
or domain name:port
.
Once the basic initialization has been completed, the program then attempts to open a TCP/IP connection
to the web server named in the
parameter.
The variable was named sUrl
http
to make clear to anyone reading the source code what the
object is used for.
iErrnum = 0 http =@ tcpsocket.new(sDomain, error=iErrnum)
If the connection fails, an error message is assigned to the return variable and the program exits. If it is
successful, however, then the GET
request is formulated and sent to the web server via
the tcpsocket object referenced via the http
variable.
if http =@= .nul sResult = sERRTXT_CONNECT + sCRLF else // Full-Request and Full-Response use the generic message // format of RFC 822 for transferring entities. Both // messages may include optional header fields (also // known as "headers") and an entity body. The entity // body is separated from the headers by a null line // (i.e., a line with nothing preceding the CRLF). // // Full-Request = Request-Line ; Section 5.1 // *( General-Header ; Section 4.3 // | Request-Header ; Section 5.2 // | Entity-Header ) ; Section 7.1 // CRLF // [ Entity-Body ] ; Section 7.2 // // This is known as a full request in the format of HTTP // 1.0 but without any additional headers or an entity // body, therefore the closing second CRLF to complete // the message: // Request-Line = Method SP Request-URI SP HTTP-Version // CRLF bContent = sGET + sSP + sUrl + sSP + sHTTPVER + sCRLF + \ sCRLF // Although it may not normally be necessary, it is far // more elegant to use a socket that will not wait // forever. By setting a timeout on the various socket // operations (default is .inf -- never) we remain in // control of the program, so that if a long time passes // with no or insufficient activity, the program can // exit properly. In a GUI-style program the user can be // asked whether to continue waiting or if they wish to // cancel the operation. http.sendblob(bContent, timeout=1, error=iErrnum)
Assuming that there is no error when sending the request, the program now
prepares to receive the response. The program sets up a loop to receive the
response from the server. As described earlier, to ensure that the program
doesn't hang while waiting for a response (which could happen if the server
or the connection went down after the request was sent), the loop is entered
and the receiveblob()
method is called and set to time
out when the standard timeout value expires. The loop will only exit if an
error occurs, nothing is received on the connection within the scope of the
time out period, or the content received contains two carriage-return plus
linefeed pairs.
![]() | Note |
---|---|
Technically this implementation is not as forgiving as it should be, since according to the standard published in RFC-1945 applications should be reasonably tolerant in terms of which formatting they accept and the carriage return and linefeed pair specifically should be treated as merely linefeed and any carriage return should be dropped (this supports UNIX-based programmers where carriage return is not normally considered to be part of the end of line character). |
if iErrnum != 0 sResult = sERRTXT_SEND + .tostr(iErrnum, 10) + sCRLF else bReceive = "" // Now we retrieve the header (it may be more than // just the header that comes in, but we are technically // interested in the header at the moment). bHeader = "" while bTmp = "" bTmp = http.receiveblob(timeout=iTIMEOUT, error=iErrnum) bHeader = bHeader + .if(bTmp > "", bTmp, "") iPos = .inblob(bHeader, .toblob(sCRLF + sCRLF)) end while iErrnum != 0 or bTmp <= "" or iPos > 0
The previous receive loop may or may not have received the entire page but
it should have received either the entire header or it exited for some other
reason. The next piece of code tests to see if, in fact, it did receive the
header and the associated separator. If so, the portion following the header
(minus the separator) is assigned to the variable bReceive
and the header alone is reassigned to the variable bHeader
.
// Now that we have received the entire header, we // examine the header The first thing to evaluate is // the response code, since it needs to be in the 2XX // class for success. If it is a 4XX then we won't be // getting any content back. if iPos > 0 bReceive = .subblob(bHeader, iPos + 4, .inf) bHeader = .subblob(bHeader, 1, iPos - 1) end if
The next step is to check the header and see what type of response was received
from the web server. Unless the web server is using HTTP 0.9 there should be a
response code. If there is none, then all we will get back is the body of the
response, which will either be the requested page or some error text. If there
is a full response, then we can evaluate the status line
and see if the request succeeded. If it did not, then there is no additional
content to retrieve. The bReceive
variable is set to be
equal either to its current value if it has any content or else to the empty
string. This is to ensure that the concatenation of the variables later does
not result in a value of .nul
.
// After receiving and interpreting a request message, // a server responds in the form of an HTTP response // message. // // // Response = Simple-Response | Full-Response // // Simple-Response = [ Entity-Body ] // // Full-Response = Status-Line ; Section 6.1 // *( General-Header ; Section 4.3 // | Response-Header ; Section 6.2 // | Entity-Header ) ; Section 7.1 // CRLF // [ Entity-Body ] ; Section 7.2 // // Status-Line = HTTP-Version SP Status-Code SP // Reason-Phrase CRLF // "HTTP/" 1*DIGIT "." 1*DIGIT SP 3DIGIT SP PHRASE CRLF // // Either we will get a simple response or a full // response. if .subblob(bHeader, 1, 4) == .toblob(sHTTP) iPos = .inblob(bHeader, .toblob(sSP)) if iPos > 0 bStatus = .subblob(bHeader, iPos + 1, 3) end if end if if bStatus > "" and bStatus[1] != '2' // The page was not found for some reason // If the bReceive section is empty, we need to // set it to the empty blob (and not .nul) for // output later. bReceive = .if(bReceive >= "", bReceive, "")
Assuming that the request succeeded the next thing to look for is the content length field in the header. Once we either have a content length value or we establish that there is not one to be found, the final step is to read the remainder of the output from the web server. The content length can assist us in deciding when to stop, but it is not necessary, nor is it always correct, according to the standard, but for the purpose of this program we will assume that it is.
else // and look for the "content-length" header field. iContentLength = -1 iPos = .inblob(.toblob(.lcase(bHeader.getstring(1, .inf,\ 1))), .toblob(sCONTENTLENGTH)) if iPos > 0 iPos2 = .inblob(.subblob(bHeader, iPos + \ .len(sCONTENTLENGTH), .inf),\ .toblob(sCRLF)) if iPos2 > 0 bTmp = .subblob(bHeader, iPos + .len(sCONTENTLENGTH),\ iPos2) iContentLength = .toval(bTmp.getstring(1, .inf, 1),\ sIGNORECHARS, 10) end if end if // If we found a "content-length" header, then we know // how much data is still to come. If we don't, then we // can only rely on the timeout and continually loop // until we receive nothing on the connection. if iContentLength >= 0 while bReceive.size < iContentLength bTmp = "" bTmp = http.receiveblob(timeout=iTIMEOUT, \ error=iErrnum) bReceive = bReceive + \ .if(bTmp > "", bTmp, .toblob("")) end while iErrnum != 0 or bTmp <= "" else while bTmp = "" bTmp = http.receiveblob(timeout=iTIMEOUT, \ error=iErrnum) bReceive = bReceive + \ .if(bTmp > "", bTmp, .toblob("")) end while iErrnum != 0 or bTmp <= "" end if end if
Now that we have all of the output from the web server (regardless of how much that actually
is) it is time to formulate the response to the user, either one of success or failure. Also
the output from the web server needs to be written to the output file.
// Finally, we deal with the result, which is either // success or failure. If failure, we need to tell the // user what went wrong. if iErrnum != 0 and iErrnum != 705 sResult = sERRTXT_RECEIVINGDATA + \ .tostr(iErrnum, 10) + sCRLF else fpo =@ fsfileoutputstream.new(sOutfile, error=iErrnum) if fpo =@= .nul or iErrnum != 0 sResult = sERRTXT_FILEOPENFAILED + sCRLF else if bStatus > "" and bStatus[1] != '2' sResult = sERRTXT_PAGE + sUrl + \ sERRTXT_NOTFOUND + sCRLF else sResult = sERRTXT_PAGE + sUrl + \ sERRTXT_SUCCESS + sCRLF end if fpo.putblob(bHeader + .toblob(sCRLF + sCRLF) + \ bReceive) end if end if end if end if end function sResult