Practical Example

Practical Example — URLDump
	Chapter 16. Client Applications Using TCP/IP

Practical Example — URLDump

The program urldump.sma implements a basic HTTP client that permits the sending of a GET request to a web server. It then receives the result and outputs the entire returned page into a target file name. Although the same thing can be done with a browser, what makes this program interesting is the fact that it also outputs the headers that were sent from the server, which the browser normally strips off. These are the most interesting part, especially if you are trying to track down why something might be going wrong. The code used in this example could also be reused to eventually provide a light-weight browser component or a web crawling robot or any of a number of different useful programs based on the HTTP protocol.

In the Beginning …

To start with, we need to create a few useful symbolic constants. The use of symbolic constants is what makes a program readable and easy to maintain. Just as using styles in a document makes it easy to change the look and feel of a document very quickly, the same is true of a good computer program.

Example 16.1. Constants portion of the urlget program

//////////////////////////////////////
//        Symbolic Constants        //
//////////////////////////////////////

constant sURLID                 "://"
constant sSTDPORT               ":80"
constant iTIMEOUT               1000000
constant sGET                   "GET"
constant sSP                    " "
constant sCRLF                  "{d}{a}"
constant sHTTPVER               "HTTP/1.0"
constant sURIBASE               "http://"
constant sCONTENTLENGTH         "content-length:"
constant sHTTP                  "HTTP"
constant sIGNORECHARS           " {d}{a}"

constant sERRTXT_CONNECT        "Failed to connect to host"
constant sERRTXT_SEND           "Error sending request to host, \
                                 error number: "
constant sERRTXT_RECEIVINGDATA  "Error receiving results from \
                                 host, number: "
constant sERRTXT_FILEOPENFAILED "Error opening output file"
constant sERRTXT_PAGE           "Page '"
constant sERRTXT_NOTFOUND       "' not found"
constant sERRTXT_SUCCESS        "' successfully retrieved"

There are two types of constants listed in this section, one set consists of the values used in various parts of the program, the other is specifically error and success messages that are returned to the user. Many of these values may not currently make much sense, although the name of the constant may help clarify their meaning and later in the actual program code how they are used will also help clear things up.

The Main Event

Now that we have established the constants we will be using (obviously, these actually got created during the writing and restructuring of the program, not before the work began), let's have a look at the program code.

Example 16.2. Beginning of the main() function of the urlget program

function main(string sUrl, string sOutfile)
  tcpsocket http
  string sDomain, sResult
  integer iErrnum, iPos, iContentLength, iPos2
  fsfileoutputstream fpo
  blob bContent, bReceive, bTmp, bHeader, bStatus

  sResult = ""
  bTmp = ""

  if .instr(sUrl, sURLID) == 0
    sUrl = sURIBASE + sUrl
  end if
  sDomain = getdomainroot(sUrl)

  // Now do a quick check and make sure that if they provided
  // a URL like www.foobar.com without the ending slash, that
  // we add it.
  if sDomain == .rstr(sUrl, .len(sDomain))
    sUrl = sUrl + '/'
  end if

  if .instr(sDomain, ":") == 0
    sDomain = sDomain + sSTDPORT
  end if

The start of the program is the main() function. It takes two parameters: the URL of the page to retrieve and the name of the output file in which the retrieved page should be stored. After declaring and initializing the variables the program first evaluates the URL and extracts the root domain from it, since tht is what is needed to create the connection. It also checks the URL to ensure that if a base domain was based that the closing slash has been appended (otherwise it adds one) since without the closing slash it will fail when attempting to retrieve the default page from the web server. Finally the root URL is checked to see if it includes the optional port information. If it does (such as :8080) then the program does nothing but if there is no port information (the normal case) then it adds :80 to the end of the domain root. This is necessary since the first parameter to the tcpsocket.new() method is the destination in the format of either IP address:port or domain name:port.

Once the basic initialization has been completed, the program then attempts to open a TCP/IP connection to the web server named in the sUrl parameter. The variable was named http to make clear to anyone reading the source code what the object is used for.

Example 16.3. Creating the socket connection in the urlget program

  iErrnum = 0
  http =@ tcpsocket.new(sDomain, error=iErrnum)

If the connection fails, an error message is assigned to the return variable and the program exits. If it is successful, however, then the GET request is formulated and sent to the web server via the tcpsocket object referenced via the http variable.

Example 16.4. Beginning the TCP/IP conversation in the urlget program

  if http =@= .nul
    sResult = sERRTXT_CONNECT + sCRLF
  else
    // Full-Request and Full-Response use the generic message
    // format of RFC 822 for transferring entities. Both
    // messages may include optional header fields (also
    // known as "headers") and an entity body. The entity
    // body is separated from the headers by a null line
    // (i.e., a line with nothing preceding the CRLF). 
    //
    // Full-Request   = Request-Line           ; Section 5.1
    //                  *( General-Header      ; Section 4.3
    //                   | Request-Header      ; Section 5.2
    //                   | Entity-Header )     ; Section 7.1
    //                  CRLF
    //                  [ Entity-Body ]        ; Section 7.2

    // 
    // This is known as a full request in the format of HTTP
    // 1.0 but without any additional headers or an entity
    // body, therefore the closing second CRLF to complete
    // the message:
    // Request-Line = Method SP Request-URI SP HTTP-Version
    //                CRLF

    bContent = sGET + sSP + sUrl + sSP + sHTTPVER + sCRLF + \
               sCRLF

    // Although it may not normally be necessary, it is far
    // more elegant to use a socket that will not wait
    // forever. By setting a timeout on the various socket
    // operations (default is .inf -- never) we remain in
    // control of the program, so that if a long time passes
    // with no or insufficient activity, the program can
    // exit properly. In a GUI-style program the user can be
    // asked whether to continue waiting or if they wish to
    // cancel the operation.

    http.sendblob(bContent, timeout=1, error=iErrnum)

Assuming that there is no error when sending the request, the program now prepares to receive the response. The program sets up a loop to receive the response from the server. As described earlier, to ensure that the program doesn't hang while waiting for a response (which could happen if the server or the connection went down after the request was sent), the loop is entered and the receiveblob() method is called and set to time out when the standard timeout value expires. The loop will only exit if an error occurs, nothing is received on the connection within the scope of the time out period, or the content received contains two carriage-return plus linefeed pairs.

	Note
Technically this implementation is not as forgiving as it should be, since according to the standard published in RFC-1945 applications should be reasonably tolerant in terms of which formatting they accept and the carriage return and linefeed pair specifically should be treated as merely linefeed and any carriage return should be dropped (this supports UNIX-based programmers where carriage return is not normally considered to be part of the end of line character).

Note

Technically this implementation is not as forgiving as it should be, since according to the standard published in RFC-1945 applications should be reasonably tolerant in terms of which formatting they accept and the carriage return and linefeed pair specifically should be treated as merely linefeed and any carriage return should be dropped (this supports UNIX-based programmers where carriage return is not normally considered to be part of the end of line character).

Example 16.5. Retrieving the header from the web server in the urlget program

    if iErrnum != 0
      sResult = sERRTXT_SEND + .tostr(iErrnum, 10) + sCRLF
    else
      bReceive = ""

      // Now we retrieve the header (it may be more than
      // just the header that comes in, but we are technically
      // interested in the header at the moment).
      bHeader = ""
      while 
        bTmp = ""
        bTmp = http.receiveblob(timeout=iTIMEOUT, error=iErrnum)
        bHeader = bHeader + .if(bTmp > "", bTmp, "")
        iPos = .inblob(bHeader, .toblob(sCRLF + sCRLF))
      end while iErrnum != 0 or bTmp <= "" or iPos > 0

The previous receive loop may or may not have received the entire page but it should have received either the entire header or it exited for some other reason. The next piece of code tests to see if, in fact, it did receive the header and the associated separator. If so, the portion following the header (minus the separator) is assigned to the variable bReceive and the header alone is reassigned to the variable bHeader.

Example 16.6. Checking the response code in the web page header in the urlget program

      // Now that we have received the entire header, we
      // examine the header The first thing to evaluate is
      // the response code, since it needs to be in the 2XX
      // class for success. If it is a 4XX then we won't be
      // getting any content back.
      
      if iPos > 0
        bReceive = .subblob(bHeader, iPos + 4, .inf)
        bHeader = .subblob(bHeader, 1, iPos - 1)
      end if

The next step is to check the header and see what type of response was received from the web server. Unless the web server is using HTTP 0.9 there should be a response code. If there is none, then all we will get back is the body of the response, which will either be the requested page or some error text. If there is a full response, then we can evaluate the status line and see if the request succeeded. If it did not, then there is no additional content to retrieve. The bReceive variable is set to be equal either to its current value if it has any content or else to the empty string. This is to ensure that the concatenation of the variables later does not result in a value of .nul.

Example 16.7. Parsing the web page header in the urlget program

      // After receiving and interpreting a request message,
      // a server responds in the form of an HTTP response
      // message. 
      //
      //
      //   Response        = Simple-Response | Full-Response
      //
      //   Simple-Response = [ Entity-Body ]
      //
      //   Full-Response   = Status-Line        ; Section 6.1
      //                     *( General-Header  ; Section 4.3
      //                      | Response-Header ; Section 6.2
      //                      | Entity-Header ) ; Section 7.1
      //                     CRLF
      //                     [ Entity-Body ]    ; Section 7.2
      //
      // Status-Line = HTTP-Version SP Status-Code SP 
      //               Reason-Phrase CRLF
      // "HTTP/" 1*DIGIT "." 1*DIGIT SP 3DIGIT SP PHRASE CRLF
      //
      // Either we will get a simple response or a full
      // response.


      if .subblob(bHeader, 1, 4) == .toblob(sHTTP)
        iPos = .inblob(bHeader, .toblob(sSP))
        if iPos > 0
          bStatus = .subblob(bHeader, iPos + 1, 3)
        end if
      end if

      if bStatus > "" and bStatus[1] != '2'
        // The page was not found for some reason
        // If the bReceive section is empty, we need to
        // set it to the empty blob (and not .nul) for
        // output later.
        bReceive = .if(bReceive >= "", bReceive, "")

Assuming that the request succeeded the next thing to look for is the content length field in the header. Once we either have a content length value or we establish that there is not one to be found, the final step is to read the remainder of the output from the web server. The content length can assist us in deciding when to stop, but it is not necessary, nor is it always correct, according to the standard, but for the purpose of this program we will assume that it is.

Example 16.8. Retrieving the web page content in the urlget program

      else
        // and look for the "content-length" header field.
        iContentLength = -1
        iPos = .inblob(.toblob(.lcase(bHeader.getstring(1, .inf,\
                       1))), .toblob(sCONTENTLENGTH))
        if iPos > 0
          iPos2 = .inblob(.subblob(bHeader, iPos + \
                                   .len(sCONTENTLENGTH), .inf),\
                          .toblob(sCRLF))
          if iPos2 > 0
            bTmp = .subblob(bHeader, iPos + .len(sCONTENTLENGTH),\
                            iPos2)
            iContentLength = .toval(bTmp.getstring(1, .inf, 1),\
                                    sIGNORECHARS, 10)
          end if
        end if

        // If we found a "content-length" header, then we know
        // how much data is still to come. If we don't, then we
        // can only rely on the timeout and continually loop
        // until we receive nothing on the connection.

        if iContentLength >= 0
          while bReceive.size < iContentLength
            bTmp = ""
            bTmp = http.receiveblob(timeout=iTIMEOUT, \
                                    error=iErrnum)
            bReceive = bReceive + \
                       .if(bTmp > "", bTmp, .toblob(""))
          end while iErrnum != 0 or bTmp <= ""
        else
          while 
            bTmp = ""
            bTmp = http.receiveblob(timeout=iTIMEOUT, \
                                    error=iErrnum)
            bReceive = bReceive + \
                       .if(bTmp > "", bTmp, .toblob(""))
          end while iErrnum != 0 or bTmp <= ""
        end if
      end if

Now that we have all of the output from the web server (regardless of how much that actually is) it is time to formulate the response to the user, either one of success or failure. Also the output from the web server needs to be written to the output file.

Example 16.9. Returning the results to the user in the urlget program

      // Finally, we deal with the result, which is either
      // success or failure. If failure, we need to tell the
      // user what went wrong.

      if iErrnum != 0 and iErrnum != 705
        sResult = sERRTXT_RECEIVINGDATA + \
                  .tostr(iErrnum, 10) + sCRLF
      else
        fpo =@ fsfileoutputstream.new(sOutfile, error=iErrnum)
        if fpo =@= .nul or iErrnum != 0
          sResult = sERRTXT_FILEOPENFAILED + sCRLF
        else
          if bStatus > "" and bStatus[1] != '2'
            sResult = sERRTXT_PAGE + sUrl + \
                      sERRTXT_NOTFOUND + sCRLF
          else
            sResult = sERRTXT_PAGE + sUrl + \
                      sERRTXT_SUCCESS + sCRLF
          end if
          fpo.putblob(bHeader + .toblob(sCRLF + sCRLF) + \
                      bReceive)
        end if
      end if
    end if
  end if
end function sResult