![]() Component "cgi"
RELATED
|
Netcgi CGI is a standard way of connecting a web application to a web server. When the server gets a request that is bound to a CGI-based web application, a new process is started running the application, and the request is transferred to the new process, which is expected to generate a reply. Finally, the process terminates. This means that CGI deals with the protocol between the server and the application, and with a certain process model. It does neither deal with the underlying Internet protocol (HTTP), nor with the type of language used for the transferred contents. For these reasons, we call CGI a connector, and should keep in mind that there are alternate connectors, e.g. the Ajp protocol. Some features in Netcgi are connector-specific, but most are not. It was a design principle of Netcgi to separate both domains from each other. The benefit for the programmer is that it is quite easy to change the used connector, or even to create programs that support several connectors. The following explanations are generic and valid for all connectors, except when a certain connector is explicitly mentioned. There are two ways of calling a web application:
When the user clicks at the link, or presses the button, the browser sends a HTTP request to the web server, and the server forwards the request to the web application. The application generates an HTML page as reply, and this page is displayed next by the browser. As explained, only POST requests should have side effects. The Netcgi library contains functions to analyze arriving web requests, and to produce a formally correct reply. Normally, there are three O'Caml modules involved:
The simplest possible CGI application This application looks as follows: let actobj = new Netcgi.std_activation() in actobj # set_header(); actobj # output # output_string "<HTML><BODY>Hello world!</BODY></HTML>\n"; actobj # output # commit_work() ;;The first line creates a new activation object. This means to create a new environment object, too, and to initialize these objects from the real environment. Effectively, the web request has been read in, and the program could now continue by analyzing the request. Well, this is the simplest possible web application, and because of this, we do not look at the request. Instead, the reply is prepared in the second line by setting the header (of the reply). Note that this does not mean to send the header immediately, Netcgi may defer this until the "right moment" has come. When the header is set, one can choose to transfer additional information in the header. Of course, the simplest possible web application uses a standard header without extra features. The third line outputs the contents of the reply. actobj # output is a so-called netchannel, an object-oriented encapsulation of I/O channels. See the chapter about netchannels for details. Netchannels have a method output_string appending the passed string to the channel, which means here that the string is appended to the reply. The fourth statement commits the data appended to the used netchannel. This means that the data are valid and must be sent to the server. In this example, the statement has only the effect that the internal buffers are flushed. However, the statement can mean much more, as there is a special mode where you can say afterwards that the data are invalid and must not be sent - the next chapter explains this special transactional mode. Note that the output channel is normally not closed. This is usually the task of the connector, although it would be harmless here. These four lines work only with the CGI connector. The first statement implicitly selects the CGI connector which is the default if nothing else is specified. The other three statements are connector-independent. Out of scope: After you have compiled these four lines, the question remains how to configure your web server such that this application is bound to a certain URL. This depends on the web server and cannot be answered here. Command-line: You can start this application on the command-line. It detects automatically that it runs outside a CGI environment and enters the test mode. You can input test parameters, and a single dot (".") lets the program continue. This prints the reply on the screen. There are even command-line options, use -help to get a summary. Bad style: Don't output the reply directly on stdout! This is incompatible with a number of Netcgi features, including generating a good header, the web transactions, and of course non-standard connectors. Direct replies vs. transactions By default, the activation object is configured such that it sends the reply immediately to the web server (at least when you flush the internal buffer of the output channel). This is called the direct mode. It has a disadvantage: When an error happens in the middle of page generation, a part of the page is already sent to the server, and the only thing you can do is to append the error message to the end of the already sent text. The consequences: Users often do not see that there is an error at all because the error message is at the bottom of the page or even invisible. It becomes even possible that incompletely initialized forms can be submitted, leading to follow-up errors. For reliable applications it is better to select the transactional mode. Instead of being immediately sent to the server the reply is completely buffered until it is committed (by calling commit_work). Alternatively, it is also possible to discard the whole reply (by calling rollback_work), and to generate a completely different one. The following scheme should be used to enable transactions: let process (actobj:cgi_activation) = try actobj # set_header ... (); ... Code that writes to actobj # output ... actobj # output # commit_work(); with error -> actobj # output # rollback_work(); actobj # set_header ~status:`Internal_server_error (); ... Code that writes the error message to actobj # output ... actobj # output # commit_work() in let operating_type = Netcgi.buffered_transactional_optype in process (new Netcgi.std_activation ~operating_type ())The transactional mode is selected by passing the argument ~operating_type. There are two implementations, one keeping the buffer in the main memory (buffered_transactional_optype), one using an external file as buffer (tempfile_transactional_optype). The latter should be used for very large replies (one megabyte and up). Normally, the same statements are executed as in the direct case: set_header, then the reply is sent to the actobj # output channel, and finally commit_work is called to flag the data as valid, and to flush the whole transaction buffer. When an exception is raised, the try...with... block catches it. First, the transaction buffer is discarded (rollback_work), and a different header is set. Here, we set the status code to `Internal_server_error. This is important, because pages with an error status are differently treated as pages with a successful status; mainly the browser is not allowed to put them into its cache. You can send other status codes as long as they indicate either a client or a server error (e.g. you can respond with `Forbidden when the user is not allowed to access the page). Next, the error message is written into the transaction buffer, and finally, the buffer is committed. Some details to better understand what is happening:
Although transactions increase the reliability of a web application, they do not fully ensure it. There is still no way to guarantee that the network connection does not break, and that the browser processes the data as it is intended by the application. Nevertheless, transactions are useful as they allow it to report errors in a reasonable way. Transactions are independent of the selected connector. Processing Parameters After the activation object has been created, it allows one to access the parameters coming with the web request. These four methods read the currently available parameters: method arguments : (string * cgi_argument) list method argument : string -> cgi_argument method argument_value : ?default:string -> string -> string method multiple_argument : string -> cgi_argument listIn particular, arguments returns all parameters as associative list, argument returns the first parameter with the passed name, and multiple_argument returns all parameters with a certain name (it is allowed to have several parameters with the same name). These three methods return cgi_argument objects containing all properties of the parameters. These properties include the name and the value of the parameter, but there may be more (see below). Especially the method value returns the current value, so the call (actobj # argument "foo") # valuereturns the string value of the argument, or raises Not_found if there is no parameter called "foo". As an abbreviation, one can alternatively write actobj # argument_value "foo"which is exactly the same. Furthermore, argument_value accepts a default value that is returned if the parameter is missing. The following two expressions are again equivalent: try (actobj # argument "foo") # value with Not_found -> "0" actobj # argument_value ~default:"0" "foo"For simple arguments only having a name and a value this is already the whole story. File uploads require that arguments have some extra capabilities. An HTML form containing file upload elements must look like: <FORM ACTION="...anyURL..." METHOD="POST" ENCTYPE="multipart/form-data"> ... <INPUT TYPE="FILE" NAME="...anyName..."> ... </FORM>The attributes METHOD and ENCTYPE must have exactly these values, otherwise the upload will not work. Arguments resulting from file uploads have some extra properties:
The first three of these properties can be simply retrieved by calling the right methods of the argument object, as in let arg = actobj # argument "...anyName..." let content_type = arg # content_type let charset = arg # charset (* or: let charset = List.assoc "charset" (arg # content_type_params) *) let filename = arg # filenameNote that filename is None if the file upload box was not used. The last special feature of upload arguments is the possibility to store the uploaded contents of files again in files on the server side. This must be explicitly enabled, by default the contents are kept in the main memory. To do so, you must pass the option ~processing when you create the activation object, as in let processing arg_name arg_header = (* return either `Memory, `File, or `Automatic *) ... in let actobj = new std_activation ~processing ()The function processing is called just after the header of the argument has been received, and the function must decide how to handle the argument. The function is called for all MIME arguments, not only file upload arguments![1] Possible result values are: `Memory meaning that the argument is loaded into memory, or `File meaning that the argument is stored in a file, or `Automatic meaning that arguments with filenames are stored in files, and all other are kept in memory. What you probably want is let processing arg_name arg_header = `AutomaticWhat changes when you store an argument into a file? At the first glance: nothing. When you call value or argument_value, the contents of the argument are automatically loaded from the file. However, you can now call some extra methods: let store = arg # store in let server_filename = match store with `File name -> name | `Memory -> failwith "Argument not stored in file"The store method returns the location where the contents of the argument are stored, either `Memory or `File _; the returned filename is the absolute path of the file in the filesystem of the server. - Furthermore, you can open the file for reading: let ch = arg # open_value_rd()Here, ch is an input netchannel reading the contents of the file[2]. - Last but not least, it is a good idea to delete such files after usage. To do so, just call: arg # finalize()This deletes the file if still existing. You can also call actobj # finalize() which deletes all existing argument files of the current activation at once. Arguments work for all connectors in the same way. Setting the header There are two methods that set the header of the reply: method set_header : ?status:status -> ?content_type:string -> ?cache:cache_control -> ?filename:string -> ?language:string -> ?script_type:string -> ?style_type:string -> ?set_cookie:cgi_cookie list -> ?fields:(string * string list) list -> unit -> unit method set_redirection_header : string -> unitset_header must be used for a normal reply, while set_redirection_header initiates a redirection to another URL. set_header has a number of optional arguments. You can omit all arguments, in which case a standard header is used. We discuss here only status, content_type, filename, and set_cookie:
The other header-setting method is used for redirections. There are two type of redirections, server-driven and client-driven redirections. If the redirection is performed by the web server, the browser will not notice that a redirection has happened; the browser just gets the page to which the redirection points. Of course, this works only if the new location is served by the same web server. In the case of client-driven redirections, a special reply is sent to the browser asking to go to the new location. The user normally sees this because the web address changes. Use set_redirection_header in both cases. A server-driven redirection will be chosen if you only pass the part of the URL that contains the path, and the client-driven redirection is taken for absolute URLs: actobj # set_redirection_header "/cgi/myapplication"; (* server-driven *) actobj # set_redirection_header "http://myserver/cgi/myapplication"; (* client-driven *)Note that the url method can be useful to create these strings when they are derived from the current URL. See below for details. After you have set the header, it is still necessary to commit the output channel, although you need not to output anything. So don't forget to call actobj # output # commit_work()! The environment Up to now, we have mostly used the activation object. This object represents the knowledge about the current request/response cycle. There is a second object, the environment, that focuses more on the conditions on which the current cycle bases, and that also represents the used connection method. In particular, you can get the following information by looking at the environment:
Most of these data are under control of the activation object, and it is strongly recommended not to modify any header, channel, or state as this would confuse the activation object. Of course, it is allowed to read the current header, and the meta data: let env = actobj # environment let date = env # input_header_field "date" let ctype = env # input_header_field "content-type" let script_name = env # cgi_property "SCRIPT_NAME" (* or: env # cgi_script_name *)Note that the names of header fields are normalized according to the conventions of HTTP even if the connection protocol uses a different style. For example, CGI transports the content type as "CONTENT_TYPE", and the user agent string as "HTTP_USER_AGENT", but the environment translates these names back to "content-type" and "user-agent" (case-insensitive), as they are used in the HTTP protocol. This does not apply to the meta data of the current request that do not occur in the HTTP header, like the URL, or the internet address of the client. So the "SCRIPT_NAME" property must be called "SCRIPT_NAME", no name translation takes place here. (Sorry for this distinction, but we have to find good connector-independent names for these fields.) Bad style: It is not a good idea to get these values directly from the process environment (Sys.getenv). This would work only for CGI, but not for other connectors. By the way, you can also find the current set of cookies in the environment, because cookies are transferred in the HTTP header. For convenience, there is an access method: let cookies = env # cookies in let my_cookie = List.assoc "my_cookie" cookiescookies returns the cookies as associative list. Now that we have an idea what data are represented in the environment: When is the environment object created? By default, together with the activation object, but you can do that yourself. The line let actobj = new std_activation()creates a default environment because no env argument is present. The default environment is let env = try new std_environment() with Std_environment_not_found -> new test_environment()where the class std_environment creates an environment that works for CGI, and test_environment is responsible for the command-line test loop. You can specify the environment to use. For example, let actobj = new std_activation ~env:(new std_environment()) ()has a CGI-only environment, the test loop is disabled. A very good reason to create your own environment is the possibility to pass the configuration record: type cgi_config = { (* System: *) tmp_directory : string; tmp_prefix : string; (* Limits: *) permitted_http_methods : string list; permitted_input_content_types : string list; input_content_length_limit : int; workarounds : workaround list; } val default_config : cgi_configThe configuration record is kept inside the environment, and can be specified when the environment object is created. The component tmp_directory is the absolute path of the directory for temporary files, and tmp_prefix is the file name prefix to use for these files. input_content_length_limit is the maximum allowed size for a web request. (The other components are documented in the mli file.) Example: Create a customized configuration record, an environment, and an activation object in turn: let config = { default_config with tmp_directory = "/var/spool/webapps"; tmp_prefix = "myapplication"; input_content_length_limit = 10485760; (* 10 MB *) } in let env = new std_environment ~config () in let actobj = new std_activation ~env () This way of creating environments works only for the CGI connector, as other connectors use different process models that are incompatible with this particular way. The contents of the environment object, however, are connector-independent. Generating links and forms OcamlNet contains some convenience functions to generate HTML code. Most important is the function to HTML-encode a string. In HTML, the characters <, >, &, and sometimes[4] " and ' have special meanings and must be denoted in a special way:
HTML defines more such entities, but you need not to use them, just send the character as such. For example, it is not necessary to write a-umlaut as ä, just use ä. Of course, you must ensure that the right character set is mentioned in the content-type header (e.g. do actobj # set_header ~content_type:"text/html;charset=utf-8" () to select the UTF-8 encoding). Note that missing charset parameters are often added by the web server, and this may go wrong. The function that HTML-encodes any string is: let enc_str = Netencoding.Html.encode ~in_enc ~out_enc () unenc_strwhere unenc_str is the unencoded string, in_enc is the assumed character set of unenc_str (e.g. `Enc_iso88591 or `Enc_usascii or `Enc_utf8), and out_enc is the assumed character set of enc_str. Normally, it is best to choose in_enc = out_enc because that avoids unnecessary transformations, but it is also allowed to pass `Enc_usascii for out_enc. This forces that entities are used for non-ASCII characters (e.g. ä instead of ä). Note that the encode function normally does not transform ', but you can enforce this by passing the additional parameter unsafe_chars: let unsafe_chars = Netencoding.Html.unsafe_chars_html4 ^ "'" in let enc_str = Netencoding.Html.encode ~in_enc ~out_enc ~unsafe_chars () unenc_strYou may wonder why there is the () argument. Because encode rebuilds the transformation tables for every call, it is possible to split the invocation in two parts: let encode_utf8 = Netencoding.Html.encode ~in_enc:`enc_utf8 ~out_enc:`Enc_utf8 () in let enc_str = encode_utf8 unenc_strThe transformation tables are already built when the partial application of the first let is executed, and this extra work is avoided when encode_utf8 is called. This technique is highly recommended when you call encode_utf8 frequently. By the way, there is also Netencoding.Html.decode for the reverse transformation. Now that we know how to efficiently apply the HTML encoding, let us now go a step further. An often needed feature is to generate FORM element that points to its generator (a "recursive" form). All you have to do is to put the URL into the ACTION attribute of this element. But how to do it right? This code shows a good solution: let own_url = actobj # url() in let enc_own_url = Netencoding.Html.encode ~in_enc ~out_enc () own_url in actobj # output # output_string (sprintf "<FORM ACTION=\"%s\">" enc_own_url)The url method returns the current URL of the activation object (see below how to get variations of the current URL), but without query string, if any (the part after the question mark). The URL is HTML-encoded and included as ACTION attribute. It is very unlikely that the URL contains characters that actually must be encoded, but we get it from an untrusted source, and it is better to program defensively here. Note that URLs may contain & in general, but this cannot happen here because the query string is omitted. The following piece of code creates a complete FORM element that does not only point recursively to its generator, but also includes all web parameters. Note that we use POST because this method does not limit the length of the transferred parameters: let enc = Netencoding.Html.encode ~in_enc ~out_enc () in let own_url = actobj # url() in actobj # output # output_string (sprintf "<FORM ACTION=\"%s\" METHOD=POST>\n" (enc own_url)); List.iter (fun (arg_name,arg) -> let arg_value = arg # value in actobj # output # output_string "<INPUT TYPE=HIDDEN NAME=\"%s\" VALUE=\"%s\">\n" (enc arg_name) (enc arg_value) ) (actobj # arguments); actobj # output # output_string "</FORM>\n";It is essential that the NAME and VALUE attributes are HTML-encoded because they can contain arbitrary characters. Note that this technique if often used, but usually not all arguments are included into the FORM. Furthermore, you may encounter problems with multi-line arguments (containing line separators), and it is a good idea to include CR and LF into the list of unsafe_chars to avoid that. The url method has the following interface: method url : ?protocol:Netcgi_env.protocol -> (* default: from environment *) ?with_authority:other_url_spec -> (* default: `Env *) ?with_script_name:other_url_spec -> (* default: `Env *) ?with_path_info:other_url_spec -> (* default: `Env *) ?with_query_string:query_string_spec -> (* default: `None *) unit -> stringThere are a lot of optional arguments referring to URL fragments. In general, an (absolute) URL has the format authority/script_name/path_info?query_stringwhere /path_info and ?query_string are optional fragments. The authority consists usually of the protocol name and host name of the denoted server (but may include more information), as in "protocol://hostname". The script name is the "directory path" of the server that is bound to the web application. Path info is the exceeding directory path, and the query string contains the parameters that are passed by URL. The url method allows the programmer to specify which parts of the URL should be taken from the current URL, which parts are to be left out, and which parts be overriden manually with certain values. For with_authority, with_script_name, and with_path_info one can pass these values: `Env means to use the corresponding fragment of the current URL. `None means to omit the fragment. `This "string" means to include this string value as fragment. The protocol argument is only used if with_authority=`Env, see below why one can pass it separately in this case. with_query_string has different cases: `Initial means to use the arguments as they were passed to the current invocation of the web application (i.e. from the request). `Current means to use the current arguments (they can be modified), `None means to omit the query string, and `Args args forces url to include exactly the arguments args into the URL. Example: Create a hyperlink that refers recursively to the web application, but sets the parameter x to the value 1 (as only parameter): let x_arg = new simple_argument "x" "1" in let u = actobj # url ~query_string_spec:(`Args [x_arg]) in let enc_u = Netencoding.Html.encode ~in_enc ~out_enc () u in actobj # output # output_string (sprintf "<A HREF=\"%s\">The same again with x=1</A>") Example: Create a hyperlink that refers recursively to the web application, but redirects the accesses to the secure server: let u = actobj # url ~protocol:(`Http((1,1), [`Secure_https])) () in let enc_u = Netencoding.Html.encode ~in_enc ~out_enc () u in actobj # output # output_string (sprintf "<A HREF=\"%s\">The same again but secure</A>") FAQ
|