Programming Languages
What is CGI?
CGI stands for Common Gateway Interface, a specification for transferring information between a World Wide Web server and a CGI program. A CGI program is any program designed to accept and return data that conforms to the CGI specification. The program could be written in any programming language, including C, Perl, Java or Visual Basic.
CGI programs are the most common way for Web servers to interact dynamically with users. Many HTML pages that contain forms, for example, use a CGI program to process the form's data once it's submitted. Another increasingly common way to provide dynamic feedback for Web users is to include script or programs that run on the user's machine rather than the Web server. These programs can be Java applet, Java scripts, or ActiveX controls. The use of CGI is a server-side solution because the processing occurs on the Web server.
One problem with CGI is that each time a CGI script is executed, a new process is started. For busy Web sites, this can slow down the server noticeably.
Tips
How to direct a browser to display a different HTML page
This is actually very simple to do in a CGI script. Instead of the usual header
Content-type: text/html
make your script print this
Location: URL to display
Don't forget the blank line afterwards.
How to write a No-Parse-Header script
When a server is returning to a browser the output from a script, it normally adds the elements of the header not supplied by the script. So by the time it reaches the browser, the Content-type header line output by your script is accompanied by header lines specifying information such as the date, the status code, and the server type.
If you want to prevent this happening, so your script can control the entire header, you need to create a no-parse-header script. It is amazingly simple to do this. All you need to do for most servers is make sure the name of your script starts with.
How to tell the browser to do nothing
You need to create a no-parse-header script. This script should return the header
HTTP/1.0 204 No Response
Browsers that handle this header correctly will do nothing. The MS Internet Explorer is one browser that does not handle this correctly; it displays a message to the user to say that the link "did not have a target".
Putting the Script and Form in One File
Having your web form and the corresponding CGI script in separate files in separate directories often makes them more difficult to manage if you have a lot of them. It would help to be able to put the script and the associated web pages into one script file (at the expense of a larger file). But if they are in a single script file, how will the script know whether to return the initial web page with the form, or to use the corresponding script to process the form data?
The answer lies in the ReadParse function, which knows, when a browser accesses the script, whether any form data is being supplied. If not, you can presume that the browser is accessing the script for the first time, and wants the web form. If the browser is sending form data, then it has already received the web form, has filled it out and wants the data to be processed by the script.
ReadParse knows whether it has received any form data and it tells you so. If there was form data, it returns the number of characters in the CGI bundle.
Testing File Names
When your script writes to a new file, you probably want it to create a new and unique name for the new file, one that doesn't conflict with any existing files, which would be overwritten. One way to create a new file name that's unique is to incorporate the process id and the time into the name. Perl's special variable, $$ returns the current pid and $^T returns the time (in seconds since 1970). So you could use something like $filename = "$$" "$^T" . ".html"; Neither alone will guarantee uniqueness since there are only a finite number of process id's, which are recycled, and your script could have been accessed twice within the same second.
This results in ugly filenames, something like "83498127310497.html". If you have prettier names that you insist on, you can test for the existence of a file with the proposed new name, using Perl's -e operator. -e $file_name is true if a file already exists with that name. In the example below, the variable $text holds some key text taken from the contents of the file that we want to use in the name. We're also assuming that the script is constructing a web page, so we add ".html" to the end of the name.
$file_name = $your_chosen_dir . $text ".html";if ( -e $file_name ) {## do something to make it different, like## substitute pidtime.html for html at the end$file_name =~ s/html$/$$$^T\.html/; |
Redirecting to Another Page
Most CGI scripts will process form data and then construct and return an acknowledgement page to the browser. To return a web page, a CGI script must first send an HTTP header to the server signaling that a web page is about to follow:
print "Content_type: text/html\n\n"; # send HTTP header |
There are a number of other HTTP headers availabl e, however, including the Location: header which redirects the browser to another page. If you have a web page prepared in another file, your CGI script needn't read it in and then print it to the server. After processing the form data however you wish, return the Location: some_url header for that file. The URL may be absolute or relative (to your script).
You could use it in your script something like this:
&ReadParse; # read the form data into %inif ( $in{'some_variable_name'} eq 'first_choice' ) {print "Location: first_url.html\n\n";} elsif ( $in{'some_variable_name'} eq 'second_choice' ) {print "Location: second_url.html\n\n";} else ( $in{'some_variable_name'} eq 'third_choice' ) {print "Location: third_url.html\n\n";}exit; |
This hard codes the URLs into the script. You can write a more general purpose script by placing them into the web form as data, perhaps as hidden variables if you don't wish to bother the user with them. In that case you might use a locution like:
if ( $in{'some_variable_name'} eq 'first_choice') {print "Location: $in{'first_url'}\n\n";} |
where there were tags in the web form that looked like
This technique was used in the web2mail anonymous web form remailer.
In general, you can't customize on the fly the web page to which you redirect, unless you are redirecting to another CGI script, in which case you might as well customize the page from the original script. So use this technique when you want to return a static web page, though you can return different pages depending on the form data that was supplied.
CGI Security
There are important security issues when mailing or calling any other program from a CGI script. But CGI security is deep magic, far beyond the scope of this tutorial. A collection of documents discussing security is available on the Web. I can only give a brief example here.
In general, you should never write scripts that allow a user's form data to be executed on your system. The most obvious example might be something like
exec "$in{message}"; |
This would allow a browser to execute commands on your system, whatever was submitted through the variable named message on your web form. (Perhaps rm -rf?) Perl has some built-in safeguards against this (TaintPerl), as do most web servers, though they are not perfect and can sometimes be circumvented by crafty web surfers.
As a more devious and realistic example, suppose your mail program is mail and you put the recipient on the command line:
$recipient = $in{email_address};open(MAIL, "|mail $recipient"); |
If the browser supplied her address as "nobody ; rm -rf", the second command might be executed after the mail program completed. (Recent versions of sendmail have safeguards against this sort of spoofing.)
So what can you do?
- Realize you have no control over what form data is passed to your script, and anyone can bypass your form and access your script directly. All they need to do is point their own form at it.
- Study the security documents linked above. These are technical issues, but they make for morbidly interesting reading.
- You should be reasonably safe if you don't execute any other scripts (including mail or other CGI scripts) in your code. (This is the kind of sweeping statement that often proves wrong, so I can't guarantee it.)
- Sanitize any form data that you pass to other scripts that you must execute. s/\W//g; will remove all nonalphanumeric characters from a variable, including punctuation (*.;'/). Even better would be to accept only a pre-determined list of possible answers.
I believe the code in this document (including sendmail -t, which keeps the email address off the command line) is reasonably secure. No guarantees though, and if you know otherwise, please let me know.
If you'd like to look at a script which goes to great lengths to be security conscious (because it's able to write to any file on your web site) see SiteMgr.
Debugging
Debugging is a challenge with CGI scripts, because the web server runs your script, not you, so you can't easily get access to standard error.
The first step is to ensure your script works when you run it by hand, even before you put it on the web server and try to fill out your forms. If your script doesn't parse, for example, your browser would only report "Document contains no data" or "Server error" or something even less informative. To run your script by hand, type Perl script_name at the command prompt, or Perl -d script_name if you want to use Perl's very useful debugger.
However, the point of a CGI script is to process a browser's form data, and these commands don't supply any to the script. If you use METHOD = POST (recommended) in your form, the data will be passed to the script on STDIN, which will need some environment variables to properly interpret it. You need to run your script under a special environment.
Most Unix systems support the env command which does precisely this. Here is a simple example of its use on the command line (but place the first two lines on one line).
env REQUEST_METHOD=POST CONTENT_LENGTH=53 perl -dscript_name << HEREname=John+Doe&email=a@a.com&msg=This+is+a+test.HERE |
This command first creates two new environment variables, REQUEST_METHOD and CONTENT_LENGTH, and sets their
values as shown. It then executes perl -d
The HERE document contains the user's form data. It consists of a &-separated list of name=value pairs. name is the variable name you used for a form element in your form page, and value is the corresponding data that the browser supplied for that form element. In addition, spaces are converted to +'s. CONTENT_LENGTH must be equal to the number of characters in the HERE document.
This is fairly awkward (especially counting characters) so I have altered the ReadParse function to accept form data on the command line. Simply run your script something like
script_name name=John+Doe\&email=a@a.com\&msg=This+is+a+test. |
supplying it with the list of name/value pairs as an argument on the command line. The ReadParse function will not detect methods GET or POST and will look to the command line arguments for the form data. Note, however, that the shell requires that you escape &'s by preceding them with a backslash.
This is an overview of the CGI protocol (under method POST) and how web servers pass data to scripts. It should be sufficient for most debugging. There are a few more details, which you can learn by examining the ReadParse function, a tutorial called Reading CGI Data, or the The Common Gateway Interface documentation.
After debugging your script by hand, you can run it from the web server by placing it in your site's cgi-bin directory. If it has been thoroughly debugged, the major additional problems might be file and directory access permissions. Make sure the script itself is world readable and executable, and that any directories and files the script must write to are world readable and writeable.
Permissions
You will also have to make the programexecutable. To do this you can use your FTP program, most ftp programs like WS_FTPLE will do this. Once the file is uploaded click on the program once, then right click on it again. You will see some options, you want to find either a "chmod" command or "change permissions".
Click on the "Execute" boxes and hit "ok
I can't figure out the correct path to my files.
Paths to files uploaded to your account begin with your account's root path. This is as follows
/usr/local/etc/httpd/sites/yourdomain.com/
You can determine the rest of the path from the location of your file relative to your root directory. For example, a file called "data.dat" located in your htdocs directory would have a full path as follows:
/usr/local/etc/httpd/sites/yourdomain.com/htdocs/data.dat
If you will be referencing many files, it is probably easiest to create a variable called $root and assign it the full path to the root directory:
$root = "/usr/local/etc/httpd/sites/yourdomain.com"; |
You can then simply perpend the root path variable to the relative locations of your files throughout your script. For example:
$datafile = "$root/htdocs/data.dat"; |
Unless you need to reference files outside your cgi-bin, we recommend using the shorter and simpler relative paths. Relative paths are always determined based on the location of the required file relative to the location of your script. For example, if you have a cgi-lib.pl library file in a directory called "Library" inside your cgi-bin, the relative path would be:
"./Library/cgi-lib.pl"; |
HTTP variables
If scripts are executed outside the server, the shell trigger will start a fresh Python interpreter process and the code will be executed - but it's worth remembering that in this case, the usual HTTP variables ($HTTP_REFERER, $QUERY_STRING, etc) won't be set, so if your script relies on values being available for these variables, you'll need to test for this and set sensible default values.
Cross-site scripting The cross site scripting issue shouldn't be ignored. One recommended method is to set the default Charest in your httpd.conf:
AddDefaultCharset = "iso-8859-1" |
(But this does ignore the difficulty of the amount of content in character sets other than iso-8859-1).
Incoming data It almost goes without saying but never pass any received QUERY_STRING or PATH_INFO data to external programs (e.g. sendmail, etc.) without first escaping potentially problematic characters -- see Paul Phillip's Safe CGI paper (go on, you know you should - and take a bookmark while you're there).
Here's something I prepared earlier. Note - it ignores CR and LF which it probably shouldn't.
def esc_chars(tgt):# will change, for example, a!!a to a\!\!aimport rematchstr = re.compile(r"""([;<>\*\|`&\$!#\(\)\[\]\{\}:'"])""")return matchstr.sub(r'\\\1', tgt)e.g.:>>> attack = """a!a!""">>> print esc_chars(attack)a\!\!a>>> attack = """#/bin/bash""">>> print esc_chars(attack)\!\#/bin/bash |
Active Error Documents
I would imagine that most server administrators will be familiar with the idea of improving the functionality of the old "500 Server Error" message. I've found it useful to create a Python script which informs the user of the fact of an error, inform them that the janitor (me) will be automatically emailed with details of the error and offer them an opportunity to send me their email address so I can get back to them when the problem has been fixed.
Here's how the error page appears to user and here's the source for the Error Document and here's the source for the "fixit.py" script which processes the form field in which they enter their email address.
Infinite Loops
If your CGI somehow gets into an infinite loop, the web server may well wait forever for the CGI to return results. This, in turn, means that the user will probably be left staring at a blank or partially filled browser for quite some time. Or worse, they'll just hit the Back button and then try again, putting another infinitely long CGI in motion on your server, and thus using up CPU time that produces nothing.
CGI programs don't know if and when the user hits the Stop button on the browser. The program often finds out only when it tries to output HTML and receives a SIGPIPE signal because the socket is no longer valid, but this may depend on the configuration of the operating system and web server.
How to find and kill infinitely looping CGIs
To kill an infinitely looping CGI, you must first find its process ID (PID). The classic way to do this is with the Unix ps command. Under Solaris, for example, you can list all of your processes like this:
ps -ef |
Look for unusually large values in the TIME column and note the PID for that process. Note that you can't trust the name given by ps, because it can be set on some systems by setting argv[0] in the executing program. Once you have the PID of the looping CGI, you can kill it with the kill command, like this:
kill 2353 |
However, this is not guaranteed to stop processes that choose to ignore the TERM signal. If the process is still present after a few seconds, try the -9 option, as in kill -9 2353. This should not be your first option because processes killed with the -9 option do not get a chance to clean up temp files or finish writing buffered output to a file. The kill command may leave a zombie process on the system, which cannot be killed but occupies only minimal system resources. Zombie processes are marked with Z or defunct in ps output. If a process is not a zombie but cannot be killed, then it is probably waiting on an NFS call or a stuck device.
There are a number of more user-friendly tools for hunting down rogue processes, such as top, skill, and killall.
UserTime.cgi
UserTime is a CGI (common gateway interface). It was written by Jochen "Joe" Savelberg to allow the customers of euregio.net to have an up-to-date indication of the time that they've spend online.
How does it work?
First, the user has to enter some information, such as his/her username, his/her password and to select some more options. Please note that the username and password are case-sensitive! Just enter the same information which is in your PPP connection script and you'll be fine.
The web server passes the information to HyperCard - the program the UserTime.cgi was written in. HyperCard then calls another script which sends some TCP/IP commands to euregio.net's UNIX server. This server returns the login times for the requested account. Then HyperCard takes over again and calculates the costs and creates the information page which is sent back to the web server and to the user's WWW client (such as Netscape Navigator, Mosaic, etc.).
When several users are requesting this service at the same time it works according to the principle of FIFO (first in, first out). That means that the first request will be handled first and the remaining requests will be queued.
The drawback is that there is a certain time-out (in our case 360 seconds). Whenever there is a request that is in the queue for a longer period than this time-out, the request will be cancelled. The user gets a message that the gateway timed out. The user could try to send the form again a little bit later.
While the UserTime.cgi is processing the request, the user can still continue his quest for the holy grail, i.e. he can continue surfing the Internet. All he/she has to do is to choose 'New window' from his/her browsers menu. The other window will still be waiting for the results of UserTime.cgi.
|
|||