Search Engine Friendly URLs

redemption - 2003-10-02 03:37:24 in PHP
Category: PHP
Reviewed by: redemption   
Reviewed on: Oct 11 2003

Today we're going to talk about Search Engine Friendly URLs, also known as SEF URLs. Now this is a topic that has lost much of its initial popularity since it became apparent that Google was capable of parsing pages that use regular PHP URLs so long as you don't have session variables in the URL - however, I feel that SEF URLs are STILL a very useful for other search engins and double as a user friendly technique for certain areas of your sites.

Search Engine Friendly URLs: What are they, and why do we want them?

So what does an SEF URL look like? Below are some examples:

http://www.domain.com/gallery.php/cat/15/id/112/
http://www.domain.com/articles/112/
http://www.domain.com/news/20030405.html
http://www.domain.com/forums/forumid/15/threadid/2563/messageid/531
http://www.domain.com/forums/forumid/15/page/12/

The first thing about SEF URLs is that they don't contain any of the question marks (?) or ampersands(&) characteristic of a CGI or other dynamic script. Back in the 90's this was important because those particular characters signalled to a search engine that the page was a live dynamic script and might not be worth spidering because it might change at any moment. Take for instance the last URL on the list that I showed ealier - we can deduce that we're browsing page 12 of forum id 15. Now the problem is that the contents of page 12 will change over time, so even if a person were to find that page on a search engine the content that they were searching for may no longer be there. The flipside of the problem is that there are some pages that look dynamic, but which aren't. Take the second to last URL up there for instance, it presumably shows message 531 in a forum - chances are that message 531 will be the same even months after it was spidered. So what was happening was that STATIC content (ie a specific message, or article, or news post) was being ignored by search engine spiders because they assumed that any CGI generated pages would contain ever changing content. Note the distinction between dynamic CONTENT and dynamic PAGEs... the former implies that a certain URL's text and content will change over time, while the latter indicates that a PAGE itself must be generated on the fly, whether the core content changed significantly or not.

So the very initial need of SEF URL's was the need to encourage search engines to include dynamic pages into their databases. However, with the majority of users using Google as their search engine either directly or indirectly, the problem is not as pronounced as it used to be. Google WILL successfully store pages who's URLs were once taboo. The reason why we still WANT search engine friendly URLs is because some of the other engines still prefer them, and also because it's a more human readable.

A simple method for implementing Search Engine Friendly URLs

The easiest method for Search Engine Friendly URLs is to embed the variables within the request URL like so:

http://www.domain.com/files/members.php/us/ca/5/

All we've done in the above URL is add / between each variable and after the filename of the script. Apache and other popular web servers will correctly understand that you are trying to access the members.php script within the files directory and will kindly ignore anything after the .php. Why does that happen? Because Apache has a "look back" feature that will keep looking up the path until it finds an active file.


Note: For Apache 2.X users - In Apache 2 and up. The URL "look back" feature is off by default. In your .htaccess or httpd.conf file you must have a section pertaining to the directory which contains your php scripts. There you must add AcceptPathInfo On as such: [code] AcceptPathInfo On [/code]

Ok now in your script you include the following:

<?php 
<P>/* 
<BR>break down the URL into it's token parts. The URLs look like: 
 
<BR>http://www.domain.com/files/members.php/us/ca/5/ 
<BR>*/ 
<BR>$tokens = 
split("/", $REQUEST_URI); 
<P>$country = $tokens[2]; //what country to search 
<BR>$state = $tokens[3]; 
//what state to search 
<BR>$page = $tokens[4]; //what page we're on 
<BR>
<P>/* 
<BR>Here begins the code to grab the members... 
<BR>*/ 
 
<BR>?>


See the simple thing to remember is that we're using the REQUEST_URI server variable. This variable is valid through PHP3 and PHP4 up to the current 4.3.3 version. For PHP4 users you may use $_SERVER["REQUEST_URI"] instead of just $REQUEST_URI if you are following the recommended secure variable model.

More advanced usages

If you want to look more professional it's likely that you want to use a URL that looks more like this:

http://www.domain.com/files/members/us/ca/5/

This is also possible, and there are two methods to doing this.

a) SetHandler directive in Apache

This method, in my opinion, is the simplest and most convenient. This only works though if your server allows you to override functionality using .htaccess, or if you can access Apache's httpd.conf files directly. At this time as far as I am aware this method really only works for Apache.

Personally I recommend the .htaccess approach, so in the files directory put this in the .htaccess file (if no .htaccess file exists, just create one): [code]
SetHandler application/x-httpd-php
[/code] In Apache 2.X you SetHandler does not seem to work. Instead use ForceType like so: [code]


ForceType application/x-httpd-php
[/code]

Then, instead of saving your members script as " members.php ", save it only as " members " (with no file extension at all) in that directory. What the above is doing is telling Apache to treat the /files/members/ directory as a call to the " members " file.

b) Mod Rewrite

An alternative to SetHandler is the use of ModRewrite. ModRewrite is a very powerful tool that allows you to rewrite URL requests such that Apache treats a call on one path as a request for a totally different path. Using ModRewrite you can perform redirects, and also mask the true path of a script being executed. You can also use it for SEF URLs :).

ModRewrite requires that you have the module installed for it to function. You can see whether ModREwrite is enabled on your server through the phpinfo() function - load up a page with phpinfo() and then search for "mod_rewrite" under the Apache section.

Next, you need to include the following in an .htaccess file or else in your Apache httpd.conf file: [code]

RewriteEngine On
#turn on the Rewrite engine, if it's not already active

#set the base directory to /files
RewriteBase /files/

# now the rewriting rules
RewriteRule ^members/ members.php [L]
RewriteRule ^sections/ directory.phpl [L]
RewriteRule ^cities/ directory.php [L]
RewriteRule ^maps/ maps.php [L]
[/code]

The above activates mod_rewrite and tells it to treat calls to each of the RewriteRule directories as calls to actual files instead... essentially rewriting the URI request so that it's requesting those files. The [50] portions following each line indicate to mod_rewrite that once that rule is satisfied it can ignore the other rules.

Mod_rewrite is a very complicated, robust, and useful module that is better explained by the mod_rewrite page.

c) SymLinks

One other option is to use Linux symlinks to mask the filename and simulate a directory. For this to work you must have FollowSymLinks enabled in Apache or in your .htaccess. This method is the least favoured in my opinion because you might run into problems with permissions with symlinks that execute files.

In this example you can symlink your members.php file like so:

ln -s members.php ./members

Again, this technique is merely another alternative. The SetHandler and Mod_Rewrite methods are by far the ones I use the most.

Conclusion & What's Next

Well that's it! Search Engine Friendly URLs are easy and quick to implement. They are also much easier on the eyes than ugly looking GET strings. If done properly you can even make reasonably human readable URLs, say by using a unique Username key as a part of your URL rather than memberids:

http://www.domain.com/files/members/Jack/

or even

http://www.domain.com/files/members/Jack.html

The above model can be extended further for things like articles, products, forum threads, and more. DEVPEN for instance uses unique filenames to identify articles. You can just as easily read this SEF article using a URL like http://www.devpen.com/articles.php?a="3. If I see the need for further explanation of how to do these last few examples I will extend this article... for now I think you have the tools and ideas to make all sorts of useful SEF URLs.

And for naysayers who believe that SEF URLs are not worth it because of the added load of parsing out the REQUEST_URI string, and the load of the Mod_Rewrite (if you are using mod_rewrite), I think you just have to balance your desire of having beautiful, search engine friendly and more human intuitive URLs versus the very small added load that taking such an approach creates. I've used SEF URLs on sites that have HUGE numbers of requests per hour, and yes you do end up using more horsepower, but you also have much nicer looking URLs on search engines such as Google, which show part of the URL in their search results.

Add all that to the fact that some search engines still do not spider CGI like pages fully (especially beyond the first or second depth of links) makes SEF URLs a worthwhile technique to use on your next site.