Validating HTML and SEO
#117 Henry, Wednesday, 29 June 2011 10:48 PM (Category: Web Development)
(Tags: seo validation)

I have a number of websites for service groups, especially one for Anne. I was doing some Google fiddling and got alarmed by a number of things I saw. I found my way to HTML validators.

These are websites where you can give them a URL and they will check the URL and report on your failures. The main one I used was the W3C Validator. That opened up a whole mess of problems. I thought my HTML was pretty good until I fed it into that. I had to get the DOCTYPE right, and then add an xmlns clause to my html tag. Then make everything validate to XHTML, so singleton tags had to have the closing slash at the end, and some things did not fit in. Some things puzzled me, like not being able to put a ul inside a p. But I went through all my code, and I cleaned up everything that was suggested, until everything got the green tick of approval. I felt virtuous.

Then I came to this other validator page - The WDG HTML Validator. This one will try and spider through your whole site instead of just one page, and it found a bunch more problems. On my photos page, there was a markup error if the matrix was filled exactly, with no empty slots. A off-by-one error. In the calendar, it found an empty row when the 1st of the month started on a Sunday. I ended up with

and this is not allowed in XHTML. I have been trying to debug this one, but it's really tricky. I will have to spend more time to fix it.

So I started working to validate just the HTML. Then I noticed that there were suggestions for improvements that would help search engines understand your data better. I found myself looking at validating the web pages for SEO. I came across validator webpages, and started following their suggestions.

I added the appropriate meta tags, including the keywords. I know that not much, if any, credence is given to keywords these days by the search engines, but it helps a little if the keywords match the content. I set up methods of putting specific keywords per page. I'm not writing raw HTML here, I am writing in PHP and have functions to handle common elements of each page. I ended up with meta tags like this:

All these were suggested to me by several SEO validator websites. I used these ones the most: Pear Analytics SEO Workers List of validation websites

I also had to ensure that page had a h1 tag, that matched the keywords and title. Lots of sensible little things that made the website tight.

Then I learned about duplicate content. This was a problem. I have both thisdomain.com and thisdomain.org, and I point both those and www.thisdomain.com and www.thisdomain.org all at the same site. So every page can be referenced by four different URLs. That's duplicate content, and it had to be fixed. I wasn't sure how to do it initially. There seemed to be two ways to do it. One was with 301 redirects (what?) and the other was with a canonical meta tag. I was puzzled about how to approach it, so I decided to make my ignorance benefit me.

Stackoverflow.com has branched off into a whole series of satellite sites. One is for web development - Pro Webmasters. So I asked my question about duplicate content (and gained some points) and got a good answer and acknowledged it as a good answer (got some more points). Then I expanded from the answer and fleshed it out to cover all my duplicate problems.

First, I had to decide which one was going to the definitive URL - thisdomain.com, www.thisdomain.com, thisdomain.org or www.thisdomain.org. I decided on thisdomain.org.

Then, I changed Apache and my virtual hosts configuration so I had a different directory for thisdomain.com and thisdomain.org. Previously, they were all going to the same directory. The vhosts setup was like this:


    ServerName www.thisdomain.org
    ServerAlias thisdomain.org
    DocumentRoot "/htdocs/thisdomain/org"
    ScriptAlias /cgi-bin/ "/htdocs/thisdomain/org/cgi-bin/"

      allowoverride all
      order allow,deny
      allow from all

    ServerName www.thisdomain.com
    ServerAlias thisdomain.com
    DocumentRoot "/htdocs/thisdomain/com"
    ScriptAlias /cgi-bin/ "/htdocs/thisdomain/com/cgi-bin/"

      allowoverride all
      order allow,deny
      allow from all

I left the directory for .org alone and created a new directory for the .com. In the directory for .com, I created a .htaccess file. I set it up per the answer on the webmasters site, like this:

Options +FollowSymlinks
RewriteEngine on
RewriteCond %{http_host} ^thisdomain.com [NC]
RewriteRule ^(.*)$ http://thisdomain.org/$1 [R=301,NC,L]

I restarted Apache and tested it. Great. If I entered the URL thisdomain.com/about.php, it would just bring up thisdomain.org/about.php. It worked great. I was really pleased with it. Then I used Google to bring up some search results and make sure they pointed to the right thing. Ugh. Total failure. All the links were like www.thisdomain.com/about.php and they all failed. I experiemented a little and did a whole heap of reading about Apache's rewrite rules, and came up with this new version of .htaccess for the .com directory.

Options +FollowSymlinks
RewriteEngine on
RewriteCond %{http_host} ^thisdomain.com [NC]
RewriteRule ^(.*)$ http://thisdomain.org/$1 [R=301,NC,L]
RewriteCond %{http_host} ^www.thisdomain.com [NC]
RewriteRule ^(.*)$ http://thisdomain.org/$1 [R=301,NC,L]

I tested it and it worked fine. Now www.thisdomain.com and thisdomain.com all got redirected to thisdomain.org. Nice. I could improve this a little by combining the two RewriteCond conditions by using the [OR] tag between them and have only one RewriteRule, but I'll get to that later.

Then I thought about it some more. I still had duplicate content. Three quarters of the problem had now gone away, with thisdomain.com and www.thisdomain.com being both redirected to thisdomain.org, but thisdomain.org and www.thisdomain.org still shared content. So I went to the directory where the org website operated from and created another .htaccess file.

RewriteEngine On
RewriteCond %{http_host} ^www.thisdomain.org [NC]
RewriteRule ^(.*)$ http://thisdomain.org/$1 [R=301,NC,L]

More tests, and yes, I finally had it. No more duplicate content. All variations pointed to thisdomain.org.

So I plodded away and did pretty much whatever the validators suggested.

I added good structure with h1 and h2 tags. I added meta lines. I discovered that a tiled image used for the background was 137k. I didn't realise this when I set it up. So I converted that image from png to jpg and dropped the quality, and it improved the look and the size went down to 16k. I added height and width attributes to the few static images. I added appropriate alt tags to images.

Some things I haven't got around to doing yet. I probably don't need to create a sitemap.xml, because the site is very small and everything is clearly linked together through the navigation system. I do need to look at minimising my CSS files, either consolidate the CSS by simplifying it, or by reducing whitespace, or gzipping it. I do need to calculate on the fly the dimensions of images so I can apply the height and width attributes in the img tag.

Anyway, I overcame all the important issues, but still have some areas for improvement.

Now I have several other sites to clean up. And any new sites I build will have all this new stuff built into them right from the start.

0 comments