(A semi-scientific study)
Bob DuCharme, October 23, 2002
While thinking about linking, I was curious about real-world use of attributes of the A element in web pages. After considering various ways to gather sample data to churn through, I found the Google programming contest, which links to a tar.gz file with 162 megs of sample web pages. After John Cowan told me about Yahoo's random link (http://random.yahoo.com/bin/ryl) and curl I also wrote a script to have curl pull down about 7000 random pages from Yahoo for a total of about 62 megs. I then wrote some scripts to analyze the use of A attributes in the Google and Yahoo sample data and the results are below. (I also saved much smaller data files, strictly concerned with the use of A attributes, in case anyone else would like to analyze them.)
After writing a script to remove the Ctrl-Z characters scattered in the data files, I wrote another that looks through them for all the A element attributes allowable by the xhtml1-transitional DTD. The script wrote out a file showing the attribute use (107K zipped Google data file here, and 29K Yahoo zipped data file here). The script maps all names to lower-case for consistency and puts all the attributes for a given A element on a single line. For example,
href href name href
represents three A elements, two of which only have href attributes and one of which has a name and href attribute. Each line begins and ends with a single space so that you can, for example, count all the lines with " href " (a count I've already done, as you'll see below). Hopefully the format of these files will make it easy for others to play with it.
Of course, a properly scientific study would use larger input samples and maybe take into account the fact that pages on Yahoo are more likely to be more professionally designed portal pages than completely randomly selected web pages. I still thought it was interesting to see how the numbers came out, especially after all the assertions lately on various mailing lists about which attributes of which elements People Use or People Don't Use in their web pages.
total number of A elements:
This table and the Yahoo table that follow show the number of elements using each attribute, whether alone or in combination with other attributes. For example, 78% of Google sample A elements only have an href attribute, while 14% have href and at least one other attribute for a total of 92% that have an href attribute. Some numbers on exclusive use follow the Yahoo table.
|(no others used)|
|(no others used)|
rel and rev never appeared in any A elements.
Out of 79,626 img elements in the Yahoo data, 8 (.01%) had the longdesc attribute, and 4 of those were URIs. Of the 124,484 img elements in in the Google data, 6 (.005%) had longdesc attributes. 5 of those were URIs, but 4, which I assume were from the same document, were relative URIs.
Google: 4,172 (1.4%)
Yahoo: 354 (0.3%)
Google: 235,947 (78%)
Yahoo: 94,458 (69%)
Google: 22,722 (10%)
Yahoo: 4,490 (14%)
Google: 258,696 (86%)
Yahoo: 99,064 (73%)