General

The Open Directory Project distributes its data in this directory in Resource Description Format files (RDF).

RDF is a graph description language implemented in eXtensible Markup Language or XML.

Errata

Unfortunately the format of the Open Directory Project is not optimal. In particular it is much longer than it would need to be, and is not even legal XML.

Because of this, the format will need to change from time to time. We will note the changes in this file until we have a better method for notifying users of the changes.

Changes

2000-11-20
A few additions have been made to the RDF files. There is a tag for altlang categories which behaves exactly like the symbolic tags. Tags have also been added for mediadate and ages (for Kids_and_Teens) with regards to URLs if these qualities exist for each URL.

1999-12-09
The RDF files are going UTF-8. You may check out an advance copy of this new format at http://rdf.dmoz.org/rdf/World.rdf.u8.gz. I hope that this will clear up a lot of the problems that some users have been having with the format. If you notice any problems, please send mail to truel@dmoz.org.

We will continue to generate the current RDF files until at least January 8, 2000. We will be generating UTF-8 files periodically until that date. After January 9, all rdf files will be in the UTF-8 character set.

N.B. Some languages may have some incorrect characters. More precisely some of our categories do not have a character set associated with them yet, and so I am converting them to UTF-8 as though they were encoded in ISO-8859-1. Please do not send me email if you think you know what character set a given language should be in, but only if you know what character set the given ODP category is in.

1999-08-25
I have created an eGroups.com mailing list to announce changes to the rdf format. To sign up, fill your email address in the following form:
Subscribe to Announcement group
Enter your e-mail address:
odp-rdf-announce archive
An e-group hosted by eGroups.com

1999-08-24
Now provide redirect.rdf.gz which lists categories which have been moved and where they have been moved to. This should obviate your need for the catmv.log.gz file.

Redirections here are pre-chained. That is if a category has moved many places, the redirection listed is the first one that actually hits a category. If someone moves a directory around and someone else creates a directory at one of the intermediate locations, the newcomer is the redirection listed.

1999-07-29
Character escaping is being done inside all fields now, not just in Titles and Descriptions.  The following four characters are being quoted, so you will have to unquote them when converting to html:

 &  &
 <  &lt;
 >  &gt;
 "  &quot;

High byte characters and non-printing control characters are also being quoted now. I have decided against utilizing actual character quoting (ie. &#21ae;) since supporting full unicode is beyond the capabilities of some of our customers. Instead the hex value of the these characters will be presented, and if you wish to convert to unicode, you will have to keep track of the charset for the given category.

As an expamle, the byte value of 200 will be presented as &xC8; whether that character was from the 8859-1 character set (&#C8; or È or &#C8;) or from 8859-2 (Č or &#x010C;) or from any other character set.

1999-05-18
Symbolic links that have been separated from the rest of the subcategories now have the link type "<symbolic1 ...>". This is exactly analogous to "<narrow1 ...>" (for separated subcategories).

2002-07-23
Data in the Netscape/ tree is no longer included in the main RDF dump. Instead, it is provided in these files:
netscape-content.rdf  
netscape-structure.rdf
netscape-terms.rdf