Thursday, September 13, 2012

Toward more categorized web ...

Identify to which category a URL belongs will be a crucial matter when it comes to web semantics. Also for other ongoing research areas such as unsupervised machine learning, networking systems it is a must to have a catalog of web.

But the problem is that it is hard to find affordable such catalog. Also when it comes to categorizing web, human interfere is preferred rather than using computer algorithms to do so. But to find such a man made web category which is affordable is hard. Where the www.dmoz.org comes in.

Dmoz is also called as the "Open Directory Project", named after its original domain name "direcotory.mozilla.org". This is owned by netscape which still maintained by lot of  human participation.

Dmoz has a catalog over 2 million URLs which has been categorized in to 605228 categories and growing. RDF dumps of the dmoz has been available  under open directory license and from 2011 it started  to use creative common license.

RDF

RDF known as "resource description framework" is one of W3C specification which has been used as a meta data model. This has been used as a method to model web resources. Also RDF is XML based language.

Dmoz RDF dumps.

RDF dumps of the dmoz is available at the this URL.

But for me it was bit harder to process RDFs since no proper library with documentation available for me in python so i use this script with some modifications to convert RDF dump to a SQL dump. (Modifications has to be done since the script us old PHP versions. So if you use new PHP interpreter please check the script for deprecated functions etc. )





No comments:

Post a Comment