This Perl program creates the xml needed to feed the Google Search Appliance (GSA), Google’s Indexer. It is a web feed with feed type equal “metadata-and-url”. I have mostly seen this used in environments where the GSA is restricted from indexing certain content, or the content is behind a database such as Oracle or Sybase, or - in my case - where I want to control what the GSA indexes. This technique allows data owners to define meta data that is piped into the GSA just as though it were indexed the normal way. We use this in our shop to “tell” Google what to index instead of having it index everything thereby reducing false positive hits.
I also wrote this program in Java, but only provided the Perl version here out of convenience. See below for program and output.
GoogleFeeder.pl
open(OUTPUT, "> GoogleFeeder.xml") or die "Error: $!\n"; print OUTPUT "<?xml version=\"1.0\" encoding=\"UTF-8\"?>\n"; print OUTPUT "<!DOCTYPE gsafeed PUBLIC \"-//google//DTD GSA Feeds//EN\" \"\">\n"; print OUTPUT "<gsafeed>\n"; print OUTPUT "\t<header>\n"; print OUTPUT "\t\t<datasource>CSATECHCONSULTINGLLC</datasource>\n"; print OUTPUT "\t\t<feedtype>metadata-and-url</feedtype>\n"; print OUTPUT "\t</header>\n"; print OUTPUT "\t<group>\n"; print OUTPUT "\t\t<record url=\"http://csatechconsulting.com/attachments/055_Randy_Hinton Resume.pdf\" mimetype=\"APPLICATION/PDF\">\n"; print OUTPUT "\t\t\t<metadata>\n"; print OUTPUT "\t\t\t\t<meta name=\"title\" content=\"Randy Hinton Resume\"/>\n"; print OUTPUT "\t\t\t\t<meta name=\"description\" content=\"Contains the resume for Randy Hinton, President CSATech Consulting, LLC\"/>\n"; print OUTPUT "\t\t\t\t<meta name=\"keywords\" content=\"RESUME, RANDY HINTON, CSATECH, ORACLE, TEAMWORKS, JAVA, CONTENT MANAGEMENT\"/>\n"; print OUTPUT "\t\t\t</metadata>\n"; print OUTPUT "\t\t</record>\n"; print OUTPUT "\t\t<record url=\"http://csatechconsulting.com/blog/wp-content/uploads/2009/01/scan0001.pdf\" mimetype=\"APPLICATION/PDF\">\n"; print OUTPUT "\t\t\t<metadata>\n"; print OUTPUT "\t\t\t\t<meta name=\"title\" content=\"Lotus Notes Database Extraction Program\"/>\n"; print OUTPUT "\t\t\t\t<meta name=\"description\" content=\"Lotus Notes Database Extraction Program\"/>\n"; print OUTPUT "\t\t\t\t<meta name=\"keywords\" content=\"LOTUS NOTES, ETL, EXTRACTION, JAVA\"/>\n"; print OUTPUT "\t\t\t</metadata>\n"; print OUTPUT "\t\t</record>\n"; print OUTPUT "\t</group>\n"; print OUTPUT "</gsafeed>"; close(OUTPUT); GoogleFeeder.xml (output from GoogleFeeder.pl):
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE gsafeed PUBLIC "-//google//DTD GSA Feeds//EN" ""> <gsafeed> <header> <datasource>CSATECHCONSULTINGLLC</datasource> <feedtype>metadata-and-url</feedtype> </header> <group> <record url="http://csatechconsulting.com/attachments/055_Randy_Hinton Resume.pdf" mimetype="APPLICATION/PDF"> <metadata> <meta name="title" content="Randy Hinton Resume"/> <meta name="description" content="Contains the resume for Randy Hinton, President CSATech Consulting, LLC"/> <meta name="keywords" content="RESUME, RANDY HINTON, CSATECH, ORACLE, TEAMWORKS, JAVA, CONTENT MANAGEMENT"/> </metadata> </record> <record url="http://csatechconsulting.com/blog/wp-content/uploads/2009/01/scan0001.pdf" mimetype="APPLICATION/PDF"> <metadata> <meta name="title" content="Lotus Notes Database Extraction Program"/> <meta name="description" content="Lotus Notes Database Extraction Program"/> <meta name="keywords" content="LOTUS NOTES, ETL, EXTRACTION, JAVA"/> </metadata> </record> </group> </gsafeed>