Processing the Web

Some additional notes from my talk at the October meeting of CTO Forum L.A.

First, anyone interested in processing the web with intelligence should read Programming Collective Intelligence. Of particular interest are the chapters on “Discovering Groups”, “Building Price Models”, and “Finding Independent Features”. There is also an appendix that lists third-party libraries. Visit O’Reilly directly to see the Table of Contents.

The Firefox plugins I showed were:

  • ClearForest Gnosis: This is the best example of entity extraction. It parses the web page being viewed and finds
  • hostip.info: Mostly to demonstrate that metadata is everywhere, this plugin is a “Community Geotarget IP Project” and gives you location info about links in a tooltip.
  • About this Site: Like hostip.info, this gives you access to tons of metadata about the current web page.
  • WASP: “Web Analytics Solution Profiler” gives you information about what analytics packages are being used to track you as you browse the web.
  • KGen: “Extracts” keywords from a page. This is pretty primitive – The technology in Adapt SEM is much superior, but then you can’t download it for free and incorporate it in your browser… 🙂
  • Interclue: Burrows through a link and gives you several important stats and a computer-generated page summary and thumbnail. Muy bueno for research as well as for the techniques it demonstrates.
  • Operator: “Operator is an extension for Firefox that adds the ability to interact with semantic data on web pages, including microformats, RDFa and eRDF” – this tool will find explicitly marked-up microformats.
  • Yahoo! Pipes: The first mashup editor – grab anything from the web, do stuff to it, manipulate the results, and display them. For those not allergic, there is also the Google Mashup Editor.
  • MIT SIMILE Project: “SIMILE is focused on developing robust, open source tools based on Semantic Web technologies that improve access, management and reuse among digital assets”. I’ve still never gotten PiggyBank to play nicely with both my browser and Java at the same time, but the set of tools this project is turning out look awesome.

You might recall that I said microformats are not the answer, but they are really useful. Here is another excellent book for those interested in learning about them: Microformats: Empowering Your Markup for Web 2.0.

I mentioned RDF and SPARQL, but you’re on your own with those. Or I have to do another talk – it’s just too big an anthill to kick over in this little post.

Several people also asked about language processing (NLP) libraries. With no further editorial explanation, I’d recommend taking a look at:

Your bonus content is this cool little application called the Smart Editor.

1590598148 0596529325