Processing the Web

Some additional notes from my talk at the October meeting of CTO Forum L.A.

First, anyone interested in processing the web with intelligence should read Programming Collective Intelligence. Of particular interest are the chapters on “Discovering Groups”, “Building Price Models”, and “Finding Independent Features”. There is also an appendix that lists third-party libraries. Visit O’Reilly directly to see the Table of Contents.

The Firefox plugins I showed were:

  • ClearForest Gnosis: This is the best example of entity extraction. It parses the web page being viewed and finds
  • hostip.info: Mostly to demonstrate that metadata is everywhere, this plugin is a “Community Geotarget IP Project” and gives you location info about links in a tooltip.
  • About this Site: Like hostip.info, this gives you access to tons of metadata about the current web page.
  • WASP: “Web Analytics Solution Profiler” gives you information about what analytics packages are being used to track you as you browse the web.
  • KGen: “Extracts” keywords from a page. This is pretty primitive – The technology in Adapt SEM is much superior, but then you can’t download it for free and incorporate it in your browser… 🙂
  • Interclue: Burrows through a link and gives you several important stats and a computer-generated page summary and thumbnail. Muy bueno for research as well as for the techniques it demonstrates.
  • Operator: “Operator is an extension for Firefox that adds the ability to interact with semantic data on web pages, including microformats, RDFa and eRDF” – this tool will find explicitly marked-up microformats.
  • Yahoo! Pipes: The first mashup editor – grab anything from the web, do stuff to it, manipulate the results, and display them. For those not allergic, there is also the Google Mashup Editor.
  • MIT SIMILE Project: “SIMILE is focused on developing robust, open source tools based on Semantic Web technologies that improve access, management and reuse among digital assets”. I’ve still never gotten PiggyBank to play nicely with both my browser and Java at the same time, but the set of tools this project is turning out look awesome.

You might recall that I said microformats are not the answer, but they are really useful. Here is another excellent book for those interested in learning about them: Microformats: Empowering Your Markup for Web 2.0.

I mentioned RDF and SPARQL, but you’re on your own with those. Or I have to do another talk – it’s just too big an anthill to kick over in this little post.

Several people also asked about language processing (NLP) libraries. With no further editorial explanation, I’d recommend taking a look at:

Your bonus content is this cool little application called the Smart Editor.

1590598148 0596529325

3 thoughts on “Processing the Web”

  1. Good to see Interclue in such fine company! Have you tried the new beta? 1.5 is going to be great! If we can only get over our rampant perfectionism and actually release the sucker!

    http://interclue.com/beta.html

    For one thing, 1.5 stops trying to autoscrape amazon product pages and does the sensible thing and uses one of the product preview widgets like you use on this page. But (oh noes!) it turns out the default adblockplus blocklist actually *blocks* those things, and users with that get a blank preview window. So one of the last todo items is rewriting that particular clueview to use one of the other amazon api methods – which will allow us to get more product info anyway.

Comments are closed.