Processing the Web

Some additional notes from my talk at the October meeting of CTO Forum L.A.

First, anyone interested in processing the web with intelligence should read Programming Collective Intelligence (Amazon price info, etc., below). Of particular interest are the chapters on “Discovering Groups”, “Building Price Models”, and “Finding Independent Features”. There is also an appendix that lists third-party libraries. Visit O’Reilly directly to see the Table of Contents.

The Firefox plugins I showed were:

  • ClearForest Gnosis: This is the best example of entity extraction. It parses the web page being viewed and finds
  • hostip.info: Mostly to demonstrate that metadata is everywhere, this plugin is a “Community Geotarget IP Project” and gives you location info about links in a tooltip.
  • About this Site: Like hostip.info, this gives you access to tons of metadata about the current web page.
  • WASP: “Web Analytics Solution Profiler” gives you information about what analytics packages are being used to track you as you browse the web.
  • KGen: “Extracts” keywords from a page. This is pretty primitive – The technology in Adapt SEM is much superior, but then you can’t download it for free and incorporate it in your browser… :-)
  • Interclue: Burrows through a link and gives you several important stats and a computer-generated page summary and thumbnail. Muy bueno for research as well as for the techniques it demonstrates.
  • Operator: “Operator is an extension for Firefox that adds the ability to interact with semantic data on web pages, including microformats, RDFa and eRDF” – this tool will find explicitly marked-up microformats.
  • Yahoo! Pipes: The first mashup editor – grab anything from the web, do stuff to it, manipulate the results, and display them. For those not allergic, there is also the Google Mashup Editor.
  • MIT SIMILE Project: “SIMILE is focused on developing robust, open source tools based on Semantic Web technologies that improve access, management and reuse among digital assets”. I’ve still never gotten PiggyBank to play nicely with both my browser and Java at the same time, but the set of tools this project is turning out look awesome.

You might recall that I said microformats are not the answer, but they are really useful. Here is another excellent book for those interested in learning about them: Microformats: Empowering Your Markup for Web 2.0 (Amazon price info, etc., below).

I mentioned RDF and SPARQL, but you’re on your own with those. Or I have to do another talk – it’s just too big an anthill to kick over in this little post.

 

Several people also asked about language processing (NLP) libraries. With no further editorial explanation, I’d recommend taking a look at:

Your bonus content is this cool little application called the Smart Editor.

 


Tools for Startups – 80% of what you need at 20% of the expense and maintenance

There’s a great article over at Read/WriteWeb on Software for Virtual Teams.

Read/WriteWeb’s recommendations are:

I think that basecamp is overrated. It has serious deficiencies (for me – ymmv) in tracking task dependencies, creating types/tags for tasks, time tracking, and automatic pinging of assignees for uncompleted tasks. It’s almost good for the stated purpose: collaborating on (simple) creative projects that span multiple company boundaries and teams.

I’m still waiting for a hosted project management-lite solution that does 80% or more of what I want.

I agree 100% with the endorsement of Skype, GoToMeeting, and CVSDude. I use all three currently and I chose them because of the good experiences I had with each of them on other projects. CVSDude also offers hosted trac and bugzilla instances, which save you the trouble of getting that stuff elsewhere.
In the accounting front, I’m not sure there’s any choice but QuickBooks – I got PeachTree for free ($149 w/ a $149 rebate, so I guess I paid the tax) and just had to go get Quickbooks because my accountant doesn’t speak anything else. I also did a quick survey of other accountants, and none of them uses anything but QuickBooks for clients my size.

I’ve already said I’m not a fan of Basecamp – even though I am a big fan of 37Signals – but I only have a wishlist of features they don’t support. In practice, I still use basecamp for a couple of projects and used it at my company until we upgraded to Atlassian‘s Jira and Confluence. Josh left a commenters on the above-mentioned post recommending goplan and Clever Tools. Clever Tools does claims to do a lot more than just collaboration/project management. I’ll be checking it out.

Calendaring is also tough. I’ve tried everything listed, and I don’t like any of them. We are using Zimbra‘s Collaboration Suite (ZCS). Calendaring should go together with email, so one of the hosted MS Exchange providers is also an option. I’ve recommended mi8 in the past.

I know nothing about online file storage or backup. I plan to learn soon, though.

While I was looking around, I also found this site that I thought was good: VerusNova – Technology for Small Business Success. It’s a pretty standard link/review blog, but they cover a lot of ground on the kinds of hosted and open source tools that interest me.

PHP Frameworks

Stay Tuned

Whiteboard drawing of NewTool - an open source platform for Free Media and Microcontent + universal personal proxy/server + authoring and replication

A sampling of references/influences/thought-starters for those interested in talking about this: