Processing the Web

November 28th, 2007

Some additional notes from my talk at the October meeting of CTO Forum L.A.

First, anyone interested in processing the web with intelligence should read Programming Collective Intelligence (Amazon price info, etc., below). Of particular interest are the chapters on “Discovering Groups”, “Building Price Models”, and “Finding Independent Features”. There is also an appendix that lists third-party libraries. Visit O’Reilly directly to see the Table of Contents.

The Firefox plugins I showed were:

  • ClearForest Gnosis: This is the best example of entity extraction. It parses the web page being viewed and finds
  • hostip.info: Mostly to demonstrate that metadata is everywhere, this plugin is a “Community Geotarget IP Project” and gives you location info about links in a tooltip.
  • About this Site: Like hostip.info, this gives you access to tons of metadata about the current web page.
  • WASP: “Web Analytics Solution Profiler” gives you information about what analytics packages are being used to track you as you browse the web.
  • KGen: “Extracts” keywords from a page. This is pretty primitive - The technology in Adapt SEM is much superior, but then you can’t download it for free and incorporate it in your browser… :-)
  • Interclue: Burrows through a link and gives you several important stats and a computer-generated page summary and thumbnail. Muy bueno for research as well as for the techniques it demonstrates.
  • Operator: “Operator is an extension for Firefox that adds the ability to interact with semantic data on web pages, including microformats, RDFa and eRDF” - this tool will find explicitly marked-up microformats.
  • Yahoo! Pipes: The first mashup editor - grab anything from the web, do stuff to it, manipulate the results, and display them. For those not allergic, there is also the Google Mashup Editor.
  • MIT SIMILE Project: “SIMILE is focused on developing robust, open source tools based on Semantic Web technologies that improve access, management and reuse among digital assets”. I’ve still never gotten PiggyBank to play nicely with both my browser and Java at the same time, but the set of tools this project is turning out look awesome.

You might recall that I said microformats are not the answer, but they are really useful. Here is another excellent book for those interested in learning about them: Microformats: Empowering Your Markup for Web 2.0 (Amazon price info, etc., below).

I mentioned RDF and SPARQL, but you’re on your own with those. Or I have to do another talk - it’s just too big an anthill to kick over in this little post.

 

Several people also asked about language processing (NLP) libraries. With no further editorial explanation, I’d recommend taking a look at:

Your bonus content is this cool little application called the Smart Editor.

 


Tools for Startups - 80% of what you need at 20% of the expense and maintenance

March 11th, 2007

There’s a great article over at Read/WriteWeb on Software for Virtual Teams.

Read/WriteWeb’s recommendations are:

I think that basecamp is overrated. It has serious deficiencies (for me - ymmv) in tracking task dependencies, creating types/tags for tasks, time tracking, and automatic pinging of assignees for uncompleted tasks. It’s almost good for the stated purpose: collaborating on (simple) creative projects that span multiple company boundaries and teams.

I’m still waiting for a hosted project management-lite solution that does 80% or more of what I want.

I agree 100% with the endorsement of Skype, GoToMeeting, and CVSDude. I use all three currently and I chose them because of the good experiences I had with each of them on other projects. CVSDude also offers hosted trac and bugzilla instances, which save you the trouble of getting that stuff elsewhere.
In the accounting front, I’m not sure there’s any choice but QuickBooks - I got PeachTree for free ($149 w/ a $149 rebate, so I guess I paid the tax) and just had to go get Quickbooks because my accountant doesn’t speak anything else. I also did a quick survey of other accountants, and none of them uses anything but QuickBooks for clients my size.

I’ve already said I’m not a fan of Basecamp - even though I am a big fan of 37Signals - but I only have a wishlist of features they don’t support. In practice, I still use basecamp for a couple of projects and used it at my company until we upgraded to Atlassian’s Jira and Confluence. Josh left a commenters on the above-mentioned post recommending goplan and Clever Tools. Clever Tools does claims to do a lot more than just collaboration/project management. I’ll be checking it out.

Calendaring is also tough. I’ve tried everything listed, and I don’t like any of them. We are using Zimbra’s Collaboration Suite (ZCS). Calendaring should go together with email, so one of the hosted MS Exchange providers is also an option. I’ve recommended mi8 in the past.

I know nothing about online file storage or backup. I plan to learn soon, though.

While I was looking around, I also found this site that I thought was good: VerusNova - Technology for Small Business Success. It’s a pretty standard link/review blog, but they cover a lot of ground on the kinds of hosted and open source tools that interest me.

Ajax Homepages

October 6th, 2006

There are three so-called Ajax Homepages that are worthwhile:

References: Using DocBook for Single-source Publishing

October 6th, 2006

First, read A Gentle Guide to DocBook. Next visit DocBook.org to get the lay of the land.Since the focus of this entry is on single-source publishing, go next to Writing Documentation Using DocBook, Selfdocbook, and Single-Source Publishing with DocBook XML.

Here are some additional documentation resources:

Software that will be helpful:

Lossless Encoding for My Music Archive

August 5th, 2006

At one point, I was interested in archiving my music in a lossless format. This was after, of course, I had already ripped it all to 192 Kbps MP3 and/or high quality AAC. The latter is absolute proof that the title of this site is apt: because I had iTunes, and because it worked well enough (better than alternatives I had on hand at the time), I lost my mind and fed my CDs in one after another to produce a format that can only be played on an iPod or my iTunes-equipped laptop. Blah.

In any case, I found a few options for lossless encoding. One that was promising was EAC - Exact Audio Copy. I remember it has having a good interface and not being bitchy about my scratched CDs. There was also FLAC. Not finding the URL right now - TODO: come back later and edit this.
During the process, I also discovered that I hated waiting for my CDs to rip and decided that I was definitiely going to use a CD-ripping service next time.
Note to self: try to remember where the good Win32 binaries for LAME were - not all the versions available work equally well.

PHP Frameworks

August 4th, 2006

Papers that inform the design of the NewTool platform

February 10th, 2003

Here are several papers (some via Lilia via McGee) that inform my thinking about the NewTool platform:

NewTool will support the knowledge process

February 9th, 2003

Knowledge Work as a Process

Knowledge Work as a Process

Jim McGee says that the goals of most knowledge management efforts today harken back to Taylorism (scientific management) with onerous command and control ideas and language as opposed to an recognition that knowledge is embodied in humans and any “management” of knowledge must take this fact into account.

NewTool acknowledges the human synthesis of information into knowledge and of knowledge into knowledge and tries only to augment human capabilities to allow focus on the process of synthesis and creation:

This is a process that is fundamentally iterative. The loops in this process are feedback loops, not opportunities for streamlining. You don’t improve this process by rearranging the steps or breaking them down into specialized tasks to be distributed. Nor are there opportunities to eliminate non-valued added steps. Improving the value of knowledge work calls for different strategies. Two that are worth exploring are to improve the infrastructure at the periphery and to eliminate friction. [Is knowledge work improvable?, McGee's Musings]

Stay Tuned

February 6th, 2003
Whiteboard drawing of NewTool - an open source platform for Free Media and Microcontent + universal personal proxy/server + authoring and replication

A sampling of references/influences/thought-starters for those interested in talking about this: