Archive for November, 2010

Update on oer.txt Proposal

Monday, November 29th, 2010

In the last blog post I made a proposal to help make OER easier to discover automatically by web crawlers. The immediate reaction can be summed up with "interesting" with many raising specific points of concerns about where the proposal fails. I’d like to thank everyone who pitched in, by commenting on the blog post or by email, particularly Nathan Yergler, the CTO of Creative Commons, and Scott Wilson, who built Ensemble.

I’ve been listening to everyone and I want to write up some of my notes from these discussions, which is this blog post.

Recap: What’s the problem and what is oer.txt?

There was confusion about the exact definition of the problem oer.txt is trying to solve and what oer.txt is. The problem is that automatically identifying educational resources is not easy. There is no widely accepted way to help web crawlers find OER. We have several technologies that support OER dissemination (e.g. RSS, OPML, OAI-PMH) but there is no way to say this (say) RSS feed is OER as opposed to a blog’s feed.

The best analogy for what oer.txt is that it’s like a road sign: it points a compatible crawler to the URLs OER can be found on a website. I specified it to say what format the URLs are in so that a crawler can choose which ones to pursue. Anything beyond that, like metadata about what the OER actually is, which format it’s in, which education level it is aimed at, etc, are all intentionally out of scope.

A simpler analogy is that oer.txt is merely an advertisment for what you already have.

Interestingly, no one is saying there isn’t a problem to be solved here. I also want to be 100% clear: I honestly do not mind what the final solution looks like and if oer.txt is the wrong one, great, let’s agree a better one. I’ll be the first to kick oer.txt out the door!

Summary of discussions so far

So what are people saying? In short, the design of the oer.txt solution is wrong on two counts:

  • It’s OER-specific, meaning that it doesn’t help/work for other problems. This also has the knock-on effect of…
  • It needs something new to be agreed, namely the oer.txt file, and so why not use what is already in use like OPML, RSS, etc?

Autodiscovery alternatives: link tags

The first theme that emerged is that we already have a way for autodiscovery by directly embedding <link> tags with rel="alternate" attributes in HTML. This is already in wide-spread use (it’s how Firefox knows this blog has an RSS feed for example) and so why not use that? It’s a great idea, and the two questions I have about this are:

  • Which HTML page would have this tag? The home page or all pages?
  • What do these alternate links point to? The simplest solution would be what is already being released. For example, a course’s home page could have two alternate links, one pointing to the course’s own machine-readable feed and one to the website’s machine-readable feed.

We would still need a way to mark these URLs as OER as opposed to any other type of feed (like a blog’s RSS feed). We could specify a new rel attribute value as a way to tag OER. For example, to tag an RSS feed we currently use:

<link rel="alternate" type="application/rss+xml" title="OCW Search All Courses" href="http://www.ocwsearch.com/courses.xml" />

Instead we could use:

<link rel="oer-alternate" type="application/rss+xml" title="OCW Search All Courses" href="http://www.ocwsearch.com/courses.xml" />

(Fictitious URLs for the sake of example.)

Autodiscovery alternatives: robots.txt itself

Another alternative is to hook into robots.txt exactly like how the Sitemaps protocol does. In this case, instead of pointing to sitemap URLs, we point to OER URLs. This side-steps the need for a separate oer.txt file (good) but might require us to agree a protocol analogous to Sitemaps (bad). I say "might" as I see no problem in pointing directly to whichever format is already produced, be it RSS, Atom, OAI-PMH, etc. This solves the two problems of communicating where the OER is and does what oer.txt aims to do. An example of this OER-enhanced robots.txt could be something like:

Sitemap: http://www.ocwsearch.com/sitemap.xml
Sitemap: http://www.ocwsearch.com/blog/sitemap.xml
OER: http://nc.ocwsearch.com/assets/extra/ocws-opensearch.xml
OER: http://www.ocwsearch.com/course-list-rss.xml
OER: http://www.ocwsearch.com/course-list-rdf.xml

(Apart from the OpenSeach URL, these URLs are fictitious.)

We can also adopt an already-established format like RDF as our Sitemaps protocol analogue, as per the last line in this above example.

Format alternatives: OPML

OPML might be the solution we seek instead of oer.txt, if everyone actually uses it. This is a good suggestion, and would work perfectly with the HTML link tags autodiscovery.

The question I have is what new attributes do we need for the <outline> tags to have to make OPML more useful for OER?

Format alternatives: RDF or POWDER

This idea basically says we can add richness when we advertise our OER in a machine readable format, and we can do so with already established formats like RDF or POWDER. This means we not only tag resources as educational but we also have a way to add extra meta data.

I’m of two minds about this idea. On the one hand I’m really keen to keep things simple: oer.txt is just a list of what you already have, which is as simple as it gets. On the other hand, have a bit of richness beyond a bare-bones format like oer.txt would be very useful. At the end of the day, it’s what the community, particularly the content producers, are happy with that makes this decision.

The other thing to consider is that RDF or POWDER are excellent candidates for a protocol analogous to Sitemaps in robots.txt as explained above.

Wrapping up

So where to now? Let’s keep talking! I’m pretty sure we haven’t come up with all the good ideas we can come up with. Please comment below, email me, or post to the mailing list.

Making OER Automatically Discoverable: oer.txt

Monday, November 22nd, 2010

Last week I was in frosty Nottingham, UK, at the JISC CETIS Conference 2010. I gave a short talk during the Locate, Collate, and Aggregate session. I’ve embedded the slides below.

During the talk I proposed that OER producers use a robots.txt-like file to make discovery of their resources easier. I’d like to elaborate a bit on this.

The robots.txt Analogy

There is a convention for web crawlers, aka robots and spiders, that websites can use to communicate to the crawlers which pages they are allowed and not allowed to crawl. This convention is called robots.txt and by convention it must be at the domain’s root. For example, the robots.txt file for OCW Search is at http://www.ocwsearch.com/robots.txt. A web crawler can choose to ignore the file but the well-behaved ones actually obey it.

As you can see it’s a very simple file: it’s a plain text file and it has a simple format. I don’t want to go too deeply into this now; if you’re not familiar with robots.txt, please read the unofficially official help website.

As this file is checked by the crawlers of the major search engines (Google, Bing, Yahoo!), the file was coopted to do something interesting: You can specify in the robots.txt file the location of your sitemaps. Without going into too much irrelevant detail, sitemaps are a standard way for webmasters to communicate to search engines the URLs of their pages. For our needs, you need to know that to point a crawler to the sitemaps, you add a line in robots.txt like this:

Sitemap: http://www.example.com/sitemaps/sitemap1

And this is the salient point: the way sitemaps in robots.txt work, we have a machine-readable way to tell crawlers the location of resources. Intersting, no? Can we use it for OER?

The oer.txt Proposal

By now the proposal should be obvious: why not make an OER-specific file just like robots.txt sitemaps to aid discoverability? Let’s call this file oer.txt and by convention it should live at the website’s root. I’ve gone ahead and created one for OCW Search: http://www.ocwsearch.com/oer.txt.

The format is again simple and thus flexible: it is a list of endpoints, one per line, with a service description and a URL separated by a colon. As you can see in the OCW Search oer.txt file, I’ve put in the OpenSearch service for OCW Search:

opensearchdescription+xml: http://nc.ocwsearch.com/assets/extra/ocws-opensearch.xml

Another example: RSS feeds. For this example, we’ll use MIT’s OCW feeds. The file’s contents could be:

index+rss: http://feeds.pheedo.com/OcwWeb/rss/new/mit-newcourses
index+rss: http://feeds.pheedo.com/OcwWeb/rss/new/mit-newavcourses

This raises the first question we need to answer as a community: how should we communicate the media we are releasing? In the MIT example example above, there are two RSS feeds, one for text and one for courses with audio and/or video. I see no problem not communicating that, in the interest of simplicity.

Another example: Stanford University’s iTunes content. In this case, Stanford is releasing each course as a separate RSS feed, and so I would like to introduce another term in the oer.txt vocabulary: content. Refer to the list of all courses released and then an excerpt of an example oer.txt will be:

content+rss: http://deimos3.apple.com/WebObjects/Core.woa/Feed/itunes.stanford.edu-dz.4331557148.04331557150
content+rss: http://deimos3.apple.com/WebObjects/Core.woa/Feed/itunes.stanford.edu.1299566665.01299566669
content+rss: http://deimos3.apple.com/WebObjects/Core.woa/Feed/itunes.stanford.edu.1291062366.01291619293

We can go further. Let’s take a standard protocol for OER dissemination, OAI-PMH, and let’s use Connexions’ OAI-PMH endpoint as our example:

oai-pmh: http://cnx.org/content/OAI?verb=Identify

Here I set it to use the OAI Identify verb, which is a reasonable thing to point crawlers to.

You get the idea. It’s a simple format and it’s extensible and it is merely a way for content producers to communicate the URLs for resources and services that they currently manage.

So what now?

I think this is a very simple way to improve the discoverability of OER. Services like OCW Search will be able to quickly consume the information content producers release.

As an immediate next step, I think the following need to happen:

  • A discussion in the OER community: is this a good thing? If not, why not? If yes, can we make it better?
  • What terms should we initially recommend people use? Notice above I stuck to the naming pattern of service+format (e.g. index+rss) to communicate two interesting bits of info. Is this the correct pattern? Should we even be communicating the format?
  • Finally, we need to get some content producers to actually use this. What we have is the classic chicken and egg situation: OCW Search already has it’s oer.txt file and it is essential it gets friends.

So… let’s talk. I’ve set up a mailing list, imaginatively called the oer.txt Working Group. Please join it and let’s talk.

My Slides

The slides from my talk: