Making OER Automatically Discoverable: oer.txt

Last week I was in frosty Nottingham, UK, at the JISC CETIS Conference 2010. I gave a short talk during the Locate, Collate, and Aggregate session. I’ve embedded the slides below.

During the talk I proposed that OER producers use a robots.txt-like file to make discovery of their resources easier. I’d like to elaborate a bit on this.

The robots.txt Analogy

There is a convention for web crawlers, aka robots and spiders, that websites can use to communicate to the crawlers which pages they are allowed and not allowed to crawl. This convention is called robots.txt and by convention it must be at the domain’s root. For example, the robots.txt file for OCW Search is at http://www.ocwsearch.com/robots.txt. A web crawler can choose to ignore the file but the well-behaved ones actually obey it.

As you can see it’s a very simple file: it’s a plain text file and it has a simple format. I don’t want to go too deeply into this now; if you’re not familiar with robots.txt, please read the unofficially official help website.

As this file is checked by the crawlers of the major search engines (Google, Bing, Yahoo!), the file was coopted to do something interesting: You can specify in the robots.txt file the location of your sitemaps. Without going into too much irrelevant detail, sitemaps are a standard way for webmasters to communicate to search engines the URLs of their pages. For our needs, you need to know that to point a crawler to the sitemaps, you add a line in robots.txt like this:

Sitemap: http://www.example.com/sitemaps/sitemap1

And this is the salient point: the way sitemaps in robots.txt work, we have a machine-readable way to tell crawlers the location of resources. Intersting, no? Can we use it for OER?

The oer.txt Proposal

By now the proposal should be obvious: why not make an OER-specific file just like robots.txt sitemaps to aid discoverability? Let’s call this file oer.txt and by convention it should live at the website’s root. I’ve gone ahead and created one for OCW Search: http://www.ocwsearch.com/oer.txt.

The format is again simple and thus flexible: it is a list of endpoints, one per line, with a service description and a URL separated by a colon. As you can see in the OCW Search oer.txt file, I’ve put in the OpenSearch service for OCW Search:

opensearchdescription+xml: http://nc.ocwsearch.com/assets/extra/ocws-opensearch.xml

Another example: RSS feeds. For this example, we’ll use MIT’s OCW feeds. The file’s contents could be:

index+rss: http://feeds.pheedo.com/OcwWeb/rss/new/mit-newcourses
index+rss: http://feeds.pheedo.com/OcwWeb/rss/new/mit-newavcourses

This raises the first question we need to answer as a community: how should we communicate the media we are releasing? In the MIT example example above, there are two RSS feeds, one for text and one for courses with audio and/or video. I see no problem not communicating that, in the interest of simplicity.

Another example: Stanford University’s iTunes content. In this case, Stanford is releasing each course as a separate RSS feed, and so I would like to introduce another term in the oer.txt vocabulary: content. Refer to the list of all courses released and then an excerpt of an example oer.txt will be:

content+rss: http://deimos3.apple.com/WebObjects/Core.woa/Feed/itunes.stanford.edu-dz.4331557148.04331557150
content+rss: http://deimos3.apple.com/WebObjects/Core.woa/Feed/itunes.stanford.edu.1299566665.01299566669
content+rss: http://deimos3.apple.com/WebObjects/Core.woa/Feed/itunes.stanford.edu.1291062366.01291619293

We can go further. Let’s take a standard protocol for OER dissemination, OAI-PMH, and let’s use Connexions’ OAI-PMH endpoint as our example:

oai-pmh: http://cnx.org/content/OAI?verb=Identify

Here I set it to use the OAI Identify verb, which is a reasonable thing to point crawlers to.

You get the idea. It’s a simple format and it’s extensible and it is merely a way for content producers to communicate the URLs for resources and services that they currently manage.

So what now?

I think this is a very simple way to improve the discoverability of OER. Services like OCW Search will be able to quickly consume the information content producers release.

As an immediate next step, I think the following need to happen:

  • A discussion in the OER community: is this a good thing? If not, why not? If yes, can we make it better?
  • What terms should we initially recommend people use? Notice above I stuck to the naming pattern of service+format (e.g. index+rss) to communicate two interesting bits of info. Is this the correct pattern? Should we even be communicating the format?
  • Finally, we need to get some content producers to actually use this. What we have is the classic chicken and egg situation: OCW Search already has it’s oer.txt file and it is essential it gets friends.

So… let’s talk. I’ve set up a mailing list, imaginatively called the oer.txt Working Group. Please join it and let’s talk.

My Slides

The slides from my talk:

12 Responses to “Making OER Automatically Discoverable: oer.txt”

  1. Scott Wilson says:

    Creating a special new aggregation format just for OER? No thanks.

    Just use OPML. It does everything you’ve mentioned, and is already used by some OER publishers.

    I made this to aggregate and cross-search 1000+ feeds (~10,000ish OERs) using already existing OPML sources:

    http://galadriel.cetis.ac.uk/ensemble

  2. Pierre Far says:

    Hi Scott,

    Thanks for commenting. I think I may not have explained this properly. This is NOT another aggregation format. I don’t want that either! I just want to make finding what’s already there easier to find.

    This is a way to tell other services which formats you already expose, and tell them in one central place. If you already have OPML, then you can easily state that in the oer.txt, like this:

    opml: http://galadriel.cetis.ac.uk/ensemble/feeds?format=opml

    For Ensemble for example, your aggregator daemon can look through oer.txt and identify all the RSS feeds and automatically index them. For your use case, I imagine setting an automated crawler on (say) all ac.uk domains and indexing all their OERs without manual intervention.

    Does this make more sense?

    Thanks,
    Pierre

  3. Marian Wan says:

    I think it’s a very good idea! We here are trying to make some full-text search spiders crawling OCW or OER sites. Instead of wondering around, an oer.txt might help for specifying the sites. There are some application of making metadata for OER content here in Taiwan. However, it’s too hard to ask content providers to fill all metadata (in our case, there are more than 70 fields!!GEE!) Thanks for bring up this solution, and please keep me informed.

  4. Scott Wilson says:

    Hi Pierre,

    But the OPML file from each provider ALREADY has all the details that your oer.txt file would contain, so there is no need for it. Just use OPML.

    E.g.

    http://openlearn.open.ac.uk/rss/file.php/stdfeed/1/full_opml.xml

    http://rss.oucs.ox.ac.uk/metafeeds/podcastingnewsfeeds.opml

    Simple! And already part of the web!

    If you want a harvesting file for all those OPML files for all providers, you can make that in OPML too.

    No need to invent a new format.

    S

  5. Pierre Far says:

    Hi Scott

    To use your examples, there is no mechanism to discover the exact locations of these feeds. And there are providers that do not use OPML – a lot don’t; for example, I don’t think the Stanford example I cite in the post has an OPML file.

    To rephrase, the oer.txt file is a simple mechanism for services to advertise, to web crawlers, what is already available for harvesting from them. For the OpenLearn example, an oer.txt file at openlearn.open.ac.uk/oer.txt can have a line that says:

    opml: http://openlearn.open.ac.uk/rss/file.php/stdfeed/1/full_opml.xml

    When a crawler intersted in OER reaches openlearn.open.ac.uk, it finds the oer.txt file, extracts the OPML’s URL and off it goes. This way no human intervention is needed to figure out where the feeds are and what they do. It’s not meant to replace any formats/standards but to advertise their URLs, which vary by service.

    And in the future if the location of the feed changes or if the content producer adds or removes end points, the oer.txt can be updated and all content consumers will update automatically.

  6. Interesting idea, but I have my doubts. I see that OPML can go “far enough” as to describe what feeds are available, but then there’s the matter of discovering where the OPML is.

    Pierre says the oer.txt file on a server’s root will fix that, but then I’d ask: how’d you know which servers to look for in the first place? =)

    We also *do* already have autodiscovery of other things like Opensearch, RSS feeds, etc. embedded in HTML, so why not use that instead? (E.g. the crawler Pierre talks about wouldn’t look for an oer.txt file but for link rel tags on the site’s homepage.)

  7. Pierre Far says:

    Hi Alejandro

    I think we’re getting stuck on feeds but oer.txt can (and should) support more than just feeds. I use OAI-PMH as an example and also OpenSearch. Yes things like OpenSearch can be embedded in HTML and autodiscovered, and likewise feeds, but other services content producers expose may not be embeddable. This is probably the best alternative suggestion so far!

    The one concern I have if we go the embed-in-HTML route is that content producers would have to produce at least one page that embeds ALL their autodiscoverable content to help crawlers – that’s basically the premise of oer.txt (all content in one place). Which page that is (home page or another “index” page?) is then next up for debate. The alternative is to spread the embeds throughout the site (i.e. not on just the home page), negating the simplicity of this convention and potentially hindering the discoverability if the website has crawlability issues (like javascript navigation and the like).

    As for which websites you look for oer.txt on: all of them, just like you look for robots.txt on all websites. As such, oer.txt is a map basically giving a crawler directions to where to get the content from. If oer.txt is not present, that’s no change from where we are today. If something like oer.txt was in widespread use, personally I would set off a crawler on all .edu/.ac.uk/etc domains and subdomains and consume the content automatically. That would certainly make my life operating OCW Search so much easier :)

  8. Clay Whipkey says:

    I think I understand the purpose, but I will admit that I’m not quite convinced yet myself. I am still leaning towards expanding the currently available semantic HTML to support auto-discovery. I realize this represents more work than something like the oer.txt concept, but one thing we have on our side is that we are a niche community. The number of sites offering OER (right now) is relatively few and this is the best time to implement responsible, sustainable procedures that will make for the best future.

    Right now, specialized services like ocwsearch.com, Merlot, Globe, the OCWC search, etc. do primarily rely on content providers making themselves known. But I think we have to admit that if we really wish the best for OER as a movement, for “openness” as a movement, we need to make content discoverable and friendly to the big players like Google and Yahoo, too. I know specifically that Google is once again working on an education-focused search service and we will be meeting with them next week. I will be happy to run this idea past that group to get some more feedback.

    Not closed to it, just not convinced yet.

    cheers,
    Clay

  9. Pierre Far says:

    Hi Clay,

    I think the need for something like oer.txt is recognized, and yes it may not be the best solution (e.g. see the discussion above with Alejandro re HTML-based autodiscovery). If at least we all start working towards something and actually do it, then I’m happy.

    At the end of the day, it doesn’t matter what we actually agree as long as it works and is produced by the content creators and used by content consumers like Merlot, Globe, the Consortium, etc. That’s the golden prize.

  10. OER.txt | says:

    [...] Posted on November 24, 2010 by openedblogger| Leave a comment OCW Search has a new post suggesting an OER.txt file, similar to a robots.txt file. From the post: There is a [...]

  11. Scott Wilson says:

    “To use your examples, there is no mechanism to discover the exact locations of these feeds.”

    HTML autodiscovery works fine for this.

    “And there are providers that do not use OPML – a lot don’t; for example, I don’t think the Stanford example I cite in the post has an OPML file.”

    So we should encourage them to use OPML, which existing non-specialist tools support (from email clients to newsreaders, mobile apps, etc), rather than create a completely new format that will only be used by a specialist community. Using OPML & RSS doesn’t just benefit OER, it benefits the rest of the web too.

    Its time to stop inventing new technologies for OER and start using what already exists.

    “There is only one web” :-)

  12. [...] Search » Advanced search help Follow us on: Facebook | Twitter | Blog « Making OER Automatically Discoverable: oer.txt [...]