Archive for the ‘OCW Search News’ Category

A new chapter for OCW Search

Sunday, January 9th, 2011

I have accepted a position to join Google and I will start working with them tomorrow. As I will not be able to continue working on OCW Search, I have dontated OCW Search to the OpenCourseWare Consortium, and they have accepted to take it over and nurture it.

What does this mean for the service? Two things:

  • The service will continue operating as is, but it will be owned and operated by the Consortium. It will take us a few days to figure out and complete the transfer, and hopefully that won’t cause any disruption.
  • I will not be involved with OCW Search any more. Will that mean I’m quitting the online education field? I certainly hope not! I just don’t know exactly how yet.

I’d like to take this chance to thank Mary Lou Forward and Clay Whipkey of the OCW Consortium for agreeing to take over. My little project is now is very capable hands and I cannot be happier about this outcome. Good luck!

Finally, I’d like to thank all the users and supporters of OCW Search. I started this as a service that scratches an itch I had and it turned out to be the most project I’ve ever been involved in. I met many smart and amazing people and I hope our paths cross again in the future. Please keep in touch with me:

Update on oer.txt Proposal

Monday, November 29th, 2010

In the last blog post I made a proposal to help make OER easier to discover automatically by web crawlers. The immediate reaction can be summed up with "interesting" with many raising specific points of concerns about where the proposal fails. I’d like to thank everyone who pitched in, by commenting on the blog post or by email, particularly Nathan Yergler, the CTO of Creative Commons, and Scott Wilson, who built Ensemble.

I’ve been listening to everyone and I want to write up some of my notes from these discussions, which is this blog post.

Recap: What’s the problem and what is oer.txt?

There was confusion about the exact definition of the problem oer.txt is trying to solve and what oer.txt is. The problem is that automatically identifying educational resources is not easy. There is no widely accepted way to help web crawlers find OER. We have several technologies that support OER dissemination (e.g. RSS, OPML, OAI-PMH) but there is no way to say this (say) RSS feed is OER as opposed to a blog’s feed.

The best analogy for what oer.txt is that it’s like a road sign: it points a compatible crawler to the URLs OER can be found on a website. I specified it to say what format the URLs are in so that a crawler can choose which ones to pursue. Anything beyond that, like metadata about what the OER actually is, which format it’s in, which education level it is aimed at, etc, are all intentionally out of scope.

A simpler analogy is that oer.txt is merely an advertisment for what you already have.

Interestingly, no one is saying there isn’t a problem to be solved here. I also want to be 100% clear: I honestly do not mind what the final solution looks like and if oer.txt is the wrong one, great, let’s agree a better one. I’ll be the first to kick oer.txt out the door!

Summary of discussions so far

So what are people saying? In short, the design of the oer.txt solution is wrong on two counts:

  • It’s OER-specific, meaning that it doesn’t help/work for other problems. This also has the knock-on effect of…
  • It needs something new to be agreed, namely the oer.txt file, and so why not use what is already in use like OPML, RSS, etc?

Autodiscovery alternatives: link tags

The first theme that emerged is that we already have a way for autodiscovery by directly embedding <link> tags with rel="alternate" attributes in HTML. This is already in wide-spread use (it’s how Firefox knows this blog has an RSS feed for example) and so why not use that? It’s a great idea, and the two questions I have about this are:

  • Which HTML page would have this tag? The home page or all pages?
  • What do these alternate links point to? The simplest solution would be what is already being released. For example, a course’s home page could have two alternate links, one pointing to the course’s own machine-readable feed and one to the website’s machine-readable feed.

We would still need a way to mark these URLs as OER as opposed to any other type of feed (like a blog’s RSS feed). We could specify a new rel attribute value as a way to tag OER. For example, to tag an RSS feed we currently use:

<link rel="alternate" type="application/rss+xml" title="OCW Search All Courses" href="" />

Instead we could use:

<link rel="oer-alternate" type="application/rss+xml" title="OCW Search All Courses" href="" />

(Fictitious URLs for the sake of example.)

Autodiscovery alternatives: robots.txt itself

Another alternative is to hook into robots.txt exactly like how the Sitemaps protocol does. In this case, instead of pointing to sitemap URLs, we point to OER URLs. This side-steps the need for a separate oer.txt file (good) but might require us to agree a protocol analogous to Sitemaps (bad). I say "might" as I see no problem in pointing directly to whichever format is already produced, be it RSS, Atom, OAI-PMH, etc. This solves the two problems of communicating where the OER is and does what oer.txt aims to do. An example of this OER-enhanced robots.txt could be something like:


(Apart from the OpenSeach URL, these URLs are fictitious.)

We can also adopt an already-established format like RDF as our Sitemaps protocol analogue, as per the last line in this above example.

Format alternatives: OPML

OPML might be the solution we seek instead of oer.txt, if everyone actually uses it. This is a good suggestion, and would work perfectly with the HTML link tags autodiscovery.

The question I have is what new attributes do we need for the <outline> tags to have to make OPML more useful for OER?

Format alternatives: RDF or POWDER

This idea basically says we can add richness when we advertise our OER in a machine readable format, and we can do so with already established formats like RDF or POWDER. This means we not only tag resources as educational but we also have a way to add extra meta data.

I’m of two minds about this idea. On the one hand I’m really keen to keep things simple: oer.txt is just a list of what you already have, which is as simple as it gets. On the other hand, have a bit of richness beyond a bare-bones format like oer.txt would be very useful. At the end of the day, it’s what the community, particularly the content producers, are happy with that makes this decision.

The other thing to consider is that RDF or POWDER are excellent candidates for a protocol analogous to Sitemaps in robots.txt as explained above.

Wrapping up

So where to now? Let’s keep talking! I’m pretty sure we haven’t come up with all the good ideas we can come up with. Please comment below, email me, or post to the mailing list.

Making OER Automatically Discoverable: oer.txt

Monday, November 22nd, 2010

Last week I was in frosty Nottingham, UK, at the JISC CETIS Conference 2010. I gave a short talk during the Locate, Collate, and Aggregate session. I’ve embedded the slides below.

During the talk I proposed that OER producers use a robots.txt-like file to make discovery of their resources easier. I’d like to elaborate a bit on this.

The robots.txt Analogy

There is a convention for web crawlers, aka robots and spiders, that websites can use to communicate to the crawlers which pages they are allowed and not allowed to crawl. This convention is called robots.txt and by convention it must be at the domain’s root. For example, the robots.txt file for OCW Search is at A web crawler can choose to ignore the file but the well-behaved ones actually obey it.

As you can see it’s a very simple file: it’s a plain text file and it has a simple format. I don’t want to go too deeply into this now; if you’re not familiar with robots.txt, please read the unofficially official help website.

As this file is checked by the crawlers of the major search engines (Google, Bing, Yahoo!), the file was coopted to do something interesting: You can specify in the robots.txt file the location of your sitemaps. Without going into too much irrelevant detail, sitemaps are a standard way for webmasters to communicate to search engines the URLs of their pages. For our needs, you need to know that to point a crawler to the sitemaps, you add a line in robots.txt like this:


And this is the salient point: the way sitemaps in robots.txt work, we have a machine-readable way to tell crawlers the location of resources. Intersting, no? Can we use it for OER?

The oer.txt Proposal

By now the proposal should be obvious: why not make an OER-specific file just like robots.txt sitemaps to aid discoverability? Let’s call this file oer.txt and by convention it should live at the website’s root. I’ve gone ahead and created one for OCW Search:

The format is again simple and thus flexible: it is a list of endpoints, one per line, with a service description and a URL separated by a colon. As you can see in the OCW Search oer.txt file, I’ve put in the OpenSearch service for OCW Search:


Another example: RSS feeds. For this example, we’ll use MIT’s OCW feeds. The file’s contents could be:


This raises the first question we need to answer as a community: how should we communicate the media we are releasing? In the MIT example example above, there are two RSS feeds, one for text and one for courses with audio and/or video. I see no problem not communicating that, in the interest of simplicity.

Another example: Stanford University’s iTunes content. In this case, Stanford is releasing each course as a separate RSS feed, and so I would like to introduce another term in the oer.txt vocabulary: content. Refer to the list of all courses released and then an excerpt of an example oer.txt will be:


We can go further. Let’s take a standard protocol for OER dissemination, OAI-PMH, and let’s use Connexions’ OAI-PMH endpoint as our example:


Here I set it to use the OAI Identify verb, which is a reasonable thing to point crawlers to.

You get the idea. It’s a simple format and it’s extensible and it is merely a way for content producers to communicate the URLs for resources and services that they currently manage.

So what now?

I think this is a very simple way to improve the discoverability of OER. Services like OCW Search will be able to quickly consume the information content producers release.

As an immediate next step, I think the following need to happen:

  • A discussion in the OER community: is this a good thing? If not, why not? If yes, can we make it better?
  • What terms should we initially recommend people use? Notice above I stuck to the naming pattern of service+format (e.g. index+rss) to communicate two interesting bits of info. Is this the correct pattern? Should we even be communicating the format?
  • Finally, we need to get some content producers to actually use this. What we have is the classic chicken and egg situation: OCW Search already has it’s oer.txt file and it is essential it gets friends.

So… let’s talk. I’ve set up a mailing list, imaginatively called the oer.txt Working Group. Please join it and let’s talk.

My Slides

The slides from my talk:

Introducing the OpenCourseWare Meta Data API

Tuesday, October 12th, 2010

A key part of OpenCourseWare is the meta data associated with each course. Meta data are course attributes like the instructors, the teaching date, which institution released the course, etc.

To make OCW Search to work like it does, I went through a lot of trouble to make sure that I extract correct meta data about each course. The meta data has always been partially useful and available in the form of advanced search queries and also displayed in the search results.

As of now, you can programatically get this data using the new OCW Search Meta Data API. It does what its name suggests: It gives you all the core (useful) meta data I track for each of the courses in OCW Search. It shares many of the design features of the search API, so your experience of the search API directly translates into useful knowledge of the meta data API.

Important note: The meta data API is currently considered in beta testing although it is quite stable and works really well. It is likely though that some implementation features will change before being labelled stable, so please keep an eye for updates here and on the API mailing list.

MERLOT: The Meta Data API Launch Partner

The meta data API was the result of a request and a collaboration with MERLOT, a program of the California State University. The idea came during a conversation between me and the team at MERLOT, which in retrospect made me slap my head thinking why I didn’t I think of it previously. Together, me and the team at MERLOT worked on building, testing, and finally integrating the OCW Search meta into the MERLOT collection using this API. For their prompting and help in getting this API to this stage, I want to deeply thank the team at MERLOT.

This actually is an important point: I am always open for ideas to make OCW Search better and more useful, particularly if it supports the spirit of spreading the discoverability and utility of OpenCourseWare. If you have an idea, please get in touch!

Again, to get started, see the OCW Search Meta Data API home page.

Top Searches on OCW Search

Wednesday, September 8th, 2010

Apologies for the long silence. I’m actually working on something big that I will share with you soon.

For now, I did a little analysis: what are the top searches done by real people on OCW Search? The list of top searches on OCW Search is quite surprising. It is also an excellent source of information for me: I will be manually checking these searches to see how good they are. Some are going to be excellent, some are going to be rubbish. The bad ones will get special treatment to see if I can improve them and make the service better for you.

Of course, if you come across a search that is not giving you good results, please let me know.

Again, the top searches list is here.

Updated: Moving Servers

Saturday, July 10th, 2010

Update: Congratulations you’re now seeing the new OCW Search server!

Both the old and new servers will continue to operate simultaneously for a little while longer to ensure full propagation of the DNS entry.

Initial text of this post:

As some of you have noticed, the service has grown quite a bit and the server has had sporadic problems. Over the past few weeks I’ve been working on a new set up to fix these issues and be the foundation of quite a few things I want to roll out over the next few months.

The new set up is pretty much ready and will be rolled out over the next few days. The switch will (should!) be seamless and the only things you’ll notice the good things this new set up brings. Beyond improved stability, look out for the speed improvements. Everything got reviewed and optimized based on real usage information. Particularly, if you use OCW Search extensively, pages will load a lot faster going forward. The backend code also got a few optimizations and there are more changes coming to make it even better.

As ever, the Status Blog will keep you updated of any issues during the move and if you want to get in touch, the details are in the contact page.

Few Fixes plus Status Blog

Thursday, June 24th, 2010

A couple of things based on user feedback and requests.

Status Blog

With the new OCW Search API, other services are now depending on the main website at But what happens if the main site is down or inaccessible for any reason? How would we communicate what’s going on?

The answer is a new OCW Search Status Blog. This is hosted on a third-party service that is unlikely to go down at the same time as this site. This means it will be an open channel in case anything makes OCW Search inaccessible.

You can always find a link to the Status Blog in the top navigation.

Advanced Search Operators Fixed

Some of the advanced search operators were not working correctly in some queries. This is now fixed and more test cases have been added for checking prior to releasing new code. This update was also applied to the OCW Search API.

As always, if you spot something is broken, please contact us with details of how to reproduce the error and it will get fixed!

MIT Re-indexed, Textbooks Included

Friday, June 18th, 2010

The MIT OCW website was recently restructured, which meant that all the indexed courses on OCW Search pointed to broken URLs. This has now been rectified with a complete re-indexing of all MIT OCW courses.

As part of this re-indexing, I took the opportunity to include the textbooks that MIT also releases. A couple of examples:

Announcing the OCW Search API

Monday, June 14th, 2010

Along with today’s milestone announcement (10 universities providing 11 OCW collections in OCW Search), the other major news is the new OCW Search API.

The API is a way for other programs to access the OCW Search index. The objective of this API is to help spread the use of OpenCourseWare by enabling other developers to integrate OpenCourseWare searching capability into their apps. My hope is to spur the development of mobile and web apps in particular as I see them as a big opportunity in open education.

The API is very easy to use and gives full access to all of OCW Search’s advanced search capabilities.

Full technical details and the developers’ mailing list at the OCW Search API home page.

Major Milestone: 10 University OCW Collections

Monday, June 14th, 2010

Just now, the live OCW Search index contains 11 OCW collections from 10 different universities. The 10 universities are:

  1. School of Public Health at Johns Hopkins (institution:jhsph)
  2. MIT (institution:mit)
  3. Notre Dame (institution:nd)
  4. The Open University UK (institution: openuniversity)
  5. Universidad Polit├ęcnica de Madrid (institution:politecnicamadrid), Spanish courses
  6. Stanford Engineering Everywhere (institution:stanford)
  7. Delft University of Technology (institution:tudelft), English and Dutch courses
  8. UMass Boston (institution:umass)
  9. The University of Tokyo (institution:utokyo), both English and Japanese OCW collections
  10. Yale University (institution:yale)

In total, the number of courses in the index is over 2600 now.

All of these course are available through the OCW Search API, which is announced in this blog post.