Team blog: Developers

Major update to freeyourstuff.cc browser extension

freeyourstuff.cc is a separate project from lib.reviews, but with a related purpose. It’s a Chrome/Chromium browser extension that lets you download reviews you’ve contributed to major websites, including Amazon, Goodreads, IMDB, TripAdivsor, and Yelp. You can publish them on the site as a backup, or keep your own copy.

The most popular feature turns out to be the Quora downloader: tens of thousands of Quora answers have been downloaded and re-published under free licenses with it (see the upload directory).

Today I pushed out a week’s worth of updates to the extension. From a user perspective, the main changes are that IMDB and Amazon extraction works again (design changes had caused the plugins to break), Quora extraction should be more well-behaved, and all plugins have a clear “busy” indicator when they’re doing stuff.

Under the hood, the extension now uses async/await functions instead of callbacks to make the download flow a lot more understandable, especially for complex plugins like the Quora one which have to monitor changes to a page dynamically made with JavaScript not under the plugin’s control.

If you’ve contributed reviews to other sites, I encourage you to use this extension to keep your own copy (and please report issues you experience). In future, we’ll make it easy to migrate individual reviews to lib.reviews or other sites, as well.


Major improvements to media uploading

It’s now possible to upload and insert media files directly from the rich-text editor when writing reviews, blog posts, or anything else. This video is a quick demonstration:


This was a major effort for a few reasons:

  • We needed to add an upload API that handles various failure cases (e.g., incorrect MIME type), batches up errors and passes them along to the application. The API supports multi-file uploads as well, but in the editor we only upload a single file at a time.

  • We needed to design a dialog that’s quick and easy, while handling entry of all required data without taking up too much space for mobile users. The flip to a second page you see in the video seems like a pretty good solution—you only ever see that page if you need to.

  • We needed to add an upload feed so we can keep track of what files are being uploaded.

As you can see in the video, the feature gives credit to the person who created the work you’re uploading, something that tends to go missing on most websites.

Along the way, we’ve also improved the old multi-file upload and the presentation of metadata on review subject pages. For now, you need to switch to “rich text” mode to see the upload button—in future, the plain text markdown editor will get its own toolbar.


HTML5 video and audio support added

We now have support for HTML5 video and audio in reviews and posts. You can insert media by URL using the “Insert media” menu in the rich text editor. Alternatively, you can use the markdown syntax for images — ![alt text](url) — and it will now work for video and audio files as well.

Here’s an example of an embedded audio file:

Keep in mind that this only works for links to files in formats supported by modern browsers (typically mp4/webm/ogv for video, mp3/ogg for audio). YouTube links and such won’t magically work. We may or may not add support for that—YouTube is ubiquitous, but it’s not ideal to have videos that may disappear at any moment embedded in reviews that are meant to be freely reusable forever.

Under the hood, the CommonMark markdown specification does not yet include support for video/audio. As a result, CommonMark compliant parsers like Markdown-It (which we use) don’t support it, either. There was a preexisting plugin, but it had a few issues:

  • clocked in at >100KB due to an unnecessarily complex dependency

  • did not show any fallback text for older browsers

  • included hardcoded English strings

  • did not tokenize audio and video differently from images, making it difficult to integrate with rich-text editors

While some of this may be fixed (I sent a pull request), I ended up writing and publishing a new module, markdown-it-html5-media, that is optimized for our use case. Using image syntax tracks the emerging consensus in the ongoing CommonMark discussion about this topic.


SPARQL that sparkles

We’ve had support for looking up items via Wikidata for a while now. In order to give you the most relevant results, we exclude some stuff from the search that is very unlikely to be of interest: disambiguation pages, categories, templates, and other “meta-content” from Wikipedia.

To do this, we previously had to fire off two requests for every query we sent to Wikidata: one, to the MediaWiki API for Wikidata, using the wbsearchentities module that also powers Wikidata’s own search; the second to the powerful Wikidata Query Service, in order to identify which of the search results should be excluded.

It turns out that the Wikidata Query Service supports directly interfacing with the MediaWiki API, but until recently, the order of the results was lost when performing such a query.

Thanks to Wikimedia Foundation engineer Stas Malyshev, this was fixed a few days ago, so we were able to rewrite our queries to make use of it. Our Wikidata search is now entirely powered by SPARQL, the query language designed for the semantic web.

The result: significantly more responsive Wikidata lookup and simpler code. See the code for details; most of the work is done by the _requestHandler function. In related news, Wikidata also has recently significantly improved the quality of search results by switching to ElasticSearch.


Fun with Node 8.9.0 and jsdoc

lib.reviews is powered by Node.js. This post is very much about the internals, so only read on if you care. :-)

We’ve just upgraded the site to a major new release of Node: 8.9.0. Many excellent blog posts have been written about the new features in this series of Node. Personally, I’m most excited about async/await.

In modern web development, you’re often dealing with operations that are synchronous (executed immediately and blocking operation of other code) vs. asynchronous (effectively running “in the background”).

For example, when you run an expensive database query, you don’t want it to keep other visitors of the site waiting—it should run in the background. But the application needs to know when the query is finished. Promises are one way to organize such asynchronous execution sequences.

Unfortunately, as you deal with more complex sequences of events, using only promises can also make code increasingly difficult to read. Here’s an example from the sync-all script, which is run every 24 hours to fetch information from sites like Wikidata and Open Library:

Thing
  .filterNotStaleOrDeleted()
  .then(things => {
    things.forEach(thing => thing.setURLs(thing.urls));
    let updates = things.map(thing => 
      limit(() => thing.updateActiveSyncs())
    );
    Promise
      .all(updates)
      .then(_updatedThings => {
        console.log('All updates complete.');
      });
  });

What’s going on here? We’re getting a list of all “Things” (review subjects), excluding old and deleted revisions. Then, for each thing, we reset the settings for updating information from external websites. We build an array of asynchronously run promises which contact external websites like Wikidata and Open Library. The limit() call throttles these requests to two at a time.

The main readability problem is the increasing nesting. If you add .catch() blocks to Promises, it can be even more difficult to follow what’s going on, and to make sure all your brackets are in the right place.

Here’s what this sequence looks like with async/await:

  const things = await Thing.filterNotStaleOrDeleted();
  things.forEach(thing => thing.setURLs(thing.urls));
  await Promise.all(
    things.map(thing => 
      limit(() => thing.updateActiveSyncs())
    )
  );

It’s a lot easier to see what’s going on. And this isn’t even accounting for the greater simplicity of success/error handling. Under the hood, async/await works with Promises, and there are many situations where using Promises directly is fine (note even the second version uses Promise.all). But for more complex operations, it really makes a difference.

While I’m at it, I’m adding standardized documentation in jsdoc format to modules as I go. Essentially, these are code comments in a special syntax that can be used to generate HTML output. You can find the generated result here; it will be updated every 24 hours from the currently deployed codebase.


New screencast is up

A lot has happened since the last screencast, from October 2016: we got full-text search, integration with Open Library and Wikidata, a rich-text editor, and other goodies. Here’s an updated screencast (YouTube version) that gives a brief overview:


Open Library autocomplete search is live

On the heels of basic Open Library support announced last week, we now have an autocomplete search box for book titles as well. This has necessitated a bit of a redesign of the relevant part of the review form. Here’s what it looks like to perform an Open Library search:

Open Library search box

What’s new here is the dropdown that lets you choose between Wikidata and Open Library. In future, other sources like OpenStreetMap may make an appearance here as well.

The actual search is, I think, quite a bit nicer than the title search on OpenLibrary.org itself. If you search on OpenLibrary.org directly, you won’t get an autocomplete match for titles like “the wealt” that would match the title “The Wealth of Nations”. In our case this works just fine. Our search is also not sensitive to the word order, often producing more results at the cost of some irrelevant ones.

Unlike OpenLibrary.org’s search, our search attempts a match against both the stemmed version of the words you enter (e.g., “dog” will both match “dog” and “dogs”) and against the wildcard version (“dog” will also match “dogcatcher”). To do this we have to fire off two requests per query. I’ve put some notes together on OL’s GitHub repository in case there’s interest in building on these improvements for the native search.

Finally, this search has a little extra feature: you can narrow search results by author by adding an author’s name (partial or full) separated with “;” after the title. This is a bit more obscure than something like “author:”, but I figured it’s nice to have a shortcut for something you may want to do very frequently—and it’s documented in the help that’s shown next to the input.


Introducing basic support for Open Library metadata

In addition to descriptions and labels from Wikidata, we now also extract authors, titles and subtitles from Open Library URLs. If you haven’t heard of it, Open Library is a fabulous project by the Internet Archive that’s both a structured wiki with data about books, and an actual library.

After making a simple user account, you can “check out” up to 5 books at a time of which the Archive has a physical copy—you can either read them online, or download (DRM-protected) PDF files. As of now, the number of books available is at a staggering 522,358.

We use Open Library as a free catalog that doesn’t have the onerous licensing terms of WorldCat. In future, we’ll add more metadata fields like publication year, number of pages, and so on. For now, when you add an Open Library URL to an existing review subject page, the result is something like this:

Open Library imported data

Editions vs. Works

Open Library distinguishes between “editions” and “works”. A work encompasses all translations and releases of a book, while an edition is a specific one. Information like the number of pages and the publication year obviously is highly variable across editions, which is why we don’t include it yet until we have a concept of “editions” on our side.

We’ll likely want to generalize that concept, since it is applicable in other domains as well: the different versions of a movie, the generations of a product, and so on. This is tricky stuff—at what point is a product so different that it merits its own top level record?

Right now, we’re mushing information together if you provide multiple Open Library URLs for the same item. Modelling out the relations between things without adding too much complexity will be one of the biggest challenges in the future.

Search

We don’t have an Open Library powered search box on the “New review” page yet, as we do for Wikidata. Adding the search box is relatively straightforward, though the search results can be a bit frustrating due to word stemming rules that don’t play well with autocomplete search. Nonetheless, some search is better than no search, so we’ll add that in the near future.


Support for more links related to a review subject

Until recently, we only showed you one link for every review subject (like a movie or book)—the one that was added alongside the first review. We now have an interface for managing links. If you’re logged in with a trusted user account (i.e. you have written at least one sane review), you’ll see a “Manage links” button on review subject pages (say, “New Internationalist”). If you click it, you’ll see this interface:

Manage links interface

As always, this should work just fine without JavaScript if that’s your thing; you’ll just have to make do without the “Add more” button and submit the form several times if you want to add more than a couple of links.

The “primary” link here is typically the official website of a movie, book, or product if one exists. Additional links can go to databases, review sites, or anything else that seems appropriate. The software will attempt to automatically classify links to commonly used sites likes Wikidata, IMDb, OpenStreetMap, etc. On the page, it will look like this (different example):

Metadata example

As before, if one of the links goes to Wikidata, we automatically extract and index summaries from there. In future we’ll add additional interfaces to more sites to pull over metadata where appropriate.

There’s one catch: any link can be used for exactly one review subject. That should usually not be a problem, but there may be cases like “product guides” that are applicable to multiple review subjects. We may create targeted features for some of those cases down the road.


You can now edit descriptions, or keep them in sync with Wikidata

When you have a review subject like Les Misérables, it’s good to have additional metadata. Are you talking about the book by Victor Hugo? One of the many film and TV adaptations? The musical? In future we’re planning to add metadata that’s useful for specific domains (e.g., opening times for restaurants, or publisher info for books), but for now, we’ve added support for a short text description.

Wikidata, which we support as a source for selecting review subjects, already has these descriptions for many items. For example, it describes the book “Nutshell” (reviews) as the “17th novel by English author and screenwriter Ian McEwan”. When you review an item via Wikidata, this description (in all available language) is automatically imported. We’ve now also added these descriptions to all reviews that had a Wikidata URL associated with them, and they will be updated automatically every 24 hours.

An automatically updated description looks like this:

Screenshot of automatically imported description

For reviews that are not associated with a Wikidata item, you can add or edit the description after writing a review.


 Older blog posts