Blinding e-Leaves with Clover: deriksmith

deriksmith

Blinding e-Leaves with Clover

Mar 09, 2010 18:51

I got a Sony PRS-300 e-reader for Christmas. Totally unexpected, hadn’t asked for it, etc. eReaders are one of those “well I guess it would be nice, but do I actually want one?” technologies.
The answer is red. The long answer is oh God yes.
Free to Read

Before getting an e-reader I can count the number of digitized novels I’ve read on one hand for the last decade. I’ve quadrupled that number in just over 2 months. It’s a much more natural way to read, and less wearing on the eyes than an active display. For aging readers, the ability to increase font size is a killer app all its own.

But for myself, the real killer application for the technology wasn’t digital editions of books I can already buy (though it’s nice to be in the middle of 5 novels at once- the e-Reader always saves my spot!) it’s the ability to read stuff that previously would have been classed as “online only.”

The Sony PRS series can read PDF’s natively, just drop them into the device and it converts and displays them on-the-fly, complete with inline images. (I gather most other e-readers require you to run them through a converter first.) I ditched the bundled software in favor of Calibre, a freeware e-book management package, which converts between the various proprietary e-reader formats more-or-less seamlessly. (Unlike the Kindle and the Nook, the Sony e-Readers don’t have a proprietary format… they use ePub, the open standard.)
No roads lead to .ePub

Getting content not already in a PDF or .lit into ePub is a bit frustrating however. A single .txt or .html file can be dropped into Calibre and converted easily enough… though often not well, if it had previously been fixed-column text. (Like Livejournal, or Fanfiction.net… or almost anywhere else you’d want to rip long-format fiction from.) And even then, there’s no mechanism for bundling multiple chapters together.

Well, that’s not entirely true- zipped HTML files will import bundled, but they’re subject to the same formatting problems mentioned previously, and sometimes came in out of order with no way to fix them.

As an experiment, I peeled apart a couple ePubs to see how the metafiles were put together, and hand-converted a 65,000 word 18-chapter DC-Comics fan-fiction into ePub format and managed to produce a working file. Huzzah! But that took more than an hour, and a lot of headaches.

Damnit, I was going on vacation! I wanted is some light fiction to take with me that I don’t have to pay for!

So I wrote Clover.
Clover

Clover is an iterative conversion engine running on XAMP, which I wrote from scratch in ECMAScript (PHP5 to be exact) to take content in a fixed format (.txt files and an index.html linking them together) which I had ripped using the HTTrack and bundle them for 1-click conversion… converting all the files concerned into compliant XHTML, adding basic styles and cleaning up the old files, as well as some title/author detection for easy import.

Basically it turns folders full of files into zips full of more files in different formats, with filemames specifically formatted so data imports correctly. This is the application workflow;

If I’d wanted to mess with all the metafiles I could have cut Calibre out entirely and converted straight to ePub. This is for my personal use, so i didn’t bother. (Anyway, Calibre has a ‘bulk convert’ function.)

All of this took less than 3 hours.
Creating the flowchart took longer than writing the program it describes. (Open Office’s flowchart tools aren’t very good, frankly.)

This is a source of enormous frustration to me… why does this functionality not already exist? E-books are already 4% of all books sold in the US. By 2013, 13 million people will be ueing e-readers, there are already 1.5-2.5 e-readers (of various stripes) in service. The market is growing in double digits year-on-year and it’s not small.

So why is it that the only content I can get for that format… is the content that’s already available as non-e books? That’s asinine!
Counter-Mindset Implementation

The answer is probably twofold:

Content producers have spent 15 years championing the online model for accessing their content, with embedded trackable ads. Switching to ‘dead’ editions where even if ads are embedded they can’t return any tracking or usage data… is the equivalent of switching back to the newspaper model of advertising. Those ad-impressions are still good- maybe even better due to fewer distractions- but they run counter to the direction the industries have been striving.
The Content Management systems used by hosts to manage their vast archives of monetizable content are huge, often ponderous. The prospect of converting them to produce bundled versions of print-editions which update when corrections are made to each constituent part… is rightly terrifying. A nightmare of moving parts added to their monolithic infrastructure.

For the first… that’s lack of vision, and fear. Industry which clings to the old comfortable way of doing things when the wind changes instead of adapting to the new environment is choosing not to exist in 10 years.
For the second… it’s a blindspot. Yes, generating and updating ePub compilations on the back-end would be a mind-numbing exercise in self-abuse. …so why not generate them on the front-end?

Picture a “get ePub” link, which pops up a window with a flash application that grabs all the necessary files, reformats them, adds styles and ads, bundles them and then converts them to an ePub file for download… on the client side. “Spooling chapters, binding e-book… Ready! Click to download!”

That’s exactly what the Clover app I wrote above does! It is demonstrably not hard! You wouldn’t even need to have an API in place for accessing the content! It’s an in-site mash-up!

Bah. I am frustrated that this functionality doesn’t exist already, because I wanted to use it for my vacation, and I frankly resent having to invent it.

I predict than if walled garden content sites don’t implement e-reader conversion in-house while they can embed ads and control it, a site like vixy.net will appear for ripping that content without ads…. and all control will be lost. The window of opportunity is narrower than it seems. June 2010 to head off the 3rd party content-rippers.

And because I resent having to program this sort of thing myself… I’m going to be a dick and strongly assert my copyright over the workflow described above, as well as the concept of a on-the-fly front-end ePub spooler, dating from February 27th, 2010.

Document your provenance carefully.

rant, epub, ereader, conversion, asshattery, prs-300, kindle, nook, conversation, clover, e-reader, walled gardens