A few days ago, my wife asked if I knew how to edit mobi files. A friend of hers were having some issues getting her latest book to come out right and while she had someone on it, perhaps I could do something. I gave a resounding “maybeâ€. Hey, Calibre has an edit button, right? Surely this stuff is documented online? I didn’t really expect to do anything that useful right out of the gate, but even though I’ve converted stuff for it at times, what I’d learned yet was approximately “It’s that format Amazon uses, you can make it with Calibreâ€. So I asked if I could look. I could and did. So I’m going to write a little about what I’ve learned so far.
Firstly, they’re often called azw files. Mobi is another format which Amazon bought for use with the kindle, originally a very ancient format. It’s in Palm database format, which places it way early in the eBook game, but it’s still used since Kindle happened to be what flipped the script and started convincing non-techies that reading eBooks is pretty cool.
Secondly, it’s actually a collection of various possible formats. This makes a lot of sense if you think about it, Kindles have been around for six-seven years now and the current ones make laptops from that era look slow. The newer stuff couldn’t (at least entirely) just have been planned for, and it wasn’t – the format has grown. It’s still backwards compatible though, you actually can dig out your Kindle 1 and order some fresh bestsellers (or my wifes friends book when it’s done) and read it just fine. That’s pretty cool. Some of the other formats are just there for specific stuff that requires very different formatting like manga, photo books, maps.. so forth. Unless you’re publishing/making stuff like that, you don’t have to worry about it because one format is pretty much *the* format. Even if you’ve been on board from the start it’s fairly likely it’s all you’ve seen. That’s also good – the less to worry about the better.
But one thing will still catch you. The “it†format is actually two formats. The old mobi is usually called mobi 7, though there’s nothing about the file itself at the consumer level making them appear different from the newer ones. When Kindle FIre launched a new format called KF8 or sometime mobi 8 came on the scene. It’s expanded in quite a few ways, capable of embedded fonts, complex tables, accurate control over the layout, access to fixed layout and a ton of other stuff. It, too, is in an azw or mobi file – it contains a copy of both. Older devices and software doesn’t mind, it just reads it as though it was nothing but the original format. That’s why an ancient Kindle can still read your books – it contains a version formatted just as it was and then a modern copy. After the old format, another copy in the newer format is attached.
 Just to top it off, it also contains a copy of whatever you generated those two copies from, which is almost always an ePub or something very similar to an ePub. That’s not always in it, some delete it since it takes up like a third of the file and no consumer devices actually use it for anything. I imagine it’s quite handy for people dealing with them though – they contain the source files so that it’s possible to unpack it, change/fix/update it and pack it back up good as new. You still have to generate a copy of the older (rather bare bones formatting wise) format, thus constructing a properly formatted copy that can be read on both systems or unpacked again.
 The making is done with one of a few different tools. Mobipocket puts out Mobipocket creator, though I think it may end up using kindlegen in the end (didn’t go all the way through with building one with it). InDesign by Adobe also has a plugin by Amazon, which looks kinda neat by I don’t have InDesign so I haven’t seen it. Amazon puts out the official tool, kindlegen, which converts an ePub or a directory containing what would be in an ePub (ePubs are actually zip files – you can unpack them with a normal unzipper like 7zip or winrar to see their insides) into the two formats and combines the whole shebang into a working mobi eBook. They also put out Kindle Previewer, which does the same thing (using the same stuff kindlegen uses I presume) but can also show them on the screen as it would show up on different devices. Those and mobipocket are free, available from these places:
The preview thing is important! As I mentioned, there are two formats in the file you make (or get from whomever) so even if you put it on your kindle and go over it, you’re only looking at half the file – the other format in it could still be a complete disaster. The previewer isn’t perfect, but if you open something in it and select “Kindle Paperwhite†from the device menu, you’ll be looking at the old mobi 7 (in full low-resolution 16 grayscale glory). Choose Kindle DX to see the KF8 and/or Kindle Fire to see it in color. That way, you can actually see misses in the other format. If you only check it on one or the other, you can’t tell what it’ll look like for the other side. It’s also wise to give it a go on Kindle for PC/Android/Mac/iOS along with any hardware laying around. Their current versions use KF8, but they’re still not functionally identical and the resolution changes are even more extreme My smallest screen is my somewhat aging Samsung admire at 320 x 480, my biggest is the PC I’m typing on now at 1080p – that’s quite a gap in terms of making content work on both in one file.
There’s no official tool for unpacking them unfortunately, but there’s a reverse engineered one by adamselene (possibly et.al). It’s called KindleUnpack, originally mobiunpack which was the same thing without a GUI. Both are written in python, so if you want to use it you’ll need to install that too if you don’t already have it. You can get the files here:
You need to install Python first, then unzip KindleUnpack and run KindleUnpack.pyw (open it in python or double click it). You need to be in the same directory you unpacked it in or it won’t find the files it brought with it in the “lib†directory (it doesn’t come with an install as such). Doing that and picking the input and output directory, I unpacked the book as well as a few others for comparison. It’s a pretty straight forward interface once it’s running, the only options are for if you want it to in turn unpack the two extracted .mobi and .azw3 as well so you can see their parts. It won’t work on all books, since the exact format is proprietary and known fully only to Amazon so some situations are unhandled, but most books I tested worked fine or could at least cough up the source files (meaning kindlegen or kindle previewer can reassemble those into a fresh one with all the formats).
At the time I didn’t know any of this and misunderstood quite a bit of what was going on, but I was presented with these files:
- kindlegenbuild.log, 4 kb
- kindlegensrc.zip, 1.8 Mb
- mobi7-formated MOBI Angel An Exceptional Twist Kimi Flores 2-26-14.mobi, 1.3 Mb
- mobi8-formated MOBI Angel An Exceptional Twist Kimi Flores 2-26-14.azw3, 1.7 Mb
.. and two directories, mobi7 and mobi8, containing:
- ./mobi7/formated MOBI Angel An Exceptional Twist Kimi Flores 2-26-14.html
- ./mobi7/formated MOBI Angel An Exceptional Twist Kimi Flores 2-26-14.nc
- ./mobi7/formated MOBI Angel An Exceptional Twist Kimi Flores 2-26-14.opf
- ./mobi7/formated MOBI Angel An Exceptional Twist Kimi Flores 2-26-14.rawml
- ./mobi7/Images
- ./mobi7/Images/cover00305.jpeg
- ./mobi7/Images/image00226.jpeg
- <pile more of image files>
and
- ./mobi8./mobi8/formated MOBI Angel An Exceptional Twist Kimi Flores 2-26-14.rawml
- ./mobi8/META-INF
- ./mobi8/OEBPS
- ./mobi8/OEBPS/Fonts
- ./mobi8/OEBPS/Images
- ./mobi8/OEBPS/Styles
- ./mobi8/OEBPS/Styles/style0001.css
- ./mobi8/OEBPS/Text
- ./mobi8/OEBPS/Text/part0000.xhtml
- ./mobi8/OEBPS/Text/part0001.xhtml
- ./mobi8/OEBPS/Text/part0002.xhtml
- ./mobi8/OEBPS/toc.ncx
As you may notice, the mobi8 files for images failed to extract correctly from the KF8 portion. I’m not sure why, though kindlegensrc.zip indeed contains the source files (and will open in an ePub reader if renamed, appears a compliant ePub). So, lets go over what’s here.
The .log file is the output from kindlegen (in this particular case) when building it, basically listing what it put in briefly and a “successfully encoded” message. The .mobi and .azw3 are (obviously?) the old and new standalone versions for their respective sides of devices/software. The jpgs are jpeg images. There were a few gifs thrown in too, those are the only officially supported end types I believe, at least in mobi 7. Kindlegen will happily convert tons of other formats, but if you want to have final say in the file that actually winds up with the customer you’ll need to submit those two formats specifically and also make sure they’re within the sizes that do not get reformatted for being too large. What constitutes too large varies, Amazon recommends less than 450×550 except for the cover which recommended at 800×600 for the older and 1562×2500 on newer. File size in bytes can trigger it too. Some files larger than 450×550 do not seem to be converted. Amazon seems to mostly give guidelines rather than stating point-blank what they will actually not tolerate, mostly with the justification that you should indeed actually follow those guidelines anyway and if you accidentally don’t and it happens to be something that can work under the specific circumstance, their end won’t intentionally prevent it.
As you can see, much of the files are of the same format in both (although they do not contain exactly the same thing, kindlegen made different changes to the original files in each). They are also pretty similar to what is in kindlegensrc.zip, so lets go ahead and look in that instead since it’s where they came from and what will be directly edited if changes are made. It contains:
Date Time Attr Size Compressed Name ------------------- ----- ------------ ------------ ------------------------ 2013-01-01 00:00:00 ..... 20 20 mimetype 2013-01-01 00:00:00 ..... 233 152 META-INF\container.xml 2013-01-01 00:00:00 ..... 9069 1843 content.opf 2013-01-01 00:00:00 ..... 776 331 ncx.ncx 2013-01-01 00:00:00 ..... 317 211 c001.xhtml 2013-01-01 00:00:00 ..... 288 185 toc.xhtml 2013-01-01 00:00:00 ..... 876813 284210 c002.xhtml 2013-01-01 00:00:00 ..... 2195 508 base.css 2013-01-01 00:00:00 ..... 25560 24725 images\image003.png < lots of more image files removed > ------------------- ----- ------------ ------------ ------------------------ 2502514 1864119 89 files, 0 folders
“mimetype” is specific to ePub, it has to be in the zip (ePub) file as the first file and without any compression, containing the mime type of the whole file. It does (“application/epub+zipâ€). Another requirement in ePubs is that they have to have a directory named META-INF containing an xml file named container.xml. XML is a common cross platform/application data format, html is based on it so it’s similar to that. It needs to contain a path (from the base of the zip) to another file called content.opf and specify that its type in turn is “application/oebps-package+xmlâ€. That is indeed what is in this one. OEBPS stands for Open eBook Publishing Standard, a standard put together for how ePub files are formatted. Their content file is a OPF file, Open Package Format, which further defines what is in the publication and in which other files (in this case prior to rearranging into the other formats).
Next in the list is the just mentioned OPF. We’re now crossing over into the meatier part of the deal and specifying what is actually in this particular book, most of the others are fairly similar in most books. Inside, we find:
<package xmlns:opf="http://www.idpf.org/2007/opf" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://www.idpf.org/2007/opf" version="2.0" unique-identifier="BookId"> <metadata> <dc:title>An Exceptional Twist</dc:title> <dc:publisher>Kimi Flores</dc:publisher> <dc:creator opf:role="aut">Kimi Flores</dc:creator> <dc:description><p> What&rsquo;s a girl to do when the one person she&rsquo;s been forewarned about is the only one that her heart desires?</p> <p> Leah Valdez is a sassy, intelligent, hard-working woman whose beauty shines from both inside and out. Friends and family have always come first, but it&#39;s time for her to start thinking about her own future.</p> <p> Stefen Hunter is a rich, charming, sexy playboy. With seemingly no effort on his part, countless women flock to him. That is, until he meets Leah. He can&rsquo;t understand why it&rsquo;s so difficult to win her over or what it is about her that intrigues him so much.</p> <p> .. continues
As you can (likely?) see, it states its version and where to find official specs (a common “Hi, I’m a file of type….†among xml style files) and then launches into various specifics about the book. These do not all show up on all devices, but must be there so that the book itself can easily be identified with some base info (who wrote it, what’s it about, who published it, etc). It’ll proceeds with date, language, ePub version (there’s slews of further optional tags available) among a few others, and finishes off with:
<meta content="cover" name="cover"/> <meta content="cover" name="cover"/> <dc:identifier id="BookId" opf:scheme="uuid">4bf86602-cee6-436b-aeb2-86444522cd6a</dc:identifier>
.. before moving on past metadata. The first of these defines the cover image – later in the manifest all files to be included in the final will be listed and given an ID which is referred to here instead of the file name. I’m not sure why there are two, but having more than one of a tag is allowed for most tags if there is more than one of whatever the tag specifies (author, editor, cover artist..). It doesn’t seem to bother KindleGen any. The uuid is a little unusual, each book must have a unique one. They aren’t kept track of though (as far as I know, but I’m almost positive). One way to get a unique one is to go to http://www.famkruithof.net/uuid/uuidgen. It will give you one on the spot. They are generated based on the current time, made in such a way that the one generated at that moment couldn’t have been generated previously and won’t be later (thus unique). I get the feeling there is, was or was meant to be more to it but at this point everyone just gets them by calculating them as that place does. Then comes, proceeding further into the file, the manifest:
<manifest> <item href="c001.xhtml" id="c001" media-type="application/xhtml+xml"/> <item href="toc.xhtml" id="toc" media-type="application/xhtml+xml"/> <item href="c002.xhtml" id="c002" media-type="application/xhtml+xml"/> <item href="base.css" id="base" media-type="text/css"/> <item href="images/image003.png" id="image003" media-type="image/png"/> <item href="images/image045.png" id="image045" media-type="image/png"/> <item href="images/image068.jpg" id="image068" media-type="image/jpeg"/> <item href="images/image027.png" id="image027" media-type="image/png"/> … continued slew of image files… <item href="images/image062.jpg" id="image062" media-type="image/jpeg"/> <item href="ncx.ncx" id="ncx" media-type="application/x-dtbncx+xml"/> <item href="images/cover.jpg" id="cover" media-type="image/jpeg"/> <item href="endmatter.css" id="endmatter" media-type="text/css"/> </manifest> <spine toc="ncx"> <itemref idref="c001"/> <itemref idref="toc"/> <itemref idref="c002"/> </spine> <guide> <reference href="c001.xhtml" type="title-page" title="Title Page"/> <reference href="toc.xhtml" type="toc" title="Table of Contents"/> <reference href="c002.xhtml" type="text" title="An Exceptional Twist"/> </guide> </package>
And the OPF file is done. Here, it lists all files to be stowed in the main book file (possibly after conversion), first starting three xhtml files. xhtml is a lot like plain html, but has slightly stricter requirements such as requiring closing all tags and not abbreviating any of the names and parameters in them. This makes it a bit easier to make sure it’s rigorously formatted and nothing got forgotten. These particular ones support pretty much all of html 5 (meaning, yes, KF8 can do pretty much everything html 5 can with the important exception of javascript and, except in some rare cases, audio and video content).
The three files here are the all required – a title page (c001), a table of content (toc) and one or more content file(s) (c002, in this case the rest of the book beyond the cover and toc). They do not have to be named this in particular and it’s ok to split the content up further (with further pointing to them so they’re found). Here they opted not to, which is fine – neither style is discouraged. “base.css†is a cascading style sheet, they’re referenced in html files to defines a lot of the formatting for them. It’s common to have only one or a few css files which defines styles that are then used over many different html files so that they don’t have to be repeated in each and can be modified without digging into every html file by itself. We finish up with with that cover we promised to give an ID higher up and then another stylesheet called endmatters.css. This css isn’t actually referenced by any other files and I don’t think it’s a name anything else grabs, so I’m not sure what it’s doing there really – leftover from an earlier structure perhaps?
The ncx.ncx is a longer story, it too is required. It contains another table of contents, this one not actually shown (usually, it’s up to the device) in about the same format as the last. In this case it only lists the minimum required – the title page, a required text only TOC and the book itself. toc.xhtml, in turn also only lists the minimum required – the title page and the rest of the book. This is technically ok and most of the KF8 devices don’t even read the ncx, but it was (and I’d assume is) a bit more useful on the old devices using mobi 7. Most of them aren’t touchscreen and also keyboardless with just four-five shortcut buttons for menu, main, back, etc, arrow keys and two buttons on each side for next/previous page and next/previous section or bookmark. The second part is where the ncx comes (or came) in – even new and without bookmarks you can skip through the sections (chapters or subheadings) with the buttons since it defined a set of important places in the text to jump through. Not sure why exactly it was excluded, but it is allowed and there’s another toc with the actual chapters right at the start of the book so it may have been a conscious choice.
And so that’s that – the ePub is fully defined. The html/css files link and include each other to whatever degree they want and point out images or other resources to display, but the rest is the content formatted in xhtml/css. c001.xhtml just contains a header stating it’s version, a tag to embed the cover and what size it should be. c0002.xhtml is, as stated before, the entire book, with a link to base.css where it defines margins, text sizes, indent, alignment, etc for various specific groups of text – can be stuff like headers, subheaders, asides, addresses, names, quotations and so on. I say “can†because it’s pretty skimpy in this one, a lot of the formatting is done on a case by case basis (specifically stating “this text here should be font X, size and indented 15 pixels†rather than the css saying “text marked as ‘body-text-listing-people’ should be <some format>†and then tags in the actual book saying “this paragraph is ‘whichever-type-it-happens-to-be’†and thus getting their format as defined in the css. That’s ok, but it’d be better to define *what* they are first (called markup) and then specify in the css how each of those things should be formatted. That way A) it’s easy to change your mind – you can easily decide a certain type of thing in the book should look different than it does and change all of them in one fell swoop by editing the style definition in the css instead of each place it happens in the book and B) whomever wrote the text got to state what it is rather than how it looks – the formatter doesn’t have to guess or assume.
The way it is here is pretty common unfortunately. It’s ok and can be made to work, but it’s much easier if the style is defined much earlier in the process. Doesn’t need to be in html, Word supports styles too as do most word processors and they can at least usually be salvaged when remaking them into html. That can be pretty important since a lot (if not most) of the rest of the formatting is usually trashed when moving it to xhtml and needs to be redone or manually excavated from the (usually abysmally formatted) exported file. It’s a lot easier and less error prone to just look at the originally written document (or ask the author/layouter/whomever) about each class of text and how they should look. As it stands now, you can only ask about or look up in the document about “The part that says <something> in chapter 3 a bit down”, not “All text that was marked as type ‘normal-body-text-emphases'”. Since a lot of the formatting requested by doing it in a word processing file isn’t possible (and even less is possible in the old version it’ll pump out) and has to be replaced, starting with a well-structured documents with the structure machine readable helps considerably.
In this particular book the chapter (and other) titles are done in a decorative font which is what all those images that keep getting mentioned are. Those could have been done by including the font in the KF8 format (i.e. new ereaders/software could render it themselves which would save some space and avoid a few other issues, such as if the reader has a different background or font selected to override the default or the one specified in the book) but the old format needs them as images or to be allowed to fall back on some other font. Because of this the rest is mostly body text. The fancy TOC does look pretty neat though, except some of the titles had been reformatted a little much and ended up pretty low resolution (hence showing up as very small on high-res devices) and a little washed out. This was the main thing she wanted to change when this came up (afaik, I wasn’t there). I think it’s done, some other changes came up so I haven’t seen the sources. It’ll be interesting to see how things ended up once it’s put together again and I can revisit it.
Both unscaled screenshots of kindle previewer, set to Kindle DX
There isn’t a single reason something like that happens, but a lot of manual reformatting ups the odds. Any tinkering does, but the overall amount grows when changing lots of individual places and things instead of overall structure and judging by the tags in the main html and the definitions in the css (many seemingly abandoned) this has gone through a lot of piecemeal changes to individual spots. Some of the other images like flourishes for jumps in the text that aren’t also chapters are include multiple times a bit too, sometimes in different formats. This would be easier to dodge if they had been specified and referenced a little more cohesively to begin with.
IDs that are human readable (“chapter-4-toc-thumbnail’ instead of “image0034.imgâ€) would also have been handy, easier to spot errors early or in passing just by looking that way, but I’m not sure how to do that in Word or any specific word processor really, I don’t work with in much.
As an aside here, the two-format deal and their limitations is something that would have been goot to keep in mind when writing the original document. The final version will, when it’s sold on Amazon, be formatted into one fairly expansive format with high resolution and a slightly lighter version without the bells and whistles for older devices. Quite likely also an ePub (roughly comparable to KF8, equal but different) although here the ePub is the basis for the other two. Both kind of default to mostly living with the functions of the older format except for including higher resolution images for the modern screens (nicer pictures and more size appropriate on screen) but it’s possible to segment off sections of the css during formatting, essentially saying “text marked ‘something’ should look like <complex format> if this is a KF8 reader, otherwise it should instead look like <simpler or other format>â€. The whole process is a little like if you were giving a presentation on TV in the late 70s early 80s – if you needed to decide what colors to use in the presentation, you’d need to know that not everyone has a color TV. A majority does, so you still have to (or at least should) give some thought to what looks good, but you can’t hinge the whole thing on it by color coding vital information – it has to be a working (if not as nice) product even in black and white. The techies can and will try to put something together of course, but it can be kind of hamfisted sometimes and it’s a fine line between artistic choice in terms of how much “extra†to put in the more expansive format version and how far to go to squeeze as much as possible out of an old format that’s intentionally pretty limited due to the smaller, older devices.
Whew, that’s a lot of text. I think I’ll leave it at that, I may write some more details some other time and perhaps do a more hands writeup of a start to end doing these things. Hope you enjoyed your read (for the few who actually made it this far). There is a bit more to say about the formatting inside the actual xhtml, but I think I’ll do that another time as well, I could stand to know a little more detail about it myself.