Difference between revisions of "New PND format"

Revision as of 18:54, 23 February 2011

The current PND format has some shortcomings as listed below. This page should serve as a discussion page/white board for how the format could be improved.

Proposal for discussion, this is just some opening shots

Current situation

The current (ISO-based) PND format has the following shortcomings:

It currently uses ISO, SquashFS, and other filesystems... This means that there's no standard to follow, and libpnd needs to carry around support for a multitude of file systems.
It (often, but not always) uses the ISO file system which is inefficient at storing data, because:
- It contains duplicate headers for each file entry[1] (one version with big-endian integers, one version with little-endian integers[2]) which leads to many kilobytes of wasted space.
- Its file tables are fixed-size, and there are therefore limitations on how many folders you can have, what names you can give your files, how big your files can be etc. For instance, files can only have names that are max 31 characters long, all upper-case, and limited to the ASCII character encoding, and an ISO does only support a folder depth of 8[3]. To support more, the Joliet file system extension is needed (or isofs won't recognize all file paths), and the Joliet header can be placed practically anywhere in the file, which means additional seek times.
- It needs various file system extensions to behave correctly, like for example Rock Ridge Interchange Protocol and Joliet. Without them, it becomes pretty useless as demonstrated above, and with them, it becomes difficult to read by a tool (You can only use isofs! There are no other tools out there for programmers to use!).
- It's very difficult to use since you need special ISO making tools to create the image. ISO is a relatively rare format if you're not technically inclined.
- If you want to add or remove files from it, it's near impossible if you use a compact ISO file, since there's no way that you can "expand" an existing ISO file.
The PND header data is at the end (or in the middle of the file if a screenshot is included) which makes it impossible for e.g. libmagic to recognize the PND file. It will instead recognize it as an ISO file. Having the header data at the end also makes it take a very long time to find the data, making tools and the libpnd library very inefficient.
The PND file uses a custom XML format for its metadata. There's no reason to do this, especially since the established ".desktop" file format fills exactly the same function.

(note: ".desktop" does not contain all metadata we have wanted over time, but it's of course possible to add extensions à la "X-Pandora-Whatever=xyz")

There's no "index" for the PND file. The whole file has to be scanned (albeit backwards) to find a PXML file, and there's a big chance for false positives etc.
Data is just appended linearly to the file so there's no order. If the format is to be extended (to e.g. include an icon file after the screenshot file), should the data just be appended as well?
UTF-8 is strictly the only encoding that is supported. If you make your PXML on a Windows machine, it won't work.

Proposed revisions

Step 1: File system

The file system for PNDs should be replaced by the uncompressed ZIP archive format. ZIP has the advantage that it's incredibly compact, and uncompressed ZIP makes it possible to read data from the file without having to do any decompression.

ZIP files are mountable using various implementations of zipfs, and most of these implementations won't store files in memory when they are being read if the ZIP file is uncompressed.

Step 2: Metadata

PND files should no longer require special tools for them to be created. Therefore, since ZIP files support random access on files, it shouldn't be necessary to append/prepend metadata to the file. The user simply includes the "PXML.xml" and "preview.png" files inside of the ZIP, and any tools that need information about the PND can simply go to the central directory of the ZIP file (or use a simple ZIP library to do it for them) and get the location of the file inside of the ZIP. This should also dramatically decrease loading times for PNDs.

Note; performance testing is needed; at the time we started, zip was enormously slower than plain ISO; with driver changes, it may or may not be so, hence our adopted multiple-filesystem-type system. zipfs is one possible option.

Step 3: Metadata format

Current PND files include so-called "PXML.xml" files. These files have a custom XML format that has a strange structure.

These files should be replaced by ".desktop" files. A PND can contain one or more ".desktop" files in its root directory that specify how an application should be launched. PND tools simply use all ".desktop" files they can find in the PND when creating launchers for the contents of the PND.

Advantages

There don't have to be any special tools for reading PND files. The package can be run on any platform using any programming language that can read ZIP files.
We can use existing facilities to manage the launching of applications. The ".desktop" files can basically be copied without modification into standard locations of the system, and all launchers will become aware of them.
Reading PND files becomes easier (because of better tool support... sorry, but you can't link to libpnd in all programming languages) and quicker (becuase of the central directory for random file access).

Benchmarks

A benchmark was carried through to measure the performance of various package implementations. The compared systems were:

ZIP-based uncompressed packages using fuse-zip (a zipfs implementation) to read the files
ZIP-based compressed packages also using fuse-zip.
ISO8859-based packages using isofs
CramFS-based packages.

A total of 9 files were generated for the test. They have sizes 1, 2, 3, 4, 16, 32, 64, 128 and 256 MB, and were generated via "dd if=/dev/urandom of=$file bs=1M count=$size".

Because the files contain random data, they did not compress very well. The resulting files were:

"zipcompressed.zip" 511.1 MB
"zipuncompressed.zip" 511.0 MB
"iso.iso" 511.3 MB
"cramfs.image" 95.4 MB (WOW!)

Some notes:

No write tests were done for obvious reasons
No random access tests were done since the Pandora will use SD cards and thus not get penalized from random access, and that's not what we're interested in anyways.
Before each test, the commands "sync; echo 3 > /proc/sys/vm/drop_caches" were run.

First test: linear read time

Command: "time cat mountpoint/* > /dev/null"

ISO: 8.119 seconds
CramFS: 1.488 seconds (WOW!)
Zip (compressed): 8.535 seconds
Zip (uncompressed): 8.290 seconds

Second test: RAM usage

-- TODO, have to find a good method of measuring this --

Remaining issues

How should libpnd tell the difference between old and new PND files? Should it depend on the "file" tool, should it run its own recognition algorithm, or what? Should the extension be changed to e.g. ".box" (Which was proposed in a thread)?
How to make sure that the ZIP files are uncompressed? Should we provide a script that "uncompresses" an ordinary ZIP file?
Is cramfs better than ZIP?
- cramfs cannot be uncompressed like ZIP can
- If you compare compressed ZIP and cramfs, cramfs is more efficient.
- cramfs is harder to use than ZIP (for the developer).
- cramfs needs to have all of its opened files in memory.

Upgrade path from the old PND format

A simple script can be written that extracts the old PND file, including the screenshot and the PXML.xml file. The PXML.xml file is then converted to one or many .desktop files, and the desktop files, the preview.png file, and the package contents are compacted into a ZIP that then is renamed to "*.pnd"

Usage scenario

Repackaging of an application from another package format

The user grabs the package for the application
He dumps the executable and all required libraries into a folder
He dumps a "screenshot.png" file into the folder that he's made
He copies the ".desktop" file(s) for the application from the old package, opens them, replaces "Exec=/usr/bin/bla" with "Exec=./bla" and saves them in the directory
He uses a ZIP archiver to make a ZIP out of the folder, making sure that he sets the "uncompressed" option.

Creating PNDs as part of a build process

The build tool creates a directory with all of the necessary information like above and invokes the "zip" utility to compress the folder.

Accessing a PND's contents for the...

...lazy programmer

The programmer extracts the PND into a folder via libzip and accesses its contents.

...pragmatic programmer

The programmer uses zipfs to mount the ZIP and accesses the file's contents.

...smart performance-aware programmer

The programmer fseek()'s the ZIP file until he finds 0x04034b50.
He jumps 18 bytes forward and reads the int at that location, storing it in "length".
He jumps 4 bytes and checks if this int matches "length". If it doesn't, it means that the file is compressed, and an error is reported.
He then jumps forward 4 bytes and stores the short at that location in a variable "nameLength".
Another 2-byte jump gets the short "extensionsLength".
He then jumps 2 bytes and reads "nameLength" amount of bytes from the file.
He then uses strcmp() to see if this string matches the sought-after file. If not, he continues the fseek().
The programmer now skips "entensionsLength" amount of bytes.
The programmer reads "length" amount of bytes from the file, and uses this as the sought-after file data.

...performance fascist/programmer who wants low seek times

The programmer fseek()'s the file from the end until he finds 0x02014b50.
He then uses the following table to get the information he needs:

ZIP central directory file header
Offset	Bytes	Description
0	4	Central directory file header signature = 0x02014b50
4	2	Version made by
6	2	Version needed to extract (minimum)
8	2	General purpose bit flag
10	2	Compression method
12	2	File last modification time
14	2	File last modification date
16	4	CRC-32
20	4	Compressed size
24	4	Uncompressed size
28	2	File name length (n)
30	2	Extra field length (m)
32	2	File comment length (k)
34	2	Disk number where file starts
36	2	Internal file attributes
38	4	External file attributes
42	4	Relative offset of local file header
46	n	File name
46+n	m	Extra field
46+n+m	k	File comment

The relative file offset can then be used to jump to the file in question. The file is then scanned as described in the previous section.

see other Proposals

@@ Line 165: / Line 165: @@
 * The relative file offset can then be used to jump to the file in question. The file is then scanned as described in the previous section.
+see other [[Proposals]]
 [[Category:PND]]