<html>
<head>
<meta content="text/html; charset=windows-1252"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">Hello Benjamin, hello OpenSlide
community,<br>
<br>
A lot of changes to the VMIC driver have been made.<br>
<br>
The code is in the repository
<a class="moz-txt-link-freetext" href="https://github.com/Markus-PP/openslide-vmic">https://github.com/Markus-PP/openslide-vmic</a> <br>
<br>
<u>Lots of technical details follow.</u><br>
<br>
<blockquote type="cite">
<pre wrap="">If there are Deep Zoom features which are not used by
VMIC, you can just check for unsupported parameters and fail if present.
(Would that allow you to switch to the simple grid?)
</pre>
</blockquote>
<blockquote type="cite">
<pre wrap="">The grid itself doesn't care about truncated tiles on the right/bottom
edges. The read function would need to check for those edges and adjust the
size of the tile it expects to read.</pre>
</blockquote>
<br>
I'm convinced and have moved to simple grid. Now, the read_tile
function has it's own way to know the tile size. The compelling
reason for this is memory foot print on large slides - a million
heap allocations of 64 bit chunks which the tile grid does on
large slides would create a system load nobody wants.<br>
</div>
<blockquote cite="mid:20161002002809.GF3744@hyliota.hl.backtick.net"
type="cite">
<pre wrap="">You said earlier that there are no overlapping VMICs; is that still true?
</pre>
</blockquote>
Yes it's still true the VMIC deepzoom images don't overlap. One can
never be 100% sure, but the precipoint team stated they are not
planning to alter any file internals without a *very* important
reason. So, yes, the simple grid will do.<br>
<br>
The following issues are not entirely solved:<br>
<blockquote cite="mid:20161002002809.GF3744@hyliota.hl.backtick.net"
type="cite"><br>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">The existing drivers re-open their data files on every
openslide_read_region(). That avoids this kind of thread-safety issue
(provided that libzip doesn't have any global state) and also prevents
an idle openslide_t from consuming file handles. If creating a ZIP
handle is expensive, you can use a handle cache + your own zip_source,
similar to the approach used by _openslide_tiffcache.
</pre>
</blockquote>
<pre wrap="">
The nefarious global GMutex has been removed, instead, the zip_t* handle
got wrapped by the _openslide_ziphandle structure, so we can have one
mutex for each instance of a zip archive.
</pre>
</blockquote>
<pre wrap="">
That's still inconsistent with OpenSlide's typical approach. We generally
try to avoid shared mutable state, and if we're going to add more, there
should be a good reason.
</pre>
<blockquote type="cite">
<pre wrap="">Also, there are now wrapper functions for zip_open, zip_open_from_source,
zip_close, zip_fopen and zip_fclose. This is simpler than a handle cache
and apparently enough to allow multithreading.
</pre>
</blockquote>
<pre wrap="">
Yes, but not enough to allow parallel I/O. If multiple threads were using
the same openslide_t, only one would be able to do I/O at a time.
</pre>
</blockquote>
Parallel I/O appears kind of hard to implement under the given
circumstances. At this iteration, a handle queue similar to the one
in the tiffcache is implemented. BUT......<br>
<blockquote cite="mid:20161002002809.GF3744@hyliota.hl.backtick.net"
type="cite">
<blockquote type="cite">
<pre wrap="">Reopening the zip for every image access would of course be out of
question.
</pre>
</blockquote>
<pre wrap="">
Why?</pre>
</blockquote>
<br>
With the given versions of libzip, as soon as zip_open is called,
the *whole* zip directory is always read into memory. For any slides
larger than tiny, the overhead would be immense. For slides with the
size of 12GB and 600k zip entries, it already can take a 2 seconds
on a 10 year old computer, possibly longer if opened from a NAS. And
reading the zip directory also allocates tens of megabytes on the
heap. The problem arises only really for large slides, smaller
slides up to 2GB size open very fast.<br>
<br>
On the other hand, as stated, we want to avoid to put any of the
threads in a wait state as much as possible.<br>
<br>
The compromise solution is in the functions
vmic_handlecache_create(), vmic_handle_get() and vmic_handle_put()<br>
<br>
* Calculate max number M from <font size="-1">[constant] /
[expected resident memory allocation per archive instance]</font><br>
* Open up to M instances of the zip in parallel, as required by
multithreading request.<br>
* When more than M of these are in use simultaneously, and a new
thread requests a tile, wait for one of the instances to be
released. Each thread is locked at maximum for the duration of the
zip_fopen/zip_fread/zip_fclose sequence, so it's basically only the
file read and zip decoding which is serialized. The JPEG decoding
and cairo rendering still works in parallel threads.<br>
<br>
The performance of zip_open for large slides > 10GB is not
entirely what we would wish for. Alternative solutions may include
improvements of the libzip code itself, or move to a different
library, if such exists in the standard distribution packages.<br>
<blockquote cite="mid:20161002002809.GF3744@hyliota.hl.backtick.net"
type="cite">
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">What are VMIC's exact rules on the name of the inner archive? All three
sample files use a filename of exactly "Image.vmici". If the name check
could be restricted further (in particular, if the .vmici extension is
always present), it might be reasonable to drop the check for the inner ZIP
magic number.
</pre>
</blockquote>
<pre wrap="">Basically, the inner archive is supposed to be named either
"Image.vmici", or, for older slides up to 2015, "Image".
Since we do want to support older slides, and the name "Image" is very
generic, I believe, dropping the check for the magic number is no good
idea, and I see no problem because the check is inexpensive.
</pre>
</blockquote>
<pre wrap="">
ZIP doesn't have a magic number in a fixed location. You're checking for
the first local file header, but there can be arbitrary amounts of data
prepended to a ZIP, so the header may not be at offset 0. If we wanted to
be rigorous, we'd have to search the last 64 KB of the member for an End of
Central Directory record. It's probably best to keep the check as is, but
it does impose an additional constraint on the format of the slide file.
</pre>
</blockquote>
The checking code has been made a bit shorter, the name now must be
exactly "Image" or "Image.vmici" - anything else would be rejected.<br>
<br>
As for the zip magic number, I have not found any archive with none
as the first bytes of an archive yet.<br>
Scanning ancient backup drives for random zip archives with <font
size="-1"><i>find -name "*.zip" -exec hexdump</i></font> keeps
giving zip magic numbers "PK34", or "PK00PK" for the rare multiple
volume archives.<br>
<br>
For more constraints, a check for the .vmic file name suffix may be
possible but is not being done in the current implementation.<br>
This would probably speed things up when the library is used to scan
a tree of files which are zips but not vmics. - Can such a situation
occur ? And does it even happen that slide files suffixes are
altered in any other than test environments ?<br>
<blockquote type="cite">
<pre wrap="">In C, a cast is a warning sign that something unusual is going on. In
general, don't explicitly cast pointers unless the compiler complains.
</pre>
</blockquote>
Those casts between pointers have been introduced to prevent
compiler warnings, such as signed/unsigned mismatch in an equality
comparision, or ensuring correct word lengths in printf-like
statements.<br>
<blockquote type="cite">
<blockquote type="cite">
<pre wrap="">[...] make a format such as VMIC optional, e.g. for reasons to avoid the libzip dependency [...]
</pre>
</blockquote>
<pre wrap="">I've tried to avoid making formats optional, since it complicates testing
and support. However, in this case it's unavoidable. Debian stable, Ubuntu
Xenial (the current LTS release), and RHEL/CentOS 7 all ship versions of
libzip older than 1.1, and people often want to compile new releases of
OpenSlide on older Linux distributions.
In existing cases when an optional format is disabled, detection still
succeeds, but opening the file returns an error. That's not possible here,
or at least it's not easy. We'll probably want to implement the optionality
by setting an AC_DEFINE and an AM_CONDITIONAL in configure.ac based on the
pkg-config detection results, conditionally compiling decode-zip and
format-vmic (using the AM_CONDITIONAL), and wrapping the formats array
member in openslide.c in an #ifdef (using the AC_DEFINE). If you're not
comfortable doing this yourself, you can implement the driver as a mandatory
dependency, and I'll fix up the build system myself.
(GitHub recently added a feature allowing me to push commits into OpenSlide
pull requests opened by other people, so I may sometimes do that instead of
asking you to fix a complex problem yourself.)
</pre>
<br>
</blockquote>
I would happily like to take this offer, I'm entirely not acquainted
with automake syntax.<br>
<br>
We do need a libzip version >= 1.1 because<br>
<br>
* libzip 0.11.2 does not even compile, because aliases such as zip_t
are undefined and required features are missing.<br>
* libzip 1.0.1 compiles ok, but zip_open_from_source returns a zip
error 28 "Operation not supported". It must be avoided.<br>
* libzip 1.1, 1.1.2 and 1.1.3 and 1.2 are fully functional. <br>
<br>
<blockquote type="cite">
<pre wrap="">[...] code formatting [...]</pre>
</blockquote>
<br>
I've been going through the code style checklist several times - the
restriction to 80 chars per line can be a bitch, to not say more. <br>
<br>
Despite the above mentioned performance issues, on which I will try
to improve whereever possible, this driver is already running in at
least two different productive environments. Hopefully the given
fixes are ok to consider. Once asked to do so, I'll join the git
commits.<br>
<br>
Markus Pöpping<br>
<br>
</body>
</html>