Question about NDPI

Mon May 13 23:30:06 EDT 2024

I have another question regarding Hamamatsu NDPI's Tiff structure:

OpenSlide way (as found in openslide-decode-tifflike.c):
After carefully reading your code and the documentation, I understand that Hamamatsu chose to use the standard Tiff (non-BigTiff) and after the very first 2 bytes of the file (bytes #0 and #1, endianness) and the byte #2 for standard Tiff (2A) or BigTiff (2B), they write 8 bytes (against the standard Tiff rule with only 4 byte offset) for the offset to their first IFD/Tiff Directory.

In OpenSlide, you try guessing if it is NDPI by trying to read the first IFD from that theoretical 8 byte and check, if you find the tag 64520, which is "pathognomonic" only for NDPI. Else, you "fall back" to a standard Tiff file.

My findings:
At Smart In Media, have a huge slide collection and our earliest ever NDPI slide dates back to Oct 2007 and the latest to this year. The models were S210, S360 and S20 and one from one, where I am not sure (2007).
In all of them, I found the word "Hamamatsu" in a Hex Editor starting at byte #12 (in the oldest slide from 2007, it starts at #18). This would make it a lot easier at identifying Hamamatsu NDPI files from the content of the file, irrespective of the extension, by just searching the first e. g. 30 bytes for the string "Hamamatsu". 

Can you confirm this or do you have other findings?

Thank you 

Martin

-----Original Message-----
From: Benjamin Gilbert <bgilbert+openslide at cs.cmu.edu> 
Sent: Samstag, 11. Mai 2024 22:07
To: openslide-users at lists.andrew.cmu.edu
Cc: Martin Weihrauch <m.weihrauch at smartinmedia.com>
Subject: Re: Question about NDPI

[Sie erhalten nicht häufig E-Mails von bgilbert+openslide at cs.cmu.edu. Weitere Informationen, warum dies wichtig ist, finden Sie unter https://aka.ms/LearnAboutSenderIdentification ]

On Sat, May 11, 2024 at 2:15 PM Martin Weihrauch <m.weihrauch at smartinmedia.com> wrote:
> If I see it correctly, it (ab)uses the Tiff format and stores a single JPEG in each zoom level, (ab)using the JPEG rules, e. g. by exceeding the 65,000 pixel limit, etc.

Right.

> My question: how is it possible to quickly extract a tile from the large JPEG? Does it internally have multiple frames (the stripes) and if yes, how is it possible to locate the internal MCU/8x8 blocks without reading the entire stripe or does each stripe have to be "parsed" at least once completely? Is there something like a directory involved?

An NDPI tile is actually a sequence of MCUs between two JPEG restart markers.  Restart markers are a JPEG feature that isn't normally used much; they allow the decoder to recover from data corruption.  Restart markers can be searched for without decoding the image data, and they reset the state of the encoder/decoder when encountered, so it's possible to start decoding at any restart marker without knowledge of previous MCUs.  OpenSlide reads a tile by concatenating the JPEG header with the tile's MCUs, fixing up the trailer marker and the header width/height fields as necessary, and passing the result to the JPEG decoder.

Restart markers are placed after every N MCUs, and NDPI TIFF tag 65426 lists the byte offset (within the JPEG) of each MCU that immediately follows a restart marker.  OpenSlide can also scan for restart markers if that tag is missing.

Best,
--Benjamin Gilbert