Add support for DICOM (aka Supp 145) #157

Mon Oct 3 08:18:14 EDT 2016

Hi Benjamin !

On Mon, Oct 3, 2016 at 9:24 AM, Benjamin Gilbert via openslide-users
<openslide-users at lists.andrew.cmu.edu> wrote:
> Hi Mathieu,
>
> Thanks for the detailed response!
>
> On Wed, Sep 14, 2016 at 02:32:51PM +0200, Mathieu Malaterre via openslide-users wrote:
>> 1. OpenSlide needs to handle a very particular subset of the DICOM
>> Transfer Syntax(s). Because of some low level (boring) details, some
>> complex parsing issues are totally avoided in that subset. What this
>> means is that the (limited) parser can be much smaller in code size
>> compared to a full implementation.
>
> Sure, but there are other cases where OpenSlide uses small parts of large
> libraries.  The wasted address space doesn't bother me if it reduces the
> amount of code we have to maintain.

Right !

>> 2. OpenSlide needs a particular DICOM parsing behavior (typically SAX
>> or StaX in the XML world), with an optimization toward reading images
>> out of DICOM file.
>
> I'm not thrilled with the streaming model: it may be more efficient, but at
> the cost of some indirection and lack of clarity.  We clearly want to defer
> reading the image data, but is the metadata large enough that reading it
> into memory would really be costly?  We'll probably need most of it anyway
> when generating OpenSlide properties.

Hum indeed this is actually a very good point. I was trying to be
smart with my fseek (NFS scenario) but indeed loading the whole file
may just work.

I did some quick tests.

Case 1: A single file is the concatenation of multiple JPEG streams.

This is the case for
ftp://medical.nema.org/MEDICAL/Dicom/DataSets/WG26/Hamamatsu/Human_15x15_20x.dcm.
In this case the DICOM header is ~696K (file is 418 MB).

Case 2: A single file contains a single JPEG stream

I did not have any dataset, so I used GDCM to split the above dataset
into individual file. In this case the header is 104K (x 4824 files).
This is nasty mostly because the ICC profile is repeated in every
single file (that may explain why no vendor choose to implement this
option).

So even in the Case 2, this represent ~512Mo in memory. Does that
correspond to other slice format ?

>> I know of two relatively good generic C++ toolkit: GDCM & DCMTK.  As
>> upstream author of GDCM, I am in a position to say that GDCM also does not
>> make a good fit here.  That leaves us with DCMTK.  What I do know is that
>> the code is very complex in part because of the code legacy and because
>> DCMTK is a generic DICOM toolkit.  So IMHO DCMTK is also not a good fit
>> here, esp because of point (2), which is something very special in the
>> DICOM world.
>
> What are the problems specifically?  Performance, reliability, features,
> ability to work around bugs in DICOM files?

As explained above: Point (2). All DICOM toolkits read the entire
file, using there own memory buffer... Supp 145 is very very different
from other class.

>> I do know of vtk-dicom, but this library does pull in an insane amount of
>> dependencies, which I believe is not a good thing for OpenSlide
>
> Yeah, we should try to avoid that.

OK.

>> I could also build some kind of abstract level on top of this library
>> and only use that abstract level within the core openslide
>> implementation (eg. parse_header_dicom(), read_tile_dicom...). This
>> would make transition to another DICOM library trivial (tm) in the
>> future.
>
> Not worth it, I think.

OK.

>> even if OpenSlide and FFmpeg do not share a common DICOM library, they
>> would share a common code base.
>
> If we do ship our own parser, I'd prefer that it completely conforms with
> OpenSlide's coding conventions, rather than trying to stay synchronized with
> ffmpeg.  Copy-pasted code tends to diverge anyway, and we'd still need to be
> able to maintain it.

OK, make sense.

>> I did not describe the issue with DICOMDIR here, since I failed to
>> understand what you meant.
>
> I was thinking of the requirement for the user to generate a DICOMDIR if one
> doesn't exist, but I now understand that issue better and it's not relevant
> here.

Keep in mind my early implementation was done rather quickly (proof of
concept). I assumed a DICOMDIR would be available but if you tell me
how to handle the other case in the openslide framework, I can adapt
the code.

So the only question that remains is the loading of the complete
dataset in memory (as discussed above).

Cheers,
-- 
Mathieu