« Metadata in PDF: 1. Strategies | Main | Metadata in PDF: 3. Deployment »

Metadata in PDF: 2. Use Cases

Well, this is likely to be a fairly brief post as I'm not aware of many use cases of metadata in PDFs from scholarly publishers. Certainly, I can say for Nature that we haven't done much in this direction yet although are now beginning to look into this.

I'll discuss a couple cases found in the wild but invite comment as to others' practices. Let me start though with the CNRI handle plugin demo for Acrobat which I blogged here.

Handle Plugin

First off, the handle plugin PDF samples do include an embedded (test) DOI in both the document information dictionary

	5 0 obj
	<<
	/CreationDate (D:20070614140125-04'00')
	/Author (Simon)
	/Creator (PScript5.dll Version 5.2.2)
	/Producer (Acrobat Distiller 8.1.0 \(Windows\))
	/ModDate (D:20070614140240-04'00')
	/HDL (10.5555/pdftest-crossref)
	/Title (Microsoft Word - crossref-rev.doc)
	>>
	endobj
and in the (document) metadata stream
	<rdf:Description rdf:about="" xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/">
	    <pdfx:HDL>10.5555/pdftest-crossref</pdfx:HDL>
	</rdf:Description>

Bar any fuller disclosure of metadata terms at large (and one of the demo cases makes use of DOI to retrieve metadata form CrossRef) this is excellent. I would, however, quibble with the use of "HDL" as a foreign key for the information dictionary. I realize this is just a test but the term "HDL" (or "DOI", for that's what it really is) is somewhat specific and a more general term such as "Identifier" would probably have more mileage, e.g.

	5 0 obj
	<<
	...
	/Identifier (doi:10.5555/pdftest-crossref)
	...
	>>
	endobj
In the second example from the metadata dictionary I don't think the term "HDL" from the PDF extension schema "pdfx" is very helpful. (Is that namespace actually defined anywhere?) From a descriptive metadata viewpoint a more usual schema such as DC would have wider coverage. So again the second example would be better rendered as
	<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
	    <dc:identifier>doi:10.5555/pdftest-crossref</dc:identifier>
	</rdf:Description>

or, alternately,

	<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/">
	    <dc:identifier>info:hdl/10.5555/pdftest-crossref</dc:identifier>
	</rdf:Description>

Elsevier

Well, we have Alexander Griekspoor's comment earlier that Elsevier are including the DOI in their PDFs. I don't know how consistently this is being done but I've checked a couple sample articles and it would seem that they have embedded the DOI (here from Cancer Cell, doi:0.1016/j.ccr.2007.06.004) in the title element which shows up in the information dictionary as

	361 0 obj
	<<
	/Producer (Adobe LiveCycle PDFG 7.2)
	/Creator (Elsevier)
	/Author ()
	/Keywords ()
	/Title (doi:10.1016/j.ccr.2007.06.004)
	/ModDate (D:20070630031637+05'30')
	/Subject ()
	/CreationDate (D:00000101000000Z)
	>>
	endobj

and in the (document) metadata dictionary as

	365 0 obj
	<<
	/Type /Metadata
	/Subtype /XML
	/Length 1526 
	>>
	stream
	<?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d' bytes='1526'?>
         
	<rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
	    xmlns:iX='http://ns.adobe.com/iX/1.0/'>
          
	 <rdf:Description about=''
  	     xmlns='http://ns.adobe.com/pdf/1.3/'
 	     xmlns:pdf='http://ns.adobe.com/pdf/1.3/'>
 	   <pdf:Producer>Adobe LiveCycle PDFG 7.2</pdf:Producer>
 	   <pdf:ModDate>2007-06-30T03:16:37+05:30</pdf:ModDate>
	   <pdf:Title>doi:10.1016/j.ccr.2007.06.004</pdf:Title>
	   <pdf:Creator>Elsevier</pdf:Creator>
 	   <pdf:Author></pdf:Author>
 	   <pdf:Keywords></pdf:Keywords>
 	   <pdf:Subject></pdf:Subject>
 	   <pdf:CreationDate>0-01-01T00:00:00Z</pdf:CreationDate>
	</rdf:Description>
         
	<rdf:Description about=''
 	    xmlns='http://ns.adobe.com/xap/1.0/'
 	    xmlns:xap='http://ns.adobe.com/xap/1.0/'>
 	  <xap:CreatorTool>Elsevier</xap:CreatorTool>
 	  <xap:ModifyDate>2007-06-30T03:16:37+05:30</xap:ModifyDate>
 	  <xap:Title>
  	    <rdf:Alt>
 	      <rdf:li xml:lang='x-default'>doi:10.1016/j.ccr.2007.06.004</rdf:li>
 	    </rdf:Alt>
 	  </xap:Title>
 	  <xap:Author></xap:Author>
 	  <xap:Description>
 	    <rdf:Alt>
 	      <rdf:li xml:lang='x-default'/>
 	    </rdf:Alt>
 	  </xap:Description>
 	  <xap:CreateDate>0-01-01T00:00:00Z</xap:CreateDate>
 	  <xap:MetadataDate>2007-06-30T03:16:37+05:30</xap:MetadataDate>
 	</rdf:Description>
         
	<rdf:Description about=''
 	    xmlns='http://purl.org/dc/elements/1.1/'
 	    xmlns:dc='http://purl.org/dc/elements/1.1/'>
 	  <dc:title>doi:10.1016/j.ccr.2007.06.004</dc:title>
 	  <dc:creator/>
 	  <dc:description/>
	</rdf:Description>
         
	</rdf:RDF>
	<?xpacket end='r'?>
	endstream
	endobj

Kudos anyway to Elsevier for emebedding this piece of information in their PDFs (if indeed it is a general practice). This has the merit of being picked up by Adobe apps and displayed in e.g. Reader. Also third party apps can pull this and use this to retrieve the metadata record from CrossRef.

The only downside is that technically this seems to be a kludge to satisfy Adobe apps and is not the correct field for filing this information. I would have thought that some other information dictionary field (e.g. "Subject") would be a better kludge, and then reserve the "Title" and "Author" fields for their proper purposes. The RDF/XML title fields would appear to be inherited from the "Title" field in the information dictionary. It's a bit of a shame really because the DOI is embedded - it's just in the wrong place(s). (OK, so that's still way better, maybe, than not providing this information at all.)

Hopefully, with more examples to mull over and experiences to learn from we can arrive at a much better and more systematic way of including the DOI, and other key metadata fields, within a PDF so that this information can be gleaned easily and unambiguously by third party apps.

Comments

Actually, Elsevier is probably moving away from embedding the DOI in the title field of the PDF.

We're working on a spec for including fuller metadata as XMP. I'll draw this entry to the attention of those working on the spec and see if they can share fuller details.

I do believe that Elsevier does this in all their content.

As I commented in your previous post, I would say neither the Title nor the Subject field really fits the DOI. In my view Elsevier did the right thing when they decided to ONLY provide the DOI and no other metadata. In that case I too would put the DOI in the title. The question is what to do the moment you decide to add the doi AND other metadata.

Chris, could I make a plea to keep the DOI in the normal metadata next to richer xmp metadata? The tools to get to the XMP metadata are rather immature at the moment and make at least the DOI much less accessible.

Alex, I would say that from Nature's perspective we would very much favour delivering metadata both through the document information dictionary ("normal metadata") as well as within an XMP packet as I alluded to in my first post in this series, although perhaps did not make explicit.

Also given that we aim to be delivering more than one piece of metadata within the document information dictionary (e.g. titla and authors) then I would be very much inclined to create an "/Identifier" foreign key for the DOI and to consider including the DOI in one of the other recognized keys, i.e. "/Subject" or "/Keywords". In fact, I would be inclined to include a full reference (traditional bib citation plus DOI) to the document probably within the "/Subject" field. I'd like to hear comments on this stratey. Like you, I believe that DOI should be available at the lowest entry point - being the document information dictionary.

Am hoping that Elsevier will be able to contribute to this discussion. They have certainly taken the intiatiave in this area. But would also like to hear from other publishers re their current practice and future plans.

Hi I recently created a test for a class presentation using Word Perfect and made it into a PDF file. The test is a multiple choice exam. Is there anyway to see the correct answers that are embedded in my test by the person viewing it in the PDF, such as by printing it?

Post a comment

Verification (needed to reduce spam):