Well, this is likely to be a fairly brief post as I’m not aware of many use cases of metadata in PDFs from scholarly publishers. Certainly, I can say for Nature that we haven’t done much in this direction yet although are now beginning to look into this.
I’ll discuss a couple cases found in the wild but invite comment as to others’ practices. Let me start though with the CNRI handle plugin demo for Acrobat which I blogged here.
First off, the handle plugin PDF samples do include an embedded (test) DOI in both the document information dictionary
5 0 obj << /CreationDate (D:20070614140125-04'00') /Author (Simon) /Creator (PScript5.dll Version 5.2.2) /Producer (Acrobat Distiller 8.1.0 \(Windows\)) /ModDate (D:20070614140240-04'00') /HDL (10.5555/pdftest-crossref) /Title (Microsoft Word - crossref-rev.doc) >> endobj
and in the (document) metadata stream
<rdf:Description rdf:about="" xmlns:pdfx="http://ns.adobe.com/pdfx/1.3/"> <pdfx:HDL>10.5555/pdftest-crossref</pdfx:HDL> </rdf:Description>
Bar any fuller disclosure of metadata terms at large (and one of the demo cases makes use of DOI to retrieve metadata form CrossRef) this is excellent. I would, however, quibble with the use of “HDL” as a foreign key for the information dictionary. I realize this is just a test but the term “HDL” (or “DOI”, for that’s what it really is) is somewhat specific and a more general term such as “Identifier” would probably have more mileage, e.g.
5 0 obj << ... /Identifier (doi:10.5555/pdftest-crossref) ... >> endobj
In the second example from the metadata dictionary I don’t think the term “HDL” from the PDF extension schema “pdfx” is very helpful. (Is that namespace actually defined anywhere?) From a descriptive metadata viewpoint a more usual schema such as DC would have wider coverage. So again the second example would be better rendered as
<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:identifier>doi:10.5555/pdftest-crossref</dc:identifier> </rdf:Description>
<rdf:Description rdf:about="" xmlns:dc="http://purl.org/dc/elements/1.1/"> <dc:identifier>info:hdl/10.5555/pdftest-crossref</dc:identifier> </rdf:Description>
Well, we have Alexander Griekspoor’s comment earlier that Elsevier are including the DOI in their PDFs. I don’t know how consistently this is being done but I’ve checked a couple sample articles and it would seem that they have embedded the DOI (here from Cancer Cell, doi:0.1016/j.ccr.2007.06.004) in the title element which shows up in the information dictionary as
361 0 obj << /Producer (Adobe LiveCycle PDFG 7.2) /Creator (Elsevier) /Author () /Keywords () /Title (doi:10.1016/j.ccr.2007.06.004) /ModDate (D:20070630031637+05'30') /Subject () /CreationDate (D:00000101000000Z) >> endobj
and in the (document) metadata dictionary as
365 0 obj << /Type /Metadata /Subtype /XML /Length 1526 >> stream <?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d' bytes='1526'?> <rdf:RDF xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#' xmlns:iX='http://ns.adobe.com/iX/1.0/'> <rdf:Description about='' xmlns='http://ns.adobe.com/pdf/1.3/' xmlns:pdf='http://ns.adobe.com/pdf/1.3/'> <pdf:Producer>Adobe LiveCycle PDFG 7.2</pdf:Producer> <pdf:ModDate>2007-06-30T03:16:37+05:30</pdf:ModDate> <pdf:Title>doi:10.1016/j.ccr.2007.06.004</pdf:Title> <pdf:Creator>Elsevier</pdf:Creator> <pdf:Author></pdf:Author> <pdf:Keywords></pdf:Keywords> <pdf:Subject></pdf:Subject> <pdf:CreationDate>0-01-01T00:00:00Z</pdf:CreationDate> </rdf:Description> <rdf:Description about='' xmlns='http://ns.adobe.com/xap/1.0/' xmlns:xap='http://ns.adobe.com/xap/1.0/'> <xap:CreatorTool>Elsevier</xap:CreatorTool> <xap:ModifyDate>2007-06-30T03:16:37+05:30</xap:ModifyDate> <xap:Title> <rdf:Alt> <rdf:li xml:lang='x-default'>doi:10.1016/j.ccr.2007.06.004</rdf:li> </rdf:Alt> </xap:Title> <xap:Author></xap:Author> <xap:Description> <rdf:Alt> <rdf:li xml:lang='x-default'/> </rdf:Alt> </xap:Description> <xap:CreateDate>0-01-01T00:00:00Z</xap:CreateDate> <xap:MetadataDate>2007-06-30T03:16:37+05:30</xap:MetadataDate> </rdf:Description> <rdf:Description about='' xmlns='http://purl.org/dc/elements/1.1/' xmlns:dc='http://purl.org/dc/elements/1.1/'> <dc:title>doi:10.1016/j.ccr.2007.06.004</dc:title> <dc:creator/> <dc:description/> </rdf:Description> </rdf:RDF> <?xpacket end='r'?> endstream endobj
Kudos anyway to Elsevier for emebedding this piece of information in their PDFs (if indeed it is a general practice). This has the merit of being picked up by Adobe apps and displayed in e.g. Reader. Also third party apps can pull this and use this to retrieve the metadata record from CrossRef.
The only downside is that technically this seems to be a kludge to satisfy Adobe apps and is not the correct field for filing this information. I would have thought that some other information dictionary field (e.g. “Subject”) would be a better kludge, and then reserve the “Title” and “Author” fields for their proper purposes. The RDF/XML title fields would appear to be inherited from the “Title” field in the information dictionary. It’s a bit of a shame really because the DOI is embedded – it’s just in the wrong place(s). (OK, so that’s still way better, maybe, than not providing this information at all.)
Hopefully, with more examples to mull over and experiences to learn from we can arrive at a much better and more systematic way of including the DOI, and other key metadata fields, within a PDF so that this information can be gleaned easily and unambiguously by third party apps.