PDF-Extract

Geoffrey Bilder – 2012 April 17

PDF-EXTRACT

Crossref Labs is happy to announce the first public release of “pdf-extract” an open source set of tools and libraries for extracting citation references (and, eventually, other semantic metadata) from PDFs. We first demonstrated this tool to Crossref members at our annual meeting last year. See the pdf-extract labs page for a detailed introduction to this new set of tools.

If you are unable to download and install the tool, you can play with a experimental web interface called “Extracto.” Be warned, Extracto is running on very feeble server using an erratic and slow internet connection. The only guarantee that we can make about using it is that it will repeatedly fall over and annoy you. The weasel has spoken.

Add linked images to PDFs

Geoffrey Bilder – 2010 August 16

In R&DPDF

While working on an internal project, we developed “pdfstamp“, a command-line tool that allows one to easily apply linked images to PDFs. We thought some in our community might find it useful and have released it on github. Some more PDF-related tools will follow soon.

XMP in RSC PDFs

admin – 2010 August 03

In IdentifiersPDFXMPInChI

Just a quick heads-up to say that we’ve had a go at incorporating InChIs and ontology terms into our PDFs with XMP. There isn’t a lot of room in an XMP packet so we’ve had to be a bit particular about what we include.

InChIs: the bigger the molecule the longer the InChI, so we’ve standardized on the fixed-length InChIKey. This doesn’t mean anything on its own, so we’ve gone the Semantic Web route of including an InChI resolver HTTP URI. Alternatively you can extract the InChIKeys with a regular expression.
Ontology terms: we’re using HTTP URIs again and pointing to either Open Biomedical Ontology URIs (biology, biomedicine; slashy) or RSC ontology terms (chemistry; hashy). Often the OBO URIs resolve to a specific web page, but for the moment the RSC URIs just point to a large OWL file. Slashy URIs are quite a bit more involved so we’ll have to see what the demand is like.

There’s only about 4K to play with, so it’s only ever going to be a best-of. More detailed article metadata has to go in either a sidecar file, as Tony has pointed out before, or ideally on the article landing page. The example files are here and I’ve posted something with a different slant on the RSC technical blog.

Add Crossref metadata to PDFs using XMP

Geoffrey Bilder – 2009 December 09

In MetadataPDFXMP

In order to encourage publishers and other content producers to embed metadata into their PDFs, we have released an experimental tool called “pdfmark”, This open source tool allows you to add XMP metadata to a PDF. What’s really cool, is that if you give the tool a Crossref DOI, it will lookup the metadata in Crossref and then apply said metadata to the PDF. More detail can be found on the pdfmark page on the Crossref Labs site. The usual weasels words and excuses about “experiments” apply.

ISO Standard for PDF

Tony Hammond – 2008 July 03

In PDF

I blogged here back in Jan. 2007 about Adobe submitting PDF 1.7 for standardization by ISO. From yesterday’s ISO press release this:

“The new standard, ISO 32000-1, Document management – Portable document format – Part 1: PDF 1.7, is based on the PDF version 1.7 developed by Adobe. This International Standard supplies the essential information needed by developers of software that create PDF files (conforming writers), software that reads existing PDF files and interprets their contents for display and interaction (conforming readers), and PDF products that read and/or write PDF files for a variety of other purposes (conforming products).”

Mars Bar

Tony Hammond – 2007 October 08

In PDF

Just noticed that there is now (as of last month) a blog for Mars (“Mars: Comments on PDF, Acrobat, XML, and the Mars file format”). See this from the initial post:

“The Mars Project at Adobe is aimed at creating an XML representation for PDF documents. We use a component-based model for representing different aspects of the document and we use the Universal Container Format (a Zip-based packaging format) to hold the pieces. Mars uses XML to represent the individual components where that makes sense, but otherwise uses industry standard formats to represent other components. Examples of these include Fonts (we use OpenType), Images (PNG, GIF, JPEG, JPEG2000), Color (ICC Color Profiles), etc.. We use SVG to represent page content, which fits as both an XML format and an industry standard.”

pdfa.org

Tony Hammond – 2007 August 23

In PDF

Following on from yesterday’s post I just came across this very useful source of information on PDF/A: the PDF/A Conformance Center. This provides links to resources such as this whitepaper PDF/A - A new Standard for Long-Term Archiving, and a number of technical notes, especially Metadata and PDF/A-1(also available as a PDF). (This latter corrects some errors in the ISO standard which are to be redressed in a forthcoming Technical Corrigendum later this year.)

RSS for this topic

Get involved

Find a service

Documentation

About us

2026 July 20

Why PID strategies need more than PIDs: our first position paper

2026 July 09

Schema 5.5 now available: adding CRediT, new record types for blogs and posters, and more

2026 July 02

Take part in UX Research at Crossref

2026 June 30

Building, refining, and connecting: summary of our May 2026 community update

PDF

PDF-Extract

PDF-EXTRACT

Add linked images to PDFs

XMP in RSC PDFs

Add Crossref metadata to PDFs using XMP

ISO Standard for PDF

Mars Bar

pdfa.org

Topics

Archives