CrossCheck Crawl

In order for your content to be crawled, you will need to do one of the following:

(1) Include as-crawled URLs in your CrossRef deposits, so that your DOIs can reliably resolve to full-text content.

(2) Ensure that your existing DOI response pages can reliably re-direct IP-authenticated crawlers to your full-text.


Crawler IP address range

208.57.158.242-254

What is the cost to have my content crawled?

CrossCheck members do not pay a fee have their content crawled. If you require regular uploads to a FTP site there may be a setup fee.

Once we sign a service agreement how long will it be until our content is crawled?

It depends on how your content is set up. Typically we start to crawl your content one to three weeks after the service agreement is signed.

What types of files can you crawl?

We can crawl most file types as long as the file is not password protected or an image. If you would like to make sure we can crawl a specific file type please email us.

What do you crawl?

We crawl all published works with registered DOIs unless you specify something you do not want crawled.

How do you know what to crawl?

We use the CrossRef site map to determine what to crawl.

How do you check that you have crawled the correct content?

For all publishers:

  • We check to verify that the content-type is the type expected for this publisher.
  • We check when the file is a PDF to make sure that the PDF is not password-protected.
  • We check the html status code of the response from the publisher's server. If it is not a 200 status code, we flag this URL.
  • We check any of the various errors related to the communication between our server and the publishers (connection timeout, unable to resolve host name, etc.).
  • We check the mime type (in case there is a content-type/mime mismatch).

For the publishers that take us to a landing page:

  • There is specific logic in place for each publisher, to locate the full text link on that page and then retrieve that page. If for any reason we cannot find the full-text link on this page according to that logic, we flag this URL.

Once we retrieve the article and have extracted the text, we do the following:

  • We check to see if the title and author specified in the CrossRef metadata are actually present in the page we are downloading.
  • We check to see if the length of the article is less than a reasonable amount.

If these fail, we still index the document but a warning is outputted into the log.

How often do you crawl our content?

Generally we crawl your content weekly.

Is it possible to restrict crawling to a certain time of day?

At this time no.

Please contact: crosscheck_info@crossref.org for more information.



copyright 2002, pila, inc. all rights reserved