Harvesting the metadata of 1.5million arXiv papers

arXiv is the world’s leading online repository of scientific research in physics, mathematics, computer science and related fields. It enables scientists to open-source manuscripts of their work easily and quickly, which are sometimes never published elsewhere.

The metadata of the ~1.5 million manuscripts that it hosts, form an ideal dataset for many NLP and data analysis projects. This post goes over how to retrieve that metadata and store it in a usable format. 

What’s available?

The metadata is made available in XML format, and it includes fields such as:

title
abstract
authors
categories
journal reference
creation date
update date

This metadata does not contain the full text of papers, but it can be downloaded separately in PDF or TeX format from Amazon S3 (beware of the request fees). Content is added approximately on a monthly schedule, whereas the metadata is updated daily.

Let’s get coding!

Step 1 – Import libraries

The only special library that we will need is sickle – it is a lightweight OAI-PMH client library designed for easy harvesting of OAI-compliant interfaces, such as arXiv. The Open Archives Initiative Protocol (OAI) enables the dissemination of content in a standard and structured format.

import xmltodict # to convert the raw metadata from xml format to dict
import pandas as pd # final format of stored data
from sickle import Sickle # to retrieve data from the OAI arxiv interface

Step 2 – Connect to arXiv’s OAI interface

To initialise the connection, we will pass the arXiv’s OAI interface URL to a Sickle object.

connection = Sickle('http://export.arxiv.org/oai2')

Step 3 – Request data

The ListRecords method requests data from the OAI interface. The following parameters can be set for conditional harvesting:

metadataPrefixdefines the metadata format. In this tutorial, we will use the arXiv format, which includes author names separated out, category and license information.
fromrestricts the papers to those created or updated after the specified date. Note that the OAI-PMH interface does not support selective harvesting based on submission date.
untilrestricts the papers to those created or updated before the specified date. If not specified, it is set to the latest available date (yesterday/today).
set: describes the category of articles as defined by arXiv here. For example, to retrieve physics papers only, pass ListRecords ‘set’: ‘physics’. If not specified, all categories will be returned.
ignore_deleted: if set to True, deleted records will be skipped.

If you need to set the from date, you may face the problem that it is a reserved word in Python. We will circumvent this issue by using a dictionary together with the ** operator. Before requesting metadata for multiple years, it would be a good idea to quickly test the code with a shorter date range.

print('Getting papers...')
data = connection.ListRecords(**{'metadataPrefix': 'arXiv', 'from': '2007-01-01', 'until': '2019-06-01', 'ignore_deleted': True})
print('Papers retrieved.')

Step 4 – Store raw metadata in .txt file

When dealing with a large dataset, it is common practice to maintain a copy of it in its raw form, in order to prevent having to repeat the time-consuming data retrieval process. As such, we will store the raw metadata before transforming into a more amendable format.

Since most OAI verbs yield more than one element, their respective Sickle methods return iterator objects which can be used to iterate over the retrieved records. We will therefore iterate over data and raise an error if too many sequential records can’t be retrieved.

 iters = 0

with open('arXiv_metadata_raw.txt','a+') as f:
    while True:
        try:
            f.write(data.next().raw)
            errors = 0
            iters +=1
            
            if iters % 10000 == 0:
                print('On iter', iters)
        
        except AttributeError:
            if errors >5:
                raise AttributeError('\nQUITTING: Too many sequential errors\n')
            else:
                print('\nERROR!\n')
                errors +=1
                
        except StopIteration:
            print('On iter', iters)
            print('\nDONE!')
            break

Step 5 – Format raw metadata

Our goal is to convert the metadata from XML to a pandas DataFrame. Converting  to dictionary and then pandas DataFrame is made increadibly easy with xmltodict and to_csv, therefore we will follow this sequence.

Firstly, we will load the raw metadata in a string, raw_data.

raw_data = ''

with open('arXiv_metadata_raw.txt','r') as f:
    while True:
        data = f.read(100_000_000)
        if not data:
            break
        else:
            raw_data += data

Then, we will separate the records and convert them into dictionaries using the  convert_dict function.

The authors field needs additional processing as in the XML format author names and surnames are separated out. The final format is a list of strings, with each string being the author’s forename and surname separated by a space.

def convert_dict(record_xml):
    record_dict = xmltodict.parse(record_xml, process_namespaces=False)['record']['metadata']['arXiv']
    
    record_dict['id'] = str(record_dict['id'])
    
    if not isinstance(record_dict['authors']['author'], list):
        authors = [record_dict['authors']['author']]
    else:
        authors = record_dict['authors']['author']
    
    authors = [(author['forenames'] + ' ' if 'forenames' in author.keys() else '') + author['keyname'] for author in authors]
        
    record_dict['authors'] = authors
    return record_dict

list_of_xml = raw_data.split('</record>')
list_of_xml = [_ + '</record>' for _ in list_of_xml]
list_of_dicts =[convert_dict(list_of_xml[i]) for i in range(1,len(list_of_xml)-1)]

Step 6 – Export data as a pandas DataFrame

Finally, we will export the formatted data as a pandas DataFrame in a .csv file.

df.to_csv('arXiv_metadata_formatted.csv')

That’s it!

You can find raw and formatted metadata of papers from 2007 to June 2019 on my GitHub repository .

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s