Colaborations

Allofplos

Goal

The Public Library of Science is an Open Access publisher that offers alternatives to subscription journals. They have a database of about 240,000 papers published in seven journals. This corpus of scientific knowledge was available for download as a zip file that includes all the papers in xml format, without any way to perform queries. The goal of this project is to provide a data mining framework to query the PLOS corpus.

Solution

allofplos, a Python library and a command line tool to download and process the PLOS corpus. This library has methods to build a custom corpus or work with the complete unmodified dataset. It allows to query any paper using an ORM (Peewee) and the option of making a Sqlite database to use standard SQL search terms. It comes with a small data subset “starter Corpus” to practice query operations without downloading the whole PLOS corpus.

Technologies involved: Python – SQLite – Peewee – Jupyter Notebook

Availability

Project in Github: https://github.com/PLOS/allofplos
Paper: Text and data mining scientific articles with allofPLOS. http://bit.ly/aoppaper

Toyoko ©

Contact Us

1900 Powell st. STE 700 Emeryville, CA.