The Public Library of Science is an Open Access publisher that offers alternatives to subscription journals. They have a database of about 240,000 papers published in seven journals. This corpus of scientific knowledge was available for download as a zip file that includes all the papers in xml format, without any way to perform queries. The goal of this project is to provide a data mining framework to query the PLOS corpus.
allofplos, a Python library and a command line tool to download and process the PLOS corpus. This library has methods to build a custom corpus or work with the complete unmodified dataset. It allows to query any paper using an ORM (Peewee) and the option of making a Sqlite database to use standard SQL search terms. It comes with a small data subset “starter Corpus” to practice query operations without downloading the whole PLOS corpus.
Technologies involved: Python – SQLite – Peewee – Jupyter Notebook
2120 University Ave. Berkeley CA.