The Public Library of Science is an Open Access publisher that offers alternatives to subscription journals. They have a database of about 240,000 papers published in seven journals. This corpus of scientific knowledge was available for download as a zip file that includes all the papers in xml format, without any way to perform queries. The goal of this project is to provide a data mining framework to query the PLOS corpus.


allofplos, a Python library and a command line tool to download and process the PLOS corpus.  This library has methods to build a custom corpus or work with the complete unmodified dataset. It allows to query any paper using an ORM (Peewee) and the option of making a Sqlite database to use standard SQL search terms. It comes with a small data subset “starter Corpus” to practice query operations without downloading the whole PLOS corpus.

Technologies involved: Python – SQLite – Peewee – Jupyter Notebook


Project in Github:
Paper: Text and data mining scientific articles with allofPLOS. 

Toyoko ©

Contact Us

2120 University Ave. Berkeley CA.

+1 510 545 4521

Follow Us