The INSPIRE High Energy Physics (HEP) information system is the leading platform used by HEP researchers to access scientific literature. One of its main strengths is its extensive coverage of HEP and related fields. This requires vast amounts of work by experts in the field in order to select the relevant content out of the large amount of data we get. Three students from Durham – Andrew Blance, Parisa Gregg, and Aidan Sedgefield – spent two months at CERN (one of the partners in the INSPIRE collaboration) and investigated the possibility of offloading humans by using a Machine Learning classifier to aid in this selection task. Their background in physics combined with their data science skills allowed them to tackle this task effectively.
They built a comprehensive dataset of annotated articles for training, investigated different approaches and provided important insights into the predictive power of features exploiting the citation graph. All those contributions were very valuable and significantly cut down on the further development time required to put the classifier into production.