User Rating: 5 / 5

Star ActiveStar ActiveStar ActiveStar ActiveStar Active

As PhD students who are working on DoSSIER project, we explore the theory of information retrieval and we also develop tools that put our knowledge into practice. I am working on project 9 “Identifying Work, Tasks, and Information Flows”. The company where I work is called Spinque. Spinque is a high-tech SME offering knowledge graph search technology. Knowledge graphs are large networks of entities (like people, locations, organisations), their properties, and relationships between entities. For example, “Michelle Obama” is linked to “Barack Obama” by the relation “spouse”. The term “Knowledge Graph” was introduced by Google in 2012 and is intended for any graph-based knowledge base.

One of my main tasks at Spinque was to develop a program that recognizes named entities in academic papers of a Dutch university, more specifically, funding information. Why funding information? It is important strategic information in academia. It is essential to know which research output is linked to which research program and what are the subsidies that fund this type of research. How is it linked to knowledge graphs? Named entities extracted with AckNer (the name of the program) will be put into a knowledge graph and the quality of search of funding information will be drastically improved.

Funding information is usually contained in the 'Funding' and 'Acknowledgements' sections of the scientific papers, so the extraction of the pages that contain those parts is performed. Then we extract the sentences which have different forms of the words 'fund', 'finance' and 'support' in them. As the 'Acknowledgements' sections' language is standardised, those sentences usually begin with "This project is funded by...", "this research is supported by...". Due to the standardisation of the language, we can use the structure of the sentence to extract the necessary named entities, which will usually be in the middle and at the end of the sentences. For that we use parsing trees. We also extract the numbers of contracts and grants using regular expressions. The program is performing well on the sample of 321 articles and reaches a precision of 0.77 and a recall of 0.84. The F1 measure is 0.80. A high recall is the most important measure for us, as it shows that the program can extract as many named entities as possible, so we can put these entities in a knowledge graph.

The paper about AckNer "This research is funded by... - Named Entity Recognition of financial information in research papers" was accepted for the BIR 2021 Workshop on Bibliometric-enhanced Information Retrieval that was held on the 1st of April 2021. This workshop explores issues related to academic search, at the intersection between Information Retrieval and Bibliometrics. I did a short presentation of the paper and was asked questions about the future use of the program for building knowledge graphs. Also, I had insights about the ways to use machine learning and deep learning for the program, such as to build a large set of training data that I can train the program on. After the workshop the paper was published in BIR 2021 proceedings. If you want to know more about AckNer, here is the link to the paper and here is the link to a Github repository.

The next step is to run AckNer on a larger dataset. We already ran AckNer on 7500 research papers and found some minor problems that are partially solved. Then we will run the program on a dataset that contains 40 - 50 thousand papers. By running AckNer on a large dataset we will build a knowledge repository from which the entities will be extracted via Spinque Desk and put in a knowledge graph. Their meaning will be disambiguated and unique identifiers will be assigned to them. The research in this field will be continued and other knowledge repositories will be created for using them in Spinque Desk and creating more knowledge graphs to help our clients to improve their search engines.

 


This project has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 860721