” Text mining of academic papers is close to impossible right now. “
Max Häussler – Bioinformatics researcher, UCSC
Faced with the explosion of published scientific articles and the exponential increase in computing capacities, the way we will read the scientific literature in the future will probably have nothing to do with the tedious, slow, and repetitive current reading work and will undoubtedly involve more and more the use of intelligent text-mining techniques. By increasing tenfold our analytical capacities, these techniques make it possible – and will make it even easier in the future – to unleash creativity and bring about scientific innovation faster and cheaper. For the time being, however, this bright outlook faces a major obstacle: the scientific publishing cartel – one of the world’s most lucrative industries, which is determined to not jeopardize its enormous profits.
Text-mining and its necessity :
Text-mining is a technology that aims to obtain key and previously unknown information very quickly from a very large quantity of text – in this case the biomedical literature. This technology is multi-disciplinary in nature, using machine learning, linguistic and statistical techniques.
The purpose of this article is not to constitute a technical study of text-mining, but it is nevertheless necessary, for the full understanding of the potential of this technology, to describe its main steps :
- Selection and collection of texts to be analyzed : This first step consists of using search algorithms to automatically download abstracts of interest from scientific article databases (such as PubMed, for example, which alone references 12,000,000 scientific articles). A search of the grey literature can also be conducted to be as exhaustive as possible.
- Preparation of the texts to be analyzed : The objective of this step is to put the texts to be analyzed in a predictable and analyzable form according to the task to be accomplished. There is a whole set of techniques to carry out this step which will make it possible to remove the “noise” of the text and to “tokenize” the words inside the sentences.
- Analysis of data from the texts : The analysis of the data will largely depend on the preparation of the text. Different statistical and data science techniques can be used: support vector machines, hidden Markov models or, for example, neural networks.
- Data visualization : The issue of data visualization is probably more important than one might think. Depending on the chosen option: tables or 3D models, for example, the information and meta-information to which the user of the model has access will be more or less relevant and explanatory.
Text-mining has already proven its usefulness in biomedical scientific research: among other things, it has been used to discover associations between proteins and pathologies; to understand interactions between proteins or to elucidate the docking of certain drugs to their therapeutic target. However, most of the time, this technology is only implemented on the abstracts of articles, which considerably reduces its power in terms of reliability of the obtained data as well as the number of its applications.
So why not using the millions of scientific articles available online? New research hypotheses could be formulated, new therapeutic strategies could be created. This is technologically within reach, but scientific publishers seem to have decided differently for the moment. Here are some explanations.
The problems posed by scientific publishers :
At their emergence, at the end of the second world war, scientific publishers had a real utility in the diffusion of science: indeed, the various learned societies had only weak means to diffuse the work and conclusions of their members. At that time, the dissemination of published articles was done through the publication of paper journals, which were too expensive for most learned societies. Since the birth of this industry and despite the considerable changes in the means of transmission of scientific knowledge with the Internet, its business model has not evolved, becoming anachronistic and bringing its gross margins to percentages that make online advertising giants like Google or Facebook look like unprofitable businesses. Scientific publishers are indeed the only industry in the world that obtains the raw material (scientific articles) for free from its customers (scientists from all over the world) and whose transformation (peer-reviewing) is also carried out by its customers on a voluntary basis.
The triple-payment system set up by scientific publishers.
Scientific publishers have set up an “odd triple-payment system”, allowing private entities to capture public money intended for research and teaching. The States finance the research leading to the writing of scientific articles, pay the salaries of the scientists who voluntarily participate in the peer-reviewing and finally pay once again, through the subscriptions of universities and research laboratories, to have access to the production of scientific knowledge that they have already financed twice! Another model, parallel to this one, has also been developing for a few years, the author-pays model in which researchers pay publication fees in order to make their work more easily accessible to readers…are we heading towards a quadruple-pay system?
The deleterious consequences of the system put in place by scientific publishers are not only financial but also impact the quality of the scientific publications produced and therefore the validity of potential artificial intelligence models based on the data in these articles. The business model based on journal subscriptions leads publishers to favor spectacular and deeply innovative discoveries over confirmatory work, which pushes some researchers, driven by the race for the “impact factor”, to defraud or to publish statistically unconsolidated results very early on: This is one of the reasons of the reproducibility crisis that science is currently experiencing and also one of the possible causes of the insufficient publication of negative, yet highly informative, results: one out of every two clinical trials does not result in any publication.
Finally, and this is the point that interests us most in this article, scientific publishers are an obstacle to the development of text-mining on the huge databases of articles they possess, which has, in fine, a colossal impact on our knowledge and understanding of the world as well as on the development of new drugs. Indeed, it is currently extremely difficult to perform text-mining on complete scientific articles on a large scale because it is not allowed by the publishers, even when you have a subscription and are legally entitled to read the articles. Several countries have legislated so that research teams implementing text-mining are no longer required to seek permission from scientific publishers. In response to these legal developments, scientific publishers, taking advantage of their oligopolistic situation, have set up completely artificial technological barriers: for example, it has become impossible to download articles rapidly and in an automated way, the maximum rate imposed being generally 1 article every 5 seconds, which means that it would take about 5 years to download all the articles related to biomedical research. The interest of this system for scientific publishers is to be able to hold to ransom – the term is strong, but it is the right one – the big pharmaceutical companies who wish to remove these artificial technical barriers for their research project.
The current system of scientific publications, as we have shown, benefits only a few companies at the expense of many actors – researchers from all over the world, and even more when they work from disadvantaged countries, governments and taxpayers, health industries and finally, at the end of the chain, patients who do not benefit from the full potential of biomedical research. Under these conditions, many alternatives to this model are emerging, some of which are largely made possible by technology.
Towards the disruption of scientific publishing ?
” You only really destroy what you replace. “
Napoléon III – 1848
Doesn’t every innovation initially come from a form of rebellion? This is especially true when it comes, so far, to the various initiatives undertaken to unleash the potential of free and open science, as these actions have often taken the form of piracy operations. Between manifestos and petitions, notably the call for a boycott launched by Mathematics researcher Timothy Gowers, based on the text “The cost of knowledge”, the protest movements led by scientists and the creation of open-source platforms like https://arxiv.org/ have been numerous. However, few actions have had as much impact as those of Aaron Swartz, one of the main theorists of open source and open science, who tragically commit suicide at the age of 26, one month before a trial during which he was facing 35 years of imprisonment for having pirated 4.8 million scientific articles, or of course, those of Alexandra Elbakyan, the famous founder of the Sci-Hub website, which allows free – and illegal – access to most of the scientific literature.
Aaron Swartz and Alexandra Elbakyan
More recently, the proponents of the open-source movement have adapted to the radical turn of text-mining, notably through Carl Malamud’s project, aiming to take advantage of a legal grey area to propose to academic research teams to mine the gigantic database of 73 million articles he has built. The solution is interesting but not fully completed, this database is for the moment not accessible from Internet for legal reasons, it is necessary to travel to India, where it is hosted, to access it.
These initiatives operate on more or less legal forms of capturing articles after their publication by scientific publishers. In the perspective of a more sustainable alternative, the ideal would be to go up the value chain and therefore work upstream with researchers. The advent of the blockchain technology – a technology for storing and exchanging information with the particularity of being decentralized, transparent and therefore highly secure, on which future articles of Resolving Pharma will come back in detail – is thus for many researchers and thinkers of the subject a great opportunity to definitively replace scientific publishers in a system inducing more justice and allowing the liberation of scientific information.
The transformation of the system will probably be slow – the prestige accorded by researchers to the names of large scientific journals belonging to the oligopoly will persist over time – perhaps it will not even happen, but the Blockchain has, if successfully implemented, the capacity to address the issues posed earlier in this article in a number of ways :
A fairer financial distribution
As we have seen, the business model of scientific publishers is not very virtuous, to word it mildly. At the other end of the spectrum, Open Access, despite its undeniable and promising qualities, can also pose certain problems, being sometimes devoid of peer-reviewing. The use of a dedicated cryptocurrency for the scientific publishing world could eliminate the triple-payment system, as each actor could be paid at the fair value of their contribution. A researcher’s institution would receive a certain amount of cryptocurrency when he or she publishes as well as when he or she participates in peer-reviewing another paper. As for the institutions’ access to publications, it would be done through the payment of a cryptocurrency amount. Apart from the financial aspects, the copyright, which researchers currently waive, would be automatically registered in the blockchain for each publication. Research institutions will thus retain the right to decide at what price the fruits of their labor will be available. A system of this kind would allow, for example, anyone wishing to use a text-mining tool to pay a certain amount of this cryptocurrency, which would go to the authors and reviewers of the articles used. Large-scale text-mining would then become a commodity.
Tracking reader usage and defining a real « impact factor »
Currently, even if we try to count the number of citations to articles, the use of scientific articles is difficult to quantify, although it could be an interesting metric for the different actors of the research ecosystem. The Blockchain would allow to precisely trace each transaction. This tracing of readers would also bring a certain form of financial justice: one can imagine that through a Smart Contract, a simple reading would not cost exactly the same amount of cryptocurrency as the citation of the article. It would thus be possible to quantify the real impact of a publication and replace the “impact factor” system by the real-time distribution of “reputation tokens” to scientists, which can also be designed in such a way as not to discourage the publication of negative results (moreover, in order to alleviate this problem, researchers have set up a platform dedicated to the publication of negative results: https://www.negative-results.org/)
With the recent development of Non-Fungible Tokens (NFT), we can even imagine tomorrow the emergence of a secondary market for scientific articles, which will be exchanged from user to user, as is already possible for other digital objects (video game elements, music tracks, etc.).
A way to limit fraud
Currently, the peer-reviewing system, in addition to being particularly long (it takes on average 12 months between the submission and the publication of a scientific article, compared to two weeks on a Blockchain-based platform such as ScienceMatters), is completely opaque to the final reader of the article, who has no access to the names of the researchers who took part in the process, nor even to the chronological iterations of the article. The Blockchain could allow, through its unforgeable and chronological structure, to record these different modifications. This is a topic that would deserve another article on its own, but the Blockchain would also allow to record the different data and metadata that led to the conclusions of the article, whether it is for example preclinical or clinical trials, and thus avoid fraud while increasing reproducibility.
Manuel Martin, one of the co-founders of Orvium, a Blockchain-based scientific publishing platform, believes: “by establishing a decentralized and competitive marketplace, blockchain can help align the goals and incentives of researchers, funding agencies, academic institutions, publishers, companies and governments.”
The use of the potential of artificial intelligence in the exploitation of scientific articles is an opportunity to create a real collective intelligence, to make faster and more efficient research happen and probably to cure many diseases around the world. The lock that remains to be broken is not technological but organizational. Eliminating scientific publishers from the equation will be a fight as bitter as it is necessary, which should bring together researchers, governments and big pharmaceutical companies, whose interests are aligned. If we can be relatively pessimistic about the cooperation capacities of these different actors, we cannot doubt the fantastic power of transparency of the Blockchain which, combined with the determination of some entrepreneurs like the founders of Pluto, Scienceroot, ScienceMatters or Orvium platforms, will be a decisive tool in this fight to revolutionize the access to scientific knowledge.
The words and opinions expressed in this article are those of the author. The other authors involved in Resolving Pharma are not associated with it.