openalexR: An R-Tool for Collecting Bibliometric Data from OpenAlex

Bibliographic databases are indispensable sources of information on published literature. OpenAlex is an open-source collection of academic metadata that enable comprehensive bibliographic analyses (Priem, Piwowar, and Orr 2022). In this paper, we provide details on the implementation of openalexR, an R package to interface with the OpenAlex API. We present a general overview of its main functions and several detailed examples of its use. Following best API package practices, openalexR offers an intuitive interface for collecting information on different entities, including works, authors, institutions, sources, and concepts. openalexR exposes to the user different API parameters including filtering, searching, sorting, and grouping. This new open-source package is well-documented and available on CRAN.


Introduction
Bibliographic data sources are organized digital collections of reference metadata and citation links related to published scientific literature.One primary purpose is to assess the scholarly performance, although it is important to exercise caution when employing bibliometric indicators for performance evaluation (Van Noorden, 2013;Hicks et al., 2015;Priem et al., 2011).Another objective of bibliometric analyzes is science mapping, a process that involves synthesizing extensive data, prioritizing impactful research, and extracting key knowledge structures (Chen, 2017).Given the rapid proliferation of scientific publications, science mapping evidence is particularly valuable for reconstructing the theoretical framework of empirical studies or literature reviews.
Recently, new sources of multidisciplinary bibliographic data have emerged, including Microsoft Academic, launched in 2016 (Sinha et al., 2015;Wang et al., 2019Wang et al., , 2020)), Semantic Scholar launched in 2015 (Ammar et al., 2018), and CrossRef, an open bibliographic data source launched in 2017 (Hendricks et al., 2020;Van Eck et al., 2018).Dimensions is a scientometric data source that provides also information on grants, datasets, clinical trials, patents, and policy documents (Herzog et al., 2020;Hook et al., 2018).These databases complement the many existing open and commercial sources.The two most widely used commercial multidisciplinary databases are Web of Science and Scopus while the specialized ones include PubMed, EconBiz, and arXiv, which are the main open bibliographic sources for medicine, economics, and physical sciences and engineering, respectively.
The value of these open and commercial bibliographic data sources hinges on several characteristics: (1) document coverage, (2) completeness and accuracy of citation links, (3) update speed, (4) automation of data access through web interfaces, APIs, and data dumps, and (5) the terms of use for a data source (Wanyama et al., 2022;Kulkanjanapiban and Silwattananusarn, 2022;Singh et al., 2021;Martín-Martín et al., 2021;Visser et al., 2021;Waltman and Larivière, 2020;Winter, 2017).OpenAlex is recognized for providing the most extensive coverage of scientific literature, encompassing a notably larger number of documents compared to other data sources (see https://openalex.org/about).Its document coverage outpaces all major databases, including Microsoft Academic (Visser et al., 2021;Wang et al., 2020) and CrossRef, which are its primary sources.While Google Scholar reportedly boasts an estimated 389 million database, it is crucial to note that Google Scholar does not adhere to the conventional database model due to its lack of comprehensive metadata, and it does not permit users to download query search results (Dallas et al., 2018).
Another remarkable strength of OpenAlex is its substantial collection of open-access works, totaling 48 million.This extensive repository grants users open and free access to a wealth of scholarly resources, aligning with OpenAlex's dedication to open science principles.This commitment promotes both accessibility and transparency in the sharing of knowledge.Furthermore, OpenAlex stands out not only in terms of quantity but also in data quality.With a substantial number of citations amounting to 1.9 billion, OpenAlex emphasizes its relevance for researchers engaged in citation-based studies.This robust citation data enhances its appeal as a valuable resource for scholars conducting research reliant on citation analysis.
The purpose of this article is to introduce the OpenAlexR R package, which facilitates the retrieval of metadata from OpenAlex, performs specific functions, and formats the data for utilization in bibliometrix, particularly for science mapping and research assessment purposes.(Priem et al., 2022).Behind this tour de force is OurResearch, a nonprofit organisation dedicated to open principles of academic works with other impactful projects such as CiteAs (Du et al., 2021) and Unpaywall (Chawla, 2017).OpenAlex was launched in January 2022, timely replacing the retired Microsoft Academic system.OpenAlex is already extensively used in many scholarly articles (Belfiore et al., 2022).The data is currently accessible through a complete database snapshot and a REST API that is updated daily.
The OpenAlex database consists of eight academic entity types: works, authors, institutions, sources, concepts, publishers, funders, geo (Fig. 1).It is important to know the OpenAlex entities because it is possible to make query for each entity.In fact, each entity is assigned an OpenAlex ID (OAID) which represents the primary key to access the data.However, OpenAlex also recognizes different external canonical IDs for different entities.We briefly summarise the eight entities below.For more detail, visit the documentation page and the more recent Postgres schema diagram by OpenAlex.• Works: academic documents such as journal articles, proceedings, books, and datasets.The Works entity is central in that it ties together the other four entities (Fig. 1).
OpenAlex indexes over 200 million works.One can identify a work by its OAID or DOI, a unique alphanumeric string assigned to a digital document, often a research article.
• Authors: individuals who create works.Authors create Works, study Concepts, and are affiliated with Institutions.OpenAlex indexes more than 200 million authors.One can identify an author by their OAID or ORCID, a persistent and unique identifier assigned to researchers.
• Institutions: universities and organisations affiliated with authors.Institutions are linked to Works via Authors.OpenAlex indexes more than 100,000 institutions.One can identify an institution by its OAID or ROR ID, a persistent identifier for research organizations.
• sources: repositories that house works such as journals, conferences, preprint repositories, or institutional repositories.OpenAlex uses a fingerprinting algorithm to match multiple locations a work may be hosted in and flag the version of the record's host as primary.OpenAlex indexes more than 100,000 sources.One can identify a source by its OAID or ISSN-L, i.e., a single ISSN that groups the publication's all possible ISSNs (standardized numeric identifiers assigned to serial publications).
• Concepts: topics of works, deduced from their titles and abstracts.To identify a concept, you can use the OAID or its Wikidata ID, a unique identifier assigned to that entity in Wikidata because all OpenAlex concepts are also Wikidata concepts.The concepts follow a hierarchy; there are 19 concepts at the root level (0-level) and 5 layers of descendants.OpenAlex indexes ∼ 65,000 concepts.
• Publishers: companies and organizations that distribute academic documents.Each publisher publishes multiple journals, so publisher data is aggregate data.OpenAlex indexes about 10,000 publishers.
The R Journal Vol.15/4, December 2023 ISSN 2073-4859 • Funders: public and private companies that fund research.OpenAlex indexes about 30,000 funders.Funder data comes from Crossref, and is enhanced with data from Wikidata and ROR.
• Geo: works produced in the country based on the nationality of the institution with which the author is affiliated.In particular, there are some ways to filter and group academic documents by continents and the Global South.OpenAlex uses United Nations data to divide the globe into continents and regions, making it easier to filter the data.

Implementation of openalexR
Interacting with the OpenAlex API, openalexR provides easy querying and downloading of scholarly metadata as well as converting the output into a classic bibliographic dataframe (Fig. 2), which allows the user to streamline their data collection and downstream analyses (Aria, 2022).

Figure 2: openalexR workflow
With minimal dependencies, openalexR lowers the barrier to using the REST API by simplifying input entry, handling rate limits, and automatically parsing API responses.The package also offers other useful functionality such as snowball search and N-grams of works.openalexR is currently listed on the OpenAlex website as the supported R library for API access.The main function of openalexR, oa_fetch, is a convenience wrapper for three smaller functions to 1. generate a valid query from a set of arguments provided by the user (oa_query), 2. download a collection of entities matching the query (oa_request), 3. and convert the list output to a classical bibliographic data frame (oa2df).Specifically, in constructing valid queries following the OpenAlex API syntax, oa_query utilizes the modify_url function of the httr package (Wickham, 2022) to pass the filter, sort, search, and group_by parameters specified by the user.Next, oa_request sends a request to OpenAlex, downloads the JSON output matching the created query, and returns the result in a nested list.Finally, oa2df converts this output list into a classic bibliographic data frame (similar to an Excel sheet) that can be used as input in a bibliometric analysis or scientific mapping (Wais, 2016), e.g., using the bibliometrix package (Aria and Cuccurullo, 2017).
In addition to oa_fetch, the package openalexR includes three other functions specific to certain features: oa_random, oa_snowball, and oa_ngrams.The function oa_random, similar to oa_fetch, returns a record randomly.This function is particularly useful for casual sampling purposes.For instance, when conducting an analysis of gender bias within the academy, one could utilize this function to randomly query the database.
oa_snowball enables the user to perform snowballing.Snowballing, or snowball search, is a literature search technique where the researcher starts with a set of articles and finds other articles that cite (forward citations) or were cited by (backward citations) the original set (Wohlin, 2014).Metaanalysis researchers often employ this technique to collect relevant primary studies, where sufficient iterations of snowballing can converge on and exhaust the target literature space (Siddaway et al., 2019).The traditional method of conducting a snowball search involves manual effort.However, computerassisted snowball search can effectively reduce the time and resources required while maintaining a representative coverage of the target literature (McWeeny et al., 2021(McWeeny et al., , 2022)).oa_snowball takes a vector of OpenAlex IDs as input and returns a list of 2 elements: nodes and edges.oa_snowball locates and retrieves information on articles that cite or are cited by the initial set of articles (nodes) and also the relationships between these articles (edges).Following tidygraph's convention (Pedersen, 2022b), the edges dataframe contains 2 columns of OpenAlex IDs, from and to.The row from A to B means A cites B.
The oa_ngrams function is used to obtain N-grams from a specific set of works.N-grams are defined as sequences of n words that occur within a work and are commonly used in language The R Journal Vol.15/4, December 2023 ISSN 2073-4859 modeling and text analysis to identify relationships between words.The use of N-grams allows for the analysis of the frequency and distribution of words within a text or set of texts and is widely employed in fields such as natural language processing, machine translation, and sentiment analysis.Some work entities in OpenAlex include N-grams of their full text, which are obtained from the Internet Archive using the spaCy parser to index academic works.The openalexR package offers the capability to extract N-grams of works through the oa_ngrams function.This function takes a vector of OpenAlex IDs as input and returns a list of N-grams and their corresponding frequencies.

Polite use
The OpenAlex API doesn't require authentication but requires following polite usage.To get into the polite pool, it is necessary to provide a user e-mail address through the mailto parameter in R options options(openalexR.mailto= "example@email.com") in all API requests.The polite pool has much faster and more consistent response times.

Examples of use
We show many different examples of typical use cases on the package's README and vignettes.
First, we demonstrate an example that uses a few different filters.We want to describe the use of "bibliometrics" approaches in the scientific literature.We first describe how to make the query that allows us to answer this search question.After a brief description of the concept "bibliometrics", we identify the scientific literature that has used the concept "bibliometrics" in OpenAlex.Finally, focusing on the metadata offered by OpenAlex, we analyse the most relevant sources, authors, institutions, and works on bibliometrics.

The bibliometrics concept
We define the search on the entity "concepts" by filtering the "bibliometrics" topic associated with the id C178315738.The function oa_fetch generates the query from the set of arguments provided to it, downloads the set of concepts that match the query, and converts the output into a classical bibliographic data frame.Concepts can be queried using concept IDs or by concept name searching.
Searching "bibliometrics" concept by name: concept <-oa_fetch( entity = "concepts", display_name.search= "bibliometrics" # search by concept name "bibliometrics" ) concept$id # [1] "https://openalex.org/C178315738" The R Journal Vol.15/4, December 2023 ISSN 2073-4859 cat(concept$description, "\nis a level", concept$level, "concept") # [1] statistical analysis of written publications, such as books or articles is a level 2 concept Alternatively, once we know the OAID for a concept, we can search by its ID and get a similar result: concept <-oa_fetch( entity = "concepts", identifier = "C178315738" # OAID for "bibliometrics" ) cat(concept$description, "\nis a level", concept$level, "concept") # [1] statistical analysis of written publications, such as books or articles is a level 2 concept To describe which concepts are related to the term bibliometrics, let's analyze the OpenAlex hierarchy.In OpenAlex each work is tagged with multiple concepts, based on the title, abstract and host source title.A score is available for each concept in a work, demonstrating how well that concept represents the work to which it was assigned.However, when a lower-level descendant concept is assigned, all of its antecedent concepts are also assigned.Since bibliometrics is a level 2 concept in OpenAlex, we have detailed information on the concepts related to ancestors (level 0 or 1), peers (level 2), and descendants (level 3).Visualising all bibliometrics-related concepts together, we observe an increasing trend in the subfields of bibliometrics, peer review, scientific literature and scientometrics.Conversely, there has been a reduction in the number of papers in the subfields of webometrics, knowledge organisation, information science, collection development, and altmetrics.PageRank and Impact factor, concepts have remained stable in popularity over the last 10 years.Compared to other topics, citation has the highest number of papers over time (Fig. 3).

Bibliometrics dataset
We download all works, included in OpenAlex, that have the words bibliometrics or science mapping in the title to map bibliometric approaches in the scientific literature.We set the query using the Our query returns 26,953 works concerning bibliometrics.By default, the oa_fetch converts these records in a tibble object (dataframe) with each row containing information about a work.This dataframe has 28 columns containing important information about a work, such as the publication date, DOI, reference works, and so on.If users wish to convert the original nested list into another object, they can change the parameters in the following way oa_fetch(..., output = "list").From this dataset, we could describe the most relevant sources, authors, and institutions.

Most relevant sources
We first identify the core journals for the discipline by tallying all bibliometrics-related works for each source and selecting the top 5 sources with the most works.
The R Journal Vol.15/4, December 2023 ISSN 2073-4859 Visualising these counts over the years (Fig. 4), we observe the field has expanded overall, especially starting around 2015.Scientometrics is the oldest journal publishing on bibliometrics and remains the top source for these articles.Other journals started to publish these works in 2015 and have maintained some volume in this field.Sustainability only started publishing in this field in 2017 but has rapidly increased its number of publications since.In 2021, it was the second source to have published the most bibliometrics articles, after Scientometrics.

Most relevant authors and institutions
Secondly, we identify the authors and institutions most relevant to the discipline by extracting the list of author and institution-related metadata from the collection, tallying all bibliometrics-related works for each author, and selecting the top 10 authors who write the most articles and 10 institutions with the most publications in the field.Among the 84,335 authors in our collection, Yuh-Shan Ho appears to have published the most bibliometrics articles.He has over 150 bibliometric papers in his career.Another relevant author in our collection is Lutz Bornmann with more than 100 bibliometrics papers.The other authors in the top 10 all have between 50 and 100 papers (Fig. 5).
With regard to institutions, we observe that the concept bibliometrics is widely developed by different centres of expertise.At the top of the ranking, we see Beijing University of Chinese Medicine (China) and University of Granada (Spain) with over 250 articles.Other relevant institutions in our dataset include Sichuan University (China), Indiana University (US), Leiden University (Netherlands), An-Najah National University (Palestine) with more than 200 bibliometric papers.Other institutions in the top 10 have between 150 and 200 papers (Fig. 5).

Most relevant works
We identify the most relevant papers for the discipline by tallying all citations of articles related to bibliometrics and selecting the top 10 most cited articles (Tab.2).We find 'Software survey: VOSviewer, a computer programme for bibliometric mapping' at the top with 4805 citations.Other relevant works in our collection are 'Comparison of PubMed, Scopus, Web of Science, and Google Scholar: strengths and weaknesses' with 1996 citations and 'bibliometrix: An R-tool for comprehensive science mapping analysis' with 1800 citations.The other papers in the top 10 all have between 950 and 1500 citations.

Snowball search
We perform snowballing with oa_snowball to identify the set of articles that cite and are cited by the two seminal works associated with the concept of bibliometrics: Software survey: VOSviewer, a computer programme for bibliometric mapping (W2150220236), bibliometrix : An R-tool for comprehensive science mapping analysis (W2755950973).We insert these OAIDs as identifiers in oa_snowball, use the filter on the citations obtaining only those related to 2022, then use tidygraph (Pedersen, 2022b) and ggraph (Pedersen, 2022a) to display this citation network (Fig. 6).oa_snowball returns a list of 2 elements: nodes and edges.The first have information about the work, while edges have the start and end points of the links.This list output from oa_snowball can be used directly as input to standard graph functions such as tidygraph::as_tbl_graph for further network analyses and visualisations in co-citation analysis, historiograph analysis, etc.
sb_docs <-oa_snowball( identifier = c("W2150220236", "W2755950973"), citing_filter = list(from_publication_date = "2022-01-01") ) # Reduced output print(sb_docs) $nodes # A tibble: 5,769 × 37 id display_name <chr> <chr> 1 W2150220236 Software survey: VOSviewer, a computer program for bibliometric ... 2 W2755950973 bibliometrix : An R-tool for comprehensive science mapping analysis 3 W4306178549 Literature reviews as independent studies: guidelines for academic ...  oa_snowball finds and returns the metadata on the two seminal works in our dataset, and also information on the articles that cite and are cited by them, in 2022.We have a total of 2,907 citing articles: 2,078 articles cite van Eck et al., 1,161 articles cite Aria and Cuccurullo.In addition, as we can see from the graph, there are 332 articles citing both van Eck et al. and Aria and Cuccurullo, simultaneously.

N-grams
Finally, we obtain the N-grams of all the bibliometric works that make up our collection.N-grams are groups of words that occur in the full text of a work.To extract n-grams we use the oa_ngrams function from the openalexR package.From this list we then extract only the bigrams because we believe they can be more informative.as circular economy, climate change, management strategy, and energy management (Fig. 7).In addition, there are themes related to political risk, organizational learning, and cooperation within organizations and social networks.In general, these topics suggest the need for greater attention to environmental and social issues in business management and risk prevention.Finally, bibliometric model (scientific performance evaluation) and credit union (financial regulation) are interrelated to bibliometrics, which requires a global vision and cooperation among different stakeholders to be effectively applied.

Figure 1 :
Figure 1: Eight OpenAlex entities and their first few attributes.

Figure 3 :
Figure 3: Trends in bibliometrics-related topics in the past 10 years.

Figure 4 :
Figure 4: Number of bibliometrics articles by journal over the years.

Figure 5 :
Figure 5: Most relevant authors and institutions.

Figure 6 :
Figure 6: Two seminal works and their citations and references, from oa_snowball output.

Figure 7 :
Figure 7: TreeMap of the top 10 bi-grams of bibliometrics articles.
We find 4 ancestor, 11 equal-level and 9 descendant concepts of bibliometrics (Tab.1).The resulting hierarchy of bibliometrics can enable us to analyse, for example, all equal-level concepts.

Table 1 :
Concepts related to bibliometrics: ancestors, equal-levels, and descendants.following parameters: entity is "works"; title.search is "bibliometrics|science mapping" and, for the first part, count_only is TRUE so we can see how many records will be returned.

6 Summary openalexR is
a new package that facilitates querying, collecting, and downloading bibliographic metadata of OpenAlex entities through the provided REST APIs.It is available on CRAN at https: //cran.r-project.org/package=openalexR. openalexR helps streamline the researcher's workflow in accessing, collecting, and wrangling OpenAlex data.Extensive documentation, comprehensive tests of the package's internal functions, and common use cases are provided, sufficiently covering the current OpenAlex API.The source code and development versions are available at https://github.com/ropensci/openalexR.The current version of the package is a stable version, and there are no plans for any breaking changes soon.Of course, openalexR will continue to be actively maintained to keep up with CRAN policies and distribute any bug fixes.Bug reports, help requests, or improvement suggestions are welcome in the package software repository.For more information on openalexR, vignettes are available at https://ropensci.github.io/openalexR/articles/.Examples we show in this manuscript can be found at https://github.com/trangdata/oarj/blob/main/paper-examples.md.