Methodology

A full and detailed methodology can be found in both of our reports. We will provide a short summary below, but we encourage interested readers to read the full, detailed methodology.

What is our data source?

Research publication data covering the years 2005 to 2025 was downloaded from the Web of Science (WoS) Core Collection database. The WoS Core Collection was selected because it’s heavily used by researchers who study scientific trends and it has well-understood performance characteristics. The dataset included conference and journal publications and excluded bibliographic records that were deemed to not reflect research advances, such as book reviews, retracted publications and letters submitted to academic journals. In addition, we used data from the Research Organization Registry (ROR) to assign laboratories and centres to their parent research institution, and data from the Open Researcher and Contributor ID (ORCiD) database to build career profiles for the researchers plotted in the talent flows visualisations.

What do we mean by ‘quality research’ or 'high-impact research'?

We use these two terms interchangably. While there will of course be exceptions, it will usually be the case that quality research is high-impact research.

Distinguishing innovative and high-impact research from low-impact research is critical when estimating the current and future technical capability of nations. Not all the millions of research papers published each year are going to meaningful advanced technological capabilites.

As such, we define 'quality research' as research from papers that are in the top 10% most cited papers when compared to other papers published in the same technology area in the same year.

There are certainly limitations to defining quality in this way, but despite these limitations it is a widely used way to define research quality in scientometrics due to its tractability and generality.

What’s a citation?

When a scientific paper references another paper, that’s known as a citation. The number of times a paper is cited reflects the impact of the paper. As time goes by, there are more opportunities for a paper to be cited, so only papers of a similar age should be compared using citation counts (as was done in this report).

How do we allocate credit to countries or institutions?

Credit for each publication is divided among all authors listed on a particular publication. This per-author credit is then further split between their affiliated institutions and the countries they are based in. This is why institution and country credits can be fractional (not whole numbers).

Other methods that were considered included assigning all credit to the first listed author, or assigning equal credit to all institutions or countries associated with the production of a particular publication. The fractional allocation method used in our analysis does a better job in ensuring that all authors, and all papers, contribute equally to the final results.

What’s the H-index?

The H-index is calculated by ranking a set of publications from highest to lowest cited, and seeing the rank at which the citation count is less than the row number. This metric, most commonly used to measure the research performance of an individual researcher, balances both quality and quantity of research. A single highly-cited paper or plenty of rarely cited papers will both result in a low H-index. Whereas a high H-index reflects an individual (or country or institution) that consistently produces high-impact research.

How do we clean our data?

Allocating country and institution credit requires countries and institutions to be clearly identified so that variations of the same name can be counted together (for example, ‘USA’ and ‘United States’ should be considered the same country). The WoS address data is structured, in the sense that there’s a general pattern in how the address is expressed. But within that pattern, there is considerable variation in how authors reference their countries and especially their institutions.

In the case of country names, this process was relatively simple. The number of variations is relatively constrained because there are only a handful of cases in which genuine name variations exist (for example, ‘the Czech Republic’ versus ‘Czechia’).

The standardisation of institution names was more intensive than standardising country names due to two main reasons:

  1. the larger number of potential institutions and the much greater variation in how those institutions may be referred to
  2. the need to consider aggregating institutions whose operations are very closely linked or managed or have in the 21-years, merged entirely

We dealt with this by creating a custom institution dictionary that captures common spellings, aliases, name changes and organisational relationships of a long list of institutions. Since this program of work began this dictionary has been built up from its initial size of around 400 corrections to now more than 2,500. We use data from the Research Organization Registry (Rto help create this dictionary, supplemented with manual research using a variety of resources (including ASPI’s Chinese Defence Universities Tracker) to capture additional institutions not in the ROR database.

How do you identify researcher movement between countries?

The talent flow sankey diagram tracks the career trajectories of researchers working in each of the 74 critical technologies. Similar to our emphasise on high-impact research, we focus specifically on the most impactful researchers in each technology, defined as the cohort of researchers who published a paper within the top 5% and 10% of most cited papers in each technology field in each year.

The talent flow has three distinct nodes, undergraduate, postgraduate and employment. The employment node is primarily sourced directly from their most recent publication in the WoS dataset. Data used to construct the undergraduate and postgraduate nodes is constructed using the ORCiD database, which uses a unique and persistent digital identifier (an ORCID iD) to researchers which can be used to link them to their professional activities (published papers, positions held and degrees/qualifications obtained).

These ORCID iDs are often included by authors in their submissions to research journals, which is captured in the WoS database. By using ORCID iDs to link between the WoS and ORCID databases, we are able to identify where each researcher did their undergraduate and postgraduate training. It is important to note that many authors do not have an ORCID iD, and of those that do, many do not have profiles complete enough to allow the country they completed their undergraduate and postgraduate training in to be determined. As such, the data captured in the talent flow sankey diagram only represents a sample of the wider researcher cohort, rather than the complete cohort itself.

What is the ‘technology monopoly risk’ metric and how is it calculated?

The technology monopoly risk traffic light highlights concentrations of technical expertise in a single country. It incorporates two factors: how far ahead the leading country is relative to the next closest competitor, and how many of the world’s top 10 research institutions are located in the leading country. Naturally, these are related, as leading institutions are required to produce high-impact research. This metric, based on research output, is intended as a leading indicator for potential future dominance in technology capability (such as military and intelligence capability and manufacturing market share).

The default position is low. To move up a level, both criteria must be met:

  • High risk = 8+/10 top institutions in the first ranked country and at least 3x research lead
  • Medium risk = 5+/10 top institutions in the first ranked country and at least 2x research lead
  • Low risk = medium criteria not met

Example: If a country has a 3.5 times research lead but ‘only’ four of the top 10 institutions, it will rate low, as it fails to meet both criteria at the medium level.