How to prioritize the improvement of open-source software security

In this photo illustration, Apache Log4j logo of a Java-based logging utility is seen on a smartphone screen. (Photo by Pavlo Gonchar / SOPA Images/Sipa USA)No Use Germany.

Earlier this year, major technology companies, non-profits, and government agencies convened for an urgent meeting at the White House to discuss how best to address the security concerns posed by free and open-source software (FOSS)—software that is developed by a distributed community rather than a centralized company. For years, tech companies and security experts have made the case for greater investments in the security of the FOSS ecosystem, as it has become an increasingly important part of critical digital infrastructure. The importance of doing so was highlighted by the recent Log4Shell vulnerability in the log4j FOSS package. Deployed across a vast range of digital applications, log4j exposed a huge amount of software to a devastating security vulnerability and illustrated the urgent need to improve security in open-source software.

FOSS is decentralized and free to use, so when security vulnerabilities are found it is difficult to determine the exact extent of the threat. Perhaps the most vexing part of the problem is that it is difficult to know which FOSS packages are most widely used (and therefore most concerning if a vulnerability is found in a given package). This lack of knowledge about which FOSS packages are deployed—and where—leaves defenders in the dark and makes hard decisions about where to deploy resources even more difficult.

To address this problem, our team at the Laboratory for Innovation Science at Harvard (LISH) has partnered with the Linux Foundation and the Open Source Security Foundation (OpenSSF) to determine which FOSS packages are most widely deployed. Our findings, documented in a report released today, provide a detailed look at which FOSS packages are deployed in production applications and offer a number of lessons for policymakers and developers about how to improve the security of a critical building block of the digital economy.

log4j and Log4Shell

First released in 1999, log4j is a FOSS component that carries out logging tasks for other pieces of software built on top of it. For example, if a developer of a piece of software needs to log all activity in an application for auditing or debugging purposes, she can utilize the log4j component so she does not have to build such logging functionality from scratch. log4j is extremely popular and is used in production software at companies including Apple, Google, Amazon, Twitter, and Tesla.

As early as 2013, a bug was introduced in the log4j code that treated logged text as code and executed it on the underlying system. Thus, an attacker would simply need to perform an action that would be logged (e.g., changing their username, writing a message in a chat, etc.) using a specific line of code, which would then be executed by the system, including reaching out to a server on the internet and downloading and running a piece of malicious code hosted there. Discovered in November 2021 by a member of Alibaba’s security team, the vulnerability was named Log4Shell.

The widespread use of log4j (potentially tens of millions of devices), combined with the ease of exploitation (a simple line of code), created a worst-case scenario. To that end, Jen Easterly, the director of the United States Department of Homeland Security (DHS) Cybersecurity and Infrastructure Security Agency (CISA) called Log4Shell “the most serious vulnerability I’ve seen in my decades-long career.” Within days of the release of the patch (long before most organizations could install it), there were over 800,000 attacks in a 72-hour period. Chinese and Iranian government-sponsored actors were observed taking advantage of the vulnerability.

The Log4Shell vulnerability is an important example of a much larger issue. FOSS has become a critical building block of the modern economy. However, its distributed and decentralized nature leaves it susceptible to significant bugs that can go unnoticed by developers for years. Further, and even more concerning, is that when such a vulnerability is found, because FOSS is built into nearly every software system, but is not well tracked, it may be difficult to identify all vulnerable instances of the software that are in production.

Prioritizing efforts to address the issue

To determine which FOSS packages are the most widely used (and therefore, the most concerning if a vulnerability is found in them) our team at LISH teamed up with the Linux Foundation and the OpenSSF. We worked with software composition analysis (SCA) companies to aggregate data on the most widely used FOSS packages. SCAs are hired by their customers to scan their codebases to help ensure they are not violating any software licenses. Therefore, by working with just a handful of SCAs, we were able to get insights into FOSS built into products sold by thousands of companies. While this method allowed us to get deep insights into the FOSS companies build into their software, this is only one layer of the technology stack, albeit an important one. In future studies we will consider other layers in the stack.

By identifying the most widely used FOSS packages, we hope to improve efforts to enhance the security of FOSS packages by looking for vulnerabilities in the most popular FOSS packages first. (Our final report can be found here.)

To ensure the privacy of the data shared by the SCAs, and to account for different size customer bases across the SCAs, we utilized statistical z-scores to aggregate the data and organize it such that we could rank-order the FOSS packages observed. Since the FOSS packages that developers build into their software frequently rely on other FOSS packages themselves, we considered both the “direct” observations of FOSS packages developers built upon, as well as the “indirect” FOSS packages those packages iteratively rely upon. Additionally, due to the differences in norms in computer programming languages related to the number of functions in a given package (and therefore how many packages a piece of software relies upon), we considered the npm repository (which hosts JavaScript packages) separately from all other repositories and languages. Not doing this would have caused JavaScript packages to incorrectly dominate the list. Finally, we considered FOSS packages in both a versioned and version-agnostic manner such that different levels of granularity could be observed.

In aggregate, we analyzed nearly 600,000 data points from the SCAs, and compiled lists documenting the 500 most used FOSS packages, one for each combination of direct/indirect, npm/non-npm, and versioned/version-agnostic packages. Although this more granular approach makes it harder to precisely say which FOSS packages are the most widely used, it provides more insight into the intricacies of the ecosystem. For example, log4j showed up as number 38 on our list of direct, non-npm, version-agnostic packages, but as number 126 on our list of indirect, non-npm, version-agnostic packages. Moreover, FOSS packages whose primary purpose are to pass data to a logger, potentially including log4j, (e.g., slf4j-api and log4j-api) showed up even higher on our lists (slf4j-api was number 1 on our list of direct, non-npm, version-agnostic packages). However, without deeper insights into how such packages were being used, it was not possible to know if they were relying on a vulnerable version of log4j.

The complexities of log4j became even more intricate when considering version numbers. By a nearly 3 to 1 margin, version 1.x of log4j was much more widely used than version 2.x. However, the Log4Shell vulnerability did not impact version 1.x, and therefore the bulk of log4j users in our dataset were not actually susceptible to the Log4Shell issue (although there are numerous vulnerabilities in the 1.x versions that remain unfixed since it has not been updated since 2015). In aggregate, despite the complexities of our results, they allow for an intricate understanding of the Log4Shell problem, and our hope is that they will also shine light on similar intricacies to help prevent such widespread vulnerabilities in the future.

Our report also identifies a number of high-level issues that need to be addressed if the FOSS ecosystem is to be properly secured:

  1. The need for a standardized naming schema for FOSS components. There is no centralized body to coordinate FOSS component names, and thus there can be multiple components that have the same name but are not the same component. This not only made our research more difficult but can also make it harder for a company to know if they are vulnerable to a particular issue based on the package name alone.
  2. The complexities associated with FOSS package versioning. Due to the open nature of FOSS, companies can copy a FOSS package from the public repository and then keep an internal version of the package (and updates to the package). As a result, we found version numbers for packages that did not exist in the public repository, which would make it hard for companies to know if the particular version of a FOSS package upon which they rely is vulnerable to newly discovered issues.
  3. The most widely used FOSS packages are developed by only a handful of contributors. In reviewing the contributors to the top 50 most used non-npm packages, we found that for 94% of the packages, fewer than ten developers accounted for more than 90% of the code added in 2021. This finding goes counter to the widespread assumption that hundreds or thousands of developers support widely used FOSS packages.
  4. The importance of individual FOSS developer account security. Many of the packages on our top 500 lists are hosted under a single developer’s personal account, rather than an organizational account, which has more security features than personal accounts. Therefore, such packages are more susceptible to malicious activity and either need increased security on the developer’s account (e.g., multi-factor authentication) or should be shifted to an organizational account.
  5. The persistence of legacy software in the open-source space. Legacy and outdated software is widely relied upon in production software. This has been a long-running problem with both open- and closed-source software, as upgrading is not always a trivial matter. However, our work quantifies the extent of such problems in the open-source ecosystem and is a reminder for developers to pay more attention to which versions of a package they are relying upon.

U.S. government response

The scale and scope of the vulnerabilities affecting FOSS packages have been known within the tech community for years. However, it is only recently that federal policy has reflected the importance of this issue to the economy and national security. A May 2021 executive order, for example, directed the U.S. National Institute for Standards and Technology (NIST) to provide guidance for companies on providing a software bill of materials (SBOM) to their customers. An accurate SBOM would give companies deeper insights into the software that is baked into their software, so they would know if they are vulnerable to issues like Log4Shell immediately. Other measures have been considered but failed to be made into law. Funding a FOSS security center within the Department of Homeland Security, for example, was included in the House version of the 2022 National Defense Authorization Act but didn’t make it into the final bill.

In response to the Log4Shell vulnerability, the White House National Security Council, held a meeting in January with firms like Google and Microsoft, open-source organizations including the Linux Foundation, the Apache Software Foundation, and OpenSSF, and numerous federal agencies and departments. The meeting focused on preventing, finding, and shortening response time to FOSS vulnerabilities and discussed various potential public-private partnerships. Although there were no concrete pledges from the meeting, the intent was to start a discussion, identify possible paths forward, and commit to future meetings that would yield specific commitments by the various stakeholders.

The Log4Shell issue has also garnered the attention of the U.S. Federal Trade Commission (FTC), which has threatened to fine companies that fail to patch the issue and lose customer data as a result. While the FTC’s move may encourage many companies to address the security issue, the fact that the FTC is playing a leading role in the response illustrates that the government lacks broad tools to address major cybersecurity vulnerabilities like Log4Shell.

Log4Shell was by no means the first major vulnerability in FOSS, but hopefully it represents a turning point that will inspire the federal government to take action to address this complex problem. Numerous private entities have already joined the effort by sponsoring FOSS projects and security improvement endeavors including Google’s Secure Open Source Rewards, the Plaintext Group/Schmidt Futures’ FOSS Virtual Incubator and the efforts of the OpenSSF like their recently announced Alpha-Omega Project (sponsored by Microsoft and Google). Such efforts are important, but public support for research and legislation leading to more secure FOSS is critical and cannot come soon enough.

Frank Nagle is an assistant professor of business administration at Harvard Business School. His research is supported in part by the Linux Foundation.

Amazon, Google, and Microsoft provide financial support to the Brookings Institution, a nonprofit organization devoted to rigorous, independent, in-depth public policy research.