A Complete History of Solana Outages: Causes, Fixes, and Lessons Learnt

Advanced

2/26/2025, 3:48:18 AM

This article will analyze each Solana outage in detail, examining the root causes, triggering events, and the measures taken to resolve them.

Beep, beep, beep. Beep, beep, beep.

Steven’s sleep is shattered by the harsh chimes of his phone, dragging him abruptly from his dreams. In the dark, the screen glows brightly, vibrating furiously on his bedside table. Beep, beep, beep. He groans, groggily rubbing his eyes, and reaches for the device. Squinting at the message, his heart sinks—the node is down. Without hesitation, he springs out of bed, half-dressed, fumbling to unlock his phone as more messages pour in. Then it hits him—the entire cluster is down.

At this exact moment, across the globe, in scattered cities and time zones, hundreds of node operators are staring at their phones with the same realization: The moment they dread has arrived—an outage.

Introduction

Like all distributed systems, Solana operates under the reality that a single implementation flaw or obscure edge case can lead to a network-wide failure. Outages, while disruptive, are an inevitable part of maintaining complex distributed infrastructure—whether in decentralized blockchains, centralized exchanges, or even major cloud service providers like Amazon or Microsoft.

The question is not if failures will occur, but when—and how the network evolves to adapt and harden itself against future incidents. Despite rigorous simulated testing, an incentivized testnet, and an active bug bounty program, no system—no matter how well-designed—can anticipate every possible failure mode. The most valuable lessons come from real-world operations.

Over the past five years, Solana has experienced seven separate outage incidents, five of which were caused by client bugs and two by the network’s inability to handle floods of transaction spam. Early versions of Solana lacked key congestion-management mechanisms, such as priority fees and local fee markets, which later proved essential in mitigating network stress. The lack of such mechanisms led to prolonged periods of degraded performance and congestion throughout 2022 as the network essentially incentivized spam.

Historical instances of Solana outages and degraded performance

This article will analyze each Solana outage in detail, examining the root causes, triggering events, and the measures taken to resolve them. Additionally, we will discuss the key aspects of network restarts, bug reporting, and the fundamental concepts of liveness and safety failures. Though these sections are best read in order, each is designed to stand alone, enabling readers to jump to the topics or outage incidents that interest them the most.

Liveness and Safety

According to the CAP theorem, also known as Brewer’s Theorem, a distributed system can only achieve two out of three properties:

Consistency - Every read sees all previous writes.
Availability - Every request receives a response.
Partition tolerance - The system continues operating despite network partitions.

For blockchains, partition tolerance is essential—network disruptions are inevitable. This forces a choice between AP (Availability + Partition Tolerance) and CP (Consistency + Partition Tolerance). Like most fast-finality PoS chains, Solana prioritizes consistency over availability, making it a CP system. It halts during critical failures rather than serving stale data or allowing unsafe writes. While this means node software may enter an unrecoverable state requiring manual intervention, it ensures user funds remain safe.

Solana’s position within the CAP theorem trade-offs

Liveness Failure: occurs when the blockchain stops progressing, preventing transactions from being confirmed and blocks from being produced due to validator downtime, network partitions, or consensus stalls. In the context of the CAP theorem, this corresponds to a loss of availability.

Safety Failure: occurs when the blockchain’s finalized state is altered or forked improperly. This can lead to conflicting histories or double-spending, often caused by consensus bugs or malicious attacks. In the context of the CAP theorem, this corresponds to a loss of consistency.

Solana prioritizes safety over liveness. Thus, the network will halt in cases of extreme network stress or consensus failure rather than risk state corruption. While outages are disruptive and can impact applications, users, and validators, they are preferable to the catastrophic consequences of an inconsistent or corrupted ledger.

Network Restarts

Restarting the Solana network involves identifying the last optimistically confirmed block slot and rebooting nodes from a trusted local state snapshot of that slot. Since the restart slot is not determined on-chain, validator operators must reach off-chain consensus to agree on a safe rollback point. This coordination occurs publicly in the #mb-validators channel on the Solana Tech Discord, where professional validator operators communicate in real time. Most operators have automated alerting systems that notify them the moment block production halts, ensuring a swift response.

Once a consensus is reached on the correct restart slot, operators use the ledger tool to generate a new local snapshot, reboot their validators, and wait for at least 80% of the total stake to return online. Only then does the network resume block production and validation. Verifying that there’s at most 20% offline stake when the cluster restarts ensures enough safety margin to stay online in case nodes fork or go back offline immediately after the restart.

Bug Reporting

Bug bounty programs reward security researchers for identifying and reporting software vulnerabilities. This is a critical line of defense, as it proactively incentivizes catching bugs before they can be exploited. Security researchers and developers who identify potential vulnerabilities in the Agave client are encouraged to report them through proper security channels. Detailed disclosure guidelines can be found in the Agave GitHub repository.

Rewards are offered for valid reports of critical vulnerabilities, with payouts based on severity:

Loss of Funds: Up to 25,000 SOL
Consensus or Safety Violations: Up to 12,500 SOL
Liveness or Loss of Availability: Up to 5,000 SOL

Additionally, the FireDancer client has a separate bug bounty program hosted through Immunefi, offering a maximum reward of $500,000 USDC for critical findings.

Outage Instances

The following sections provide a detailed, chronological analysis of Solana’s outages and periods of degraded performance, starting from the launch of Mainnet Beta on March 16, 2020. This examination will highlight key incidents, their root causes, and the network’s subsequent improvements, offering insight into how Solana has evolved to enhance stability and resilience over time.

Turbine Bug: December 2020

Downtime: Approximately six hours

Root issue: Block propagation bug

Fixes:

Track blocks by hash instead of slot number
Fix places in Turbine where fault detection can be done earlier
Propagate the first detected fault to all validators through gossip

This outage was caused by a previously known block repair and code processing issue triggered by an unidentified bug in Turbine, Solana’s block propagation mechanism. The failure occurred when a validator transmitted two different blocks for the same slot and propagated them to two separate partitions (A and B) while a third partition independently detected the inconsistency.

Since each partition held only a minority stake, none could achieve a supermajority consensus to progress the chain. The underlying issue stemmed from how Solana’s internal data structures track blocks and their computed state. The system used the Proof of History (PoH) slot number (a u64 identifier) to reference the state and the block at that slot. Once the network split into partitions, nodes misinterpreted blocks A and B as identical, preventing proper repair and block synchronization.

Each partition assumed that the other had the same block, leading to a fundamental conflict:

Nodes holding block A rejected forks derived from block B
Nodes holding block B rejected forks derived from block A

Since state transitions differed between partitions, validators could not repair or reconcile the forks, preventing finality.

The remediation for this issue was to allow services to track blocks by hash instead of slot number. If any number of blocks for the same slot create partitions, they are treated no differently than partitions with blocks that occupy different slots. Nodes will be able to repair all possible forks, and consensus will be able to resolve the partitions.

Although the bug was the initial cause of the outage, most of the downtime resulted from waiting for enough stake weight to come back online, as Solana requires at least 80% stake participation to resume block production.

The Grape Protocol IDO: September 2021

Downtime: Seventeen hours

Root issue: Memory overflow caused by bot transactions

Fixes:

Ignore write locks on programs
Rate limits on transaction forwarding
Configurable RPC retry behavior
TPU vote transaction prioritization

On September 14, 2021, Solana experienced a major network stall following Grape Protocol’s launch of its on-chain initial DEX offering (IDO) on the crowdfunding platform Raydium AcceleRaytor. Within 12 minutes of the IDO, the network became overwhelmed by an unprecedented flood of bot-driven transactions and stopped producing rooted slots. These bots effectively executed a distributed denial-of-service (DDoS) attack, pushing transaction loads beyond the network’s capacity.

At peak congestion:

Some validators were receiving over 300,000 transactions per second.
Raw transaction data exceeded 1 Gbps, with 120,000 packets per second.
The traffic sometimes exceeded the physical limits of network interfaces, causing packet loss at the switch port before even reaching validators.

Solana slots per second during the Grape IDO outage of September 14th, 2021 (Data source: Jump Crypto)

One of the bots structured its transactions to write-lock 18 key accounts, including the global SPL token program and the now-defunct Serum DEX program. This blocked all transactions interacting with these accounts, severely reducing Solana’s parallel processing capability. Instead of executing transactions independently, the network became bottlenecked, processing transactions sequentially—exacerbating congestion.

A fix that ignores write locks on programs was already developed and scheduled for release. Later, the network reboot enabled this upgrade, permanently removing this attack vector.

During the IDO event, validators received a flood of bot-driven transactions and, in turn, forwarded excess transactions to the next leader, amplifying congestion. The network reboot introduced rate limits on transaction forwarding to prevent future transaction storms from overwhelming leaders.

Solana’s RPC nodes automatically retry failed transactions, a feature designed to improve reliability. However, this retry mechanism exacerbated transaction flooding under extreme congestion, keeping old transactions in circulation instead of allowing the network to recover. Solana 1.8 introduced configurable RPC retry behavior, enabling applications to optimize retries with shorter expiry times and exponential backoff strategies.

Under heavy congestion, Solana leaders failed to include vote transactions, which are critical for maintaining consensus. As a result, the lack of confirmed votes led to a consensus stall, halting the production of new root blocks. Later versions of the Solana client introduced a mechanism to prioritize vote transactions, preventing them from being drowned out by regular transactions in future events.

A Second Bug: Integer Overflow

During the network restart, a second issue emerged. Validators reported wildly fluctuating active stake amounts. This issue stemmed from a bug in which the stake percentage was incorrectly multiplied by 100, exceeding the maximum possible value. The inflation mechanism had created so many new SOL tokens that it overflowed a 64-bit unsigned integer. This bug was quickly identified and patched before a second restart.

High Congestion: January 2022

Downtime: None

Root cause: Excessive duplicate transactions

Partial fix:

Solana 1.8.12 and 1.8.14 releases
Optimization of SigVerify deduplication
Improvements in executor cache performance

Between January 6th and January 12th, 2022, Solana mainnet experienced severe network congestion, leading to degraded performance and partial outages. The disruption was driven by bots spamming excessive duplicate transactions, significantly reducing network capacity. Blocks took longer than expected to process, causing the next leader to fork and further reduce throughput. At its peak, transaction success rates dropped by as much as 70%. The client struggled to handle the network’s increasingly complex, high-compute transactions, exposing limitations in its ability to meet demand.

Additional instability occurred from January 21st to 23rd, with congestion persisting. On January 22nd, the public RPC endpoint (https://api.mainnet-beta.solana.com) went offline due to abuse, as spammed batched RPC calls overwhelmed the system.

To address these issues, the Solana 1.8.12 release specifically targeted program cache exhaustion, while version 1.8.14 introduced improvements to the Sysvar cache, SigVerify discard, and SigVerify deduplication.

Candy Machine Spam: April / May 2022

Downtime: Eight hours

Root issue: Transaction spam from bot accounts

Fixes:

Bot tax on the candy machine program
Memory improvements in Solana v1.10

On April 30, 2022, Solana experienced an unprecedented surge in transaction requests. Some nodes reported reaching six million requests per second, generating over 100 Gbps of traffic per node. This surge was driven by bots trying to secure newly minted NFTs through the Metaplex Candy Machine program. This minting mechanism operated on a first-come, first-served basis, creating a strong economic incentive to flood the network with transactions and win the mint.

April 30th / May 1st, 2022 Candy Machine outage, packets per second ingress (Data source: Jump Crypto)

As transaction volume skyrocketed, validators ran out of memory and crashed, ultimately stalling consensus. Insufficient voting throughput prevented the finalization of earlier blocks, preventing abandoned forks from being cleaned up. As a result, validators became overwhelmed by the sheer number of forks they had to evaluate, exceeding their capacity even after restarts and requiring manual intervention to restore the network.

While this outage shared similarities with the September 2021 incident, Solana demonstrated improved resilience. Despite experiencing 10,000% more transaction requests than in the previous outage, the network remained operational for much longer, reflecting the improvements made by the validator community in response to prior scaling challenges.

April 30th / May 1st, 2022 Candy Machine outage, active validators (Data source: Jump Crypto)

The network restart took less than 1.5 hours after the canonical snapshot had been agreed upon. Solana v1.10 included memory use improvements to prolong the time nodes can endure slow or stalled consensus.

However, fundamental issues remained unresolved. The leader still processed transactions contending for the same account data on a first-come, first-served basis without effective spam prevention, leaving users unable to prioritize the urgency of their transactions. To address this, three long-term mechanisms were proposed as practical solutions.

The Adoption of QUIC: Previously, Solana relied upon the UDP (User Datagram Protocol) networking protocol to send transactions through Gulf Stream from RPC nodes to the current leader. While fast and efficient, UDP is connectionless, lacking flow control and receipt acknowledgments. Accordingly, there is no meaningful way to discourage or mitigate abusive behavior. To effect control over network traffic, the validator’s transaction ingestion protocol (i.e., the TPU’s Fetch Stage) was reimplemented with QUIC.

QUIC attempts to offer the best of both TCP and UDP. It facilitates rapid, asynchronous communication similar to UDP but with the secure sessions and advanced flow control strategies of TCP. This allows limits to be placed on individual traffic sources so the network can focus on processing genuine transactions. QUIC also has a concept of separate streams, so if one transaction is dropped, it doesn’t block the remaining ones. QUIC was eventually integrated into the Solana Labs client with the 1.13.4 release.

Stake-Weighted Quality of Service (SWQoS): A new system that prioritizes network traffic based on the stake held by validators was introduced, ensuring those with higher stake can send transactions more efficiently. Under this mechanism, a validator with 3% of the total stake can send up to 3% of the total packets to the leader. SWQoS acts as a Sybil resistance measure, making it more difficult for malicious actors to flood the network with low-quality transactions. This approach replaces the previous first-come, first-served model, which accepted transactions indiscriminately without considering their source.

Introduction of Priority Fees: Once transactions are ingested, they still compete to access shared account data. Previously, this contention was resolved on a simple first-come, first-served basis, providing users no way to signal the urgency of their transactions. Since anyone can submit transactions, stake-weighting is unsuitable for prioritization at this stage. To address this, a new instruction was added to the Compute Budget program, allowing users to specify an additional fee collected upon execution and block inclusion. The fee-to-compute-unit ratio determines a transaction’s execution priority, ensuring a more dynamic and market-driven approach to transaction ordering.

Candy Machine Bot Tax

Metaplex quickly introduced a hard-coded bot tax of 0.01 SOL on mint transactions interacting with the Candy Machine program to combat bot-driven spam. This anti-spam mechanism imposed a minimal fee to deter malicious activity without penalizing legitimate users who made accidental mistakes. The tax was applied in specific scenarios, including:

Attempting to mint when the Candy Machine was not live
Trying to mint when no items remained
Transactions where mint or set collection was not the final instruction
Use of incorrect collection ID
Mismatched Set Collection instructions
A signer-payer mismatch between the collection set and mint instructions
Suspicious transactions involving disallowed programs
Attempting to mint from an AllowList-protected Candy Machine without holding the required allowlist token

This economic deterrent proved highly effective. Mint snipers were quickly drained, and spam activity ceased. Within the first few days, botters collectively lost over 426 SOL.

Durable Nonce Bug: June 2022

Downtime: Four and a half hours

Root issue: Durable nonce bug leading to consensus failure

Fixes:

Temporary disablement of durable nonce transactions
Solana 1.10.23 update

A runtime bug allowed certain durable nonce transactions to be processed twice—once as a regular transaction and again as a nonce transaction—if they used a recent blockhash instead of a durable nonce in the recent_blockhash field. This led to non-deterministic behavior among validators, as some nodes rejected the second execution while others accepted it. Critically, since more than one-third of validators accepted the block, it prevented the required two-thirds majority from reaching consensus.

Unlike standard transactions, durable nonce transactions do not expire and require a unique mechanism to prevent double execution. They are processed serially using an on-chain nonce value tied to each account, which is rotated every time a durable nonce transaction is processed. Once rotated, the same nonce transaction should not be valid again.

To mitigate the issue, durable nonce transactions were temporarily disabled. A fix was later implemented in Solana 1.10.23, which prevented duplicate execution by separating the nonce and blockhash domains. The update ensured that when advancing nonce accounts, the blockhash is hashed with a fixed string, making a blockhash invalid as a nonce value. As a result, a transaction executed once as a regular transaction cannot be re-executed as a durable transaction, and vice versa. Additionally, a new DurableNonce type replaced previous blockhash values in the nonce account state, adding type safety and preventing similar issues in the future.

Read our previous Helius blog article to understand more about durable nonces and their uses.

Duplicate Block Bug: September 2022

Downtime: Eight and a half hours

Root issue: A bug in fork choice rules led to consensus failure

Fix:

Client patch

This outage was triggered by a validator erroneously producing duplicate blocks at the same block height. This occurred because both the validator’s primary node and its fallback spare node became active simultaneously, using the same node identity but proposing different blocks. This condition persisted for at least 24 hours before the outage, during which the network correctly handled the validator’s duplicate leader slots.

The cluster eventually halted when the network encountered an unrecoverable fork due to a bug in the fork selection logic. This bug prevented block producers from building on the previous block, leading to a failure in consensus.

Forks are a routine occurrence on Solana, and validators typically resolve them by aligning on the fork with the majority of votes (the heaviest fork). When a validator selects the wrong fork, it must switch to the heaviest fork to stay in sync with the network. However, in this case, validators could not revert to the heaviest bank if its slot matched their last voted slot. This flaw caused validators to remain stuck, preventing consensus from progressing and ultimately leading to the network halt.

Duplicate block bug outage fork choice, September 2022 (Source: Laine, Michael Hubbard)

In the example above, faulty validator C produces duplicate blocks for its leader slots 5 through 8. When validator G takes over as the next leader, it observes only one of the duplicates and extends its fork accordingly. However, the following leader, validator D, detects both duplicate blocks from validator C and decides to discard them, instead building its fork on top of slot 4.

As the network progresses, the fork built by validator G gains votes from the majority of stake, establishing itself as the canonical chain. Recognizing its fork is losing, validator D attempts to switch to validator G’s fork. However, the transition fails due to a bug in the fork selection logic. This issue arises because the common ancestor of the two forks—a duplicate block at slot 5—was not handled correctly, preventing validator D from recognizing the majority fork. As a result, validator D remains stuck on its own fork, unable to rejoin the main chain.

The issue was resolved after a review by the core team. A patch was merged into the master branch and backported to all release branches.

Large Block Overwhelms Turbine: February 2023

Downtime: Almost 19 hours

Root issue: Failure of deduplication logic in shred-forwarding services

Fixes:

Multiple improvements to Turbine’s deduplication logic and filtering
Add client patch forcing block producers to abort if they generate large blocks

A validator’s custom shred-forwarding service malfunctioned, transmitting an exceptionally large block (almost 150,000 shreds), several orders of magnitude larger than a standard block, during its leader slot. This overwhelmed validator deduplication filters, causing the data to be continuously reforwarded. The issue compounded as new blocks were produced, eventually saturating the protocol.

Large block outage, shreds per block, February 2023 (Source: Laine, Michael Hubbard)

The surge in abnormal network traffic overwhelmed Turbine, forcing block data to be transmitted via the significantly slower fallback Block Repair protocol. Although Turbine is designed to withstand large blocks by filtering them out, the shred-forwarding services function upstream of this filtering logic, diminishing its effectiveness. During the degraded period, block leaders automatically shifted into vote-only mode, a safety mechanism in which leaders exclude economic non-vote transactions.

The root cause of the issue was a failure in the deduplication logic within the shred-forwarding services, preventing redundant retransmission of shreds. Additionally, the deduplication filter in the retransmission pipeline was not originally designed to prevent looping within the Turbine tree, exacerbating the problem.

The network was manually restarted with a downgrade to the last known stable validator software version. To mitigate these issues, Solana v1.13.7 and v1.14.17 introduced enhancements to the deduplication logic, improving its ability to prevent filter saturation and ensuring more robust network performance.

Infinite Recompile Loop: February 2024

Downtime: Almost five hours

Root issue: Bug causing an infinite recompile loop in the JIT cache

Fixes:

Disable legacy loader v1.17.20

The Agave validator just-in-time (JIT) compiles all programs before executing transactions that reference them. To optimize performance, the JIT output of frequently used programs is cached, reducing unnecessary recompilations. As part of Agave v1.16, the existing caching mechanism, LoadedPrograms, was replaced with a new implementation called ExecutorsCache, which introduced several efficiencies.

LoadedPrograms provided a global, fork-aware view of cached programs, reducing accounting data duplication and allowing transaction execution threads to load new programs cooperatively, preventing compilation conflicts. A key feature of this system was tracking the slot where a program becomes active (known as the effective slot height) to detect cache invalidations when on-chain program data is updated.

Most programs’ effective slot height was derived from their deployment slot, which was stored in their on-chain account. However, programs deployed using legacy loaders did not retain this deployment slot in their accounts. LoadedPrograms assigned these programs an effective slot height of zero as a workaround.

An exception occurred when a deploy instruction was detected, signaling that a program’s bytecode had been replaced. In this case, LoadedPrograms temporarily inserted an entry with the correct effective slot height. However, because a transaction never referenced this entry, it was highly susceptible to eviction. When evicted, the JIT output was discarded, and the program was marked as unloaded, but the effective slot height was retained.

If a transaction later referenced this unloaded program, LoadedPrograms recompiled it and reinserted an entry at its effective slot height. Typically, this would make the program available for execution on the next iteration. However, for legacy loader programs, the new JIT output was assigned the sentinel slot height of zero, placing it behind the previous unloaded entry. As a result, LoadedPrograms never recognized the program as loaded, triggering a continuous recompilation loop on every iteration.

In Agave v1.16, LoadedPrograms did not support cooperative loading, allowing the triggering transaction to be packed into a block. This block was then propagated across the network, causing every validator to replay it and enter the same infinite recompilation loop. Since over 95% of the cluster stake was running Agave v1.17 during the outage, most validators became stalled on this block, halting the network.

This bug was identified the previous week during an investigation into a Devnet cluster outage, and a patch was scheduled for deployment. @jeff.washington/2024-02-06-solana-mainnet-beta-outage-report-619bd75b3ce0">The chosen mitigation was to backport changes to Agave v1.17 and immediately remove a feature gate upon network restart. This disabled the legacy loader responsible for triggering the bug, preventing further occurrences.

Coordinated Vulnerability Patch: August 2024

Downtime: None

Root issue: Incorrect ELF address alignment assumption

Fixes:

Patch update

On August 5th, Anza’s core engineers were alerted to a vulnerability in the Agave client, reported by an external researcher. An attacker could have exploited this flaw to crash leader validators, leading to a network-wide halt. In response, Anza’s engineers swiftly developed a patch, which multiple third-party security firms then audited.

Solana programs are compiled using LLVM into the Executable and Linkable Format (ELF). The vulnerability stemmed from an incorrect address alignment assumption within these generated ELF files. While ELF sanitization typically enforces various integrity checks, it did not validate the alignment of the .text section. This oversight could have allowed a maliciously crafted ELF file to define a misaligned .text section, leading the virtual machine to jump to an invalid address. This would result in a host segmentation fault, crashing the validator.

An attacker could have exploited this vulnerability by:

Creating a malicious Solana program that uses the CALL_REG opcode.
Manipulating the ELF file to misalign the .text section.
Deploying and invoking the program on the network, triggering validator crashes.

Patch Update Process

Any publicly released patch update would immediately make the vulnerability clear to all. This could allow an attacker enough time to reverse engineer the vulnerability and halt the network before a sufficient amount of stake had upgraded. A critical mass of validators would need to adopt any patch release as quickly as possible to avoid such a scenario.

By August 7th, multiple members of the Solana Foundation had reached out to validators through private messages on various communication platforms, informing them of an upcoming critical patch and sharing a hashed message that confirmed the date and unique identifier of the incident. Multiple prominent Anza, Jito, and the Solana Foundation members shared this hash on X, GitHub, and LinkedIn to verify the message’s accuracy. Example hash shared:

Over the next day, core members continued to reach out to validators, underscoring the importance of urgency and confidentiality. At the pre-determined time, August 8th, 2 PM UTC, validator operators received a further message containing instructions for downloading, verifying, and applying the patch. The patch was hosted on the Github repository of a known Anza engineer, not the main Agave repository. Instructions included verification of the downloaded patch files against supplied shasums.

By 8 PM UTC on August 8th, a supermajority of stake had been patched, ensuring network security. Following this, the vulnerability and its corresponding patch were publicly disclosed, accompanied by a call for all remaining validators to upgrade.

The quiet distribution of the patch and the behind-the-scenes coordination of validators raised concerns about Solana’s decentralization. Shortly after the incident, Solana Foundation’s executive director, Dan Albert, addressed these criticisms in a media interview.

“I think it’s important not to confuse centralization with the ability to coordinate. There are 1,500 block-producing nodes all over the world that are operated by almost as many individuals…. The ability to communicate with them, or some of them, voluntarily, is not to be confused with centralization.”

Korea Blockchain Week (KBW) 2024

I think it’s important not to confuse centralization with the ability to coordinate. There are 1,500 block-producing nodes all over the world that are operated by almost as many individuals…. The ability to communicate with them, or some of them, voluntarily, is not to be confused with centralization.

Conclusion

As of this writing, Solana has gone over a year without an outage, meeting a key milestone for removing the “beta” tag from mainnet-beta. The frequency of outages appears to be decreasing as the network matures, and the introduction of Firedancer is expected to enhance client diversity, reducing the risk of undiscovered bugs or edge cases causing a full cluster-wide shutdown. However, some community leaders, including Helius founder Mert Mumtaz, have predicted that outages will continue. Time will tell.

Many thanks to Zantetsu (Shinobi Systems) and OxIchigo for reviewing earlier versions of this work.

Disclaimer:

This article is reprinted from [Helius]. All copyrights belong to the original author [Lostin]. If there are objections to this reprint, please contact the Gate Learn team, and they will handle it promptly.
Liability Disclaimer: The views and opinions expressed in this article are solely those of the author and do not constitute any investment advice.
The Gate Learn team does translations of the article into other languages. Copying, distributing, or plagiarizing the translated articles is prohibited unless mentioned.

Conteúdo

Introduction

Liveness and Safety

Network Restarts

Bug Reporting

Outage Instances

Turbine Bug: December 2020

The Grape Protocol IDO: September 2021

High Congestion: January 2022

Candy Machine Spam: April / May 2022

Durable Nonce Bug: June 2022

Duplicate Block Bug: September 2022

Large Block Overwhelms Turbine: February 2023

Infinite Recompile Loop: February 2024

Coordinated Vulnerability Patch: August 2024

Conclusion

A Complete History of Solana Outages: Causes, Fixes, and Lessons Learnt

Advanced2/26/2025, 3:48:18 AM

This article will analyze each Solana outage in detail, examining the root causes, triggering events, and the measures taken to resolve them.

Solana

Introduction

Liveness and Safety

Network Restarts

Bug Reporting

Outage Instances

Turbine Bug: December 2020

The Grape Protocol IDO: September 2021

High Congestion: January 2022

Candy Machine Spam: April / May 2022

Durable Nonce Bug: June 2022

Duplicate Block Bug: September 2022

Large Block Overwhelms Turbine: February 2023

Infinite Recompile Loop: February 2024

Coordinated Vulnerability Patch: August 2024

Conclusion

Beep, beep, beep. Beep, beep, beep.

Introduction

Historical instances of Solana outages and degraded performance

Liveness and Safety

According to the CAP theorem, also known as Brewer’s Theorem, a distributed system can only achieve two out of three properties:

Consistency - Every read sees all previous writes.
Availability - Every request receives a response.
Partition tolerance - The system continues operating despite network partitions.

Solana’s position within the CAP theorem trade-offs

Network Restarts

Bug Reporting

Rewards are offered for valid reports of critical vulnerabilities, with payouts based on severity:

Loss of Funds: Up to 25,000 SOL
Consensus or Safety Violations: Up to 12,500 SOL
Liveness or Loss of Availability: Up to 5,000 SOL

Additionally, the FireDancer client has a separate bug bounty program hosted through Immunefi, offering a maximum reward of $500,000 USDC for critical findings.

Outage Instances

Turbine Bug: December 2020

Downtime: Approximately six hours

Root issue: Block propagation bug

Fixes:

Track blocks by hash instead of slot number
Fix places in Turbine where fault detection can be done earlier
Propagate the first detected fault to all validators through gossip

Each partition assumed that the other had the same block, leading to a fundamental conflict:

Nodes holding block A rejected forks derived from block B
Nodes holding block B rejected forks derived from block A

Since state transitions differed between partitions, validators could not repair or reconcile the forks, preventing finality.

The Grape Protocol IDO: September 2021

Downtime: Seventeen hours

Root issue: Memory overflow caused by bot transactions

Fixes:

Ignore write locks on programs
Rate limits on transaction forwarding
Configurable RPC retry behavior
TPU vote transaction prioritization

At peak congestion:

Some validators were receiving over 300,000 transactions per second.
Raw transaction data exceeded 1 Gbps, with 120,000 packets per second.
The traffic sometimes exceeded the physical limits of network interfaces, causing packet loss at the switch port before even reaching validators.

Solana slots per second during the Grape IDO outage of September 14th, 2021 (Data source: Jump Crypto)

A fix that ignores write locks on programs was already developed and scheduled for release. Later, the network reboot enabled this upgrade, permanently removing this attack vector.