Sergei Tikhomirov

2024-09-19T09:52:24+00:00

This post is kept here for archival purposes. It is superseded by this later post.

The Lightning Network (LN) is a prominent Bitcoin scaling solution that uses payment channels. Channel probing is inference of supposedly non-public balances of honest users in the LN. Channel probing is a big threat to user privacy in the LN. An attacker might want to infer channel balances for several reasons: to spy on payments, to learn business operation details, to optimize other attacks (e.g., network split via jamming), etc.

Even though probing has received some attention over the last couple of years, the understanding of this attack is still quite limited. Prior work makes assumptions that don’t hold well in practice, thus it is hard to understand from the existing literature whether probing is really feasible (what’s the cost of the attack) and what are the best ways to combat it.

We believe that ignoring parallel channels is one of the biggest shortcomings of the prior art. In our work, we expand the existing model of the LN with parallel channels, and advance the understanding of attack optimizations. We suggest a wide range of countermeasures (both aware and unaware of parallel channels) and evaluate the feasibility of the attack. Using our novel LN simulator, we demonstrate that our countermeasures bound the attacker’s information gain at 30% while making the attack 2-4 times longer.

The paper is available here: https://eprint.iacr.org/2021/384

Basics of channel probing
Probing applications and why parallel channels matter
Challenges in spying on parallel channels
Attack model
Information metrics
Simulator
Countermeasures and evaluation
Attack optimizations
Future work
Conclusions

Basics of channel probing

We assume that our readers have a basic understanding of how the Lightning Network operates. Channel probing involves sending fake payments (with random hashes) via a target channel. We refer to such payments as probes. Probes might result in the following outcomes: 1) “Invalid hash” error signed by the probe target, when a probe successfully reached the target; 2) “Insufficient amount” error signed by the second-last node in the route; 3) A failure somewhere along the route (e.g., “insufficient amount” on the non-last hop).

For a balance of a given target channel, the attacker maintains a lower and an upper bound. The attacker picks a probing amount between the current bounds and sends a probe. If (1) is returned, then the lower bound of the target channel is updated with the chosen amount. If (2) is returned, then the upper bound is updated. In case of (3), an attacker has to find another route. Note that even in case (3) the attacker may update the bounds for channels before the erring one, and use this information to optimize the attack. The attacker repeats probes with new balance estimates until the target precision is reached.

In practice, “sending payments via a target channel” is not possible. The sender can only specify an ordered list of nodes that form a payment path. Routing nodes are free to choose any available channel to reach the next node in the route (this is called non-strict forwarding). Thus, if the target channel is not the only one between nodes A and B (in other words, if parallel channels exist), the attacker can’t guarantee that the payment went through the target channel. The attacker can’t even tell through which channel the payment actually went. We will talk about these challenges in more detail in Section Challenges.

Completely mitigating channel probing is probably impossible due to the nature of the LN. However, it’s possible to make it very inefficient. The relevant metrics are:

the attacker’s information gain bounds;
the attacker’s information gain speed;
the attacker’s required capital allocation.

Probing applications and why parallel channels matter

Even though inference of balance of a particular channel might itself be interesting, an attacker might also want to infer payments of a particular node (or between two nodes), or infer the full balance of a victim, or do something else. For these scenarios, probing is just a building block.

In our work, we focus on channel probing in general, without going into much detail about particular attack aims or scenarios. We achieve generality by a) having a general attack success metric (see Attack model), and b) evaluating it against 100 random targets nodes in the simulated network (see Evaluation).

While being generic, we understand that an attacker might be interested in two observations: 1) the maximum amount that can be forwarded between two nodes; 2) the balances of individual channels (of a victim, or between two nodes).

For (1), parallel channels are irrelevant, and an attacker can be unaware of them. However, (1) is often insufficient for an attacker, and (2) is required, for example, to detect payments flowing between two nodes. But (2) can be properly inferred only if the attacker is aware of parallel channels.

Parallel channels are allowed by the specification (although one of the implementations, c-lightning, running on approximately 11% of nodes, is not planning to support them). As of December 2020, the LN contains 21% parallel channels that hold 45% of the network’s capacity.

Challenges in spying on parallel channels

Prior channel probing research ignored parallel channels. So, what’s the challenge with inferring balances of parallel channels?

Consider the following example where an attacker Eve is trying to infer balances of channels C1, C2 between Alice and Bob.

Due to non-strict forwarding it’s up to the routing nodes to choose which channel will be used for routing. By default, routing nodes will make best effort to route a payment to earn fees.

In this case, it’s impossible for an attacker to infer balances of channel C1 which has both smaller balance and smaller capacity than C2. An attacker would only be able to infer that channel C2 has 3 coins on both sides, and that’s it. Even though the attacker knows it can’t possibly reflect C1 (because the capacity there is public and equals to 2 coins), an attacker can’t do much without advanced techniques like those we suggest in Attack optimizations.

Prior works would either get meaningless results (by claiming that channel A has more balance than its capacity), or only achieve very limited information gain (by ignoring many parallel channels).

We now demonstrate how the inference strategy should be adjusted in the light of parallel channels.

Attack model

In our work, we introduce a notion of a “hop”, which is a set of all channels between two nodes, with estimates of their bounds.

We propose an attack strategy that keeps track of both hop bounds and channel bounds. At every probe, the attacker applies the new estimate to the hop as a whole. Additionally, if the attacker can deduct that the probe went through a particular channel, channel-level bounds are also updated. (Currently, we ignore cases where an estimate applies for a subset of channels.) Then, every new probe makes the best effort to infer as much knowledge about target channels as possible.

While this strategy is aware of parallel channels, it sometimes can’t determine which channel-level bound should be updated. While hop-level bounds (that match channel-level bounds for single-channel hops) may be sufficient in some scenarios, the attacker can also be interested in channel-level bounds in the general case. Advanced attack techniques (jamming and policies), which we describe in Section Advanced attacks, may help. Our current attack strategy allows an attacker to be more efficient (e.g., make fewer probes) by taking any extra knowledge into account.

Information metrics

To measure the gained knowledge, we introduce a new metric that is aware of parallel channels. Intuitively, it reflects the attacker’s information gain. As soon as the desired information gain is reached for the current target hop, the attacker stops probing. We define the information gain as the change in uncertainty before and after the probing. But what exactly is uncertainty?

Consider a channel between Alice and Bob with capacity 1023. Ignoring fees, channel reserve, and in-flight payments for simplicity, Alice’s balance can take any value between 0 and 1023 - a total of 1024 possible values, which would require 10 bits to encode (the initial uncertainty). After one probe, the attacker learns that Alice’s balance can only take values from 0 to 511, gaining 1 bit of information and decreasing the uncertainty to 9 bits.

Step	Bounds	Uncertainty (granularity = 1)	Uncertainty (granularity = 256)
0	0-1023	10	2
1	0-511	9	1
2	256-511	8	0

The attacker may only wish to learn the balance up to a certain granularity. In the above example, the granularity is 1 satoshi. For the granularity of 256 satoshis, the initial bounds (0 - 1023) only correspond to 2 bits of uncertainty (the interval 0-1023 contains 2^2=4 “buckets” of 256 values each). We refer the reader to the paper for the exact (rather simple) formulas for uncertainty.

We define the information gain for the attack as a whole as the sum of information gains (assuming independence for simplicity) for all channels in all target hops. We may also express the information gain as a percentage of the total initial uncertainty. This allows us to say things like “the attacker is able to retrieve 80% of balance information from target channels” (see evaluation below).

Having defined the attack and the metric to evaluate its success, we implement it in a simulator which models the probing process using the algorithm described above.

Simulator

Multiple simulators have been developed for the LN. We believe that our simulator improves upon the state of the art in the following ways:

it accurately models parallel channels;
it reflects the fact that some channels only allow forwarding in one direction;
it simulates the time a real probing would take by generating randomized networking delays based on prior real-world measurements.

Another advantage of our approach is that we model the network itself and the user (in our case, the prober) separately. The simulator takes as input a network snapshot obtained from an LN node. Based on the snapshot, the simulator creates different graphs to represent the real network and the attacker’s view of the network (including the information obtained from probing). Such modularity allows for implementing other, more complex attack scenarios, for example, those with multiple probing entities, or with honest users’ payments.

The simulator is extendable, one can change the probing parameters and plug in other path-finding algorithms, amount selection algorithms (besides binary search), etc. We plan to release the simulator under an open-source license.

Countermeasures and evaluation

Now, the goal of honest users, routing nodes, and stakeholders of the LN is to reduce the information gain of an attacker.

Since probing is based on sending fake payments and observing the resulting errors, we came up with three error-related ideas:

random failures: a node fails a payment even if the balance at the next hop is sufficient;
error spoofing: a node pretends that an error happened in its next channel and not later;
error delays: a node delays sending the error message to the previous hop.

Each countermeasure may be applied with some probability or under some conditions. We assess how these countermeasures (separately and in combinations) make attacks longer and limit the attacker’s information gain. Our experiments are summarized in the following graph, where we plot the attacker’s information gain (as percentage of the initial uncertainty) as it accumulated throughout the simulated time of the attack.

Each line represents one attack (the results are averaged across 10 simulations). All lines start from 0, indicating that at the start the attacker only knows public channel capacities. For the attacker, steeper lines are better (information is collected faster). As expected, the line without countermeasures is the steepest.

If no countermeasures are deployed, an attacker can achieve 80% information gain in 20 seconds per channel (full information extraction is not always possible due to parallel channels, non-forwarding hops, leafs, etc). The most effective countermeasure is a combination of error delays and random failure, which bounds information gain at 25% while requiring 80 seconds per channel to reach it. Random failure alone also provides a good result of 30% bound on information gain with 40 seconds per channel.

Each countermeasure alone is less promising: error delays and error spoofing can only make the attack about 2-3x longer than the baseline, and other combinations don’t yield better results than those mentioned above. For more detailed analysis, please refer to the paper.

Countermeasures come at a cost though. To understand the cost, we first note that honest LN users score routing nodes based on their historic reliability. Reliable nodes are prioritized for route construction. The drawbacks of the countermeasures described above are:

they damage the user experience for honest users: genuine error messages take longer to propagate to the sender and are unreliable (making channel scoring less useful);
they might be not incentive-compatible: routing nodes are supposed to sacrifice their channel score to improve other users’ privacy.

In the paper, we briefly discuss the tension between usability, incentives, and privacy. Some trade-offs might be acceptable.

We also suggest other potential countermeasures without an in-depth analysis of their performance:

intra-hop payment split: allow routing nodes to split a payment among their channels without coordination with the payer;
on-demand channel rebalancing (e.g., JIT routing);
adversarial strategies for multi-channel hops: a combination of special channel structure and batching to minimize information leak;
using “gates” of channels / links to prevent an external observer from inferring our internal structure beyond the gate;
rate-limiting via linking payments to Stake Certificates (blinded proofs of UTXO ownership).

Attack optimizations

Even if a hypothetical attacker is aware of parallel channels and tunes the inference engine accordingly, it turns out that in many cases it’s still impossible to infer all channel balances within a hop (e.g., the example in Countermeasures), because an attacker can’t easily choose through which channels the payment goes.

In our work, we also explored how an attacker might use channel jamming and channel policy discrepancies to overcome the issues with probing parallel channels. These are currently important unsolved issues of the Lightning Network, so it’s fair to assume they will be relevant for a while.

It turns out that, in many cases, using either one of those techniques indeed allows an attacker to choose through which channel the payment will go, and thus, reliably probe specific channels. We verified this against our own real LN nodes.

With jamming, an attacker has to jam certain high balance channels (probably after probing them), to reach the previously inaccessible channels. With attack enhancements based on fees and other policies, an attacker can craft probes so that a victim forwards them only over a channel from a specific subset.

These optimizations also come at a cost:

channel jamming assumes an attacker locks liquidity (for “capacity-based jamming”) or opens many channels and pay on-chain fees (for “slot-based jamming”);
exploiting policy discrepancies requires victim channels to be heterogeneous.

To reduce the costs, an attacker might combine these attack vectors: for example, tune the fees of jamming payments so that they jam a specific channel.

Future work

The following topics look to us as a natural next step towards better understanding and combating channel probing:

the impact of routing fees on the attacks;
the noise that honest payment flows introduce;
capital costs of probing;
advancing attacks with behavior patterns of the victims and bulk probing.

We would be excited to see explorations of specific applications of probing, e.g. payment flow inference, built on top of this work.

Conclusions

It’s important to better understand channel probing because it’s a real privacy threat in the Lightning Network. In this work, we advance this understanding by considering parallel channels. We describe how an attacker may extract information from them using various probing techniques. We also suggest ideas for combating probing attacks and demonstrate their efficiency, while also noting the shortcomings they introduce. We hope to see community discussion around these trade-offs.

Unjamming Lightning

2022-11-18T00:00:00+00:00

For the past few months, I’ve been working with Clara Shikhelman at Chaincode Labs on the issue of jamming attacks in the Lightning Network.

Our efforts have resulted in a paper where we propose a solution that combines small unconditional fees and local reputation.

I’ve written a post summarizing our findings for the Chaincode Research blog:

https://research.chaincode.com/2022/11/15/unjamming-lightning/

LightPIR. Privacy-Preserving Route Discovery for Lightning (paper summary and analysis)

2022-02-16T00:00:00+00:00

Lightning is currently source-routed. This means that each sender does a local route search on the full network graph. This may become unsustainable as Lightning grows grows. Naively outsourcing route discovery to dedicated servers harms privacy: the servers know who is paying whom. The LightPIR paper proposes a solution. The authors combine private information retrieval with all-pairs-shortest-path pre-computation with hub labeling, optimized for real LN topology. In this post, I summarize the LightPIR protocol and outline the potential first steps to turn it from a research prototype to a real-world implementation.

Lightning routing

Multi-hop routing is a key feature of Lightning. Payments between users with no shared channels allows Lightning to truly perform as a network: every (public) channel provides liquidity not only its counterparts but also to everyone else. Currently, Alice (the sender) defines the entire route to Bob (the receiver). To do so, Alice stores a snapshot of the public network based on P2P gossip.

As LN grows, it may become infeasible for each node to store a full network snapshot. Although the current LN size (18410 nodes and 82128 channels at the time of this writing, as per ACINQ) doesn’t sound like a lot by modern hardware standards, if we expect Lightning to grow by orders of magnitude, source routing may become challenging. Indeed, we want Lightning to work even on an IoT device with a tiny battery, little storage, and sporadic Internet connection. This justifies the need for alternative efficient route discovery algorithms.

A naive trust-based solution

Let’s start with a naive trust-based solution and then improve it in both efficiency and privacy. Imagine we have specialized nodes (servers) to do route discovery. Generally, we can evaluate a client-server protocol using many metrics:

server-side computation;
server-side storage;
client-side computation;
client-side storage;
communication.

There are two approaches to server-based route discovery:

a server calculates a route from Alice to Bob when Alice asks for it (just-in-time);
a server pre-calculates shortest routes for each pair of nodes and looks up the appropriate route when requested. This approach is called “all pairs shortest paths”, or APSP.

The just-in-time approach is less demanding but may introduce higher latency for the client. APSP requires more computation and storage from the server.

The protocols we’ve considered so far are not private. The server knows what route a client requests (that is, who is paying whom). In contrast, the current source-based route discovery is relatively private[¹]. Our goal is to come up with a protocol that would combine the efficiency of APSP approaches with the privacy of client-side route discovery.

Private information retrieval (addressing privacy)

Let’s put LN specifics aside for a moment. Consider a server that stores a database DB, which is an array of boolean values. The client wants to obtain the i-th element. A straightforward protocol would be: the client sends i to the server, and the server replies with DB[i]. Clearly, the server knows which element the client has requested.

Can a client obtain DB[i] without the server knowing i? Theoretically, the only solution is to send the whole database to the client. This is hardly practical. Instead of the problem we know is practically unsolvable, let’s bend the rules a bit and consider a related problem under additional security assumptions.

Here is where private information retrieval (PIR) comes in. There are two branches of PIR: computational (CPIR) and information-theoretical (IT-PIR). CPIR allows for PIR with a single server but is slower and involves advanced cryptography (keyword: fully homomorphic encryption). IT-PIR, on the other hand, is fast, uses cheap cryptography, but only works with multiple servers. LightPIR is based on IT-PIR, so we’ll use the term PIR to refer to IT-PIR unless stated otherwise.

Consider two servers holding identical database copies. A simple PIR protocol works under an additional security assumption: the servers don’t collude.

the client generates a random string r and another string r' that only differs from r in the i-th bit;
the client sends r to the first server;
the first server selects elements at indices j where r[j]=1 and responds with a single bit d that is a XOR of those elements;
the client sends r' to the second server;
the second server responds analogously with a single bit d';
the client locally computes d XOR d' = D[i].

In other words, the client obtains two bits that are XOR’s of nearly identical subsets of the database elements. The only difference between those subsets is that one includes D[i] and the other doesn’t. XOR’ing them together, the client discovers D[i] locally.

This simple PIR algorithm is not very efficient. Server-side computation cost and the communication cost are linear in the database size. Even though each response is a single bit long, the request r is as long as the database itself.

Advanced PIR protocols are more efficient. One idea is to consider the database as a matrix instead of a linear array. This allows to improve communication complexity by increasing the number of servers. For instance, with four non-colluding servers, the client only sends request of length of square root of the number of elements. For more details, see this introductory lecture.

Hub labeling (addressing efficiency)

Any digital map service finds a route across a continent in seconds. Isn’t it amazing, considering that a road graph may contain millions of points?

Turns out, there is a lot of space for optimizations in route discovery algorithms that exploit the properties of particular graphs. In particular, roads are not equally likely to occur on a random route. Most routes (except very short ones) follow this pattern:

drive to the nearest highway entrance;
drive along the highway;
exit the highway and drive to your destination.

Route discovery algorithms have been heavily optimized based on the fact that highways are involved in a disproportionately large share of long routes. A key optimization is called hub labeling. Informally, a hub is a node that is part of many sufficiently long shortest paths. (Each highlighted word in the previous sentence is of course formally defined.) Remember an earlier APSP method, where the server pre-computes routes for each pair of nodes? Let’s now modify APSP with hubs in mind.

Let each server store, for each node, a list of its hubs and the shortest route to each of them. Finding a route from Alice to Bob involves these steps:

find a node that both Alice and Bob consider their hub;
look up the shortest route from Alice to the hub;
look up the shortest route from the hub to Bob[²];
concatenate the two partial routes to get the full route.

An attentive reader might spot an issue here. What happens if Alice and Bob have no hubs in common? To avoid this scenario, hub sets are chosen to intersect all shortest paths of non-trivial length. This task is equivalent to the set cover problem, which is NP-complete. Heuristic algorithms do the job well enough for practical applications.

A (heuristic) hub set labeling algorithm computes shortest paths between any pair of nodes and assign hub labels. The hub sets generated by a heuristic are valid (i.e., they allow for lookup-and-concatenate route discovery) but may be suboptimal (for example, include more hubs than necessary). Still, the result is good enough for practical applications.

LightPIR: putting this all together

At long last, let’s discuss the contributions of the LightPIR paper! The authors combine PIR and hub labeling (HL) for route discovery in payment channel networks. Assuming three types of nodes (a content provider, servers, and clients), the protocol goes as follows:

the content provider (CP) creates a network snapshot;
CP populates the graph with hub sets and shortest routes to hubs[³];
CP and sends its copies to the servers;
clients query multiple servers using a PIR protocol.

The novelty of the paper is an improved HL algorithm tailored to the LN topology. The LN-optimized HL algorithm goes as follows:

select K nodes with the highest degree and put them in the hub set for all nodes;
exclude those nodes from the graph;
find the remaining hubs based on the pruned graph.

Intuitively, the LightPIR exploits the fact that the LN is even more centralized than highway networks. Instead of repeatedly “discovering”, for instance, that the ACINQ node (the most connected public node at the time of this writing) belongs to the hub set for Alice, and for Bob, and for Carol, etc, the proposed algorithm simply adds ACINQ to all hub sets. The same applies to K most connected nodes, which are then temporarily excluded from the graph. Additional “node-specific” hubs are then discovered on the pruned graph. How many most-connected nodes shall we pre-select into all hub sets? The authors run experiments on historic LN snapshots and conclude that K should be around 100.

A note on dataset validity

Initially, I had doubts about the dataset the paper is based on. The data source is Christian Decker’s Lightning Network Gossip repository, which is trustworthy. However, the following phrase raised suspicion:

although the network size almost doubles from March 2019 to January 2021, the number of nodes in the largest SCC remains pretty consistent across all snapshots (≈ 2500 nodes)

This seemingly contradicted other sources (like Bitcoin Visuals), which report a consistent growth in the number of Lightning nodes during that period. Moreover, in the snapshots I took for my own research in 2021, 99% of nodes belonged to the main connected component.

In fact, there is no contradiction. The LightPIR paper uses a directed graph model and only considers the main strongly connected component (SCC). The main SCC is the largest set of nodes where each pair of nodes has a directed route. The (non-stront) connected component, in contrast, doesn’t take directionality into account, and hence may be much larger than the main SCC[⁴].

Even if there is no contradiction, it’s unclear why the authors chose to only consider the main SCC, which contains only between 39% and 72% of public nodes[⁵]. More thought is required to more accurately reflect the real-world network structure. Would the results change if we consider the main non-strongly connected component?

Implementation prospects

Assuming LightPIR is a good idea, how should it be converted into a practically implementable proposal for Lightning? I see a few issues that should be addressed.

Non-collusion assumption

PIR protocols are based on the assumption that servers don’t collude. The paper that introduced PIR (Chor et al., 1995) justifies this assumption from the reputational standpoint:

We assume that the servers do not collude in trying to violate the user’s privacy. <…> A detected violation of the privacy guarantees will result in severe damage to the server. It is as if a bank were caught in fraud. <…> In the rare case where the user values its potential loss as more substantial than the server’s risk, the user should not use a PIR scheme in which privacy depends on a noncollusion assumption.

This seemingly contradicts the permissionless nature of Lightning. However, Lightning nodes already have a somewhat persistent identity and reputation. This is especially relevant for Lightning Service Providers, or LSPs (a vague term for large nodes providing LN services professionally). If LSPs put their reputation on the line and are punished for colluding, the scheme might work in practice. The key question then is: how to provably detect collusion?

IT-PIR vs CPIR

As mentioned earlier, there are two kinds of PIR: information-theoretical (IT-PIR) and computational (CPIR). The IT-PIR approach, which LightPIR is based on, achieves exactly zero information leakage even if the attacker has infinite computing power. CPIR, on the other hand, allows for a single-server PIR, assuming the attacker’s resources are bounded. The paper doesn’t justify the choice of IT-PIR as opposed to CPIR.

The aforementioned 1995 paper writes on the topic of non-collusion assumption:

The single-server computational PIR scheme of Kushilevitz and Ostrovsky (1997) addresses this concern.

Wouldn’t CPIR approach be more suitable for Lightning? A bound on attacker’s resources is the assumption under which most real-world systems, including Bitcoin and Lightning, operate anyway. What are the downsides of CPIR in this context?

A single source of network graph data

The LightPIR model assumes a single source of truth about network topology, copiled by a dedicated entity (the content provider). Lightning, in contrast, has no canonical network view: nodes compose their own graphs based on gossip. (Yet, everyone is very likely to be aware of the most active and connected nodes.)

There might be two ways to reconcile theory and practice. First, we could amend the theory to allow database copies to be slightly different. The servers then could independently compile their graphs based on gossip. Second, LN nodes could opt-in for a centralized data source. For instance, a well-known LSP (think 1ML, LNBIG, or ACINQ) would periodically publish fresh network snapshots. Servers (maintained by, for instance, by wallet providers) would download the snapshot and announce to their clients that they provide PIR-based route discovery based on the graph from a certain date. Clients would then query a random subset of wallet providers to privately retrieve routes.

A common route quality metric

LightPIR assumes that all clients use the same metric to evaluate route quality. In the simplest case (ignoring fees), the model provides shortest routes. If we use fees as edge costs, we’d be talking about cheapest routes. On the one hand, the cheapest routes model may capture the desires of many clients. On the other hand, this assumption may open clients to attacks where the adversary attracts payments by advertising low fees. On top of that, total fees may be just one of the components of route quality function. Some clients may want to also optimize for:

route length (longer routes increase the chance of payment failure);
success probability (related to but not fully defined by route length);
avoiding specific nodes (known adversarial nodes, nodes located in a certain region, etc).

We may bridge this gap between theory and practice in one of two ways (or both):

amend the theory to allow for user-defined route quality metrics;
use the protocol as is for users who aim for a simple metric (such as minimal fees).

The model doesn’t account for amounts (and fees?)

The graph model as presented in the paper is quite simplified. First, it doesn’t account for payment amounts and channel capacities. Second, it doesn’t account for fees (though I’m not exactly sure[⁶]).

Also, the key challenge in scaling LN routing is accommodating dynamic parts of the graph. The nodes and channels are somewhat static, whereas fee and policy updates may be frequent. How should LightPIR be modified to not only account for fees but also to reflect ongoing fee updates?

Conclusion

LightPIR is a route discovery protocol for payment channel networks that combines private information retrieval and optimized hub labeling. It builds on prior work in route discovery for road networks and achieves higher efficiency by exploiting LN’s centralized topology. Simulations based on historical LN snapshots indicate how to best parameterize the algorithm.

LightPIR is an example of research that applies prior scientific findings in new contexts. In this case, results on private database lookup and routing for road networks have been applied to payment channel networks. This approach is valuable and underappreciated. Most likely, there are lots of valuable ideas in scientific literature from long before Bitcoin came along, waiting to be applied in modern development. However, at least in the case of LightPIR, more effort is required to turn this protocol into an implementation-ready proposal.

Probing parallel channels in the Lightning network

2022-01-03T00:00:00+00:00

In this post, we summarize our paper on channel balance probing in the Lightning network. It supersedes our earlier work on this topic. A video presentation based on this post (a longer version is also available):

First, we briefly introduce the Lightning network (LN) and the channel balance probing attack. Then, we propose an enhanced probing technique that allows an attacker to extract more private information faster. We run simulations based on real-world data and conclude that the proposed probing method is indeed better than prior art. Finally, we discuss potential countermeasures and their trade-offs.

Lightning Network 101
Channel balance probing
Probing model
Probing multi-channel hops
Jamming-enhanced probing
Evaluation
Conclusion

Lightning Network 101

The Lightning Network (LN) is a layer-two protocol for fast bitcoin payments. It is considered the major scaling solution for Bitcoin. As of January 2022, its publicly announced part consists of 17 thousand nodes and 79 thousand payment channels.

A payment channel is a cryptographic protocol for off-chain bitcoin payments between two parties. A useful mental model to visualize a channel is “beads on a string”. The beads cannot leave the string, they can only move back and forth.

The total number of coins in a channel is called its capacity, and the number of coins currently owned by Alice and Bob are their respective balances. The two balances sum up to the capacity, so we can infer one balance from the other. We define the channel balance to be the balance of the node with alphabetically smaller name (in this example, that would be “Alice”). We refer to a pair of adjacent nodes together with all channels that they share as a hop.

Alice doesn’t have to establish a direct channel to Charlie to send him payments. Instead, she can use a multi-hop path (in this example, via Bob). Multi-hop payments work as follows. Alice offers Bob one coin under the condition that he forwards one coin to Charlie. Bob forwards one coin to Charlie, who uses the payment secret known only to him to redeem the coin. Bob can then use the same secret to redeem the coin from Alice. Hence, one coin has effectively moved from Alice to Charlie.

The key issue with this process is that Alice doesn’t know in advance whether Bob has sufficient balance in the channel towards Charlie. If Alice tries to send another coin along the same path, the payment would fail. Therefore, Lightning follows the trial-and-error approach. The sender may have to make several payment attempts until one of them succeeds.

As we will see, one can exploit the error reporting mechanism of the Lightning network in an attack called balance probing.

Channel balance probing

The attacker (also referred to as the Prober) wants to learn remote channels balances (which is private information). To achieve this goal, it sends fake payments, or probes, and observes where they fail. If a probe reaches the final destination, all channels along the path have sufficient balances. Otherwise, if the probe fails somewhere along the path, the Prober learns that the erring node lacks balance.

The attacker’s knowledge may be visualized as follows. The outer interval denotes the target channel. The star shows the true balance. The colored area is the set of all points where, according to the attacker’s current knowledge, the balance may be. b^l and b^u are the current balance bounds.

Initially, the colored area covers the whole interval. By making a series of probes, the attacker updates the balance estimates and shrinks the colored interval. Assuming that balances are equally likely to take any value between zero and the channel capacity, the optimal strategy is to divide the colored interval in half with every probe.

The probing algorithm doesn’t always work perfectly. Consider a hop with two channels. Such channels are called parallel.

A routing node (Alice) is free to choose any of the parallel channels to forward the probe. After receiving the error message, the attacker doesn’t know which channel it applies to. As a result, the classic probing algorithm becomes inapplicable.

Note that while the prober cannot update individual balance bounds, it does get some information about the hop as a whole. We need a new probing model to describe what exactly the attacker learns in this case.

Probing model

We propose a new geometrical model that describes probing in the general case, for any number of parallel channels. To introduce our model, let’s use a two-dimensional example. Consider a two-channel hop with the capacities of both channels equal to C. It can be represented as a square with corners (0,0), (0,C), (C,C), (C,0).

Each point within the square corresponds to a possible vector of channel balances. The star denotes the true balance point: the first channel has balance b_1, and the second channel has balance b_2.

The attacker sends the first probe of amount a_1.

The probe doesn’t reach the destination, so the prober concludes that all channel balances are less than the probe amount: b_1 < a_1 and b_2 < a_1. Geometrically, this means that the true balance is inside the a_1-sided square that the probe “cuts” from the lower-left corner of the larger square. Now, the attacker sends another probe with amount a_2 that does reach the destination. This means that at least one of the channels has sufficient balance: either b_1 > a_2 or b_2 > a_2. Geometrically, it means that the true balance is outside of the a_2-sided square. As a result of there two probes, the attacker has obtained the upper and lower bounds that correspond geometrically to the colored L-shaped figure (the difference of two squares).

What do these bounds bound, by the way? As mentioned before, the prober doesn’t necessarily learn anything about individual balances. Instead, it learns how much a hop can forward in the probe direction – simply speaking, the maximum of the balances. We refer to this value as h = max(b_i). The analogous value in the opposite direction is denoted as g.

Probes in the opposite direction have a similar representation in the geometrical model: instead of “cutting” squares from the lower-left corner from the larger square, they cut squares from the upper-right corner. Consider the state of probing after four probes have been done:

The attacker’s knowledge is comprised of four values – the lower and upper bounds on h and g. The bounds on h – h^l and h^u – define an L-shape “looking north-east”, whereas g^l and g^u define an analogous L-shape “looking south-west”. The intersection of these shapes defines the attacker’s knowledge: the smaller the area of the resulting figure, the more precisely the prober knows there the true balances are.

Here is where our first contribution comes in. We suggest choosing each next probe amount such as the probe cuts the colored figure in half by area. Prior algorithms chose the probe amount as the mid-point between the current balance estimates, which may be sub-optimal in the multi-channel case. Instead, our generalized approach is optimized for hops with any number of channels.

The following figures illustrate the process of probing a 2-channel hop step by step. (We only go through the first four steps explicitly.)

Note that at some point the colored area splits into two disjoint diagonally symmetric rectangles. This is a representation of the fact that balances can only be probed up to permutation, because the model assigns channels to axes randomly.

The same model naturally applies to any number of channels (and therefore, dimensions).

Now consider a question. Given enough probes, can the attacker probe any hop, with any number of channels?

Probing multi-channel hops

Consider a 3-channel hop with equal-capacity channels.

Analogously to the two-dimensional case, each probe now cuts a cube (instead of a square) from the lower-left vertex of the larger cube that represents the hop. The two bounds on h correspond to two surfaces, each composed of three perpendicular faces of the respective cube. The true balance must be above the smaller (purple) surface representing the lower bound, and below the larger (orange) surface representing the upper bound.

What happens when the attacker learn h precisely? The two surfaces collapse into one:

The balance must be somewhere on the colored surface.

Probes from the opposite direction produce a symmetrical surface, also composed of three perpendicular squares. The true balance must be somewhere on the intersection of these two surfaces. However, in the general case, two such surfaces intersect along a line composed of six intervals. The attacker cannot learn exactly where on this line the balance is! Compare it to the two-dimensional case, where instead of surfaces we had linear L-shapes, which neatly intersected at exactly two points, reflecting the true balance vector (modulo permutation).

An intuitive interpretation of this difference could be as follows. There are only two directions that a channel can be probed in. Probing in each direction decreases the dimensionality by one. That is, if the hop in question has only one or two channels, the final result would only contain one or two points. In the 3-dimensional case, the best the attacker can achieve is a line, that is, a one-dimensional figure. In the 4-dimensional case, the end result would be some surface, in the 5-dimensional case – some 3-dimensional volume, and so on.

(There is another scenario when a multi-channel hop may not always be probed fully. Can you guess what the reason is? If you want the answer, see Appendix B in the paper.)

The key issue that prevents full information extraction is that the prober cannot influence which of the parallel channels the probes go through. Instead, probes only reveal information about the aggregate of the balances. If only there were a way to force probes go through a specific channel…

Jamming-enhanced probing

We suggest combining jamming with probing to extract more information from multi-channel hops.

Jamming is a type of denial-of-service attacks on Lightning channels. The attacker sends a payment to itself (either via a circular route or simply to another node that it also controls) and deliberately delays finalizing the payment. As a result, the funds along the route are left “in-flight” and are unavailable for other payments.

There are two types of jamming (by capacity and by slots), discussing them is outside the scope of this post (please refer to the Background section of the paper and references therein). For our purposes, it’s sufficient to understand that an attacker can temporarily disable a victim channel.

We suggest combining jamming and probing to overcome the dimensionality issue described above. In particular, the attacker can jam all channels in a multi-channel hop except one, and then probe the remaining channel. In other words, while the attacker cannot influence how a routing node chooses a channel to forward a probe, it is possible to decrease the set of suitable channels the node picks from.

Geometrically, jamming-enhanced probing boils down to revealing each channel individually. In the 3-dimensional case, the prober first reveals b_1, then b_2, and then b_3. Each balance is represented by a plane parallel to the corresponding axis. The intersection of three perpendicular planes is a single point representing the true balance vector.

To recap: our contributions are as follows. We introduce a new probing model that accurately describes the attacker’s knowledge when probing multi-channel hops. We propose jamming-enhanced probing to overcome the limitation on information extraction in multi-channel hops. Finally, we suggest using an optimized algorithm (generalized binary search) to select probe amounts for multi-channel hops.

The question now is: how do we measure the benefits that our proposed improvements provide?

Evaluation

We evaluate our approach using our own simulator written in Python. We capture a snapshot of the network using our own Lightning node and assign balances to channels uniformly at random. We then pick 20 target hops with a given number of channels (from 1 to 5) and probe them in the simulator.

We use two metrics to access the success of the attack: information gain and probing speed. Information gain reflects the share (from 0 to 1) of uncertainty about channel balances in target hops that the attacker was able to resolve. (By uncertainty we mean the binary logarithm of the number of points contained in the final figure describing the attackers knowledge.) Probing speed shows how much information the prober gets per message sent (a message is either a probe of a jam).

We alter the probing algorithm in three ways:

Jamming-enhanced probing vs non-enhanced probing
Optimized amount selection vs simple binary search
Direct vs remote probing

In direct probing, the attacker established a channel to the target hop directly. In the real network, this requires on-chain fee but, on the other hand, all probes reach the target hop. In remote probing, the attacker sends probes along multi-channel paths. This allows for amortizing the cost of channel openings across many target hops but some probes are wasted due to insufficient balances in intermediary hops.

For each alteration of the probing algorithm, we run the simulation 100 times and average the results.

For information gain, we observe that:

for non-enhanced probing (the left graph), the information gain decreases as the number of channels increases (due to the dimensionality issue);
jamming-enhanced probing (the right graph) overcomes this limitation, achieving nearly full information extraction for multi-channel hops;
all else equal, remote probing performs slightly worse than direct probing.

For probing speed, we observe that:

Direct probing with optimized amount selection (the left graph, blue line) achieves nearly perfect probing speed of 1 bit / message;
Remote probing is always slightly slower than direct probing;
Optimized amount selection is always faster than non-optimized amount selection.

In summary: we confirm that jamming-enhanced probing yields more balance information, and that optimized amount selection allows for faster probing.

Potential countermeasures may be divided into node-level policies (something a single node can apply) or network-level protocol changes. On the node level, popular routing nodes may batch payments (so that payments in the opposite directions cancel each other out), split payments among their parallel channels, establish unannounced channels in parallel to public ones, or drop or forge error messages (which would, however, decrease reliability). Measures on the network level largely intersect with potential anti-jamming proposals, for instance, upfront fees for both successful and failed payment attempts.

Conclusion

In summary, we have introduced an enhanced probing technique for Lightning channels and confirmed using simulations that it reveals channel balances better and faster.

More generally, the issue we’ve been discussing illustrates the dilemma for Lightning. As long as Lightning is permissionless and privacy-focused (in particular, it uses onion routing), bad actors would be able to abuse it by mounting attacks on reliability (such as jamming) or privacy (such as probing). The key challenge for LN development is to limit the negative effects of unwanted network activity while preserving the permissionless nature of the network. We hope this work helps advance the understanding of the relevant trade-offs and be a basis of future protocol improvements.

For more details, see the full paper (to be presented at Financial Cryptography 2022). Slides, a video presentation (roughly based on this post), and the source code of the simulator are also available.

Clustering transactions in Bitcoin and other cryptocurrencies

2019-11-25T00:00:00+00:00

For the significant portion of 2018, as part of my PhD studies in CryptoLUX group at the University of Luxembourg, I’ve been working on network-level privacy attack on Bitcoin and other cryptocurrencies with professor Alex Biryukov. This blog post summarizes our findings, which have been published in 2019 (“Deanonymization and linkability of cryptocurrency transactions based on network analysis”). You can watch my presentation at EuroS&P 2019 in Stockholm (press CC for subtitles; slides are also available):

You though it was enough to use mixers or privacy-focused altcoins to preserve the privacy of your cryptocurrency transactions? Think again…

Privacy in cryptocurrencies

Bitcoin is the first successful implementation of decentralized digital money. Its key innovation is using proof-of-work to make modifying the ledger difficult. But in order to be really decentralized, Bitcoin and other cryptocurrencies must provide at least some level of privacy. Otherwise, those in power would still have some influence over cryptocurrency users.

Unlike a bank account, generating a Bitcoin address doesn’t require having passport. You can, and are in fact advised to, generate a new address for every transaction. This doesn’t guarantee full privacy though.

The fundamental trade-off here is between privacy and verifiability. Bitcoin tries to be decentralized: users must be able to independently validate incoming transactions. But to be able to verify incoming bitcoins, I must know where they are coming from.

Bitcoin transactions, once included in a block, are stored on thousands of nodes worldwide. Anyone can download the whole Bitcoin transaction history and analyze it. Turns out, there are techniques to extract quite a bit of information from the blockchain.

A Bitcoin transaction consumes (spends) a number of unspent transaction outputs (UTXOs) and creates new UTXOs. One heuristic is that all UTXOs in a transaction belong to the same entity. This is not generally true due to multisigs etc, but good enough for a heuristic. Another heuristic is that one of the outputs is the change address. Just as with cash, you usually don’t pay the exact amount, but pay a bit extra and receive change back. Multiple papers have been published along those ideas, and there are commercial companies which offer services of deanonymizing cryptocurrency users.

Alternative cryptocurrencies aim at stronger privacy. More established ones include Monero, Zcash, and (to a lesser degree) Dash. Newer ones, Beam and Grin, are based on the MimbleWimle protocol. They do indeed prevent or at least hinder blockchain analysis. While flaws have been described in all of them, one cannot simply open a blockchain explorer and look where certain coins come from.

The information stored on the blockchain is stored there forever. But what about the ephemeral information in the peer-to-peer network? Can an attacker extract any useful deanonymizing data from observing the network traffic? What makes the idea even more interesting is that privacy focused cryptocurrencies use P2P networks similar to Bitcoin…

Randomization of transaction propagation

Cryptocurrencies rely on peer-to-peer networks to propagate data. In particular, a Bitcoin node connects to 8 random nodes (and may accept incoming connections) and relays transactions to them. They, in turn, relay it to their neighbors, and so on. After a few seconds, nearly all nodes become aware of the new transaction, and miners can include it in a block.

If Alice knows a new transaction, she first announces its hash in an inventory (INV) message to Bob. Then only if Bob is interested in this transaction he replies with a GETDATA message, and Alice replies with a TX message for a transaction. If this gossiping mechanism was implemented just naively, it would be dangerous for privacy: an adversary could listen to the network and try to estimate the “rumor source”. That’s why there are certain broadcast randomization techniques used in Bitcoin and other cryptocurrencies: trickling and diffusion.

Trickling was used in Bitcoin before 2015 and is still used in Zcash. In trickling, a node chooses a random neighbor subset and announces the transaction only to them. Then, after a certain delay, another random subset is chosen, and the transaction is announced to its members. In diffusion, for each neighbor, a node announces the transaction to each neighbor after a random delay. As we will show, these methods hinder but do not completely eliminate network analysis.

Transaction clustering based on network-level information

The general idea behind our attack is to connect to many nodes and log the timestamps of transaction announcements. The intuition is that transactions which originate from the same node propagate in a similar fashion. (We will define the notion of similarity later.) First, we have to overcome some technical difficulties to collect the data.

Parallel connections

As described earlier, Bitcoin uses a mechanism known as diffusion to network-level attacks on privacy difficult. If we collect data by only connecting to each node once, as the reference software allows us to do, we won’t gain much. Ideally, we want to be the first to receive a new transaction announcement from the peer which generated it.

A typical full node maintains 8 outgoing and allows up to 117 incoming connections. Say, it has 12 incoming connections, for a total of 20 connections. This means that if we connect to it, we have only a 1 in 20 chance of being the first to hear about a new transaction.

To overcome broadcast randomization, we use an alternative implementation of the Bitcoin networking stack called bcclient. It was developed here at CryptoLUX as part of the research resulting in “Bitcoin over Tor isn’t a good Idea” and “Deanonymisation of Clients in Bitcoin P2P Network”. Bcclient connects to Bitcoin nodes with multiple parallel connections. As Zcash inherited most of networking properties of Bitcoin, it was relatively straightforward to adapt bcclient for Zcash.

Weighting IP addresses

Next, we want to find a metric to compare transactions. Each transaction can be characterized by a vector of IP addresses which announced it to us. For each transaction, we consider a vector of IP addresses which were the the first to announce it to us, assign weights to them.

Intuitively, the first IP to announce the transaction to us is the most likely to be the sender. For each subsequent IP, the probability of it being “close” to the sender decreases. In real network some transactions get to us very quickly, some of them get to us a little bit slower. For every transaction, we want the weight of IP addresses to drop neither too quickly nor too slowly. Here is an example of weights assigned to three (made up) timestamp vectors by our parameterized weight function (see the paper for details):

Calculating correlations

Next, we calculate the correlation coefficient for each pair of weight vectors and depict them as a matrix. We expect it to exhibit a special structure: with the right permutation of rows and columns, clusters would be visible along the main diagonal. That would mean that some subsets of transactions are closer related to each other than to other transactions.

To measure the effect of our attack, we use the anonymity degree proposed by Diaz et al in 2002. This metric, originally introduced for messaging systems, reflects the amount of information that the attacker can obtain regarding who was the author of each message. The anonymity degree varies from 0 to 1, where one means perfect anonymity, and zero means no anonymity at all.

Recap: what we do step by step

We connect to many nodes from servers on three continents (Europe, Asia, and North America) to get a better view of the network. We log transaction announcements and assign weights to vectors of timestamps for each IP address. Then we calculate the pairwise correlations between these weight vectors and apply the spectral co-clustering algorithm, which tries to find the permutation of rows and columns in the matrix such that the internal clustering structure would be visible. It is implemented in a popular Python sklearn library. Then we calculate the anonymity degree using our own transactions as ground truth.

Results

On the Bitcoin testnet, we did the full-scale experiment connecting to all nodes with as many connections as possible. We obtained the following picture (see the paper for full results):

We clearly see that our own transactions (marked with black lines) form a cluster, and we see some other clusters forming along the main diagonal. We also did the experiment on the Bitcoin mainnet, but in order not to disrupt the real network, we limited ourselves to only connecting to 1,000 nodes, which is about 1/10 of the nodes available, and didn’t try to occupy all connection slots. The results on mainnet are significantly worse: our transactions are scattered around the clusters.

We also did an experiment on Zcash, and there the picture is not that clear, but still our transactions form a cluster. It’s also important to note that in Zcash transactions can be shielded or transparent. Transparent transactions have the same structure as in Bitcoin and give no privacy enhancements. Shielded transactions take advantage of sophisticated cryptography (zk-SNARKs). In the following picture, the longer black lines indicate the shielded transactions, and shorter ones indicate transparent transactions. This shows that our method doesn’t care whether you use zk-SNARKs or not because we only take transaction hashes into account:

The traffic for Dash and Monero also exhibits some cluster properties (see the paper for the full results).

We also estimated the original sender IP address for a given cluster. It is only possible in certain circumstances. When a node connects to a network, it advertises its IP address in an ADDR message. If our listener nodes are online at that moment, they can compare the IP address announcement in the address message with the IP addresses which are highly ranked in certain clusters. We show that, at least for the Bitcoin testnet, an adversary can narrow down the search of the source IP address to about five IP addresses.

What about countermeasures? From a cryptocurrency user’s point of view, you shouldn’t issue many transactions during the same session. If you do, you may want to run your nodes with an increased number of connections and also periodically drop and re-establish connections.

Two recent proposals in Bitcoin, Dandelion++ and Erlay, would defeat our attack if implemented. In both cases, the key change compared to the existing P2P protocol is that the first stage of propagation is performed with outbound connections only. This ensures that an adversary has a zero probability of receiving a new transaction announcement early, if the victim did not choose to connect to the adversarial IP. This probability can not be increased by establishing incoming connections, as we have done in this work. Note that the Erlay paper explicitly states: “The decision to relay through outbound connections, but not the inbound ones, was made to defend against timing attacks”.

Conclusion

Many blockchain developers think of the network layer as a black box: it broadcasts transactions, what else do we need? As we have demonstrated, timing of transaction announcements reveals information on related transactions. This data is invisible on the application level. Randomization techniques, as they exist today, are not 100% efficient. The issue is especially crucial for privacy focused cryptocurrencies. Novel P2P protocols may help alleviate the attack by preferring outbound connections for the initial announcement phase.

See the full papers for more details:

Deanonymization and linkability of cryptocurrency transactions based on network analysis (IEEE EuroS&P 2019)
Transaction Clustering Using Network Traffic Analysis for Bitcoin and Derived Blockchains (CryBlock workshop at IEEE INFOCOM 2019) – a shorter version of “Deanonymization and linkability”
Security and Privacy of Mobile Wallet Users in Bitcoin, Dash, Monero, and Zcash (in a special issue of “Pervasive and Mobile Computing” on blockchain technologies) – studying the networking aspects of mobile wallets and applying the clustering technique to transactions issued from smartphones.

Thoughts on Web3. Part 1

2019-09-04T00:00:00+00:00

This August, Berlin was the global center of all things decentralized. Thousands of blockchain enthusiasts gathered for the Berlin Blockchain Week – a series of conferences, meetups, and a hackathon. In this post, I’ll share my thoughts on Web3, which was the primary topic of the the first major event of the week – Web3 Summit. [¹]

"The next Mark Zuckerberg won't start a social network company." – Peter Thiel

What is Web3?

The “3” in Web3 here refers to the “versions” of the web. Web 1.0 consisted of static, read-only websites. Web 2.0 brought the dynamic, Javascript-powered interfaces and, more importantly, user generated content. Facebook is the primary example of a Web2 service.[²] Billions of users use it for free to stay in touch, share photos, and much more. More and more people are feeling upset though. The major reason for criticism is Facebook’s business model.

What is wrong with Web 2.0?

Contrary to radio and television, the Internet is a dual channel. Users consume data but also report what they’ve consumed back to the server. The cost of storing and processing data has been declining rapidly (Moore’s law plus economies of scale) – at least compared to the value of the data if processed properly. This led to the current Facebook’s business model: it collects comprehensive user profiles and provides precise targeting to advertisers.

Is it that bad if someone extracts enough value from my clicks to let me connect to friends for free? Advertising has been around for centuries, why are people angry at Facebook?

Say, every morning you drop by the same cafe to grab some coffee. The barista greets you with a smile: “A double espresso as usual, Mister Smith? We have delicious chocolate cupcakes today, would you like one?” You smile back and feel glad that you live in such a lovely neighborhood.

Now imagine you bought a coffee machine on Amazon. A minute later, you see an ad: “Bought a coffee machine? Order our extra-special coffee beans! Best for your favorite morning double espresso!” Feels a bit creepy, right? What if the ad greets you by name?

People fear the unknown. A monster jumping from behind the corner in a horror movie makes you scared for a second, but the character walking around the haunted house anticipating a monster jumping from behind any corner generates lasting suspense. Nuclear power, which is statistically very safe, is perceived as dangerous because but involves deadly yet invisible radiation. People also don’t like giving up control. Many people “feel” that driving is safer than flying, because they feel in control behind the wheel.

Uncertainty and lack of control affect personal data as well. Sure, Facebook collects data on me, but what exactly? Does it come only from my Facebook usage or from other websites too? Who has access to the data? How long is it stored and where? What can they understand about me if they analyze it? What if they analyze it in ten years with a hundred times more powerful computer? Who do they share it with, and what to those parties do? Can I control what data they collect and what they use it for?

Facebook isn’t keen to answer these questions. The rare PR-department-style announcements start with the obligatory “We value your privacy” only to continue with a huge “but”, carefully wrapped in layers of unconvincing legalspeak. Facebook is one of the least trusted brands in the US, at the bottom of the list alongside big banks and The Worst Brand Ever in The Land Of The Free: the Government.

Speaking of banks and the government…

Money is another area of life with similar dynamics. It’s very scary to have your card blocked, especially if you have no cash. The rationale behind such decisions (is your transaction pattern strange so we block you just in case?) is hidden. The financial system is complex, controlled by someone else, without an opt-out option….

…until 2009.

Bitcoin showed that it is possible to disrupt the seemingly all-mighty fiat money system. Understandably, people angry at Facebook started wondering: if Bitcoin found a way to give people a more free and private money, can we use similar technologies to achieve the same goals for data?

Without going into too much technical details, let’s unpack what makes Bitcoin work.

Why Bitcoin works

Bitcoin, famously, is a rabbit hole. As you start thinking about what makes Bitcoin work, every insight opens up new questions. After some time, you stare into the abyss, asking yourself: what is value? what is energy? what is time? In this chapter, I’ll just focus on three points which I find important for the web3 discussion: what is digital scarcity, value as a special content type, and how this enables a closed reward system in Bitcoin.

Digital scarcity

Bitcoin is the first implementation of digital scarcity without a trusted party.

Digital information works very well for many purposes, but one characteristic delayed the (proper) digitalization of money for nearly three decades (from David Chaum’s digital cash in 1982 to Bitcoin in 2008): bits are not scarce. Whatever sequence of bits you give me, I can copy it, diluting the value you aimed at transferring. A simple but dirty way of solving this problem was to appoint a central party, which we all trust to not let people copy value-representing bits. (This is called fraud or counterfeiting and is punishable off-protocol, that is, by armed people putting you in a cage.)

Bitcoin is digital scarcity without people with guns. Where does its value come from?

Well, where does value of anything come from? It comes from people.

Imagine a universe without humans. How much is a star worth? How much is an atom of hydrogen worth? These questions are absurd because value is subjective. “Something is valuable” literally means “some people find it useful”.

People need money – a system of value transfer though time (store of value) and space (medium of exchange). (Unit of account is the third function of money, but it arguably emerges if the other two work well.) Bitcoin satisfies the demand for a money system for enough people to be worth around $10k apiece at the time of this writing. Turns out, to achieve this, it has to be very “inefficient” and make the harder choice at every turn. But as the type of information Bitcoin handles is so valuable (it is pure value itself), the expenses are worth it.

Is money just another content type?

I’m a long time fan of Andreas Antonopoulos. His passion and dedication helped me comprehend the incredible beauty of Bitcoin design. One of my favorite metaphors by Andreas is “money as a content type”. However, while catchy and indeed applicable in programming contexts, it is not entirely accurate.

Content types like MP3 or JPG are just sets of rules that let a computer interpret a sequence of bits. You may try to interpret a JPG file as text – this will most likely result in pages of unreadable symbols. But fundamentally JPG bits are the same as TXT bits and even the same as EXE bits.

But you can’t interpret bitcoin as text.

The Bitcoin whitepaper defines a coin as a chain of digital signatures. Signatures, of course, can be represented as text and printed in hexadecimal. But not every coin is a bitcoin. To be valid, a coin must stem from the coinbase of some valid block in the heaviest chain, which started from a particular genesis block.

A digital representation of value does not work in vacuum. It must relate to some agreed upon money system. In the same way it makes no sense to say “pay me 100” without specifying the currency, it makes no sense to consider a chain of signatures without linking it to the Bitcoin system. In fiat, we establish this relation (PKI, banking licenses, etc) and then trust that the system won’t fail, while it technically can. I trust that the numbers in my bank account there are, all fiat sins notwithstanding, not part of an outright scam plot. In Bitcoin, we establish this relation (full nodes, SPV nodes, trusted servers – whatever fits our security model) but no additional trust is required. The validity of blocks can be verified independently, and the probabilistic uniqueness of history is established with proof-of-work.

A Soviet porcelain factory

When the Soviet Union collapsed, its economic system was completely dysfunctional. [³] It was not just inefficient as you might expect from a centrally planned economy – it was impossible to perform basic financial tasks. Instead of money, a porcelain factory would pay its employees in dishes and cups, which they sold for next to nothing on the streets, just to bump into empty shelves in a grocery store. [⁴]

Bitcoin is a porcelain factory. It pays salaries to its “employees” with what it produces. [⁵] However, this is that single case where this makes sense – the factory literally produces money. The money-like objects Bitcoin spits out every ten minutes on average turn out to be exactly what people generally expect to be paid in. Therefore it is possible to embed the reward for maintaining the system into the system itself.

Regular employees trust their employer to pay them and rely on an external legal system to enforce their job contract if they don’t. Bitcoin miners, on the other hand, are not expecting a salary from anyone. Their pay is indivisible from their performance. There is no “boss” who could fail to pay, no company to go bankrupt. This closed cryptoeconomic loop makes the whole thing working and independent of any external party.

To summarize: value is a special type of information. Money is a system for transferring value. Before Bitcoin, digital money required trust. Satoshi leveraged a very special property of value to embed a reward mechanism into Bitcoin, creating a closed (hence independent) system. I’d argue that this is the absolutely critical piece of the Bitcoin puzzle. Let’s call these self-sustaining systems with built-in economic motivations – cryptoeconomic systems.

Why is my Facebook data valuable?

Let’s get back to the Web2 / Web3 discussion.

Facebook is one of the largest companies in the world, yet I don’t pay anything to use it. How come? This is quite a cliché already, but if you’re not paying for the product, you are the product. More precisely, you are a data point which lets Facebook develop better products for its true clients – advertisers. To attract data point (like you), Facebook maintains all those data centers and hires best developers to lure people into scrolling their feeds for hours on end.

Users’ data has no immediate value however. It is more of a debt instrument: it promises returns, if you derive insights from it and sell them to someone. The value of Facebook’s ad targeting system stems from a number of ingredients:

the huge amount of data (1.5 billion people use Facebook daily, tracked every split second);
the monopoly on data (a Facebook’s competitor can hire smart engineers, but can’t train the algorithms on Facebook’s datasets);
the proprietary nature of the algorithms.

The loop is self-enforcing. Facebook’s algorithms are developed by the best-in-generation engineers and trained on the world’s largest dataset. Facebook can afford to hire the best engineers and run powerful servers because of high revenue, enabled by the best data and algorithms, and so on.

The Web3 question then is: can we break that circle? Can we build a self-sustained, decentralized cryptoeconomic system on top of data to atone for the Internet’s original sin?

I’ll try to answer this question in the next post.

For Russian speakers: you can watch and listen to our coverage of Berlin Blockchain Week in Basic Block podcast. I also participated in the ETHBerlin Zwei hackathon, helping with a project around Maker DAO. ↩
I will Facebook as metaphor for a Web2 service throughout this post, but the same applies to Google and others. ↩
Let me suggest Collapse of an Empire by Yegor Gaidar if you’re interested in why that happened. ↩
People also used all kinds of money substitutes. I remember mid-1990s Russian TV ads with prices in “у.е.”, an euphemism for “USD converted to rubles at the date of purchase” invented due to a legal ban on price listings in ~~stablecoins~~ foreign currencies. ↩
An interesting question to ponder is who are the employees of Bitcoin? The obvious question is miners, but it is only miners? What about early adopters who invested at $10 per BTC and have the resources to go Bitcoin full time or develop another blockchain? What about Coindesk journalists? Or countless blockchain podcasters, myself included? In some indirect way, all people working in the space have been paid, essentially, by a DAO named Bitcoin. Mind = blown. A nice topic for another post. ↩

Eltoo

2019-04-25T00:00:00+00:00

Continuing the journey through layer-two technologies, here is a summary of the paper “eltoo: A Simple Layer2 Protocol for Bitcoin” by Christian Decker et al (see also: a summary in the Blockstream blog). Eltoo proposes a new construction for payment channels.[¹] It is not a fully-fledged protocol, rather, it only describes one crucial building block – state revocation mechanism. As you might remember from my summary of “SoK: Off the chain transactions”, the crucial challenge in L2 protocol design is old state invalidation. Lightning uses replace by revocation (in the SoK paper terms) which works in practice but has its drawbacks. The construction is rather complex, and the intermediate states held by the two parties are different. This inherent asymmetry prevents easily extending the protocol to support multi-party channels. Eltoo suggest another, symmetric state revocation mechanism, which is arguably better modulo one crucial limitation: it depends on a non-existent SIGHASH_NOINPUT signature flag. The good news is, this change can be implemented relatively easily via a soft fork and doesn’t seem to be very contentions. If that happens, it would be possible to replace state revocation mechanism in the live Lightning network to Eltoo while preserving all other aspects (channel synchronization vis HTLCs, routing algorithms, etc).

History: nSequense

It’s worth noting that Satoshi himself tinkered the problem of re-negotiating a transaction without broadcasting it. Bitcoin transactions have an nSequence field, which was initially meant to act as a counter. Miners were assumed to give priority to transactions with higher sequence numbers. Honestly, it’s hard for me to grasp how Satoshi, who invented the brilliant economic game that has been securing Bitcoin for over a decade now, seriously thought sequence numbers could work. Miners as rational agents simply choose the transaction with the highest fee in case of a double-spend attempt. [²]

In any case, miners even have some degree of plausible deniability: they may claim that they simply haven’t heard of the transaction with a higher sequence number.[³]

Bitcoin scripts

Bitcoin is essentially a replicated state machine. The state here is the set of unspent transaction outputs (UTXO set). Transactions modify this set by removing some outputs from the set (“consuming”) and creating new ones. Validity rules ensure that all transaction adhere to the same rules:

the sum of output values is not greater than the sum of input values;
the script executes to true.

On the protocol level, Bitcoin has no accounts and no balances, only scripts and transactions. Bitcoin’s scripting system is not particularly intuitive.[⁴] A transaction pends outputs by providing input scripts which when concatenated with some unspent outputs evaluate to true. Conditions under which bitcoins in the newly created outputs can be spent are described with output scripts. A popular script template is “pay to public key hash” (P2PKH), which means “a valid signature corresponding to this public key”.[⁵]

For example, for P2PKH the output script is “OP_DUP OP_HASH160 OP_EQUALVERIFY OP_CHECKSIG”. The input script is “Bob’s Signature> ”.[⁶] If you concatenate these scripts, imagine you’re a stack machine, and execute the commands, you’ll arrive at true, which means this input-output pair is legit (Chapter 6 of Mastering Bitcoin explains the process in detail).

More complex conditions can be encoded with pay-to-script-hash (P2SH) outputs. Here, the output only specifies the hash of the script necessary to unlock the coins. The spender provides the script, and the interpreter first checks the hash, and then executes the script.[⁷]

Eltoo: on-chain version

The previous section was there just to give the necessary background. Let’s return at the task at hand: a secure L2 state replacement mechanism. The authors first introduce an on-chain version of Eltoo, where all intermediary states must be confirmed on-chain. Then the authors “lift the protocol off the blockchain” and show how just with one little tweak to Bitcoin’s transaction engine enables fully-fledged Eltoo.

Similar to Lightning network, Eltoo is a three-stage protocol consisting of a setup phase, a negotiating phase, and a settlement phase. The setup phase consists of transferring some coins to a 2-of-2 multisig address. [⁸]

Both Alice and Bob generate and exchange two public-private key pairs: a settlement keypair and am update keypair. Assume Alice funds the channel. She creates a transaction spending her coins to a 2-of-2 multisig with either both update keys, or both settlement keys. Before sending it to Bob, she requires him to sign a initial settlement transaction which spends the coins from the multisig back to Alice.[⁹]

The funding output, as well as all successive update transaction outputs, can be spent in two ways (expressed as OP_IF branches in the script). The first (true) branch is the settlement branch: it requires two signatures with settlement keys and imposes a relative timelock (i.e., is valid only after a certain number of blocks are created after the transaction which created the output was confirmed). The second (false) branch is the update branch: it spends the output of the previous update transaction without any timelock.

Note the differences with Lightning:

information held by parties is symmetric;
parties use different sets of keys for update and for settlement;
a new update transaction spends the output of the previous update transaction (whereas in Lightning all update transaction re-spend the same 2-of-2 output of the funding transaction).

The parties have the time until the timelock in the settlement branch expires to spend this output with a new update transaction (signed with their update keys).

Consider Figure 1:

I should confess that I often have trouble wrapping my head around such diagrams, as the meaning of an edge if often defined implicitly. Here a rectangle is a transaction, a circle is a script, a (transaction – script) edge means “contains”, and a (script – transaction) edge means “spends”. Note the key insight: many transactions can spend the same output. Of course, at most one of them is ever confirmed on-chain, but multiple valid transactions can spend the same (P2SH / P2WPKH) output (probably executing different branches of the script). So the parties can co-sign one transaction which spends a given output, exchange it without broadcasting it, and then co-sign another transaction spending the same output. In fact this is exactly what they do: until the timeout in the settlement branch expires, they spend the latest output with a new update transaction. If they want to close the channel, they either simply wait till the timeout expires and broadcast the latest settlement transaction, or co-sign a new update transaction and broadcast it immediately (cooperative close).

Here is how the update phase happens:

The update transaction effectively doublespends the settlement transaction before it becomes valid. As with the funding transaction, before signing and broadcasting the new update transaction, the two endpoints negotiate a new settlement transaction that spends the newly created contract output.

Note that we are still dependent on on-chain confirmations of all intermediary update transactions! So this is not an L2 protocol yet. If only we had a way to preserve the security guarantees without broadcasting all intermediary update transactions…

3, 2, 1, lift-off!

That is exactly what the authors do in what they call “lifting the protocol off the chain”. Citing the blog post,

The key insight in eltoo is that we can skip intermediate updates, basically connecting the final update transaction to the contract creation.

But how is that even possible? How can a transaction be valid if it doesn’t specify exactly which outputs it spends?

SIGHASH_NOINPUT

Bitcoin transaction includes the transaction hash and the output index of the outputs it spends. Translated into English, a typical transaction looks like this:

I’m spending output number 0 from transaction 0xdead, here is the input script;
I’m spending output number 1 from transaction 0xbeef, here is the input script;
I’m creating an output, here is the output script.

But what if a transaction didn’t have to commit to the exact outputs it’s spending? In fact, it is possible:

signatures in Bitcoin transactions can be parameterized with the sighash-flag that specifies which parts of the transaction are committed to in the signature. By introducing a new sighash flag, SIGHASH_NOINPUT, it is possible to selectively mark a transaction as a floating transaction.

(A transaction is called floating if it “can be bound to any previous transaction with matching scripts”.)

Let’s visualize an output as a secure vault. A transaction opens a vault and re-distributes the money to other vaults. A vault says: “whoever has the key can open me and do whatever they want with the content”. A usual transaction says: “I have the key from this vault, so I’m opening it (and distributing the content into other vaults)”. A SIGHASH_NOINPUT transaction says: “I have a key, but I don’t yet know which vault I will open with it – might be any vault where the key works”. Hence, the same update transaction, once signed, can be modified to attach it to the required settlement transaction.

Is that all? Not quite…

Ordering

As the authors point out,

Using the SIGHASH_NOINPUT flag for update transaction adds a lot of flexibility, however they are now too flexible.

In particular, the current scheme allows replacing a new update transaction with an older one. This unfortunate circumstance is explained by the fact that we completely discarded any notion of order. Without committing to outputs, any update transaction can be attached to any other update transaction. But this is not what we wanted:

by using SIGHASH_NOINPUT we have removed any commitment to the state we are replacing. We therefore have to selectively re-introduce some of the previous transaction’s details into the validation.

Timelocks come to the rescue (again).

To check for timelocks, Bitcoin has two opcodes with totally intuitive names: OP_CHECKSEQUENCEVERIFY (aka OP_CSV) and OP_CHECKTIMELOCKVERIFY (aka OP_CTLV). The former one is relative, and the latter one is absolute. Eltoo uses absolute timelocks to order update transactions. Currently, OP_CSV ensures that the current time is later than the one encoded in the output. There are two ways to define absolute timelocks: as blockchain height, or as a UNIX timestamp. In yet another example of technical elegance, Bitcoin uses the same field for these two cases. The semantics is determined by the value of the field. The current timestamp, at the time of writing, is around 1.5 billion (1 billion seconds is approximately 32 years). The current block height at the time of this writing is a bit over 573 thousand. So if the value in the nLocktime field is above 0.5 billion it is interpreted as a timestamp, otherwise as a block height (note: billion with a b, that is, Bitcoin can exist for another 10 thousand years until timelocks are broken).

Just if that wasn’t enough of a dirty hack, the authors suggest the following. There are around 1 billion (and counting!) timestamps between 0.5 billion and the current moment. All these timestamps are in the past, so absolute timelocks are irrelevant for them. “I define this transaction to only be valid after 15:43 UTC on 8 October 1997” – well, it is already valid. Therefore, we can repurpose the poor nLocktime field once more and assign the values in the 0.5 billion – 1.5 billion range the semantics of Eltoo state counters. [¹⁰]

Anyway, just assigning some semantics to some field doesn’t make it enforceable (just as was the case with nSequence). The authors suggest a workaround:

In order to achieve the limited binding for settlement transaction a new set of public keys A s,i and B s,i is derived that is specific to each state number. The key-pair derivation can easily be done with hierarchical deterministic key derivation as used by many existing Bitcoin wallets. This ensures that a settlement transaction can only be bound to the matching update transaction.

The paper more fine-grained details on the compatibility with P2SH and P2WSH transactions and on fees, but the general construction can be summarized as follows: parties update the state using floating update transactions, which double-spend the output of the previous update transaction before the corresponding settlement transaction is valid.

Eltoo vs Lightning

Structure of intermediary transactions

The Eltoo authors put an emphasis on the symmetric nature of their protocol. Thinking about it a bit, I came to realize the fundamental distinction between Eltoo and Lightning revocation protocols.

Lightning transactions re-spend the same output (of the funding transaction) again and again. To prevent an old state from being confirmed, the toxic information is used: if Alice broadcasts an old state, Bob has the revocation key allowing him to take all the money from the channel. In Eltoo, on the contrary, intermediary transactions are organized linearly. The first update transaction spends the output from the funding transaction, and the settlement transaction spends the output of the latest update transaction. We don’t need toxic information anymore, because transaction are (temporarily) linked linearly. Only when the channel closes, the latest update transaction is bound to the initial funding output. This allows for the symmetry of the information held by the parties.

Note that in Eltoo, as well as in Lightning, the victim must be online to present a “double-spending” update transaction if an old settlement transaction is broadcast. And, similarly to Lightning, only payments larger than the L1 fee are economically secure.

The cost of timelocks

In Eltoo, the settlement part of every output script is time-locked. The authors acknowledge the related trade-off:

Notice that choosing the correct timeout for the settlement branch is a trade-off. It must be chosen high enough to guarantee that any subsequent update is confirmed before the settlement transaction becomes valid. On the other hand this timeout is also the time participants have to wait before funds are returned to their sole control should the other participant stop cooperating.

In my opinion, this is the biggest drawback of Eltoo: a channel expires after some time if both parties do nothing. Of course, if both parties are cooperative and want to maintain the channel, they just won’t broadcast the now-valid settlement transaction, but this can not be guaranteed. Lightning, on the other hand, maintains the state of a channel indefinitely: both parties can go offline, return in a year, and continue updating the channel. Does this lead to inapplicability of Eltoo in any of the Lightning use cases, or vice versa? If state revocation technique is just a replaceable module in a larger L2 protocol, can we envision a future where we have multiple implementations of all L2 building blocks (networking, routing, state invalidation…) and can combine them to better suit our particular set of constraints?..

Conclusion

Eltoo is a cleverly designed protocol. I’m looking forward to the SIGHASH_NOINPUT BIP being discussed and hopefully implemented, enabling Eltoo implementation.

Eltoo and SIGHASH_NOINUT is an interesting example of blockchain protocol design evolution. As far as I can tell, the initial code published by Satoshi was not of enterprise quality. It took years of engineering effort to separate it logically (wallet stuff from networking stuff from consensus stuff) and fix bugs. But also on a higher level, Bitcoin initially was probably too “strict”. For example, transactions must commit to outputs they are spending. How could it be otherwise? Seems logical, until someone explores the protocol deeply enough to realize that we can remove this restriction and enable new functionality without sacrificing exiting security guarantees. Segwit is another improvement of this kind. What else can we strip from the protocol to enable new use cases without weakening the security model?

Another lesson from reading about Bitcoin scripts is that Bitcoin is full of dirty hacks. I understand the desire of some talented teams in the space to take the brilliant idea by Nakamoto, derive a new system based on it, and implement it cleanly from scratch (e.g., Cardano with its heavy emphasis on formal verification). For what it’s worth, I have a feeling that Bitcoin’s network effects are already so strong, that it would be really hard to compete with it in its niche, despite all its inefficiencies. The question is, what are other niches for blockchains, apart from digital money?

The authors seem to prefer stylized inscription “eltoo”, in all small letters, but I prefer capitalizing proper nouns for easier reading, sorry. ↩
At least in the basic model without non-monetary incentives like a wish to harm Bitcoin for political reasons, or fees outside the protocol (also known as bribes). ↩
Another aspect of the original Bitcoin design which makes me cringe is “send to IP” – see this LTB episode where Andreas Antonopoulos explains why this is a very, very bad idea. ↩
Which may explain why Ethereum gained traction so quickly: it appealed to a large community of web developers who were not willing to dive into peculiarities of stack-based languages inspired by Forth. ↩
More precisely, to the public key hash – the key is hashed to prevent certain privacy attacks and to add a layer of protection against the quantum threat: quantum computers may be able to derive a private key from the public key, but can’t reverse hashes. ↩
As the authors node, “due to this case being so common, the spending condition is commonly referred to as scriptPubKey, and the input script is referred to as scriptSig”. Oh, I love how intuitive Bitcoin terminology is! Was it supposed to act as a filter against people without at least a Master’s degree? ↩
By the way, P2SH is recognized by the script format only. That is, if the interpreter sees a script in the form “check that the hash of the argument is equal to this”, it also interprets the argument itself as a script! Oh, how much engineering elegance! Wait till we get to timelocks, this will get curiouser and curiouser. ↩
While the final protocol is extensible to more parties, the authors describe the 2-party version first. ↩
Segwit enabled the ability to build protocols including spending not-yet-confirmed outputs, as their identifiers are not malleable anymore. ↩
Reminds me of this post by Emin Gün Sirer on “clever” hacks in multi-layered protocols and the following software bloat… ↩

SoK: Off the chain transactions

2019-04-17T00:00:00+00:00

Here is papers I’ve been waiting for for quite a while. Thinking about it, I’d be happy to have (co-)written it, had I directed my research into this topic a bit earlier. Today’s summary is based on a systematization of knowledge (SoK) paper by Lewis Gudgeon et al entitled “SoK: Off the chain transactions”. The readers of this blog must be familiar with some of the challenges in layer-two protocols which I outlined in previous paper summaries. But as in any rapidly developing field, information is dispersed across various media and is getting outdated quickly. The authors summarize and systematize the challenges faced by the developers of layer-two protocols and compare the existing solutions. I definitely gained at least some of the mental clarity essential for digging deeper and contributing to the field. Read the whole thing to get general view of what problems are out there and how various proposals are tackling them. And the 136 references will fill your “read later” folder with papers for months to come!

Layers and myths

Blockchains scale poorly. There are multiple approaches to improve the efficiency of the base layer (consensus algorithm). But updating an existing blockchain is hard (if it’s decentralized). Layer-two protocols are a class of scaling solutions orthogonal to consensus improvements. Their main feature is that they don’t require any modifications on layer-one, taking it’s security guarantees as assumptions.

The authors outline the four “myths” regarding layer two:

blockchains can’t scale without advances in layer-one;
layer-two solutions are secure only if fully collateralized;
off-chain transactions are private by default;
blockchain transaction fees depend on transaction size or computational complexity, not its monetary value.

The very word “myth” calls for the verb “debunk”, but this is not what will happen: rather, the authors provide insights regarding the “myths” throughout the paper.

What are these “layers” anyway? The authors propose the following classification:

hardware layer (layer “minus-one”) account for the fact that hardware may provide “trusted execution environment” like Intel SGX (which shifts the security assumptions towards the hardware manufacturer and alleviates many problems of protocols designed to be run on general purpose hardware);
layer-zero is the peer-to-peer network (though the paper focuses on permissionless networks, the authors point out that the insights regarding layer-two are also applicable to the permissioned ones);
layer-one is the consensus algorithm;
layer-two are protocols built on top of layer-one, of which the authors define three types.

What do we mean by “built on top”?

Layer-two protocols typically assume two properties from the blockchain layer: integrity (i.e. only valid transactions are added to the ledger) and eventual synchronicity with an upper time bound (i.e. a valid transaction is eventually added to the ledger, before a critical timeout).

The authors separate layer-two protocols into two groups: channels (payment and state) and commit-chains.

Channels

A channel is a type of a second-layer protocol where parties

consent to state updates unanimously by exchanging authenticated state transitions off-chain.

There are two kinds of channels (though the division seems a bit artificial from the purely theoretical standpoint, though it does reflect the current reality): payment channels and state channels.

The workflow of a payment channel consists of three phases:

channel establishment: the parties lock up collateral on-chain;
channel transitions: the parties co-sign state updates and exchange them off-chain;
channel closure: the parties finalize the latest state on-chain (collaboratively or via a dispute mechanism).

The key design challenge for both payment and state channels is to prevent parties from submitting an outdated state to the blockchain. There are four state replacement techniques:

replace-by-incentive (RbI). In a one-way payment channel where Alice pays Bob, Bob will only submit the latest state to the blockchain because it will yield him the highest value. Clearly, it doesn’t work with a bidirectional channel: if Bob pays back to Alice, he still possesses an outdated but valid state and can submit it to the blockchain effectively stealing from Alice.
replace-by-timelock (RbT). In this scheme, each new update is timelocked (is valid only after a certain timestamp). Every new update must have the timelock closer to the present by a safe margin. This guarantees that the interested party will be able to submit the latest state to the blockchain before any outdated states are even valid. The drawback is that the lifetime of RbT is limited by the first timelock.
replace-by-revocation (RbR). This is what Lightning is built upon. When agreeing on a new balance distribution in a channel, Alice and Bob effectively invalidate the previous state. If one of them tries to submit an outdated state to the blockchain, another party will be able to take the whole collateral. The drawback here is that parties are supposed to be constantly monitoring the blockchain.
replace-by-version (RbV). This approach works best if the layer-one is stateful (like Ethereum). Each new state has an incrementing counter; the version with the highest counter value is valid (disputes can be settled by a smart contract).

The first three techniques are used in payment channels; the fourth one is the basic idea behind state channels.

Payment vs state channels

The authors outline multiple proposed payment channel constructions, most of them are, in my opinion, only of a historical value. The very first ideas for unidirectional channels date back to Satoshi himself, then there was a construction known as Spilman channels, then Decker and Wattenhofer proposed Duplex micropayment channels… Eventually, the only implemented payment channel scheme is Lightning (see part 1 of my series on it). The authors point out the key drawbacks in the RbR-based Lightning construction:

RbR is the first channel design to require both parties to remain online and fully synchronized with the blockchain to observe malicious closure attempts. <…> RbR introduces unfavorable implications for third-party watching services <…> entails O(N) storage.

Can we improve this by using RbV in payment channels somehow? Decker et al propose

Eltoo to support RbV in UTXO-based blockchains through the use of floating transactions

State channels generalized ideas pioneered by payment channels and apply similar construction to arbitrary computations. The authors distinguish state channel constructions with closure disputes and command disputes. To be honest, I didn’t understand this paragraph, as the notion of “installing / uninstalling” an on-chain application needs clarification. The two most prominent constructions (at least those with formal security proofs) are Perun (closure disputes) and Sprites (command disputes).

Channel synchronization

Up to this point, we were only talking about single channels, but what about channel networks? The authors suggest the term channel synchronization to denote techniques to logically connect updates to multiple channels. A well-known example are hash-time-locked contracts (HTLCs) used in Lightning: channels along the path get atomically rebalanced if the receiver reveals a hash preimage or a timeout expires. Another approach, suited for stateful blockchains, is a global preimage manager – a smart contract which keeps track of the revealed preimages (see the Sprites paper). The key advantage of the preimage manager is that it lowers the requirements for capital lockup from quadratic to linear in the number of hops (in the worst case). In Lightning, the timelock at each next hop must be different from the previous one by a security margin to allow an intermediary node to confirm the correct balance on-chain in case of a dispute. In a stateful blockchain with a preimage manager, the deadline can be the same for all channels in the path. Other approaches to channel synchronization include scriptless multi-hop locks (see “General state channel networks”) and virtual channels.

Routing

An important question in channel networks is how to find a path to the receiver capable of delivering the payment. This tasks becomes tricky if we consider the resource limitations of end-user devices (smartphones) and privacy requirements (it would be bad to reveal all intermediate states in every channel, hence failures due to insufficient capacity are inevitable).

Shall we use onion routing for better privacy, one might suggest? Here is a surprising insight:

Some algorithms involve onion routing, which requires the random selection of nodes in a path to achieve its anonymity guarantees. As routing algorithms do not select the nodes randomly, it remains unclear if onion routing provides privacy in the context of payment channels.

Turns out, onion routing in Lightning may end up being just security theater?.. Mind = blown. Looks like yet another issue stemming from the fundamental “information vs value” distinction. Worth further investigation!

The authors define two approaches to routing: with global and local view. Lightning and Raiden use source routing with global view: the sender is expected to have a full (potentially slightly outdated) snapshot of the network. Local view routing algorithm fall into four categories: distributed hash tables (Celer’s cRoute), flow algorithms, landmark routing (SilentWhispers, Flare), and network embeddings (SpeedyMurmurs). Another dimension to compare routing algorithms is along the circuit switching (atomic payments) vs packet switching (splitting a payment into smaller chunks and transferring them through different paths). Routing for channel networks seems to be a very promising area of research, as

no routing algorithm fulfills all desired criteria.

Commit-chains

Commit chains are the second category of layer-two protocols, alongside channels. In contrast to channels,

commit-chains are maintained by one single party that acts as an intermediary for transactions.

The two proposals the authors review are NOCUST (an account-based commit-chain) and Plasma (a UTXO-based commit-chain). More specifically, they focus on Plasma Cash as the most mature Plasma flavor (I love this video of Karl Floersch explaining it!).

The general workflow for commit-chains is as follows. An operator collects commit-chain transactions and periodically commits to the latest state. There is no three-step lifecycle (open – transact – close) as in channels: the application is always on once launched. Users can anytime withdraw their funds to the layer-one chain.

There are two ways for users to verify that their transactions are reflected correctly in the latest state commitment: Merkle proofs and zero-knowledge proofs. The distinction is that Merkle root commitments “do not self-enforce”, whereas

ZKPs enforce consistent state transitions on-chain.

If I understood this correctly, we can theoretically encode the rules into the layer-one contract saying that the state transition is not valid unless accompanied by a zero-knowledge proof of its correctness. On the other hand, I don’t quite see how this can prevent censorship (an operator refusing to process chosen transactions as if they were not requested).

Security and privacy

Layer-two protocols introduce new security / privacy challenges. Despite a common belief, L2 transactions are not absolutely private by default (hardly anything is). Sure, they are not permanently recorded in a globally replicated database for all chainalyses of the future to analyze. But the privacy problem is far from solved. Some of the relevant proposals mentioned in the paper are TumbleBit, Bolt (the anonymous channels from the Zcash team, not the Lightning specification), and Rayo / Fulgor (this blog has a summary of those).

The listed security threats include:

the requirement to keep keys in a hot wallet;
the requirement to constantly be online;
the problem of mass exits (bank runs for the new age – everybody is trying to exit the malicious layer-two system, but the layer-one can’t handle the load, timeouts expire, preventing honest dispute resolution);
high cost of on-chain proof verification (650k gas on Ethereum for a single ZKP!);
wormhole attack (two malicious nodes along the same path short-cut the preimage off the protocol, taking the fees from nodes between them);
capital lock-up attack (start a payment but never reveal the preimage; all intermediary nodes pay the opportunity cost of locked channel capacities).

An interesting observation is the finality vs collateral trade-off between channels and commit-chains:

Unlike previously discussed layer-two protocols, the intermediary commit-chain operator does not require on-chain collateral to securely route a payment <…> commit-chain transactions do not offer instant transaction finality (as in channels) but eventual finality after commit-chain transactions are recorded securely in an on-chain checkpoint.

The way I think about this is that in channels intermediary nodes provide capacity, therefore, they must lock it up. The meaning of the phrase “Alice sent 1 coin to Bob via Charlie” is actually: Alice send 1 coin to Bob, and Bob sent 1 coin to Charlie, with atomicity enabled by the protocol. But Charlie must have at least 1 coin to transfer to Bob while he waits to be able to pull 1 coin from Alice! On the contrary, in commit-chains, user have to trust the operator for all actions after the latest commitment. Because of these additional trust assumptions, the system can operate without collateral: a Plasma operator implementing a payment service doesn’t need to have any coins to be able to update its internal database and commit to the next state. On the other hand, operators may optionally put up collateral which they would lose in case of fraud to boost users’ confidence. Moreover, by introducing some trust, we mitigate the mass exit problem:

commit-chains do not require a deadline for users to withdraw their coins, mitigating the transaction fee bidding war.

This is possible because the operator doesn’t hold any collateral which is locked in case of a dispute and must be unlocked eventually (hence timelocks)!

Another very interesting observation is that layer-two makes fees ~~great~~ dependent on the transaction value again. In traditional finance, it’s common to charge fees as a percentage of the transaction value. Bitcoin has a completely different model: transactions compete for space in blocks which is limited in terms of bytes, not coins. Therefore, a common unit of account for Bitcoin transaction is satoshis per byte. It costs about the same to transfer one dollar or a million dollars, if the transaction script has the same structure. On the contrary, layer-two re-introduces the value semantics into protocols: relaying a payment requires liquidity, and intermediary nodes pay more in opportunity costs of locked capital when transferring a million dollars as opposed to one dollar. This suggests that the layer-two fee economics will look very different from layer-one (and we haven’t fully explored the latter yet…).

Conclusion

I highly recommend reading the whole paper (maybe multiple times, as it’s quite dense). There is just so much work to do in layer-two! We have barely scratched the surface. I hope to take part in this exploration soon.

Routing cryptocurrency with the Spider network

2019-04-13T00:00:00+00:00

Let’s continue our journey though recent paper which suggest ways to optimize routing in payment channel networks. In previous posts, we looked an SilentWhispers and SpeedyMurmurs. Both approaches emphasized privacy as an important goal, but employed different constructions: landmark-based routing (SW) and embedding-based routing (SM). Today let’s look into a paper entitled “Routing cryptocurrency with the Spider network” (2018) by Sivaraman et al.

Introduction

I won’t spend much time summarizing the introduction sections, as the points the authors underline are more or less the same in every paper in this area: blockchains are cool but not scalable, second-layer solutions are coming to the rescue, but how do we find paths?

The authors outline the existing approaches to routing in PCNs, listing all the usual suspects: the generic max-flow algorithm, Flare, SilientWhispers, and SpeedyMurmurs. In all previous approaches, however, payment atomicity was considered “a red line”: a payment must either go through or fail, tertium non datur. The key idea of the authors is to weaken this requirement, and optimize for two separate variables: the success ratio (the share of transactions successfully processed) and the success volume (the total monetary value of the transactions successfully processed). More concretely, the Spider routing algorithm

actively accounts for the cost of channel imbalance by preferring routes that rebalance channels.

The authors identify the key challenge to continuous operation of PCNs: if a channel is consistently utilized in one direction more then in the other, it eventually gets depleted, and requires an on-chain transaction (with an on-chain fee) to continue operating. To address the challenge, they introduce another approach to modeling the whole problem, which I find clever and insightful.

The authors also emphasize one crucial security assumption in PCNs, which is often overlooked:

the underlying cryptography backing payment channels assumes that transactions on the payment channels are larger than the blockchain transaction fee to ensure that broadcasting the true balance is profitable.

Model

Previous approaches, such as SilentWhispers (SW) and SpeedyMurmurs (SM), operate under the paradigm of periodic rebalancing. Nodes establish channels and start using them. The network as a whole only knows the initial total channel balances. As channels are being used, their capacity distributions deviate more and more from the initial state. As a consequence, the share of payments failed due to insufficient capacity also rises. To make the network useful again, rebalancing is required. Rebalancing either happens once every epoch (SW), or “on-demand” (SM). In both cases, this is a distinct process, separated from usual transaction routing.

The authors of Spider have a more ambitions vision: a payment network which doesn’t need rebalancing at all! Rebalancing in Spider, instead of being a separate process, happens naturally as payments are routed.

An example

The usual way to model a PCN is a graph where nodes represent peers and edges represent channels (usually directed and weighted according to capacity). Let’s call it the network graph, as it represents the existing topology of the network. The authors separate the two questions:

What is the current structure of the network? The answer is the network graph.
How do the participants want to use the network? The answer is the payment graph.

The payment graph shows the intentions of the peers: how much value they collectively want to transfer in which directions. Money flows are modeled not as individual atomic transactions, but as constant flows (which can be approximated with a series of unit payments). The key question is, given a topology graph and a payment graph, how many payments can we satisfy? Alternatively, given a payment graph, what is the optimal topology graph? Or, given a topology graph, can we influence the payment graph using fees to make the network balanced?

Consider a fully connected network of 3 nodes: Alice, Bob, and Charlie. Say, the initial state is three channels with capacities of 1 bitcoin on one side: Alice to Bob, Bob to Charlie, and Charlie to Alice. Alice wants to transfer 1 bitcoin per day to Bob, Bob – 1 bitcoin per day to Charlie, Charlie – 1 bitcoin per day to Alice. If every node chooses the shortest path, all 3 channels will be depleted after 1 day. But if, instead, one of the “payment streams” is routed sub-optimally (Charlie – Bob – Alice instead of Charlie – Alice), the flow of funds in one direction would offset that in the other, and the network would be able to run indefinitely (assuming zero fees).

The authors give a somewhat more elaborate “motivating example” in the beginning of Section 5 to illustrate this point:

The goal is to maintain the network in a “balanced” condition, that is, such that every node has equal incoming and outgoing value flows. Let’s define “balanced” transaction rate as the sum of flows through all nodes which can be maintained indefinitely. Turns out, if every source node chooses the shortest path as a destination, the overall throughput wouldn’t be optimal! In a sample network of 5 nodes, they show how the “balanced” transaction rate is smaller than the maximum possible if all nodes choose optimal routes for their transactions (similar to my previous example with 3 nodes).

The authors than show that any payment graph can be decomposed into two components: the circulation and the DAG. These two graphs have the same nodes as the payment graph, but the weight of each edge is split between the two components (it may be zero in one of them).

All flows in circulation are balanced at every node (incoming and outgoing flows are equal). All what’s left goes into the DAG. The authors show that for every payment graph there is a maximum circulation graph (with the highest transaction rate), and for this circulation there is a network graph which achieves the maximum transaction rate. The total throughput would depend only on the locked-up capacity compared to the “settlement delay” – the time it takes for the receiver’s secret to propagate to the routing node. Until that moment, the funds are “in flight” and can’t be use for other transfers. Consequently, no network graph achieves transaction rate higher than that of the circulation.

Adding on-chain rebalancing

What happens if a payment channel does get depleted?

If I lack capacity on the local side (I want to send more), I can top up my side of the channel with an on-chain transactions (not sure this is currently implemented in Lightning, but this assumption is useful for modeling).
If I lack capacity on the remote side (I want to receive more), I can either spend some coins, or ask my counterparty to top up their side of the channel. Assuming there is no fiat world, and my counterparty is not particularly generous, I’m expected to reimburse them with, again, an on-chain transaction (see 1).

To reflect these options, the authors add the cost of on-chain rebalancing to the picture. Putting all parts together (and accounting for the off-chain as well as on-chain fees), they come up with a system of equations which form a linear programming (LP) problem: our goal is to maximize the throughput given a set of constraints. Solving the equations with the primal-dual algorithm, the authors derive the formula for the optimal fee structure that achieves the maximal transaction rate. They also note that the used algorithm may converge too slowly if the network conditions are changing rapidly, and suggest a less precise but faster-converging alternative (“waterfilling”).

Evaluation

The authors compare their approach (with both the LP and waterfilling algorithms) with alternatives: SW, SM, and max-flow. The set of transaction from Ripple acts as a real-world dataset (as in the previous papers on the subject).

I’ll come back to it later.

The first experiment compared the performance of the six algorithms (Spider LP, Spider Waterfilling, Max-flow, Shortest path, SW, and SM) for atomic payments only with two network topologies: ISP-like (not clear what it means exactly though) and a subset of Ripple. See the results in the figure below; I’m surprised that SM performed so poorly and can’t come up with a reason why (comments welcome!).

Note also that Spider only operates in benign conditions (i.e., in a permissioned network):

[O]ur design does not address incentives <…>. It also doesn’t account for adversarial routers relaying wrong values to the sender.

Isn’t it the most important point? How does Lightning handle it by the way? Wouldn’t be surprised if it also doesn’t (looks exiting: so much space for research!).

Conclusion and the road traffic analogy

The Spider network incorporates one of the key ideas from information networks (packet switching), which allowed the Internet to scale massively, to value networks. This definitely looks promising. But though the authors compare their approach to SW and SM, it still seems to me that they are answering different questions.

Citing the paper (page 12, Section 5.3, emphasis mine):

To select paths, end-hosts monitor the total price on different paths and choose the cheapest option.

And in Section 6.1, describing the experimental setup:

We restrict both algorithms to use 4 disjoint shortest paths for every source-destination pair.

Well, but how do I get the list of these “different paths” / “disjoint shortest paths”? Aren’t we back at square one, in search for an efficient routing algorithm?

I think the following analogy might be useful. Imagine a city suffering from road congestion. People are commuting by car from home to work and back every day, and there is not enough road capacity.

Imagine an ideal situation:

the city consists of two areas A and B;
half of the population lives in A and works in B, and the other half lives in B and works in A
the working hour are distributed equally throughout the day.

Under these rather unrealistic conditions, roads are equally occupied throughout the day. But in reality, people mostly work in the city center and live outside it, work mostly during the day and sleep at night, so the flow is not balanced.

There are two related, but distinct questions:

For an individual: given all preconditions (the home address, the work address, the traffic conditions), how do I choose the best route?
For the city administration (or whoever owns the resources): how much do we charge for roads to balance supply and demand?

Max-flow, SW, and SM answer the first question. What makes the task difficult is that the map of the city is too big for any individual to fully possess. SilentWhispers find the shortest way to a highway (landmark), and from the highway (navigation on the highway is considered trivial). SpeedyMurmurs, on the other hand, at every crossing turn to the direction which brings you closer to your destination.

Spider network seems to answer the second question: given a payment graph (where people want to go) and the network graph (road capacities), how much should entering each portion of a road cost so that equal amounts of traffic would go in the opposite directions? If the most direct highway to my destination costs a lot, I’d be happy to take a detour via less congested roads. If just the right share of commuters does so, the traffic will be balanced. But is this even the right goal to strive for? Shouldn’t we aim for getting more people to their destinations using as little fuel as possible? (Public transport, anyone?)

Maybe we’re hitting the limits of usefulness of this analogy, but maybe we aren’t.

SpeedyMurmurs: applying friend-to-friend routing to payments

2019-03-27T00:00:00+00:00

In a previous post, I discussed SilentWhispers – a routing algorithm for credit networks. Today I’ll dive into a follow-up paper by a partially intersecting group of authors, entitled “Settling payments fast and private: decentralized routing for path-based transactions”, which presents a routing algorithm called SpeedyMurmurs.

Routing, revisited

The SilentWhispers paper (2016) was mostly dealing with credit networks (and was itself a continuation of the work of some of the co-authors on this subject). The introduction of the SpeedyMurmurs paper (2017) sets up the stage in the blockchain world. The authors introduce the notion of path-based transaction (PBT) network, which unifies credit networks such as Ripple and Stellar with L2 solutions on top of open blockchains (Lightning, Raiden). They then define the three mechanisms that comprise a PBT network:

routing;
payment;
accountability.

In all PBT networks, payments are executed in an atomic series of elementary operations. Credit networks use a (somewhat) centralized entity for ensuring atomicity (even “real-world” courts maybe). Lightning and Raiden rely on the base layer. The problem of routing, however, is orthogonal to the implementation of enforcement, so it makes sense to unify the credit-like networks under an umbrella term PBT.

Speaking of the routing mechanism specifically, the authors list four dimensions to measure how well it performs:

effectiveness (share of successfully completed transactions);
efficiency (delays and overhead);
scalability (how the system responds to the growth of the number of nodes, links, and transactions).

On top of that, the authors argue that routing must not compromise users’ privacy.

Ripple and Stellar base routing decisions based on the information stored in a public blockchain. A number of obviously centralized approaches are mentioned as well, but these are not interesting (with a trusted third party we can do anything). A more noteworthy proposal, which I haven’t heard of before, is Flare – a 2016 BitFury-designed routing proposal for Lightning which didn’t eventually go into production (though it does go into my reading list).

SilentWhispers, as the authors modestly note, is

the most promising approach in regard to privacy

Let me just briefly remind you the key ideas:

a number of well-known, highly-connected nodes called landmarks periodically run a breadth-first search and create a spanning tree over all nodes;
payments are routed via a landmark;
landmarks calculate path capacities using a multi-party computation, which conceals individual nodes and their link capacities;
if the path capacity via the first landmark is low, the sender tries another one.

However, it suffers from a number of drawbacks:

the spanning tree has to be re-computed once every epoch, including the parts which have not changed;
all paths go through a landmark, even if both the sender and the receiver are in the same sub-tree;
as all nodes along the path must send shares of their local link capacity to all landmarks, the number of messages grows quadratically;
the protocol doesn’t handle concurrency.

SpeedyMurmurs aim to solve these problems by abandoning the landmark routing and using another approach called embedding-based routing.

Embedding-based routing

What are embeddings, exactly? I was looking for a definition (a sentence starting with “An embedding is…”) but didn’t find one. Instead, the notion is introduced as follows:

Embeddings rely on assigning coordinates to nodes <…> and having nodes forward packets based on the distances between coordinates known to that node and a destination coordinate.

For the proposed routing algorithm, greedy embeddings are used. Greedy embeddings assign coordinates based on the position of the node in the spanning tree. Then a distance function is defined on pairs of coordinates with the following essential quality: for every (sender, destination) pair, the sender has a neighbor which is closer to the destination than the sender itself. This means that a greedy algorithm – forwarding a message to the neighbor which is the closest to the destination – will always find a path (and not get stuck in a local minimum).

Consider prefix embeddings – greedy embeddings where a coordinate of a node contains the coordinate of its parent as a prefix. Imagine a binary tree with three levels. The root gets an empty string as its coordinate. The two nodes on the first level get “0” and “1”. Their children get, rather unsurprisingly, “00”, “01”, “10”, and “11”. If we want to send a message from “00” to “11”, the shortest path follows the tree up to the root and back to the leaf: 00 – 0 – “” – 1 – 11. If the sender and the receiver are in the same sub-tree, we don’t have to go all the way up to the root: 00 – 0 – 01. There might also be “shortcuts” – links between nodes which do not belong to the tree. If a suitable shortcut exists, a routing algorithm should choose it.

On each step, the next node to forward the message to is chosen greedily among all neighbors (includes shortcuts). The distance function is defined as d(u,v) = |u| + |v| - 2CPL(u,v), where |u| is the length of the path from u to the root (equivalently, the length of its coordinate vector), and CPL is the common prefix length of the two coordinates. The formula basically says:

go from u to the root;
go from the root to v;
oh, you didn’t actually have to traverse the common part of these paths (twice).

It’s not the only way to construct embedding-based routing, but follows the general recipe:

construct a spanning tree;
assign a pre-defined coordinate to the root;
let each node derive its coordinate from its parent’s coordinate;
define a suitable distance function between coordinates.

This is all well and good, but what about privacy? In a routing scheme described above, every intermediary node knows where the message is going. A privacy-preserving protocol called VOUTE comes to the rescue.

VOUTE uses anonymous return addresses, which allow intermediary nodes to choose the neighbor closest to the destination without revealing its coordinate. Sounds somewhat like homomorphic encryption which supports distance calculation (or comparison, at least) on encrypted coordinates. See the VOUTE paper (and a shorter version) for more detail.

By the way, did you notice how we now talk about messages and not transactions?

Friend-to-friend networks

Embedding-based routing was initially conceived for anonymous messaging in friend-to-friend (F2F) networks. In F2F, links are established and maintained consciously, relying on off-protocol trust relations. This differs from P2P networks like BitTorrent, where a user connects to whichever peers provide the highest bandwidth (I don’t care where I get my file from as long as the checksum is correct). In anonymous messaging, maximizing bandwidth is not as important as not letting your data fall into untrusted hands.

F2F lies in between data-based (BitTorrent) and value-based (Lightning) P2P networks. F2F links are more semantically charged compared to those in filesharing networks, but, contrary to PBT networks, are undirected and unweighted. Undirected means that messages can be directly transmitted from Alice to Bob if and only if they can be directly transmitted from Bob to Alice. Unweighted means that

transmitting a message does not affect the ability of the link to transmit future messages

(I think this is a rather deep observation highlighting the crucial difference between digital representations of information and value.)

In order to adapt routing algorithms from F2F to PBT, we have to account for two crucial features of credit links:

asymmetry: links from Alice to Bob and from Bob to Alice generally differ in capacity;
weights: all links are weighted, and weight changes must be accounted for during graph re-balancing.

SpeedyMurmurs

SpeedyMurmurs adapt VOUTE – a greedy embedding-based routing with anonymous return addresses for F2F messaging – to the PBT model. The protocol operates in weighted graph model and distinguishes between unidirectional and bidirectional links. Alice and Bob are said to share a bidirectional link, if they share two links with positive weights in opposite directions. I’m a bit skeptical on whether this model accurately reflects the reality of PCNs like Lightning, where the following four types of relationship between Alice and Bob have distinct qualities [¹]:

sharing no channel;
sharing an open channel with all capacity on one side;
sharing a pair of channels with their full capacities on opposite sides;
sharing a channel with non-zero capacity on both sides.

(Formalizing the properties of these states may also be interesting!)

We assume, as in SilentWhispers, that there is a set of well-connected and well-known nodes called landmarks. Each landmark defines its own spanning tree. For each tree, each nodes is assigned a coordinate based on its parent’s coordinate. A payment is split into random chunks, and each chunk is sent along the path within a different tree.

The protocol consists of three algorithms:

setRoutes creates spanning trees and assigns coordinates to nodes;
setCred reacts to a change in a link capacity;
routePay discovers a suitable path for the requested transaction.

Setting the routes

The authors modify the VOUTE’s tree creation algorithm by splitting it into two phases. During the first phase, the original algorithm runs, considering only bidirectional links. Then, if any nodes are left outside the tree, they “attach” to it with their unidirectional links. Note that the algorithm described in the paper (Algorithm 1, page 7) assumes a central coordinator which maintains a queue of nodes not yet in the tree. In a distributed scenario, the authors acknowledge,

starting the second phase is tricky

The nodes that are not yet in the tree are not sure whether they should wait for an invitation from a node with a bidirectional link, or just be satisfied with a unidirectional one. But the problem can be circumvented by choosing a proper timeout, after which a node assumes the second phase has started.

Setting the credit

The key problem is how to make the network react to changes in link capacities. The authors suggest the following algorithm. The network reacts to one of the following events:

a new unidirectional link: as one of the node is not yet part of the tree, it chooses the other as the parent;
a now non-zero bidirectional link: if one of the nodes has only a unidirectional link to its current parent, it should choose the newly connected node with a bidirectional link as a new parent instead; this leads to the tree replacing unidirectional links with bidirectional ones whenever possible, leading to higher potential throughput;
removed link: as one of the two nodes in question is a child of another, the child selects a new parent.

Every time a node changes its parent, all its neighbors are notified, they then also choose a new parent and a corresponding coordinate.

Note that setCred doesn’t react to changes in capacities of existing links!

Routing the payment

RoutePay discovers the path between a sender and a receiver capable of transferring the required amount. To improve anonymity, the sender splits the payment in random chunks and sends them along paths in different trees. This allows to avoid a costly multi-party computation of SilentWhispers and also gives a passive attacker less information on the lower band of the payment.

The routing accounts for weighted links but doesn’t actually do much to optimize for this new model. Routing fails if there is no neighbor with a coordinate closer to the receiver and sufficient available credit. There may be a path with sufficient credit which temporarily goes “the wrong way”, but greedy routing wouldn’t consider it. Can we make routing a bit less greedy to account for some combination of the qualities “being close to the destination” and “having sufficient capacity”? Another open question worth investigating!

Privacy analysis

The authors dedicate a separate sub-section to privacy guarantees of SpeedyMurmurs. Most of the privacy properties follow from proofs in the VOUTE paper, but one assumption regarding value privacy looks suspicious:

we say that a PBT network achieves value privacy, if the adversary cannot determine the value c <…>, if the adversary is not sitting in any of the involved routing paths.

A rather weak assumption, isn’t it? In credit networks with an external identity systems, where establishing Sybil nodes requires social engineering attacks at scale, this might be reasonable. But for a PCN over an open blockchain, where a resourceful attacker can easily launch a well-connected, well funded node and route everyone’s payments while spying on them… This just doesn’t seem right. Note that the reason why the sender splits the value into random chunks supposed to be a countermeasure against this attack. The authors note:

when the adversary corrupts some of the paths <…> we cannot prevent the adversary from estimating c.

Indeed, the adversary at least learns the lower bound for the total value, and as the value is shared uniformly, and the number of landmarks is known, the total value may be estimated as L * c_i, where L is the number of landmarks.

Evaluation

The authors identified multiple “axes” along which routing algorithms for P2P networks may differ:

routing: landmark-based, greedy embedding-based, or “tree-only”;
stabilization method: periodic or on-demand;
assignment of credit on paths: multi-party computation or random;
landmark selection: highest degree or random;

and five metrics: fraction of successful transactions, delay, messages sent per transaction path length, path length, and messages related to stabilization per epoch.

SilentWhispers are landmark-based, with periodic stabilization, and multi-party computation. SpeedyMurmurs are greedy embedding-based, with on-demand stabilization, and random credit assignment. Both use highest-degree nodes as landmarks.

Using the GTNA graph analysis framework and the dataset of Ripple transactions, the authors compared multiple combinations of the parameters listed above. Unsurprisingly, Ford-Fulkenson (a generic max-flow algorithm) “exhibited prohibitive delays”. More interestingly, SpeedyMurmurs performed better than SilentWhispers across all metrics in the “static scenario” (without stabilization). In a dynamic scenario, during “normal operation”, SpeedyMurmurs were also superior, but during “intervals of frequent change” the opposite was the case. All algorithms except for Ford-Fulkenson showed success ratio substantially lower than 100%, none of them higher than 91%.

I fully agree with the authors in that

users might not be willing to accept a failure rate of 10%

Lots of work lies ahead before we make payment networks appealing to the general public (or, more realistically, at least to developers who would create payment-network-based applications targeting the general public).

Summary and questions

SpeedyMurmurs are an interesting proposal, but I can’t get over the feeling that we can’t just apply relatively minor tweaks to algorithms from data transfer networks and apply them to value transfer. We definitely can and should borrow ideas from existing research, but routing algorithms for P2P messaging may need more radical modifications to be useful in a PBT setting, especially in PCNs like Lightning and Raiden.

Another faucet of the same issue: most algorithms in the paper are introduced in a “centralized” manner (assuming there is a central coordinator who runs the protocol). Then a paragraph follows, starting with something along the lines of: “the distributed version of the algorithm is basically the same, but nodes additionally do this and that”. I’m not convinced that patching centralized algorithms in such a manner does not introduce vulnerabilities or substantial inefficiencies. I’d prefer algorithms to be introduced in a decentralized setting directly.

The same applies to the details of the graph model. How well does modeling a bidirectional link as a pair of independent unidirectional ones reflect the reality (of networks being implemented in practice, as they are the most useful things to model)? Do differences between stateless (Lightning, Bitcoin-style) vs stateful (Raiden, Ethereum-style) models play a role here?

Moreover, as I’m undoubtedly spoiled by the importance of the Bitcoin’s incentive layer, every time I see a protocol description containing a phrase like “every node does X and notifies all its neighbors”, certain questions immediately start popping up in my head:

what if it doesn’t?
why would it want to?
what if it does it but incorrectly?
what if it sends equivocating messages?
what if an attacker launches 100x more nodes than the network currently contains?

Und so weiter, und so weiter…

Yet, in the end, isn’t it always the same question – and always the same answer?

Cases 3 and 4, though they both enable payments in both directions, differ, at least, in that they require, respectively, two or one on-chain transactions to redeem the balances on layer-1. ↩

Sergei Tikhomirov

Basics of channel probing

Probing applications and why parallel channels matter

Challenges in spying on parallel channels

Attack model

Information metrics

Simulator

Countermeasures and evaluation

Attack optimizations

Future work

Conclusions

Unjamming Lightning

LightPIR. Privacy-Preserving Route Discovery for Lightning (paper summary and analysis)

Lightning routing

A naive trust-based solution

Private information retrieval (addressing privacy)

Hub labeling (addressing efficiency)

LightPIR: putting this all together

A note on dataset validity

Implementation prospects

Non-collusion assumption

IT-PIR vs CPIR

A single source of network graph data

A common route quality metric

The model doesn’t account for amounts (and fees?)

Conclusion

Further reading

Probing parallel channels in the Lightning network

Lightning Network 101

Channel balance probing

Probing model

Probing multi-channel hops

Jamming-enhanced probing

Evaluation

Conclusion

Clustering transactions in Bitcoin and other cryptocurrencies

Privacy in cryptocurrencies

Randomization of transaction propagation

Transaction clustering based on network-level information

Parallel connections

Weighting IP addresses

Calculating correlations

Recap: what we do step by step

Results

Conclusion

Thoughts on Web3. Part 1

What is Web3?

What is wrong with Web 2.0?

Why Bitcoin works

Digital scarcity

Is money just another content type?

A Soviet porcelain factory

Why is my Facebook data valuable?

Eltoo

History: nSequense

Bitcoin scripts

Eltoo: on-chain version

3, 2, 1, lift-off!

SIGHASH_NOINPUT

Ordering

Eltoo vs Lightning

Structure of intermediary transactions

The cost of timelocks

Conclusion

SoK: Off the chain transactions

Layers and myths

Channels

Payment vs state channels

Channel synchronization

Routing

Commit-chains

Security and privacy

Conclusion

Routing cryptocurrency with the Spider network

Introduction

Model

An example

Adding on-chain rebalancing

Evaluation

Conclusion and the road traffic analogy