Resilience under Failures¶
Now, we investigate how failures might affect each MNO in terms of FDP and FSP and how national roaming could help in these cases. In case of an isolated failure, a cell tower might fail due to software errors (e.g., misconfiguration or malicious attacks) or hardware errors (e.g., power loss) independent of the other towers. Also, this type of failure represents the case where MNOs conduct regular maintenance on their network, during which some BSs become out-of-service. Second, correlated regional failures represent failures in a spatial locality due to certain events, e.g., a thunderstorm in a smaller region or an earthquake or flood affecting a larger region. Failures on the backhaul transport network can also be considered in this category, as such failures affect multiple BSs in a certain region simultaneously [59], [61]. In this case, BSs located in the same region will be affected similarly. For the isolated failures, we test a scenario in which a fraction piso of the BSs fails. In the case of a correlated regional failure, all BSs within a circle of radius r fail meters of the center fail, where we assume the center is the centroid of the region.
We conduct simulations for these scenarios on the municipality level, each with 100 independent runs for statistical significance. As municipality, we choose Enschede since it is a middle-sized municipality with both urban and rural areas. However, the general results for Enschede are similar to every other municipality and can be found in [6].9
本节旨在评估在基站发生故障的情况下,各移动网络运营商(MNO)在 FDP 与 FSP 两项指标上的表现变化,并进一步分析全国漫游在此类故障情境下的缓解效果。我们考虑两类故障情形:其一为孤立故障,即基站因软件问题(如配置错误或恶意攻击)或硬件故障(如断电)而单独失效,彼此之间无关联;此类情形亦代表运营商执行常规维护操作期间的临时基站下线。其二为相关区域性故障,即由局部性事件(如雷暴)或大范围自然灾害(如地震或洪涝)引起的区域性基站失效。此外,回传网络链路故障也可归入该类故障,因为其影响范围通常为一整片区域的多个基站 [59], [61]。
我们以荷兰恩斯赫德市(Enschede)为例,在市政级别进行仿真实验,每种情形下独立重复运行 100 次以保证统计显著性。恩斯赫德是一个包含城市与农村区域的中等规模市政单位,其结果具有一定代表性,更多城市的测试结果可参见文献 [6]。
Isolated failures: Fig. 8 shows each MNO’s performance in Enschede in terms of FDP and FSP under isolated failures, where every BS fails independently with probability p iso . Note that p iso = 0 corresponds to a scenario without failures. Comparing the performance with this baseline scenario, as observed in earlier studies such as [61], we infer that individual failures do not have a significant impact on the end users due to the inherent signal coverage redundancy in the network. However, contradicting the intuitions, Fig. 8a shows that higher p iso might result in lower FDP. For instance, for p iso = 0.25, FDP is lower for MNO 1 and MNO 2 compared to the maintained FDP for p iso = 0. A closer investigation shows that this is due to a decrease in the interference in the system with the decrease in the number of BSs. For MNO 1 , the difference between the received SINR and the SNR is 12dB in a scenario without failures while it is only 9dB for p iso = 0.25. In other words, the interference decreases with failing BSs resulting in higher SINR on the average leading to lower FDP. However, interference management plays a key role in maintaining a high signal quality and consequently high capacity. Our observations are based on the assumption that MNOs implement interference management schemes and the closest three co-channel BSs do not interfere with each other. However, under other assumptions or a more advanced frequency re-use scheme, these results could be different. When it comes to FSP, Fig. 8b suggests that users experience service quality degradation more drastically if MNOs do not implement infrastructure sharing. Failures up to 10% of the BSs do not affect the networks significantly, but for higher values of p iso the surviving BSs become overloaded (i.e., has to serve an increased number of users) which causes degradation in user satisfaction represented by lower FSP. Note that the decrease in FSP can be mitigated by dynamic frequency allocation schemes which re-allocate the frequency resources of the failing BSs to the active ones. Comparing the benefit of infrastructure sharing under normal operation (p iso = 0) against that of under failures (p iso > 0) by accounting for both FDP and FSP, we can conclude that sharing leads to better performance under failures. The networks of MNO 1 and MNO 2 in particular are sufficiently redundant to still ensure coverage under failures. However, to also ensure satisfaction, network sharing is paramount.
图 8 展示了在每个基站以概率 pisopiso 独立失效的情形下,各 MNO 在恩斯赫德市的 FDP 与 FSP 表现,其中 piso=0piso=0 表示无故障的基准场景。如前期研究 [61] 所示,网络由于具备一定的信号覆盖冗余性,孤立故障对终端用户影响不大。
然而,图 8a 中也观察到一个 违背直觉的现象:在某些情形下更高的 pisopiso 值反而带来更低的 FDP。 例如,当 piso=0.25piso=0.25 时,MNO 1 和 MNO 2 的 FDP 低于无故障场景。这主要归因于干扰水平的下降:在无故障情形下,MNO 1 的接收信噪比(SINR)与噪声信比(SNR)差值为 12 dB,而在 piso=0.25piso=0.25 时降为 9 dB,即干扰减少提升了平均 SINR,从而降低了 FDP。
在确保高容量通信的前提下,干扰管理仍至关重要。我们的分析建立在假设 MNO 实施了干扰管理策略,确保任意三个相邻同频基站之间无严重干扰。在其他干扰控制策略或更复杂的频率复用方案下,该结果可能有所不同。
在 FSP 方面,图 8b 显示若未实施基础设施共享,随着 pisopiso 增大,用户满意度明显下降。尽管在 piso≤0.1piso≤0.1 的情况下影响较小,但随着失效基站比例进一步提升,幸存基站负载上升,服务质量下降,导致 FSP 降低。此趋势可通过频谱动态分配等策略缓解,将失效基站的频谱资源重分配给活跃基站。
综合 FDP 与 FSP,在对比正常情况下(piso=0piso=0)与故障情形下(piso>0piso>0)的基础设施共享效果后可见,基础设施共享在故障情况下更具益处。特别是 MNO 1 与 MNO 2 的网络覆盖具备一定冗余性,即使部分基站失效也能保障覆盖能力,但若希望保障用户满意度,仍需依赖网络共享机制。
Correlated failures: Fig. 9 plots the FDP and FSP with increasing radius of correlated failures. In every case, we simulate a failure in the center of the municipality and let all BSs within radius r fail m of the center fail. If r fail = 500m, on average 1% of the BSs in the region fail. For r fail = 100, 2500 and 5000m, this is respectively 9%, 38% and 65% of the BSs.
Similar to the isolated failures in Fig. 8, Fig. 9 suggests that the FDP and FSP are not affected for small regions of failure under correlated failures. However, for larger radii, we notice that national roaming does not result in the highest FSP and lowest FDP anymore (e.g., when r fail = 2500 m), as MNO1 performs better in terms of FDP and MNO 2 in terms of FSP. We attribute this lower FSP to a higher number of users served by the remaining surviving BSs. Under national roaming, three times as many users are in the center of a region (the city center) compared to no national roaming. Moreover, due to the non-uniform deployment of BSs, the number of BSs that fail is relatively large compared to a single-MNO scenario. Hence, a disaster in this region has more impact when all users share the network compared to when every MNO uses its own network, as BSs on the border of the disaster region become more congested.
Comparing Fig. 9b and Fig. 8b, correlated failures causes a significantly lower satisfaction compared to isolated failures, e.g., for r fail = 2500m, around 38% of the BSs fail, but the FSP is lower than the FSP for p iso = 0.5.
图 9 展示了当基站在城市中心半径 rfailrfail 范围内全部失效时,不同规模区域故障对 FDP 与 FSP 的影响。例如,rfail=500rfail=500 米时平均有约 1% 基站失效;当 rfail=1000rfail=1000, 2500 和 5000 米时,失效比例分别为约 9%、38% 与 65%。
与图 8 中的孤立故障结果相似,图 9 显示当区域较小时,相关故障对 FDP 与 FSP 影响较小。然而,在半径较大的情况下,全国漫游的效果出现反转,例如在 rfail=2500rfail=2500 米时,MNO 1 的 FDP 低于全国漫游下的 FDP,MNO 2 的 FSP 表现也优于全国漫游。这种现象主要是由于:
- 全国漫游使灾区中心区域(三个运营商用户共享基站)用户密度显著上升
- 基站部署非均匀,灾区中共享网络的基站失效率较高
- 灾区边缘的存活基站因用户量激增而出现严重拥塞
比较图 9b 与图 8b 可见,在相同比例基站失效(如 38%)的情况下,相关区域性故障导致的用户满意度(FSP)下降更为严重。例如,在 rfail=2500rfail=2500 米时的 FSP 明显低于 piso=0.5piso=0.5 时的 FSP。
Takeaway — Our analysis shows that isolated failures do not lead to any significant FDP decrease as the MNOs have enough redundancy in terms of the BSs covering an area: even if 50% of the BSs in a region fails, most users still maintain the required minimum signal level for connectivity. On the contrary, the FSP drastically decreases, as BSs become more congested and the effective user bandwidth decreases due to the increase in the number of users served by the surviving BSs. Correlated failures lead to a more significant impact compared to isolated failures, i.e., increase in FDP and decrease in FSP. However, in most cases, numerical analysis shows that national roaming has the potential to improve resilience.
我们的分析表明, 孤立故障对 FDP 的影响有限,因为各运营商在区域内部署了足够多的冗余基站 ,即使有 50% 基站失效,大多数用户仍可获得最低连接信号。然而, FSP 下降明显,由于幸存基站的过载导致用户可用带宽减少,从而影响服务质量。
相比之下, 相关区域性故障影响更大,不仅 FDP 上升,FSP 更大幅下降。 尽管如此,仿真结果表明,在多数情形下,全国漫游机制仍能提升网络韧性,是提高系统抗灾能力的有效手段。