Copyright 1998 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE.

This article was published in the January/February 1998 issue of
IEEE Network.

ABSTRACT

 

A PCS network constantly tracks the locations of the mobile stations so that incoming calls can be delivered to the target mobile stations. In general, a two-level database system is used to store location information of the mobile stations. When the location databases fail, incoming calls may be lost. This article describes the standard GSM database failure restoration procedure which reduces the number of lost calls. Then we propose an efficient VLR identification algorithm for the HLR failure recovery procedure, which utilizes mobile station movement information to speed up the recovery procedure.

 

 

Improving the Fault Tolerance of GSM Networks

 

Ming-Feng Chang and Yi-Bing Lin, National Chiao Tung University
Shu-Chin Su, Computer and Communication Research Laboratories,
Industrial Technology Research Institute

 

A personal communications services (PCS) network tracks the locations of mobile stations (MSs) or mobile phones so that incoming calls can be delivered to subscribers. To exercise location tracking, a PCS service area is partitioned into several location areas (LAs). Each LA consists of a group of base stations that communicate with the MSs through radio contact. The major task of mobility management is to update the location of an MS when it moves from one LA to another. The location update procedure is referred to as registration, which is initiated by the MS as follows.
The base stations continuously broadcast the corresponding LA addresses to the MSs. When an MS receives a different LA address, it sends a registration message to the network. The location information is stored in the PCS mobility databases called the home location register (HLR) and the visitor location register (VLR). For every LA, there is a corresponding VLR. When an MS visits the LA, a temporary record of the MS is created in the VLR to indicate its location (i.e., the LA address). For every MS, there is a permanent record stored in the HLR. The record stores the address of the VLR visited by the MS. We use Global System for Mobile Communications (GSM) [1] as an example to illustrate the network architecture for PCS mobility management (Fig. 1). In this architecture, the base stations of an LA are connected to a mobile switching center (MSC), a telephone switch tailored for PCS services). Thus, an MSC covers several LAs. One or more MSCs are connected to a VLR, and exchange location information with the VLR through the Signaling System No. 7 (SS7) network [2]. Similarly, the VLR communicates with the HLR to exchange location information using SS7 messages. The GSM mobility management protocol, Mobile Application Part (MAP), is implemented using the SS7 platform.
If the location databases fail, the location information loss or corruption will seriously degrade the service offered to subscribers. Thus, fault tolerance of location databases becomes an important issue for PCS network management. This article describes the failure restoration procedures in GSM, and proposes an algorithm to speed up the HLR failure recovery procedure.

Mobility Databases

This section describes the information maintained in the mobility databases. The HLR is a database used for mobile user information management. All permanent subscriber data are stored in this database. An HLR record consists of three types of information.
Mobile station information includes the International Mobile Subscriber Identity (IMSI) used by the mobile station to access the network, and the Mobile Station ISDN Number (MSISDN), the ISDN number (i.e., "phone number") used by the caller to access the mobile station.
Location information includes the ISDN number (address) of a VLR, and the ISDN number (address) of an MSC. This information indicates the VLR and MSC where the MS resides. The HLR obtains this information from the VLR during the registration operation.
Service information includes service subscription, service restrictions, and supplementary services [3]. This information is provided by the subscriber.
The VLR is the database of the service area visited by an MS. The VLR contains all subscriber data of an MS required for call handling and other purposes. Similar to the HLR, the VLR information consists of three parts.
Mobile station information includes IMSI, MSISDN, and temporary mobile subscriber identity (TMSI), defined in [3]. In GSM, TMSI is used to avoid fraud usage. Usage of TMSI can be found in [5], and will be not be elaborated on in this article.
Location information includes the MSC number and LA identity (LAI). This information indicates the MSC and LA where the MS resides.
Service information includes a subset of the service information in the HLR. The VLR obtains this information from the HLR during the MS registration operation.
Note that in the mobility databases, the MS information and service information are seldom updated. On the other hand, the location information is modified every time the MS moves. Thus, after a database failure, the approaches to recovering these two types of information are different. We also note that both the HLR and VLR maintain the MSC information. This redundant information will be used in VLR failure restoration.

Failure Restoration

The details of the location update and the call delivery procedures can be found in [5]. These procedures utilize the location information in the HLR/VLR, and if the mobility databases fail, the system will not be able to track the MS. To guarantee fault tolerance of GSM mobility management, the failure restoration to the mobility databases is essential. This section describes the failure restoration procedures [6, 7] for both the VLR and HLR. Specifically, VLR failure recovery largely consists of asking the HLR for the relevant information on a need-to-know basis. HLR failures are more serious, and recovery is primarily based on restoration from a backup.

VLR Failure Restoration

After a failure, the service information of a VLR record is recovered by the first contact between the VLR and HLR (of the corresponding MS). The location information is recovered by the first radio contact between the VLR and MS. The MS information is recovered by the contact with either the HLR or the MS. The VLR record restoration is initiated by one of the following three events.
MS Registration -- Since the VLR record is erased after the failure, the VLR assumes that the registration is for inter-VLR movement (note that the MS may move between two LAs within an MSC, which is referred to as intra-MSC movement). Following the normal registration procedure [5], the VLR record is recovered.
MS Call Origination -- When the VLR receives the call origination request from the MSC [1], the VLR record for the MS is not found. The VLR considers the situation as a system error (the error cause is "Unidentified Subscriber"). The request is rejected, and the MS is asked to initiate the location registration procedure. After the registration, the VLR record is recovered.
MS Call Termination -- The call termination message flow is illustrated in Fig. 2.
Step 1 -- When theMSISDN is dialed by a user of the public switched telephone network (PSTN), the call is routed from the originating switch in the PSTN to a gateway MSC (or an ISDN exchange) by an SS7 ISDN User Part (ISUP) [2] initial address message (IAM). The IAM message reserves the voice trunk between the originating switch and the gateway MSC.
Step 2 -- To obtain the routing information (i.e., the actual location of the MS), the gateway MSC interrogates the HLR by sending a routing query message. The message consists of the MSISDN of the MS (i.e., the phone number of the MS) and other related information.
Step 3 -- The HLR sends a query message to the VLR to obtain the mobile station routing number (MSRN) which indicates the MSC and LAI of the MS. The message consists of the IMSI, MSC number, and other related information. Note that the MSC number was maintained in both the HLR and VLR.
The VLR searches the MS record by using the received IMSI. Since the record was erased after the VLR failure, the search fails. The VLR creates a VLR record for the MS (note that both the service and location information are not available in this newly created record). Then Steps 4 and 5 are executed in parallel.
Steps 4 and 7 -- Since the VLR does not have the routing information, it uses the MSC number provided by the HLR to create the MSRN. The number is sent back to the gateway MSC to set up the call in Step 8.
Steps 5 and 6 -- The VLR recovers the service information of the VLR record by exchanging a data restoration message pair with the HLR. At this point, the service information of the VLR record has been recovered. However, the location information (specifically, the LAI number) is still not available. This information will be recovered at Step 11 below.
Step 8 -- After the gateway MSC has received the MSRN in Step 7, an IAM message is sent to the target MSC to reserve the voice trunk between the gateway and target MSCs.
Steps 9 and 10 -- The target MSC does not have the LA information of the MS. In order to proceed with the call setup procedure, the MSC sends a query in Step 9 to the VLR to find out the LA of the MS. Unfortunately, the VLR does not have the LAI information due to the database failure. The VLR asks the MSC to search the LA of the MS in Step 10.
Steps 11, 12, and 13 -- The MSC initiates paging of the MS in all LAs (Step 11). If the paging operation is successful, the current LA address of the MS is sent back from the MSC to the VLR in Step 12. At this point the location information of the VLR record is recovered. The VLR acknowledges this restoration operation in Step 13.
Note that the LA searching operation (i.e., Step 11) is an expensive operation (every base station connected to the MSC must be paged). To avoid this "wide-area paging," the GSM system may ask the MSs to periodically reregister to the VLR. With periodic location updating, there is a better chance that the location information is recovered by the periodic location confirmation before the first call termination after the failure. Thus, expensive MS search operation is avoided. The selection of the frequency for location reregistration was studied in [9].

HLR Failure Restoration

For HLR, it is mandatory to save the updates into nonvolatile storage. Changes of service information are saved into the backup immediately after any update. The location information is periodically checkpointed into the backup. Note that service information update is infrequent (most subscribers never change the service profile after subscription), and the immediate backup update cost is acceptable.
After an HLR failure, the data in the backup are reloaded into the HLR. Data that have been changed in the period between the last backup checkpointing and restart of the HLR (this period is referred to as the uncovered period) cannot be recovered. Thus, the following HLR restoration procedure is executed.

HLR Restoration Procedure
Step 1 -- The HLR sends a database reset message to the VLRs where its MSs are located.
Step 2 -- The VLR derives all MSs of the HLR, and for each MS the VLR sends a registration message to the HLR. After the location update operation, the HLR record is recovered.

Note that the above HLR restoration procedure is not robust because during the uncovered period, an MS may move into a VLR unknown to the HLR at the last checkpointing time. If so, the HLR will not be able to locate the VLR of the MS at Step 1 of the HLR restoration procedure.
Let be the probability that no MS (of the HLR) is in a VLR at the last HLR checkpointing time before an HLR failure, and some MSs have moved in the VLR before the failure occurs. In other words, is the probability that the HLR fails to request update information from a VLR (which has updated location information during the uncovered period) in the restoration process. To compute it we make the following assumptions:

  • The MS arrival to the VLR is a Poisson process with the arrival rate .
  • The VLR residence time (the period an MS resides in a VLR) is a general distribution with mean 30 min.
  • The checkpointing period at the HLR is a constant TH.
  • The interval between two failures is much longer than TH.
Following the above assumptions, we conduct several simulation experiments. In the simulation model, we generate the uncovered periods and the user arrivals to the VLR during these uncovered periods. Then we check if these arrivals will eventually leave the VLR at the HLR failure. According to [9], the uncovered periods are uniformly distributed in (0, TH]. The user arrivals are a Poisson process (and the interarrival times are exponentially distributed). We consider various user residence time distributions such as exponential, uniform, and Gamma, with different variances. We found that the results are insensitive to user residence time distribution. Every simulation run consists of 2 million user arrivals to ensure that the simulation experiments are not affected by the initial and final effects. Based on the simulation experiments, Table 1 lists as a function of TH and where the mean VLR residence time is 30 minutes. The table indicates that when the call arrival rate is low, is high. The table also shows that increases as TH decreases because is determined by two factors:
  • The probability p1 that no users of the HLR are in the VLR at a checkpointing
  • The probability p2 that some users enter the VLR after the checkpointing, but do not leave at the failure
For a fixed user arrival rate () and average VLR residence time, both p1 and p2 decrease as TH increases. Thus, decreases as TH increases.
Another problem of the GSM HLR restoration procedure is that the location information of a VLR may not be changed during the uncovered period. However, the HLR still asks the VLR to resend the information, which results in unnecessary overhead. This situation may occur if all MSs entering a VLR during the uncovered period eventually leave the VLR before the HLR fails. Let ß be the probability that such a situation occurs. From the simulation experiments, Table 2 shows that ß is large for small . We also observe that ß increases as TH decreases because probability ß is determined by two conflicting factors:
  • The probability p3 that no user enters the VLR during the uncovered period. This probability increases as TH decreases.
  • The probability p4 that some users enter the VLR, and all these users leave the VLR at the HLR failure. If TH is short, the entering users are unlikely to leave at the HLR failure, or p4 decreases as TH decreases. We found that p3 has a more significant effect on p4 in the cases we study; thus, ß increases as TH decreases.
In summary, our study indicates that the standard GSM HLR restoration procedure will incur large network overhead if is small (which result in large and ß). Small to a VLR is typical if the VLR and HLR are in different countries. Thus, it is desirable to identify the exact VLRs that should be contacted by the HLR after an HLR failure. We will describe an algorithm to address this issue.

The VLR Identification Algorithm

We have proposed an algorithm [8] to identify a superset of the VLRs to be contacted by the HLR after a failure. In this article, we propose an improved algorithm called the VLR Identification Algorithm (VIA), which can identify the exact VLRs to be contacted by the HLR except for one case: due to the transient behavior of message sending, at the time of HLR failure, this set of VLRs may be a bit larger than the set of "true" VLRs to be contacted by the HLR. To simplify the description, we assume that every VLR covers exactly one MSC. Extension of our algorithm to accommodate multiple MSCs is trivial. The VIA consists of three procedures: Procedure 1 for checkpointing, Procedure 2 for registration, and Procedure 3 for restoration. Procedure 1 periodically saves the HLR records into the backup, and Procedure 3 restores the HLR records when an HLR failure occurs. Procedure 2 performs the standard GSM registration operations and keeps track of the VLRs that have been modified since the last checkpointing. (In Procedure 3, HLR will contact these VLRs to obtain correct user location information after an HLR failure.) Procedure 2 is based on the principle of counting where the HLR counts the number of new MSs that have entered a VLR since the last checkpointing. If such MSs exist, the VLR is marked in a list in the backup. Thus, at an HLR failure, the VLRs to be contacted by the HLR is in this list (which can be obtained in the backup). To implement the VIA, extra data structures are required in the HLR, as shown in Fig. 3. This figure only shows the fields of the HLR record required to exercise the VIA. In the backup, the extra data structure is a set VLR_List* of VLRs that have been modified during the uncovered period. After a HLR failure, the HLR only needs to send the reset messages to the VLRs in VLR_List*.
In the HLR, every record includes two extra fields, as illustrated in Fig. 3.
  • The ts field indicates the time of the latest location update. In some GSM implementations, this field already exists for other purposes.
  • The PVLR field contains the address of the VLR where the MS resided at the last checkpointing time. Thus, for any MS p we have

    HLR*[p].VLR = HLR[p], PVLR

Two extra data structures are introduced to the HLR:
  • TS is the last checkpointing (backup) time.
  • VLR_Counter is a set of (VLR,Count) pairs, where Count represents the "effective number" of MSs entering the VLR during the uncovered period. In Fig. 3, there are three effective MSs in VLR V1. Note that an MS is not effective to the VLR if it has entered the VLR area but eventually left the area during the uncovered period. Also note that the VLRs in VLR_Counter are the VLRs in VLR_List*.
The VIA works as follows. The HLR records are periodically saved into the backup by using the following checkpointing procedure.

VIA Procedure 1: Checkpointing

Step 1 -- for every location entry p in HLR* do
HLR[p]*.VLR HLR[p].VLR;
Step 2 -- TS current time;
Step 3 -- for every location entry p in HLR do
HLR[p].ts TS; HLR[p].PVLR HLR[p].VLR;
Step 4 -- VLR_Counter Ø, VLR_List* Ø;

In the checkpointing procedure, every location entry is saved into the backup (Step 1). The clock TS is set to the time of checkpointing (Step 2). The timestamp filed ts of every location entry in HLR is set to TS to indicate that the last location of the MS was updated no later than the latest checkpointing time TS (see Step 3). The PVLR is set to the current VLR address of the MS. Finally, both VLR_Counter and VLR_List* are set to empty to indicate that no VLR has a new roaming MS at TS(Step 4).
Suppose that MS p moves into a VLR area Vnew at time t. Following the standard GSM registration procedure, a registration message is sent from Vnew to the HLR. The following procedure at the HLR is triggered to perform the registration operation.

VIA Procedure 2: Registration

Step 1 -- (Update HLR)
Step 1.1 -- Vold HLR[p].VLR;
/* Vold is the last VLR visited by p*/
Step 1.2 -- HLR[p].VLR Vnew;
/* Vnew is the current VLR visited by p*/
Step 1.3 -- told HLR[p].ts;
/* told is the time when p entered Vold*/
Step 1.4 -- HLR[p].ts t;
/* t is the time when p entered Vnew*/
Step 1.5 -- Send a deregistration message to cancel the VLR entry of p at Vold;
Step 1.6 -- Send an acknowledgment to Vnew;
Step 2 -- (Update the VnewCount field in VLR_Counter)
if HLR[p].VLR HLR[p].PVLR then
/* Vnew is not the VLR visited by p at the last checkpointing; thus, p is a "new" visitor to the VLR after the last checkpointing */
Step 2.1 -- if VLR_Count[Vnew] exists then
VLR_Counter[Vnew].Count VLR_Counter[Vnew].Count + 1;
Step 2.2 -- else create VLR_Counter[Vnew] and
VLR_List*[Vnew];
VLR_Counter[Vnew] 1;
/* Steps 2 increments the number of MSs that entered Vnew after the last checkpointing
*/
Step 3 -- (Update the Vold counter entry)
if told > TS and Vold != HLR[p].PVLR then
/*p enters Vold in the uncovered period and Vold is not the VLR visited by p at the last checkpointing */
Step 3.1 -- VLR_Counter[Vold].Count VLR_Counter[Vold].Count – 1;
Step 3.2 -- if VLR_Counter[Vold].Count = 0 then
/* No effective MS is in Vold*/
Step 3.2.1 -- delete VLR_Counter[Vold] and VLR_List*[Vold];
/* Step 3 decrements the number of MSs entered Vold after the last checkpointing */

In the registration procedure, the location information of the MS is updated (Step 1.2), its location record at the old VLR Vold is canceled (Step 1.5), and Vnew is acknowledged (Step 1.6). The last update time told is saved to be used in Step 3.
At Steps 2 and 3 of Procedure 2, VLR_Counter[]is used to count the "effective" number of MSs that entered the VLRs during the period [TS, t]. Note that if the MS was in Vnew before TS (i.e., HLR[p].VLR = HLR*[p].VLR = HLR[p].PVLR), then the HLR may consider that the MS never moves out of the VLR, and there is no need to increment the VLR counter (and Step 2 is skipped).
If the MS entered Vold in the uncovered period (i.e., tsold > TS) and Vold is not the VLR visited by the MS at the last checkpointing (i.e., Vold HLR[p].PVLR), then it implies that the movement into Vold is not effective because the MS has moved out of Vold at t. Thus, the Vold counter should be decremented by 1 as described in Step 3. If Vold is the VLR visited by the MS at the last checkpointing, the MS is never considered an effective MS (see Step 2). In this case, there is no need to decrement the Vold counter when the MS moves out of Vold.
If VLR_Counter[V].Count > 1, any update to VLR_Counter[V] will not invoke modification to VLR_List*[]. In other words, access to the HLR backup is avoided. The purpose of Procedure 2 is to avoid updating the backup for every registration operation.
After an HLR failure, Procedure 3 is executed to restore the HLR. In this procedure, the HLR restores the location entries from the backup and requests current status of the MSs from all VLRs that have updated the MS information between the last checkpointing time and the HLR failure time. Note that after the execution of VIA Procedure 2, VLR_List* in the backup contains the VLRs the HLR should contact after an HLR failure. Due to the transient behavior of message sending, at the time of HLR failure, this set of VLRs may be a bit larger than the set of "true" VLRs to be contacted by the HLR.

VIA Procedure 3: Restoration

Step 1 -- TS current time;
Step 2 -- for every location entry p in do
/* The HLR recovers the MS records from the backup */
HLR[p].PVLR = HLR[p].VLR HLR[p]*.VLR;
HLR[p].ts TS;
Step 3 -- for every VLR entry V in VLR_List* do
/* The HLR initiates the standard GSM HLR failure restoration procedure */
Send a reset message to V;

Some observations of the VIA are given below:

  • The VIA does not modify the standard GSM MAP protocol. In other words, we do not change the standard message flow of the GSM MAP, and no new GSM MAP message type is created. The VLRs are not modified. We only modify the HLR internal macros (e.g., Steps 2 and 3 of Procedure 2) for the VIA operations.
  • In the VIA, the backup (specifically VLR_List*) is only modified at Steps 2.2 and 3.2.1 of Procedure 2. Let N be the average number of registrations and deregistrations before a backup operation (i.e., Step 2.2 or 3.2.1) is executed in the VIA. The larger the N value, the smaller the HLR backup cost. Table 3 indicates that the backup access cost of the VIA is low. For example, the backup operation is executed every 54 location update operations for TH = 5 hr, and = 6/hr. Thus, the execution of the VIA is expected to be efficient.
  • The VIA is especially useful if the GSM system accommodates international roaming. VLRs in foreign countries are seldom accessed, and ß and can be large, as illustrated in Tables 1 and 2. In Taiwan, a typical GSM operator has roaming agreements with over 20 foreign GSM networks, and may potentially access over 100 VLRs. If the HLR has to contact all these VLRs after a failure, the network cost (specifically message exchanges across countries) can be huge. Thus, implementing the VIA in the HLR is desirable.

Conclusion

This article presented failure restoration procedures in GSM. In general, VLR failure recovery largely consists of asking the HLR for the relevant information on a need-to-know basis. HLR failures are more serious, and recovery is primarily based on restoration from a backup. The standard GSM HLR failure restoration procedure is not robust because the HLR may not contact the right VLRs to access the updated location information. To identify the VLRs to be contacted by the HLR after its failure, we propose an efficient VLR identification algorithm. The algorithm maintains timestamped HLR records and keeps counting the effective number of MSs entering each VLR since the last HLR checkpointing. As a result, at an HLR failure recovery the VLRs that need to be contacted for MS location update can easily be identified to speed the failure restoration procedure.

Acknowledgments

Thanks to the three anonymous reviewers, the quality of this article has been significantly improved. Yi-Bing Lin's work was supported in part by National Science Council, Contract No. NSCNSC87-2213-E009-014. Shu-Chin Su's work was supported in part by the Wireless Communication Program, Ministry of Economics Affairs.

References
[1] ETSI/TC Rec. GSM 09.02, "Mobile Application Part (MAP) Specification, Version 4.8.0," 1994.
[2] Y.-B. Lin and S. K. DeVries, "PCS Network Signaling Using SS7," IEEE Pers. Commun., June 1995, pp. 44–55; see also http://liny.csie.nctu.edu.tw.
[3] ETSI/TC Rec. GSM 03.03, "Numbering, Addressing and identification," 1993.
[4] ETSI/TC, "Organisation of Subscriber Data," Rec. GSM 03.08, 1993.
[5] Y.-J. Cho, Y.-B. Lin, and C.-H. Rao, "Reducing the Network Cost of Call Delivery to GSM Roamers," IEEE Network, vol. 11, no. 5, Sept./Oct. 1997, pp. 19–25.
[6] ETSI/TC Rec. GSM 03.07, "Restoration Procedures, Version 4.2.0," 1993.
[7] ITU-T Rec. Q.1004, "Location Register Restoration Procedures," 1993.
[8] Y.-B. Lin, "Failure Restoration of Mobility Databases for Personal Communication Networks," ACM-Baltzer J. Wireless Networks, vol. 1, 1995, pp. 365–72.
[9] A. R. Noerpel, L. F. Chang, and Y.-B. Lin, "Performance Modeling of Polling De-registration for Unlicensed PCS," IEEE JSAC, vol. 14, no. 4, 1996.

Biographies
Ming-Feng Chang received B.S. and M.S. degrees in electrical engineering from National Taiwan University in 1982 and 1984, respectively, and a Ph.D. degree in computer science from the University of Illinois at Urbana-Champaign in 1991. He is currently an associate professor in the Department of Computer Science and Information Engineering, National Chiao Tung University, Taiwan, Republic of China. His research interests include VLSI system design, personal communications systems, and mobile computing.
Yi-Bing Lin [SM] received his B.S.E.E. degree from National Cheng Kung University in 1983, and his Ph.D. degree in computer science from the University of Washington in 1990. Between 1990 and 1995 he was with the Applied Research Area at Bell Communications Research (Bellcore), Morristown, New Jersey. In 1995, he was appointed full professor of the Department of Computer Science and Information Engineering (CSIE), National Chiao Tung University. He is now chair of CSIE. His current research interests include design and analysis of personal communications services networks, mobile computing, distributed simulation, and performance modeling.
Shu-Chin Su received an M.S. degree in computer science from Stevens Institute of Technology in 1982. She is currently an executive assistant to the general director, Computer Communication Research Laboratories, Industrial Technology Research Institute. She is also pursuing a Ph.D. degree from Sun Yat-Sen University. Her research interests include personal communications systems, distributed systems, and database management systems.