Deep Reinforcement Learning-Based Optimization for End-to-End Network Slicing With Control- and User-Plane Separation

Control- and user-plane separation (CUPS) and network slicing are two key technologies to support increasing network traffic and diverse wireless services. However, the benefit of CUPS in decoupling the network coverage and data service functions has not been fully utilized to facilitate network slicing. In this paper, we present a novel CUPS-based end-to-end (CUPS-E2E) network slicing scheme. First, the base stations (BS) are classified into control BSs (CBS) that provide control plane (CP) coverage and traffic BSs (TBS) that deliver user plane (UP) traffic. Next, upon CBSs and TBSs being virtualized, we define four typical end-to-end (E2E) network slices: one for CP coverage, one for high-throughput services, one for computation-intensive services, and the other for delay-sensitive services. The utilities of the four E2E network slices are defined based on their coverage, throughput, computing capability and delay requirements, respectively. Then, a deep deterministic policy gradient (DDPG)-based algorithm is employed to maximize the long-term sum-utility of the four E2E network slices by jointly optimizing the allocation of communication and computing resources to the four network slices and the activation of virtual TBSs, while meeting the service requirements of all users. Simulation results show that our proposed CUPS-E2E network slicing scheme in conjunction with a DDPG-based sum-utility maximization algorithm can support the CP wide-coverage and massive access requirements as well as the UP high-throughput, computation-intensive and delay-sensitive services simultaneously, and outperforms the existing E2E network slicing schemes in terms of the sum-utility, coverage percentage, throughput and delay.


I. INTRODUCTION
A S A KEY technology for the next generation mobile networks to support diverse application scenarios and provide customized services, network slicing defines multiple logical end-to-end (E2E) networks (each including the core network (CN) and radio access network (RAN)) on a shared physical network infrastructure [1]. Each logical network contains a set of virtual network functions (VNF) and resources.
Meanwhile, the drawbacks of traditional mobile networks due to the tight coupling between network coverage and data service become increasingly evident with the exponential growth of wireless traffic, which has called for the control-and user-plane separation (CUPS) [2]. CUPS decouples the control plane (CP) and the user plane (UP), where the control base stations (CBS) in the CP transmit control signals to users using low frequency bands, while the traffic base stations (TBS) in the UP transmit data signals using high frequency bands. As a result, CUPS can alleviate the interference between control signals and data signals [3] and can reduce the energy consumption of the network by switching off TBSs with no or few active users in coverage [4].

A. Related Works
VNF deployment and resource allocation for network slicing have recently attracted a lot of research interests. Sattar et al. [5] investigated the placement of CP functions and UP functions (UPF) for CN slicing, and proposed a mixed-linear integer programming-based algorithm to minimize the E2E delay while maintaining intra-slice isolation among VNFs of the same CN slice. Guan et al. [6] used complex network theory to optimize the VNF deployment for Ultra Reliable Low Latency Communication (URLLC), Enhanced Mobile Broadband (eMBB), and massive Machine Type Communication (mMTC) E2E network slices. 0018-9545 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
Regarding resource allocation for network slicing, Sun et al. [7] proposed a Stakelberg game-based radio resource allocation scheme to minimize the content download latency for the low-delay slice while guaranteeing the data rate for the high-rate slice with minimal transmit power consumption. Zhou et al. [8] maximized a time-averaged utility by jointly optimizing the allocation of virtual bandwidth and power for URLLC and eMBB RAN slices via Lyapunov optimization. An Upper-tier First with Latency-bounded Over-provisioning Prevention (UFLOP) algorithm was proposed in [9] to optimize the allocation of computing and communication resources for the eMBB, URLLC and mMTC E2E network slices in a multitenant 5 G network, while satisfying the latency constraints and Service-level Agreements (SLA) of multiple tenants.
CUPS has been researched for mobility management, e.g., in a two-tier downlink ultra-dense network (UDN) [10], where the small base stations (SBS) in the UP send data packets to nearby user devices, and the macro base stations (MBSs) in the CP manage the radio resource control. Sun et al. [11] proposed a probability suffix tree (PST)-based predictive mobility management algorithm for the urban UDN with CUPS. Yang et al. [12] have shown that the CUPS architecture outperforms the conventional network architecture in terms of the coverage probability and the handover cost in UDNs. Liang et al. [13] employed the stochastic geometry and queuing theory to analyze the coverage probability of the CP assuming non-line-of-sight (NLoS) transmissions and the energy efficiency of the UP under the co-existence of both NLoS and line-of-sight (LoS) transmissions.
Although the feasibility of combining CUPS with RAN slicing was demonstrated through testbed experiments for three types of RAN slices (i.e., continuous CP coverage, UP video services and UP audio services) [14], the benefit of CUPS in decoupling the CP coverage and UP traffic has not been sufficiently studied or exploited to achieve customized E2E network slicing. Moreover, [10]- [14] considered the CP and the UP separately. Despite that the influence of the CP coverage on the spectrum efficiency and the energy efficiency of the UP for the CUPS-based cellular RAN was studied in [15], the relevance between the CP and the UP in utility modeling for E2E network slicing has not been fully researched.
Nowadays, the resource allocation in mobile networks involves communication, storage and computing resources [16], and the resource allocation problems become increasingly complex. Traditional optimization algorithms show limitations in solving high-complexity resource allocation problems. Fortunately, deep reinforcement learning (DRL), where an agent interacts with the environment constantly to learn the best action that maximizes a cumulative reward under a given state through the trial and error [17], has been employed to solve complex resource allocation problems. Popular DRL algorithms include: deep Q-learning (DQL) [18], policy gradient (PG), actor-critic (AC), and deep deterministic policy gradient (DDPG) [19]. Therein, the DQL algorithm was mainly used to solve resource allocation problems with discrete action spaces [20]- [22]. For resource allocation problems with continuous action spaces, the DDPG algorithm was used, e.g., to solve the joint optimization of continuous energy harvesting (EH) time and transmit power allocation for maximizing the long-term throughput of EH-assisted communications [23]. Jiang et al. [24] used DDPG algorithm to maximize the number of the access success devices in cellular Internet of Things networks by dynamic optimize the access class barring factor. However, it is important to note that the relevance between the CP and the UP under CUPS will lead to new challenges in the utility modeling and DRL-based resource allocation for E2E network slicing.

B. Contributions
In this paper, we aim to solve the CUPS-based E2E (CUPS-E2E) network slicing optimization of the four typical E2E network slices: one typical mMTC slice for CP wide-coverage, two typical eMBB slices for UP high-throughput services (such as panorama virtual reality [25]) and UP computation-intensive services (such as augmented reality [26]), respectively, and one typical URLLC slice for UP delay-sensitive services (such as immersive virtual reality [27] and industry automation [28]). We define the utilities of the four E2E network slices based on their coverage percentage, throughput, computing capability and delay requirements, respectively, while taking the relevance between the CP coverage and the UP services into consideration. Therein, the coverage percentage is defined as the instantaneous fraction of all users covered by the CP at the corresponding time. Then, we propose to maximize the long-term sum-utility of the four E2E network slices by jointly optimizing the virtual TBSs (vTBS) activation and the allocation of communication and computing resources while meeting the service requirements of users, which is formulated into a non-convex optimization problem. It is important to note that the joint optimization of vTBS activation and resource allocation should be solved timely to meet the dynamic user requirements and wireless channel states, which can hardly be achieved by the traditional convex optimization methods. We solve the sum-utility maximization problem with a high-dimensional action space by devising a DRL-based algorithm, in which the agent learns the mapping function between the users' real-time requests as well as wireless channel states and the action. The main contributions of this work are summarized as follows: r We study a CUPS-based system that decouples the CP coverage and the UP services in E2E network slicing. The CUPS-E2E network slicing overcomes the restrictions caused by the tight coupling between the CP and the UP of traditional networks and allows for the customization of CP coverage and UP services for each E2E network slice. More specifically, the CP coverage and the UP services are supported by dedicated virtual CBSs (vCBS) (operating in low frequency bands) and vTBSs (operating in high frequency bands), respectively. The wide-coverage slice builds on the virtualized CP only, whereas the high-throughput, computation-intensive and delay-sensitive slices involve both the virtualized UP and CP, where the activation of vTBSs in the UP is determined by the CP.
r In order to better characterize the heterogeneous performance metrics of the E2E network slices, we define the utilities of the four slices in terms of their gains in coverage percentage, throughput, computing capability and delay minus their corresponding costs, respectively. The costs for each slice include the backhaul link capacity and the downlink bandwidth. In addition, the costs of the wide-coverage slice include the vCBS transmit power consumption; while the costs of the high-throughput, computation-intensive and delay-sensitive slices include both the vTBS transmit power and vTBS computing power comsumptions. These slice-specific utilities enable the joint optimization of communication and computing resources allocation, as well as a comprehensive performance evaluation, for each E2E network slice.
r In our proposed CUPS-E2E network slicing model, if a user is not covered by the CP, then it will not be served by the UP, i.e., it will not be allocated with any UP communication or computing resource. This dependence of UP services on the CP coverage is embodied in our utility model. Accordingly, the utilities of the high-throughput, computation-intensive, and delay-sensitive slices contain only the throughput, computing capability, and delay of the users covered by the CP, respectively.
r The CUPS-E2E network slicing optimization is formulated into a problem that maximizes the long-term sum-utility of the four E2E network slices by jointly optimizing the vTBS activation and the allocation of communication and computing resources while meeting the service requirements of all users. The formulated problem is non-convex and hard to solve using traditional optimization methods. We design a DDPG-based sum-utility maximization algorithm to find the optimal vTBS activation and resource allocation policy in the continuous action space. More specifically, we obtain the optimal vTBS activation and the optimal allocation of communication and computing resources to the four E2E network slices, including the CP subcarrier and vCBS transmit power for the widecoverage slice, as well as the backhaul link capacity, UP subcarrier, vTBS transmit power and vTBS computing capability for the high-throughput, computation-intensive and delay-sensitive slices. The remainder of this paper is organized as follows. The CUPS-based system model is proposed in Section II. Section III formulates the sum-utility maximization problem. The DDPG-based sum-utility maximization algorithm is presented in Section IV. In Section V, simulation results are presented. Finally, the paper is concluded in Section VI.

II. SYSTEM MODEL
This section presents a CUPS-based system, where the E2E network model contains the CN and the RAN. The CN mainly consists of the Access and Mobility Management Function (AMF), Session Management Function (SMF) and UPF, etc. Therein, the AMF is responsible for the access and mobility of users, while the SMF and UPF are in charge of the UP services [29].
In the RAN, we assume that there are N C physical CBSs and N T physical TBSs distributed within a sufficiently large area S pl , and the symbols N C and N T are used to denote the sets of the physical CBSs and TBSs, respectively. The CBSs send control signals at the low frequency bands to provide the CP coverage for users and control the switching on-off for the TBSs, whereas the TBSs process and send UP data at the high frequency bands to provide the UP services for users. Moreover, the densities of the physical CBSs and TBSs are λ C = N C /S pl and λ T = N T /S pl , respectively.
There are N u users distributed in the given area with the density of λ u = N u /S pl , and the symbol N u is used to denote the set of users. If a user has a data service request, then the user is deemed as an active user. All users will receive control signals from their serving CBSs, but only the active users can receive data services from their serving TBSs. The physical network model is shown in Fig. 1.
The network operator provides one wide-coverage slice and three data service slices for users' high-throughput, computation-intensive and delay-sensitive requests, respectively. Firstly, the CN and the backhaul links that connect the CN, the CBSs and the TBSs will be abstracted and virtualized into several virtual CNs (vCN) and virtual backhaul links, respectively. Next, each physical TBS is mapped to several vTBSs by virtualization, and each vTBS will be related to one specific UP service. Each physical CBS is mapped to four vCBSs by virtualization, wherein one vCBS will be associated with the CP coverage, while the other three will control the activation of vTBSs. Accordingly, the CP is mapped to four virtualized CPs, which are deployed in the four E2E network slices, respectively; the UP is mapped to three virtualized UPs, which are deployed in the high-throughput, computation-intensive and delay-sensitive slices, respectively. Meanwhile, the communication and computing resources in the system are abstracted into virtual resources to facilitate resource sharing. The network slicing model is shown in Fig. 2. The t-th time slot has the time interval of [t, t + 1), where t ∈ {0, 1, 2, . . ., T −1} and the duration of each slot is one second.

A. Network Coverage
The network coverage is supported by the wide-coverage slice (i.e., slice C), which involves only the virtualized CP. The coverage control signals are transmitted from the AMF of the vCN to the vCBSs associated with the CP coverage, and then the vCBSs send the control signals to all users [30].
Each physical CBS is connected with the CN by a backhaul link with the capacity of Rb C B , and the maximum transmit power of each physical CBS is P C B . The total downlink bandwidth of the CP is B C , which is evenly divided into N C sub subcarriers forming the set of CP subcarriers N C sub = {1, 2, . . . , N C sub }, and each CP subcarrier has a bandwidth of B C /N C sub . And we define y C (t) = {y C k,l (t)} (k ∈ N u , l ∈ N C sub ) as the set of binary CP subcarrier allocation indicators, where y C k,l (t) = 1 if CP subcarrier l is allocated to user k at slot t and y C k,l (t) = 0 otherwise. We assume that the transmission of control signal for each user will occupy a small backhaul link capacity of Rb C and only one CP subcarrier because its traffic is low. Then, we denote the vCBS transmit power allocated to user k at slot t by P C k (t), and the set of vCBS transmit power levels at slot t is given by P C (t) = {P C k (t)} (k ∈ N u ). The distance between CBS j (j ∈ N C ) and user k (k ∈ N u ) is denoted by r C j,k . Since the number of vCBSs in slice C equals to the number of physical CBSs and does not change over time, we still denote the set of vCBSs in slice C by N C . Each user is associated with its nearest vCBS, and a binary vCBS association indicator g C j,k (j ∈ N C , k ∈ N u ) is defined such that g C j,k = 1 if user k is associated with vCBS j, namely j = argmin 1≤j≤N C r C j,k , and g C j,k = 0 otherwise. We assume that the downlink channels of CBSs exhibit independent Rayleigh fading and have a path loss exponent of α C . Therefore, the downlink SINR between user k (k ∈ N u ) and its associated vCBS j (j ∈ N C ) on CP subcarrier l (l ∈ N C sub ) at slot t is where r C j,k and r C i,k denote the transmission distances between user k and its associated vCBS or interfering vCBSs, respectively, and σ 2 denotes the power of the additive white noise. While h C j,k,l (t) and h C i,k,l (t) are the channel power gains of the desired transmission path and interference path at slot t, respectively, following two independent unit-mean exponential distributions. The sets of channel power gains of the desired transmission path and interference path at slot t are denoted by , respectively. Besides, y C k,l (t) and y C w,l (t) represent that the desired signal and interference signals are transmitted on the common CP subcarrier l at slot t, while g C j,k and g C i,w are the binary vCBS association indicators of user k and the interference users, respectively.
Due to the fact that only one CP subcarrier is allocated to a user, the downlink SINR between user k and its associated vCBS at slot t is given by where the binary CP subcarrier allocation indicator y C k,l (t) has to satisfy that l∈N C sub y C k,l (t) = 1. For any user, if its downlink SINR is above a pre-defined threshold x 0 , the user is covered by the CP. Then, we define a binary coverage indicator d C k (x 0 , t) (k ∈ N u ) to represent whether user k is covered by the CP at slot t or not, i.e., Accordingly, the number of users covered by the CP at slot t is given by and the coverage percentage of slice C at slot t is Next, we will analyze the costs of slice C, including the backhaul link capacity, downlink bandwidth, and vCBS transmit power consumption.
The backhaul link capacity of slice C is given by For the downlink bandwidth of slice C, let us analyze it from the perspective of the CP subcarrier allocation. Firstly, any CP subcarrier l will be idle if it is not allocated to any user at slot t, i.e., the CP subcarrier allocation indicator y C k,l (t) satisfies that y C k,l (t) = 0 and 1 − y C k,l (t) = 1 for any user k. At this time, the cumulative product of 1 − y C k,l (t) satisfies that However, CP subcarrier l will be occupied if it is allocated to at least one user at slot t, i.e., there exist a user k (k ∈ N u ) enables that y C k,l (t) = 1 and 1 − y C k,l (t) = 0. At this time, the cumulative product of 1−y C k,l (t) satisfies that From the above analysis, we find that the expression k∈N u [1−y C k,l (t)] can be used to indicate whether CP subcarrier l is idle at slot t or not, and the number of the idle CP subcarriers is ). Therefore, the number of the occupied CP subcarriers can be calculated as the total number of the CP subcarriers N C sub minus the number of the idle CP subcarriers, and the downlink bandwidth of slice C at slot t is the total bandwidth of the occupied CP subcarriers, i.e., The total vCBS transmit power consumption of slice C at slot t is expressed as

B. Data Service
The high-throughput slice (i.e., slice 1), the computationintensive slice (i.e., slice 2) and the delay-sensitive slice (i.e., slice 3) provide data services for active users with different service requirements if the users are covered by the CP [31].
We take the video services as an example, the whole procedure of data services in slice s (s ∈ {1, 2, 3}) includes the backhaul link transmission, video processing, and wireless access link transmission. If the active users have video requests, the requested videos will be first delivered from the cloud to the vTBSs through the virtual backhaul links. Then, the videos will be processed and rendered by the vTBSs, which consumes computing resource. After the video processing, the vTBSs will deliver the videos to the corresponding users who request them over the wireless access links [25].
In the CN, the UPF is in charge of forwarding the video data requested by the active users from the data center in the cloud to the vTBSs over the virtual backhaul links, and the SMF is responsible for managing and controlling the data forwarding of the UPF [29].
The RAN of slice s (s ∈ {1, 2, 3}) involves both the virtualized UP and CP, and the vCBSs in the virtualized CP control the activation of the vTBSs in the virtualized UP according to the users' requesting states. In this process, the vCBSs send control signals to the vTBSs to switch them on or off [2]. The communication between the vCBSs and the vTBSs is through the virtual backhaul links between them. For simplicity, we ignore this process when analyzing the utilities of the data service slices because its traffic is extreme low.
We define a (s) and we denote the set of activated vTBSs in slice s at slot t by N (s) In addition, we deem the active users requesting the highthroughput, computation-intensive or delay-sensitive services at slot t as the users in slice 1, slice 2 or slice 3 at current slot. 1 Note that the superscript (s) in all symbols represents slice s (s∈{1, 2, 3}).
We denote o Hence, the number of active users in slice s at slot t is given by and we denote the set of active users in slice s at slot t by N (s) u (t). Each physical TBS is connected with the CN by a backhaul link with the capacity of Rb T B . We assume that the total downlink bandwidth of the UP is B U and the maximum transmit power of each physical TBS is P T B . The communication resource, including the backhaul link capacity, downlink bandwidth and vTBS transmit power will only be allocated to the active users covered by the CP.
We denote the backhaul link capacity allocated to user k (covered by the CP) in slice s at slot t by Rb The downlink bandwidth in the UP is evenly divided into N U sub subcarriers forming the set of UP subcarriers N U sub = {1, 2, . . ., N U sub }, and each UP subcarrier has a bandwidth of is the set of vTBS transmit power levels in slice s at slot t. Therefore, the overall vTBS transmit power allocated to user k (covered by the CP) in slice s at slot t is the sum of vTBS transmit power on all the UP subcarriers allocated to the user, i.e., Besides, we denote the set of downlink channel power gains of slice s at slot t by h (s is the downlink channel power gain between TBS j and user k on UP subcarrier l at slot t. The distance between TBS j (j ∈ N T ) and user k (k ∈ N u ) is denoted by r T j,k . Each active user is associated with its nearest vTBS, and a binary vTBS association indicator g j,k (t) = 1 if user k (covered by the CP) in slice s is associated with vTBS j at slot t, namely j = argmin 1≤j≤N (s) T (t) r T j,k , and g (s) j,k (t) = 0 otherwise. We assume that the downlink channels of vTBSs exhibit independent Rayleigh fading and have a path loss exponent α T . Therefore, the SINR between user k (k ∈ N (s) u (t)) and its associated vTBS j (j ∈ N (s) where r T j,k and r T i,k denote the transmission distances between user k and its associated vTBS or interfering vTBSs, respectively, and σ 2 denotes the power of the additive white noise. While h T j,k,l (t) and h T i,k,l (t) are the channel power gains of the desired transmission path and interference path on the common UP subcarrier l at slot t, which follow two independent unit-mean exponential distributions. Specifically, the set of interfering channel power gains is denoted by h (s) k,l (t) and y (s) w,l (t) represent that the desired signal and interference signals are transmitted on the common UP subcarrier l in slice s at slot t, while g i,w (t) are the vTBS association indicators of user k and the interference users in slice s at slot t, respectively. Accordingly, we obtain the downlink rate of user k on UP subcarrier l in slice s at slot t as and the overall downlink rate of user k in slice s at slot t is the sum of the downlink rate on all the UP subcarriers allocated to the user, i.e., The throughput of user k (covered by the CP) in slice s at slot t is min{R [32]. Therefore, we have the total throughput of E2E network slice s at slot t, i.e., The computing resource of the network is represented by computing capability in CPU cycle/s [33], and the maximum computing capability of each physical TBS is F T B . The vTBS computing capability allocated to user k (covered by the CP) in slice s at slot t is denoted by f is the set of vTBS computing capabilities in slice s at slot t. Therefore, we obtain the total vTBS computing capability of E2E network slice s at slot t as f (s) To analyze the delay performance of the E2E network slicing, three data queues are discussed. Let ) denote the current backhaul link data queue backlogs, computing queue backlogs and wireless access link data queue backlogs in slice s at slot t, respectively. We define as the random data arrivals on the backhaul links in slice s at slot t, where Ah (s) is assumed to be independent and identically distributed (i.i.d.) over time and follows a Poisson arrival process with the average rate γ (s) a (in bits/s). Therefore, we model the backhaul link data queuing process for user k (covered by the CP) in slice s by Since the data will be processed at the vTBS after the backhaul link transmission, the arrival of the computing queue is the departure of the backhaul link data queue. Then, the computing queuing process for user k (covered by the CP) in slice s is modeled by k is the number of CPU cycles required per bit in CPU cycle/bit [33] and [f k ] is the computing rate, i.e., the number of bits that computed per second for user k in slice s at slot t, which is expressed in bit/s. The data will be transmitted to users over the wireless access links after the vTBS processing, hence, the arrival of the wireless access link queue is the departure of the computing queue. We model the wireless access link data queuing process for user k (covered by the CP) in slice s by According to Little's Theorem [34], the user's average queuing delay is proportional to the average queue length, therefore, we can represent the delay of user k in slice s by the sum of the queue length Y  [35]. The access delay is not considered here because of its small proportion [36]. Hence, we optimize the delay of the E2E network slicing by minimizing the total queue length. The total queue length of slice s at slot t is given by and the average delay of slice s can be calculated as Specifically, since slice 1, slice 2 and slice 3 can only serve users covered by the CP, no communication or computing resource will be allocated to the users not covered, i.e., if d C k (x 0 , t) = 0, then Rb  The backhaul link capacity of slice s at slot t is given by For the downlink bandwidth of slice s, let us analyze it from the perspective of the UP subcarrier allocation. Firstly, any UP subcarrier l will be occupied in slice s if it is allocated to at least one user (covered by the CP) in slice s at slot t, i.e., there exist a user k (k ∈ N (s) u (t) and d C k (x 0 , t) = 1) enables that y The vTBS transmit power consumption of slice s at slot t is expressed as The vTBS computing power consumption of user k in slice s at slot t is [33] P (s) accordingly, the vTBS computing power consumption of slice s at slot t is given by

III. SUM-UTILITY MAXIMIZATION PROBLEM
In this section, we will define the utilities of the widecoverage, high-throughput, computation-intensive and delaysensitive slices, respectively, and then the sum-utility maximization problem will be formulated.

A. Utility Definition
From the perspective of the network operator, the utility of CUPS-E2E network slicing is defined as the income minus the costs of the physical network. For slice C, the utility is defined as the difference between the income quantified by the number of users covered by the CP minus the costs in terms of the backhaul link capacity, the downlink bandwidth and the vCBS transmit power consumption in slice C, i.e., where m C is the income from each user covered by the CP, while ω, δ and β are the costs per unit backhaul link capacity, per unit downlink bandwidth and per unit power consumption, respectively. To guarantee the fundamental coverage for the users, the CP coverage percentage has to satisfy the constraint: where ρ is the threshold of the CP coverage percentage. The coverage penalty ξ ·max{ρ− p cov C (x 0 , t), 0} will be paid if the constraint on the CP coverage percentage is violated, where ξ is a positive penalty coefficient.
For slice 1, providing high throughput has the priority. Therefore, the utility of slice 1 is defined as the difference between the income from the total throughput provided and the costs of slice 1, including the backhaul link capacity, downlink bandwidth, the vTBS transmit power and vTBS computing power consumptions in slice 1, i.e., where m 1 is the income per unit throughput. Providing high computing capability is the primary target for slice 2. Accordingly, the utility of slice 2 is defined as the income from the total computing capability provided minus the costs in terms of the backhaul link capacity, downlink bandwidth, the vTBS transmit power and vTBS computing power consumptions in slice 2, i.e., where m 2 is the income per unit computing capability. The main objective of slice 3 is to provide users as lower delay as possible. Consequently, the utility of slice 3 is defined as the difference between the income defined by the queuing delay gain [8] and the costs in terms of the backhaul link capacity, downlink bandwidth, the vTBS transmit power and vTBS computing power consumptions in slice 3, i.e., where m 3 is the income per unit queue length, i.e., the unit price of delay gain, and ψ is an initial maximum benefit coefficient of slice 3 to guarantee the non-negativity of the utility [37]. To ensure the quality of service (QoS) requirement for users in the delay-sensitive slice 3, the average delay of slice 3 has to meet the constraint: D  ave (t)−ζ, 0} will be paid, where η is a positive penalty coefficient.
Hence, we have the sum-utility of the entire network as the sum of U C (t), U (1)

B. Sum-Utility Maximization Problem Formulation
We aim to maximize the expected cumulative discounted sum-utility of CUPS-E2E network slicing, while satisfying the vTBS activation and resource allocation constraints, including the vCBS transmit power, CP subcarrier, backhaul link capacity, UP subcarrier, vTBS transmit power and vTBS computing capability allocations. We formulate the sum-utility maximization problem as follows where γ ∈ [0, 1] is the discount factor, and variable set V (t) = [y C (t), P C (t), a (s) . Besides, P C max and P (s) max are the maximum vCBS and vTBS transmit power that can be allocated to a single user in slice C and slice s, respectively. In the constraints, C3 and C8 are the vCBS and vTBS transmit power constrains, respectively. C1 and C6 guarantee that one CP subcarrier or UP subcarrier can only be allocated to at most one user associated with the same vCBS or vTBS at the same slot, respectively, while C2 ensures that only one CP subcarrier can be allocated to one user. C7 guarantees that one UP subcarrier can only be allocated to users in at most one slice, which guarantees the bandwidth isolation among network slices. From the analysis in Section II-B, the expression (1− k∈N (s) can be used as an indicator to represent whether UP subcarrier l is allocated to slice s at slot t or not, therefore, the sum of the indicators of three data service slices means the number of slices that the same UP subcarrier allocated to at the same slot, which should not be more than 1. C4, C5, C9 and C10 guarantee that the total vCBS transmit power, backhaul link capacity, vTBS transmit power and vTBS computing capability allocated to all the users associated with one CBS or TBS should not exceed the total available vCBS transmit power, backhaul link capacity, vTBS transmit power and vTBS computing capability of the CBS or TBS, respectively. C11 represents the value range of the binary variables a

IV. DDPG-BASED SUM-UTILITY MAXIMIZATION
In this section, we will firstly give a brief introduction about DDPG algorithm. Then, we will present the state space, action space, and reward function of the sum-utility maximization problem. Next, a DDPG-based sum-utility maximization algorithm is designed to solve the sum-utility maximization problem for CUPS-E2E network slicing. Finally, the complexity of our algorithm will also be analyzed.

A. DDPG Algorithm
As a deep reinforcement learning algorithm, DDPG can be applied for high-dimensional continuous joint vTBS activation and resource allocation for CUPS-E2E network slicing, which involves two online neural networks and two target neural networks: online critic network, online actor network, target critic network and target actor network with the network parameters θ Q , θ μ , θ Q and θ μ , respectively [38].
The objective function of DDPG algorithm is defined as the expectation of the discounted accumulative reward, i.e., J(θ μ ) = E r 0 + γr 1 + γ 2 r 2 + · · · + γ n r n , where r 0 , r 1 , · · ·, r n denote the rewards at each time step. The online actor network is used to learn the action selection under a given state, which maps the current state s t to a certain action a t based on the present policy μ(s t |θ μ ). The online critic network evaluates the action by a state-action value function Q(s t , a t |θ Q ), the input of which includes both the current state and the action obtained by the online actor network. And the online critic network parameter θ Q is updated by the stochastic gradient descent method to minimize the loss function of the online critic network. The loss function L is represented by the mean square error, i.e., where y t is the target state-action value, which contains the current reward and the state-action value of the next time step. Deep neural network (DNN) is used to approximate the policy function μ(s t |θ μ ) and state-action value function Q(s t , a t |θ Q ).
To make the state-action value more stable during the training stage, the target actor network and target critic network are used to select the action and calculate the state-action value of the next time step, respectively. Therefore, y t is given by The online actor network parameter θ μ is updated by the gradient descent method, i.e., To explore potentially better strategies, a stochastic noise N o is introduced to affect the action selection during the network model training stage, and the action selection with the stochastic noise is expressed as The target critic network parameter θ Q and target actor network parameter θ μ are updated by the soft update method with a small constant τ , i.e.,

B. DDPG-Based Sum-Utility Maximization Algorithm
We design a DDPG-based sum-utility maximization algorithm to find the optimal joint vTBSs activation, communication and computing resources allocation policy for CUPS-E2E network slicing while ensuring that the long-term sum-utility is maximum. The state space, action space and reward function are as follows: 1) State Space: The state space S t includes the wireless channel states, user's service requesting states and data arrivals. Therein, the channel states contains the desired channel power gains and interfering channel power gains in slice C and slice s (s = 1, 2, 3) at current slot: 2) Action Space: The network operator has to decide the activation of vTBSs and the allocation of communication and computing resources for CUPS-E2E network slicing. For network coverage slice (slice C), the network operator has to decide the allocation of CP subcarrier and vCBS transmit power. For data service slices (slice 1, slice 2 and slice 3), the network operator has to decide the allocation of backhaul link capacity, UP subcarrier, vTBS transmit power and vTBS computing capability. Accordingly, we have the action space vector A t as: (39) Especially, we relax the binary variables in y C (t), a (s) T (t) and y (s) (t) to continuous variables with the range of [0,1] to satisfy the continuity condition of the action space in DDPG algorithm. Importantly, the network operator will check whether the selected actions content the constraints of the sum-utility maximization problem or not at each time slot, and the actions which violate the constraints will be discarded or modified. Specifically, some of the selected actions P C (t), Rb (s) (t), p (s) (t) and f (s) (t) will be modified to the lower values to satisfy the resource allocation constraints if they exceed the limitations in constraints C3, C4, C5, C8, C9 and C10. Besides, if some of the selected actions y C (t) and y (s) (t) violate the CP subcarrier allocation indicator constraints C1 and C2 or the UP subcarrier allocation indicator constraints C6 and C7, only one of the CP subcarrier allocation indicators for a user or one of the UP subcarrier allocation indicators will be reserved and the others will be modified to zero to satisfy the constrains C1, C2, C6 and C7.
3) Reward Function: The reward function is denoted by the sum-utility of CUPS-E2E network slicing, which is determined by the current state S t and action A t : Our algorithm contains the training stage and validating stage as shown in Algorithms 1 and 2, respectively. The agent learns the mapping function between the users' UP service requests as well as wireless channel states and the action, including the vTBS activation and resource allocation in the training stage. The vTBS activation and resource allocation are determined based on the current deterministic policy μ according to the current state S t at each time step. With the selected vTBS activation and resource allocation, the immediate reward Re t and the state of the next step S t+1 will be generated. In the training stage, the data set (S t , A t , Re t , S t+1 ) of each time step will be stored in a space named replay buffer as the training data. A mini-batch of the training data is sampled from the reply buffer 1: Initialization: Initialize the parameters of online critic network θ Q and online actor network θ μ ; Initialize the parameters of target critic network θ Q ← θ Q and online actor network θ μ ← θ μ ; Initialize the sizes of replay buffer D and mini-batch M ; 2: for episode = 1, 2, ··· K do 3: Receive initial observation state S 1 : Select action A t based on the current deterministic policy μ with the stochastic noise according to (36); 6: Take action A t , obtain the immediate reward Re t and generate the next state S t+1 ; 7: Store the data set (S t , A t , Re t , S t+1 ) in D; 8: if Replay buffer D is full do 9: Select M data sets randomly from D and build a mini-batch (S i , A i , Re i , S i+1 ) (i = 1, 2, ···M ); 10: Update θ Q by gradient descent method to minimize the loss Update θ μ by gradient descent method according to (35)

13:
Soft update θ Q and θ μ according to (37) , Ah (s) (1)]; 2: for t = 1, 2, ··· T do 3: Select action A t based on the trained deterministic policy μ: A t = μ(S t |θ μ ); 4: Take action A t , obtain the immediate reward Re t and generate the next state S t+1 ; 5: Update the state: S t ← S t+1 ; 6: end for 7: return Validating Results at each time step to update the parameters of the online critic network θ Q and online actor network θ μ . In the validating stage, the joint vTBS activation and resource allocation are determined by the trained deterministic policy μ, and the immediate reward is obtained at the same time.

C. Complexity Analysis
In this subsection, we analyse the computation complexity of the training process of the DDPG-based Sum-utility Maximization Algorithm. DNNs are used as function approximators for both the actor and critic networks of the DDPG-based algorithm. The DNNs have one input layer, one output layer, and several hidden layers, which are fully connected. Accordingly, the computation complexity analysis is mainly based on the DNN model [39].
The actor network takes the users' requests and wireless channel states as input and outputs the actions including the resource allocation and vTBS activation policy, while the critic network inputs the state and action information and outputs the corresponding state-action value to evaluate the action selected by the action network. Let N a L and N c L denote the numbers of hidden layers of the actor and critic networks, respectively, and the numbers of neurons in the mth hidden layer of the actor and critic networks are denoted by n a m and n c m , respectively. In addition, we denote the dimensions of the state and action spaces by |S 1 | and |A 1 |, respectively, which can be obtained based on (38) and (39), and assume that the parameters of the DNNs will converge after F conv 1 episodes and N conv 1 time slots. Accordingly, the complexity of the DDPG-based algorithm is expressed as Next, we analyze the complexity of the DQL-based algorithm. Firstly, the DQL algorithm requires to discretize the action space of the sum-utility maximization problem. We denote the dimension of the state space and the size of the discretized action space of the DQL-based algorithm by |S 2 | and |A 2 |, namely the dimensions of input and output of the DQL network, respectively. Therein, |S 2 | = |S 1 |, and |A 2 | depends on the resolution of discretization of the action space. Let N Q L and n Q m denote the number of hidden layers of the DQL network and the number of neurons in the mth hidden layer, respectively, and assume that the parameters of the DQL network will converge after F conv 2 episodes and N conv 2 time slots. Therefore, the complexity of the DQL-based algorithm is expressed as (42) Based on the above analysis, we can see that the complexity of the DDPG-based algorithm is higher than that of the DQL-based algorithm. This is because the complexity analysis of the DDPG-based algorithm covers both the actor and critic networks. However, the DQL-based algorithm divides the continuous action space into a discrete action space, and restricts the resource allocation for CUPS-E2E network slicing to the finite discrete quantities, which deteriorates the sum-utility obtained by the DQL-based algorithm as compared with the DDPG-based algorithm.
To verify the performance of the proposed scheme, we provide two coupled network slicing schemes and two benchmark resource allocation algorithms for comparison. Firstly, we compare the proposed CUPS-E2E network slicing scheme with two coupled schemes: 1) Conventional E2E network slicing scheme where the CP and the UP are tightly coupled [17]: the control signals and data signals are transmitted by the same physical base station (BS) in the common bandwidth. 2) Software-Defined Hyper-Cellular Networks (SD-HCN)-based E2E network slicing scheme: traditional BSs are classified in to CBSs and TBSs, and the control and data signals are transmitted by the dedicated CBSs and TBSs, respectively in the common bandwidth [43]. Then, to reflect the advantage of the DDPG-based algorithm in solving the sum-utility maximization problem, we compare two typical algorithms: the asynchronous advantage actor-critic (A3C)-based algorithm and the fixed resource allocation algorithm with the DDPG-based algorithm in terms of the throughput, computing capability and sum-utility. Both A3C and DDPG are actor-critic algorithms and can solve problems with continuous action spaces, therefore, we choose A3C algorithm as a comparison algorithm. With the fixed resource allocation algorithm, the communication and computing resources are empirically allocated to each slice and keep constant no matter how the data service requesting states change over time [44]. Besides, the convergence of the DQL-based algorithm is also compared with that of the DDPG-based and A3C-based algorithms.
The convergence of the DDPG-based algorithm is shown in Figs. 3 and 4. Therein, Fig. 3 shows the convergence for the DDPG-based sum-utility maximization algorithm under  different values of the actor's learning rate α μ and critic's learning rate α Q as well as the convergence for the DQL-based and A3C-based algorithms. As can be seen from Fig. 3, under the same critic's learning rate α Q = 10 −3 , the convergence of the algorithm is faster when α μ = 10 −2 than it when α μ = 10 −3 . Similarly, under the same actor's learning rate α μ = 10 −3 , the algorithm has a faster convergence when the critic's learning rate is larger (α Q = 10 −3 ). Fig. 4 shows the impact of the sizes of replay buffer and mini-batch on the convergence. In Fig. 4, we can see that under the same size of mini-batch, the algorithm has a faster convergence when the size of replay buffer is smaller. Similarly, under the same size of replay buffer, the convergence is faster when the size of mini-batch is 32 than it when the size of mini-batch is 50. Besides, Figs. 3 and 4 show that the convergence of the DDPG-based algorithm is slightly faster than the DQL-based algorithm and the A3C-based algorithm under the same α μ and α Q or the same sizes of replay buffer and mini-batch.
Figs. 5 and 6 depict the CP coverage percentage of slice C versus the SINR threshold x 0 and the density of users, respectively, for the proposed CUPS-E2E network slicing scheme under various λ C and other network slicing schemes. The simulation  results consider the distances among the users and the CBSs when calculating the SINR for each user. Compared with the conventional E2E network slicing scheme, the SD-HCN-based E2E network slicing scheme could provide a better coverage performance. This is because the coverage control signals are transmitted via dedicated CBSs in the SD-HCN-based E2E network slicing scheme, and the dedicated CBSs would provide a higher transmit power. Moreover, both Figs. 5 and 6 show that the CP coverage percentage of the proposed CUPS-E2E network slicing scheme is higher than that of the SD-HCN-based E2E network slicing and conventional E2E network slicing schemes. For the proposed scheme, coverage control signals and data signals are transmitted in different frequency bands, and the interference between them is reduced significantly, which improves the CP coverage percentage of the network. Besides, Fig. 6 shows that the CP coverage percentage decreases slightly with the increasing density of users because the growing number of users in the given area reduces the vCBS transmit power allocated to each user, but the proposed CUPS-E2E network slicing scheme can still provide a high CP coverage percentage (over 70%) when the density of users is greater than 1.0 users/m 2 , which guarantees massive user access even when the density of users is high. Fig. 7 plots the throughput of two typical eMBB slices versus the density of TBSs for the proposed CUPS-E2E network slicing  scheme and other benchmark algorithms or schemes. We observe that when the density of TBSs is below 0.0225 TBSs/m 2 , the interference from other TBSs is limited, and the throughput increases with the density of TBSs because of the reduction of path loss and the raising of vTBS transmit power allocated to each user. With the increasing density of TBSs (from 0.0225 to 0.0300 TBSs/m 2 ), the interference enhances and the throughput shows a decreasing trend. The throughput of the CUPS-E2E network slicing scheme precedes the SD-HCN-based E2E network slicing and conventional E2E network slicing schemes, which shows the advantage of the proposed scheme. We can also note that the throughput obtained by DDPG-based algorithm is 3.2%, 40.7% higher as compared with the A3C-based algorithm and the fixed resource allocation algorithm, respectively. In addition, it is worth noting that the throughput of slice 1 is also much higher than that of slice 2, which satisfies the throughput performance requirement of slice 1. Fig. 8 shows the computing capability of two typical eMBB slices versus the density of TBSs for all algorithms and schemes. Upon increasing density of TBSs, the number of users associated with one physical TBS decreases, and the TBS is able to provide a higher computing capability for each user. Therefore, the computing capability is also increased almost linearly with the density of TBSs. Compared with the A3C-based algorithm and the fixed resource allocation algorithm, we can obtain better computing performance by the DDPG-based algorithm. The computing capability of the proposed CUPS-E2E network slicing scheme is same as that of the SD-HCN-based E2E network slicing and conventional E2E network slicing schemes because the separation of the CP and the UP will not affect the allocation of computing resource. Moreover, the computing capability of slice 1 is much lower than slice 2, which contents the computing capability requirement of slice 2. Fig. 9 plots the average delay of slice 3 versus the density of TBSs for all algorithms and schemes. We note that when the density of TBSs is below 0.0225 TBSs/m 2 , the average delay decreases rapidly with the increasing density of TBSs because of the promotion of the backhaul link capacity, computing capability and downlink rate of each user and the reductions of the queue backlogs. The backhaul link capacity and computing capability allocated to each user will increase continuously with the denser deployment of the TBSs (from 0.0225 to 0.0300 TBSs/m 2 ), while the downlink rate of each user will decrease because of the enhancement of the interference. Therefore, the backlogs of the backhaul link queues and computing queues will continue to reduce and the backlogs of the wireless access link queues will increase. Accordingly, the average delay will decrease slowly and tend to be stable when the density of TBSs exceeds 0.0225 TBSs/m 2 . Fig. 9 also shows that we can obtain better delay performance by the DDPG-based algorithm as compared with the A3C-based algorithm and the fixed resource allocation algorithm. Besides, the average delay of the proposed CUPS-E2E network slicing scheme is lower than that of the SD-HCN-based E2E network slicing and conventional E2E network slicing schemes because the improvement of the downlink rate of each user in the proposed scheme makes the queue backlogs lower than the SD-HCN-based E2E network slicing and conventional E2E network slicing schemes.  When the density of CBSs is below 6 × 10 −5 CBSs/m 2 , the interference from other CBSs is limited and the number of users covered by the CP increases with the density of CBSs. Therefore, the utility of slice C is also incremental upon increasing density of CBSs. When the CBSs are deployed densely (from 6 × 10 −5 to 10 −4 CBSs/m 2 ), the CP coverage percentage saturates and the utility of slice C keeps steady. Moreover, the impact of the density of CBSs on the utilities of the data service slices is also reflected in Fig. 10(a): when the CBSs are deployed sparsely (below 5 × 10 −6 CBSs/m 2 ), there are only a few number of users covered by the CP, and the communication and computing resources allocated to each user will not change almost with the number of covered users. Hence, the total throughput and the total computing capability increase with the number of covered users, and the utilities of slice 1 and slice 2 increase near-linearly upon increasing density of CBSs. The number of the covered users is further increasing with the denser deployment of CBSs, and the communication and computing resources allocated to each user decrease relatively because of the inherent resource limitation. Consequently, the total throughput and the total computing capability keep almost stable with the increasing number of covered users and therefore the utilities of slice 1 and slice 2 keep steady nearly with the increment of the density of CBSs (over 5 × 10 −6 CBSs/m 2 ) as well. However, the tendency of the utility of slice 3 with the density of CBSs is different from that of the two eMBB slices. The utility of slice 3 decreases first with the increasing density of CBSs. This is because the total queue length grows with the incremental number of the covered users when the density of CBSs is below 6 × 10 −5 CBSs/m 2 , and especially the reduction of the computing capability and downlink rate of the individual user further increases the total queue length when the density of CBSs is between 5 × 10 −6 CBSs/m 2 and 6 × 10 −5 CBSs/m 2 . And with the further increment of the density of CBSs (over 6 × 10 −5 CBSs/m 2 ), the utility of 3 tends to be stable due to the saturation of the number of covered users. Finally, the sum-utility of the entire network increases rapidly when the density of CBSs is below 5 × 10 −6 CBSs/m 2 . And the sum-utility keeps almost steady when the density of CBSs exceeds 5 × 10 −6 CBSs/m 2 because of the mutual cancellation of the utilities of slice C and slice 3. Fig. 10(b) illustrates the utility versus the density of TBSs. Upon increasing density of TBSs, the utility of slice 1 increases gradually first and then decreases when the TBSs are deployed densely, which is in accord with the throughput performance of slice 1 as depicted in Fig. 7. While the utility of slice 2 increases linearly with the increasing density of TBSs, which conforms to the computing capability performance of slice 2 as illustrated in Fig. 8. And the utility of slice 3 increases evidently first because of the reduction of the slice 3's delay and then increases slowly due to the delay of tends to be stable with the increasing density of TBSs, which conforms to the delay performance of slice 3 as illustrated in Fig. 9. Additionally, the utility of slice C keeps steady with the increasing density of TBSs. Figs. 10(a) and 10(b) together show that the sum-utilities of the SD-HCN-based E2E network slicing and conventional E2E network slicing schemes are worse than that of the proposed CUPS-E2E network slicing scheme. Therefore, the proposed scheme guarantees that the network operator obtains better sum-utility. Conventional E2E network slicing scheme is not compared in Fig. 10(a) because the control signals are not transmitted by the dedicated CBSs in this scheme. We can also note that the network operator could get a higher sum-utility by the DDPG-based algorithm obviously as compared with the fixed resource allocation algorithm, and the sum-utility obtained by the DDPG-based algorithm is slightly better than that obtained by the A3C-based algorithm. The results demonstrate that the DDPG-based algorithm performs better than others in solving the sum-utility maximization problem.

VI. CONCLUSION
In this paper, we have studied the sum-utility maximization for CUPS-E2E network slicing, where the CP coverage and UP services are decoupled for four E2E network slices: one for CP coverage, one for UP high-throughput services, one for UP computation-intensive services, and the other for UP delaysensitive services. The utilities of the four E2E network slices are defined considering their heterogeneous performance metrics. To maximize the sum-utility of the entire network, we have employed a DDPG-based sum-utility maximization algorithm that jointly optimizes the vTBS activation and the allocation of CP subcarrier, vCBS transmit power, backhaul link capacity, UP subcarrier, vTBS transmit power and vTBS computing capability. Simulation results have shown that the proposed CUPS-E2E network slicing scheme in conjunction with a DDPG-based sumutility maximization algorithm can support the wide-coverage and massive access requirements, as well as the high-throughput, computation-intensive and delay-sensitive services simultaneously, and outperforms the SD-HCN-based E2E network slicing and conventional E2E network slicing schemes in terms of the sum-utility, coverage percentage, throughput and delay.