Disclosure of Invention
In order to overcome the defects of poor instantaneity, insufficient adaptability and limited multi-objective optimization capability of the conventional drilling parameter regulation and control method, the invention provides a drilling key parameter automatic regulation and control method based on a depth certainty strategy Gradient (DEEP DETERMINISTIC Policy Gradient, DDPG) algorithm, aiming at realizing dynamic optimization regulation and control of key parameters such as Weight On Bit (WOB) and rotating speed (RPM) through an intelligent algorithm so as to improve the rate of penetration (ROP) and ensure the quality and operation safety of a well. According to the method, the reinforcement learning DDPG algorithm model is constructed, the drilling parameters are dynamically adjusted, the multi-target requirements such as the drilling efficiency, the service life of the drill bit, the quality of the well bore and the like are comprehensively considered, and the intelligent level and the economic benefit of the drilling process are remarkably improved.
In order to solve the technical problems, the invention adopts the following technical scheme:
an automatic regulation and control method for drilling key parameters based on an improved DDPG algorithm comprises the following steps:
1) Acquiring and preprocessing drilling data, and constructing a training data set for training of the ROP time sequence prediction model;
2) Constructing and training an ROP time sequence prediction model, providing accurate dynamic response, and using the ROP time sequence prediction model as an environment feedback basis of a reinforcement learning DDPG algorithm model;
3) Constructing a simulation environment based on the ROP prediction model in the step 2), defining a state space, an action space and a reward function of the reinforcement learning DDPG algorithm model, and designing an improved DDPG algorithm model;
4) Training DDPG an algorithm model by using a simulation environment, and learning an optimal parameter regulation strategy by using an intelligent agent.
The drilling parameters in the step 1) include Weight On Bit (WOB), rotational speed (RPM), riser pressure (SPP), torque (TQ), mud inflow (MFI), and major ditch load (HKL), and the target parameter is rate of drilling (ROP). The data preprocessing process comprises the following steps:
1.1 The collected data is subjected to preliminary cleaning, data items with the missing value proportion exceeding 10% are removed, abnormal data exceeding three times of standard deviation are removed, and linear interpolation processing is carried out on the missing data;
1.2 Maximum and minimum normalization of the data, the formula is:
wherein Xnormalized is normalized data, X is original data, min (X) is minimum value of the original data, and max (X) is maximum value of the original data.
In the step 2), a deep time sequence neural network model is constructed by using the training set data in the step S1 to predict the ROP value, and the method specifically comprises the following substeps:
2.1 A deep neural network model is constructed, the model adopts self-attention framework as a core framework, and the dynamic rule of the change of drilling parameters and stratum characteristics along with time is captured by utilizing the advantages of the model in the aspect of processing time sequence data. The model comprises a neural network architecture of an encoder (encoder) and a decoder (decoder) based on a self-attention mechanism (self-attention) for extracting multi-level time series characteristics, and setting a full connection layer at an output end to generate a predicted ROP value.
2.2 Setting input characteristics and output targets of the model, wherein the input characteristics of the model comprise Weight On Bit (WOB), rotating speed (RPM), riser pressure (SPP), torque (TQ), slurry inflow (MFI) and major groove load (HKL), and the output targets are the drilling Rate (ROP) at corresponding moments. The input features are subjected to normalization processing to accelerate training and improve generalization performance of the model.
2.3 Time sequence prediction model loss function and optimizer, the model training process adopts Mean Square Error (MSE) as the loss function, and the formula is as follows:
Wherein yi is the true ROP value,For the model predicted ROP value, N is the number of samples. In the training optimization process, an Adam optimizer is adopted for parameter updating.
2.4 And (3) after the evaluation result of the test set reaches the expectation, storing the optimal state of the model.
In the step 3), a simulation environment is constructed based on the ROP time sequence prediction model trained in the step 2), and key elements of the reinforcement learning DDPG algorithm model are defined, which specifically comprises the following contents:
3.1 A simulation environment is built, and an ROP prediction model is used as a core for simulating the influence of Weight On Bit (WOB) and rotating speed (RPM) on the rate of penetration (ROP). Inputs to the simulation environment include the bottom layer characteristics parameters of Weight On Bit (WOB), rotational speed (RPM), riser pressure (SPP), torque (TQ), mud inflow (MFI), and major ditch load (HKL). Since these underlying characteristic parameters cannot be directly manipulated by reinforcement learning, they are used as state inputs to the environment, affecting the rate of penetration prediction and subsequent optimization process in the simulation environment. The simulation environment calculates the rate of penetration (ROP) through the ROP prediction model and provides feedback indexes of the rate of penetration under a reward and punishment mechanism for optimizing DDPG algorithm models.
Constraints in the simulation environment include physical operating limitations of weight and speed (e.g., maximum allowable weight and speed of equipment), limitations of bit wear rate, and wellbore quality requirements, which ensure that model-generated regulation schemes are viable and that the rate of drilling can be effectively increased in actual operation.
3.2 Key elements of the DDPG algorithm model are defined, which elements include a preset action space and a set reward function. The action space comprises adjustment of the weight on bit and the rotating speed, and the two parameters are targets which can be regulated and controlled by the DDPG algorithm model. The model optimizes the rate of penetration (ROP) by increasing, decreasing, or maintaining a constant action on these parameters.
The reward function is designed as multi-objective optimization, taking into account the satisfaction of ROP lifting, adjustment costs and constraints. The specific rewarding function forms are:
Reward=α·ROPgain-β·Adjustmentcost-γ·Constraintpenalty
The ROPgain represents the drilling rate lifting amount of the current time step t compared with the previous time step t-1, and is multiplied by a weight coefficient alpha to encourage the increase of the drilling rate, the Adjustmentcost represents the Adjustment range of the drilling pressure and the rotating speed in the current time step, and is multiplied by a weight coefficient beta to penalize excessive parameter Adjustment and reduce the operation cost and equipment wear, and the Constraintpenalty is used for measuring whether engineering constraints (such as physical limits of the drilling pressure and the rotating speed, the drill bit wear rate, the borehole quality and the like) are violated and multiplying by a higher weight coefficient gamma to ensure strict compliance with safety and performance standards. Through the rewarding mechanism, the intelligent agent is not only stimulated to promote the drilling speed, but also restricted to operate in a reasonable adjustment range, and any behavior violating engineering restrictions is strictly punished, so that the balance optimization of efficiency, safety and economic benefit is realized in the drilling process.
3.3 Architecture of reinforcement learning model that uses depth deterministic strategy Gradient (DEEP DETERMINISTIC Policy Gradient, DDPG) algorithm and employs the structure of strategy network (Actor) and value network (Critic) to optimize the regulation strategy of Weight On Bit (WOB) and rotational speed (RPM). Compared with the traditional DDPG algorithm model, the invention is optimized and improved in the following aspects:
3.3.1 The model adds an extra full connection layer in the traditional strategy network, so that the network depth reaches four layers. The introduction of the additional layer enables the network to better understand and process multidimensional data in the drilling process, thereby improving the accuracy and stability of the strategy;
3.3.2 In value network design, the present invention introduces a residual connection (Residual Connections). Specifically, by adding two linear layers (residual 1 and residual 2), the input features are added directly to the output of the hidden layer. The residual connection effectively relieves the gradient vanishing problem in the deep network, promotes a more stable and efficient training process, and improves the accuracy of value evaluation;
3.3.3 The model uses normal distribution (mean 0, standard deviation 0.1) for weight initialization in each layer of the policy network and the value network. The initialization method is helpful to avoid gradient disappearance or explosion phenomenon, so that the network can keep better performance in the initial stage of training;
3.3.4 To prevent model overfitting and promote generalization capability, the invention adds Dropout layers (p=0.1) after the first hidden layer of the policy network and the key hidden layer of the value network, respectively. The Dropout layer reduces the dependence of the network on specific neurons by randomly discarding the output of part of neurons, and enhances the adaptability and robustness of the model under different drilling environments;
3.3.5 The output of the strategy network is limited to the range of [ -1,1] via the tanh activation function, and then the action values are scaled to the actual action space range drilling pressures [ wob_low, wob_high ] and rotational speeds [ rpm_low, rpm_high ] by linear transformation. The specific formula is as follows:
the action represents the current action value, and the design ensures that the output bit pressure and the output rotation speed are in a reasonable and operable range, and avoids generating invalid or extreme action values, thereby improving the enforceability and the safety of the regulation strategy.
In the step 4), the simulation environment and the reinforcement learning DDPG algorithm model constructed in the step 3) are utilized, and the model is trained to learn the optimal parameter regulation strategy through training, which concretely comprises the following contents:
4.1 A training process is established, a deep reinforcement learning depth deterministic strategy gradient algorithm (DDPG) algorithm is adopted, and a model is trained. In the training process, the intelligent agent interacts with the simulation environment, and at each time step t, one action at is selected according to the current state st, and the next state st+1 and the instant prize rt are obtained. By constantly interacting with the environment, empirical data is accumulated.
4.2 Experience pool creation and updating, an experience replay pool (Experience Replay Buffer) is introduced for storing interaction data of the intelligent agent, including quadruples of state, action, rewards and next state (st,at,rt,st+1). And randomly extracting small-batch (mini-batch) data from the experience pool for training when updating the strategy each time, so as to break the data correlation and improve the training stability and efficiency.
4.3 Updating model parameters using the policy network and the value network. Wherein the goal of the value network is to estimate the expected jackpot under a given state and action, and the loss function of the value network is defined as:
Wherein,
yi=ri+γQ′(si+1,μ′(si+1|θμ′)|θQ′)
Gamma is the discount factor and Q 'and μ' are parameters of the target value network and the target policy network, respectively. The value network parameter θQ is updated using a gradient descent method by minimizing the loss function LQ.
The purpose of the strategic network is to output optimal actions (i.e., adjustments to weight on bit and rotational speed) based on the current state, with the objective of maximizing the jackpot of the intelligent agent in the environment. The strategy network directly decides the regulation strategy which the intelligent agent should take under different states through learning the mapping relation from the states to the actions.
The goal of the updating of the policy network is to increase the value that can be gained by the actions selected in the various states. The parameter updating of the strategy network is realized by a strategy gradient method, and the gradient calculation formula is as follows:
Where θμ is a parameter of the policy network, μ (si|θμ) represents an action output by the policy network in state si, Q (s, a|θQ) is a value network, and a value of performing the action in state s is evaluated. By means of the policy gradient method, the policy network parameter θμ is updated so that the policy network can output actions that can maximize the expected jackpot under a given state.
4.4 The convergence of the model and the policy optimization, and the convergence condition of the model is judged by monitoring the maximum cumulative rewards and the stability of the policy in the training process. The model is considered to have converged when the jackpot reaches an expected level on the validation set and the policy output tends to stabilize. At this time, the optimal model parameters are saved for subsequent practical applications.
The trained DDPG algorithm model can output an optimal Weight On Bit (WOB) and rotating speed (RPM) adjustment strategy according to the current drilling state, so that the optimization of the rate of drilling (ROP) is realized, the drilling efficiency is improved, and the drilling safety is ensured.
Compared with the prior art, the invention has the main advantages that:
(1) Realizing the intelligent regulation and control of key parameters. According to the invention, automatic regulation and control of Weight On Bit (WOB) and rotating speed (RPM) are realized through the improved DDPG algorithm model, key parameters can be dynamically regulated according to the real-time drilling state, manual intervention is not needed, and the automation degree of drilling operation is greatly improved.
(2) Improving the accuracy and efficiency of rate of penetration (ROP) optimization. By combining the ROP time sequence prediction model and DDPG algorithm, the invention can predict the drilling speed in real time under different working conditions, and can maximize the lifting effect of the drilling speed through intelligent regulation and control, thereby remarkably improving the drilling efficiency.
(3) Multi-objective optimization capability. The invention designs a multi-objective rewarding function, comprehensively considers the drilling speed improvement, the parameter adjustment cost and the safety constraint, can ensure the service life of the drill bit and the quality of the well bore while improving the drilling speed, and meets various engineering requirements.
(4) And the adaptability of complex working conditions is enhanced. By constructing the simulation environment and improving the dynamic optimization of DDPG algorithm models, the invention can stably run under different stratum conditions, equipment configuration and operation constraints and is suitable for complex drilling working conditions.
(5) Reducing the risk of manual intervention and operation. In the traditional drilling operation, the adjustment of the drilling pressure and the rotating speed is required to depend on manual experience, and the problems of misoperation and low efficiency are easy to occur. According to the invention, through the improved intelligent decision making capability of the reinforcement learning DDPG algorithm model, the requirement of manual intervention is remarkably reduced, and the operation risk is reduced.
In conclusion, the invention provides an efficient, stable and safe automatic regulation and control method for drilling key parameters through an intelligent algorithm and a reinforcement learning DDPG algorithm model, which can remarkably improve drilling efficiency, reduce operation risk, adapt to complex working condition requirements and have wide application prospects.
Detailed Description
Example 1
Referring to fig. 1, the method for automatically regulating and controlling drilling key parameters based on the improved DDPG algorithm comprises the following steps:
1) Acquiring and preprocessing drilling data, and constructing a training data set for training of the ROP time sequence prediction model;
2) Constructing and training an ROP time sequence prediction model, and inputting the data in the step 1) into the model to provide accurate dynamic response, wherein the accurate dynamic response is used as an environment feedback basis of a reinforcement learning DDPG algorithm model;
3) Constructing a simulation environment based on the ROP prediction model in the step 2), defining a state space, an action space and a reward function of the reinforcement learning DDPG algorithm model, and designing an improved DDPG algorithm model;
4) And 3) training the reinforcement learning DDPG algorithm model by utilizing the ROP simulation environment in the step 3), learning the optimal parameter regulation strategy by an intelligent agent, and finally saving and applying the model.
In this embodiment, in step 1), key parameter data related to the drilling process is collected in real time by sensors and monitoring systems at the drilling site. The drilling operation parameters comprise Weight On Bit (WOB), rotating speed (RPM), vertical pipe pressure (SPP), torque (TQ), drilling time (BDT), mud inflow (MFI) and major ditch load (HKL), and the target parameter is drilling Rate (ROP). The collected data includes historical data and real-time drilling data for forming a complete training data set. And then preprocessing the acquired data, wherein the preprocessing process is as follows:
1.1 The collected data is subjected to preliminary cleaning, data items with the missing value proportion exceeding 10% are removed, abnormal data exceeding three times of standard deviation are removed, and the quality and consistency of the residual data are ensured. Table 1 shows the processed partial data samples:
table 1 ROP logging sample data example
1.2 Data interpolation is carried out on the subsequent data, a linear interpolation method is used for filling the missing value, and the interpolation is carried out on the data missing in a specific time interval by utilizing the data at adjacent moments.
1.3 Maximum and minimum normalization (Min-Max Normalization) of the data, the normalization formula is as follows:
wherein Xnormalized is normalized data, X is original data, min (X) is minimum value of the original data, and max (X) is maximum value of the original data.
In the embodiment, in step 2, a ROP time sequence prediction model based on self-attention is constructed by using training set data to capture dynamic rules of drilling parameters and formation characteristics changing with time. As shown in fig. 2, the main flow of the model construction is as follows:
2.1 A full connection output layer, which takes an encoder layer (encoder) and a decoder layer (decoder) of a self-attention mechanism (self-attention) as a core architecture, and maps the extracted features to a prediction target, namely a rate of penetration (ROP).
2.2 Loss function and model optimization, model training adopts Mean Square Error (MSE) as the loss function, and the formula is as follows:
Wherein yi is the true ROP value,For the model predicted ROP value, N is the number of samples. In the training optimization process, an Adam optimizer is adopted to update parameters, the initial learning rate is 0.001, and the learning rate is adjusted according to an attenuation factor of 0.9 after every 10 training periods.
2.3 Training process design, training the model by a small batch gradient descent method (batch size=32) for 100 rounds, and checking ROP values after automatic parameter adjustment on a verification set to obtain corresponding improvement.
In this embodiment, in step 3), a simulation environment is constructed based on the ROP timing prediction model trained in step 2), which is used for training and optimizing DDPG models, and referring to fig. 3, the following specifically include:
3.1 A simulation environment is constructed. The simulation environment takes a trained ROP prediction model as a core, and simulates the dynamic influence of key parameters such as Weight On Bit (WOB) and rotating speed (RPM) on the rate of drilling (ROP). Input characteristics of the simulation environment include current Weight On Bit (WOB), rotational speed (RPM), riser pressure (SPP), torque (TQ), mud inflow (MFI), and major ditch load (HKL) parameters. The bottom layer characteristic parameters are used as state input of the environment, directly influence the drilling speed prediction result in the simulation environment, and provide basis for strategy optimization of DDPG algorithm models. Because the bottom layer characteristic parameters cannot be directly regulated and controlled through DDPG models, the reinforcement learning DDPG algorithm mainly learns and optimizes the adjustment strategy of the weight on bit and the rotating speed.
3.2 Basic conditions of the reinforcement learning process, including state space, action space, and reward functions, are defined based on the simulation environment. The State space (State, s) includes current Weight On Bit (WOB) and rotational speed (RPM), two adjustable characteristic parameters, other non-adjustable parameters, and historical rate of penetration (WOB) data. Through the definition of the state space, the reinforcement learning DDPG algorithm model can comprehensively sense the current drilling working condition and provide sufficient information for subsequent decisions.
The Action space (Action, a) is defined as an adjustment value of the weight on bit and the rotation speed, and specifically includes increasing, decreasing or maintaining the weight on bit and the rotation speed unchanged, and performing physical operation limitation on the weight on bit and the rotation speed. The reinforcement learning DDPG algorithm model outputs optimal actions under different conditions through learning, thereby optimizing rate of penetration (ROP) and ensuring wellbore quality and bit life. The reward function (Reward, r) is designed as a multi-objective optimization function, taking into account the rate of penetration (ROP gain), the parameter Adjustment costs (Adjustment costs) and the constraint satisfaction (Constraint penalty) in total. The specific form of the reward function is as follows:
Reward=α·ROPgain-β·Adjustmentcost-γ·Constraintpenalty
Where ROPgain is the rate of rise, adjustmentcost is the Adjustment cost, constraintpenalty is the penalty term for violating the Constraint, and α, β, γ are the weight coefficients used to balance the targets.
3.3 The reinforcement learning model uses a deep reinforcement learning algorithm DDPG (DEEP DETERMINISTIC Policy Gradient) whose architecture includes a Policy network (Actor) and a value network (Critic). The strategy network is used to output optimal adjustment actions (i.e., adjustment values for weight-on-bit and rotational speed) based on the current state, while the value network is used to evaluate the jackpot for each state-action pair, helping the strategy network optimize the regulation strategy.
The policy network and the value network both adopt full-connection layer structures, the activation function is ReLU (RECTIFIED LINEAR Unit), and the calculation formula is as follows:
ReLU(x)=max(0,x)
In the model architecture, the strategy network receives current state information (such as weight on bit, rotation speed, mud flow, borehole depth, historical ROP value, etc.) of the simulation environment, and outputs two continuous values corresponding to the adjustment values of the weight on bit and the rotation speed respectively. The value network then receives the current state and the action generated by the policy network and calculates a value estimate (i.e., a jackpot) for the state-action pair. The strategy network and the value network are continuously matched and optimized, so that the model can learn the optimal parameter regulation strategy step by step.
In this embodiment, in step 4), training is performed using the simulation environment and the reinforcement learning DDPG algorithm model. By utilizing the simulation environment constructed in the step S3 and the reinforcement learning DDPG algorithm model, the model is trained to learn the optimal Weight On Bit (WOB) and rotating speed (RPM) regulation strategy, so that the dynamic optimization of the drilling Rate (ROP) is realized, and the specific implementation process is as follows:
4.1 The training process adopts a depth deterministic strategy gradient algorithm (DDPG), and the model gradually learns to an optimal parameter regulation strategy through continuous interaction between an intelligent Agent (Agent) for reinforcement learning and a simulation environment. At each time step t, the intelligent agent generates an action at over the strategy network based on the current state st, which corresponds to the adjustment values for weight on bit and rotational speed. The simulation environment updates the state to st+1 based on the entered action at and the current state st, while generating an instant prize rt. The calculation of the reward value considers the drill speed improvement, the parameter adjustment cost and the satisfaction of the constraint condition, so as to guide the behavior optimization of the intelligent agent. Through this continuous interaction, the intelligent agent accumulates empirical data, providing training samples for subsequent model optimization.
4.2 During the building and updating training process of the experience pool, an experience replay pool (Experience Replay Buffer) is introduced for storing the interaction data of the intelligent agent and the simulation environment. The data stored by the experience pool is recorded in the form of four tuples, including status, action, instant prize, and next status (st,at,rt,st+1). Each time model parameters are updated, small batch data (mini-batch) is randomly extracted from the experience pool for training the strategy network and the value network. The data correlation is broken through random sampling, and the experience pool improves the stability and efficiency of training. The capacity of the experience pool is typically set to a fixed size, e.g., 10,000 interaction records, with the most recent data overlapping the oldest data, ensuring that training is always based on the most recent interaction information.
4.3 Updating settings of model parameters. In the training process, parameters of the strategy network and the value network are updated through different optimization targets. The value network aims to estimate the jackpot under the current state and action, and the loss function is defined as:
Wherein,
yi=ri+γQ′(si+1,μ′(si+1|θμ′)|θQ′)
S is the state at a certain moment, and α is the action at the current state. yi denotes that the target jackpot gamma is a discount factor and Q 'and μ' are parameters of the target value network and the target policy network, respectively. The value network parameter θQ is updated using a gradient descent method by minimizing the loss function LQ.
The policy network is responsible for generating an optimal action a based on the current state s, with the optimization objective being to maximize the cumulative rewards of the intelligent agent. The strategy network updates parameters through a strategy gradient method, and the gradient formula is as follows:
Where θμ is a parameter of the policy network, μ (si|θμ) represents an action output by the policy network in state si, Q (s, a|θQ) is a value network, and a value of performing the action in state s is evaluated. By means of the policy gradient method, the policy network parameter θμ is updated so that the policy network can output actions that can maximize the expected jackpot under a given state.
4.4 In the training process, the convergence condition of the model is judged by monitoring the stability of the accumulated rewards and the strategy. When the intelligent agent performance in the simulation environment tends to be stable, the cumulative rewarding curve reaches the expected level, and the action change of the strategy network output in the multi-round interaction is smaller, which indicates that the model has converged. In this case, the optimal parameters of the trained policy network and value network are saved for subsequent practical applications.
4.5 In the present example, the model generates optimal Weight On Bit (WOB) and rotational speed (RPM) adjustment strategies based on real-time drilling conditions (e.g., current weight on bit, rotational speed, mud flow, etc.) to optimize rate of penetration (ROP). Referring to fig. 4, the improved DDPG algorithm model of the present invention can significantly improve drilling efficiency by continuously adjusting and controlling under different conditions.