Skip to content

Latest commit

 

History

History
executable file
·
230 lines (186 loc) · 11.9 KB

Hot_Loading_CN.md

File metadata and controls

executable file
·
230 lines (186 loc) · 11.9 KB

Paddle Serving 中的模型热加载

背景

在实际的工业场景下,通常是远端定期不间断产出模型,线上服务端需要在服务不中断的情况下拉取新模型对旧模型进行更新迭代。

Server Monitor

Paddle Serving 提供了一个自动监控脚本,远端地址更新模型后会拉取新模型更新本地模型,同时更新本地模型文件夹中的时间戳文件 fluid_time_stamp 实现热加载。

目前支持下面几种类型的远端监控 Monitor:

Monitor类型 描述 特殊选项
general 远端无认证,可以通过 wget 直接访问下载文件(如无需认证的FTP,BOS等) general_host 通用远端host
hdfs/afs(HadoopMonitor) 远端为 HDFS 或 AFS,通过 Hadoop-Client 执行相关命令 hadoop_bin Hadoop 二进制的路径
fs_name Hadoop fs_name,默认为空
fs_ugi Hadoop fs_ugi,默认为空
ftp 远端为 FTP,通过 ftplib 进行相关访问(使用该 Monitor,您可能需要执行 pip install ftplib 下载 ftplib ftp_host FTP host
ftp_port FTP port
ftp_username FTP username,默认为空
ftp_password FTP password,默认为空
Monitor通用选项 描述 默认值
type 指定 Monitor 类型
remote_path 指定远端的基础路径
remote_model_name 指定远端需要拉取的模型名
remote_donefile_name 指定远端标志模型更新完毕的 donefile 文件名
local_path 指定本地工作路径
local_model_name 指定本地模型名
local_timestamp_file 指定本地用于热加载的时间戳文件,该文件被认为在 local_path/local_model_name 下。 fluid_time_file
local_tmp_path 指定本地存放临时文件的文件夹路径,若不存在则自动创建。 _serving_monitor_tmp
interval 指定轮询间隔时间,单位为秒。 10
unpacked_filename Monitor 支持 tarfile 打包的远程模型。如果远程模型是打包格式,则需要设置该选项来告知 Monitor 解压后的文件名。 None
debug 如果添加 --debug 选项,则输出更详细的中间信息。 默认不添加该选项

下面通过 HadoopMonitor 示例来展示 Paddle Serving 的模型热加载功能。

HadoopMonitor 示例

示例中在 product_path 中生产模型上传至 hdfs,在 server_path 中模拟服务端模型热加载:

.
├── product_path
└── server_path

一.生产模型

product_path 下运行下面的 Python 代码生产模型(运行前需要修改 hadoop 相关的参数),每隔 60 秒会产出 Boston 房价预测模型的打包文件 uci_housing.tar.gz 并上传至 hdfs 的/路径下,上传完毕后更新时间戳文件 donefile 并上传至 hdfs 的/路径下。

import os
import sys
import time
import tarfile
import paddle
import paddle.fluid as fluid
import paddle_serving_client.io as serving_io

train_reader = paddle.batch(
    paddle.reader.shuffle(
        paddle.dataset.uci_housing.train(), buf_size=500),
    batch_size=16)

test_reader = paddle.batch(
    paddle.reader.shuffle(
        paddle.dataset.uci_housing.test(), buf_size=500),
    batch_size=16)

x = fluid.data(name='x', shape=[None, 13], dtype='float32')
y = fluid.data(name='y', shape=[None, 1], dtype='float32')

y_predict = fluid.layers.fc(input=x, size=1, act=None)
cost = fluid.layers.square_error_cost(input=y_predict, label=y)
avg_loss = fluid.layers.mean(cost)
sgd_optimizer = fluid.optimizer.SGD(learning_rate=0.01)
sgd_optimizer.minimize(avg_loss)

place = fluid.CPUPlace()
feeder = fluid.DataFeeder(place=place, feed_list=[x, y])
exe = fluid.Executor(place)
exe.run(fluid.default_startup_program())

def push_to_hdfs(local_file_path, remote_path):
    afs = 'afs://***.***.***.***:***' # User needs to change
    uci = '***,***' # User needs to change
    hadoop_bin = '/path/to/haddop/bin' # User needs to change
    prefix = '{} fs -Dfs.default.name={} -Dhadoop.job.ugi={}'.format(hadoop_bin, afs, uci)
    os.system('{} -rmr {}/{}'.format(
      prefix, remote_path, local_file_path))
    os.system('{} -put {} {}'.format(
      prefix, local_file_path, remote_path))

name = "uci_housing"
for pass_id in range(30):
    for data_train in train_reader():
        avg_loss_value, = exe.run(fluid.default_main_program(),
                                  feed=feeder.feed(data_train),
                                  fetch_list=[avg_loss])
    # Simulate the production model every other period of time
    time.sleep(60)
    model_name = "{}_model".format(name)
    client_name = "{}_client".format(name)
    serving_io.save_model(model_name, client_name,
                          {"x": x}, {"price": y_predict},
                          fluid.default_main_program())
    # Packing model
    tar_name = "{}.tar.gz".format(name)
    tar = tarfile.open(tar_name, 'w:gz')
    tar.add(model_name)
    tar.close()

    # Push packaged model file to hdfs
    push_to_hdfs(tar_name, '/')

    # Generate donefile
    donefile_name = 'donefile'
    os.system('touch {}'.format(donefile_name))

    # Push donefile to hdfs
    push_to_hdfs(donefile_name, '/')

hdfs 上的文件如下列所示:

# hadoop fs -ls /
Found 2 items
-rw-r--r--   1 root supergroup          0 2020-04-02 02:54 /donefile
-rw-r--r--   1 root supergroup       2101 2020-04-02 02:54 /uci_housing.tar.gz

二.服务端加载模型

进入 server_path 文件夹。

  1. 用初始模型启动 Server 端

这里使用预训练的 Boston 房价预测模型作为初始模型:

wget --no-check-certificate https://paddle-serving.bj.bcebos.com/uci_housing.tar.gz
tar -xzf uci_housing.tar.gz

启动 Server 端:

python -m paddle_serving_server.serve --model uci_housing_model --thread 10 --port 9292
  1. 执行监控程序

用下面的命令来执行 HDFS 监控程序:

python -m paddle_serving_server.monitor \
	--type='hdfs' --hadoop_bin='/hadoop-3.1.2/bin/hadoop' \
	--remote_path='/' --remote_model_name='uci_housing.tar.gz' \
	--remote_donefile_name='donefile' --local_path='.' \
	--local_model_name='uci_housing_model' --local_timestamp_file='fluid_time_file' \
	--local_tmp_path='_tmp' --unpacked_filename='uci_housing_model' --debug

上面代码通过轮询方式监控远程 HDFS 地址/的时间戳文件/donefile,当时间戳变更则认为远程模型已经更新,将远程打包模型/uci_housing.tar.gz拉取到本地临时路径./_tmp/uci_housing.tar.gz下,解包出模型文件./_tmp/uci_housing_model后,更新本地模型./uci_housing_model以及Paddle Serving的时间戳文件./uci_housing_model/fluid_time_file

预计输出如下:

2020-04-02 10:12 INFO     [monitor.py:85] _hadoop_bin: /hadoop-3.1.2/bin/hadoop
2020-04-02 10:12 INFO     [monitor.py:85] _fs_name:
2020-04-02 10:12 INFO     [monitor.py:85] _fs_ugi:
2020-04-02 10:12 INFO     [monitor.py:209] AFS prefix cmd: /hadoop-3.1.2/bin/hadoop fs
2020-04-02 10:12 INFO     [monitor.py:85] _remote_path: /
2020-04-02 10:12 INFO     [monitor.py:85] _remote_model_name: uci_housing.tar.gz
2020-04-02 10:12 INFO     [monitor.py:85] _remote_donefile_name: donefile
2020-04-02 10:12 INFO     [monitor.py:85] _local_model_name: uci_housing_model
2020-04-02 10:12 INFO     [monitor.py:85] _local_path: .
2020-04-02 10:12 INFO     [monitor.py:85] _local_timestamp_file: fluid_time_file
2020-04-02 10:12 INFO     [monitor.py:85] _local_tmp_path: _tmp
2020-04-02 10:12 INFO     [monitor.py:85] _interval: 10
2020-04-02 10:12 DEBUG    [monitor.py:214] check cmd: /hadoop-3.1.2/bin/hadoop fs  -ls /donefile 2>/dev/null
2020-04-02 10:12 DEBUG    [monitor.py:216] resp: -rw-r--r--   1 root supergroup          0 2020-04-02 10:11 /donefile
2020-04-02 10:12 INFO     [monitor.py:138] doneilfe(donefile) changed.
2020-04-02 10:12 DEBUG    [monitor.py:233] pull cmd: /hadoop-3.1.2/bin/hadoop fs  -get /uci_housing.tar.gz _tmp/uci_housing.tar.gz 2>/dev/null
2020-04-02 10:12 INFO     [monitor.py:144] pull remote model(uci_housing.tar.gz).
2020-04-02 10:12 INFO     [monitor.py:98] unpack remote file(uci_housing.tar.gz).
2020-04-02 10:12 DEBUG    [monitor.py:108] remove packed file(uci_housing.tar.gz).
2020-04-02 10:12 INFO     [monitor.py:110] using unpacked filename: uci_housing_model.
2020-04-02 10:12 DEBUG    [monitor.py:175] update model cmd: cp -r _tmp/uci_housing_model/* ./uci_housing_model
2020-04-02 10:12 INFO     [monitor.py:152] update local model(uci_housing_model).
2020-04-02 10:12 DEBUG    [monitor.py:184] update timestamp cmd: touch ./uci_housing_model/fluid_time_file
2020-04-02 10:12 INFO     [monitor.py:157] update model timestamp(fluid_time_file).
2020-04-02 10:12 INFO     [monitor.py:161] sleep 10s.
2020-04-02 10:12 DEBUG    [monitor.py:214] check cmd: /hadoop-3.1.2/bin/hadoop fs  -ls /donefile 2>/dev/null
2020-04-02 10:12 DEBUG    [monitor.py:216] resp: -rw-r--r--   1 root supergroup          0 2020-04-02 10:11 /donefile
2020-04-02 10:12 INFO     [monitor.py:161] sleep 10s.
  1. 查看 Server 日志

通过下面命令查看 Server 的运行日志:

tail -f log/serving.INFO

日志中显示模型已经被热加载:

I0330 09:38:40.087316  7361 server.cpp:150] Begin reload framework...
W0330 09:38:40.087399  7361 infer.h:656] Succ reload version engine: 18446744073709551615
I0330 09:38:40.087414  7361 manager.h:131] Finish reload 1 workflow(s)
I0330 09:38:50.087535  7361 server.cpp:150] Begin reload framework...
W0330 09:38:50.087641  7361 infer.h:250] begin reload model[uci_housing_model].
I0330 09:38:50.087972  7361 infer.h:66] InferEngineCreationParams: model_path = uci_housing_model, enable_memory_optimization = 0, static_optimization = 0, force_update_static_cache = 0
I0330 09:38:50.088027  7361 analysis_predictor.cc:88] Profiler is deactivated, and no profiling report will be generated.
I0330 09:38:50.088393  7361 analysis_predictor.cc:841] MODEL VERSION: 1.7.1
I0330 09:38:50.088413  7361 analysis_predictor.cc:843] PREDICTOR VERSION: 1.6.3
I0330 09:38:50.089519  7361 graph_pattern_detector.cc:96] ---  detected 1 subgraphs
I0330 09:38:50.090925  7361 analysis_predictor.cc:470] ======= optimize end =======
W0330 09:38:50.090986  7361 infer.h:472] Succ load common model[0x7fc83c06abd0], path[uci_housing_model].
I0330 09:38:50.091022  7361 analysis_predictor.cc:88] Profiler is deactivated, and no profiling report will be generated.
W0330 09:38:50.091050  7361 infer.h:509] td_core[0x7fc83c0ad770] clone model from pd_core[0x7fc83c06abd0] succ, cur_idx[0].
...
W0330 09:38:50.091784  7361 infer.h:489] Succ load clone model, path[uci_housing_model]
W0330 09:38:50.091794  7361 infer.h:656] Succ reload version engine: 18446744073709551615
I0330 09:38:50.091820  7361 manager.h:131] Finish reload 1 workflow(s)
I0330 09:39:00.091987  7361 server.cpp:150] Begin reload framework...
W0330 09:39:00.092161  7361 infer.h:656] Succ reload version engine: 18446744073709551615
I0330 09:39:00.092177  7361 manager.h:131] Finish reload 1 workflow(s)