CUDA

$ sudo yum-config-manager --add-repo http://developer.download.nvidia.com/compute/cuda/repos/rhel7/x86_64/cuda-rhel7.repo
$ sudo yum clean all
$ sudo yum -y install nvidia-driver-latest-dkms cuda
$ sudo yum -y install cuda-drivers

cuDNN

https://qiita.com/tatsuya11bbs/items/70205b070c7afd7dd651

に従う。NVIDIAへのユーザー登録が必要。

https://developer.nvidia.com/rdp/cudnn-download

$ cp cudnn-10.2-linux-x64-v7.6.5.32.tgz /usr/local/
$ cd /usr/local/
$ tar zxvf cudnn-10.2-linux-x64-v7.6.5.32.tgz

Pyenv

$ git clone https://github.com/yyuu/pyenv.git ~/.pyenv

Add the following three lines to ~/.bashrc

export PYENV_ROOT=$HOME/.pyenv
export PATH=$PYENV_ROOT/bin:$PATH
eval "$(pyenv init -)"

Anaconda

$ pyenv install --list

$ pyenv install anaconda3-2019.10

$ pyenv versions
* system (set by /home/oda/.pyenv/version)
  anaconda3-2019.10
$ pyenv global anaconda3-2019.10
  system
* anaconda3-2019.10 (set by /home/oda/.pyenv/version)

Keras

$ pip install --upgrade pip
$ pip install keras

TensorFlow

$ pip install --upgrade pip
$ pip install tensorflow
$ pip install tensorflow-gpu
$ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49) 
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
2020-01-30 12:32:29.462468: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-01-30 12:32:29.462551: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-01-30 12:32:29.462562: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
>>>

TensorRT

libnvinfer.so.6を用意するために、TensorRTを導入しようとしてみた。

https://developer.nvidia.com/tensorrt

cp TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.2.cudnn7.6.tar.gz /usr/local/
cd /usr/local/
tar zxvf TensorRT-7.0.0.11.Ubuntu-18.04.x86_64-gnu.cuda-10.2.cudnn7.6.tar.gz 
cd /usr/local/TensorRT-7.0.0.11/lib

libnvinfer.so.7はあるが、libnvinfer.so.6はないので、

ln -s libnvinfer.so libnvinfer.so.6
ln -s libnvinfer_plugin.so libnvinfer_plugin.so.6

としてみた。

/.bashrc

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/TensorRT-7.0.0.11/lib:/usr/local/cuda/lib64

を追加して、glibcのバージョンを2.17から2.27に変更とかしようとしたけれど、つまづく。

TensorFlow2.0.1

TensorFlow2.1.0でのlibnvinfer.so.6の問題は https://github.com/tensorflow/tensorflow/issues/35968 で10日前に報告されているので、対処は諦め、バージョンを下げた。

$ pip install tensorflow==2.0.1
$ pip install tensorflow-gpu==2.0.1
$ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow as tf
>>>

TensorFlowはインストールできた。しかし、 https://thr3a.hatenablog.com/entry/20180113/1515820265 に従って確認すると、

>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2020-01-30 14:21:30.013092: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-30 14:21:30.022186: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2020-01-30 14:21:30.022941: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55b1c0f6b170 executing computations on platform Host. Devices:
2020-01-30 14:21:30.022982: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-01-30 14:21:30.024665: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-30 14:21:30.025440: E tensorflow/stream_executor/cuda/cuda_driver.cc:318] failed call to cuInit: CUDA_ERROR_SYSTEM_DRIVER_MISMATCH: system has unsupported display driver / cuda driver combination
2020-01-30 14:21:30.025468: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: localhost.localdomain
2020-01-30 14:21:30.025478: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: localhost.localdomain
2020-01-30 14:21:30.025514: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:200] libcuda reported version is: 440.33.1
2020-01-30 14:21:30.025543: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:204] kernel reported version is: 440.44.0
2020-01-30 14:21:30.025553: E tensorflow/stream_executor/cuda/cuda_diagnostics.cc:313] kernel version 440.44.0 does not match DSO version 440.33.1 -- cannot find working devices in this configuration
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 15191092083718892008
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 9175125806567128648
physical_device_desc: "device: XLA_CPU device"
]

となるので、GPUは認識されていない。。

ただ再起動したら、認識してくれた。

$ python
Python 3.7.4 (default, Aug 13 2019, 20:35:49)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from tensorflow.python.client import device_lib
>>> device_lib.list_local_devices()
2020-01-30 15:02:51.584089: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-30 15:02:51.594858: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 3600000000 Hz
2020-01-30 15:02:51.595512: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564f260a01a0 executing computations on platform Host. Devices:
2020-01-30 15:02:51.595559: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): Host, Default Version
2020-01-30 15:02:51.597305: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-01-30 15:02:51.664145: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-30 15:02:51.664450: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564f260c30d0 executing computations on platform CUDA. Devices:
2020-01-30 15:02:51.664470: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): GeForce RTX 2060 SUPER, Compute Capability 7.5
2020-01-30 15:02:51.664603: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:1006] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2020-01-30 15:02:51.664822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties:
name: GeForce RTX 2060 SUPER major: 7 minor: 5 memoryClockRate(GHz): 1.83
pciBusID: 0000:01:00.0
2020-01-30 15:02:51.664906: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/TensorRT-7.0.0.11/lib:/usr/local/cuda/lib64
2020-01-30 15:02:51.664958: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/TensorRT-7.0.0.11/lib:/usr/local/cuda/lib64
2020-01-30 15:02:51.665010: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/TensorRT-7.0.0.11/lib:/usr/local/cuda/lib64
2020-01-30 15:02:51.665059: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/TensorRT-7.0.0.11/lib:/usr/local/cuda/lib64
2020-01-30 15:02:51.665108: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/TensorRT-7.0.0.11/lib:/usr/local/cuda/lib64
2020-01-30 15:02:51.665158: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/TensorRT-7.0.0.11/lib:/usr/local/cuda/lib64
2020-01-30 15:02:51.667833: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-01-30 15:02:51.667848: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1641] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed 
properly if you would like to use GPU. Follow the guide at 
https://www.tensorflow.org/install/gpu for how to download and setup the 
required libraries for your platform.
Skipping registering GPU devices...
2020-01-30 15:02:51.667868: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1159] Device interconnect 
StreamExecutor with strength 1 edge matrix:
2020-01-30 15:02:51.667878: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1165]      0
2020-01-30 15:02:51.667885: I 
tensorflow/core/common_runtime/gpu/gpu_device.cc:1178] 0:   N
[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 6508581553371486294
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 1987290959829944833
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 7094331008891992157
physical_device_desc: "device: XLA_GPU device"
]
>>>

しかし、良く見るとWarningが出ていて、ライブラリが見つからないから、GPUはつかえないとのこと。

2020-01-30 16:06:10.616708: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: :/usr/local/TensorRT-7.0.0.11/lib:/usr/local/cuda/lib64:/usr/local/TensorRT-7.0.0.11/lib:/usr/local/cuda/lib64

CUDA 10.2を入れているが、CUDA 10.0を要求しているようで、

yum --showduplicates search cuda

で利用可能なバージョンを調べて、

yum install cuda-cublas-10-0-10.0.130-1.x86_64
yum install cuda-cufft-10-0-10.0.130-1.x86_64
yum install cuda-curand-10-0-10.0.130-1.x86_64
yum install cuda-cusolver-10-0-10.0.130-1.x86_64
yum install cuda-cusparse-10-0-10.0.130-1.x86_64
yum install cuda-cudart-10-0-10.0.130-1.x86_64

でインストールし、

/.bashrc で

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/TensorRT-7.0.0.11/lib:/usr/local/cuda/lib64:/usr/local/cuda-10.0/lib64/

とした。すると、Warningは消えて、 tensorflowのジョブを走らせている時、

$ nvidia-smi
Thu Jan 30 17:53:44 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     
|
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC 
|
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. 
|
|===============================+======================+======================|
|   0  GeForce RTX 206...  Off  | 00000000:01:00.0 Off |                  N/A 
|
|  0%   52C    P2    44W / 215W |   7692MiB /  7982MiB |      4%      Default 
|
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory 
|
|  GPU       PID   Type   Process name                             Usage      
|
|=============================================================================|
|    0     18243      C   ...v/versions/anaconda3-2019.10/bin/python  7681MiB 
|
+-----------------------------------------------------------------------------+

となり、使われているようだ。

TensorFlowのテストジョブ

https://www.amazon.co.jp/gp/product/4839969515/ に基づいた、テストジョブ。

https://github.com/yusugomori/deeplearning-keras-tf2-torch

問題点

RNNを使おうとするとcuDNN周りでこけた。

やりたかったこと

Autoencoder

制約ボルツマンマシン

KerasでModelとParameterをLoad/Saveする方法

https://qiita.com/supersaiakujin/items/b9c9da9497c2163d5a74

Colaboratory

https://colab.research.google.com/notebooks/welcome.ipynb?hl=ja これを使う方がラクか。


添付ファイル: filemnist.py 33件 [詳細] file01_mlp_toy_problem_keras.py 41件 [詳細]

トップ   編集 凍結解除 差分 バックアップ 添付 複製 名前変更 リロード   新規 一覧 単語検索 最終更新   ヘルプ   最終更新のRSS
Last-modified: 2020-04-09 (木) 16:41:01 (120d)