{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "toc": "true"
   },
   "source": [
    "<h1>Table of Contents<span class=\"tocSkip\"></span></h1>\n",
    "<div class=\"toc\"><ul class=\"toc-item\"><li><span><a href=\"#Breast-Cancer-Wisconsin-(Diagnostic)-Data-Set\" data-toc-modified-id=\"Breast-Cancer-Wisconsin-(Diagnostic)-Data-Set-1\"><span class=\"toc-item-num\">1&nbsp;&nbsp;</span>Breast Cancer Wisconsin (Diagnostic) Data Set</a></span><ul class=\"toc-item\"><li><span><a href=\"#Attribute-Information:\" data-toc-modified-id=\"Attribute-Information:-1.1\"><span class=\"toc-item-num\">1.1&nbsp;&nbsp;</span>Attribute Information:</a></span></li><li><span><a href=\"#分類器\" data-toc-modified-id=\"分類器-1.2\"><span class=\"toc-item-num\">1.2&nbsp;&nbsp;</span>分類器</a></span></li><li><span><a href=\"#仮説クラス\" data-toc-modified-id=\"仮説クラス-1.3\"><span class=\"toc-item-num\">1.3&nbsp;&nbsp;</span>仮説クラス</a></span></li></ul></li><li><span><a href=\"#最急降下法\" data-toc-modified-id=\"最急降下法-2\"><span class=\"toc-item-num\">2&nbsp;&nbsp;</span>最急降下法</a></span><ul class=\"toc-item\"><li><span><a href=\"#print_w\" data-toc-modified-id=\"print_w-2.1\"><span class=\"toc-item-num\">2.1&nbsp;&nbsp;</span>print_w</a></span></li><li><span><a href=\"#データの読み込みと初期化\" data-toc-modified-id=\"データの読み込みと初期化-2.2\"><span class=\"toc-item-num\">2.2&nbsp;&nbsp;</span>データの読み込みと初期化</a></span></li><li><span><a href=\"#最急降下法によるw探索(steepest-descent)\" data-toc-modified-id=\"最急降下法によるw探索(steepest-descent)-2.3\"><span class=\"toc-item-num\">2.3&nbsp;&nbsp;</span>最急降下法によるw探索(steepest descent)</a></span></li></ul></li><li><span><a href=\"#結果\" data-toc-modified-id=\"結果-3\"><span class=\"toc-item-num\">3&nbsp;&nbsp;</span>結果</a></span></li><li><span><a href=\"#QR-decomposition\" data-toc-modified-id=\"QR-decomposition-4\"><span class=\"toc-item-num\">4&nbsp;&nbsp;</span>QR decomposition</a></span></li></ul></div>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Breast Cancer Wisconsin (Diagnostic) Data Set"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic)>\n",
    "\n",
    "## Attribute Information:\n",
    "\n",
    "1. ID number \n",
    "1. Diagnosis (M = malignant:1, B = benign:-1) M:悪性，B:良性\n",
    "\n",
    "    * M:悪性(-1)，B:良性(1) => M:悪性(1)，B:良性(-1)\n",
    "    * cancer detectorの評価なので，positiveは悪性であるべき．\n",
    "\n",
    "2. 3-32\n",
    "Ten real-valued features are computed for each cell nucleus: \n",
    "\n",
    "* 半径radius (mean of distances from center to points on the perimeter) \n",
    "* テクスチャtexture (standard deviation of gray-scale values) \n",
    "* 境界の長さperimeter \n",
    "* 面積area \n",
    "* なめらかさsmoothness (local variation in radius lengths) \n",
    "* コンパクトさcompactness (perimeter^2 / area - 1.0) \n",
    "* くぼみ度合いconcavity (severity of concave portions of the contour) \n",
    "* くぼみの数concave points (number of concave portions of the contour) \n",
    "* 対称性symmetry \n",
    "* フラクタル次元fractal dimension (\"coastline approximation\" - 1)\n",
    "\n",
    "のそれぞれのmean, stderr, worst数値を保持している．\n",
    "\n",
    "http://people.idsia.ch/~juergen/deeplearningwinsMICCAIgrandchallenge.html\n",
    "\n",
    "## 分類器\n",
    "用意された訓練(training)データには，\n",
    "$\\boldsymbol{A}$に上に記した特徴量が，\n",
    "$\\boldsymbol{b}$に悪性(1, positive)か良性(-1, negative)かを\n",
    "示す数値が入っている．\n",
    "\n",
    "与えられた特徴ベクトル$\\boldsymbol{y}$に対し，\n",
    "細胞組織が悪性か良性かを分類する関数$C(\\boldsymbol{y})$を選び出すプログラムを作成しよう．"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 仮説クラス\n",
    "分類器は可能な分類器の集合(**仮説クラス**)から選ばれる．この場合，仮説クラスとは特徴ベクトルの空間$\\mathbb{R}^D$から$\\mathbb{R}$への線形関数$h(\\cdot)$である．すると分類器は次のような関数として定義される．\n",
    "\n",
    "$$\n",
    "C(\\boldsymbol{y}) = \n",
    "\\left\\{ \\begin{array}{ccc}\n",
    "+1 &  {\\rm when} & h(\\boldsymbol{y})\\geq 0\\\\\n",
    "-1 &  {\\rm when} & h(\\boldsymbol{y})<0\n",
    "\\end{array} \\right.\n",
    "$$\n",
    "\n",
    "各線形関数$h:\\mathbb{R}^D \\rightarrow \\mathbb{R}$に対して，\n",
    "次のような$D$ベクトル$\\boldsymbol{w}$が存在する．\n",
    "$$\n",
    "h(\\boldsymbol{y}) = \\boldsymbol{w}\\cdot \\boldsymbol{y}\n",
    "$$\n",
    "したがって，そのような線形関数を選ぶことは，結局$D$ベクトル$\\boldsymbol{w}$を\n",
    "選ぶことに等しい．特に，$\\boldsymbol{w}$を選ぶことは，仮説クラス$h$を\n",
    "選ぶことと等価なので，$\\boldsymbol{w}$を**仮説ベクトル**と呼ぶ．\n",
    "\n",
    "問題を単純化すると，分類器を単なるベクトルとみなして，データとの掛け算で予測がつきます．本来は予測を-1,1とかに投影しないといけないんですが，単純化のためにそのままの値を用います．問題はどうやってこの仮説ベクトル$\\boldsymbol{w}$の各要素の値を決定するか？ですよね．\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 最急降下法\n",
    "\n",
    "損失関数に\n",
    "$$\n",
    "L(w)=\\sum_{i=1}^n (A_i \\cdot \\boldsymbol{w} - b_i)^2\n",
    "$$\n",
    "を選ぶと，ベクトル$\\boldsymbol{w}$の\n",
    "$j$偏微分は，\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\frac{\\partial L}{\\partial w_j} &= \n",
    "\\sum_{i=1}^n \\frac{\\partial}{\\partial w_j}(A_i \\cdot w -b_i)^2 \\\\\n",
    "&= \\sum_{i=1}^n 2(A_i \\cdot w -b_i) A_{ij}\n",
    "\\end{aligned}\n",
    "$$\n",
    "となる．\n",
    "ここで，$A_{ij}$は$A_i$の$j$番目の要素を意味する．\n",
    "この偏微分$\\frac{\\partial L}{\\partial w_j}$を\n",
    "$\\boldsymbol{w}_j$の勾配(slope)として，$L(w)$の極小値(local minimum)を求める．\n",
    "\n",
    "このような探索方法を最急降下法(steepest descent method)と呼ぶ．"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## print_w\n",
    "\n",
    "出てきた$\\boldsymbol{w}$の$j$要素をきれいに表示する関数を用意しておきます．"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "def print_w(w):\n",
    "    import numpy as np\n",
    "    params = [\n",
    "        \"radius\", \"texture\", \"perimeter\", \"area\",\n",
    "        \"smoothness\", \"compactness\", \"concavity\", \"concave_pts\",\n",
    "        \"symmetry\", \"fractal_dim\"\n",
    "    ]\n",
    "    w = np.array(w).flatten()\n",
    "    print(\"{:<13} {:>13} {:>13} {:>13}\".format(\"Parameter\", \"Mean\", \"StdErr\", \"Worst\"))\n",
    "    for i, param in enumerate(params):\n",
    "        mean, stderr, worst = w[i*3], w[i*3+1], w[i*3+2]\n",
    "        print(\"{:<13} {:13.9f} {:13.9f} {:13.9f}\".format(param, mean, stderr, worst))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## データの読み込みと初期化"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "tmp = np.fromfile('./wisconsin_diag_breast_cancer_data/train_A.data', np.float64, -1, \" \")\n",
    "A = tmp.reshape(300,30)\n",
    "tmp = np.fromfile('./wisconsin_diag_breast_cancer_data/train_b.data', np.float64, -1, \" \")\n",
    "b = tmp.reshape(300,1)\n",
    "w = np.zeros(30).reshape(30,1)\n",
    "for i in range(30):\n",
    "    w[i] = 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,\n",
       "       3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,\n",
       "       8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,\n",
       "       3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,\n",
       "       1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01])"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "A[0]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([[ 1.],\n",
       "       [ 1.],\n",
       "       [ 1.],\n",
       "       [ 1.],\n",
       "       [-1.],\n",
       "       [-1.],\n",
       "       [-1.],\n",
       "       [ 1.],\n",
       "       [ 1.],\n",
       "       [ 1.]])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "b[15:25]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 最急降下法によるw探索(steepest descent)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parameter              Mean        StdErr         Worst\n",
      "radius         -0.000426997  -0.000741817  -0.002548876\n",
      "texture        -0.001687946  -0.000004707  -0.000000127\n",
      "perimeter       0.000003968   0.000002078  -0.000008954\n",
      "area           -0.000003595  -0.000002569  -0.000070324\n",
      "smoothness     -0.000001139   0.000881778  -0.000000430\n",
      "compactness    -0.000000441  -0.000000723  -0.000000267\n",
      "concavity      -0.000001200  -0.000000191  -0.000411499\n",
      "concave_pts    -0.000921972  -0.002395138   0.001932789\n",
      "symmetry       -0.000005930   0.000003750   0.000008147\n",
      "fractal_dim     0.000002341  -0.000011565  -0.000003523\n"
     ]
    }
   ],
   "source": [
    "loop, sigma = 300, 3.0*10**(-9)\n",
    "for i in range(loop):\n",
    "  dLw = A.dot(w)-b\n",
    "  w = w - (A.transpose().dot(dLw))*sigma\n",
    "\n",
    "print_w(w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 結果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_accuracy(mA, vb, vw):\n",
    "    # M:悪性(1)，B:良性(-1)\n",
    "\n",
    "    correct,safe_error,critical_error=0,0,0\n",
    "    predict = mA.dot(vw)\n",
    "    n = vb.size\n",
    "    for i in range(n):\n",
    "        if predict[i]*vb[i]>0:\n",
    "            correct += 1\n",
    "        elif (predict[i]<0 and vb[i]>0): #  悪性なのに良性と予測：見落としcritical(fn)\n",
    "            critical_error += 1\n",
    "        elif (predict[i]>0 and vb[i]<0): # 良性なのに悪性と予測：再検査safe(fp)\n",
    "            safe_error += 1\n",
    "    print(\"     correct(true predicttion): %4d/%4d\" % (correct,n))\n",
    "    print(\"    safe error(false positive): %4d\" % safe_error)\n",
    "    print(\"critical error(false negative): %4d\" % critical_error)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "def show_accuracy2(mA, vb, vw):\n",
    "    # M:悪性(1) positive，B:良性(-1) negative\n",
    "    # true(真) , false(偽)\n",
    "    tp, tn, fp, fn = 0, 0, 0, 0\n",
    "    predict = mA.dot(vw)\n",
    "    n = vb.size\n",
    "    for i in range(n):\n",
    "        if predict[i]*vb[i]>0:\n",
    "            if predict[i]>0:\n",
    "                tp += 1\n",
    "            else:\n",
    "                tn += 1\n",
    "        elif (predict[i]<0 and vb[i]>0): #  悪性なのに良性と予測：見落としcritical(fn)\n",
    "            fn += 1\n",
    "        elif (predict[i]>0 and vb[i]<0): # 良性なのに悪性と予測：再検査safe(fp)\n",
    "            fp += 1\n",
    "    print(\"%21s %12s %10s\" %  (\"predict\",\"positive\", \"negative\"))\n",
    "    print(\"%21s | %10d %10d\" % (\"actual positive\",tp, fn))\n",
    "    print(\"%21s | %10d %10d\" % (\"actual negative\",fp, tn))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     correct(true predicttion):  274/ 300\n",
      "    safe error(false positive):    5\n",
      "critical error(false negative):   21\n"
     ]
    }
   ],
   "source": [
    "show_accuracy(A, b, w)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     correct(true predicttion):  248/ 269\n",
      "    safe error(false positive):   10\n",
      "critical error(false negative):   11\n"
     ]
    }
   ],
   "source": [
    "tmp = np.fromfile('./wisconsin_diag_breast_cancer_data/validate_A.data', np.float64, -1, \" \")\n",
    "A = tmp.reshape(269,30)\n",
    "tmp = np.fromfile('./wisconsin_diag_breast_cancer_data/validate_b.data', np.float64, -1, \" \")\n",
    "b = tmp.reshape(269,1)\n",
    "\n",
    "show_accuracy(A, b, w)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "              predict     positive   negative\n",
      "      actual positive |         55         11\n",
      "      actual negative |         10        193\n"
     ]
    }
   ],
   "source": [
    "show_accuracy2(A, b, w)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# QR decomposition\n",
    "\n",
    "QR分解を使うとより簡単に最小値を求めることができる．\n",
    "行列$A$は正方行列でないので，逆行列をもとめることができない．\n",
    "しかし，その場合でも$||A.w -b ||^2$を最小にする$w$を求めることができる．\n",
    "\n",
    "QR分解によって，$n \\times m$行列は\n",
    "$$\n",
    "A = QR\n",
    "$$\n",
    "と分解される．ここで，Qは$n \\times m$行列，Rは$m \\times m$の正方行列で，逆行列を求めることができる．\n",
    "\n",
    "$|| Aw - b ||$が最小となるのはQRを使って，\n",
    "$$\n",
    "Q.R.w=b \\\\\n",
    "R.w = Q^t.b \\\\\n",
    "R^{-1}.R.w = R^{-1}.Q^t.b\n",
    "$$\n",
    "となりそう．"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "\n",
    "tmp = np.fromfile('./wisconsin_diag_breast_cancer_data/train_A.data', np.float64, -1, \" \")\n",
    "A = tmp.reshape(300,30)\n",
    "tmp = np.fromfile('./wisconsin_diag_breast_cancer_data/train_b.data', np.float64, -1, \" \")\n",
    "b = tmp.reshape(300,1)\n",
    "\n",
    "q, r = np.linalg.qr(A)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [],
   "source": [
    "ww = np.linalg.inv(r).dot(np.transpose(q).dot(b))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(300, 30)"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "q.shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[-2.57579883e+02 -3.32324268e+02 -1.68607899e+03 -1.29450676e+04\n",
      " -1.65446346e+00]\n"
     ]
    }
   ],
   "source": [
    "print(r[0,0:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     correct(true predicttion):  286/ 300\n",
      "    safe error(false positive):    1\n",
      "critical error(false negative):   13\n",
      "              predict     positive   negative\n",
      "      actual positive |        133         13\n",
      "      actual negative |          1        153\n"
     ]
    }
   ],
   "source": [
    "show_accuracy(A, b, ww)\n",
    "show_accuracy2(A, b, ww)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Parameter              Mean        StdErr         Worst\n",
      "radius         -0.869922200   0.024313945   0.062679604\n",
      "texture         0.003274620   8.790304633  -1.747147681\n",
      "perimeter       0.202847615   6.506449469  -5.061760150\n",
      "area          -49.167544936   0.956592244   0.082052701\n",
      "smoothness      0.007943079  -0.004976910  27.841937812\n",
      "compactness    -3.301518835  -4.985956723  16.318887300\n",
      "concavity     -10.316290100  21.332168405   0.408605799\n",
      "concave_pts     0.003345717   0.000677877  -0.002510735\n",
      "symmetry       -4.531369160  -0.590110921   0.719368538\n",
      "fractal_dim     2.158967280   3.803466937  12.298424597\n"
     ]
    }
   ],
   "source": [
    "print_w(ww)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "     correct(true predicttion):  261/ 269\n",
      "    safe error(false positive):    6\n",
      "critical error(false negative):    2\n",
      "              predict     positive   negative\n",
      "      actual positive |         64          2\n",
      "      actual negative |          6        197\n"
     ]
    }
   ],
   "source": [
    "tmp = np.fromfile('./wisconsin_diag_breast_cancer_data/validate_A.data', np.float64, -1, \" \")\n",
    "A = tmp.reshape(269,30)\n",
    "tmp = np.fromfile('./wisconsin_diag_breast_cancer_data/validate_b.data', np.float64, -1, \" \")\n",
    "b = tmp.reshape(269,1)\n",
    "\n",
    "show_accuracy(A, b, ww)\n",
    "show_accuracy2(A, b, ww)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.8.5"
  },
  "latex_envs": {
   "LaTeX_envs_menu_present": true,
   "autocomplete": true,
   "bibliofile": "biblio.bib",
   "cite_by": "apalike",
   "current_citInitial": 1,
   "eqLabelWithNumbers": true,
   "eqNumInitial": 1,
   "hotkeys": {
    "equation": "Ctrl-E",
    "itemize": "Ctrl-I"
   },
   "labels_anchors": false,
   "latex_user_defs": false,
   "report_style_numbering": false,
   "user_envs_cfg": false
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {
    "height": "12px",
    "width": "252px"
   },
   "number_sections": true,
   "sideBar": true,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": true,
   "toc_position": {},
   "toc_section_display": "block",
   "toc_window_display": true
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}