怎麼讓 Google 搜尋到我的部落格 ? 新增 sitemap.xml & robots.txt - 已找到，目前尚未建立索引 ?

前言

這篇文章主要是記錄我的部落格，在長達四個月的修改後，終於被 Google 搜尋系統收錄的過程。包含最核心的產出 sitemap.xml 和 robots.txt，其他就是按照 Google Search Console 給的錯誤進行修正。

Google 搜尋引擎是甚麼概念

基本上只要有網址，就可以連到世界上大大小小的網站。但是，不會有人一開始知道你的網址，所以需要透過強大的搜尋系統，如: Google 幫忙收錄，並可以透過關鍵字搜尋來推廣。

所以「加入 Google 搜尋引擎」，就是要讓 Google 收錄我的部落格，並透過瀏覽器搜尋到，而不是只靠貼網址才能找到。為了能在 Google 搜尋得到，就必須要有 sitemap 及 robots.txt 這兩個核心檔案。

新增 sitemap.xml

sitemap 是 Google 爬蟲用來讀取網站的網站地圖，可以把它想成網站摘要的概念。在 Hexo 中，有套件可以直接生成 sitemap。

下載套件: npm install hexo-generator-sitemap --save
於 _config.yaml 新增 sitemap 路徑

sitemap 的路徑可以自訂，一般會放在第一層 ( my-blog/sitemap )，可在 path 參數修改
tags 跟 categories 參數，決定是否要讓 sitemap 包含標籤頁跟分類頁。這邊建議不要，因為這些頁面都偏空洞，可能會讓 google 很難建立索引。
_config.yaml
1
2
3
4
5
# Sitemap
sitemap:
path: sitemap.xml
tags: false
categories: false

修改完 sitemap 後，網站部屬後會自動生成新的 sitemap，可輸入網址/sitemap.xml 檢查。

新增 robots.txt

新增 robots.txt 可以加強 SEO，因為這個檔案主要是規範網路爬蟲 (如: Goolgebot, Bingbot, Baidu Spider 等)，哪些頁面可以被瀏覽器爬取。

我的　robots.txt　基本上是讓網路爬蟲都可以爬取，然後在個別頁面限制，哪些頁面不要被收錄在搜尋系統。注意: robots.txt 要放在 source 資料夾底下。

robots.txt

User-agent: *
Allow: /
Allow: /post/
Allow: /images/
Allow: /img/
Allow: /about/
Allow: /archives/
Allow: /categories/
Allow: /tags/
Allow: /css/
Allow: /js/

# 提供 Sitemap 文件位置
Sitemap: https://david31009.github.io/sitemap.xml

User-agent: 就是可以限制只給特定的網路爬蟲爬取，「*」允許是所有網路爬蟲。
Allow/ Disallow: 允許/ 限制哪些內容可以被爬取
Sitemap: 提供 sitemap 的路徑，要用絕對路徑

我一開始沒加 robots.txt，一直無法索引成功，加了之後有改善 !!

如果有不允許的頁面會報錯

驗證網站擁有權及網域:

到 Google Search Console: 把網頁加入 google 搜尋引擎的工具。
我是使用網址前置字元，驗證網域
驗證網域前，需先驗證「網站所有權」，這邊使用 HTML 標記驗證，把 meta 資料塞入網站的 <head></head> 區塊中

修改 _config.butterfly.yaml ( 擇一即可 ):

#Inject 部分可以新增 meta 資料到 <head></head> 中

_config.butterfly.yaml

inject:
  head:
    # Google Search Console 驗證網站所有權，使用 HTML 標記
    - <meta name="google-site-verification" content="E89DAusjlVZxxxxx" />

後來發現 #Verification 區塊可以直接加入 content id 更方便

_config.butterfly.yaml

# Verification (站長驗證)
site_verification:
  - name: google-site-verification
    content: E89DAusjlVZxxxxx

成功畫面

後台新增 sitemap

到 Google Search Console 填入 sitemap.xml 即可，因為 path 是放在第一層。之後 sitemap 若有更新，google 爬蟲也會自動更新。

新增 favicon.ico

做完上述步驟後，等了一個月沒什麼反應，只發現 Google Search Console 顯示已找到，目前尚未建立索引，後來發現在設定 > 檢索統計資料可以查看錯誤。除了之前 robots.txt 沒新增之外，後續遇到的錯誤是: google 爬蟲找不到 favicon.ico。

在 source 資料夾底下加入 favicon.ico，跟 robots.txt 同一層
_config.butterfly.yml 在 #Inject 部分加入 <link rel="icon" href="/favicon.ico" type="image/x-icon">

_config.butterfly.yml

# Inject
inject:
  head:
    # Add favicon.ico
    - <link rel="icon" href="/favicon.ico" type="image/x-icon">

等待 3-4 天，google 就通知我，我的網頁可以順利被收錄了。

檢驗方式很簡單，就是在瀏覽器搜尋 site:<網站網域>，有搜出網站內容，就代表成功。