網路行銷學院

玩構網路 » WebDesign » php file_get_contents遇到亂碼如何處理

php file_get_contents遇到亂碼如何處理

28 10 月, 2021

Table of Contents

file_get_contents為何

file_get_contents()是php內建的函式，可以把整個文件或檔案讀進一個字串中，常用於讀取檔案內容、抓取網頁原始碼、…。

Response Content亂碼

當我們直接使用file_get_contents($url)來抓取網頁原始碼時，爬到某些網站會顯示亂碼，如果沒有一點程式背景，剛開始真的會不知所措…

先來說常見Response Content亂碼的原因是目標網站回傳帶有Gzip壓縮的內容，用Chrome開發者模式觀察或是程式爬取Response Headers，會發現content-encoding: gzip。玩編也觀察到除了某些大型網站ex:amazon有使用Gzip外，WordPress網站如果有使用熱門的快取外掛，也很有可能會抓到Gzip的內容。

爬取Gzip網頁亂碼解決

提供3個解決GZIP網頁亂碼的方法，玩編個人是推薦使用cURL，並設定CURLOPT_ACCEPT_ENCODING來自動解壓縮接收到的內容。

使用Server zlib函式庫

如果我們用來爬取網頁的遠端伺服器有裝好zlib函式庫，可以直接使用下面程式碼來解決內容亂碼問題。

file_get_contents(“compress.zlib://” . $url);

使用cURL CURLOPT_ENCODING

如果是單純透過php爬取網頁，cURL在效能、功能面上都明顯優於file_get_contents or file_get_html(Simple HTML DOM Parser)。

在建立curl物件後，如$curl = curl_init($url)，我們一定要加入以下程式碼來告訴curl lib，當接收到Gzip內容時可以自動解壓縮。

curl_setopt($curl, CURLOPT_ACCEPT_ENCODING, “”) //自動帶入請求端內建支援的編碼 or
curl_setopt($curl, CURLOPT_ACCEPT_ENCODING, “gzip”) or
curl_setopt($curl, CURLOPT_ACCEPT_ENCODING, “br, gzip, deflate”)

CURLOPT_ACCEPT_ENCODING

使用gzip解壓縮函式

使用遠端伺服器內建的gzdecode函式可以解壓縮gzip內容，這邊要提醒的是玩編在爬文時有用了另一個函式gzuncompress，但並無法解壓gzip網頁內容。

if(!function_exists('gzdecode')){ /** * Decode gz coded data * * http://php.net/manual/en/function.gzdecode.php * * Alternative: http://digitalpbk.com/php/file_get_contents-garbled-gzip-encoding-website-scraping * * @param string $data gzencoded data * @return string inflated data */ function gzdecode($data) { // strip header and footer and inflate return gzinflate(substr($data, 10, -8)); } }

gzdecode(file_get_contents($url))

有網路行銷需求？

還在作沒有成效的行銷嗎？立即聯繫玩構，打造專屬網路行銷專案。
0958-078032 Chris 林先生
0963-003316 Anna 蔡小姐

聯絡我們

wango

網路行銷學院

php file_get_contents遇到亂碼如何處理

file_get_contents為何

Response Content亂碼

爬取Gzip網頁亂碼解決

使用Server zlib函式庫

使用cURL CURLOPT_ENCODING

使用gzip解壓縮函式

有網路行銷需求？

更多相關文章

網頁設計-用html做出回上一頁按鈕

Cloudways如何增加maximum file upload size

CSS rem、em用法為何？網頁設計字體大小怎麼用

台北業務處

台中業務處

台南業務處

高雄總部

Google合作夥伴