Featured image of post 用 PSR-7 message 格式讓 Python requests 結果無縫傳給 PHP

用 PSR-7 message 格式讓 Python requests 結果無縫傳給 PHP

Python 爬蟲、PHP 後續處理的跨語言架構:讓 Python 輸出 PSR-7 message 字串,PHP 用 parseResponse 解析,免手動拆 header 和 body。

有時候 Python 已經有現成好用的 package 可以抓取網頁資訊,但後續處理都在 PHP 端。問題是怎麼把 Python requests 的 response 完整帶回 PHP?

Python → PSR-7 HTTP Message → PHP 的資料流

1
composer require guzzlehttp/psr7 symfony/process

為什麼用 PSR7 message 格式

HTTP message 本身就是純文字協定,PSR7 定義了標準的 message interface。只要 Python 端把 response 輸出成 HTTP message 格式的字串,PHP 端用 GuzzleHttp\Psr7Message::parseResponse 就能直接解析回 ResponseInterface,不需要自己拆 header 和 body。

Python 端:將 requests response 轉成 PSR7 message

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
import argparse
from urllib.parse import urlparse

import requests
from requests import Response, PreparedRequest

parser = argparse.ArgumentParser()
parser.add_argument("url")
args = parser.parse_args()

def has_header(headers, key: str, value: str = None):
    for (k, v) in headers:
        if k.lower() == key.lower() and (value is None or v.lower() == value.lower()):
            return True
    return False

def psr7_message(message: PreparedRequest or Response) -> bytes:
    msg = ""
    body = bytes()
    if isinstance(message, PreparedRequest) is True:
        msg = "%s %s HTTP/1.1" % (message.method, message.url)
        body = message.body if message.body else bytes()
        if "host" in map(lambda header: str(header[0]).strip().lower(), message.headers):
            msg += "\r\nHost:" + urlparse(str(message.url)).hostname

    if isinstance(message, Response) is True:
        msg = "HTTP/1.1 %s %s" % (message.status_code, message.reason)
        body = message.content

    headers = message.headers.items()

    headers = filter(lambda header: has_header([header], "Transfer-Encoding", "chunked") is False, headers)
    headers = filter(lambda header: has_header([header], "Content-Encoding") is False, headers)
    headers = sorted(list(headers))

    msg += "".join(["\r\n%s: %s" % (name, value) for name, value in headers])

    return ("%s\r\n\r\n" % msg).encode("utf-8") + body

def main(url):
    resp = requests.get(url)

    print(psr7_message(resp).decode('utf-8').replace(u'\xa0', u' '))

if __name__ == '__main__':
    main(args.url)

重點是過濾掉 Transfer-Encoding: chunkedContent-Encoding 這些 header,避免 PHP 端解析時出問題。

PHP 端:實作 PSR-18 ClientInterface

PHP 這邊實作 ClientInterface,透過 Symfony Process 呼叫 Python script,再用 Message::parseResponse 把輸出解析成 ResponseInterface

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
namespace App\Services;

use GuzzleHttp\Psr7\Message;
use GuzzleHttp\Psr7\Response;
use Psr\Http\Client\ClientInterface;
use Psr\Http\Message\RequestInterface;
use Psr\Http\Message\ResponseInterface;
use Symfony\Component\Process\Exception\ProcessFailedException;
use Symfony\Component\Process\ExecutableFinder;
use Symfony\Component\Process\Process;

class ScraperClient implements ClientInterface
{
    private static ?string $python = null;

    public function sendRequest(RequestInterface $request): ResponseInterface
    {
        try {
            $process = new Process([
                $this->findPython3(),
                resource_path('path/to/main.py'),
                (string)$request->getUri(),
            ]);

            return Message::parseResponse($process->getOutput());
        } catch (ProcessFailedException $e) {
            return new Response(500, [], $e->getMessage());
        }
    }

    private function findPython3(): string
    {
        if (! self::$python) {
            $executableFinder = new ExecutableFinder();

            foreach (['python3', 'python'] as $binary) {
                $python = $executableFinder->find($binary);
                if ($python) {
                    return self::$python = $python;
                }
            }
        }

        return self::$python;
    }
}

因為實作了 ClientInterface,所以在 PHP 端可以像平常使用 HTTP client 一樣呼叫,完全不需要改變既有的使用習慣。需要換回 Guzzle 或其他 client 時,只要抽換實作就好。