Detail Projects

2155 views Apr 08, 2025

ဒီနေ့ ကျွန်တော် webpage တစ်ခုက image တွေ အကုန်ကို python သုံးပြီး ဘယ်လို download ဆွဲရမလဲကို ပြပေးမှာပါ။Web Scraping or Crawling ကျွန်တော်တို့က website တစ်ခုက data တွေလိုချင်တယ် ဒါပေမဲ့ အဲ့website က API မချပေးထားဘူး။အဲ့ကျရင် ကျွန်တော်တို့လိုချင်တဲ့ data ကို web crawler တွေသုံးပြီး ရယူကြပါတယ်။ဥပမာအားဖြင့် Facebook ပေါ် website link တစ်ခု share လိုက်ကြည့် facebook bot ကချက်ချင်း အဲ့website ကိုသွားပြီး data collect လုပ်လာပါတယ်။Google Crawler တွေလဲ ထိုနည်းတိုင်းပါပဲ မိတ်ဆွေက Google မှာ ရှာလိုက်တယ် သူတို့ရဲ့ NLP algo တွေနဲ့ နားလည်အောင်လုပ်ပြီး website တွေက နေ crawl လုပ်ပါတယ်။ရလာတဲ့ result ကို မိတ်ဆွေက မြင်ရတာပါ။ဒါပေမဲ့ Google bot ကိုလဲ ဘယ်ဟာတွေတော့ လာယူလို့ရတယ် ဘယ်ဟာတွေကိုတော့ bot ကို access မပေးဘူးကို သတ်မှတ်ထားလို့ရပါတယ်။အဲ့တော့ ကျွန်တေ်ာတို့ စလိုက်ကြရအောင်။ကျွန်တော်ဒီ tutorial မှာ Python version 3 ကိုသုံးထားတာပါ။python2 user တွေတွက် အလုပ်မဖြစ်ပါဘူး။ကျွန်တော့်အနေနဲ့လဲ Python version 3ကိုပဲ သုံးရတာကြိုက်လို့ သုံးဖို့အကြံပေးချင်ပါတယ်။

ကျွန်တော်တို့ ပထမဦးဆုံးလိုအပ်တာတွေ install လုပ်လိုက်ရအောင်။

$ pip install beautifulsoup4
$ pip install requests

ကျွန်တော်တို့ beautifulsoup က ကျတော့ website ထဲက လိုချင်တဲ့ tags တွေဆွဲထုတ်ဖို့ပါ။Requests ကတော့ ထည့်ပေးလိုက်တဲ့ website link ကို request လုပ်ပြီး ပြန်ပို့လိုက်တဲ့ response ထဲကနေ beautifulsoup ကိုသုံးပြီး image tagတွေဆွဲထုတ်ဖ့ို့ပါ။

အဲ့တော့ code ကို တစ်ချက်ကြည့်လိုက်ရအောင်

from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import requests
import argparse

ap = argparse.ArgumentParser()
ap.add_argument("-u","--url",required=True,help="input url to download")
args = vars(ap.parse_args())

res = requests.get(args['url']).text
soup = BeautifulSoup(res,"html.parser")
images = soup.find_all("img")
print("[INFO] Found {} images".format(len(images)))
for i,image in enumerate(images):
        print('[INFO] Downloading Image {}...'.format(i+1))
        urlretrieve(image['src'],'{}.png'.format(i+1))

ကျွန်တော်ာတို့ လိုအပ်တဲ့ lib တွေ import လုပ်လိုက်ပြီးတော့ command line ကနေ url ကို ထည့်လို့ရအောင်လုပ်ထားပါတယ်။ကျွန်တော်ပထမတုန်းက ပြောခဲ့သလိုပဲ ကျွန်တော်တို့ image download လုပ်ချင်တဲ့ website ကို line 10 နဲ့ 11 မှာ request လုပ်ပြီး ရလာတဲ့ Response ကို BeautifulSoupဆိုတဲ့ class ထဲကိုထည့်လိုက်ပြီး object တစ်ခုဆောက်လိုက်ပါတယ်။ကျွန်တော်တို့ လိုချင်တာက image တွေပါပဲ တခြား ဘာမှမလိုချင်တာမို့ find_all() ဆိုတဲ့ method ကိုသုံးပြီး img tags ကိုတွေကို ရှာလိုက်ပါတယ်။

line 14 to 16 မှာတော့ ရလာတဲ့ Image tags တွေကို loop ပတ်လိုက်ပါတယ်။ဘာလို့ဆိုတော့ အကုန်download လုပ်မှာမို့လို့ပါ ပြီးတော့ urlretrieve ကိုသုံးပြီး image ကို ဒေါင်းလော့ဆွဲလိုက်ပါတယ်။urlretrieve မှာ params နှစ်ခုပါပြီး ဒုတိယတစ်ခုက output ပါ ပထမကတော့ input urlပါ။

ကျွန်တော်တို့ html မှာ <img src=”image source” alt=””/>အဲ့လို ပုံစံမျိုးရေးတာမို့လို့ ကျွန်တော်တို့ image ကို server ပေါ်ဘယ်နေရာတင်ထားတာလဲကို src ကနေ ကြည့်လို့ရအောင် image[‘src’] ဆိုပြီး src ကိုထုတ်လိုက်ပါတယ်။ပြီးတော့မှ srcကိုထည့်ပြီး ဒေါင်းလော့ဆွဲလိုက်ပါတယ်။

ဒီနေရာမှာ error တက်နိုင်ပါတယ်။ပုံမှန်အားဖြင့် developer တွေက src ထဲမှာ url အပြည့်အစုံနဲ့ မရေးပါဘူး။အကယ်လို့ image က လက်ရှိfolder ရဲ့ အပြင်မှာရှိတယ်ဆိုအောက်ကလိုရေးကြပါတယ်။CMS တွေကတော့ auto generate လုပ်တာမို့လို့ အဆင်ပြေပေမယ့်။developer တွေကတော့

အဲ့လိုမျိုးရေးကြပါတယ်အဲ့မှာ မိတ်ဆွေတို့ url ကို concatenate လုပ်ပေးဖို့လိုလာပါတယ်။

အဲ့တော့ ကျွန်တော်တို့ code လေးကို update လုပ်ဖို့လိုလာပါတယ်။

from urllib.request import urlretrieve
from urllib.parse import urljoin
from bs4 import BeautifulSoup
import requests
import argparse
 
ap = argparse.ArgumentParser()
ap.add_argument("-u","--url",required=True,help="input url to download")
args = vars(ap.parse_args())
 
res = requests.get(args['url']).text
soup = BeautifulSoup(res,"html.parser")
images = soup.find_all("img")
print("[INFO] Found {} images".format(len(images)))
for i,image in enumerate(images):
        print('[INFO] Downloading Image {}...'.format(i+1))
        urlretrieve(urljoin(args['url'],

အဲတာဆို အောက်ဆုံးက urlretrieve ဆိုတဲ့ လိုင်းမှာ ကျွန်တော် urlနှစ်ခုကို join ထားတာတွေ့မှာပါ။အဲ့တာဆိုရင်တော့ solving လုပ်ပြီးပါပြီ။အဲ့တာဆိုရင်တော့ အောက်က command လေးသုံးပြီး run လို့ရပါပြီ။

$ python download_image.py --url http://www.pyrobocity.org/setting-up-django-for-hand-written-digits-recognition-project/

အဲ့တာဆိုရင်တော့ အခုလို result ကိုတွေ့ရမှာပါ။

Downloading All Images From Web Page