多线程抓取表情包

用的新浪的图床,不用顾虑表情包网站服务器压力的问题,可以实验操作一下
其他网站还请手下留情,尽量不要用多线程爬别人网站,学习的目的不是破坏

正文开始
首先初始化锁
1
| gLock = threading.Lock()
|
创建一个数组存放表情包链接地址,名字
1 2 3
| FACE_URL_LIST = []
FACE_NAME_LIST = []
|
同上一个单线程的 demo,先保存链接
1 2 3
| for x in range(1,1500): url = BESE_PAGE_URL + str(x) PAGE_URL_LIST.append(url)
|
定义一个生产者,负责从每个页面中提取表情 URL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
| class Producer(threading.Thread): def run(self): while len(PAGE_URL_LIST) > 0:
gLock.acquire() page_url = PAGE_URL_LIST.pop() gLock.release() response = requests.get(page_url)
soup = BeautifulSoup(response.content,'lxml') img_list = soup.find('div',class_ = 'page-content').find_all('img',class_ = 'img-responsive lazy image_dta')
gLock.acquire() for img in img_list: title = img['alt'] if len(img_list): src = img['data-original'] else: continue
print(title) print(src) FACE_URL_LIST.append(src) FACE_NAME_LIST.append(title) gLock.release() time.sleep(0.5)
|
多线程相比单线程重要的是使用某种资源时要为它加锁
,用完及时释放资源解锁
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
| class Consumer(threading.Thread): def run(self): print '%s is runing' % threading.current_thread() while True: gLock.acquire() if len(FACE_URL_LIST)==0 or len(FACE_NAME_LIST)==0: gLock.release() continue else: face_url = FACE_URL_LIST.pop() filename = FACE_NAME_LIST.pop() gLock.release() z = face_url[-4:]
if len(filename) > 20 : filename = filename[:20] path = 'images2'+ '/' +filename.strip().replace('?','')+z urllib.urlretrieve(face_url,path)
|
详细作用都在注释里
最后创建多线程调用上边的函数
1 2 3 4 5 6 7 8
| if __name__=='__main__': for x in range(2): Producer().start()
for x in range(5): Consumer().start()
|
爬虫暂时告一段落