Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Cover image for Scrape Brave Videos with Python
SerpApi profile imageArtur Chukhrai
Artur Chukhrai forSerpApi

Posted on • Edited on

     

Scrape Brave Videos with Python

Intro

Currently, we don't have an API that supports extracting data from Brave Search.

This blog post is to show you way how you can do it yourself with provided DIY solution below while we're working on releasing our proper API.

The solution can be used for personal use as it doesn't include theLegal US Shield that we offer for our paidproduction and above plans and has its limitations such as the need to bypass blocks, for example, CAPTCHA.

You can check our public roadmap to track the progress for this API:[New API] Brave Search

What will be scraped

wwbs-brave-videos-results

📌Note: Sometimes there may be no videos in the organic search results. This blog post gets videos from organic results and videos tab.

What is Brave Search

The previous Brave blog post previously describedwhat is Brave search. For the sake of non-duplicating content, this information is not mentioned in this blog post.

Full Code

If you don't need explanation, have a look atfull code example in the online IDE.

frombs4importBeautifulSoupimportrequests,lxml,json,re# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urlsparams={'q':'dune 2021',# query'source':'web',# source'tf':'at',# publish time (at - any time, pd - past day, pw - past week, pm - past month)'length':'all',# duration (short, medium, long)'resolution':'all'# resolution (1080p, 720p, 480p, 360p)}# https://docs.python-requests.org/en/master/user/quickstart/#custom-headersheaders={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}defscrape_organic_videos():html=requests.get('https://search.brave.com/search',headers=headers,params=params)soup=BeautifulSoup(html.text,'lxml')data=[]forresultinsoup.select('#video-carousel .card'):title=result.select_one('.title').get_text()link=result.get('href')source=result.select_one('.anchor').get_text().strip()date=result.select_one('.text-xs').get_text().strip()favicon=result.select_one('.favicon').get('src')# https://regex101.com/r/7OA1FS/1thumbnail=re.search(r"background-image:\surl\('(.*)'\)",result.select_one('.img-bg').get('style')).group(1)video_duration=(result.select_one('.duration').get_text()ifresult.select_one('.duration')elseNone)data.append({'title':title,'link':link,'source':source,'date':date,'favicon':favicon,'thumbnail':thumbnail,'video_duration':video_duration})returndatadefscrape_tab_videos():html=requests.get('https://search.brave.com/videos',headers=headers,params=params)soup=BeautifulSoup(html.text,'lxml')data=[]forresultinsoup.select('.card'):title=result.select_one('.title').get_text()link=result.select_one('a').get('href')source=result.select_one('.center-horizontally .ellipsis').get_text().strip()date=result.select_one('#results span').get_text().strip()favicon=result.select_one('.favicon').get('src')# https://regex101.com/r/7OA1FS/1thumbnail=re.search(r"background-image:\surl\('(.*)'\)",result.select_one('.img-bg').get('style')).group(1)creator=(result.select_one('.creator').get_text().strip()ifresult.select_one('.creator')elseNone)views=(result.select_one('.stat').get_text().strip()ifresult.select_one('.stat')elseNone)video_duration=(result.select_one('.duration').get_text()ifresult.select_one('.duration')elseNone)data.append({'title':title,'link':link,'source':source,'creator':creator,'date':date,'views':views,'favicon':favicon,'thumbnail':thumbnail,'video_duration':video_duration,})returndataif__name__=="__main__":# brave_organic_videos = scrape_organic_videos()# print(json.dumps(brave_organic_videos, indent=2, ensure_ascii=False))brave_tab_videos=scrape_tab_videos()print(json.dumps(brave_tab_videos,indent=2,ensure_ascii=False))
Enter fullscreen modeExit fullscreen mode

Preparation

Install libraries:

pip install requests lxml beautifulsoup4
Enter fullscreen modeExit fullscreen mode

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine abouthow to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.

Reduce the chance of being blocked

Make sure you're usingrequest headersuser-agent to act as a "real" user visit. Because defaultrequestsuser-agent ispython-requests and websites understand that it's most likely a script that sends a request.Check what's youruser-agent.

There's ahow to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.

Code Explanation

Import libraries:

frombs4importBeautifulSoupimportrequests,lxml,json,re
Enter fullscreen modeExit fullscreen mode
LibraryPurpose
BeautifulSoupto scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
requeststo make a request to the website.
lxmlto process XML/HTML documents fast.
jsonto convert extracted data to a JSON object.
reto extract parts of the data via regular expression.

Create URL parameters and request headers:

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urlsparams={'q':'dune 2021',# query'source':'web',# source'tf':'at',# publish time (at - any time, pd - past day, pw - past week, pm - past month)'length':'all',# duration (short, medium, long)'resolution':'all'# resolution (1080p, 720p, 480p, 360p)}# https://docs.python-requests.org/en/master/user/quickstart/#custom-headersheaders={'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
Enter fullscreen modeExit fullscreen mode
CodeExplanation
paramsa prettier way of passing URL parameters to a request.
user-agentto act as a "real" user request from the browser by passing it torequest headers.Defaultrequests user-agent is apython-reqeusts so websites might understand that it's a bot or a script and block the request to the website.Check what's youruser-agent.

Scrape organic videos

This function scrapes all organic videos data for thehttps://search.brave.com/search URL and returns a list with all results.

You need to make a request, pass the created request parameters and headers. The request returns HTML to BeautifulSoup:

html=requests.get('https://search.brave.com/search',headers=headers,params=params)soup=BeautifulSoup(html.text,'lxml')
Enter fullscreen modeExit fullscreen mode
CodeExplanation
timeout=30to stop waiting for response after 30 seconds.
BeautifulSoup()where returned HTML data will be processed bybs4.

Create thedata list to store all videos:

data=[]
Enter fullscreen modeExit fullscreen mode

To extract the necessary data, you need to find the selector where they are located. In our case, this is the#video-carousel .card selector, which contains all organic videos. You need to iterate each video in the loop:

forresultinsoup.select('#video-carousel .card'):# data extraction will be here
Enter fullscreen modeExit fullscreen mode

To extract the data, you need to find the matching selectors.SelectorGadget was used to grab CSS selectors. I want to demonstrate how the selector selection process works:

brave-organic-videos-selector-gadget

After the selectors are found, we need to get the corresponding text or attribute value. I want to draw your attention to the fact that thethumbnail is extracted in a different way. The desired image link is hidden inside thestyle attribute. To extract this link, you need to do a lot of operations on the string. Instead, you can useregular expression to extract the required data:

title=result.select_one('.title').get_text()link=result.get('href')source=result.select_one('.anchor').get_text().strip()date=result.select_one('.text-xs').get_text().strip()favicon=result.select_one('.favicon').get('src')# https://regex101.com/r/7OA1FS/1thumbnail=re.search(r"background-image:\surl\('(.*)'\)",result.select_one('.img-bg').get('style')).group(1)video_duration=(result.select_one('.duration').get_text()ifresult.select_one('.duration')elseNone)
Enter fullscreen modeExit fullscreen mode

📌Note: When extracting thevideo_duration, aternary expression is used which handles the values of these data, if any are available.

CodeExplanation
select_one()/select()to run a CSS selector against a parsed document and return all the matching elements.
get_text()to get textual data from the node.
get(<attribute>)to get attribute data from the node.
strip()to return a copy of the string with the leading and trailing characters removed.
search()to search for a pattern in a string and return the corresponding match object.
group()to extract the found element from the match object.

After the data from item is retrieved, it is appended to thedata list:

data.append({'title':title,'link':link,'source':source,'date':date,'favicon':favicon,'thumbnail':thumbnail,'video_duration':video_duration})
Enter fullscreen modeExit fullscreen mode

The complete function to scrape organic videos would look like this:

defscrape_organic_videos():html=requests.get('https://search.brave.com/search',headers=headers,params=params)soup=BeautifulSoup(html.text,'lxml')data=[]forresultinsoup.select('#video-carousel .card'):title=result.select_one('.title').get_text()link=result.get('href')source=result.select_one('.anchor').get_text().strip()date=result.select_one('.text-xs').get_text().strip()favicon=result.select_one('.favicon').get('src')# https://regex101.com/r/7OA1FS/1thumbnail=re.search(r"background-image:\surl\('(.*)'\)",result.select_one('.img-bg').get('style')).group(1)video_duration=(result.select_one('.duration').get_text()ifresult.select_one('.duration')elseNone)data.append({'title':title,'link':link,'source':source,'date':date,'favicon':favicon,'thumbnail':thumbnail,'video_duration':video_duration})returndata
Enter fullscreen modeExit fullscreen mode

Output:

[{"title":"Dune | Official Main Trailer - YouTube","link":"https://www.youtube.com/watch?v=8g18jFHCLXk","source":"youtube.com","date":"July 22, 2021","favicon":"https://imgs.search.brave.com/Ux4Hee4evZhvjuTKwtapBycOGjGDci2Gvn2pbSzvbC0/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvOTkyZTZiMWU3/YzU3Nzc5YjExYzUy/N2VhZTIxOWNlYjM5/ZGVjN2MyZDY4Nzdh/ZDYzMTYxNmI5N2Rk/Y2Q3N2FkNy93d3cu/eW91dHViZS5jb20v","thumbnail":"https://imgs.search.brave.com/E6_Wv3qlA5iqnRkLZILt8tq-lLKCHVwJESItayT5jro/rs:fit:200:200:1/g:ce/aHR0cHM6Ly9pLnl0/aW1nLmNvbS92aS84/ZzE4akZIQ0xYay9t/YXhyZXNkZWZhdWx0/LmpwZw","video_duration":"03:28"},{"title":"Dune Official Trailer - YouTube","link":"https://www.youtube.com/watch?v=n9xhJrPXop4","source":"youtube.com","date":"September 9, 2020","favicon":"https://imgs.search.brave.com/Ux4Hee4evZhvjuTKwtapBycOGjGDci2Gvn2pbSzvbC0/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvOTkyZTZiMWU3/YzU3Nzc5YjExYzUy/N2VhZTIxOWNlYjM5/ZGVjN2MyZDY4Nzdh/ZDYzMTYxNmI5N2Rk/Y2Q3N2FkNy93d3cu/eW91dHViZS5jb20v","thumbnail":"https://imgs.search.brave.com/uNV6ho7lr6Z-67_BgPPOp56rj-aVny1loaiYzGyLwQk/rs:fit:200:200:1/g:ce/aHR0cHM6Ly9pLnl0/aW1nLmNvbS92aS9u/OXhoSnJQWG9wNC9t/YXhyZXNkZWZhdWx0/LmpwZw","video_duration":"03:05"},{"title":"DUNE – FINAL TRAILER - YouTube","link":"https://www.youtube.com/watch?v=w0HgHet0sxg","source":"youtube.com","date":"October 7, 2021","favicon":"https://imgs.search.brave.com/Ux4Hee4evZhvjuTKwtapBycOGjGDci2Gvn2pbSzvbC0/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvOTkyZTZiMWU3/YzU3Nzc5YjExYzUy/N2VhZTIxOWNlYjM5/ZGVjN2MyZDY4Nzdh/ZDYzMTYxNmI5N2Rk/Y2Q3N2FkNy93d3cu/eW91dHViZS5jb20v","thumbnail":"https://imgs.search.brave.com/-irGOLELOj8B0YDVJU5dHgpWsd8nSx2l3yrVARUlv0E/rs:fit:200:200:1/g:ce/aHR0cHM6Ly9pLnl0/aW1nLmNvbS92aS93/MEhnSGV0MHN4Zy9t/YXhyZXNkZWZhdWx0/LmpwZw","video_duration":"02:29"},...othervideos]
Enter fullscreen modeExit fullscreen mode

Scrape tab videos

This function scrapes all tab videos data for thehttps://search.brave.com/videos URL and returns a list with all results.

You need to make a request, pass the created request parameters and headers. The request returns HTML to BeautifulSoup:

html=requests.get('https://search.brave.com/videos',headers=headers,params=params)soup=BeautifulSoup(html.text,'lxml')
Enter fullscreen modeExit fullscreen mode

Create thedata list to store all videos:

data=[]
Enter fullscreen modeExit fullscreen mode

To retrieve data from all videos in the page, you need to find the.card selector of the items. You need to iterate each item in the loop:

forresultinsoup.select('.card'):# data extraction will be here
Enter fullscreen modeExit fullscreen mode

On this page, the matching selectors are different. So this function also usedSelectorGadget to grab CSS selectors. I want to demonstrate how the selector selection process works:

brave-tab-videos-selector-gadget

The difference between extracting data in this function is that here you can get acreator andviews:

title=result.select_one('.title').get_text()link=result.select_one('a').get('href')source=result.select_one('.center-horizontally .ellipsis').get_text().strip()date=result.select_one('#results span').get_text().strip()favicon=result.select_one('.favicon').get('src')# https://regex101.com/r/7OA1FS/1thumbnail=re.search(r"background-image:\surl\('(.*)'\)",result.select_one('.img-bg').get('style')).group(1)creator=(result.select_one('.creator').get_text().strip()ifresult.select_one('.creator')elseNone)views=(result.select_one('.stat').get_text().strip()ifresult.select_one('.stat')elseNone)video_duration=(result.select_one('.duration').get_text()ifresult.select_one('.duration')elseNone)
Enter fullscreen modeExit fullscreen mode

📌Note: When extracting thecreator,views andvideo_duration, aternary expression is used which handles the values of these data, if any are available.

After the data from item is retrieved, it is appended to thedata list:

data.append({'title':title,'link':link,'source':source,'creator':creator,'date':date,'views':views,'favicon':favicon,'thumbnail':thumbnail,'video_duration':video_duration,})
Enter fullscreen modeExit fullscreen mode

The complete function to scrape tab videos would look like this:

defscrape_tab_videos():html=requests.get('https://search.brave.com/videos',headers=headers,params=params)soup=BeautifulSoup(html.text,'lxml')data=[]forresultinsoup.select('.card'):title=result.select_one('.title').get_text()link=result.select_one('a').get('href')source=result.select_one('.center-horizontally .ellipsis').get_text().strip()date=result.select_one('#results span').get_text().strip()favicon=result.select_one('.favicon').get('src')# https://regex101.com/r/7OA1FS/1thumbnail=re.search(r"background-image:\surl\('(.*)'\)",result.select_one('.img-bg').get('style')).group(1)creator=(result.select_one('.creator').get_text().strip()ifresult.select_one('.creator')elseNone)views=(result.select_one('.stat').get_text().strip()ifresult.select_one('.stat')elseNone)video_duration=(result.select_one('.duration').get_text()ifresult.select_one('.duration')elseNone)data.append({'title':title,'link':link,'source':source,'creator':creator,'date':date,'views':views,'favicon':favicon,'thumbnail':thumbnail,'video_duration':video_duration,})returndata
Enter fullscreen modeExit fullscreen mode

Output:

[{"title":"Dune (2021)","link":"https://www.imdb.com/title/tt1160419/","source":"IMDB","creator":null,"date":"28 Jan, 2010","views":null,"favicon":"https://imgs.search.brave.com/_XzIkQDCEJ7aNlT3HlNUHBRcj5nQ9R4TiU4cHpSn7BY/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvZmU3MjU1MmUz/MDhkYjY0OGFlYzY3/ZDVlMmQ4NWZjZDhh/NzZhOGZlZjNjNGE5/M2M0OWI1Y2M2ZjQy/MWE5ZDc3OC93d3cu/aW1kYi5jb20v","thumbnail":"https://imgs.search.brave.com/zHiJ3yZ-f7a99EkHYp8nB2BD0XvWk5fKq-dcukd5Jro/rs:fit:235:225:1/g:ce/aHR0cHM6Ly90c2U0/Lm1tLmJpbmcubmV0/L3RoP2lkPU9WUC43/WUM5TEhVTGxFaEFm/VGNVNzZqZGRBRmVJ/SSZwaWQ9QXBp","video_duration":"03:05"},{"title":"Dune Review (2021)","link":"https://www.youtube.com/watch?v=DqquKCvOxwA","source":"YouTube","creator":"IGN","date":"03 Sep, 2021","views":"568.14K","favicon":"https://imgs.search.brave.com/Ux4Hee4evZhvjuTKwtapBycOGjGDci2Gvn2pbSzvbC0/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvOTkyZTZiMWU3/YzU3Nzc5YjExYzUy/N2VhZTIxOWNlYjM5/ZGVjN2MyZDY4Nzdh/ZDYzMTYxNmI5N2Rk/Y2Q3N2FkNy93d3cu/eW91dHViZS5jb20v","thumbnail":"https://imgs.search.brave.com/Eru5tXsCCm42JOqGxc3XsNe8RPs0_Fk1Bs0AVvmpDQE/rs:fit:640:225:1/g:ce/aHR0cHM6Ly90c2Ux/Lm1tLmJpbmcubmV0/L3RoP2lkPU9WUC5q/TU5XYmtTdGFHalJY/X0JSQm1mUV9RSGdG/byZwaWQ9QXBp","video_duration":"04:31"},{"title":"DUNE Trailer 2 (2021)","link":"https://www.youtube.com/watch?v=LG7QhzmavZg","source":"YouTube","creator":"KinoCheck.com","date":"22 Jul, 2021","views":"485.53K","favicon":"https://imgs.search.brave.com/Ux4Hee4evZhvjuTKwtapBycOGjGDci2Gvn2pbSzvbC0/rs:fit:32:32:1/g:ce/aHR0cDovL2Zhdmlj/b25zLnNlYXJjaC5i/cmF2ZS5jb20vaWNv/bnMvOTkyZTZiMWU3/YzU3Nzc5YjExYzUy/N2VhZTIxOWNlYjM5/ZGVjN2MyZDY4Nzdh/ZDYzMTYxNmI5N2Rk/Y2Q3N2FkNy93d3cu/eW91dHViZS5jb20v","thumbnail":"https://imgs.search.brave.com/XHCBNZHpAHNxvYqWYgfhYKKinSHXdYjI3e5HtsgVjcg/rs:fit:640:225:1/g:ce/aHR0cHM6Ly90c2Uy/Lm1tLmJpbmcubmV0/L3RoP2lkPU9WUC5q/VXZ2aE1Md1hfWUtM/OHZsU0l3SFJRSGdG/byZwaWQ9QXBp","video_duration":"03:37"},...othervideos]
Enter fullscreen modeExit fullscreen mode

Links

Join us onTwitter |YouTube

Add aFeature Request💫 or aBug🐞

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

API to get search engine results with ease.

More fromSerpApi

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp