Data Science Jupyter Notebooks 885

💡 Top 70 Web Scraping Operations in Python

I. Making HTTP Requests (requests)

• Import the library.

import requests

• Make a GET request to a URL.

response = requests.get('http://example.com')

• Check the response status code (200 is OK).

print(response.status_code)

• Access the raw HTML content (as bytes).

html_bytes = response.content

• Access the HTML content (as a string).

html_text = response.text

• Access response headers.

print(response.headers)

• Send a custom User-Agent header.

headers = {'User-Agent': 'My Cool Scraper 1.0'}
response = requests.get('http://example.com', headers=headers)

• Pass URL parameters in a request.

params = {'q': 'python scraping'}
response = requests.get('https://www.google.com/search', params=params)

• Make a POST request with form data.

payload = {'key1': 'value1', 'key2': 'value2'}
response = requests.post('http://httpbin.org/post', data=payload)

• Handle potential request errors.

try:
    response = requests.get('http://example.com', timeout=5)
    response.raise_for_status() # Raise an exception for bad status codes
except requests.exceptions.RequestException as e:
    print(f"An error occurred: {e}")

II. Parsing HTML with BeautifulSoup (Setup & Navigation)

• Import the library.

from bs4 import BeautifulSoup

• Create a BeautifulSoup object from HTML text.

soup = BeautifulSoup(html_text, 'html.parser')

• Prettify the parsed HTML for readability.

print(soup.prettify())

• Access a tag directly by name (gets the first one).

title_tag = soup.title

• Navigate to a tag's parent.

title_parent = soup.title.parent

• Get an iterable of a tag's children.

for child in soup.head.children:
    print(child.name)

• Get the next sibling tag.

first_p = soup.find('p')
next_p = first_p.find_next_sibling('p')

• Get the previous sibling tag.

second_p = soup.find_all('p')[1]
prev_p = second_p.find_previous_sibling('p')

III. Finding Elements with BeautifulSoup

211 views19:45

Data Science Jupyter Notebooks

• Find the first occurrence of a tag.

first_link = soup.find('a')

• Find all occurrences of a tag.

all_links = soup.find_all('a')

• Find tags by their CSS class.

articles = soup.find_all('div', class_='article-content')

• Find a tag by its ID.

main_content = soup.find(id='main-container')

• Find tags by other attributes.

images = soup.find_all('img', attrs={'data-src': True})

• Find using a list of multiple tags.

headings = soup.find_all(['h1', 'h2', 'h3'])

• Find using a regular expression.

import re
links_with_blog = soup.find_all('a', href=re.compile(r'blog'))

• Find using a custom function.

# Finds tags with a 'class' but no 'id'
tags = soup.find_all(lambda tag: tag.has_attr('class') and not tag.has_attr('id'))

• Limit the number of results.

first_five_links = soup.find_all('a', limit=5)

• Use CSS Selectors to find one element.

footer = soup.select_one('#footer > p')

• Use CSS Selectors to find all matching elements.

article_links = soup.select('div.article a')

• Select direct children using CSS selector.

nav_items = soup.select('ul.nav > li')

IV. Extracting Data with BeautifulSoup

• Get the text content from a tag.

title_text = soup.title.get_text()

• Get stripped text content.

link_text = soup.find('a').get_text(strip=True)

• Get all text from the entire document.

all_text = soup.get_text()

• Get an attribute's value (like a URL).

link_url = soup.find('a')['href']

• Get the tag's name.

tag_name = soup.find('h1').name

• Get all attributes of a tag as a dictionary.

attrs_dict = soup.find('img').attrs

V. Parsing with lxml and XPath

• Import the library.

from lxml import html

• Parse HTML content with lxml.

tree = html.fromstring(response.content)

• Select elements using an XPath expression.

# Selects all <a> tags inside <div> tags with class 'nav'
links = tree.xpath('//div[@class="nav"]/a')

• Select text content directly with XPath.

# Gets the text of all <h1> tags
h1_texts = tree.xpath('//h1/text()')

• Select an attribute value with XPath.

# Gets all href attributes from <a> tags
hrefs = tree.xpath('//a/@href')

VI. Handling Dynamic Content (Selenium)

• Import the webdriver.

from selenium import webdriver

• Initialize a browser driver.

driver = webdriver.Chrome() # Requires chromedriver

• Navigate to a webpage.

driver.get('http://example.com')

• Find an element by its ID.

element = driver.find_element('id', 'my-element-id')

• Find elements by CSS Selector.

elements = driver.find_elements('css selector', 'div.item')

• Find an element by XPath.

button = driver.find_element('xpath', '//button[@type="submit"]')

• Click a button.

button.click()

• Enter text into an input field.

search_box = driver.find_element('name', 'q')
search_box.send_keys('Python Selenium')

• Wait for an element to become visible.

199 views19:45

Data Science Jupyter Notebooks

from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

element = WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.ID, "myDynamicElement"))
)

• Get the page source after JavaScript has executed.

dynamic_html = driver.page_source

• Close the browser window.

driver.quit()

VII. Common Tasks & Best Practices

• Handle pagination by finding the "Next" link.

next_page_url = soup.find('a', text='Next')['href']

• Save data to a CSV file.

import csv
with open('data.csv', 'w', newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(['Title', 'Link'])
    # writer.writerow([title, url]) in a loop

• Save data to CSV using pandas.

import pandas as pd
df = pd.DataFrame(data, columns=['Title', 'Link'])
df.to_csv('data.csv', index=False)

• Use a proxy with requests.

proxies = {'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080'}
requests.get('http://example.com', proxies=proxies)

• Pause between requests to be polite.

import time
time.sleep(2) # Pause for 2 seconds

• Handle JSON data from an API.

json_response = requests.get('https://api.example.com/data').json()

• Download a file (like an image).

img_url = 'http://example.com/image.jpg'
img_data = requests.get(img_url).content
with open('image.jpg', 'wb') as handler:
    handler.write(img_data)

• Parse a sitemap.xml to find all URLs.

# Get the sitemap.xml file and parse it like any other XML/HTML to extract <loc> tags.

VIII. Advanced Frameworks (Scrapy)

• Create a Scrapy spider (conceptual command).

scrapy genspider example example.com

• Define a parse method to process the response.

# In your spider class:
def parse(self, response):
    # parsing logic here
    pass

• Extract data using Scrapy's CSS selectors.

titles = response.css('h1::text').getall()

• Extract data using Scrapy's XPath selectors.

links = response.xpath('//a/@href').getall()

• Yield a dictionary of scraped data.

yield {'title': response.css('title::text').get()}

• Follow a link to parse the next page.

next_page = response.css('li.next a::attr(href)').get()
if next_page is not None:
    yield response.follow(next_page, callback=self.parse)

• Run a spider from the command line.

scrapy crawl example -o output.json

• Pass arguments to a spider.

scrapy crawl example -a category=books

• Create a Scrapy Item for structured data.

import scrapy
class ProductItem(scrapy.Item):
    name = scrapy.Field()
    price = scrapy.Field()

• Use an Item Loader to populate Items.

from scrapy.loader import ItemLoader
loader = ItemLoader(item=ProductItem(), response=response)
loader.add_css('name', 'h1.product-name::text')

#Python #WebScraping #BeautifulSoup #Selenium #Requests

━━━━━━━━━━━━━━━
By: @DataScienceN ✨

❤3

299 views19:45

Data Science Jupyter Notebooks

🔥 Trending Repository: nocobase

📝 Description: NocoBase is the most extensible AI-powered no-code/low-code platform for building business applications and enterprise solutions.

🔗 Repository URL: https://github.com/nocobase/nocobase

🌐 Website: https://www.nocobase.com

📖 Readme: https://github.com/nocobase/nocobase#readme

📊 Statistics:
🌟 Stars: 17.7K stars
👀 Watchers: 147
🍴 Forks: 2K forks

💻 Programming Languages: TypeScript - JavaScript - Smarty - Shell - Dockerfile - Less

🏷️ Related Topics:

#internal_tools #crud #crm #admin_dashboard #self_hosted #web_application #project_management #salesforce #developer_tools #airtable #workflows #low_code #no_code #app_builder #internal_tool #nocode #low_code_development_platform #no_code_platform #low_code_platform #low_code_framework

==================================
🧠 By: https://www.tgoop.com/DataScienceM

246 views11:00

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: alertmanager

📝 Description: Prometheus Alertmanager

🔗 Repository URL: https://github.com/prometheus/alertmanager

🌐 Website: https://prometheus.io

📖 Readme: https://github.com/prometheus/alertmanager#readme

📊 Statistics:
🌟 Stars: 7.3K stars
👀 Watchers: 166
🍴 Forks: 2.3K forks

💻 Programming Languages: Go - Elm - HTML - Makefile - TypeScript - JavaScript

🏷️ Related Topics:

#notifications #slack #monitoring #email #pagerduty #alertmanager #hacktoberfest #deduplication #opsgenie

==================================
🧠 By: https://www.tgoop.com/DataScienceM

194 views11:01

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: gopeed

📝 Description: A modern download manager that supports all platforms. Built with Golang and Flutter.

🔗 Repository URL: https://github.com/GopeedLab/gopeed

🌐 Website: https://gopeed.com

📖 Readme: https://github.com/GopeedLab/gopeed#readme

📊 Statistics:
🌟 Stars: 21K stars
👀 Watchers: 167
🍴 Forks: 1.5K forks

💻 Programming Languages: Dart - Go - C++ - CMake - Swift - Ruby

🏷️ Related Topics:

#android #windows #macos #golang #http #ios #torrent #downloader #debian #bittorrent #cross_platform #ubuntu #https #flutter #magnet

==================================
🧠 By: https://www.tgoop.com/DataScienceM

168 views11:01

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: vertex-ai-creative-studio

📝 Description: GenMedia Creative Studio is a Vertex AI generative media user experience highlighting the use of Imagen, Veo, Gemini 🍌, Gemini TTS, Chirp 3, Lyria and other generative media APIs on Google Cloud.

🔗 Repository URL: https://github.com/GoogleCloudPlatform/vertex-ai-creative-studio

📖 Readme: https://github.com/GoogleCloudPlatform/vertex-ai-creative-studio#readme

📊 Statistics:
🌟 Stars: 512 stars
👀 Watchers: 19
🍴 Forks: 200 forks

💻 Programming Languages: Jupyter Notebook - Python - TypeScript - Go - JavaScript - Shell

🏷️ Related Topics:

#google_cloud #gemini #chirp #imagen #veo #lyria #vertex_ai #nano_banana

==================================
🧠 By: https://www.tgoop.com/DataScienceM

206 views11:01

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: Parabolic

📝 Description: Download web video and audio

🔗 Repository URL: https://github.com/NickvisionApps/Parabolic

🌐 Website: https://flathub.org/apps/details/org.nickvision.tubeconverter

📖 Readme: https://github.com/NickvisionApps/Parabolic#readme

📊 Statistics:
🌟 Stars: 4.1K stars
👀 Watchers: 28
🍴 Forks: 188 forks

💻 Programming Languages: C++ - CMake - Python - Inno Setup - C - CSS

🏷️ Related Topics:

#music #windows #downloader #youtube #qt #cpp #youtube_dl #gnome #videos #flathub #gtk4 #yt_dlp #libadwaita

==================================
🧠 By: https://www.tgoop.com/DataScienceM

278 views11:01

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: localstack

📝 Description: 💻 A fully functional local AWS cloud stack. Develop and test your cloud & Serverless apps offline

🔗 Repository URL: https://github.com/localstack/localstack

🌐 Website: https://localstack.cloud

📖 Readme: https://github.com/localstack/localstack#readme

📊 Statistics:
🌟 Stars: 61.1K stars
👀 Watchers: 514
🍴 Forks: 4.3K forks

💻 Programming Languages: Python - Shell - Makefile - ANTLR - JavaScript - Java

🏷️ Related Topics:

#python #testing #aws #cloud #continuous_integration #developer_tools #localstack

==================================
🧠 By: https://www.tgoop.com/DataScienceM

359 views11:02

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: go-sdk

📝 Description: The official Go SDK for Model Context Protocol servers and clients. Maintained in collaboration with Google.

🔗 Repository URL: https://github.com/modelcontextprotocol/go-sdk

📖 Readme: https://github.com/modelcontextprotocol/go-sdk#readme

📊 Statistics:
🌟 Stars: 2.7K stars
👀 Watchers: 39
🍴 Forks: 249 forks

💻 Programming Languages: Go

🏷️ Related Topics: Not available

==================================
🧠 By: https://www.tgoop.com/DataScienceM

326 views12:00

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: rachoon

📝 Description: 🦝 Rachoon — A self-hostable way to handle invoices

🔗 Repository URL: https://github.com/ad-on-is/rachoon

📖 Readme: https://github.com/ad-on-is/rachoon#readme

📊 Statistics:
🌟 Stars: 292 stars
👀 Watchers: 4
🍴 Forks: 14 forks

💻 Programming Languages: TypeScript - Vue - HTML - SCSS - Dockerfile - JavaScript - Shell

🏷️ Related Topics: Not available

==================================
🧠 By: https://www.tgoop.com/DataScienceM

238 views12:01

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: Kotatsu

📝 Description: Manga reader for Android

🔗 Repository URL: https://github.com/KotatsuApp/Kotatsu

🌐 Website: https://kotatsu.app

📖 Readme: https://github.com/KotatsuApp/Kotatsu#readme

📊 Statistics:
🌟 Stars: 7.2K stars
👀 Watchers: 72
🍴 Forks: 366 forks

💻 Programming Languages: Kotlin

🏷️ Related Topics:

#android #manga #comics #mangareader #manga_reader #webtoon

==================================
🧠 By: https://www.tgoop.com/DataScienceM

240 views12:01

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: ggml

📝 Description: Tensor library for machine learning

🔗 Repository URL: https://github.com/ggml-org/ggml

📖 Readme: https://github.com/ggml-org/ggml#readme

📊 Statistics:
🌟 Stars: 13.4K stars
👀 Watchers: 141
🍴 Forks: 1.4K forks

💻 Programming Languages: C++ - C - Cuda - Metal - GLSL - CMake

🏷️ Related Topics:

#machine_learning #automatic_differentiation #tensor_algebra #large_language_models

==================================
🧠 By: https://www.tgoop.com/DataScienceM

313 views12:01

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: asm-lessons

📝 Description: FFMPEG Assembly Language Lessons

🔗 Repository URL: https://github.com/FFmpeg/asm-lessons

📖 Readme: https://github.com/FFmpeg/asm-lessons#readme

📊 Statistics:
🌟 Stars: 9.7K stars
👀 Watchers: 153
🍴 Forks: 288 forks

💻 Programming Languages: Not available

🏷️ Related Topics: Not available

==================================
🧠 By: https://www.tgoop.com/DataScienceM

❤1

360 views12:02

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: lima

📝 Description: Linux virtual machines, with a focus on running containers

🔗 Repository URL: https://github.com/lima-vm/lima

🌐 Website: https://lima-vm.io/

📖 Readme: https://github.com/lima-vm/lima#readme

📊 Statistics:
🌟 Stars: 18.4K stars
👀 Watchers: 83
🍴 Forks: 722 forks

💻 Programming Languages: Go - Shell - Makefile - Perl - HTML - SCSS

🏷️ Related Topics:

#macos #vm #qemu #containerd

==================================
🧠 By: https://www.tgoop.com/DataScienceM

291 views13:00

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: mcp

📝 Description: AWS MCP Servers — helping you get the most out of AWS, wherever you use MCP.

🔗 Repository URL: https://github.com/awslabs/mcp

🌐 Website: https://awslabs.github.io/mcp/

📖 Readme: https://github.com/awslabs/mcp#readme

📊 Statistics:
🌟 Stars: 7K stars
👀 Watchers: 68
🍴 Forks: 1K forks

💻 Programming Languages: Python - Shell - Dockerfile - HTML - TypeScript - Jinja

🏷️ Related Topics:

#aws #mcp #mcp_servers #mcp_server #modelcontextprotocol #mcp_client #mcp_tools #mcp_host #mcp_clients

==================================
🧠 By: https://www.tgoop.com/DataScienceM

205 views13:00

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: strix

📝 Description: ✨ Open-source AI hackers for your apps 👨🏻‍💻

🔗 Repository URL: https://github.com/usestrix/strix

🌐 Website: https://usestrix.com/

📖 Readme: https://github.com/usestrix/strix#readme

📊 Statistics:
🌟 Stars: 3K stars
👀 Watchers: 38
🍴 Forks: 394 forks

💻 Programming Languages: Python - Jinja - Dockerfile

🏷️ Related Topics:

#artificial_intelligence #cybersecurity #penetration_testing #agents #llm #generative_ai

==================================
🧠 By: https://www.tgoop.com/DataScienceM

180 views13:01

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: frigate

📝 Description: NVR with realtime local object detection for IP cameras

🔗 Repository URL: https://github.com/blakeblackshear/frigate

🌐 Website: https://frigate.video

📖 Readme: https://github.com/blakeblackshear/frigate#readme

📊 Statistics:
🌟 Stars: 26.8K stars
👀 Watchers: 218
🍴 Forks: 2.5K forks

💻 Programming Languages: TypeScript - Python - CSS - Shell - Dockerfile - JavaScript

🏷️ Related Topics:

#home_automation #mqtt #ai #camera #rtsp #tensorflow #nvr #realtime #home_assistant #homeautomation #object_detection #google_coral

==================================
🧠 By: https://www.tgoop.com/DataScienceM

205 views13:01

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: gumroad

📝 Description: Sell stuff and see what sticks

🔗 Repository URL: https://github.com/antiwork/gumroad

🌐 Website: https://gumroad.com

📖 Readme: https://github.com/antiwork/gumroad#readme

📊 Statistics:
🌟 Stars: 7.4K stars
👀 Watchers: 50
🍴 Forks: 1.4K forks

💻 Programming Languages: Ruby - TypeScript - HTML - SCSS - Shell - JavaScript

🏷️ Related Topics: Not available

==================================
🧠 By: https://www.tgoop.com/DataScienceM

182 views13:01

📥 Download Zip

🚀 Explore Data Science

Data Science Jupyter Notebooks

🔥 Trending Repository: code-server

📝 Description: VS Code in the browser

🔗 Repository URL: https://github.com/coder/code-server

🌐 Website: https://coder.com

📖 Readme: https://github.com/coder/code-server#readme

📊 Statistics:
🌟 Stars: 74.6K stars
👀 Watchers: 734
🍴 Forks: 6.3K forks

💻 Programming Languages: TypeScript - Shell - HTML - CSS - HCL - JavaScript

🏷️ Related Topics:

#ide #vscode #development_environment #remote_work #dev_tools #browser_ide #vscode_remote

==================================
🧠 By: https://www.tgoop.com/DataScienceM

252 views13:01

📥 Download Zip

🚀 Explore Data Science

2025/11/18 02:51:25
Back to Top

HTML Embed Code:

<iframe width="100%" src="https://www.tgoop.com/buyppe/web?embed=1" title="Telegram Web" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>