NumPy (Numerical Python) is like a super-powered calculator for Python. Think of it as:
- Regular Python lists = Basic calculator
- NumPy arrays = Scientific calculator with advanced functions
Why NumPy?
- Faster: 50-100x faster than regular Python lists
- Less memory: Uses less RAM
- More features: Mathematical functions, linear algebra, statistics
- Foundation: Used by pandas, matplotlib, scikit-learn, and more
Installation #
pip install numpy
1. Getting Started #
Importing NumPy #
import numpy as np # Standard convention
Your First NumPy Array #
# From Python list
my_list = [1, 2, 3, 4, 5]
my_array = np.array(my_list)
print(my_array) # [1 2 3 4 5]
print(type(my_array)) # <class 'numpy.ndarray'>
# Direct creation
arr = np.array([10, 20, 30, 40, 50])
print(arr) # [10 20 30 40 50]
Real-world analogy: Like converting a shopping list (Python list) into a spreadsheet (NumPy array) for better organization and calculations.
2. Creating Arrays #
Different Ways to Create Arrays #
# 1. From lists
arr1d = np.array([1, 2, 3]) # 1D array
arr2d = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array
# 2. Zeros and Ones
zeros = np.zeros(5) # [0. 0. 0. 0. 0.]
ones = np.ones((2, 3)) # 2x3 array of ones
"""
[[1. 1. 1.]
[1. 1. 1.]]
"""
# 3. Range of numbers
range_arr = np.arange(0, 10, 2) # [0 2 4 6 8] (start, stop, step)
linspace_arr = np.linspace(0, 10, 5) # [0. 2.5 5. 7.5 10.] (evenly spaced)
# 4. Random numbers
random_arr = np.random.random(5) # 5 random numbers between 0-1
random_int = np.random.randint(1, 10, 5) # 5 random integers between 1-9
# 5. Identity matrix
identity = np.eye(3) # 3x3 identity matrix
"""
[[1. 0. 0.]
[0. 1. 0.]
[0. 0. 1.]]
"""
# 6. Full arrays
full_arr = np.full((2, 3), 7) # 2x3 array filled with 7
"""
[[7 7 7]
[7 7 7]]
"""
Real-world Examples: #
# Student grades for 3 subjects, 4 students
grades = np.array([
[85, 90, 78], # Student 1
[92, 88, 91], # Student 2
[76, 82, 89], # Student 3
[88, 85, 87] # Student 4
])
# Monthly sales data (12 months)
sales = np.array([15000, 18000, 22000, 25000, 23000, 27000,
30000, 28000, 31000, 29000, 26000, 24000])
# Temperature readings (7 days, 4 times per day)
temperatures = np.random.uniform(20, 35, (7, 4)) # Random temps 20-35°C
3. Array Properties and Attributes #
arr = np.array([[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])
# Shape - dimensions of array
print(arr.shape) # (3, 4) - 3 rows, 4 columns
# Size - total number of elements
print(arr.size) # 12
# Dimensions
print(arr.ndim) # 2 (2D array)
# Data type
print(arr.dtype) # int64 (or int32 depending on system)
# Memory usage
print(arr.nbytes) # bytes used
# Example with real data
student_scores = np.array([[85, 90, 78, 92],
[88, 76, 85, 89],
[92, 88, 91, 87]])
print(f"Class size: {student_scores.shape[0]} students") # 3 students
print(f"Number of tests: {student_scores.shape[1]} tests") # 4 tests
print(f"Total scores recorded: {student_scores.size}") # 12 scores
4. Array Indexing and Slicing #
1D Array Indexing #
arr = np.array([10, 20, 30, 40, 50])
# Positive indexing
print(arr[0]) # 10 (first element)
print(arr[2]) # 30 (third element)
# Negative indexing
print(arr[-1]) # 50 (last element)
print(arr[-2]) # 40 (second to last)
# Slicing [start:stop:step]
print(arr[1:4]) # [20 30 40]
print(arr[:3]) # [10 20 30] (first 3)
print(arr[2:]) # [30 40 50] (from index 2 to end)
print(arr[::2]) # [10 30 50] (every 2nd element)
2D Array Indexing #
grades = np.array([[85, 90, 78], # Student 1
[92, 88, 91], # Student 2
[76, 82, 89]]) # Student 3
# Access specific element [row, column]
print(grades[0, 1]) # 90 (Student 1, Subject 2)
print(grades[2, 0]) # 76 (Student 3, Subject 1)
# Access entire row
print(grades[1]) # [92 88 91] (All grades for Student 2)
# Access entire column
print(grades[:, 0]) # [85 92 76] (Subject 1 for all students)
# Slicing ranges
print(grades[0:2, 1:3]) # First 2 students, subjects 2-3
"""
[[90 78]
[88 91]]
"""
# Real-world example: Monthly sales by region
sales_data = np.array([
[15000, 18000, 22000], # Region 1 (Q1, Q2, Q3)
[20000, 23000, 25000], # Region 2
[12000, 15000, 18000], # Region 3
[25000, 28000, 30000] # Region 4
])
# Q2 sales for all regions
q2_sales = sales_data[:, 1]
print("Q2 Sales:", q2_sales) # [18000 23000 15000 28000]
# Region 1 and 2 sales for all quarters
top_regions = sales_data[0:2, :]
print("Top 2 regions:\n", top_regions)
Boolean Indexing #
scores = np.array([85, 92, 76, 88, 91, 78, 95, 82])
# Find scores above 85
high_scores = scores[scores > 85]
print("High scores:", high_scores) # [92 88 91 95]
# Multiple conditions
good_scores = scores[(scores >= 80) & (scores <= 90)]
print("Good scores (80-90):", good_scores) # [85 88 82]
# Real example: Filter temperatures
temperatures = np.array([22, 35, 28, 41, 19, 33, 25, 38])
hot_days = temperatures[temperatures > 30]
print("Hot days (>30°C):", hot_days) # [35 41 33 38]
5. Array Operations #
Arithmetic Operations #
# Element-wise operations
a = np.array([1, 2, 3, 4])
b = np.array([5, 6, 7, 8])
print(a + b) # [6 8 10 12]
print(a - b) # [-4 -4 -4 -4]
print(a * b) # [5 12 21 32]
print(a / b) # [0.2 0.33 0.43 0.5]
print(a ** 2) # [1 4 9 16] (square)
# Operations with scalars
arr = np.array([10, 20, 30])
print(arr + 5) # [15 25 35]
print(arr * 2) # [20 40 60]
print(arr / 10) # [1. 2. 3.]
# Real example: Price calculations
prices = np.array([100, 250, 75, 180])
tax_rate = 0.08
# Add tax
final_prices = prices * (1 + tax_rate)
print("Prices with tax:", final_prices) # [108. 270. 81. 194.4]
# Apply discount
discount = 0.1
discounted_prices = prices * (1 - discount)
print("Discounted prices:", discounted_prices) # [90. 225. 67.5 162.]
Mathematical Functions #
arr = np.array([1, 4, 9, 16, 25])
# Square root
print(np.sqrt(arr)) # [1. 2. 3. 4. 5.]
# Logarithms
print(np.log(arr)) # Natural log
print(np.log10(arr)) # Base 10 log
# Trigonometric functions
angles = np.array([0, 30, 45, 60, 90]) * np.pi / 180 # Convert to radians
print(np.sin(angles)) # Sine values
print(np.cos(angles)) # Cosine values
# Statistical functions
data = np.array([10, 15, 20, 25, 30, 35, 40])
print(f"Mean: {np.mean(data)}") # 25.0
print(f"Median: {np.median(data)}") # 25.0
print(f"Std Dev: {np.std(data)}") # 10.0
print(f"Min: {np.min(data)}") # 10
print(f"Max: {np.max(data)}") # 40
# Real example: Test scores analysis
test_scores = np.array([78, 85, 92, 88, 76, 91, 83, 87, 90, 79])
print(f"Class average: {np.mean(test_scores):.1f}") # 84.9
print(f"Highest score: {np.max(test_scores)}") # 92
print(f"Lowest score: {np.min(test_scores)}") # 76
print(f"Standard deviation: {np.std(test_scores):.1f}") # 5.7
6. Array Reshaping and Manipulation #
Reshaping Arrays #
# Create 1D array
arr = np.arange(12) # [0 1 2 3 4 5 6 7 8 9 10 11]
# Reshape to different dimensions
arr_2d = arr.reshape(3, 4) # 3 rows, 4 columns
print(arr_2d)
"""
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
"""
arr_3d = arr.reshape(2, 2, 3) # 2 layers, 2 rows, 3 columns
print(arr_3d.shape) # (2, 2, 3)
# Flatten back to 1D
flattened = arr_2d.flatten()
print(flattened) # [0 1 2 3 4 5 6 7 8 9 10 11]
# Real example: Image data (pixels)
# Imagine 24 pixel values for a 6x4 image
pixels = np.arange(24)
image = pixels.reshape(6, 4) # 6 rows, 4 columns
print("Image shape:", image.shape) # (6, 4)
Joining and Splitting Arrays #
# Concatenation
arr1 = np.array([1, 2, 3])
arr2 = np.array([4, 5, 6])
# Horizontal concatenation
horizontal = np.concatenate([arr1, arr2])
print(horizontal) # [1 2 3 4 5 6]
# For 2D arrays
matrix1 = np.array([[1, 2], [3, 4]])
matrix2 = np.array([[5, 6], [7, 8]])
# Vertical stacking (row-wise)
v_stack = np.vstack([matrix1, matrix2])
print("Vertical stack:\n", v_stack)
"""
[[1 2]
[3 4]
[5 6]
[7 8]]
"""
# Horizontal stacking (column-wise)
h_stack = np.hstack([matrix1, matrix2])
print("Horizontal stack:\n", h_stack)
"""
[[1 2 5 6]
[3 4 7 8]]
"""
# Splitting arrays
big_array = np.arange(10) # [0 1 2 3 4 5 6 7 8 9]
split_arrays = np.split(big_array, 5) # Split into 5 parts
print("Split arrays:", split_arrays)
# Real example: Combining quarterly sales
q1_sales = np.array([15000, 18000, 20000])
q2_sales = np.array([22000, 25000, 23000])
q3_sales = np.array([27000, 30000, 28000])
q4_sales = np.array([31000, 29000, 26000])
# Combine all quarters
yearly_sales = np.vstack([q1_sales, q2_sales, q3_sales, q4_sales])
print("Yearly sales by quarter:\n", yearly_sales)
7. Broadcasting #
Broadcasting allows NumPy to perform operations on arrays with different shapes.
# Example 1: Array + Scalar
arr = np.array([[1, 2, 3],
[4, 5, 6]])
result = arr + 10 # Adds 10 to every element
print(result)
"""
[[11 12 13]
[14 15 16]]
"""
# Example 2: Different shaped arrays
matrix = np.array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
row_vector = np.array([10, 20, 30])
result = matrix + row_vector # Adds row_vector to each row
print(result)
"""
[[11 22 33]
[14 25 36]
[17 28 39]]
"""
# Real example: Apply different discounts to products
prices = np.array([[100, 200, 150], # Electronics
[50, 75, 25], # Books
[30, 40, 35]]) # Food
# Different discount rates for each category
discounts = np.array([0.1, 0.05, 0.15]) # 10%, 5%, 15%
# Apply discounts (broadcasting)
discounted_prices = prices * (1 - discounts.reshape(-1, 1))
print("Discounted prices:\n", discounted_prices)
8. Advanced Array Operations #
Sorting #
# 1D sorting
scores = np.array([85, 92, 76, 88, 91, 78, 95, 82])
sorted_scores = np.sort(scores)
print("Sorted scores:", sorted_scores) # [76 78 82 85 88 91 92 95]
# Get indices that would sort the array
sort_indices = np.argsort(scores)
print("Sort indices:", sort_indices) # [2 5 7 0 3 4 1 6]
# 2D sorting
grades = np.array([[85, 90, 78],
[92, 88, 91],
[76, 82, 89]])
# Sort along axis (0=rows, 1=columns)
sorted_by_column = np.sort(grades, axis=0) # Sort each column
print("Sorted by column:\n", sorted_by_column)
# Real example: Student rankings
student_names = np.array(['Alice', 'Bob', 'Charlie', 'Diana'])
final_scores = np.array([88, 92, 76, 95])
# Get ranking (highest to lowest)
ranking_indices = np.argsort(final_scores)[::-1] # Reverse for descending
ranked_students = student_names[ranking_indices]
ranked_scores = final_scores[ranking_indices]
print("Student Rankings:")
for i, (student, score) in enumerate(zip(ranked_students, ranked_scores)):
print(f"{i+1}. {student}: {score}")
Unique Values and Counting #
# Find unique values
grades = np.array(['A', 'B', 'A', 'C', 'B', 'A', 'B', 'C', 'A'])
unique_grades = np.unique(grades)
print("Unique grades:", unique_grades) # ['A' 'B' 'C']
# Count occurrences
unique_grades, counts = np.unique(grades, return_counts=True)
print("Grade counts:")
for grade, count in zip(unique_grades, counts):
print(f"Grade {grade}: {count} students")
# Real example: Survey responses
responses = np.array([5, 4, 5, 3, 4, 5, 2, 4, 5, 3, 4, 5])
unique_responses, counts = np.unique(responses, return_counts=True)
print("Survey Results:")
for rating, count in zip(unique_responses, counts):
print(f"Rating {rating}: {count} responses")
Conditional Operations #
# Where function
scores = np.array([85, 92, 76, 88, 91, 78, 95, 82])
# Replace scores: A (>=90), B (80-89), C (<80)
letter_grades = np.where(scores >= 90, 'A',
np.where(scores >= 80, 'B', 'C'))
print("Letter grades:", letter_grades)
# Real example: Temperature classification
temperatures = np.array([22, 35, 28, 41, 19, 33, 25, 38])
weather = np.where(temperatures > 35, 'Hot',
np.where(temperatures > 25, 'Warm', 'Cool'))
print("Weather conditions:", weather)
# Count conditions
hot_days = np.sum(temperatures > 35)
warm_days = np.sum((temperatures > 25) & (temperatures <= 35))
cool_days = np.sum(temperatures <= 25)
print(f"Hot days: {hot_days}, Warm days: {warm_days}, Cool days: {cool_days}")
9. Linear Algebra with NumPy #
Matrix Operations #
# Matrix multiplication
A = np.array([[1, 2],
[3, 4]])
B = np.array([[5, 6],
[7, 8]])
# Dot product (matrix multiplication)
result = np.dot(A, B)
# or result = A @ B
print("Matrix multiplication:\n", result)
"""
[[19 22]
[43 50]]
"""
# Transpose
print("Transpose of A:\n", A.T)
"""
[[1 3]
[2 4]]
"""
# Determinant
det_A = np.linalg.det(A)
print("Determinant of A:", det_A) # -2.0
# Inverse
inv_A = np.linalg.inv(A)
print("Inverse of A:\n", inv_A)
# Eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Eigenvalues:", eigenvalues)
print("Eigenvectors:\n", eigenvectors)
Solving Linear Systems #
# Solve system: 2x + 3y = 7, x - y = 1
# In matrix form: AX = B
A = np.array([[2, 3],
[1, -1]])
B = np.array([7, 1])
# Solve for X
X = np.linalg.solve(A, B)
print("Solution:", X) # [2. 1.] means x=2, y=1
# Verify solution
print("Verification:", np.dot(A, X)) # Should equal B
10. Working with Real Data #
Statistical Analysis #
# Simulate sales data for 12 months, 3 products
np.random.seed(42) # For reproducible results
sales_data = np.random.normal(1000, 200, (12, 3)) # mean=1000, std=200
months = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec']
products = ['Product A', 'Product B', 'Product C']
# Monthly analysis
monthly_totals = np.sum(sales_data, axis=1) # Sum across products
best_month_idx = np.argmax(monthly_totals)
worst_month_idx = np.argmin(monthly_totals)
print(f"Best month: {months[best_month_idx]} (${monthly_totals[best_month_idx]:.0f})")
print(f"Worst month: {months[worst_month_idx]} (${monthly_totals[worst_month_idx]:.0f})")
# Product analysis
product_totals = np.sum(sales_data, axis=0) # Sum across months
best_product_idx = np.argmax(product_totals)
print(f"Best product: {products[best_product_idx]} (${product_totals[best_product_idx]:.0f})")
# Growth analysis
monthly_growth = np.diff(monthly_totals) / monthly_totals[:-1] * 100
avg_growth = np.mean(monthly_growth)
print(f"Average monthly growth: {avg_growth:.1f}%")
Data Cleaning and Processing #
# Simulate sensor data with some missing/invalid values
sensor_data = np.array([22.5, 23.1, -999, 24.2, 23.8, 25.1, -999, 24.5, 23.9])
# Clean data: replace -999 (invalid readings) with NaN
cleaned_data = np.where(sensor_data == -999, np.nan, sensor_data)
# Remove NaN values for calculations
valid_data = cleaned_data[~np.isnan(cleaned_data)]
print(f"Original data points: {len(sensor_data)}")
print(f"Valid data points: {len(valid_data)}")
print(f"Average temperature: {np.mean(valid_data):.1f}°C")
print(f"Temperature range: {np.max(valid_data) - np.min(valid_data):.1f}°C")
# Fill missing values with interpolation (simple average of neighbors)
for i in range(len(cleaned_data)):
if np.isnan(cleaned_data[i]):
# Use average of previous and next valid values
prev_val = cleaned_data[i-1] if i > 0 and not np.isnan(cleaned_data[i-1]) else np.mean(valid_data)
next_val = cleaned_data[i+1] if i < len(cleaned_data)-1 and not np.isnan(cleaned_data[i+1]) else np.mean(valid_data)
cleaned_data[i] = (prev_val + next_val) / 2
print("Cleaned data:", cleaned_data)
11. Performance Tips and Best Practices #
Memory Efficiency #
# Use appropriate data types
small_integers = np.array([1, 2, 3, 4, 5], dtype=np.int8) # 1 byte per element
large_integers = np.array([1, 2, 3, 4, 5], dtype=np.int64) # 8 bytes per element
print(f"int8 array uses: {small_integers.nbytes} bytes")
print(f"int64 array uses: {large_integers.nbytes} bytes")
# Use views instead of copies when possible
original = np.arange(1000000)
view = original[::2] # Every 2nd element (creates view, not copy)
copy = original[::2].copy() # Creates actual copy
print("View shares memory with original:", np.shares_memory(original, view))
print("Copy shares memory with original:", np.shares_memory(original, copy))
Vectorization vs Loops #
import time
# Bad: Using Python loops
def slow_calculation(arr):
result = []
for x in arr:
result.append(x**2 + 2*x + 1)
return np.array(result)
# Good: Using NumPy vectorization
def fast_calculation(arr):
return arr**2 + 2*arr + 1
# Test performance
large_array = np.random.random(100000)
# Time the slow version
start = time.time()
slow_result = slow_calculation(large_array)
slow_time = time.time() - start
# Time the fast version
start = time.time()
fast_result = fast_calculation(large_array)
fast_time = time.time() - start
print(f"Slow version: {slow_time:.4f} seconds")
print(f"Fast version: {fast_time:.4f} seconds")
print(f"Speedup: {slow_time/fast_time:.1f}x faster")
12. Common Patterns and Recipes #
Moving Averages #
def moving_average(data, window_size):
"""Calculate moving average with given window size"""
return np.convolve(data, np.ones(window_size)/window_size, mode='valid')
# Stock price data (simulated)
stock_prices = np.array([100, 102, 98, 105, 103, 107, 109, 106, 108, 110, 112, 108])
# 3-day moving average
ma_3 = moving_average(stock_prices, 3)
print("3-day moving average:", ma_3.round(2))
# 5-day moving average
ma_5 = moving_average(stock_prices, 5)
print("5-day moving average:", ma_5.round(2))
Normalization and Standardization #
# Sample test scores
scores = np.array([78, 85, 92, 88, 76, 91, 83, 87, 90, 79])
# Min-Max normalization (0 to 1)
normalized = (scores - np.min(scores)) / (np.max(scores) - np.min(scores))
print("Normalized scores:", normalized.round(3))
# Z-score standardization (mean=0, std=1)
standardized = (scores - np.mean(scores)) / np.std(scores)
print("Standardized scores:", standardized.round(3))
# Check results
print(f"Standardized mean: {np.mean(standardized):.6f}") # Should be ~0
print(f"Standardized std: {np.std(standardized):.6f}") # Should be ~1
Finding Peaks and Valleys #
def find_peaks(data, threshold=0):
"""Find local maxima in data"""
peaks = []
for i in range(1, len(data)-1):
if data[i] > data[i-1] and data[i] > data[i+1] and data[i] > threshold:
peaks.append(i)
return np.array(peaks)
# Sample signal data
signal = np.array([1, 3, 2, 5, 4, 6, 3, 8, 2, 4, 1])
peak_indices = find_peaks(signal)
peak_values = signal[peak_indices]
print("Peak indices:", peak_indices)
print("Peak values:", peak_values)
13. Integration with Other Libraries #
With Pandas #
# Convert between NumPy and Pandas
grades_array = np.array([[85, 90, 78],
[92, 88, 91],
[76, 82, 89]])
# If you have pandas installed:
# import pandas as pd
# grades_df = pd.DataFrame(grades_array,
# columns=['Math', 'Science', 'English'],
# index=['Alice', 'Bob', 'Charlie'])
#
# # Convert back to NumPy
# back_to_numpy = grades_df.values
With Matplotlib (Visualization) #
# If you have matplotlib installed:
# import matplotlib.pyplot as plt
#
# # Generate sample data
# x = np.linspace(0, 10, 100)
# y = np.sin(x)
#
# # Create plot
# plt.figure(figsize=(10, 6))
# plt.plot(x, y)
# plt.title('Sine Wave')
# plt.xlabel('x')
# plt.ylabel('sin(x)')
# plt.grid(True)
# plt.show()
14. Common Errors and Solutions #
Shape Mismatch Errors #
# Common error: Shape mismatch
try:
a = np.array([[1, 2], [3, 4]]) # (2, 2)
b = np.array([1, 2, 3]) # (3,)
result = a + b # This will fail
except ValueError as e:
print("Error:", e)
print("Solution: Make sure shapes are compatible for broadcasting")
# Fix: Reshape or use compatible arrays
b_fixed = np.array([1, 2]) # (2,) - compatible with (2, 2)
result = a + b_fixed
print("Fixed result:\n", result)
Index Out of Bounds #
arr = np.array([1, 2, 3, 4, 5])
# Safe indexing
def safe_index(array, index):
if 0 <= index < len(array):
return array[index]
else:
print(f"Index {index} is out of bounds for array of length {len(array)}")
return None
# Test safe indexing
print(safe_index(arr, 2)) # 3 (valid)
print(safe_index(arr, 10)) # None (invalid)
Data Type Issues #
# Integer overflow
small_int = np.array([100], dtype=np.int8) # Range: -128 to 127
try:
result = small_int * 2 # 200 > 127, causes overflow
print("Overflow result:", result) # Unexpected result!
except:
print("Use larger data type for large numbers")
# Solution: Use appropriate data type
large_int = np.array([100], dtype=np.int32)
result = large_int * 2
print("Correct result:", result) # [200]
# Division by zero
arr = np.array([1, 2, 0, 4])
result = np.divide(10, arr, out=np.zeros_like(arr, dtype=float), where=(arr!=0))
print("Safe division:", result) # [10. 5. 0. 2.5]
15. Advanced Topics #
Memory Layout and Performance #
# Row-major vs Column-major order
arr_c = np.array([[1, 2, 3], [4, 5, 6]], order='C') # C-style (row-major)
arr_f = np.array([[1, 2, 3], [4, 5, 6]], order='F') # Fortran-style (column-major)
print("C-order flags:", arr_c.flags)
print("F-order flags:", arr_f.flags)
# Performance difference for different access patterns
import time
large_matrix = np.random.random((1000, 1000))
# Row-wise access (efficient for C-order)
start = time.time()
for i in range(1000):
row_sum = np.sum(large_matrix[i, :])
row_time = time.time() - start
# Column-wise access
start = time.time()
for j in range(1000):
col_sum = np.sum(large_matrix[:, j])
col_time = time.time() - start
print(f"Row-wise access: {row_time:.4f}s")
print(f"Column-wise access: {col_time:.4f}s")
Custom Data Types #
# Define structured array (like a database record)
student_dtype = np.dtype([
('name', 'U20'), # Unicode string, max 20 chars
('age', 'i4'), # 32-bit integer
('grades', 'f4', (3,)), # Array of 3 32-bit floats
('passed', '?') # Boolean
])
# Create structured array
students = np.array([
('Alice', 20, [85.5, 90.0, 78.5], True),
('Bob', 19, [76.0, 82.5, 88.0], True),
('Charlie', 21, [65.0, 70.5, 72.0], False)
], dtype=student_dtype)
print("Student names:", students['name'])
print("Average grades:", np.mean(students['grades'], axis=1))
print("Passed students:", students[students['passed']]['name'])
Advanced Indexing #
# Fancy indexing with arrays
arr = np.array([10, 20, 30, 40, 50])
indices = np.array([0, 2, 4])
selected = arr[indices]
print("Selected elements:", selected) # [10 30 50]
# 2D fancy indexing
matrix = np.array([[1, 2, 3, 4],
[5, 6, 7, 8],
[9, 10, 11, 12]])
# Select specific elements
rows = np.array([0, 1, 2])
cols = np.array([1, 2, 3])
diagonal_like = matrix[rows, cols]
print("Selected elements:", diagonal_like) # [2 7 12]
# Boolean indexing with multiple conditions
data = np.random.randint(1, 100, 20)
complex_condition = (data > 20) & (data < 80) & (data % 2 == 0)
filtered_data = data[complex_condition]
print("Filtered data:", filtered_data)
16. Practical Projects #
Project 1: Grade Analysis System #
class GradeAnalyzer:
def __init__(self, grades, student_names, subject_names):
self.grades = np.array(grades)
self.students = np.array(student_names)
self.subjects = np.array(subject_names)
def student_averages(self):
"""Calculate average grade for each student"""
return np.mean(self.grades, axis=1)
def subject_averages(self):
"""Calculate average grade for each subject"""
return np.mean(self.grades, axis=0)
def top_students(self, n=3):
"""Get top n students by average"""
averages = self.student_averages()
top_indices = np.argsort(averages)[-n:][::-1]
return self.students[top_indices], averages[top_indices]
def failing_students(self, threshold=60):
"""Find students with any grade below threshold"""
failing_mask = np.any(self.grades < threshold, axis=1)
return self.students[failing_mask]
def grade_distribution(self):
"""Analyze grade distribution"""
flat_grades = self.grades.flatten()
a_grades = np.sum(flat_grades >= 90)
b_grades = np.sum((flat_grades >= 80) & (flat_grades < 90))
c_grades = np.sum((flat_grades >= 70) & (flat_grades < 80))
d_grades = np.sum((flat_grades >= 60) & (flat_grades < 70))
f_grades = np.sum(flat_grades < 60)
return {
'A': a_grades, 'B': b_grades, 'C': c_grades,
'D': d_grades, 'F': f_grades
}
# Example usage
grades_data = [
[85, 90, 78, 92], # Alice
[76, 82, 88, 85], # Bob
[92, 95, 89, 94], # Charlie
[68, 72, 75, 70], # Diana
[95, 98, 92, 96] # Eve
]
analyzer = GradeAnalyzer(
grades_data,
['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
['Math', 'Science', 'English', 'History']
)
print("Student Averages:")
for student, avg in zip(analyzer.students, analyzer.student_averages()):
print(f"{student}: {avg:.1f}")
print("\nTop 3 Students:")
top_students, top_scores = analyzer.top_students(3)
for student, score in zip(top_students, top_scores):
print(f"{student}: {score:.1f}")
print("\nGrade Distribution:", analyzer.grade_distribution())
Project 2: Weather Data Analysis #
class WeatherAnalyzer:
def __init__(self, temperatures, dates):
self.temperatures = np.array(temperatures)
self.dates = np.array(dates)
def temperature_stats(self):
"""Basic temperature statistics"""
return {
'mean': np.mean(self.temperatures),
'median': np.median(self.temperatures),
'std': np.std(self.temperatures),
'min': np.min(self.temperatures),
'max': np.max(self.temperatures),
'range': np.max(self.temperatures) - np.min(self.temperatures)
}
def find_extremes(self):
"""Find hottest and coldest days"""
hot_day_idx = np.argmax(self.temperatures)
cold_day_idx = np.argmin(self.temperatures)
return {
'hottest_day': self.dates[hot_day_idx],
'hottest_temp': self.temperatures[hot_day_idx],
'coldest_day': self.dates[cold_day_idx],
'coldest_temp': self.temperatures[cold_day_idx]
}
def temperature_trends(self, window=7):
"""Calculate moving average trends"""
if len(self.temperatures) < window:
return None
moving_avg = np.convolve(
self.temperatures,
np.ones(window)/window,
mode='valid'
)
# Calculate trend (positive = warming, negative = cooling)
trend = np.polyfit(range(len(moving_avg)), moving_avg, 1)[0]
return {
'moving_average': moving_avg,
'trend': trend,
'trend_description': 'Warming' if trend > 0 else 'Cooling'
}
def heat_wave_analysis(self, threshold=30, min_duration=3):
"""Detect heat waves (consecutive days above threshold)"""
hot_days = self.temperatures > threshold
heat_waves = []
i = 0
while i < len(hot_days):
if hot_days[i]:
start = i
while i < len(hot_days) and hot_days[i]:
i += 1
duration = i - start
if duration >= min_duration:
heat_waves.append({
'start_date': self.dates[start],
'end_date': self.dates[i-1],
'duration': duration,
'max_temp': np.max(self.temperatures[start:i]),
'avg_temp': np.mean(self.temperatures[start:i])
})
else:
i += 1
return heat_waves
# Example usage
# Generate sample weather data for 30 days
np.random.seed(42)
base_temp = 25
seasonal_variation = 5 * np.sin(np.linspace(0, 2*np.pi, 30))
daily_variation = np.random.normal(0, 3, 30)
temperatures = base_temp + seasonal_variation + daily_variation
dates = [f"2024-06-{day:02d}" for day in range(1, 31)]
weather = WeatherAnalyzer(temperatures, dates)
print("Temperature Statistics:")
stats = weather.temperature_stats()
for key, value in stats.items():
print(f"{key.title()}: {value:.1f}°C")
print("\nExtreme Days:")
extremes = weather.find_extremes()
print(f"Hottest: {extremes['hottest_day']} ({extremes['hottest_temp']:.1f}°C)")
print(f"Coldest: {extremes['coldest_day']} ({extremes['coldest_temp']:.1f}°C)")
print("\nTrend Analysis:")
trends = weather.temperature_trends()
if trends:
print(f"Overall trend: {trends['trend_description']} ({trends['trend']:.2f}°C/day)")
print("\nHeat Wave Analysis:")
heat_waves = weather.heat_wave_analysis(28, 2)
for i, wave in enumerate(heat_waves):
print(f"Heat Wave {i+1}: {wave['start_date']} to {wave['end_date']}")
print(f" Duration: {wave['duration']} days, Max: {wave['max_temp']:.1f}°C")
Project 3: Financial Portfolio Analysis #
class PortfolioAnalyzer:
def __init__(self, prices, stock_names):
self.prices = np.array(prices) # Shape: (days, stocks)
self.stocks = np.array(stock_names)
def daily_returns(self):
"""Calculate daily returns for each stock"""
return np.diff(self.prices, axis=0) / self.prices[:-1] * 100
def volatility(self):
"""Calculate volatility (standard deviation of returns)"""
returns = self.daily_returns()
return np.std(returns, axis=0)
def cumulative_returns(self):
"""Calculate cumulative returns from start"""
return (self.prices / self.prices[0] - 1) * 100
def correlation_matrix(self):
"""Calculate correlation between stocks"""
returns = self.daily_returns()
return np.corrcoef(returns.T)
def portfolio_performance(self, weights):
"""Calculate portfolio performance with given weights"""
weights = np.array(weights)
if not np.isclose(np.sum(weights), 1.0):
raise ValueError("Weights must sum to 1.0")
returns = self.daily_returns()
portfolio_returns = np.dot(returns, weights)
return {
'daily_returns': portfolio_returns,
'total_return': np.sum(portfolio_returns),
'volatility': np.std(portfolio_returns),
'sharpe_ratio': np.mean(portfolio_returns) / np.std(portfolio_returns) if np.std(portfolio_returns) > 0 else 0
}
def risk_metrics(self):
"""Calculate risk metrics for each stock"""
returns = self.daily_returns()
# Value at Risk (95% confidence)
var_95 = np.percentile(returns, 5, axis=0)
# Maximum drawdown
cumulative = self.cumulative_returns()
peak = np.maximum.accumulate(cumulative, axis=0)
drawdown = (cumulative - peak)
max_drawdown = np.min(drawdown, axis=0)
return {
'var_95': var_95,
'max_drawdown': max_drawdown,
'volatility': self.volatility()
}
# Example usage
# Simulate stock prices for 100 days, 4 stocks
np.random.seed(42)
days = 100
stocks = 4
initial_prices = [100, 50, 200, 150]
# Generate realistic price movements
price_data = []
current_prices = np.array(initial_prices)
for day in range(days):
# Random daily returns (mean-reverting with some trend)
daily_changes = np.random.normal([0.05, 0.03, 0.02, 0.04], [2, 1.5, 3, 2.5])
current_prices = current_prices * (1 + daily_changes/100)
price_data.append(current_prices.copy())
portfolio = PortfolioAnalyzer(
price_data,
['TECH', 'FINANCE', 'HEALTH', 'ENERGY']
)
print("Stock Performance Summary:")
print("-" * 50)
for i, stock in enumerate(portfolio.stocks):
final_return = portfolio.cumulative_returns()[-1, i]
volatility = portfolio.volatility()[i]
print(f"{stock:8}: Return: {final_return:6.1f}%, Volatility: {volatility:.1f}%")
print("\nCorrelation Matrix:")
print("-" * 30)
corr_matrix = portfolio.correlation_matrix()
print(" ", end="")
for stock in portfolio.stocks:
print(f"{stock:8}", end="")
print()
for i, stock in enumerate(portfolio.stocks):
print(f"{stock:8}", end=" ")
for j in range(len(portfolio.stocks)):
print(f"{corr_matrix[i,j]:7.2f}", end=" ")
print()
# Test different portfolio allocations
equal_weights = [0.25, 0.25, 0.25, 0.25]
tech_heavy = [0.5, 0.2, 0.2, 0.1]
conservative = [0.1, 0.4, 0.4, 0.1]
print("\nPortfolio Comparison:")
print("-" * 40)
for name, weights in [("Equal Weight", equal_weights),
("Tech Heavy", tech_heavy),
("Conservative", conservative)]:
perf = portfolio.portfolio_performance(weights)
print(f"{name:12}: Return: {perf['total_return']:6.1f}%, "
f"Volatility: {perf['volatility']:.1f}%, "
f"Sharpe: {perf['sharpe_ratio']:.2f}")
17. Best Practices Summary #
Do’s ✅ #
# 1. Use vectorized operations instead of loops
good_way = np.sum(arr**2) # Fast
# bad_way = sum([x**2 for x in arr]) # Slow
# 2. Use appropriate data types
efficient_int = np.array([1, 2, 3], dtype=np.int32) # 4 bytes per element
# wasteful_int = np.array([1, 2, 3], dtype=np.float64) # 8 bytes per element
# 3. Use broadcasting for different shaped arrays
matrix = np.ones((100, 3))
row_vector = np.array([1, 2, 3])
result = matrix * row_vector # Broadcasting works
# 4. Preallocate arrays when size is known
result = np.zeros(1000) # Good
# result = [] # Bad for numerical work
# 5. Use views instead of copies when possible
view = arr[::2] # Creates view (fast, memory efficient)
# copy = arr[::2].copy() # Creates copy (slower, more memory)
Don’ts ❌ #
# 1. Don't modify arrays while iterating
arr = np.array([1, 2, 3, 4, 5])
# Don't do this:
# for i in range(len(arr)):
# if arr[i] > 3:
# arr = np.delete(arr, i) # Modifies array during iteration
# Do this instead:
arr = arr[arr <= 3] # Use boolean indexing
# 2. Don't use nested loops for array operations
# Bad:
# result = np.zeros_like(matrix)
# for i in range(matrix.shape[0]):
# for j in range(matrix.shape[1]):
# result[i, j] = matrix[i, j] ** 2
# Good:
result = matrix ** 2
# 3. Don't forget to handle edge cases
def safe_divide(a, b):
return np.divide(a, b, out=np.zeros_like(a), where=(b!=0))
# 4. Don't ignore memory layout for performance-critical code
# Be aware of C-order vs F-order for large arrays
18. Conclusion #
NumPy is the foundation of scientific computing in Python. It provides:
Key Benefits:
- Performance: 50-100x faster than pure Python
- Memory Efficiency: Compact data storage
- Functionality: Rich mathematical and statistical functions
- Ecosystem: Works seamlessly with pandas, matplotlib, scikit-learn
When to Use NumPy:
- Mathematical computations
- Data analysis and manipulation
- Scientific computing
- Machine learning preprocessing
- Image and signal processing
- Financial analysis
Learning Path:
- Beginner: Arrays, indexing, basic operations
- Intermediate: Broadcasting, reshaping, statistical functions
- Advanced: Linear algebra, custom dtypes, performance optimization
Next Steps:
- Pandas: For structured data analysis
- Matplotlib: For data visualization
- Scikit-learn: For machine learning
- SciPy: For advanced scientific computing
NumPy is like learning to use a powerful calculator – once you master it, you’ll wonder how you ever did numerical work without it!
Remember: Always think in arrays, not loops! This mindset shift will make you a much more effective Python programmer for numerical tasks.