If NumPy is for numerical arrays, pandas is for labeled tabular data — the kind of mixed-type, named-column data that 80% of real ML work involves. CSV in, CSV out, with cleaning, joining, grouping, and aggregating in between. Almost every ML pipeline starts in pandas long before it touches a model.
Series and DataFrame
A Series is a 1D array with a label — like one column. A DataFrame is a 2D table: rows × columns, where each column is a Series with its own dtype. Each column has a name, each row has an index. Pandas operations are column-aware: filter, sort, group by column name and let pandas figure out the rest.
df.head() # first 5 rows
df[df.age > 30] # filter rows
df.groupby('city').mean() # aggregate by group
df.merge(other, on='id') # SQL-style join
df['col'].apply(fn) # transform a column
df.pivot_table(...) # cross-tabulateCore pandas verbs — covers ~90% of daily data work
The split-apply-combine pattern
The most powerful pandas pattern is `groupby().agg()`. Split the data by some key (e.g. by city), apply a function to each group (e.g. mean), then combine the results into a new table. Once you can think in split-apply-combine, half your data wrangling problems disappear.
- Read/write: `pd.read_csv`, `read_parquet`, `read_sql`, `to_csv`, `to_parquet`
- Filter: `df[df.col > 10]`, `df.query('age > 30 and city == "NYC"')`
- Group: `df.groupby('city').mean()`, `.agg({'col': ['min', 'max']})`
- Reshape: `df.pivot`, `df.melt`, `df.stack`, `df.unstack`
- Time series: `pd.to_datetime`, `df.resample('D').mean()`, `.rolling(7).mean()`