Programming Foundations - Beginner - 12 min

Learn Pandas — DataFrames

A free visual AI and machine learning lesson with an interactive 3D visualization, plain-English theory, and quiz.

Last updated: 2026-05-13.

If NumPy is for numerical arrays, pandas is for labeled tabular data — the kind of mixed-type, named-column data that 80% of real ML work involves. CSV in, CSV out, with cleaning, joining, grouping, and aggregating in between. Almost every ML pipeline starts in pandas long before it touches a model.

Series and DataFrame

A Series is a 1D array with a label — like one column. A DataFrame is a 2D table: rows × columns, where each column is a Series with its own dtype. Each column has a name, each row has an index. Pandas operations are column-aware: filter, sort, group by column name and let pandas figure out the rest.

df.head()                  # first 5 rows
df[df.age > 30]            # filter rows
df.groupby('city').mean()  # aggregate by group
df.merge(other, on='id')   # SQL-style join
df['col'].apply(fn)        # transform a column
df.pivot_table(...)        # cross-tabulate

Core pandas verbs — covers ~90% of daily data work

The split-apply-combine pattern

The most powerful pandas pattern is `groupby().agg()`. Split the data by some key (e.g. by city), apply a function to each group (e.g. mean), then combine the results into a new table. Once you can think in split-apply-combine, half your data wrangling problems disappear.

  • Read/write: `pd.read_csv`, `read_parquet`, `read_sql`, `to_csv`, `to_parquet`
  • Filter: `df[df.col > 10]`, `df.query('age > 30 and city == "NYC"')`
  • Group: `df.groupby('city').mean()`, `.agg({'col': ['min', 'max']})`
  • Reshape: `df.pivot`, `df.melt`, `df.stack`, `df.unstack`
  • Time series: `pd.to_datetime`, `df.resample('D').mean()`, `.rolling(7).mean()`

Practice questions

  1. What is the pandas equivalent of SQL `SELECT * FROM df WHERE age > 30`?
  2. What does the split-apply-combine pattern look like in pandas?
  3. Why is `df.iterrows()` discouraged for production code?
  4. What happens with `df = pd.read_csv('huge.csv')` if the file is 100 GB and you have 16 GB RAM?

Related AI learning resources

Premium lesson notes and simulations | AI project templates | More Programming Foundations lessons