What is meant by ’tidy’ data?

When processing and plotting data, how you choose your columns can have a massive impact on how easy your data is to manipulate. Data can either be in ’long’ (or ’tidy’) form, or it can be in ‘wide’ form. Some plotting libraries are designed to work with ’long’ data, and others with ‘wide’ data.

Long data

A table stored in ’long’ form has a single column for each variable in the system.

In the case of UK election results, each data point represents the number of seats a particular party won in a particular year, so our variables are seats, party and year:

>> print(long)
        year         party  seats
0       1966  Conservative    253
1       1970  Conservative    330
2   Feb 1974  Conservative    297
3   Oct 1974  Conservative    277
4       1979  Conservative    339
..       ...           ...    ...
55      2005        Others     30
56      2010        Others     29
57      2015        Others     80
58      2017        Others     59
59      2019        Others     72

[60 rows x 3 columns]

‘Long-form’ data is also sometimes called ’tidy’, ‘stacked’ or ’narrow’.

Libraries that work best with long data: Seaborn, Plotly Express, Altair

Wide data

A table stored in ‘wide’ form spreads a variable across several columns. We have the number of seats won by four parties (including others), so it seems sensible to store them in four columns:

>> print(wide)
        year  conservative  labour  liberal  others
0       1966           253     364       12       1
1       1970           330     287        6       7
2   Feb 1974           297     301       14      18
..       ...           ...     ...      ...     ...
12      2015           330     232        8      80
13      2017           317     262       12      59
14      2019           365     202       11      72

[15 rows x 5 columns]

‘Wide form’ data is also sometimes called ‘un-stacked’.

Libraries that work best with wide data: Matplotlib, Plotly, Bokeh, PyGal, Pandas.

Pandas DataFrames

All the libraries mentioned in our Python plotting guide work well with Pandas DataFrames, so I’ve created DataFrames from my data.

Pandas DataFrames allow you to manipulate large amounts of tabulated data in a scalable way, providing methods to iterate through columns, filter out particular values, replace missing values, and many other operations you want to do efficiently. Their columns are also Python Sequences, so they can usually be used anywhere you’d use a Python list.

Converting between Long and Wide data in Pandas

Pandas has convenient methods to convert wide-form data in to long form and vice versa.

Wide to long

To convert wide form into long form, use df.melt():

>> # Convert wide form to long form
>> melted = wide.melt('year', var_name='party', value_name='seats')
>> print(melted)
        year         party  seats
0       1966  conservative    253
1       1970  conservative    330
2   Feb 1974  conservative    297
3   Oct 1974  conservative    277
4       1979  conservative    339
..       ...           ...    ...
55      2005        others     43
56      2010        others     25
57      2015        others     80
58      2017        others     59
59      2019        others     72

For more detail, see the Pandas documentation on melt.

Long to wide

To convert long form into wide form, use df.pivot().reset_index()

>> # Convert long form to wide form
>> widened = long.pivot(index='year', columns='party', values='seats').reset_index()
>> print(widened)
party      year  Conservative  Labour  Liberal  Others
0          1966           253     364       12       1
1          1970           330     287        6       7
2          1979           339     269       11      16
..          ...           ...     ...      ...     ...
12         2019           365     202       11      72
13     Feb 1974           297     301       14      18
14     Oct 1974           277     313       13      32

For more detail, see the Pandas documentation on pivot.

More on plotting in Python

For a full comparison of Python plotting libraries, see Plotting in Python: A Rundown of Libraries.