Renewal·사이드 프로젝트

Data analysis and visualization practice — Weather data (Seoul)

June 7, 2024·5 min read

cover image

Through the AI coaching study at Modulabs, I've been getting the chance to dust off and revisit some really old memories. On top of that, I met great teammates and got to receive great resource files from them— so I also got the chance to organize the related material! (Supplementary material: AI 2024 coaching study team mission (Lolong Coaching, Team 6). by_@Park Moon-ji)

Just as the same book 'The Little Prince' delivers a different impression when you read it in elementary school, in high school, in college, and then in working life, recently revisiting Python, Matplotlib, and Pandas has felt fresh in a similar way ~~— well, not quite at the level of being moved to tears, ha— I'm not that kind of weirdo —~~ there's a new feeling to it.

Anyway, through this practice, I plan to organize the recurring patterns and code snippets for collecting field data, cleaning, reviewing, preprocessing, drawing insights, and visualizing so that I can usefully reuse them later via Ctrl+C, Ctrl+V. Of course… GPT will probably be faster, lol.

Data collection

Seoul temperature data (csv, https://data.kma.go.kr/stcs/grnd/grndTaList.do?pgmNo=70 )

KMA Open MET Data Portal [Climate Statistical Analysis: Statistical Analysis: Temperature Analysis]

Central (26) Seoul/Gyeonggi: Seoul (108), Incheon (112), Suwon (119), Ganghwa (201), Yangpyeong (202), Icheon (203). Gangwon East: Sokcho (90), Gangneung (105), Taebaek (216). Gangwon West: Cheorwon (95), Daegwallyeong (100), Chuncheon (101), Wonju (114), Inje (211), Hongcheon (212). Chungbuk:

data.kma.go.kr

Adding packages

# Import libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns

Data load and preprocessing

# Load data df = pd.read_csv('data/weather_seoul_1980to2024.csv') df.head()

Error:
* UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 0: invalid start byte
--> Open the collected csv file in Notepad, set the encoding to utf-8, and save it under a different name to use it.

* ParserError: Error tokenizing data. C error: Expected 1 fields in line 8, saw 5
--> The CSV format is broken— it's not delimited by commas. Delete the header text area in the CSV file and save it before using.

* NameError: name 'pd' is not defined
-->\t characters are present… Use Notepad's Find & Replace to remove the \t characters.

Korean font setup

Because matplotlib defaults to 'sans-serif,' it doesn't support Korean. As a result, when titles, column names, labels, and legends are in Korean, they get rendered as broken (mojibake) characters. In that case, use matplotlib.rc to reset the font and verify the result.

# Check current font: ['sans-serif'] import matplotlib matplotlib.rcParams['font.family']

# Fix broken Korean text on Mac import matplotlib.pyplot as plt from matplotlib import rc rc('font', family='AppleGothic') plt.rcParams['axes.unicode_minus'] = False

# Fix broken Korean text on Windows import matplotlib import matplotlib.font_manager as fm font_name = fm.FontProperties(fname = 'C:/Windows/Fonts/malgun.ttf').get_name() matplotlib.rc('font', family = font_name) matplotlib.rcParams['axes.unicode_minus'] = False

# Check changed font: ['AppleGothic'] import matplotlib matplotlib.rcParams['font.family']

Data review

# Review recent data (*station=Seoul) df.tail()

# Summary statistics of the data df.describe()

# General data info: 16,288 rows, 5 columns, missing values (columns 3 and 4 differ from columns 0, 1, and 2) df.info()

Preprocessing

Adjusting column names

df.columns

df.columns = ['date', 'station', 'avg_temp', 'min_temp', 'max_temp'] df.columns

df.head()

Checking for missing values

#isnull()
df.isnull().sum()

# Find dates where minimum temperature is missing - typhoon # https://imnews.imbc.com/replay/2022/nw1400/article/6396123_35722.html # However, I could not find separate information related to the equipment. cond = df['min_temp'].isnull() df[cond]

# Find dates where maximum temperature is missing - earthquake? Or rather... the Pohang earthquake on the 15th? # Again, I could not find separate information related to equipment operation. cond = df['max_temp'].isnull() df[cond]

df.info()

# Handle missing values df.dropna(inplace=True)

Date type conversion

# Convert date type (when there is no inplace=True option) # object to datetime pandas - https://pandas.pydata.org/docs/reference/api/pandas.to_datetime.html df['date'] = pd.to_datetime(df['date']) df.info()

df.head()

# Derive year, month, day columns df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day'] = df['date'].dt.day df.info()

df.head()

Analysis

From the data table, extract elements that match certain conditions and reconstitute the table. And then make it possible to compare specific elements to one another via graphs.

The hottest day

# Hottest day? (maximum temperature) hottestDayList = df.sort_values(by='max_temp',ascending=False) hottestDayList.head()

hottestDayList.iloc[0] # row 0 = the hottest day

hottestDayList.iloc[0,0] # row 0, column 0 = the hottest day, date

# Print date type as a string df.iloc[0,0].date() df.iloc[0,0].strftime('%Y-%m-%d')

hottestDayList.iloc[0,4] # the hottest day, temperature

# Print the hottest day in Seoul hotday = hottestDayList.iloc[0,0].strftime('%Y-%m-%d') temp = hottestDayList.iloc[0,4] print(f'The hottest day in Seoul was {hotday}: {temp} degrees')

New Year's Day temperature graph

# Extract only January 1 data since 1980 cond = (df['year'] >=1980) & (df['month']==1) & (df['day']==1) birth_df = df[cond] birth_df.head()

# Extract only required columns (date, average temperature) col_lst = ['date','avg_temp'] birth_df = birth_df[col_lst] birth_df.head()

## Font setup and plt.rc('font', family='AppleGothic') # Mac # plt.rc('font', family='NanumSquare') # set Nanum font # plt.rc('font', family='Malgun Gothic') # set Malgun Gothic # minus sign setting plt.rcParams['axes.unicode_minus'] = False # sign

# New Year's Day temperature data day = birth_df['date'].values temp_avg = birth_df['avg_temp'].values plt.plot(day,temp_avg) plt.xlabel('date') plt.ylabel('avg_temp') plt.title('New Year's Day temperature data (1/1)') plt.axhline(y=0, color='orange', linestyle='--') plt.show()

Christmas high/low temperature trends

# Extract all Christmas data in the dataset cond = (df['month']==12) & (df['day']==25) chris_df = df[cond] chris_df.head()

x = chris_df['date'].values y1 = chris_df['min_temp'].values y2 = chris_df['max_temp'].values plt.plot(x,y1,label='min_temp',color='b') plt.plot(x,y2,label='max_temp',color='r') plt.title('Christmas max/min temperature') plt.xlabel('year') plt.ylabel('temperature') plt.legend() # legend plt.show()

Scatter plot matrix

For multiple continuous variables, draw a scatter plot for each pair so the relationships among the variables can be checked all at once.

# Select key numeric data suitable for analysis (*station=Seoul) selected_columns = ['date', 'station', 'avg_temp', 'min_temp', 'max_temp'] reviewGraph_selected = df[selected_columns] # Draw scatter matrix sns.pairplot(reviewGraph_selected) plt.suptitle('Scatter Matrix of Seoul weather data review', y=1.02) # add title plt.show()

Histogram — Seoul highest temperatures

# Use matplotlib hist() - based on maximum temperature x = df['max_temp'] plt.hist(x) plt.show()

# For numeric features, adjust 'bins' appropriately.... df['max_temp'].hist(bins=20)

df['max_temp'].hist(bins=100)

Histogram — Comparing the Seoul highest temperatures of February and August

# Extract only spring(February) data feb_df = df[df['month']==2] feb_df.head()

# Extract only autumn(August) data aug_df = df[df['month']==8] aug_df.head()

# Histogram visualization x1 = aug_df['max_temp'] x2 = feb_df['max_temp'] plt.hist(x1,bins=100,color='r',label='aug') plt.hist(x2,bins=100,color='b',label='feb') plt.legend() plt.show()

Box plot — Comparing the distribution of mean and outlier temperatures

import pandas as pd import matplotlib.pyplot as plt df = pd.read_csv('data/weather_seoul_1980to2024.csv') df.columns = ['date', 'station', 'avg_temp', 'min_temp', 'max_temp'] # remove missing values df.dropna(inplace=True) # type conversion df['date'] = pd.to_datetime(df['date']) # create derived variables df['year']=df['date'].dt.year df['month']=df['date'].dt.month df['day']=df['date'].dt.day df.head()

# Visualize max/min temperature boxplot x = df['max_temp'].values y = df['min_temp'].values plt.boxplot([x,y]) plt.show()

# Use pandas boxplot() df[['max_temp','min_temp']].boxplot()

Box plot — Distribution of temperatures by month

import pandas as pd import numpy as np import matplotlib.pyplot as plt # Load data df = pd.read_csv('data/weather_seoul_1980to2024.csv') # Rename columns df.columns = ['date', 'station', 'avg_temp', 'min_temp', 'max_temp'] # remove missing values df.dropna(inplace=True) # Convert date data Str => date type # df['date'] = pd.to_datetime(df['date']) df['date'] = df['date'].astype('datetime64[ns]') df['year'] = df['date'].dt.year df['month'] = df['date'].dt.month df['day'] = df['date'].dt.day df.head()

cond = df['month'] == 1 cond

df[cond]['avg_temp']

df.loc[cond,'avg_temp']

avg_month = [] for i in range(1,13): avg_month.append(df.loc[df['month']==i,'avg_temp']) print(i)

plt.rc('font', family='AppleGothic') # Mac plt.rcParams['axes.unicode_minus'] = False # sign plt.figure(figsize=(8,4)) # check results by aspect ratio plt.boxplot(avg_month) plt.xlabel('month') plt.ylabel('temperature') plt.title('Monthly average temperature') plt.show()

This English version was translated by Claude.

Written by

친절한 찰쓰씨

Pleasant Charles — UI/UX researcher at AIT. Keeping notes on design, planning, and slow days here since 2010.

Keep reading

Renewal