Back to feed
Renewal·사이드 프로젝트

LET's AI 2024 Coaching Study - Week 1, Team Mission

NS
normalstory
cover image

 

This post is a record of the discussion we had this week, where we picked specific keywords from our learning content and researched them together. 
Beyond just doing the theoretical research and pulling the material together, we also organized the corresponding hands-on code so that we can copy-paste? reference it later when needed.  

-update: Even with sources and names credited, I belatedly judged that posting content contributed by my teammates on this blog is not appropriate, so all related portions have been removed. 

 

 

Topic 1.

Research the phenomenon known as the 'curse of dimensionality' that appears as data dimensionality grows, and discuss what causes this phenomenon and the methods used to address it.  [ Topic: Data and Dimensionality ]

(omitted)

1) Academic and theoretical understanding is great, but it would be even better to look more concretely at how this is applied in code and what the results look like.

(1) Iris dataset example 

  • PCA example code

 

  • t-SNE example code

 

 

(2) digits dataset example 

  • PCA example code

 

  • t-SNE example code

2) From the research reference articles alone the use cases of each technique seemed clearly distinct... (but looking at the actual code outputs, in my own simple side-by-side, t-SNE looks superior to PCA?? Honestly the table-style research write-up didn't quite click for me.) It would be helpful to organize this more concretely along the lines of 'when this case use this technique, when that case use that technique', spelling out the use cases more clearly. 


3. Conclusion 

Let's organize the criteria for picking PCA vs t-SNE on a case-by-case basis. 

1) Data characteristics:
- High-dimensional data with linear structure: PCA 
- High-dimensional data with non-linear structure: t-SNE

2) Speed and scale:
- Need fast processing on large-scale data: PCA
- Smaller data where local structure matters: t-SNE

3) Purpose:
- Need feature extraction and data compression: PCA
- Need data visualization and pattern discovery: t-SNE

4) Practical examples

- PCA use cases:

Analyzing the key characteristics of customer data to find important variables, then reducing dimensionality to improve model performance.
Extracting key gene patterns from genetic data while reducing noise so the data is easier to analyze.

- t-SNE use cases:

Reducing image or text embeddings to 2D so similar images or documents can be visualized as clusters. In marketing campaigns, visualizing customer behavior patterns to identify customer groups exhibiting similar behavior. These two techniques can also be complementary. For example, it's common in practice to first reduce dimensionality somewhat with PCA, then apply t-SNE on top for visualization.

 


Topic 2.

Describe the characteristics of 'Numpy', one of Python's libraries, and investigate which tasks have become more convenient thanks to Numpy. [ Topic: Python ]1. Research 

1. Numpy explained

(omitted..)

2. Discussion

1) The points are organized neatly enough to follow, but I feel the explanation is too conceptual. For a more practical understanding, it would help to look up code examples that illustrate the cases above.

(1) Efficient handling of multi-dimensional arrays

import numpy as np

# Create a 2D array
array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Inspect the array shape
print("Array shape:", array_2d.shape)

# Access and modify array elements
array_2d[1, 2] = 10
print("Modified array:\n", array_2d)

-->  Result, 
Array shape: (3, 3)
Modified array:
 [[ 1  2  3]
 [ 4  5 10]
 [ 7  8  9]]

 

(2) High-performance numerical computation

# Create two large arrays
a = np.random.rand(1000000)
b = np.random.rand(1000000)

# Element-wise array multiplication
result = a * b

# Print the first 5 results
print(result[:5])

-->  Result, [0.01176252 0.59470493 0.83800347 0.36904421 0.02989509]

 

(3) Data preprocessing and analysis

# Generate random data
data = np.random.randn(1000)

# Compute mean and standard deviation
mean = np.mean(data)
std_dev = np.std(data)

print("Mean:", mean)
print("Standard Deviation:", std_dev)

# Normalize the data
normalized_data = (data - mean) / std_dev
print("First 5 normalized data points:", normalized_data[:5])

-->  Result, 
Mean: -0.055785869551972574
Standard Deviation: 0.9666896586679969
First 5 normalized data points: [-1.22099914  0.03545876 -2.18956188  0.5683409  -0.61175484]

 

(4) Scientific computing and simulation

# Linear system Ax = b
A = np.array([[3, 1], [1, 2]])
b = np.array([9, 8])

# Solve the linear system
x = np.linalg.solve(A, b)

print("Solution x:", x)

-->  Result, Solution x: [2. 3.]

 

(5) Data visualization and interoperability

import matplotlib.pyplot as plt

# Generate sample data
x = np.linspace(0, 10, 100)
y = np.sin(x)

# Visualize the data
plt.plot(x, y)
plt.title("Sine Wave")
plt.xlabel("x")
plt.ylabel("sin(x)")
plt.show()

-->  Result, 

 

(6) Machine learning and data science

from sklearn.linear_model import LinearRegression

# A simple linear regression example
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 2, 1.3, 3.75, 2.25])

# Train the model
model = LinearRegression().fit(X, y)

# Predict
predictions = model.predict(X)

print("Predictions:", predictions)

-->  Result, Predictions: [1.21  1.635 2.06  2.485 2.91 ]

 

2) How about comparing 'before NumPy' and 'after NumPy'? It would help us understand NumPy's purpose and use cases more intuitively. Let's also organize the related code. 

In the past, before NumPy came along, large-scale data processing and numerical computation had to be done with Python's built-in data structures and loops. This was extremely inefficient, especially for large datasets or complex numerical calculations. The example below shows what code looked like in the days before NumPy.

Hands-on code 1. Large-scale matrix operations: The speed difference is dramatic, and the code reads much more intuitively too. 

- Before

# Before
## Large-scale matrix multiplication had to be implemented with loops.

# Matrix multiplication
import random

# Size of the two matrices
N = 1000

# Build the matrices
A = [[random.random() for _ in range(N)] for _ in range(N)]
B = [[random.random() for _ in range(N)] for _ in range(N)]
C = [[0]*N for _ in range(N)]

# Matrix multiplication
for i in range(N):
for j in range(N):
for k in range(N):
C[i][j] += A[i][k] * B[k][j]

--> Result, processing time = 2m 9.1s

 

- After applying numpy

# After NumPy
import numpy as np

# Build the matrices
A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

# Matrix multiplication
C = np.dot(A, B)

-->  Result, processing time = 0m 0s

 

Hands-on code 2. Image processing: The code is shorter and the speed is faster too. 

- Before

# Before
## Adjusting image brightness was complicated.

from PIL import Image

# Load the image
image = Image.open('temp/image.png')
pixels = list(image.getdata())
width, height = image.size

# Adjust brightness
new_pixels = [(int(r*0.5), int(g*0.5), int(b*0.5)) for (r, g, b) in pixels]

# Build the new image
new_image = Image.new('RGB', (width, height))
new_image.putdata(new_pixels)
new_image.save('temp/darkened_image.png')

-->  Result, processing time = 1m 1s

 

- After applying numpy

# Install package
! pip install scikit-image
# After NumPy
import numpy as np
from skimage import io

# Load the image
image = io.imread('temp/image.png')

# Adjust brightness
darkened_image = image * 0.5

# Save the image
io.imsave('temp/darkened_image.png', darkened_image.astype(np.uint8))

--> Result, processing time = 0m 7s

Left – before, right – after (image source: Sky Cumulus Atmosphere Cloud — free image on Pixabay )



3. Conclusion 

(1) The earlier code examples made the difference clear. To wrap up, let's move beyond feature-oriented code and finish with some hands-on code more grounded in real-world business scenarios.

[ Case ] Product sales analysis: by analyzing the monthly sales volume and profit of multiple products, you can figure out which product is the most profitable and which month has the strongest sales. (source : A real-world example using NumPy )

- Sample data setup 

import numpy as np

# Each row represents a product, each column represents monthly sales volume.
sales_data = np.array([
[120, 135, 150, 145],
[100, 110, 140, 130],
[90, 100, 95, 105]
])

# Profit per unit for each product
profit_per_unit = np.array([20, 30, 25])

 

- Compute total sales volume 

#Compute total sales volume
total_sales = np.sum(sales_data, axis=1)
print(f"Total sales: {total_sales}")

--> Result,  Total sales: [550 480 390]

 

- Compute total profit per product

#Compute total profit per product
total_profit = np.dot(profit_per_unit, sales_data)
print(f"Total profit: {total_profit}")

--> Result,  Total profit: [7650 8500 9575 9425]

 

- Find the month with the highest sales volume

# Find the month with the highest sales volume
highest_sales_month = np.argmax(np.sum(sales_data, axis=0)) + 1
print(f"Month with the highest sales volume: month {highest_sales_month}")

--> Result, Month with the highest sales volume: month 3

 

- Find the most profitable product

# Find the most profitable product
highest_profit_product = np.argmax(total_profit) + 1
print(f"Most profitable product: product {highest_profit_product}")

--> Result, Most profitable product: product 3

 

 

 


GitHub repository link 

 

GitHub - normalstory/2024AI_CoachingStudy

Contribute to normalstory/2024AI_CoachingStudy development by creating an account on GitHub.

github.com

 

This English version was translated by Claude.

친절한 찰쓰씨
Written by
친절한 찰쓰씨

Pleasant Charles — UI/UX researcher at AIT. Keeping notes on design, planning, and slow days here since 2010.

More on the author's page

Keep reading

Renewal

Steadily, for the long haul, without burning out

Mar 31, 2026·9 min
Renewal

Tech-life balance

Feb 7, 2026·3 min
Renewal

Humanality, by Park Jeong-ryeol

Feb 7, 2026·11 min