Renewal·사이드 프로젝트

LET's AI 2024 Coaching Study - Week 1, Team Mission

May 26, 2024·7 min read

cover image

This post is a record of the discussion we had this week, where we picked specific keywords from our learning content and researched them together.
Beyond just doing the theoretical research and pulling the material together, we also organized the corresponding hands-on code so that we can ~~copy-paste?~~ reference it later when needed.

-update: Even with sources and names credited, I belatedly judged that posting content contributed by my teammates on this blog is not appropriate, so all related portions have been removed.

Topic 1.

Research the phenomenon known as the 'curse of dimensionality' that appears as data dimensionality grows, and discuss what causes this phenomenon and the methods used to address it. [ Topic: Data and Dimensionality ]

(omitted)

1) Academic and theoretical understanding is great, but it would be even better to look more concretely at how this is applied in code and what the results look like.

(1) Iris dataset example

PCA example code

t-SNE example code

(2) digits dataset example

PCA example code

t-SNE example code

2) From the research reference articles alone the use cases of each technique seemed clearly distinct... (but looking at the actual code outputs, in my own simple side-by-side, t-SNE looks superior to PCA?? Honestly the table-style research write-up didn't quite click for me.) It would be helpful to organize this more concretely along the lines of 'when this case use this technique, when that case use that technique', spelling out the use cases more clearly.

3. Conclusion

Let's organize the criteria for picking PCA vs t-SNE on a case-by-case basis.

1) Data characteristics:
- High-dimensional data with linear structure: PCA
- High-dimensional data with non-linear structure: t-SNE

2) Speed and scale:
- Need fast processing on large-scale data: PCA
- Smaller data where local structure matters: t-SNE

3) Purpose:
- Need feature extraction and data compression: PCA
- Need data visualization and pattern discovery: t-SNE

4) Practical examples

- PCA use cases:

Analyzing the key characteristics of customer data to find important variables, then reducing dimensionality to improve model performance.
Extracting key gene patterns from genetic data while reducing noise so the data is easier to analyze.

- t-SNE use cases:

Reducing image or text embeddings to 2D so similar images or documents can be visualized as clusters. In marketing campaigns, visualizing customer behavior patterns to identify customer groups exhibiting similar behavior. These two techniques can also be complementary. For example, it's common in practice to first reduce dimensionality somewhat with PCA, then apply t-SNE on top for visualization.

Topic 2.

Describe the characteristics of 'Numpy', one of Python's libraries, and investigate which tasks have become more convenient thanks to Numpy. [ Topic: Python ]1. Research

1. Numpy explained

(omitted..)

2. Discussion

1) The points are organized neatly enough to follow, but I feel the explanation is too conceptual. For a more practical understanding, it would help to look up code examples that illustrate the cases above.

(1) Efficient handling of multi-dimensional arrays

import numpy as np

# Create a 2D array

array_2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

# Inspect the array shape

print("Array shape:", array_2d.shape)

# Access and modify array elements

array_2d[1, 2] = 10

print("Modified array:\n", array_2d)

--> Result,
Array shape: (3, 3)
Modified array:
[[ 1 2 3]
[ 4 5 10]
[ 7 8 9]]

(2) High-performance numerical computation

# Create two large arrays

a = np.random.rand(1000000)

b = np.random.rand(1000000)

# Element-wise array multiplication

result = a * b

# Print the first 5 results

print(result[:5])

--> Result, [0.01176252 0.59470493 0.83800347 0.36904421 0.02989509]

(3) Data preprocessing and analysis

# Generate random data

data = np.random.randn(1000)

# Compute mean and standard deviation

mean = np.mean(data)

std_dev = np.std(data)

print("Mean:", mean)

print("Standard Deviation:", std_dev)

# Normalize the data

normalized_data = (data - mean) / std_dev

print("First 5 normalized data points:", normalized_data[:5])

--> Result,
Mean: -0.055785869551972574
Standard Deviation: 0.9666896586679969
First 5 normalized data points: [-1.22099914 0.03545876 -2.18956188 0.5683409 -0.61175484]

(4) Scientific computing and simulation

# Linear system Ax = b

A = np.array([[3, 1], [1, 2]])

b = np.array([9, 8])

# Solve the linear system

x = np.linalg.solve(A, b)

print("Solution x:", x)

--> Result, Solution x: [2. 3.]

(5) Data visualization and interoperability

import matplotlib.pyplot as plt

# Generate sample data

x = np.linspace(0, 10, 100)

y = np.sin(x)

# Visualize the data

plt.plot(x, y)

plt.title("Sine Wave")

plt.xlabel("x")

plt.ylabel("sin(x)")

plt.show()

--> Result,

(6) Machine learning and data science

from sklearn.linear_model import LinearRegression

# A simple linear regression example

X = np.array([[1], [2], [3], [4], [5]])

y = np.array([1, 2, 1.3, 3.75, 2.25])

# Train the model

model = LinearRegression().fit(X, y)

# Predict

predictions = model.predict(X)

print("Predictions:", predictions)

--> Result, Predictions: [1.21 1.635 2.06 2.485 2.91 ]

2) How about comparing 'before NumPy' and 'after NumPy'? It would help us understand NumPy's purpose and use cases more intuitively. Let's also organize the related code.

In the past, before NumPy came along, large-scale data processing and numerical computation had to be done with Python's built-in data structures and loops. This was extremely inefficient, especially for large datasets or complex numerical calculations. The example below shows what code looked like in the days before NumPy.

Hands-on code 1. Large-scale matrix operations: The speed difference is dramatic, and the code reads much more intuitively too.

- Before

# Before

## Large-scale matrix multiplication had to be implemented with loops.

# Matrix multiplication

import random

# Size of the two matrices

N = 1000

# Build the matrices

A = [[random.random() for _ in range(N)] for _ in range(N)]

B = [[random.random() for _ in range(N)] for _ in range(N)]

C = [[0]*N for _ in range(N)]

# Matrix multiplication

for i in range(N):

for j in range(N):

for k in range(N):

C[i][j] += A[i][k] * B[k][j]

--> Result, processing time = 2m 9.1s

- After applying numpy

# After NumPy

import numpy as np

# Build the matrices

A = np.random.rand(1000, 1000)

B = np.random.rand(1000, 1000)

# Matrix multiplication

C = np.dot(A, B)

--> Result, processing time = 0m 0s

Hands-on code 2. Image processing: The code is shorter and the speed is faster too.

- Before

# Before

## Adjusting image brightness was complicated.

from PIL import Image

# Load the image

image = Image.open('temp/image.png')

pixels = list(image.getdata())

width, height = image.size

# Adjust brightness

new_pixels = [(int(r*0.5), int(g*0.5), int(b*0.5)) for (r, g, b) in pixels]

# Build the new image

new_image = Image.new('RGB', (width, height))

new_image.putdata(new_pixels)

new_image.save('temp/darkened_image.png')

--> Result, processing time = 1m 1s

- After applying numpy

# Install package

! pip install scikit-image

# After NumPy

import numpy as np

from skimage import io

# Load the image

image = io.imread('temp/image.png')

# Adjust brightness

darkened_image = image * 0.5

# Save the image

io.imsave('temp/darkened_image.png', darkened_image.astype(np.uint8))

--> Result, processing time = 0m 7s

Left – before, right – after (image source: Sky Cumulus Atmosphere Cloud — free image on Pixabay )

3. Conclusion

(1) The earlier code examples made the difference clear. To wrap up, let's move beyond feature-oriented code and finish with some hands-on code more grounded in real-world business scenarios.

[ Case ] Product sales analysis: by analyzing the monthly sales volume and profit of multiple products, you can figure out which product is the most profitable and which month has the strongest sales. (source : A real-world example using NumPy )

- Sample data setup

import numpy as np

# Each row represents a product, each column represents monthly sales volume.

sales_data = np.array([

[120, 135, 150, 145],

[100, 110, 140, 130],

[90, 100, 95, 105]

])

# Profit per unit for each product

profit_per_unit = np.array([20, 30, 25])

- Compute total sales volume

#Compute total sales volume

total_sales = np.sum(sales_data, axis=1)

print(f"Total sales: {total_sales}")

--> Result, Total sales: [550 480 390]

- Compute total profit per product

#Compute total profit per product

total_profit = np.dot(profit_per_unit, sales_data)

print(f"Total profit: {total_profit}")

--> Result, Total profit: [7650 8500 9575 9425]

- Find the month with the highest sales volume

# Find the month with the highest sales volume

highest_sales_month = np.argmax(np.sum(sales_data, axis=0)) + 1

print(f"Month with the highest sales volume: month {highest_sales_month}")

--> Result, Month with the highest sales volume: month 3

- Find the most profitable product

# Find the most profitable product

highest_profit_product = np.argmax(total_profit) + 1

print(f"Most profitable product: product {highest_profit_product}")

--> Result, Most profitable product: product 3

GitHub repository link

GitHub - normalstory/2024AI_CoachingStudy

Contribute to normalstory/2024AI_CoachingStudy development by creating an account on GitHub.

github.com

This English version was translated by Claude.

Written by

친절한 찰쓰씨

Pleasant Charles — UI/UX researcher at AIT. Keeping notes on design, planning, and slow days here since 2010.

Keep reading

Renewal

LET's AI 2024 Coaching Study - Week 1, Team Mission

Topic 1.

Topic 2.

Keep reading

Steadily, for the long haul, without burning out

Tech-life balance

Humanality, by Park Jeong-ryeol