This post is a record of the discussion we had this week, where we picked specific keywords from our learning content and researched them together.
Beyond just doing the theoretical research and pulling the material together, we also organized the corresponding hands-on code so that we cancopy-paste?reference it later when needed.
-update: Even with sources and names credited, I belatedly judged that posting content contributed by my teammates on this blog is not appropriate, so all related portions have been removed.
Topic 1.
Research the phenomenon known as the 'curse of dimensionality' that appears as data dimensionality grows, and discuss what causes this phenomenon and the methods used to address it. [ Topic: Data and Dimensionality ]
(omitted)
1) Academic and theoretical understanding is great, but it would be even better to look more concretely at how this is applied in code and what the results look like.
(1) Iris dataset example
- PCA example code
- t-SNE example code
(2) digits dataset example
- PCA example code
- t-SNE example code
2) From the research reference articles alone the use cases of each technique seemed clearly distinct... (but looking at the actual code outputs, in my own simple side-by-side, t-SNE looks superior to PCA?? Honestly the table-style research write-up didn't quite click for me.) It would be helpful to organize this more concretely along the lines of 'when this case use this technique, when that case use that technique', spelling out the use cases more clearly.
3. Conclusion
Let's organize the criteria for picking PCA vs t-SNE on a case-by-case basis.
1) Data characteristics:
- High-dimensional data with linear structure: PCA
- High-dimensional data with non-linear structure: t-SNE
2) Speed and scale:
- Need fast processing on large-scale data: PCA
- Smaller data where local structure matters: t-SNE
3) Purpose:
- Need feature extraction and data compression: PCA
- Need data visualization and pattern discovery: t-SNE
4) Practical examples
- PCA use cases:
Analyzing the key characteristics of customer data to find important variables, then reducing dimensionality to improve model performance.
Extracting key gene patterns from genetic data while reducing noise so the data is easier to analyze.
- t-SNE use cases:
Reducing image or text embeddings to 2D so similar images or documents can be visualized as clusters. In marketing campaigns, visualizing customer behavior patterns to identify customer groups exhibiting similar behavior. These two techniques can also be complementary. For example, it's common in practice to first reduce dimensionality somewhat with PCA, then apply t-SNE on top for visualization.
Topic 2.
Describe the characteristics of 'Numpy', one of Python's libraries, and investigate which tasks have become more convenient thanks to Numpy. [ Topic: Python ]1. Research
1. Numpy explained
(omitted..)
2. Discussion
1) The points are organized neatly enough to follow, but I feel the explanation is too conceptual. For a more practical understanding, it would help to look up code examples that illustrate the cases above.
(1) Efficient handling of multi-dimensional arrays
--> Result,
Array shape: (3, 3)
Modified array:
[[ 1 2 3]
[ 4 5 10]
[ 7 8 9]]
(2) High-performance numerical computation
--> Result, [0.01176252 0.59470493 0.83800347 0.36904421 0.02989509]
(3) Data preprocessing and analysis
--> Result,
Mean: -0.055785869551972574
Standard Deviation: 0.9666896586679969
First 5 normalized data points: [-1.22099914 0.03545876 -2.18956188 0.5683409 -0.61175484]
(4) Scientific computing and simulation
--> Result, Solution x: [2. 3.]
(5) Data visualization and interoperability
--> Result,
(6) Machine learning and data science
--> Result, Predictions: [1.21 1.635 2.06 2.485 2.91 ]
2) How about comparing 'before NumPy' and 'after NumPy'? It would help us understand NumPy's purpose and use cases more intuitively. Let's also organize the related code.
In the past, before NumPy came along, large-scale data processing and numerical computation had to be done with Python's built-in data structures and loops. This was extremely inefficient, especially for large datasets or complex numerical calculations. The example below shows what code looked like in the days before NumPy.
Hands-on code 1. Large-scale matrix operations: The speed difference is dramatic, and the code reads much more intuitively too.
- Before
--> Result, processing time = 2m 9.1s
- After applying numpy
--> Result, processing time = 0m 0s
Hands-on code 2. Image processing: The code is shorter and the speed is faster too.
- Before
--> Result, processing time = 1m 1s
- After applying numpy
--> Result, processing time = 0m 7s
Left – before, right – after (image source: Sky Cumulus Atmosphere Cloud — free image on Pixabay )
3. Conclusion
(1) The earlier code examples made the difference clear. To wrap up, let's move beyond feature-oriented code and finish with some hands-on code more grounded in real-world business scenarios.
[ Case ] Product sales analysis: by analyzing the monthly sales volume and profit of multiple products, you can figure out which product is the most profitable and which month has the strongest sales. (source : A real-world example using NumPy )
- Sample data setup
- Compute total sales volume
--> Result, Total sales: [550 480 390]
- Compute total profit per product
--> Result, Total profit: [7650 8500 9575 9425]
- Find the month with the highest sales volume
--> Result, Month with the highest sales volume: month 3
- Find the most profitable product
--> Result, Most profitable product: product 3
GitHub - normalstory/2024AI_CoachingStudy
Contribute to normalstory/2024AI_CoachingStudy development by creating an account on GitHub.
github.com
