Renewal·사이드 프로젝트

LET's AI 2024 Coaching Study - Week 1, Team Glossary

May 21, 2024·22 min read

cover image

Since last week, I've been participating as a leader booster in the LET's AI 2024 coaching study run by Modu's Research.
- Lolong Coach_Team 6 : @Park Munji @Seo Jinhyung @Eom Subin @Enter6 (Byeon Charles) @miji (Park Miji) @Shinar (Shin Aera)

LET's AI 2024 Coaching Study
? Week 1, Team Glossary
- Week 1, Team Mission
- Week 2, Team Glossary
- Week 2, Team Mission

In this post, one of the Week 1 team assignments was to gather all the new keywords we encountered during the learning process and put them together into a single glossary.
And it occurred to me that, since the volume of keywords accumulated over the course of the study would probably be substantial, I might as well collect them in plain text, CSV, and other formats following a clear structure, so that they can later be reused for an LLM RAG practice exercise.

<Data Literacy> (Data literacy)

Data literacy is the ability to read, understand, create, and communicate data as information. (Source: Wikipedia)

It refers to the ability to understand and interpret data, and on that basis to make meaningful decisions. This is becoming more and more important in modern society and is essential in many fields for using data to solve problems and create value.

Data literacy includes several aspects:
- Data understanding: The ability to read data, understand the source and structure of the data, and grasp what the data represents. This means being able to interpret data presented in the form of numbers, graphs, and tables.
- Data analysis: The ability to analyze data and derive meaningful information from it. This includes basic statistical analysis, data visualization, and data mining.
- Data interpretation: The ability to interpret analyzed data, understand its meaning, and draw conclusions from it. This involves identifying trends and patterns shown by the data and using them to solve problems or gain new insights.
- Data utilization: The ability to apply data to real-world situations and make decisions. This supports Data-Driven Decision Making (DDDM) and includes data-based strategy formulation, problem solving, and performance evaluation.
- Data ethics: The ability to understand and follow ethical standards and legal requirements when using data. This covers personal information protection, data security, and the accuracy and reliability of data.

Reasons why data literacy matters:
- Decision making: Data-based decision making allows for more accurate and efficient decisions.
- Problem solving: Analyzing data helps identify the cause of a problem and derive effective solutions.
- Innovation: Gaining new insights from data uncovers new opportunities and drives innovation.
- Competitiveness: Effectively using data secures a competitive edge and strengthens market competitiveness.

For these reasons, data literacy is an essential capability that helps not only individuals but also entire organizations adopt a data-centric mindset and produce better outcomes through data.

<Data Mining> (Data Mining)

A process of systematically and automatically analyzing statistical rules or patterns within large amounts of stored data to extract valuable information. (Source - Wikipedia)

<Data Preprocessing> (Data preprocessing)

Data preprocessing can refer to manipulating, filtering, or augmenting data before it is analyzed, and it is often a critical step in the data mining process. If irrelevant and redundant information exists, or if the share of noisy and unreliable data is high, knowledge discovery during the training phase becomes more difficult. The data preparation and filtering phase can take a considerable amount of time. Examples of methods used for data preprocessing include cleaning, instance selection, normalization, one-hot encoding, data transformation, feature extraction, and feature selection. (Source: Wikipedia)

<Data Science> (Data Science)

Similar to data mining, it is an interdisciplinary field that uses scientific methodologies, processes, algorithms, and systems to extract knowledge and insights from various types of data, both structured and unstructured.

Main components of data science:
- Data Collection: Collecting data from various sources. This may include databases, web crawling, sensor data, and APIs.
- Data Preprocessing: Transforming the collected data into a form suitable for analysis. This process includes data cleaning, missing-value handling, data transformation, and outlier removal.
- Data Analysis: Exploring data, finding patterns using statistical techniques, and summarizing data. Descriptive statistics and exploratory data analysis (EDA) belong to this stage.
- Data Modeling: Using machine learning algorithms and statistical models to train data and build predictive models. Techniques such as regression, classification, clustering, and dimensionality reduction are used.
- Data Visualization: A process of expressing data visually so that it is easier to understand. Graphs, charts, and dashboards are used to communicate data visually.
- Interpretation of Results: Interpreting analytical results and drawing meaningful conclusions on that basis. Explaining how the results can be applied to business or research questions.
- Decision Support: Supporting decision making based on the results of data analysis. This is used in many fields such as business strategy formulation, policy decisions, and product development.

Major techniques and tools of data science:
- Descriptive Statistics: A statistical technique that summarizes and describes the basic characteristics of data.
- Predictive Analytics: An analytical technique that predicts the future based on past data. Machine learning and regression analysis are major tools.
- Data Mining: A process of extracting patterns or meaningful information from large amounts of data.
- Machine Learning: Developing and applying algorithms that learn from data and make predictions.
- Big Data Technologies: Using big data frameworks such as Hadoop and Spark to process and analyze large volumes of data.
- Programming languages: Languages used for data analysis and processing such as Python, R, and SQL.

Application areas of data science:
- Business: marketing strategy, customer segmentation, demand forecasting, financial analysis, and so on.
- Healthcare: disease diagnosis, patient management, medical image analysis, and so on.
- Public policy: crime prediction, traffic management, environmental monitoring, and so on.
- Sports: game analysis, optimization of player performance, and so on.
- Technology: recommendation systems, natural language processing, computer vision, and so on.

The role of a data scientist: A data scientist analyzes diverse data to solve organizational problems and to support data-based decision making. Data scientists carry out the following tasks:
- Data collection and preprocessing
- Data analysis and visualization
- Machine learning model development and evaluation
- Interpreting analysis results and writing reports
- Proposing data-based strategies for solving business problems

Data science is an important field that maximizes the value and usefulness of data and drives innovation in various areas.

<Domain Experience> (Domain Experience)

Domain experience refers to deep knowledge of and familiarity with a specific industry or field. Domain experience plays a very important role in the process of analyzing and interpreting data. With it, a data scientist can do more than discover patterns in the data; they can understand what those patterns actually mean within the relevant field.

Reasons why domain experience matters:
- Problem definition: A data science project starts with a clear problem definition. With domain experience, you can identify the right problems and ask the right questions.
- Data understanding: With domain experience, you can understand how the data was generated and collected, and what the data actually means. This is very useful in the data preprocessing stage.
- Modeling: Domain knowledge helps you choose appropriate models, interpret the model's results, and draw conclusions that are meaningful in the business context.
- Interpreting results: Domain experience is essential when deciding how to apply the results of data analysis within the relevant field. It helps predict how the results will affect the actual business.
- Decision support: The ultimate goal of data science is to make better decisions based on data. Domain experience allows a data scientist to provide insights into business strategy and to give useful advice to decision makers.

For example, a data scientist in the medical field needs deep understanding of medical data, patient records, and methods of disease diagnosis and treatment. In the financial field, knowledge of financial products, market trends, and risk management is important.
In conclusion, domain experience plays a crucial role in data science, both in analyzing data effectively and in applying the results to actual business or industry contexts.

<Deep Learning> (Deep Learning)

Deep Learning is a branch of machine learning that uses Artificial Neural Networks to learn from data and recognize complex patterns. Deep learning excels especially at processing and analyzing large amounts of data, and it is showing remarkable performance in various fields including image recognition, speech recognition, and natural language processing.

Main concepts and components of deep learning are as follows:
- Artificial Neural Networks: Structures inspired by biological neural networks, designed as algorithms to process and learn from input data. A neural network consists of an Input Layer, Hidden Layers, and an Output Layer.
- Neuron: The basic unit of a neural network. It receives an input, processes it together with weights, and produces an output via an activation function.
- Weight: A parameter that adjusts the importance of input data, tuned during the learning process.
- Activation Function: A function that applies a nonlinear transformation to a neuron's output, increasing the expressive power of the neural network. Examples: ReLU (Rectified Linear Unit), Sigmoid, Tanh (hyperbolic tangent).
- Deep Neural Networks (DNN): Artificial neural networks with multiple hidden layers, capable of learning complex data and recognizing high-dimensional patterns. The more hidden layers there are, the "deeper" the network is said to be.
- Convolutional Neural Networks (CNN): Neural networks mainly used for image recognition. They extract spatial features of images using Convolutional Layers and Pooling Layers.
- Convolutional layer: Uses filters to extract features from images.
- Pooling layer: Reduces the size of feature maps, lowering the computation cost while emphasizing important features.
- Recurrent Neural Networks (RNN): Neural networks used to process time-series or sequential data. They remember information from previous inputs and process it together with the current input. Variants include LSTM (Long Short-Term Memory) and GRU (Gated Recurrent Unit).
- Autoencoder: A neural network that compresses (encodes) input data and then reconstructs (decodes) it, learning features in the process. Mainly used for dimensionality reduction and noise removal.
- Generative Adversarial Networks (GAN): A structure where two neural networks (a generator and a discriminator) compete with each other while learning, used to generate realistic data. Examples include image generation and style transfer.

Main characteristics of deep learning:
- Large-scale data processing: It can effectively learn from large volumes of data and recognize high-dimensional patterns.
- Self-learning: It automatically extracts and learns complex features, eliminating the need to manually design features.
- Wide range of applications: It is used in many fields such as image recognition, speech recognition, natural language processing, autonomous driving, and recommendation systems.

The growth of deep learning has been made possible by the advancement of computing power and the increased accessibility of large amounts of data. As a result, deep learning has become a very useful tool for solving complex, high-dimensional problems.

<Machine Learning> (Machine Learning)

Machine Learning is a branch of artificial intelligence (AI) that allows computers to learn from data and to make predictions or decisions based on that experience, without being explicitly programmed. Machine learning uses various algorithms and statistical models to analyze data, recognize patterns, and process new data based on what has been learned.

Main components and concepts of machine learning are as follows:
- Dataset: A collection of data used to train a machine learning model. A dataset consists of input data (features) and target variables (target).
- Algorithm: A method or procedure used to build a machine learning model. Algorithms are used to analyze data, find patterns, and make predictions.
- Training: The process by which a machine learning model learns from data. Using a training dataset, the model recognizes patterns and is tuned so that it can make predictions based on them.
- Validation: After training, a separate validation dataset is used to evaluate the model's performance. This is used to detect overfitting and to tune the model.
- Test: A process used to evaluate the final performance of the model. A test dataset is used to verify how well the model works in actual environments.
- Model: A predictive model built based on data and algorithms. The model makes predictions on new data.

Main types of machine learning are as follows:
- Supervised Learning: A model is trained using input data along with corresponding correct answers (labels). Examples include house price prediction and spam email filtering.
- Classification: A problem of classifying input data into various categories. Example: email spam filtering (spam/not spam).
- Regression: A problem of predicting continuous values. Example: house price prediction.
- Unsupervised Learning: A model is trained using data without correct answers. It is used to find structures or patterns in the data. Examples: clustering, dimensionality reduction.
- Clustering: A problem of grouping data points with similar characteristics. Example: customer segmentation.
- Association Rule Learning: A problem of finding associations between items within data. Example: market basket analysis.
- Reinforcement Learning: An agent learns behaviors that maximize rewards by interacting with an environment. Examples: game playing, robotic control.

Machine learning is widely used in various industries and plays an important role in many applications such as predictive analytics, natural language processing, image recognition, autonomous vehicles, and recommendation systems.

<Extrapolation> (extrapolation)

When the available data range is limited and we cannot directly obtain values beyond that range, extrapolation refers to estimating values beyond the observed limit using known observations. As shown in the figure, it predicts the value Q using the values P within the given range A to B; however, the accuracy of such a prediction tends to drop.

<Mathematical Functions> (Mathematical functions)

Variance:

The variance is calculated by squaring the deviation of each observed value from the mean, summing those squared deviations, and dividing by the total count. In other words, it is the average of squared deviations. If you simply add up the deviations (the differences between each observation and the mean), the result is zero, so the deviations are squared first to remove their signs before summing. The variance is also the square of the standard deviation.

std (Standard deviation):

A way of expressing how widely scattered the observed values of a dataset are, using a single number. The standard deviation is the (non-negative) square root of the variance. It is used as a numerical indicator of the variance of a statistical population or the dispersion of data, and is defined as the non-negative square root of the variance. The smaller the standard deviation, the closer the variates are on average to the mean. In statistics and probability, it is mainly applied to probability distributions, random variables, or to measured populations or multisets. (Source: Wikipedia)

Put more simply, taking the square root of the (inflated-by-squaring) variance brings the value back to its original scale. In particular, depending on how it is described, its implications can be quite varied, and it ends up playing a surprisingly important role in understanding the population standard deviation, sample standard deviation, and standard error - the foundations of inferential statistics, which is at the core of methodology in subsequent research. The reason standard deviation plays such an important role is that it intuitively communicates how big or small a number is. (Source: Namu Wiki)

Exponential function (exponential function):

A transcendental function that takes the exponent of a power as a variable and is defined for all real numbers as its domain. It is the inverse of the logarithmic function. (Source: Wikipedia)

For reference, exponential functions inside numpy include exp, expml, exp2, log, log10, loglp, log2, power, sqrt, and so on. (Source: Becoming a Developer)

Trigonometric functions (trigonometric functions):

In mathematics, trigonometric functions are functions that express the size of an angle as a trigonometric ratio. In other words, they describe the relationship between the angles and side lengths of a triangle. Acute-angle trigonometric functions match the acute angles of a right triangle to the ratio of the lengths of two of its sides. The trigonometric functions can also be defined for any arbitrary angle. (Source: Wikipedia)

For reference, trigonometric functions inside numpy include sin, cos, tan, acsin, arccos, atctan, and so on. (Source: Becoming a Developer)

Hyperbolic functions (hyperbolic functions):

In mathematics, hyperbolic functions are functions that share properties similar to ordinary trigonometric functions. Just as trigonometric functions appear when expressing a unit-circle graph parametrically, hyperbolic functions appear when expressing a standard hyperbola parametrically. (Source: Wikipedia)

For reference, hyperbolic functions inside numpy include sinh, cosh, tanh, acsinh, arccosh, atctanh, and so on. (Source: Becoming a Developer)

<Overfitting> (Overfitting)

A state where the model is excessively over-optimized to the train-set. (Source - [AI Basics] 3. Over-Fitting)

Overfitting (left - Overfitted, right - Good Fit)

<Artificial Neural Networks> (ANNs, Artificial Neural Networks)

Artificial neural networks are algorithms inspired by biological neural networks (especially the brain in the central nervous system of animals) in machine learning and cognitive science. An artificial neural network refers, in general, to models in which artificial neurons (nodes), forming a network through synaptic connections, change the strength of those synaptic connections through learning so that the network gains problem-solving capabilities. Artificial neural networks are divided into supervised learning, in which the network is optimized for a problem by being given teacher signals (correct answers), and unsupervised learning, which does not require teacher signals. (Source: Wikipedia)

<Data Structure Type> (Data structure type)

Data with a non-linear structure (Non-linear Structure) :

When the structure of the data is given non-linearly, this refers to data with such characteristics. In general, it can be expressed using a straight line or curve in two-dimensional space. Algorithms such as t-SNE and autoencoders are often used.

- Image data: Image data is a typical example of data with a non-linear structure. Different parts of an image are connected to each other in different directions, but they form a particular structure.
- Text data: Text data can also have a non-linear structure. Each paragraph of a sentence or document is connected to others in different directions, with specific meanings or contexts.

Data with a linear structure (Linear Structure):

When the structure of the data is given linearly, this refers to data with such characteristics. In general, it can be expressed using a straight line in two-dimensional space. Algorithms such as PCA and LLE are often used.

- Time series data: Time series data is a typical example of data with a linear structure. Each segment is connected to the previous one and forms a consistent pattern.
- Stock price data: Stock price data can also have a linear structure. Stock prices increase or decrease in a relatively straightforward manner over time.

<Dimensionality Reduction> (Dimensionality Reduction)

A method of converting high-dimensional data into low-dimensional data. (Source - Wikipedia)

Reducing the dimensionality of a high-dimensional dataset composed of many features and generating a new dataset of lower dimensionality. (Source - Dimension Reduction - PCA, LDA)

<The Curse of Dimensionality> (The curse of dimensionality)

The curse of dimensionality refers to the problems that arise when the number of dimensions becomes too large in data analysis - as dimensions increase, data spreads out widely and analysis becomes more difficult.

When the dimensionality of data grows, executing algorithms becomes very tricky. The curse of dimensionality occurs in fields such as numerical analysis, sampling, combinatorics, machine learning, data mining, and databases. The common theme of these problems is that as dimensionality grows, the volume of the space grows so quickly that the available data becomes sparse. The amount of data needed to obtain reliable results often grows exponentially with the dimensionality. Also, organizing and retrieving data often relies on detecting regions where objects form groups with similar attributes. However, in high-dimensional data, all objects are sparse and dissimilar in many respects, so common data organization strategies become inefficient. (Source: Wikipedia)

Various phenomena that arise when analyzing and organizing data in high-dimensional spaces, which do not occur in low-dimensional environments such as the 3-dimensional physical space of everyday experience. (Source - Wikipedia)

It refers to the phenomenon where, as the number of dimensions increases, the number of training samples becomes smaller relative to the number of dimensions, leading to performance degradation. As the number of dimensions grows, the number of variables grows, and the amount of data per individual dimension to be learned becomes smaller. (Source - The curse of dimensionality - causes and solutions)

1. Relationship between the number of training samples and the number of dimensions
- Increased data sparsity: As the number of dimensions grows, data points become more spread out in space. It's like having only a few people in a huge sports field.
- Increased need for training data: As the number of dimensions grows, more data is needed. For example, you need a lot of data just to have enough samples for each dimension.

2. Causes of the curse of dimensionality
- Distortion of distance measurement: In a high-dimensional space, the distances between two points become similar to one another, making it hard to judge the similarity between data points.
- Volume increase of space: As the number of dimensions grows, the volume of space grows exponentially, and the proportion occupied by the data becomes very small.
- Increased model complexity: With more dimensions, models become more complex and can suffer from overfitting. Overfitting is when the model fits the training data too closely and ends up failing to predict new data well.

3. Ways to overcome the curse of dimensionality
- Feature Selection: A method that removes unnecessary features and selects only the important ones to reduce dimensionality. For example, choosing only height and weight to analyze a person.
- Feature Extraction: A method of combining existing high-dimensional features to create new low-dimensional features. For example, combining grades from multiple subjects into a single 'GPA score'.

4. Types and characteristics of solutions
1) PCA (Principal Component Analysis)
- Characteristics: PCA reduces data linearly. That is, it simplifies the data as if everything were a straight line.
- Pros: Computation is fast and the variance of data is preserved as much as possible while reducing dimensionality.
- Cons: Only linear relationships are considered, so non-linear structures are not well captured.

2) t-SNE (t-Distributed Stochastic Neighbor Embedding)
- Characteristics: t-SNE reduces data non-linearly. It preserves the complex shape of the data well during reduction.
- Pros: It captures the complex non-linear structures of data, so it shows clusters well in low-dimensional space.
- Cons: Computation cost is high, and it is hard to apply to large-scale data.

<Tuple> (Tuple)

A list can have its values changed during program execution. However, sometimes you need a list whose values cannot change while the program is running. That's where the tuple comes in. (Source: https://wikidocs.net/16042)

A tuple is the same as a list except that it uses --parentheses ( ) instead of square brackets [ ] (though parentheses are not strictly required - what is required is the comma).
e.g., t1= (300, 500) t1= 300, 500

If there is only one element, it does not become a tuple. However, if you put a comma after the single element, it can be maintained as a tuple. The thing that defines a tuple is not the parentheses (()) but the comma (,).
e.g., t2 = (20,)

<Features> (Features)

In the terminology of machine learning and pattern recognition, a feature refers to an individual, measurable empirical attribute discovered in the object being observed. (Source: Wikipedia)

Refers to the individual independent variables used in machine learning or data analysis. (Source - AI & Machine Learning Dictionary)

<Feature Vector> (Feature Vector)

A set of features is called a feature vector. The reason for representing it as a vector is that vectors are mathematically convenient to work with. The concept of a feature was borrowed from statistical techniques such as linear regression. The concepts of independent and dependent variables also come from statistics. (Source: Wikipedia)

<Typing in Programming Languages> (Typing in the programming language)

Each programming language has its own basic data types, called Primitive Data Types. The way these data types are determined is called Typing. In other words, deciding the data type of a particular piece of data for the first time is called Typing. Python supports Dynamic Typing by default. In general, typing is divided into Static Typing and Dynamic Typing. (Source: Python Dynamic Typing ,https://seongonion.tistory.com/16#정리-1)
e.g., a_ndarray = np.array(a, --int--)

1. Dynamic typing
- a = 15
- The computer-related structure is omitted when writing code.
- Writing code is fast.
- Code execution is slow.
- The contents and logic of the code are easy to grasp.
- It is suitable for those who are first learning programming.
- It is not suitable for tasks where speed is critical, but it is suitable for small and simple projects.
- Since dynamically typed languages have no choice but to verify types at runtime, tools like TypeScript and Flow have appeared to ease this inconvenience.
- Languages that use dynamic typing - Groovy, JavaScript, Lisp, Lua, Objective-C, PHP, Prolog, Python, Ruby, Smalltalk, Tcl

2. Static typing
- int a = 15
- The opposite of dynamic typing - the computer-related structure is explicitly stated when writing code.
- Writing code is slow.
- Code execution is fast.
- The structure of the code is easy to grasp.
- It can be hard for those who are first learning a programming language.
- It is suitable for large, complex projects in which many people participate.
- Since errors appear at compile time, errors can be checked more easily and quickly.
- Languages that use static typing - Ada, C, C++, C#, JADE, Java, Fortran, Haskell, ML, Pascal, Scala

<Hyperparameters> (Hyper Parameters)

Hyperparameters are external configuration variables used by data scientists to manage the training of machine learning models. They are sometimes called model hyperparameters and are set manually before the model is trained. Hyperparameters are different from parameters - parameters are internal values automatically derived during the learning process and are not set by the data scientist. Examples of hyperparameters include the number of nodes and layers in a neural network, and the number of branches in a decision tree. Hyperparameters determine key aspects of the model such as architecture, learning rate, and model complexity. (Source: Amazon)

<dtype> (Data type objects)

NumPy dtype object: A data type used for scientific computation in Python. (Source: https://runebook.dev/ko/docs/numpy/reference/arrays.dtypes)

1. Attributes
- kind: kind of data type ('i' for integer, 'f' for float, etc.)
- itemsize: byte size of each item
- shape: shape of the items
- byteswap: byte order (big-endian or little-endian)

2. Usage
- Creating a NumPy array: np.array(data, dtype)
- Converting a data type: array.astype(dtype)
- Comparing dtype objects: dtype1 == dtype2

3. Examples
- Creating an integer array: arr1 = np.array([1, 2, 3], dtype=np.int32)
- Creating a floating-point array: arr2 = np.array([1.1, 2.2, 3.3], dtype=np.float64)
- Creating a string array: arr3 = np.array(["a", "b", "c"], dtype=np.string_)
- Converting a data type: arr1 = arr1.astype(np.float32)
- Comparing dtype objects: np.dtype(np.int32) == np.dtype("int32")

4. Notes
- The dtype object is an essential element of NumPy arrays.
- Converting a data type can result in data loss.
- Be careful when using structured data types.
- If a wrong data type is specified, a TypeError occurs.
- If a data type cannot be converted, a ValueError occurs.

<LLM> (Large Language Model)

A model that has been trained on a vast amount of text data and is capable of generating various kinds of text. For example, it can generate poetry, code, scripts, music compositions, emails, letters, and other kinds of text. It can also be used for tasks such as translating text, summarizing it, or answering questions. (Source: datamaker)

<LMM> (Large Multimodal Model)

A model that, in addition to text data, is capable of integrating and processing several other types of data such as images and audio. For example, it can be used for summarizing and generating various types of media content such as movies, music, and news, as well as for tasks such as speech recognition, image recognition, and emotion analysis - tasks that involve handling several types of data at once. (Source: datamaker)

<K-NN Nearest-Neighbor Algorithm> (K-Nearest Neighbor)

An algorithm that finds the k nearest training data points to use for prediction. If k is 1, the single closest point is used for prediction; if k is 3, then three points are used. If k is too small, the model becomes overly sensitive to nearby points, and the chance of overfitting increases.

GitHub storage link

GitHub - normalstory/2024AI_CoachingStudy

Contribute to normalstory/2024AI_CoachingStudy development by creating an account on GitHub.

github.com

This English version was translated by Claude.

Written by

친절한 찰쓰씨

Pleasant Charles — UI/UX researcher at AIT. Keeping notes on design, planning, and slow days here since 2010.

Keep reading

Renewal

LET's AI 2024 Coaching Study - Week 1, Team Glossary

Keep reading

Steadily, for the long haul, without burning out

Tech-life balance

Humanality, by Park Jeong-ryeol