대학원 면접 준비

본 포스트는 본인이 인공지능 대학원(AI 대학원)의 면접을 준비하며 작성했던 내용을 업로드한 것입니다.

본인이 외우고 이해하기 편하게 내용을 구성했음으로 실제와는 다른 경우가 존재할 수 있습니다.

그렇기 때문에 대학원 면접을 준비하시는 분들께서는 본 포스트만을 믿지 않으셨으면 좋겠습니다.

또한, 본 포스트의 많은 질문은 https://jrc-park.tistory.com/259의 블로그 포스트에서 참조 했습니다.

Probability and Statistics

Q. What is central limitation theory? Link!

모집단의 분포와 상관없이 모집단에서 표본을 추출 후, 표본에 대해서 평균을 진행하다 보면 정규분포에 근사하게 된다. 이때 표본 추출의 수는 30~50이 적당하다고 한다.

A. the distribution of sample means approximates a normal distribution as the sample size gets larger, regardless of the population's distribution.

A. sufficiently large sample size can predict the characteristics of a population more accurately.

Q. Central Limit Theorem은 어디에 쓸 수 있는가?

A. 적은 수의 sample만 사용하더라도 모집단의 평균과 분산을 알아낼 수 있다.

모든 데이터를 정규분포화 시킬 수 있다. 따라서 해당 데이터에 대해서 정규분포를 가정하는 기법을 사용할 수 있게 된다.

Q. What is the law of the large number?

A. 표본의 크기가 충분히 크다면 그때의 표본 평균은 모평균에 충분히 가까워 진다

Q. What is the marginal distribution

A. In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset.

Q. What is the conditional distribution

A. Given two jointly distributed random variables X and Y, the conditional probability distribution of Y given X is the probability distribution of Y when X is known to be a particular value.

Unbiased Estimator
Biased Estimator

Irreducible Error
Reducible Error
Bais
Variance

Linear Algebra

Q. What is Linearly Independent?
A. linear combination c1a1 + c2a2 + … + cnan = 0. if the solution of this formula is a trival solution, then this is linearly independent

Q. What is the basis
A. a set B of vectors in vector space V is a basis if its elements are linearly independent and every element of V is a linear combination of elements of B.
1. B span V
2. B is linearly independent

Q. What is a dimension?
The maximum number of linearly independent vectors in the vector space V is called the dimension of V and is denoted by dim V. Here, assume that the dimension of the vector space is finite.
벡터공간 V에 속한 1차독립 벡터들의 최대수를 V의 차원(dimension)이라 부르고, dim V로 표기합니다. 여기서 벡터공간의 차원은 유한(finite)하다고 가정합니다.

Q. What is Column Space
A. Column Space of Matrix A is the linear combination set of columns of Matrix A.

Q. What is Null Space?
A. the subspace of solutions of Ax=0

Q. What is Symmetric Matrix?
A. a symmetric matrix is a square matrix that is equal to its transpose.

Q. What is the positive-definite Matrix?
A. zTMz > 0. z0인 모든 실수에 대해서 0보다 클 때 M을 positive-definite Matrix라고 한다.
해당 matrix의 shape은 convex다.

Q. What is EigenVector and EigenValue?
A. EigenVector is the non-zero vector that does not change the direction after linear transformation.
A. the eigenvalue is the factor by which the eigenvector is scaled.

Eigen Value is unique, but eigenvector is not.

Q. What is the Jacobian Matrix?
A. The Jacobian Matrix is a set of partial derivatives for variables of multivariate functions.

The Basic of Machine Learning

Basic Machine Learning

Q. What is the ensemble?
A. 여러 개의 단일 모델들의 평균치를 내거나, 투표를 해서 다수결에 의한 결정을 하는 등 여러 모델들의 집단 지성을 활용하여 더 나은 결과를 도출해 내는 것에 주 목적

Q. give examples of ensemble methods.
A. Bagging: Could Reduce the Variance → overfitting인 경우 막기 위해 사용 가능
Boosting: Could Reduce the Bias → underfitting인 경우 좋은 학습을 위해 사용 가능
Voting:

Q. What is Bagging?
A. A method of sampling and dividing multiple datasets by allowing overlapping.
After learning multiple models in parallel based on overlapping allowed bootstrap data, predicting the results through voting.
여러 개의 dataset을 중첩을 허용하게 하여 샘플링하여 분할하는 방식
중첩 허용된(Bootstap) 데이터를 바탕으로 다수의 모델을 병렬적으로 학습시킨 후 결과물에 대해서 voting을 통해 예측하는 것.
Parallel
Classification에서는 Voting으로 Class를 정하며
Regression에서는 average of outputs로 값을 정한다.

Q. What is Boosting?
A. The Sequencial Manner.
Learn the model. The weight is given larger for incorrect answers among the learning results, and then another model is learned. This series of processes continue to be repeated. Finally, a specific input is predicted for N models, and a final predicted value is derived through the predicted values.
모델을 하나 학습 시킨다. 학습 결과 중 오답에 대해서 weight를 더 크게 준 다음 또 다른 모델을 학습시킨다. 이 일련의 과정을 계속 반복한다. 최종적으로 N개의 모델에 대해서 특정 input을 예측하고, 예측된 값들을 통해 최종 예측 값을 도출한다.

Q. What is the difference between bagging and Voting? Both use a kind of voting.
A. Voting uses many different models. But bagging usually uses the same models.

Q. What is Boosting?
A. 이전 학습에 대하여 잘못 예측된 데이터에 가중치를 부여해 오차를 보완해 나가는 방식
gradient descent에서는 parameter W를 업데이트 한다면
Boosting에서는 Function F를 업데이트한다고 보면 된다.
gradient descent → gradient descent in parameter space
Boosting → gradient descent in functional space

frequentist : 그 환자를 직접 검사 하여 source of pain을 찾는다.
bayesian : 비슷한 증상의 이전 환자의 증상과 결합하여 source of pain을 찾는다.

Q. Frequentist
A. Frequentist is that the model has a true value and data is randomly generated. Therefore, we need to find the only model that best describes the data.
Frequentist는 모델이 참 값이 있으며, 데이터가 임의로 발생하는 것이다. 따라서 데이터를 가장 잘 설명할 수 있는 단 하나의 모델을 찾아야한다.

Q. Bayesian
A. There is no true model, and the data is true. Therefore, it is not to find a single model that can best explain the current data but to find a model that can best explain the current data among several models.
참인 모델은 없으며, 데이터가 참이다. 때문에 가장 잘 설명할 수 있는 단 하나의 모델을 찾는 것이 아니라, 여러개의 모델 중 현재 데이터를 가장 잘 설명할 수 있는 모델을 찾는 것이다.

Q. The Curse of Dimensionality
A. As the dimension increases, the feature space increases exponentially. Therefore, in order to densely fill the increased feature space, the number of necessary data is also exponentially required.
차원이 증가할 수록 feature space는 기하 급수적으로 증가하게 된다. 따라서 증가된 feature space를 densy하게 채우기 위해서는 필요한 데이터의 수 또한 기하 급수적으로 많이 요구된다.

Q. Cross-Validation
A. When evaluating by statistically dividing the train-set and test-set, parameters suitable for the static test-set will be used. It can be seen that this is overfitting to the test-set, not the train-set. Cross-Validation is trying to solve this problem.

For K-Fold Cross-Validation, Divide the entire dataset into K pieces, use K-1 for training, and use the other one as a test. Repeat the above process by k times so that the test set does not overlap.
As a result, as many hyperparameters as K are generated, and the average of K generated is used as Final Hyper Parameters.
Use the above method to prevent overfitting to a specific dataset. However, there is a disadvantage that it takes a long time to learn and take a long time to evaluate because it takes K to do the first thing.

Train-set과 Test-set을 static하게 나누어 평가할 경우 static한 test-set에 맞는 parameter를 사용하게 될 것이다. 이는 train-set이 아닌 test-set에 overfitting이 된다고 볼 수 있다.
Cross-Validation은 이러한 문제를 해결하기 위해서
K-Fold Cross-Validation의 경우
전체 데이터 셋을 K개로 나눈 후 K-1개를 train에 사용하며 나머지 1개를 test로 사용합니다.
위의 과정을 test set이 겹치지 않게 k번만큼 반복 수행을합니다.
결과적으로 K개 만큼의 hyper parameter가 생성되고, 생성된 K개의 평균을 Final Hyper Parameter로 사용하게 됩니다.
위의 방법을 사용해 특정 데이터 셋에 overfitting을 방지합니다. 하지만 1번 할 것을 K번하기 때문에 학습 시간도 오래 걸리며 평가 시간도 오래 걸린다는 단점이 있습니다.

Generative Model VS Discriminative Model
Q. What is the Generative Model
A. It is a model that learns given learning data and generates similar data according to the distribution of learning data.
주어진 학습 데이터를 학습하여 학습 데이터의 분포를 따르는 유사한 데이터를 생성하는 모델이다.
P(X|Y)

Q. What is the Discriminative Model?
A. It's a model that finds the probability of Label Y when given Data X.
Data X가 주어졌을 때 Label Y가 나올 확률을 구하는 모델이다
P(Y|X)
⇒ Linear Regression과 Logistic Regression이 대표적

Prior Probability: p(x),
The a priori probability that an observer has for a system or model before making an observation.
관측자가 관측을 하기 전에 시스템 또는 모델에 대해 가지고 있는 선험적 확률.

Posterior Probability: p(x|z),
When a particular event A occurs, the probability that event A occurs in a particular model.
특정 사건 A가 발생했을 때, 사건 A가 특정 모델에서 발생했을 확률

Q. MLE(Maximum Likelihood Estimation) vs MAP(Maximum a Posteriori Estimation)
A. MAP은 사전 확률 p(z)를 이용해 확률을 구하게 된다.
즉, MLE: p(z|a) ? p(z|b). a나 b에서 z가 나올 수 있는 확률
MAP: p(a|z) ? p(b|a). z가 a나 b에서 나올 수 있는 확률

A. MLE는 어떤 사건 A가 일어날 확률을 가장 높이는 모수를 찾기
MAP는 모수의 사전 확률(prior)를 결합한 확률을 고려

어떻게 보면 prior가 uniform distribution을 따르면 MLE=MAP

Q. ROC(Receiver Operating Characteristic)
A. TPR(True positive Rate) = Sensitivity = True PositiveTrue Positive + False Negative
FPR(False Positive Rate) = 1 - Specificity
= 1-True NegativeTrue Negative + False Negative

The ROC curve is a curve indicating the performance of the binary classifier.
If the two classes can be better distinguished, the ROC curve is located at the top left.
AUROC must be randomly selected, but 0.5 or less is discarded, and 0.7 or more is useful for the model.

ROC Curve는 이진 분류기의 성능을 표시한 커브이며
두 클래스를 더 잘 구별할 수 있다면 ROC 커브는 좌상단에 위치
AUROC는 무작위로 골라도 0.5, 따라서 0.5이하면 폐기, 0.7은 되야 쓸만

Q. What is Precision
A. Out of the positive predictions, the actual positive percentage.
Positive라고 예측한 것 중에서 실제로 Positive한 비율
precision = True PositiveTrue Positive + False Positive

Q. What is Recall
A. the rate that the model predicted true among true values.
True value가 positive인 값을 모델이 positive라고 예측한 비율
recall = True PositiveTrue Positive + False Negative

Q. Type 1 Error and Type 2 Error
A. Type 1 Error: False를 True로 판단(False Positive)
Type 2 Error: True를 False로 판단(True Negative)

Q. analytical algorithms vs iterative algorithm
A. analytical algorithms: using the dataset X,

θ can be found one time
ex) MLE, find the best thereby derivative

iterative algorithms: iteratively update the Theta,
ex) MoG, EM-algorithm

Q. EM algorithm
Step 1.
Calculate the likelihood value as close as possible to the likelihood from any given parameter initial value.
주어진 임의의 파라미터 초기값에서 Likelihood와 최대한 근사한 likelihood값을 계산한다.

Step 2.
A new parameter value that maximizes the likelihood calculated in E-step is obtained.
E-step에서 계산된 likelihood를 최대화(maximize)하는 새로운 파라미터값을 얻는다.

In the case of MLE, for Convex or Concave function, it is clear as the solution is if choose a point of zero using a derivative.
However, when several Gaussian models are mixed, such as MoG, they do not Convex, so the optimal value cannot be found at once, so it must be solved with heuristic techniques.
MLE의 경우 Convex or Concave function에 대해서 derivative를 활용해 0인 지점을 고르면 해가 되듯이 명확하다.
하지만 MoG과 같이 여러개의 가우시안 모델이 섞인 경우 Convex하지 않기 때문에 최적값을 한번에 찾을 수 없어 휴리스틱한 기법으로 해결해야한다.

Q. PCA
A. PCA is a linear projection.
PCA finds an orthogonal basis set that makes the largest possible variance
on the linearly projected space.
Calculate the Covariance matrix with features of the dataset.
and do eigendecomposition to the covariance matrix.
and sort the eigenvalues as decreasing order and project dataset.

Q. How to avoid overfitting?
1. make the model architecture simpler
⬝ Using Small Model
2. focus on underlying abstractions / explanatory factors
  ⬝ Feature Selection, PCA, etc.
3. restriction on the parameter space
  ⬝ shrinkage method
  ⬝ early stopping
4. spread out the probability mass from the training samples
  ⬝ adding noise
  ⬝ data augmentation

Q. What is an activation Function?
A. When the value from the previous node comes into the next node, not to pass it next to the next, we apply that value to a certain function. That is the activation function. But, usually, we use non-linear functions as activation functions. Because even if we set multi-layers, the linear function is the same with just one layer.

Q. What is an Objective Function?
A. When training a certain model, we use some specific functions. We call that function an Objective Function.

Q. What is Loss Function?
A. Difference between the predicted value and true value.

Q. What is Cost Function?
A. Set of Loss functions. Average of the losses.

Loss Function Cost Function Objective Function

'AI' 카테고리의 다른 글

Decoupled Knowledge Distillation - CVPR2022 (0)	2023.08.12
Debiased Self-Training for Semi-Supervised Learning (0)	2023.01.07
SaR: Self-Adaptive Refinement on Pseudo Labels for Multiclass-Imbalanced Semi-Supervised Learning (0)	2023.01.06
Error: gradient computation has been modified by an inplace operation (0)	2022.09.22
Numpy Image File with torchvision.datasets (0)	2022.04.22

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

MisoYuri's Deck