1. What are the differences between data analysis and data mining?
Answer:
- Data Analysis: The process of inspecting, cleaning, transforming, and modeling data to discover useful information, draw conclusions, and support decision-making.
- Purpose: Used to generate insights, find patterns, and understand trends.
- Example: Analyzing sales data to determine which products perform best.
- Data Mining: The process of discovering patterns and relationships in large datasets, often through machine learning, statistics, and database techniques.
- Purpose: Used to predict outcomes and find hidden patterns.
- Example: A retailer using association rules to understand which products are often bought together.
2. How would you handle missing or inconsistent data in a dataset?
Answer:
- Identify missing values by analyzing the dataset and determining which values are incomplete or inconsistent.
- Decide on an approach depending on the data:
- Remove missing data if it’s minimal and doesn't significantly impact the analysis.
- Impute values by using methods like mean, median, or mode for numerical data, or a placeholder value for categorical data.
- Use algorithms that handle missing values, such as decision trees in machine learning.
- Consult domain experts if possible, to understand if there’s a logical method to fill in the gaps.
3. Explain the difference between correlation and causation.
Answer:
- Correlation is a measure that describes the strength and direction of a relationship between two variables. However, correlation alone doesn’t imply that one variable causes the other.
- Example: Ice cream sales and drowning incidents may have a positive correlation because both increase in summer, but one does not cause the other.
- Causation implies that one variable directly affects another. Establishing causation requires controlled experiments or further statistical analysis beyond correlation.
- Example: A study showing that taking a specific medication leads to lower blood pressure indicates causation.
4. What is a p-value, and why is it important in data analysis?
Answer:
- The p-value represents the probability of obtaining test results at least as extreme as the observed results, assuming that the null hypothesis is true.
- Importance: It helps determine the significance of the results in hypothesis testing.
- Interpretation: A low p-value (typically ≤ 0.05) suggests strong evidence against the null hypothesis, indicating that the observed effect is statistically significant. Conversely, a high p-value suggests insufficient evidence to reject the null hypothesis.
5. How would you handle outliers in a dataset?
Answer:
- Identify outliers by using methods such as:
- Z-scores to check if values are several standard deviations away from the mean.
- IQR (Interquartile Range), where data points outside the range Q1−1.5×IQR to Q3+1.5×IQR are considered outliers.
- Decide on a treatment based on the data context:
- Remove outliers if they result from data entry errors or anomalies.
- Transform data using log or square root transformations if outliers skew the data.
- Use robust statistical techniques like median-based metrics that are less affected by outliers.
6. What are the differences between SQL’s JOIN
types?
Answer:
- INNER JOIN: Returns records that have matching values in both tables.
- LEFT JOIN (or LEFT OUTER JOIN): Returns all records from the left table, and the matched records from the right table; if no match, NULL values are returned for columns from the right table.
- RIGHT JOIN (or RIGHT OUTER JOIN): Similar to LEFT JOIN, but returns all records from the right table.
- FULL JOIN (or FULL OUTER JOIN): Returns all records when there is a match in either table; unmatched rows will have NULLs.
7. Explain the difference between variance and standard deviation.
Answer:
- Variance: Measures the average squared difference between each data point and the mean. It provides a general sense of how spread out the data is.
- Formula: σ2=N∑(xi−μ)2
- Standard Deviation: The square root of variance, bringing the unit back to the original data's scale. It shows how much data typically deviates from the mean.
- Formula: σ=σ2
8. What is A/B testing, and how is it used in data analysis?
Answer:
- A/B Testing: A method to compare two versions (A and B) of a variable to determine which one performs better. Typically used in marketing and product development to optimize elements such as webpage design or call-to-action buttons.
- Process:
- Define Hypotheses: Establish a clear hypothesis about what you expect to see.
- Randomize Groups: Divide the audience into two random groups.
- Measure Results: Collect data on key metrics and use statistical analysis to determine if the results are significant.
- Interpret Findings: If one variant significantly outperforms the other, it suggests that variant is preferable.
9. What is the difference between structured and unstructured data?
Answer:
- Structured Data: Data that is organized in a fixed format, such as tables with rows and columns. It is easy to store and analyze using traditional database tools (e.g., SQL).
- Examples: Customer names, addresses, phone numbers in a spreadsheet.
- Unstructured Data: Data that doesn’t have a predefined format or organization. It requires more complex methods to store, process, and analyze.
- Examples: Images, videos, emails, social media posts.
10. What is the purpose of normalization in databases?
Answer:
- Normalization is the process of structuring a relational database to reduce data redundancy and improve data integrity.
- Process: Involves breaking down tables into smaller, related tables and establishing relationships between them.
- Benefits: Helps eliminate duplication, ensures data consistency, and improves database efficiency.
11. How would you approach analyzing a dataset you’ve never seen before?
Answer:
- Understand the Data Context: Identify the source, purpose, and key questions the data is intended to answer.
- Inspect the Data: Look at column names, data types, and sample values.
- Check for Missing/Outlier Values: Address any incomplete or abnormal data points.
- Summarize Statistics: Use descriptive statistics to understand distributions, averages, and trends.
- Visualize Key Variables: Plot histograms, scatter plots, or box plots to explore relationships and spot patterns.
- Document Findings: Note down insights and areas requiring deeper analysis.
12. Explain the concept of time series analysis and its applications.
Answer:
- Time Series Analysis: A technique for analyzing data that is collected at different points in time, allowing us to detect patterns such as trends and seasonality.
- Components: Consists of trend, seasonal, and cyclical variations, as well as residual error.
- Applications: Forecasting stock prices, predicting sales in retail, analyzing website traffic patterns, and weather prediction.
13. What is the Central Limit Theorem?
Answer:
- Central Limit Theorem (CLT): States that the sampling distribution of the sample mean will approximate a normal distribution as the sample size grows, regardless of the population distribution’s shape.
14. What is the difference between supervised and unsupervised learning?