Leverage AI for faster, more insightful data analysis without replacing your data engineering skills.
Data Analysis Workflow
Raw Data → Understanding → Cleaning → Analysis → Visualization → Insights
↑ ↑ ↑ ↑ ↑ ↑
AI AI AI AI AI AIPhase 1: Data Understanding
Quick Data Profile
Here's a sample of my dataset (first 50 rows as CSV):
[paste data]
Provide:
1. Column analysis (types, distributions, nulls)
2. Potential data quality issues
3. Relationships between columns
4. Suggested cleaning steps
5. Analysis questions this data could answerSchema Design
I have these raw log files:
[sample log entries]
Design a structured schema for storing this data in PostgreSQL.
Include:
- Table definitions with appropriate types
- Indexes for common queries
- Partitioning strategy if needed
- Sample insert/query statementsPhase 2: Data Cleaning
Generate Cleaning Scripts
Write a Python data cleaning script for this dataset:
Issues to fix:
- Column "date" has mixed formats (MM/DD/YYYY and YYYY-MM-DD)
- "price" column has currency symbols and commas
- "category" has inconsistent capitalization
- Several rows have missing "email" values
- Duplicate entries based on (user_id, timestamp)
Use pandas. Include validation checks.SQL Data Cleaning
Write SQL queries to clean this orders table:
Problems:
- Some order_total values are negative
- customer_email has entries like "N/A", "none", empty strings
- shipped_date is sometimes before order_date
- duplicate orders within 1 minute from same customer
For each issue:
1. Query to identify affected rows
2. Query to fix or flag them
3. Validation query to confirm the fixPhase 3: Analysis
Exploratory Analysis
Given this sales data schema:
orders(id, customer_id, product_id, quantity, total, created_at)
customers(id, segment, region, acquisition_date)
products(id, category, price, cost)
Write SQL queries for:
1. Monthly revenue trend (last 12 months)
2. Top 10 customers by lifetime value
3. Product category performance (revenue, margin, growth)
4. Customer cohort retention analysis
5. Seasonal patterns in purchasing
Include explanations of what each query reveals.Statistical Analysis
Perform statistical analysis on this A/B test data:
Control group: n=5000, conversions=150
Treatment group: n=5000, conversions=185
Calculate:
1. Conversion rates for each group
2. Statistical significance (p-value)
3. Confidence interval for the difference
4. Required sample size for 80% power
5. Recommendation on whether to implement
Show the calculations and Python code.Phase 4: Visualization
Chart Recommendations
I have this data about website traffic:
- Daily unique visitors (365 days)
- Traffic source breakdown (organic, paid, social, direct)
- Page views by section
- Bounce rate by landing page
- Conversion funnel stages
What visualizations should I create for:
1. Executive dashboard
2. Marketing team report
3. Technical performance review
Include chart types, key metrics to highlight, and sample Python/JS code.Generate Visualization Code
Create a Python visualization for this data:
Data: Monthly sales by region (4 regions, 24 months)
Requirements:
- Line chart with all regions
- Highlight the region with highest growth
- Add trend lines
- Professional styling (not default matplotlib)
- Export as PNG and SVG
Use matplotlib or plotly.Phase 5: Reporting
Automated Reports
Generate a weekly metrics report template:
Metrics to include:
- Revenue: total, vs last week, vs same week last year
- Orders: count, average value, by channel
- Customers: new, returning, churn indicators
- Top products: by revenue, by units, by growth
- Issues: failed payments, returns, support tickets
Format as:
1. Executive summary (3 bullet points)
2. Key metrics table
3. Detailed breakdowns
4. Action items
Include Python/SQL code to pull this data.Insight Generation
Here's our Q3 sales data summary:
[paste summary statistics]
Generate insights for a board presentation:
1. What are the 3 most important trends?
2. What's surprising or concerning?
3. What actions should leadership consider?
4. What additional data would help answer open questions?Common Data Tasks
SQL Query Generation
Natural language: "Find customers who bought product A but not product B
in the last 30 days, grouped by their signup month"
Schema:
customers(id, email, created_at)
orders(id, customer_id, product_id, created_at)
products(id, name, category)
Generate the SQL query with explanation.Data Transformation
Transform this JSON API response into a flat CSV:
```json
{
"users": [{
"id": 1,
"profile": {"name": "Alice", "tags": ["admin", "beta"]},
"orders": [{"id": 100, "total": 50}, {"id": 101, "total": 75}]
}]
}Output format: user_id, user_name, tag, order_id, order_total (one row per user-tag-order combination)
Write Python code using pandas.
### Regular Expressions for Data
```text
Write regex patterns to extract:
1. Email addresses from text
2. Phone numbers (US format)
3. URLs (http/https)
4. IP addresses (v4)
5. Dates in various formats
Include Python code with test cases.Best Practices
- Validate AI output: Always verify generated queries on a subset
- Preserve original data: Never modify source data directly
- Document transformations: Keep a log of all cleaning steps
- Test edge cases: AI may not handle all data anomalies
- Version control: Track analysis scripts like code
- ai data analysis
- data processing
- ai analytics
- data insights
- ai reports